K8s下iptables-invalid-drop引起的耗时波动或者偶发断流随记
环境前提
- 有kube-proxy组件且工作在iptables模式下
- 可有可无的条件: calico CNI
可能的诱因 & 现象结果
- overlay POD 与集群外服务通讯
- underlay与overlay网络通讯(去程overlay 回程underlay导致 asymmetrical routing 即非对称路由)
- conntrack saturation? (conntrack 饱和)
产生偶发性大耗时 或者 偶发性断流现象
在 kube-proxy 所维护的filter KUBE-FORWARD iptables规则链中,存在一条规则-A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP
[root@gzu-prd ~]# iptables -L KUBE-FORWARD --line -nv Chain KUBE-FORWARD (1 references) num pkts bytes target prot opt in out source destination 1 0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 ctstate INVALID 2 4 240 ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes forwarding rules */ mark match 0x4000/0x4000 3 11412 33M ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes forwarding conntrack pod source rule */ ctstate RELATED,ESTABLISHED 4 0 0 ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes forwarding conntrack pod destination rule */ ctstate RELATED,ESTABLISHED
这一条规则会导致在connection track标记为INVALID的流量被DROP处理,同时这一行为目前不支持配置禁用(除非改代码重新编译)
其中关于TCP的connection track状态可以在conntrack -L 或者 cat /proc/net/nf_conntrack中查到(例如[UNREPLIED]之类的)
kube-proxy会在endpoint发生变动的时候粗暴地Flush iptables规则,导致不能简单地在KUBE-FORWARD中插入一条ACCEPT规则来避免这种问题
同样在calico所维护的各种iptables filter表中,每一个cali-fw-cali****链基本也存在规则-m conntrack --ctstate INVALID -j DROP
[root@gzu-prd ~]# iptables-save -t filter|grep INVALID -A cali-fw-cali02fca994756 -m comment --comment "cali:Zgj-5PhkyRyRGc5v" -m conntrack --ctstate INVALID -j DROP -A cali-fw-cali091fd1acd82 -m comment --comment "cali:vySNraYuHVkcwzZC" -m conntrack --ctstate INVALID -j DROP -A cali-fw-cali0945b5ec7e6 -m comment --comment "cali:YpO6T4K2fN2biMqp" -m conntrack --ctstate INVALID -j DROP -A cali-fw-cali09725d6075c -m comment --comment "cali:3Q23jKsPGkXWWHjs" -m conntrack --ctstate INVALID -j DROP
但是这一行为是可以通过FELIX_DISABLECONNTRACKINVALIDCHECK环境变量关闭
具体是否受影响,利用iptables命中计数器是观测手段之一
iptables -w 3 -L --line -nv|grep DROP|sort -rn -k 2|head -n 10
[root@gzu-prd ~]# iptables -w 3 -L --line -nv|grep DROP|sort -rn -k 2|head -n 10 2 19020 773K DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:kRQn4VHUEHOpigCm */ ctstate INVALID 2 15617 937K DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:DTf_pGZFWLZaqlg8 */ ctstate INVALID 2 7068 283K DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:HGKygSKf4SfkbRyf */ ctstate INVALID 2 3845 154K DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:t5nJs-UfMTVjRtBI */ ctstate INVALID 2 2312 139K DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:h3VJGUlERuK34Tcz */ ctstate INVALID 2 2115 110K DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:dTQ4mHZc378Z1e33 */ ctstate INVALID 2 1828 110K DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:kp1Tzme9aWaPgdKP */ ctstate INVALID 2 1556 62240 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:VaeGtNK_681jKlg9 */ ctstate INVALID 2 1330 69160 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:meQqPUz96UN62T8l */ ctstate INVALID 2 1025 53300 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:mIIn1Wh34t2SZwbR */ ctstate INVALID
如果在不修改kube-proxy和calico-node参数的情况下,想避免这种情况,可以简单粗暴地在集群中设置一个daemonset
kind: DaemonSet apiVersion: apps/v1 metadata: name: iptables-conntrack-hacker namespace: kube-system labels: app: iptables-conntrack spec: selector: matchLabels: app: iptables-conntrack-hacker template: metadata: name: iptables-conntrack-hacker labels: app: iptables-conntrack-hacker spec: volumes: - name: lib-modules hostPath: path: /lib/modules type: '' - name: xtables-lock hostPath: path: /run/xtables.lock type: '' containers: - name: iptables-conntrack-hacker image: 'your-registry-address/kube-system/kube-proxy:v1.18.20' command: - /bin/sh - '-ce' - | export TZ=Asia/Shanghai; echo "$(date) Container started..."; echo "Current iptables rule state:" iptables -w 10 -L --line -nv|grep INVALID || true while (true) do iptables -C FORWARD -w 15 -m conntrack -m comment --comment "To avoid invalid tcp traffic dropped by kubelet or calico" --ctstate INVALID -j ACCEPT || \ (iptables -I FORWARD -w 10 -m conntrack -m comment --comment "To avoid invalid tcp traffic dropped by kubelet or calico" --ctstate INVALID -j ACCEPT && echo "$(date) Adding iptables rules ..."); sleep 60 done resources: limits: cpu: 250m memory: 256Mi requests: cpu: 100m memory:64Mi volumeMounts: - name: lib-modules mountPath: /lib/modules - name: xtables-lock mountPath: /run/xtables.lock imagePullPolicy: IfNotPresent securityContext: privileged: true runAsUser: 0 restartPolicy: Always terminationGracePeriodSeconds: 5 dnsPolicy: ClusterFirstWithHostNet hostNetwork: true securityContext: {} schedulerName: default-scheduler tolerations: - key: CriticalAddonsOnly operator: Exists - operator: Exists effect: NoExecute - operator: Exists effect: NoSchedule updateStrategy: type: RollingUpdate rollingUpdate: maxUnavailable: 50% revisionHistoryLimit: 5
这个Daemonset只有在启动的时候会去操作宿主机的iptables以粗暴地插入一条INVALID ACCEPT规则
有条件的同学可以修改为死循环并且每10 - 30秒检测一次iptables是否存在ACCEPT规则,不存在则插入
注意使用这个Daemonset还存在一个前提约束,如果使用的overlay CNI为calico,需要确认calico-node的iptables操作模式为追加模式
将 FELIX_CHAININSERTMODE
环境变量要修改为Append
,否则cali-FORWARD这个链会被插在FORWARD链最前面,导致INVALID ACCEPT规则失效
Related
kube-proxy(v1.18.20) code: https://github.com/kubernetes/kubernetes/blob/1f3e19b7beb1cc0110255668c4238ed63dadb7ad/pkg/proxy/iptables/proxier.go#L1503-L1511
calico v3.16 config(FELIX_DISABLECONNTRACKINVALIDCHECK): https://docs.tigera.io/archive/v3.16/reference/felix/configuration