K8s下iptables-invalid-drop引起的耗时波动或者偶发断流随记

来自三线的随记
Admin讨论 | 贡献2023年7月28日 (五) 12:46的版本

环境前提

  • 有kube-proxy组件且工作在iptables模式下
  • 可有可无的条件: calico CNI

可能的诱因 & 现象结果

  • overlay POD 与集群外服务通讯
  • underlay与overlay网络通讯(去程overlay 回程underlay导致 asymmetrical routing 即非对称路由)
  • conntrack saturation? (conntrack 饱和)

产生偶发性大耗时 或者 偶发性断流现象


在 kube-proxy 所维护的filter KUBE-FORWARD iptables规则表中,存在一条规则-A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP

[root@gzu-prd ~]# iptables -L KUBE-FORWARD --line -nv
Chain KUBE-FORWARD (1 references)
num   pkts bytes target     prot opt in     out     source               destination
1        0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate INVALID
2        4   240 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes forwarding rules */ mark match 0x4000/0x4000
3    11412   33M ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes forwarding conntrack pod source rule */ ctstate RELATED,ESTABLISHED
4        0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes forwarding conntrack pod destination rule */ ctstate RELATED,ESTABLISHED

这一条规则会导致在connection track标记为INVALID的流量被DROP处理,同时这一行为目前不支持配置禁用(除非改代码重新编译)

其中关于TCP的connection track状态可以在conntrack -L 或者 cat /proc/net/nf_conntrack中查到(例如[UNREPLIED]之类的)

kube-proxy会在endpoint发生变动的时候粗暴地Flush iptables规则,导致不能简单地在KUBE-FORWARD中插入一条ACCEPT规则来避免这种问题


同样在calico所维护的各种iptables filter表中,每一个cali-fw-cali****表基本也存在规则-m conntrack --ctstate INVALID -j DROP

[root@gzu-prd ~]# iptables-save -t filter|grep INVALID
-A cali-fw-cali02fca994756 -m comment --comment "cali:Zgj-5PhkyRyRGc5v" -m conntrack --ctstate INVALID -j DROP
-A cali-fw-cali091fd1acd82 -m comment --comment "cali:vySNraYuHVkcwzZC" -m conntrack --ctstate INVALID -j DROP
-A cali-fw-cali0945b5ec7e6 -m comment --comment "cali:YpO6T4K2fN2biMqp" -m conntrack --ctstate INVALID -j DROP
-A cali-fw-cali09725d6075c -m comment --comment "cali:3Q23jKsPGkXWWHjs" -m conntrack --ctstate INVALID -j DROP

但是这一行为是可以通过FELIX_DISABLECONNTRACKINVALIDCHECK环境变量关闭

如果在不修改kube-proxy和calico-node参数的情况下,想避免这种情况,可以简单粗暴地在集群中设置一个daemonset

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: iptables-conntrack-hacker
  namespace: kube-system
  labels:
    app: iptables-conntrack
spec:
  selector:
    matchLabels:
      app: iptables-conntrack-hacker
  template:
    metadata:
      name: iptables-conntrack-hacker
      labels:
        app: iptables-conntrack-hacker
    spec:
      volumes:
        - name: lib-modules
          hostPath:
            path: /lib/modules
            type: ''
        - name: xtables-lock
          hostPath:
            path: /run/xtables.lock
            type: ''
      containers:
        - name: iptables-conntrack-hacker
          image: 'your-registry-address/kube-system/kube-proxy:v1.18.20'
          command:
            - /bin/sh
            - '-c'
            - |
              export TZ=Asia/Shanghai;
              echo "$(date) postStart ...";
              iptables -C FORWARD -w 10 -m conntrack -m comment --comment "To avoid invalid tcp traffic dropped by kubelet" --ctstate INVALID -j ACCEPT || \
              iptables -I FORWARD -w 10 -m conntrack -m comment --comment "To avoid invalid tcp traffic dropped by kubelet" --ctstate INVALID -j ACCEPT && echo "Add iptables rules ..."; 
              iptables -w 10 -L FORWARD --line -nv|grep INV;
              tail -f /dev/stdout;
          resources:
            limits:
              cpu: 250m
              memory: 256Mi
            requests:
              cpu: 1m
              memory: 1Mi
          volumeMounts:
            - name: lib-modules
              mountPath: /lib/modules
            - name: xtables-lock
              mountPath: /run/xtables.lock
          lifecycle:
            postStart:
              exec:
                command:
                  - /bin/sh
                  - '-c'
                  - |
                    sleep 10;
            preStop:
              exec:
                command:
                  - /bin/sh
                  - '-c'
                  - |
                    export TZ=Asia/Shanghai;
                    echo "$(date) preStop delete iptables rules ..."  > /proc/1/fd/1 2>&1;
                    iptables -D FORWARD -w 10 -m conntrack -m comment --comment "To avoid invalid tcp traffic dropped by kubelet" --ctstate INVALID -j ACCEPT  > /proc/1/fd/1 2>&1;
                    sleep 10;
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
            runAsUser: 0
      restartPolicy: Always
      terminationGracePeriodSeconds: 5
      dnsPolicy: ClusterFirstWithHostNet
      hostNetwork: true
      securityContext: {}
      schedulerName: default-scheduler
      tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - operator: Exists
          effect: NoExecute
        - operator: Exists
          effect: NoSchedule
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 50%
  revisionHistoryLimit: 2

这个Daemonset只有在启动的时候会去操作宿主机的iptables以粗暴地插入一条INVALID ACCEPT规则

有条件的同学可以修改为死循环并且每10 - 30秒检测一次iptables是否存在ACCEPT规则,不存在则插入

Related

kube-proxy(v1.18.20) code: https://github.com/kubernetes/kubernetes/blob/1f3e19b7beb1cc0110255668c4238ed63dadb7ad/pkg/proxy/iptables/proxier.go#L1503-L1511

calico v3.16 config(FELIX_DISABLECONNTRACKINVALIDCHECK): https://docs.tigera.io/archive/v3.16/reference/felix/configuration

github issue