K8s的一些小坑或者bug简要记录

来自三线的随记

kubectl

kubectl rollout history

kubectl rollout history 在 v1.26之前,如果带上-o yaml或者-o json之类的-o 参数,输出的内容会是错误的版本内容

相关Issue https://github.com/kubernetes/kubectl/issues/598#issuecomment-1230824762

kubectl apply 在特定情况下可能有bug或者非预期行为

前提提要: kubectl apply 的工作涉及到了计算行为 How apply calculates differences and merges changes

例如如果在kubectl 1.18,kubectl apply操作hostAliases的时候可能是追加而不是替换 在使用kubectl_apply操作hostalias产生的非预期行为

还有一个修改probe配置,apply会有异常的,这个基本也是跟apply计算实现有关(只出现在1.18,不是很记得怎么复现,有缘再补)

kubernetes中apply命令执行的全过程源码解析:https://juejin.cn/post/6968106028642598949

kubelet

kubelet 1.27前串行拉取容器镜像

https://kubernetes.io/docs/concepts/containers/images/#serial-and-parallel-image-pulls

By default, kubelet pulls images serially. In other words, kubelet sends only one image pull request to the image service at a time. Other image pull requests have to wait until the one being processed is complete.

kubernetes 节点上kubelet在1.27版本之前对于容器镜像是串行拉取的,串行值为1,这在拉公网镜像的时候会有可能导致其它容器镜像一直处在拉取状态,在1.27中改成了并行镜像拉取

kubelet 不断刷大量的 'Path "/var/lib/kubelet/pods/${pod_ID}/volumes" does not exist' 日志报错

关联原因Issue里面介绍是runc cgroup GC异常

issue:

https://github.com/kubernetes/kubernetes/issues/112124

底部有cgroup清理脚本,但是KUBE_POD_IDS的取值逻辑要根据实际环境调整,而且就算改完了,rmdir cgroup directory会提示Device or resource busy错误

继续关联issue:

https://github.com/kubernetes/kubernetes/issues/112151#issuecomment-1285261341

issue解释诱因: 磁盘IO

kubelet 刷 vol_data.json: no such file or directory 日志

报错日志样式:

operationExecutor.UnmountVolume failed

failed to open volume data file [/var/lib/kubelet/pods/${pod_id}/volumes/kubernetes.io~csi/${pvc_id}/vol_data.json]: open /var/lib/kubelet/pods/${pod_id}/volumes/kubernetes.io~csi/${pvc_id}/vol_data.json: no such file or directory

Issue:

https://github.com/kubernetes/kubernetes/issues/85280

里面issue creator有提及:

When there is something wrong to execute os.Remove(volPath), volume path is left on node. However, mount path and vol_data.json is deleted.

这时候实践下来,可以手动umount,重启kubelet错误即可解除 其他issue中有提及https://github.com/kubernetes/kubernetes/issues/116847#issuecomment-1721540974

Alright one last update. If anyone is running into problems like these, make sure your CSI driver implements NodeUnpublish correctly at very minimum and idempotent. This issue imo is almost entirely caused by problematic CSI driver implementations.