kubernetes错误node故障总结

kubelet启动失败,Error: No space left on device

查看日志,关键语句如下

Failed to start cAdvisor inotify_add_watch /sys/fs/cgroup/blkio: no space left on device
或
Failed to start cAdvisor inotify_add_watch /sys/fs/cgroup/cpu,cpuacct: no space left on device

解决办法,参考https://blog.csdn.net/xiaofang2015/article/details/80649548

[root@node6 ~]# cat /proc/sys/fs/inotify/max_user_watches
8196
[root@node6 ~]# sysctl fs.inotify.max_user_watches=1048576

systemctl命令超时,Failed to start reboot.target: Connection timed out

现象如下:
kubernetes错误node故障总结_第1张图片
解决办法:参考地址:https://serverfault.com/questions/712928/systemctl-commands-timeout-when-ran-as-root

[root@node7 ~]# systemctl --force --force reboot
Rebooting.
packet_write_wait: Connection to 10.180.3.107 port 22: Broken pipe

System OOM encountered

Warning  SystemOOM                34m (x8 over 34m)  kubelet, node5     System OOM encountered
Sep 30 18:36:22 node5 kubelet[134096]: E0930 18:36:22.037042  134096 kubelet_node_status.go:106] Unable to register node "node5" with API server: Post https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: getsockopt: connection refused
Sep 30 18:36:21 node5 kubelet[134096]: E0930 18:36:21.898997  134096 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dnode5&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused

直接重启机器吧,原因不知道

ContainerGCFailed rpc error: code = DeadlineExceeded desc = context deadline exceeded

现象:

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Thu, 10 Oct 2019 20:38:34 +0800   Mon, 08 Oct 2018 23:08:22 +0800   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Thu, 10 Oct 2019 20:38:34 +0800   Fri, 23 Aug 2019 21:12:52 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Thu, 10 Oct 2019 20:38:34 +0800   Wed, 15 May 2019 16:12:45 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  Ready            False   Thu, 10 Oct 2019 20:38:34 +0800   Thu, 10 Oct 2019 20:27:48 +0800   KubeletNotReady              PLEG is not healthy: pleg was last seen active 13m51.370099044s ago; threshold is 3m0s


Events:
  Type     Reason             Age                 From             Message
  ----     ------             ----                ----             -------
  Normal   NodeNotReady       10m (x2 over 137d)  kubelet, node37  Node node37 status is now: NodeNotReady
  Warning  ContainerGCFailed  2m (x4 over 11m)    kubelet, node37  rpc error: code = DeadlineExceeded desc = context deadline exceeded

解决办法:

systemctl daemon-reexec

systemctl restart docker(是的,需要重启docker)

Get https://127.0.0.1:6443/api/v1/namespaces/java-service/pods/example-554976b69b-4bnxh: dial tcp 127.0.0.1:6443: getsockopt: connection refused

现象:
node notready,查看describe无event时间信息。
在node上查看kubelet日志systemctl status kubelet.service -l,如下
kubernetes错误node故障总结_第2张图片
解决办法:

在node上重启kubelet。有时候重启也不行,看下磁盘,cpu,内存,文件描述符这些资源的占用情况。
有一次就是因为磁盘根目录使用率到99%了,重启完kubelet后,刚恢复又notready了,但是describe里有没有显示出来磁盘空间不足,导致没有往磁盘空间这个方面考虑。
还有一次是因为内存不足了

container runtime is down

不知道
kubernetes错误node故障总结_第3张图片

FailedCreatePodSandBox Failed create pod sandbox.

不知道
kubernetes错误node故障总结_第4张图片

你可能感兴趣的:(k8s)