搭建Kubernetes集群踩坑日志之coreDNS 组件出现CrashLoopBackOff问题的解决

这段时间一直在学习基于Docker和Kubernetes搭建服务器集群的知识,由于之前没有云计算相关的基础,过程可以说是非常难受了,开始跟着大佬的帖子一步步来,即使这样也是踩了无数的坑。

这里先贴上一位大佬的教程贴:个人觉得这篇帖子算是基于高版本Kubernetes构建集群环境比较全面的帖子了。

使用Kubeadm(1.13+)快速搭建Kubernetes集群

https://www.cnblogs.com/RainingNight/p/using-kubeadm-to-create-a-cluster-1-13.html

下面开始正题:

先说下笔者的环境:

两台主机组件一主一从的集群环境(主机(master节点):czb-workstation:192.168.0.109/从机(工作节点):dl4:192.168.0.111),两台的软件环境如下:

linux-ubuntu 16.04

Docker 18.03.1-ce

Kubernetes v1.13.3

在此之前由于不懂Docker和Kubernetes的原理,所以只能一步步按照帖子上的步骤进行尝试,遇到不懂的地方再百度或者谷歌。这里着重说一下在构建过程中遇到的很常见但同时又比较棘手的问题:建立起来的pod 出现CrashLoopBackOff的问题,在笔者的构建过程中遇到了很多次coreDNS 组件出现CrashLoopBackOff的问题。

1. 刚开始的时候遇到的是安装了网络插件Calico后遇到两个coreDNS组件均出现CrashLoopBackOff挂掉的问题,谷歌以后找到原因:由于本机环境中存在loop循环造成的。解决方法就是将主机环境中的127.0.0.1主机ip循环地址删除即可,具体有这么几个位置:/etc/resolv.conf     /run/systemd/resolve/resolv.conf     /etc/systemd/resolved.conf

参考帖子:https://stackoverflow.com/questions/53075796/coredns-pods-have-crashloopbackoff-or-error-state

2. 按照上述帖子的方法处理后,两个coreDNS组件开始确实也显示为running状态,但是好景不长,几分钟后发现其中一个coreDNS组件有RESTART记录,后来每隔两分钟便出现一次RESTART记录,在三次尝试重启后,果不其然,该coreDNS组件CrashLoopBackOff挂掉。。。

无奈,硬着头皮开始DEBUG过程,首先通过命令“kubectl describe pod coredns-xxxxxxx -n kube-system ”命令查看该pod的情况:

czb@czb-workstation:~$ kubectl describe pod coredns-78d4cf999f-2mjbj -n kube-system
Name:               coredns-78d4cf999f-2mjbj
Namespace:          kube-system
Priority:           0
PriorityClassName:  
Node:               dl4/192.168.0.111
Start Time:         Tue, 26 Feb 2019 10:05:49 +0800
Labels:             k8s-app=kube-dns
                    pod-template-hash=78d4cf999f
Annotations:        cni.projectcalico.org/podIP: 192.168.1.28/32
Status:             Running
IP:                 192.168.1.28
Controlled By:      ReplicaSet/coredns-78d4cf999f
Containers:
  coredns:
    Container ID:  docker://c178ebc8719657aff484fd5277a0ab7b5184c26c92f402218c7236dfe4e20c1b
    Image:         registry.aliyuncs.com/google_containers/coredns:1.2.6
    Image ID:      docker-pullable://registry.aliyuncs.com/google_containers/coredns@sha256:0e7e5387c73f4898a7251d91f27297d3a5b210421a0b234302276feb8b264a27
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    State:          Running
      Started:      Tue, 26 Feb 2019 10:21:19 +0800
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 26 Feb 2019 10:16:49 +0800
      Finished:     Tue, 26 Feb 2019 10:18:38 +0800
    Ready:          True
    Restart Count:  7
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Environment:  
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-45bt9 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  coredns-token-45bt9:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  coredns-token-45bt9
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  
Tolerations:     CriticalAddonsOnly
                 node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  15m                   default-scheduler  Successfully assigned kube-system/coredns-78d4cf999f-2mjbj to dl4
  Normal   Pulled     12m (x3 over 15m)     kubelet, dl4       Container image "registry.aliyuncs.com/google_containers/coredns:1.2.6" already present on machine
  Normal   Killing    12m (x2 over 14m)     kubelet, dl4       Killing container with id docker://coredns:Container failed liveness probe.. Container will be killed and recreated.
  Normal   Created    12m (x3 over 15m)     kubelet, dl4       Created container
  Normal   Started    12m (x3 over 15m)     kubelet, dl4       Started container
  Warning  Unhealthy  5m31s (x26 over 14m)  kubelet, dl4       Liveness probe failed: HTTP probe failed with statuscode: 503
  Warning  BackOff    44s (x12 over 3m)     kubelet, dl4       Back-off restarting failed container

主要报错信息为:活动探测失败:HTTP探测失败,状态码为:503   这个信息对我来说完全没用,目前只知道是从机dl4上部署的coreDNS组件工作异常,仅此而已。

接着自然想到查看pod的日志,输入”kubectl logs -f coredns-78d4cf999f-2mjbj -n kube-system“命令:

czb@czb-workstation:~$ kubectl logs -f coredns-78d4cf999f-2mjbj -n kube-system
.:53
2019-02-26T02:56:04.551Z [INFO] CoreDNS-1.2.6
2019-02-26T02:56:04.551Z [INFO] linux/amd64, go1.11.2, 756749c
CoreDNS-1.2.6
linux/amd64, go1.11.2, 756749c
 [INFO] plugin/reload: Running configuration MD5 = f65c4821c8a9b7b5eb30fa4fbc167769
E0226 02:56:29.551969       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:318: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0226 02:56:29.552106       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:313: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0226 02:56:29.552135       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:311: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0226 02:57:00.552562       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:318: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0226 02:57:00.553712       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:313: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0226 02:57:00.554783       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:311: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0226 02:57:31.553284       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:318: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0226 02:57:31.554273       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:313: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0226 02:57:31.555564       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:311: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
[INFO] SIGTERM: Shutting down servers then terminating
czb@czb-workstation:~$ kubectl get pod --all-namespaces
NAMESPACE     NAME                                      READY   STATUS             RESTARTS   AGE
kube-system   calico-node-7fwc7                         2/2     Running            0          24h
kube-system   calico-node-lxkr5                         2/2     Running            2          24h
kube-system   coredns-78d4cf999f-2mjbj                  0/1     CrashLoopBackOff   21         78m
kube-system   coredns-78d4cf999f-rnft5                  1/1     Running            1          24h
kube-system   etcd-czb-workstation                      1/1     Running            1          24h
kube-system   kube-apiserver-czb-workstation            1/1     Running            1          24h
kube-system   kube-controller-manager-czb-workstation   1/1     Running            1          24h
kube-system   kube-proxy-lddlb                          1/1     Running            1          24h
kube-system   kube-proxy-rd6kt                          1/1     Running            0          24h
kube-system   kube-scheduler-czb-workstation            1/1     Running            1          24h

从上面的信息大概知道,是由于从机无法通过coreDNS组件访问10.96.0.1:443 该地址造成的,dial tcp 10.96.0.1:443: i/o timeout

之后继续谷歌,之后在一篇求助贴中看到有人遇到跟我一样的问题并且已经顺利解决!!那个激动啊!!

帖子原地址:https://github.com/coredns/coredns/issues/2325

该作者将解决问题的思路又写为博客并发布了出来:https://medium.com/@cminion/quicknote-kubernetes-networking-issues-78f1e0d06e12

大致描述如下:

症状

工作节点(从机)上的Pod无法连接到API服务器

超时连接到10.96.0.1

但master上的pod(可能没有污染)工作正常。

当您使用kubeadm init时,请指定pod-network-cidr。确保主机/主网络的IP不在您引用的子网中。

即如果您的网络运行在192.168.*.*使用10.0.0.0/16

如果您的网络是10.0.*.*使用192.168.0.0/16

我忘了子网是如何工作的,所以我没有意识到192.168.0.1与192.168.100.0/16在同一子网中。简而言之,当您使用16子网标记时,它意味着使用192.168.*.*中的任何内容。

由于我的网络运行在192.168.1。*主机在192.168.0上运行正常。*但我的工作人员无法通信,因为它试图在192.168.1上运行。*因此很好地导致我的盒子上的路由问题。

 

也就是说,在使用”kubeadm init“命令初始化master节点时,在给Calico网络插件分配CIDR网段时,自己环境中(master节点和工作节点)的ip地址不能够跟Calico网络插件的网段重合!!不得不说这个问题非常隐蔽啊!!新手非常容易忽略。

之后我又去Calico官方教程中找到了答案:

搭建Kubernetes集群踩坑日志之coreDNS 组件出现CrashLoopBackOff问题的解决_第1张图片

这里的注意就是这个意思。

接下来只能重新配置集群环境了,分别在master节点和工作节点上执行”kubeadm reset“命令,然后首先在master节点运行新的”kubeadm init“命令:”sudo kubeadm init --image-repository registry.aliyuncs.com/google_containers --kubernetes-version v1.13.1 --pod-network-cidr=10.0.0.0/16“,之后的步骤就跟博客中附上的第一篇教程贴的步骤一样了!

这个问题前前后后困扰了我好多天,开始完全摸不到头脑,面对满屏的英文和通过谷歌翻译出来的英文帖子中蹩脚的中文逻辑确实对新手理解造成极大困扰!但是我觉得如果能够坚持,克服浮躁的心境,慢慢寻找问题所在,并积极在网络中寻找解决问题的思路和答案,就一定能够解决问题。有感网络中那么多技术很牛的大佬能在网路中将自己的经验无私地奉献出来并不厌其烦的解答新手遇到的各种在他们看来小儿科的问题。又想起之前看到的一句话:取之于网络,回馈于网络!!

你可能感兴趣的:(Kubernetes集群搭建)