这段时间一直在学习基于Docker和Kubernetes搭建服务器集群的知识,由于之前没有云计算相关的基础,过程可以说是非常难受了,开始跟着大佬的帖子一步步来,即使这样也是踩了无数的坑。
这里先贴上一位大佬的教程贴:个人觉得这篇帖子算是基于高版本Kubernetes构建集群环境比较全面的帖子了。
使用Kubeadm(1.13+)快速搭建Kubernetes集群
https://www.cnblogs.com/RainingNight/p/using-kubeadm-to-create-a-cluster-1-13.html
下面开始正题:
先说下笔者的环境:
两台主机组件一主一从的集群环境(主机(master节点):czb-workstation:192.168.0.109/从机(工作节点):dl4:192.168.0.111),两台的软件环境如下:
linux-ubuntu 16.04
Docker 18.03.1-ce
Kubernetes v1.13.3
在此之前由于不懂Docker和Kubernetes的原理,所以只能一步步按照帖子上的步骤进行尝试,遇到不懂的地方再百度或者谷歌。这里着重说一下在构建过程中遇到的很常见但同时又比较棘手的问题:建立起来的pod 出现CrashLoopBackOff的问题,在笔者的构建过程中遇到了很多次coreDNS 组件出现CrashLoopBackOff的问题。
1. 刚开始的时候遇到的是安装了网络插件Calico后遇到两个coreDNS组件均出现CrashLoopBackOff挂掉的问题,谷歌以后找到原因:由于本机环境中存在loop循环造成的。解决方法就是将主机环境中的127.0.0.1主机ip循环地址删除即可,具体有这么几个位置:/etc/resolv.conf /run/systemd/resolve/resolv.conf /etc/systemd/resolved.conf
参考帖子:https://stackoverflow.com/questions/53075796/coredns-pods-have-crashloopbackoff-or-error-state
2. 按照上述帖子的方法处理后,两个coreDNS组件开始确实也显示为running状态,但是好景不长,几分钟后发现其中一个coreDNS组件有RESTART记录,后来每隔两分钟便出现一次RESTART记录,在三次尝试重启后,果不其然,该coreDNS组件CrashLoopBackOff挂掉。。。
无奈,硬着头皮开始DEBUG过程,首先通过命令“kubectl describe pod coredns-xxxxxxx -n kube-system ”命令查看该pod的情况:
czb@czb-workstation:~$ kubectl describe pod coredns-78d4cf999f-2mjbj -n kube-system
Name: coredns-78d4cf999f-2mjbj
Namespace: kube-system
Priority: 0
PriorityClassName:
Node: dl4/192.168.0.111
Start Time: Tue, 26 Feb 2019 10:05:49 +0800
Labels: k8s-app=kube-dns
pod-template-hash=78d4cf999f
Annotations: cni.projectcalico.org/podIP: 192.168.1.28/32
Status: Running
IP: 192.168.1.28
Controlled By: ReplicaSet/coredns-78d4cf999f
Containers:
coredns:
Container ID: docker://c178ebc8719657aff484fd5277a0ab7b5184c26c92f402218c7236dfe4e20c1b
Image: registry.aliyuncs.com/google_containers/coredns:1.2.6
Image ID: docker-pullable://registry.aliyuncs.com/google_containers/coredns@sha256:0e7e5387c73f4898a7251d91f27297d3a5b210421a0b234302276feb8b264a27
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Running
Started: Tue, 26 Feb 2019 10:21:19 +0800
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 26 Feb 2019 10:16:49 +0800
Finished: Tue, 26 Feb 2019 10:18:38 +0800
Ready: True
Restart Count: 7
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Environment:
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-45bt9 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
coredns-token-45bt9:
Type: Secret (a volume populated by a Secret)
SecretName: coredns-token-45bt9
Optional: false
QoS Class: Burstable
Node-Selectors:
Tolerations: CriticalAddonsOnly
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 15m default-scheduler Successfully assigned kube-system/coredns-78d4cf999f-2mjbj to dl4
Normal Pulled 12m (x3 over 15m) kubelet, dl4 Container image "registry.aliyuncs.com/google_containers/coredns:1.2.6" already present on machine
Normal Killing 12m (x2 over 14m) kubelet, dl4 Killing container with id docker://coredns:Container failed liveness probe.. Container will be killed and recreated.
Normal Created 12m (x3 over 15m) kubelet, dl4 Created container
Normal Started 12m (x3 over 15m) kubelet, dl4 Started container
Warning Unhealthy 5m31s (x26 over 14m) kubelet, dl4 Liveness probe failed: HTTP probe failed with statuscode: 503
Warning BackOff 44s (x12 over 3m) kubelet, dl4 Back-off restarting failed container
主要报错信息为:活动探测失败:HTTP探测失败,状态码为:503 这个信息对我来说完全没用,目前只知道是从机dl4上部署的coreDNS组件工作异常,仅此而已。
接着自然想到查看pod的日志,输入”kubectl logs -f coredns-78d4cf999f-2mjbj -n kube-system“命令:
czb@czb-workstation:~$ kubectl logs -f coredns-78d4cf999f-2mjbj -n kube-system
.:53
2019-02-26T02:56:04.551Z [INFO] CoreDNS-1.2.6
2019-02-26T02:56:04.551Z [INFO] linux/amd64, go1.11.2, 756749c
CoreDNS-1.2.6
linux/amd64, go1.11.2, 756749c
[INFO] plugin/reload: Running configuration MD5 = f65c4821c8a9b7b5eb30fa4fbc167769
E0226 02:56:29.551969 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:318: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0226 02:56:29.552106 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:313: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0226 02:56:29.552135 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:311: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0226 02:57:00.552562 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:318: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0226 02:57:00.553712 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:313: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0226 02:57:00.554783 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:311: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0226 02:57:31.553284 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:318: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0226 02:57:31.554273 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:313: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0226 02:57:31.555564 1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:311: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
[INFO] SIGTERM: Shutting down servers then terminating
czb@czb-workstation:~$ kubectl get pod --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-node-7fwc7 2/2 Running 0 24h
kube-system calico-node-lxkr5 2/2 Running 2 24h
kube-system coredns-78d4cf999f-2mjbj 0/1 CrashLoopBackOff 21 78m
kube-system coredns-78d4cf999f-rnft5 1/1 Running 1 24h
kube-system etcd-czb-workstation 1/1 Running 1 24h
kube-system kube-apiserver-czb-workstation 1/1 Running 1 24h
kube-system kube-controller-manager-czb-workstation 1/1 Running 1 24h
kube-system kube-proxy-lddlb 1/1 Running 1 24h
kube-system kube-proxy-rd6kt 1/1 Running 0 24h
kube-system kube-scheduler-czb-workstation 1/1 Running 1 24h
从上面的信息大概知道,是由于从机无法通过coreDNS组件访问10.96.0.1:443 该地址造成的,dial tcp 10.96.0.1:443: i/o timeout
之后继续谷歌,之后在一篇求助贴中看到有人遇到跟我一样的问题并且已经顺利解决!!那个激动啊!!
帖子原地址:https://github.com/coredns/coredns/issues/2325
该作者将解决问题的思路又写为博客并发布了出来:https://medium.com/@cminion/quicknote-kubernetes-networking-issues-78f1e0d06e12
大致描述如下:
症状
工作节点(从机)上的Pod无法连接到API服务器
超时连接到10.96.0.1
但master上的pod(可能没有污染)工作正常。
解
当您使用kubeadm init时,请指定pod-network-cidr。确保主机/主网络的IP不在您引用的子网中。
即如果您的网络运行在192.168.*.*使用10.0.0.0/16
如果您的网络是10.0.*.*使用192.168.0.0/16
我忘了子网是如何工作的,所以我没有意识到192.168.0.1与192.168.100.0/16在同一子网中。简而言之,当您使用16子网标记时,它意味着使用192.168.*.*中的任何内容。
由于我的网络运行在192.168.1。*主机在192.168.0上运行正常。*但我的工作人员无法通信,因为它试图在192.168.1上运行。*因此很好地导致我的盒子上的路由问题。
也就是说,在使用”kubeadm init“命令初始化master节点时,在给Calico网络插件分配CIDR网段时,自己环境中(master节点和工作节点)的ip地址不能够跟Calico网络插件的网段重合!!不得不说这个问题非常隐蔽啊!!新手非常容易忽略。
之后我又去Calico官方教程中找到了答案:
这里的注意就是这个意思。
接下来只能重新配置集群环境了,分别在master节点和工作节点上执行”kubeadm reset“命令,然后首先在master节点运行新的”kubeadm init“命令:”sudo kubeadm init --image-repository registry.aliyuncs.com/google_containers --kubernetes-version v1.13.1 --pod-network-cidr=10.0.0.0/16“,之后的步骤就跟博客中附上的第一篇教程贴的步骤一样了!
这个问题前前后后困扰了我好多天,开始完全摸不到头脑,面对满屏的英文和通过谷歌翻译出来的英文帖子中蹩脚的中文逻辑确实对新手理解造成极大困扰!但是我觉得如果能够坚持,克服浮躁的心境,慢慢寻找问题所在,并积极在网络中寻找解决问题的思路和答案,就一定能够解决问题。有感网络中那么多技术很牛的大佬能在网路中将自己的经验无私地奉献出来并不厌其烦的解答新手遇到的各种在他们看来小儿科的问题。又想起之前看到的一句话:取之于网络,回馈于网络!!