一、Pod存活性探测
有不少的应用长时间持续运行后逐渐转为不可用状态,并且仅能通过重启操作来恢复,Kubernetes的容器存活性探测机制可发现诸如此类的问题,并根据探测结果结合重启策略触发后续的行为。存活性探测是隶属于容器级别的配置,kubelet可基于它判定何时需要重启一个容器,该诊断操作由容器的处理器(handler)进行定义的。Kubernetes支持三种处理器用于Pod的探测:
1)ExecAction:在容器中执行一个命令,并根据其返回的状态码进行诊断的操作称为Exec探测,状态码为0表示探测诊断成功,否则Pod即为不健康状态
2)TCPSocketAction:通过与容器的某TCP端口尝试建立连接进行诊断,端口能够成功打开即为正常状态,否则为不健康状态
3)HTTPGetAction:通过向容器IP地址的某个指定端口的指定path发起HTTP GET请求进行诊断,响应码为2XX或3XX时即为成功,否则为失败
存活性探测:用于判定容器是否处于"运行"状态;一旦此类检测未通过,kubelet将杀死容器并根据其restartPolicy决定是否将其重启;未定义存活性检测的容器默认状态为"Success"
就绪性探测:用于判断容器是否准备就绪并可对外提供服务;未通过检测的容器意味着其尚未准备就绪,端点控制器(如Service对象)会将其IP从所有匹配到此Pod对象的Service对象的端点列表中移除;检测通过之后,会再次将其IP添加至端点列表中
二、容器的重启策略
容器程序发生崩溃或容器申请超出限制等原因都可能会导致Pod对象的终止,此时是否应该重建该Pod对象则取决于其重启策略(restartPolicy)属性的定义:
1)Always:但凡Pod对象终止就将其重启,此为默认设定
2)OnFailure:仅在Pod对象出现错误时方才将其重启
3)Never:从不重启
需要注意的是,restartPolicy适用于Pod对象中的所有容器,而且它仅用于控制在同一节点上重新启动Pod对象的相关容器。首次需要重启的容器,将在其需要时立即进行重启,随后再次需要重启的操作将由kubelet延迟一段时间后进行,且反复的重启操作的延迟时长依次是10秒、20秒、40秒、80秒、160秒和300秒,300秒是最大延迟。事实上,一旦绑定到一个节点,Pod对象将永远不会被重新绑定到另一个节点上,它要么被重启,要么终止,直到节点发生故障或者被删除。
三、Exec探针
exec类型的探针通过在目标容器中执行由用户自定义的命令来判断容器的健康状态,若命令状态返回值为0则表示"成功"通过检测,其它值均为"失败"状态。
1)编写exec探针yaml文件
]# cat exec.yaml
apiVersion: v1
kind: Pod
metadata:
name: exec-pod
labels:
test: liveness-exec
spec:
containers:
- name: liveness-exec-demo
image: busybox
imagePullPolicy: IfNotPresent
args: ["/bin/sh","-c"," touch /tmp/healthy; sleep 60; rm -rf /tmp/healthy; sleep 600"]
livenessProbe:
exec:
command: ["test","-e","/tmp/healthy"]
]# kubectl apply -f exec.yaml
pod/exec-pod created
2)检测Pod内文件是否存在
]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
exec-pod 1/1 Running 0 12s 10.244.1.42 node1 <none> <none>
]# kubectl exec exec-pod -it -- /bin/sh
/ # ls /tmp/ -l
total 0
-rw-r--r-- 1 root root 0 Aug 1 11:21 healthy
/ # exit
此处可以看到容器中是存在此文件的
3)再次查看Pod状态
]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
exec-pod 1/1 Running 1 2m37s 10.244.1.42 node1 <none> <none>
# kubectl exec exec-pod -it -- /bin/sh
/ # ls -l /tmp/
total 0
过了60秒之后,再次查看Pod的状态发现该Pod已经重启了一次,说明Pod的存活性检测失败了,再进入Pod中查看文件已经被删除了
4)查看Pod的详细信息
]# kubectl describe pods exec-pod
Name: exec-pod
Namespace: default
Priority: 0
Node: node1/172.16.2.101
Start Time: Sat, 01 Aug 2020 19:21:03 +0800
Labels: test=liveness-exec
Annotations: Status: Running
IP: 10.244.1.42
IPs:
IP: 10.244.1.42
Containers:
liveness-exec-demo:
Container ID: docker://e06871bd25c2b0d556821a2bd87de0e2f4862bb43bb90cb2b7e5fe2b6b740772
Image: busybox
Image ID: docker-pullable://busybox@sha256:4f47c01fa91355af2865ac10fef5bf6ec9c7f42ad2321377c21e844427972977
Port: <none>
Host Port: <none>
Args:
/bin/sh
-c
touch /tmp/healthy; sleep 60; rm -rf /tmp/healthy; sleep 600
State: Running
Started: Sat, 01 Aug 2020 19:24:57 +0800
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Sat, 01 Aug 2020 19:22:57 +0800
Finished: Sat, 01 Aug 2020 19:24:56 +0800
Ready: True
Restart Count: 2
Liveness: exec [test -e /tmp/healthy] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-47pch (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-47pch:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-47pch
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/exec-pod to node1
Warning Unhealthy <invalid> (x6 over <invalid>) kubelet, node1 Liveness probe failed:
Normal Killing <invalid> (x2 over <invalid>) kubelet, node1 Container liveness-exec-demo failed liveness probe, will be restarted
Normal Pulled <invalid> (x3 over <invalid>) kubelet, node1 Container image "busybox" already present on machine
Normal Created <invalid> (x3 over <invalid>) kubelet, node1 Created container liveness-exec-demo
Normal Started <invalid> (x3 over <invalid>) kubelet, node1 Started container liveness-exec-demo
从Pod的详细信息中我们发现容器上一次的状态是退出的,并且错误码为137,在events中可以看到,容器执行livenessprobe探测失败了,然后该容器被执行了重启操作
四、HTTP探针
基于HTTP的探测(HTTPGetAction)向目标容器发起一个HTTP请求,根据其响应码进行结果判定,响应码形如2XX或3XX时表示检测通过,否则说明检测失败,容器会执行默认的重启策略。
HTTP探测的可用配置字段:
host
port
httpHeaders<[]Object>:自定义的请求报文首部
path
scheme:建立连接使用的协议,仅可以是HTTP或HTTPS,默认为HTTP协议
1)编写HTTP探测的yaml文件
]# cat http.yaml
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-http
spec:
containers:
- name: liveness-http-demo
image: ikubernetes/myapp:v1
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 80
lifecycle:
postStart:
exec:
command: ["/bin/sh","-c"," echo Healthy > /usr/share/nginx/html/healthy.html"]
livenessProbe:
httpGet:
path: /healthy.html
port: http
scheme: HTTP
]# kubectl apply -f http.yaml
pod/liveness-http created
2)查看Pod文件是否存在
]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
liveness-http 1/1 Running 0 39s 10.244.1.44 node1 <none> <none>
]# kubectl exec liveness-http -it -- /bin/sh
/ # ls -l /usr/share/nginx/html/healthy.html
-rw-r--r-- 1 root root 8 Aug 1 11:44 /usr/share/nginx/html/healthy.html
/ # cat /usr/share/nginx/html/healthy.html
Healthy
3)删除文件
/ # rm /usr/share/nginx/html/healthy.html
/ # exit
4)再次查看Pod状态信息
]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
liveness-http 1/1 Running 1 3m4s 10.244.1.44 node1 <none> <none>
可以看到Pod已经被重启了一次
5)查看Pod详细信息
]# kubectl describe pods liveness-http
Name: liveness-http
Namespace: default
Priority: 0
Node: node1/172.16.2.101
Start Time: Sat, 01 Aug 2020 19:44:54 +0800
Labels: test=liveness
Annotations: Status: Running
IP: 10.244.1.44
IPs:
IP: 10.244.1.44
Containers:
liveness-http-demo:
Container ID: docker://3549c5a13a1448260c00e138676c475b26c75fbf3417fe44ef546b3b89014037
Image: ikubernetes/myapp:v1
Image ID: docker-pullable://ikubernetes/myapp@sha256:9c3dc30b5219788b2b8a4b065f548b922a34479577befb54b03330999d30d513
Port: 80/TCP
Host Port: 0/TCP
State: Running
Started: Sat, 01 Aug 2020 19:47:43 +0800
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 01 Aug 2020 19:44:54 +0800
Finished: Sat, 01 Aug 2020 19:47:42 +0800
Ready: True
Restart Count: 1
Liveness: http-get http://:http/healthy.html delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-47pch (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-47pch:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-47pch
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/liveness-http to node1
Normal Pulled <invalid> (x2 over <invalid>) kubelet, node1 Container image "ikubernetes/myapp:v1" already present on machine
Warning Unhealthy <invalid> (x3 over <invalid>) kubelet, node1 Liveness probe failed: HTTP probe failed with statuscode: 404
Normal Killing <invalid> kubelet, node1 Container liveness-http-demo failed liveness probe, will be restarted
Normal Created <invalid> (x2 over <invalid>) kubelet, node1 Created container liveness-http-demo
Normal Started <invalid> (x2 over <invalid>) kubelet, node1 Started container liveness-http-demo
从Pod的详细信息中我们发现容器上一次的状态是退出的,并且重启了一次,在events中可以看到,容器执行livenessprobe探测失败了,HTTP探测的返回状态码为404,然后该容器被执行了重启操作
五、TCP探针
基于TCP的存活性探测(TCPSocketAction)用于向容器的特定端口发起TCP请求并尝试建立连接进行结果判定,连接建立成功即为通过检测;相比较来说,它比基于HTTP的探测要更高效、更节约资源,但是精准度略低,毕竟连接建立成功未必意味着页面资源可访问。TCP探测主要包含以下可用的字段属性:
1)host
2)port
存活性探测行为属性:
initialDelaySeconds
timeoutSeconds
periodSeconds
successThreshold
failureThreshold:处于成功状态时,探测操作至少连续多少次的失败才被视为是检测不通过,显示为#failure属性,默认为3,最小值为1;
1)编写TCP探针yaml文件
# cat tcp.yaml
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-tcp
spec:
containers:
- name: liveness-tcp-demo
image: ikubernetes/myapp:v1
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 80
livenessProbe:
tcpSocket:
port: http
]# kubectl apply -f tcp.yaml
pod/liveness-tcp created
2)查看Pod信息
]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
liveness-tcp 1/1 Running 0 8s 10.244.1.45 node1 <none> <none>
]# kubectl exec liveness-tcp -it -- /bin/sh
/ # netstat -tunlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 1/nginx: master pro
3)停止Pod的80端口
/ # nginx -s stop
2020/08/01 12:02:12 [notice] 13#13: signal process started
/ # command terminated with exit code 137
4)查看Pod信息
]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
liveness-tcp 1/1 Running 1 3m16s 10.244.1.45 node1 <none> <none>
可以看到Pod已经被重启了一次,当然这个重启并不是TCP探针检测失败的重启,因为应用退出了容器也就自然会结束掉
六、就绪型探测
Pod对象启动后,容器应用通常需要一段时间才能完成其初始化过程,例如加载配置文件,甚至有些程序需要运行某类的预热过程,若在此阶段完成之前接入客户端的请求,势必会因为等待太久而影响用户的体验。因此,应该避免Pod对象启动后立即让其处理客户端请求,而是等待容器初始化工作完成并转为"就绪"状态,尤其是存在其他提供相同服务的Pod对象的场景更是如此。
与存活性探测机制类似,就绪型探测是用来判断容器就绪与否的周期性(默认周期是10秒钟)操作,它用于探测容器是否已经初始化完成并服务于客户端请求,探测操作返回"success"状态时,即为传递容器已经"就绪"的信号。
与存活性探测机制相同,就绪型探测也支持Exec、HTTP GET和TCP Socket三种探测方式,且各自定义的机制也都相同。但与存活性探测触发的操作不同的是,探测失败时,就绪型探测不会杀死或重启容器以保证其健康性,而是通知其尚未就绪,并触发依赖于其就绪状态的操作(例如,从Service对象中移除此Pod对象)以确保不会有客户端请求接入此Pod对象;不过,即便实在运行过程中,Pod就绪性探测依然有其价值所在,例如Pod A依赖到的Pod B因为网络故障等原因而不可用时,Pod A的服务应该转为未就绪状态,以免无法向客户端提供完整的响应。
1)编写exec探测方式的yaml文件
]# cat exec-readiness.yaml
apiVersion: v1
kind: Pod
metadata:
name: exec-pod
labels:
test: readiness-exec
spec:
containers:
- name: readiness-exec-demo
image: busybox
imagePullPolicy: IfNotPresent
args: ["/bin/sh","-c","while true;do rm -f /tmpo/ready; sleep 30; touch /tmp/ready; sleep 300;done"]
readinessProbe:
exec:
command: ["test","-e","/tmp/ready"]
initialDelaySeconds: 5
periodSeconds: 5
]# kubectl apply -f exec-readiness.yaml
pod/exec-pod created
2)查看Po状态
]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
exec-pod 0/1 Running 0 7s 10.244.1.46 node1 <none> <none>
# kubectl exec exec-pod -it -- /bin/sh
/ # ls /tmp/
ready
3)删除文件
/ # rm /tmp/ready
/ # exit
4)再次查看Pod详细状态
]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
exec-pod 0/1 Running 0 3m36s 10.244.1.46 node1 <none> <none>
]# kubectl describe pods exec-pod
Name: exec-pod
Namespace: default
Priority: 0
Node: node1/172.16.2.101
Start Time: Sat, 01 Aug 2020 20:38:33 +0800
Labels: test=readiness-exec
Annotations: Status: Running
IP: 10.244.1.46
IPs:
IP: 10.244.1.46
Containers:
readiness-exec-demo:
Container ID: docker://cd1b74a31ad2d0577a9bad6577a6fae620430a9ab5256a271aa908c4562fbd9f
Image: busybox
Image ID: docker-pullable://busybox@sha256:4f47c01fa91355af2865ac10fef5bf6ec9c7f42ad2321377c21e844427972977
Port: <none>
Host Port: <none>
Args:
/bin/sh
-c
while true;do rm -f /tmpo/ready; sleep 30; touch /tmp/ready; sleep 300;done
State: Running
Started: Sat, 01 Aug 2020 20:38:34 +0800
Ready: True
Restart Count: 0
Readiness: exec [test -e /tmp/ready] delay=5s timeout=1s period=5s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-47pch (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-47pch:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-47pch
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/exec-pod to node1
Normal Pulled <invalid> kubelet, node1 Container image "busybox" already present on machine
Normal Created <invalid> kubelet, node1 Created container readiness-exec-demo
Normal Started <invalid> kubelet, node1 Started container readiness-exec-demo
Warning Unhealthy <invalid> (x5 over <invalid>) kubelet, node1 Readiness probe failed: