在Kubernetes中可以通过探针配置运行状况检查,以确定每个 Pod 的状态。
在k8s中只要将pod调度到某个节点,Kubelet就会运行pod的容器,如果该pod的容器有一个或者所有的都终止运行(容器的主进程崩溃),Kubelet将重启容器,所以即使应用程序本身没有做任何特殊的事,在Kubemetes中运行也能自动获得自我修复的能力。
自动重启容器以保证应用的正常运行,这是使用Kubernetes的优势,不过在某些情况,即使进程没有崩溃,有时应用程序运行也会出错。默认情况下Kubernetes只是检查Pod容器是否正常运行,但容器正常运行并不一定代表应用健康。
K8S 提供了3种探针
指示容器是否准备好服务请求(是否启动完成并就绪)。就绪探针初始延迟之前的就绪状态默认为Failure,待容器启动成功弹指指标探测结果为成功后,状态变更为 Success。如果未配置就绪探针,则默认状态为Success。
只有状态为 Success ,才会被纳入 pod 所属 service ,添加到endpoint中,也就是 service 接收到请求后才有可能会被分发处理请求。
如果ReadinessProbe探针检测到失败,则Pod的状态被修改。Endpoint Controller将从Service的Endpoint中删除包含该容器所在Pod的Endpoint。
用于判断容器是否存活(running状态),如果LivenessProbe探针探测到容器不健康(你可以配置连续多少次失败才记为不健康),则 kubelet 会杀掉该容器,并根据容器的重启策略restartPolicy做相应的处理。如果未配置存活探针,则默认状态为Success。即探针返回的值永远是 Success。
判断容器内的应用程序是否已启动。如果配置了启动探测,在则在启动探针状态为 Succes 之前,其他所有探针都处于无效状态,直到它成功后其他探针才起作用。如果启动探测失败,kubelet 将杀死容器,容器将服从其重启策略。如果容器没有配置启动探测,则默认状态为 Success。
容器重启策略restartPolicy有三个可选值:
Always:当容器终止退出后,总是重启容器,默认策略。
OnFailure:当容器异常退出(退出状态码非0)时,才重启容器。
Never:当容器终止退出,从不重启容器。
以上三种探针都具有以下参数:
initialDelaySeconds :启动 liveness、readiness 探针前要等待的秒数。默认是0
periodSeconds :检查探针的频率。默认是1
timeoutSeconds :检测超时时间,默认是1
successThreshold :探针需要通过的最小连续成功检查次数。通过为成功,默认是1
failureThreshold :将探针标记为失败之前的重试次数。对于 liveness 探针,这将导致 Pod 重新启动。对于 readiness 探针,将标记 Pod 为未就绪(unready)。默认是1
每种探测机制支持三种健康检查方法,分别是命令行exec,httpGet和tcpSocket,其中exec通用性最强,适用与大部分场景,tcpSocket适用于TCP业务,httpGet适用于web业务
exec(自定义健康检查):在容器中执行指定的命令,如果执行成功,退出码为 0 则探测成功。
httpGet:通过容器的IP地址、端口号及路径调用 HTTP Get方法,如果响应的状态码大于等于200且小于400,则认为容器 健康。
tcpSocket:通过容器的 IP 地址和端口号执行 TCP 检 查,如果能够建立 TCP 连接,则表明容器健康。
探针探测结果有以下值:
Success
:表示通过检测。Failure
:表示未通过检测。Unknown
:表示检测没有正常进行。readinessProbe 和 livenessProbe 可以使用相同探测方式,只是对 Pod 的处置方式不同。
livenessProbe 当检测失败后,将杀死容器并根据 Pod 的重启策略来决定作出对应的措施。
readinessProbe 当检测失败后,将 Pod 的 IP:Port 从对应的 EndPoint
列表中删除,不在接收流量
启动pod,使用livenessProbe以及readinessProbe
apiVersion: v1
kind: Pod
metadata:
name: nginx
namespace: test
spec:
containers:
- name: nginx
image: nginx:latest
livenessProbe:
httpGet:
path: /health
port: 80
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
kubectl get po -n test
[root@master ~]# kubectl get po -n test -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 19s 10.233.90.14 node1
####nginx日志如下:每隔10S检测一下状态
[root@master ~]# kubectl logs nginx -n test
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2023/04/03 06:30:29 [notice] 1#1: using the "epoll" event method
2023/04/03 06:30:29 [notice] 1#1: nginx/1.23.3
2023/04/03 06:30:29 [notice] 1#1: built by gcc 10.2.1 20210110 (Debian 10.2.1-6)
2023/04/03 06:30:29 [notice] 1#1: OS: Linux 3.10.0-1127.el7.x86_64
2023/04/03 06:30:29 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2023/04/03 06:30:29 [notice] 1#1: start worker processes
2023/04/03 06:30:29 [notice] 1#1: start worker process 30
2023/04/03 06:30:29 [notice] 1#1: start worker process 31
2023/04/03 06:30:29 [notice] 1#1: start worker process 32
2023/04/03 06:30:29 [notice] 1#1: start worker process 33
192.168.5.227 - - [03/Apr/2023:06:31:08 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.25" "-"
192.168.5.227 - - [03/Apr/2023:06:31:18 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.25" "-"
192.168.5.227 - - [03/Apr/2023:06:31:28 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.25" "-"
192.168.5.227 - - [03/Apr/2023:06:31:38 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.25" "-"
测试异常情况:
apiVersion: v1
kind: Pod
metadata:
name: nginx
namespace: test
spec:
nodeName: node1
restartPolicy: Always
containers:
- name: count
image: nginx:latest
imagePullPolicy: IfNotPresent
livenessProbe:
httpGet:
port: 81 ####修改端口为81
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
tcpSocket:
port: 80
initialDelaySeconds: 5
periodSeconds: 5
启动pod后,查看pod日志信息,如下:
2023/04/03 06:13:44 [notice] 32#32: gracefully shutting down
2023/04/03 06:13:44 [notice] 33#33: gracefully shutting down
2023/04/03 06:13:44 [notice] 32#32: exiting
2023/04/03 06:13:44 [notice] 33#33: exiting
2023/04/03 06:13:44 [notice] 33#33: exit
2023/04/03 06:13:44 [notice] 32#32: exit
2023/04/03 06:13:44 [notice] 31#31: gracefully shutting down
2023/04/03 06:13:44 [notice] 31#31: exiting
2023/04/03 06:13:44 [notice] 31#31: exit
2023/04/03 06:13:45 [notice] 30#30: gracefully shutting down
2023/04/03 06:13:45 [notice] 30#30: exiting
2023/04/03 06:13:45 [notice] 30#30: exit
2023/04/03 06:13:45 [notice] 1#1: signal 17 (SIGCHLD) received from 33
2023/04/03 06:13:45 [notice] 1#1: worker process 32 exited with code 0
2023/04/03 06:13:45 [notice] 1#1: worker process 33 exited with code 0
2023/04/03 06:13:45 [notice] 1#1: signal 29 (SIGIO) received
2023/04/03 06:13:45 [notice] 1#1: signal 17 (SIGCHLD) received from 32
2023/04/03 06:13:45 [notice] 1#1: signal 17 (SIGCHLD) received from 31
2023/04/03 06:13:45 [notice] 1#1: worker process 31 exited with code 0
2023/04/03 06:13:45 [notice] 1#1: signal 29 (SIGIO) received
2023/04/03 06:13:45 [notice] 1#1: signal 17 (SIGCHLD) received from 30
2023/04/03 06:13:45 [notice] 1#1: worker process 30 exited with code 0
2023/04/03 06:13:45 [notice] 1#1: exit
kubectl get po -n test查看pod
[root@master ~]# kubectl get po -n test -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 0/1 CrashLoopBackOff 5 (27s ago) 6m28s 10.233.90.12 node1
[root@master ~]# kubectl describe po nginx -n test ####describe查看pod信息如下,提示Liveness probe failed
Name: nginx
Namespace: test
Priority: 0
Service Account: default
Node: node1/192.168.5.227
Start Time: Mon, 03 Apr 2023 14:07:44 +0800
Labels:
Annotations: cni.projectcalico.org/containerID: fc89567be28e2010a6f0997482489b65649ac2136c92c603d5a5537dc2e7efbd
cni.projectcalico.org/podIP: 10.233.90.12/32
cni.projectcalico.org/podIPs: 10.233.90.12/32
Status: Running
IP: 10.233.90.12
IPs:
IP: 10.233.90.12
Containers:
count:
Container ID: containerd://10de8e85609ac2d33cac17fa995a7a1c88f318ef10ddbb6086898fb32b216bb6
Image: nginx:latest
Image ID: docker.io/library/nginx@sha256:aa0afebbb3cfa473099a62c4b32e9b3fb73ed23f2a75a65ce1d4b4f55a5c2ef2
Port:
Host Port:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 03 Apr 2023 14:18:51 +0800
Finished: Mon, 03 Apr 2023 14:19:45 +0800
Ready: False
Restart Count: 7
Liveness: http-get http://:81/ delay=30s timeout=1s period=10s #success=1 #failure=3
Readiness: tcp-socket :80 delay=5s timeout=1s period=5s #success=1 #failure=3
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zpsdf (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-zpsdf:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Killing 10m (x3 over 12m) kubelet Container count failed liveness probe, will be restarted
Normal Pulled 10m (x4 over 13m) kubelet Container image "nginx:latest" already present on machine
Normal Started 10m (x4 over 13m) kubelet Started container count
Warning Unhealthy 9m57s (x10 over 12m) kubelet Liveness probe failed: Get "http://10.233.90.12:81/": dial tcp 10.233.90.12:81: connect: connection refused
Normal Created 8m36s (x6 over 13m) kubelet Created container count
Warning BackOff 3m22s (x19 over 7m36s) kubelet Back-off restarting failed container
[root@master ~]#