Pod健康检查介绍
默认情况下,kubelet根据容器运行状态作为健康依据,不能监控容器中应用程序状态,例如程序假死。这就会导致无法提供服务,丢失流量。因此引入健康检查机制确保容器健康存活。
Pod通过两类探针来检查容器的健康状态。分别是LivenessProbe
(存活探测)和 ReadinessProbe
(就绪探测)。
livenessProbe(存活探测)
存活探测将通过http、shell命令或者tcp等方式去检测容器中的应用是否健康,然后将检查结果返回给kubelet,如果检查容器中应用为不健康状态提交给kubelet后,kubelet将根据Pod配置清单中定义的重启策略restartPolicy
来对Pod进行重启。
readinessProbe(就绪探测)
就绪探测也是通过http、shell命令或者tcp等方式去检测容器中的应用是否健康或则是否能够正常对外提供服务,如果能够正常对外提供服务,则认为该容器为(Ready状态),达到(Ready状态)的Pod才可以接收请求。
对于被Service所管理的Pod,Service与被管理Pod的关联关系也将基于Pod是否Ready进行设置,Pod对象启动后,容器应用通常需要一段时间才能完成其初始化的过程,例如加载配置或数据,甚至有些程序需要运行某类的预热过程,若在此阶段完成之前就接收客户端的请求,那么客户端返回时间肯定非常慢,严重影响了体验,所以因为避免Pod对象启动后立即让其处理客户端请求,而是等待容器初始化工作执行完成并转为Ready状态后再接收客户端请求。
如果容器或则Pod状态为(NoReady)状态,Kubernetes则会把该Pod从Service的后端endpoints Pod中去剔除。
健康检测实现方式
以上介绍了两种探测类型livenessProbe
(存活探测),readinessProbe
(就绪探测),这两种探测都支持以下方式对容器进行健康检查
- ExecAction:在容器中执行命令,命令执行后返回的状态为0则成功,表示我们探测结果正常
- HTTPGetAction:根据容器IP、端口以及路径发送HTTP请求,返回码如果是200-400之间表示成功
- TCPSocketAction:根据容器IP地址及特定的端口进行TCP检查,端口开放表示成功
以上每种检查动作都可能有以下三种返回状态
- Success,表示通过了健康检查
- Failure,表示没有通过健康检查
- Unknown,表示检查动作失败
livenessProbe存活探测示例
livenessProbe for ExecActiion 示例
通过在目标容器中执行由用户自定义的命令来判定容器的健康状态,即在容器内部执行一个命令,如果改命令的返回码为0,则表明容器健康。spec.containers.LivenessProbe
字段用于定义此类检测,它只有一个可用属性command,用于指定要执行的命令,下面是在资源清单文件中使用liveness-exec方式的示例:
1.创建资源配置清单
创建一个Pod——》运行Nginx容器——》首先启动nginx——》然后沉睡60秒后——〉删除nginx.pid
通过livenessProbe存活探测的exec命令判断nginx.pid文件是否存在,如果探测返回结果非0,则按照重启策略进行重启。
预期是容器真正(Ready)状态60s后,删除nginx.pid,exec命令探测生效,按照重启策略进行重启
cat ngx-health.yaml
apiVersion: v1
kind: Pod
metadata:
name: ngx-health
spec:
containers:
- name: ngx-liveness
image: nginx:latest
command:
- /bin/sh
- -c
- /usr/sbin/nginx; sleep 60; rm -rf /run/nginx.pid
livenessProbe:
exec:
command: [ "/bin/sh", "-c", "test", "-e", "/run/nginx.pid" ]
restartPolicy: Always
2.创建Pod资源
kubectl apply -f ngx-health.yaml
等待Pod Ready
3.查看Pod的详细信息
#第一次查看,Pod中的容器启动成功,事件正常
kubectl describe pods/ngx-health | grep -A 10 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled default-scheduler Successfully assigned default/ngx-health to k8s-node03
Normal Pulling 12s kubelet, k8s-node03 Pulling image "nginx:latest"
Normal Pulled 6s kubelet, k8s-node03 Successfully pulled image "nginx:latest"
Normal Created 6s kubelet, k8s-node03 Created container ngx-liveness
Normal Started 5s kubelet, k8s-node03 Started container ngx-liveness
#第二次查看,容器的livenessProbe探测失败,
kubectl describe pods/ngx-health | grep -A 10 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled default-scheduler Successfully assigned default/ngx-health to k8s-node03
Normal Pulling 52s kubelet, k8s-node03 Pulling image "nginx:latest"
Normal Pulled 46s kubelet, k8s-node03 Successfully pulled image "nginx:latest"
Normal Created 46s kubelet, k8s-node03 Created container ngx-liveness
Normal Started 45s kubelet, k8s-node03 Started container ngx-liveness
Warning Unhealthy 20s (x3 over 40s) kubelet, k8s-node03 Liveness probe failed:
Normal Killing 20s kubelet, k8s-node03 Container ngx-liveness failed liveness probe, will be restarted
#第三次查看,已经重新拉取镜像,然后创建容器再启动容器
kubectl describe pods/ngx-health | grep -A 10 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled default-scheduler Successfully assigned default/ngx-health to k8s-node03
Warning Unhealthy 35s (x3 over 55s) kubelet, k8s-node03 Liveness probe failed:
Normal Killing 35s kubelet, k8s-node03 Container ngx-liveness failed liveness probe, will be restarted
Normal Pulling 4s (x2 over 67s) kubelet, k8s-node03 Pulling image "nginx:latest"
Normal Pulled 2s (x2 over 61s) kubelet, k8s-node03 Successfully pulled image "nginx:latest"
Normal Created 2s (x2 over 61s) kubelet, k8s-node03 Created container ngx-liveness
Normal Started 2s (x2 over 60s) kubelet, k8s-node03 Started container ngx-liveness
通过长格式输出可以看到如下,第一次长格式输出Pod运行时间22s,重启次数为0
第二次长格式输出,运行时间是76s,Pod已经完成一次重启
kubectl get pods -o wide | grep ngx-health
ngx-health 1/1 Running 0 22s 10.244.5.44 k8s-node03
kubectl get pods -o wide | grep ngx-health
ngx-health 1/1 Running 1 76s 10.244.5.44 k8s-node03
第二次健康探测失败及第二次重启
kubectl describe pods/ngx-health | grep -A 10 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled default-scheduler Successfully assigned default/ngx-health to k8s-node03
Normal Pulled 58s (x2 over 117s) kubelet, k8s-node03 Successfully pulled image "nginx:latest"
Normal Created 58s (x2 over 117s) kubelet, k8s-node03 Created container ngx-liveness
Normal Started 58s (x2 over 116s) kubelet, k8s-node03 Started container ngx-liveness
Warning Unhealthy 31s (x6 over 111s) kubelet, k8s-node03 Liveness probe failed:
Normal Killing 31s (x2 over 91s) kubelet, k8s-node03 Container ngx-liveness failed liveness probe, will be restarted
Normal Pulling 0s (x3 over 2m3s) kubelet, k8s-node03 Pulling image "nginx:latest"
kubectl get pods -o wide | grep ngx-health
ngx-health 1/1 Running 2 2m13s 10.244.5.44 k8s-node03
livenessProbe for HTTPGetAction示例
通过容器的ip地址,端口号及路径调用HTTPGet方法,如果响应的状态码大于等于200且小于400,则认为容器健康,spec.containers.livenessProbe.httpGet
字段用于定义此类检测,它的可用配置字段包括如下几个:
- host :请求的主机地址,默认为Pod IP;也可以在httpHeaders中使用 Host: 来定义
- port :请求的端口,必选字段,端口范围1-65535
- httpHeaders <[]Object>:自定义的请求报文首部
- path :请求的HTTP资源路径,即URL path
- scheme:建立连接使用的协议,仅可为HTTP或HTTPS,默认为HTTP
1.创建资源配置清单
创建一个Pod——》运行Nginx容器——》首先启动nginx——》然后沉睡60秒后——〉删除nginx.pid
通过livenessProbe存活探测的httpGet方式请求nginx项目根目录下的index.html文件,访问端口为80,访问地址默认为Pod IP,请求协议为HTTP,如果请求失败则按照重启策略进行重启。
cat ngx-health.yaml
apiVersion: v1
kind: Pod
metadata:
name: ngx-health
spec:
containers:
- name: ngx-liveness
image: nginx:latest
command:
- /bin/sh
- -c
- /usr/sbin/nginx; sleep 60; rm -rf /run/nginx.pid
livenessProbe:
httpGet:
path: /index.html
port: 80
scheme: HTTP
restartPolicy: Always
2.创建Pod资源对象
kubectl apply -f ngx-health.yaml
3.查看Pod运行状态
#容器创建
kubectl get pods -o wide | grep ngx-health
ngx-health 0/1 ContainerCreating 0 7s k8s-node02
#容器运行成功
kubectl get pods -o wide | grep ngx-health
ngx-health 1/1 Running 0 19s 10.244.2.36 k8s-node02
4.查看Pod的详细事件信息
容器镜像拉取并启动成功
kubectl describe pods/ngx-health | grep -A 10 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled default-scheduler Successfully assigned default/ngx-health to k8s-node02
Normal Pulling 30s kubelet, k8s-node02 Pulling image "nginx:latest"
Normal Pulled 15s kubelet, k8s-node02 Successfully pulled image "nginx:latest"
Normal Created 15s kubelet, k8s-node02 Created container ngx-liveness
Normal Started 14s kubelet, k8s-node02 Started container ngx-liveness
容器ready状态后运行60s左右livenessProbe健康检测,可以看到下面已经又开始拉取镜像
kubectl describe pods/ngx-health | grep -A 15 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled default-scheduler Successfully assigned default/ngx-health to k8s-node02
Normal Pulled 63s kubelet, k8s-node02 Successfully pulled image "nginx:latest"
Normal Created 63s kubelet, k8s-node02 Created container ngx-liveness
Normal Started 62s kubelet, k8s-node02 Started container ngx-liveness
Normal Pulling 1s (x2 over 78s) kubelet, k8s-node02 Pulling image "nginx:latest"
镜像拉取完后再次重启创建并启动了一遍,可以看到 Age 列的时间已经重新计算
kubectl describe pods/ngx-health | grep -A 15 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled default-scheduler Successfully assigned default/ngx-health to k8s-node02
Normal Pulling 18s (x2 over 95s) kubelet, k8s-node02 Pulling image "nginx:latest"
Normal Pulled 2s (x2 over 80s) kubelet, k8s-node02 Successfully pulled image "nginx:latest"
Normal Created 2s (x2 over 80s) kubelet, k8s-node02 Created container ngx-liveness
Normal Started 1s (x2 over 79s) kubelet, k8s-node02 Started container ngx-liveness
长格式输出Pod,可以看到Pod已经重启过一次
kubectl get pods -o wide | grep ngx-health
ngx-health 0/1 Completed 0 96s 10.244.2.36 k8s-node02
k8sops@k8s-master01:~/manifests/pod$ kubectl get pods -o wide | grep ngx-health
ngx-health 1/1 Running 1 104s 10.244.2.36 k8s-node02
通过查看容器日志,可以看到下面的探测日志,默认10秒探测一次
kubectl logs -f pods/ngx-health
10.244.2.1 - - [15/May/2020:03:01:13 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "kube-probe/1.18" "-"
10.244.2.1 - - [15/May/2020:03:01:23 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "kube-probe/1.18" "-"
10.244.2.1 - - [15/May/2020:03:01:33 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "kube-probe/1.18" "-"
10.244.2.1 - - [15/May/2020:03:01:43 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "kube-probe/1.18" "-"
10.244.2.1 - - [15/May/2020:03:01:53 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "kube-probe/1.18" "-"
10.244.2.1 - - [15/May/2020:03:02:03 +0000] "GET /index.html HTTP/1.1" 200 612 "-" "kube-probe/1.18" "-"
livenessProbe for TCPSocketAction示例
通过容器的IP地址和端口号进行TCP检查,如果能够建立TCP连接,则表明容器健康。相比较来说,它比基于HTTP的探测要更高效,更节约资源,但精准度略低,毕竟建立连接成功未必意味着页面资源可用,spec.containers.livenessProbe.tcpSocket
字段用于定义此类检测,它主要包含以下两个可用的属性:
- host:请求连接的目标IP地址,默认为Pod IP
- port:请求连接的目标端口,必选字段
下面是在资源清单文件中使用liveness-tcp方式的示例,它向Pod IP的80/tcp端口发起连接请求,并根据连接建立的状态判定测试结果:
1.创建资源配置清单
apiVersion: v1
kind: Pod
metadata:
name: ngx-health
spec:
containers:
- name: ngx-liveness
image: nginx:latest
command:
- /bin/sh
- -c
- /usr/sbin/nginx; sleep 60; rm -rf /run/nginx.pid
livenessProbe:
tcpSocket:
port: 80
restartPolicy: Always
2.创建资源对象
kubectl apply -f ngx-health.yaml
3.查看Pod创建属性信息
#容器创建并启动成功
kubectl describe pods/ngx-health | grep -A 15 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled default-scheduler Successfully assigned default/ngx-health to k8s-node02
Normal Pulling 19s kubelet, k8s-node02 Pulling image "nginx:latest"
Normal Pulled 9s kubelet, k8s-node02 Successfully pulled image "nginx:latest"
Normal Created 8s kubelet, k8s-node02 Created container ngx-liveness
Normal Started 8s kubelet, k8s-node02 Started container ngx-liveness
#在容器ready状态后60s左右Pod已经有了再次拉取镜像的动作
kubectl describe pods/ngx-health | grep -A 15 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled default-scheduler Successfully assigned default/ngx-health to k8s-node02
Normal Pulled 72s kubelet, k8s-node02 Successfully pulled image "nginx:latest"
Normal Created 71s kubelet, k8s-node02 Created container ngx-liveness
Normal Started 71s kubelet, k8s-node02 Started container ngx-liveness
Normal Pulling 10s (x2 over 82s) kubelet, k8s-node02 Pulling image "nginx:latest"
#通过长格式输出Pod,也可以看到当前Pod已经进入了完成的状态,接下来就是重启Pod
kubectl get pods -o wide | grep ngx-health
ngx-health 0/1 Completed 0 90s 10.244.2.37 k8s-node02
健康检测参数
上面介绍了两种在不同时间段的探测方式,以及两种探测方式所支持的探测方法,这里介绍几个辅助参数
- initialDelaySeconds:检查开始执行的时间,以容器启动完成为起点计算
- periodSeconds:检查执行的周期,默认为10秒,最小为1秒
- successThreshold:从上次检查失败后重新认定检查成功的检查次数阈值(必须是连续成功),默认为1,也必须是1
- timeoutSeconds:检查超时的时间,默认为1秒,最小为1秒
- failureThreshold:从上次检查成功后认定检查失败的检查次数阈值(必须是连续失败),默认为1
健康检测实践
以下示例使用了就绪探测readinessProbe和存活探测livenessProbe
就绪探测配置解析:
- 容器在启动5秒
initialDelaySeconds
后进行第一次就绪探测,将通过http访问探测容器网站根目录下的index.html文件,如果探测成功,则Pod将被标记为(Ready)状态。 - 然后就绪检测通过
periodSeconds
参数所指定的间隔时间进行循环探测,下面我所指定的间隔时间是10秒钟,每隔10秒钟就绪探测一次。 - 每次探测超时时间为3秒,如果探测失败1次就将此Pod从Service的后端Pod中剔除,剔除后客户端请求将无法通过Service访问到其Pod。
- 就绪探测还会继续对其进行探测,那么如果发现此Pod探测成功1次,通过
successThreshold
参数设定的值,那么会将它再次加入后端Pod。
存活探测配置解析
- 容器在启动15秒
initialDelaySeconds
后进行第一次存活探测,将通过tcpSocket探测容器的80端口,如果探测返回值为0则成功。 - 每次存活探测间隔为3秒钟,每次探测超时时间为1秒,如果连续探测失败2次则通过重启策略重启Pod。
- 检测失败后的Pod,存活探测还会对其进行探测,如果再探测成功一次,那么将认为此Pod为健康状态
1.资源配置清单
cat nginx-health.yaml
#create namespace
apiVersion: v1
kind: Namespace
metadata:
name: nginx-health-ns
labels:
resource: nginx-ns
spec:
---
#create deploy and pod
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-health-deploy
namespace: nginx-health-ns
labels:
resource: nginx-deploy
spec:
replicas: 3
revisionHistoryLimit: 10
selector:
matchLabels:
app: nginx-health
template:
metadata:
namespace: nginx-health-ns
labels:
app: nginx-health
spec:
restartPolicy: Always
containers:
- name: nginx-health-containers
image: nginx:1.17.1
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- /usr/sbin/nginx; sleep 60; rm -rf /run/nginx.pid
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 3
failureThreshold: 1
httpGet:
path: /index.html
port: 80
scheme: HTTP
livenessProbe:
initialDelaySeconds: 15
periodSeconds: 3
successThreshold: 1
timeoutSeconds: 1
failureThreshold: 2
tcpSocket:
port: 80
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
---
#create service
apiVersion: v1
kind: Service
metadata:
name: nginx-health-svc
namespace: nginx-health-ns
labels:
resource: nginx-svc
spec:
clusterIP: 10.106.189.88
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: nginx-health
sessionAffinity: ClientIP
type: ClusterIP
2.创建资源对象
kubectl apply -f nginx-health.yaml
namespace/nginx-health-ns created
deployment.apps/nginx-health-deploy created
service/nginx-health-svc created
3.查看创建的资源对象
k8sops@k8s-master01:/$ kubectl get all -n nginx-health-ns -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/nginx-health-deploy-6bcc8f7f74-6wc6t 1/1 Running 0 24s 10.244.3.50 k8s-node01
pod/nginx-health-deploy-6bcc8f7f74-cns27 1/1 Running 0 24s 10.244.5.52 k8s-node03
pod/nginx-health-deploy-6bcc8f7f74-rsxjj 1/1 Running 0 24s 10.244.2.42 k8s-node02
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/nginx-health-svc ClusterIP 10.106.189.88 80/TCP 25s app=nginx-health
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
deployment.apps/nginx-health-deploy 3/3 3 3 25s nginx-health-containers nginx:1.17.1 app=nginx-health
NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR
replicaset.apps/nginx-health-deploy-6bcc8f7f74 3 3 3 25s nginx-health-containers nginx:1.17.1 app=nginx-health,pod-template-hash=6bcc8f7f74
4.查看Pod状态,目前Pod状态都没有就绪并且完成状态,准备重启
k8sops@k8s-master01:/$ kubectl get pods -n nginx-health-ns -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-health-deploy-6bcc8f7f74-6wc6t 0/1 Completed 0 64s 10.244.3.50 k8s-node01
nginx-health-deploy-6bcc8f7f74-cns27 0/1 Completed 0 64s 10.244.5.52 k8s-node03
nginx-health-deploy-6bcc8f7f74-rsxjj 0/1 Completed 0 64s 10.244.2.42 k8s-node02
5.目前已经有一台Pod完成重启,已准备就绪
kubectl get pods -n nginx-health-ns -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-health-deploy-6bcc8f7f74-6wc6t 1/1 Running 1 73s 10.244.3.50 k8s-node01
nginx-health-deploy-6bcc8f7f74-cns27 0/1 Running 1 73s 10.244.5.52 k8s-node03
nginx-health-deploy-6bcc8f7f74-rsxjj 0/1 Running 1 73s 10.244.2.42 k8s-node02
6.三台Pod都均完成重启,已准备就绪
kubectl get pods -n nginx-health-ns -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-health-deploy-6bcc8f7f74-6wc6t 1/1 Running 1 85s 10.244.3.50 k8s-node01
nginx-health-deploy-6bcc8f7f74-cns27 1/1 Running 1 85s 10.244.5.52 k8s-node03
nginx-health-deploy-6bcc8f7f74-rsxjj 1/1 Running 1 85s 10.244.2.42 k8s-node02
※更多文章和资料|点击后方文字直达 ↓↓↓
100GPython自学资料包
阿里云K8s实战手册
[阿里云CDN排坑指南]CDN
ECS运维指南
DevOps实践手册
Hadoop大数据实战手册
Knative云原生应用开发指南
OSS 运维实战手册
云原生架构白皮书
Zabbix企业级分布式监控系统源码文档
云原生基础入门手册
10G大厂面试题戳领