通过前面几篇文章的理论和实践,大家都知道,Docker Swarm会自动判断服务中容器的健康状态,从而决定是否删除重建,以保证设定的副本数replicas
。但它是怎么判断的呢?
容器都有一个STATUS
代表它的运行状态created, restarting, running, removing, paused, exited, dead
,最主要的,是容器运行的状态码STOPSIGNAL
,只要是Exited(STOPSIGNAL!=0)
,那就代表异常退出。
通过docker kill http.2.a6i8uov6efb4e0wjioha02o9y
模拟服务中的一个副本异常退出运行,docker ps -a
查看:
...
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
aad2191ae7d5 nandy/show-host-info:v2 "/app" 18 seconds ago Exited (2) 4 seconds ago http.2.a6i8uov6efb4e0wjioha02o9y
5e0b2da6b4e5 nandy/show-host-info:v2 "/app" 18 seconds ago Up 17 seconds 80/tcp http.1.p35na0wooa509aqd53qzr1n50
...
Exited (2)
代表容器非正常停止。服务会在第一时间捕获到这个STOPSIGNAL
并立即重建一个新的容器:
...
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c08cdedcf461 nandy/show-host-info:v2 "/app" 9 seconds ago Up 4 seconds 80/tcp http.2.ak0lef9jebrqz2racsf2vq8fl
aad2191ae7d5 nandy/show-host-info:v2 "/app" 24 seconds ago Exited (2) 10 seconds ago http.2.a6i8uov6efb4e0wjioha02o9y
5e0b2da6b4e5 nandy/show-host-info:v2 "/app" 24 seconds ago Up 23 seconds 80/tcp http.1.p35na0wooa509aqd53qzr1n50
...
但是,我们结合实际的服务器运维经验思考一下,仅靠容器本身的异常退出与否来判断,是不是可以确定服务健康(正常响应请求)?如果是服务假死(CPU异常、内存异常、触发代码BUG…)而容器并未异常退出呢?如果服务有一个接口,通过定期请求这个接口并返回期望值来判断呢?
这篇文章的主角便是healthcheck
。它有5个子选项:
--health-cmd=命令,用于检查接口的命令。
--health-interval=时间间隔 (默认: 30s),它是每次执行healthcheck的时间间隔。
--health-timeout=时间间隔 (默认: 30s),如果在超时时间之内没有响应,则代表异常。
--health-retries=N (默认: 3),连续达到多少次异常之后退出。
--health-start-period=时间间隔 (默认:0),容器启动之后多久进行健康检查(服务启动预热),即运行health-cmd。
创建服务,在之前的基础上加入容器的健康检查,注意curl -f http://localhost:80
检查命令,镜像中必须先安装curl
:
docker service create --network httpnet --name http --replicas 2 -p 81:80 \
--health-cmd "curl -f http://localhost:80 || exit 1" --health-interval 5s --health-timeout 3s --health-retries 3 --health-start-period 30s \
nandy/show-host-info:v2
运行docker ps
查看:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
add8115df6f1 nandy/show-host-info:v2 "python run.py" About a minute ago Up About a minute (healthy) 80/tcp http.1.q1jjcnjcrfcxsafzdvtvhi7ho
66e77ee79c0a nandy/show-host-info:v2 "python run.py" About a minute ago Up About a minute (healthy) 80/tcp http.2.m0oaoqjia0d6m3rzmsh5sr7n6
此时,STATUS
的显示比以往多了(healthy)
状态,同时,为了验证curl -f http://localhost:80
是否每隔5s运行一次,且运行是否正常,我们通过docker logs --tail 5 http.1.q1jjcnjcrfcxsafzdvtvhi7ho
查看一下容器的日志:
[2018-10-16 17:54:42 +0800] - (sanic.access)[INFO][127.0.0.1:50220]: GET http://localhost/ 200 41
[2018-10-16 17:54:47 +0800] - (sanic.access)[INFO][127.0.0.1:50224]: GET http://localhost/ 200 41
[2018-10-16 17:54:52 +0800] - (sanic.access)[INFO][127.0.0.1:50228]: GET http://localhost/ 200 41
[2018-10-16 17:54:58 +0800] - (sanic.access)[INFO][127.0.0.1:50232]: GET http://localhost/ 200 41
[2018-10-16 17:55:03 +0800] - (sanic.access)[INFO][127.0.0.1:50236]: GET http://localhost/ 200 41