现象描述
docker run -d centos:v1 /bin/bash创建容器或者docker exec -it container_name bash进入容器,都会报错“usrbindocker-current Error response from daemon:shim error.context deadline exceeded.”,docker ps、docker stats、docker info等命令均可用
基础环境
物理机操作系统:CentOS Linux release 7.3.1611 (Core)
内核版本:3.10.0-693.el7.x86_64;该内核版本已修复单机最多跑100个容器(否则触发xfs文件系统bug导致机器自动重启)的bug
Docker version:
Client:
Version: 1.12.6
API version: 1.24
Package version: docker-common-1.12.6-16.el7.centos.x86_64
Go version: go1.7.4
Git commit: 3a094bd/1.12.6
Built: Fri Apr 14 13:46:13 2017
OS/Arch: linux/amd64

Server:
Version: 1.12.6
API version: 1.24
Package version: docker-common-1.12.6-16.el7.centos.x86_64
Go version: go1.7.4
Git commit: 3a094bd/1.12.6
Built: Fri Apr 14 13:46:13 2017
OS/Arch: linux/amd64
Docker info:
Containers: 68
Running: 39
Paused: 0
Stopped: 0
Images: 38
Server Version: 1.12.6
Storage Driver: devicemapper
Pool Name: docker-253:0-3222085682-pool
Pool Blocksize: 65.54 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file: /dev/loop0
Metadata file: /dev/loop1
Data Space Used: 23.86 GB
Data Space Total: 107.4 GB
Data Space Available: 83.51 GB
Metadata Space Used: 48.09 MB
Metadata Space Total: 2.147 GB
Metadata Space Available: 2.099 GB
Thin Pool Minimum Free Space: 10.74 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
WARNING: Usage of loopback devices is strongly discouraged for production use. Use --storage-opt dm.thinpooldev to specify a custom block storage device.
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.135-RHEL7 (2016-09-28)
Logging Driver: json-file
Cgroup Driver: systemd
Plugins:
Volume: local
Network: bridge host null overlay
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Security Options: seccomp
Kernel Version: 3.10.0-693.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 2
CPUs: 48
Total Memory: 251.1 GiB
Name: t-docker-02-12
ID: 4OTZ:QXM3:XSQW:ZPQK:2XEF:W25W:R5DN:QL6X:RMXV:63WP:BHAB:NGPK
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
registry.sfbest.com
127.0.0.0/8
Registries: docker.io (secure)
问题分析
1.1 日志内容
docker的日志里包含大量的error,见下,
Jan 9 11:00:32 t-docker-02-12 dockerd-current: time="2018-01-09T11:00:32.482494003+08:00" level=error msg="Error running exec in container: rpc error: code = 2 desc = shim error: context deadline exceeded"
Jan 9 11:00:47 t-docker-02-12 dockerd-current: time="2018-01-09T11:00:47.540579791+08:00" level=error msg="Error running exec in container: rpc error: code = 2 desc = shim error: context deadline exceeded"
Jan 9 11:01:02 t-docker-02-12 dockerd-current: time="2018-01-09T11:01:02.581747742+08:00" level=error msg="Error running exec in container: rpc error: code = 2 desc = shim error: context deadline exceeded"
Jan 9 11:01:17 t-docker-02-12 dockerd-current: time="2018-01-09T11:01:17.614305903+08:00" level=error msg="Error running exec in container: rpc error: code = 2 desc = shim error: context deadline exceeded"
Jan 9 11:01:32 t-docker-02-12 dockerd-current: time="2018-01-09T11:01:32.658808780+08:00" level=error msg="Error running exec in container: rpc error: code = 2 desc = shim error: context deadline exceeded"
Jan 9 11:01:47 t-docker-02-12 dockerd-current: time="2018-01-09T11:01:47.702526455+08:00" level=error msg="Error running exec in container: rpc error: code = 2 desc = shim error: context deadline exceeded"
1.2 谷歌搜索
谷歌搜索“shim error: context deadline exceeded”,查到有人遇到相关问题,但是原因和解决办法没有找到,有的说是docker 1.12版本的一个bug,但是看样子文中的这个bug跟当前遇到的问题没啥关系,https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=1443103。
1.3 尝试解决
1.3.1 docker exec进程
怀疑使用了大量的“docker exec -it containerid bash”命令后没有正确的退出容器,导致过多的“docker exec”进程影响了docker run和docker exec命令的使用,所以kill掉了所有的“docker exec”进程。问题没有解决。
1.3.2 docker info看到异常
Docker info:
Containers: 68
Running: 39
Paused: 0
Stopped: 0
Images: 38
一共有68个容器,但是只有39个是运行状态,其余的都是Exited状态。
然后把这些Exited状态的容器删掉,docker run和docker exec命令恢复,问题解决。
现怀疑是过多的“Exited状态”的容器导致问题的出现。
因为是测试的宿主机,所以难免会试验性的建一些可能根本起不来的容器,起不来的话就变成“Exited”状态了。
亡羊补牢
定期执行docker rm docker ps -a | grep Exited | awk '{print $1}'清理一下垃圾容器;
将docker及系统日志加到elk里,检测日志内容,如果每分钟内的包含“error”的条目超过10条,就邮件报警。