nvidia-docker 安装报错记录

最近因为业务需要,很多服务要用docker部署,于是开始研究docker的使用。

代码是python,使用的深度学习框架为tensorflow,  按照官网说明,需要先安装 Docker 和 nvidia-docker。其中Docker的安装比较简单,基本就是参照了这篇文档:https://yeasy.gitbooks.io/docker_practice/install/centos.html

但是安装nvidia-docker时遇到了一些小坑,记录一下。

对于nvidia-docker的安装,基本是参照了官网的说明:https://github.com/NVIDIA/nvidia-docker

# If you have nvidia-docker 1.0 installed: we need to remove it and all existing GPU containers
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo yum remove nvidia-docker

# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
  sudo tee /etc/yum.repos.d/nvidia-docker.repo

# Install nvidia-docker2 and reload the Docker daemon configuration
sudo yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

# Test nvidia-smi with the latest official CUDA image
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

前面几个步骤还比较顺利,就是最后执行

docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

这个命令时,一直报错。

报错内容如下:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=9.0 --pid=15997 /var/lib/docker/overlay2/5e678ed1c028293c3a8d9edc227307b89239e8c41672174811378c40e2dbbec9/merged]\\\\nnvidia-container-cli: requirement error: unsatisfied condition: cuda >= 9.0\\\\n\\\"\"": unknown.

仔细看最后一段“requirement error: unsatisfied condition: cuda >= 9.0” 猜测是cuda 版本问题。

查看本机cuda版本:

 cat /usr/local/cuda/version.txt

显示:CUDA Version 8.0.61

果然版本不行

于是修改上述命令:

docker run --runtime=nvidia --rm nvidia/cuda:8.0-base nvidia-smi

还是报错:

Unable to find image 'nvidia/cuda:8.0-base' locally
docker: Error response from daemon: manifest for nvidia/cuda:8.0-base not found.

应该是没有这个镜像文件

最后百度查到,正确的命令应该是:

docker run --runtime=nvidia --rm nvidia/cuda:8.0-devel nvidia-smi

成功显示:

Thu Feb 14 07:56:19 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 0000:00:0C.0     Off |                    0 |
| N/A   42C    P0    28W / 250W |      0MiB / 16276MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

总结:测试命令也是要跟本机实际的cuda版本对应起来才行啊。

你可能感兴趣的:(nvidia-docker 安装报错记录)