当安装nvidia-docker到最后一步的时候,一般会要求你输入一下命令来测试是否安装成功:
sudo docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
这个时候官方教程告诉你看到如下结果就代表安装成功了:
然而,并不以定所有人都会成功显示,博主就遇到了一系列的问题,于是记录下来解决方法供大家参考
1.遇到报错:nvidia-docker | 2018/11/01 15:05:51 Error: Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.26/version: dial unix /var/run/docker.sock: connect: permission denied
这个错误一般都是权限不够,直接在命令前加上sudo获取管理员权限即可,即sudo nvidia-docker .......(略)
2.遇到错误:先显示在pull镜像,下载完成之后遇到报错:container_linux.go:247: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH"
docker: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH".
这个问题困扰了我很久,因为网上实在是找不到类似的问题,一开始以为是nvidia驱动没有安装对,导致在环境变量$PATH中找不到,然而尝试了用export添加可执行文件/usr/bin/nvidia-smi,然而并没有效果,后来看博客感觉这个问题要么是驱动安装问题,要么就和内核有关。
但是用nvidia-smi命令测试了驱动安装是没有问题的,因此推测出是由于docker的版本号和内核的版本号不匹配。
于是又找指令安装制定版本的docker还是不行,最后发现在run一个docker容器的时候制定你安装的cuda版本:
sudo nvidia-docker run --rm nvidia/cuda:8.0-devel nvidia-smi
然而到这一步还没有成功,遇到了新的问题3,见下:
3.遇到报错:
输入指令
sudo nvidia-docker run --rm nvidia/cuda:8.0-devel nvidia-smi
遇到报错:
8.0-devel: Pulling from nvidia/cuda
18d680d61657: Pull complete
0addb6fece63: Pull complete
78e58219b215: Pull complete
eb6959a66df2: Pull complete
7a0b022c2633: Pull complete
2536ccb3c0e4: Pull complete
72568544e638: Pull complete
eb3c7a1fa7df: Pull complete
Digest: sha256:953003e905c1201b7bc45de52d94f8e1b1111f510ce7e550791281dc8b8649e2
Status: Downloaded newer image for nvidia/cuda:8.0-devel
nvidia-docker | 2018/11/01 10:20:17 Error: Could not load UVM kernel module. Is nvidia-modprobe installed?
此时根据提示安装nvidia-modprobe
sudo apt-get install nvidia-modprobe
安装以后,再次尝试
sudo nvidia-docker run --rm nvidia/cuda:8.0-devel nvidia-smi
任然有报错:
docker: Error response from daemon: create nvidia_driver
_384.130: create nvidia_driver_384.130: Error looking up volume plugin nvidia-docker: legacy plugin: plugin not found.
See 'docker run --help'.
这里384.130是我的驱动版本,你的可能不一样,不过都没有关系,经过努力发现并不是驱动安装的问题,输入如下指令:
systemctl status nvidia-docker
显示如果如下:
nvidia-docker.service - NVIDIA Docker plugin
Loaded: loaded (/lib/systemd/system/nvidia-docker.service; enabled; vendor pr
Active: failed (Result: start-limit-hit) since 四 2018-11-01 09:28:07 CST; 1h
Docs: https://github.com/NVIDIA/nvidia-docker/wiki
Process: 15768 ExecStopPost=/bin/rm -f $SPEC_FILE (code=exited, status=0/SUCCE
Process: 15761 ExecStartPost=/bin/sh -c /bin/echo unix://$SOCK_DIR/nvidia-dock
Process: 15746 ExecStartPost=/bin/sh -c /bin/mkdir -p $( dirname $SPEC_FILE )
Process: 15745 ExecStart=/usr/bin/nvidia-docker-plugin -s $SOCK_DIR (code=exit
Main PID: 15745 (code=exited, status=1/FAILURE)
11月 01 09:28:06 yu408 systemd[1]: Failed to start NVIDIA Docker plugin.
11月 01 09:28:06 yu408 systemd[1]: nvidia-docker.service: Unit entered failed st
11月 01 09:28:06 yu408 systemd[1]: nvidia-docker.service: Failed with result 'ex
11月 01 09:28:07 yu408 systemd[1]: nvidia-docker.service: Service hold-off time
11月 01 09:28:07 yu408 systemd[1]: Stopped NVIDIA Docker plugin.
11月 01 09:28:07 yu408 systemd[1]: nvidia-docker.service: Start request repeated
11月 01 09:28:07 yu408 systemd[1]: Failed to start NVIDIA Docker plugin.
11月 01 09:28:07 yu408 systemd[1]: nvidia-docker.service: Unit entered failed st
11月 01 09:28:07 yu408 systemd[1]: nvidia-docker.service: Failed with result 'st
可以看到active的状态是failed,我们输入下列命令启动它:
systemctl start nvidia-docker
再次输入下列指令进行测试
systemctl status nvidia-docker
显示如下则成功:
nvidia-docker.service - NVIDIA Docker plugin
Loaded: loaded (/lib/systemd/system/nvidia-docker.service; enabled; vendor pr
Active: active (running) since 四 2018-11-01 10:53:01 CST; 17s ago
Docs: https://github.com/NVIDIA/nvidia-docker/wiki
Process: 15768 ExecStopPost=/bin/rm -f $SPEC_FILE (code=exited, status=0/SUCCE
Process: 20905 ExecStartPost=/bin/sh -c /bin/echo unix://$SOCK_DIR/nvidia-dock
Process: 20889 ExecStartPost=/bin/sh -c /bin/mkdir -p $( dirname $SPEC_FILE )
Main PID: 20888 (nvidia-docker-p)
Tasks: 8
Memory: 36.5M
CPU: 1.165s
CGroup: /system.slice/nvidia-docker.service
└─20888 /usr/bin/nvidia-docker-plugin -s /var/lib/nvidia-docker
11月 01 10:53:01 yu408 systemd[1]: Starting NVIDIA Docker plugin...
11月 01 10:53:01 yu408 nvidia-docker-plugin[20888]: /usr/bin/nvidia-docker-plugi
11月 01 10:53:01 yu408 nvidia-docker-plugin[20888]: /usr/bin/nvidia-docker-plugi
11月 01 10:53:01 yu408 systemd[1]: Started NVIDIA Docker plugin.
11月 01 10:53:01 yu408 nvidia-docker-plugin[20888]: /usr/bin/nvidia-docker-plugi
11月 01 10:53:03 yu408 nvidia-docker-plugin[20888]: /usr/bin/nvidia-docker-plugi
11月 01 10:53:03 yu408 nvidia-docker-plugin[20888]: /usr/bin/nvidia-docker-plugi
11月 01 10:53:03 yu408 nvidia-docker-plugin[20888]: /usr/bin/nvidia-docker-plugi
然后我们再次输入:
sudo nvidia-docker run --rm nvidia/cuda:8.0-devel nvidia-smi
显示如下则代表成功了,大功告成!