对 nvidia docker安装错误的一些思考

之前安装autoware docker中出现的错误有来自于nvidia,或者是CUDA,或者是nvidia docker,这此autoware最后是安装成功了,但是还有一些问题。

dong@cvad:~$ docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
Unable to find image 'nvidia/cuda:latest' locally
latest: Pulling from nvidia/cuda
898c46f3b1a1: Pull complete 
63366dfa0a50: Pull complete 
041d4cd74a92: Pull complete 
6e1bee0f8701: Pull complete 
131dbe7c254d: Pull complete 
5bca6b05dcd6: Pull complete 
0d286a7b6e12: Pull complete 
5776d2c6371d: Pull complete 
768e84e7fc24: Pull complete 
Digest: sha256:eba1dc5810e40f60625ee797d618a6bd11be24cb67bc6647a4d36392202bb013
Status: Downloaded newer image for nvidia/cuda:latest

docker: Error response from daemon: OCI runtime create failed: container_linux.go:345:
 starting container process caused "process_linux.go:424: container init caused \
"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: 
exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --
ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=10.1 brand=
tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411 --pid=9333/
var/lib/docker/overlay2/b64e08438f267f9c15fe7e68340df81ab085636359d502d92b9586324c5d8b41/
merged]\\\\nnvidia-container-cli: requirement error: unsatisfied condition: brand = tesla\\\\n\\\"\"": unknown.

这条命令

docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

dong@cvad:~$ sudo docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused "process_linux.go:407: running prestart hook 1 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=10.1 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411 --pid=14008 /var/lib/docker/overlay2/48fa7509300f737df294dd36ae954b441b8343d20b039020eabecf5eeb90fe91/merged]\\nnvidia-container-cli: requirement error: unsatisfied condition: brand = tesla\\n\""": unknown.

是用来验证nvidia docker是否安装成功的,如果成功的话,应该显示显卡的信息才对。

现在怎么来解决问题呢。

先看看相关nvidia docker安装的第三方文档,是否能找到一点信息。搜索:nvidia docker安装

在Ubuntu 16.04上制作 NVIDIA CUDA Docker image
这篇有安装docker ce,卸载docker ce,获取sudo docker权限,runtime版本安装nvidia docker image,看晕了。

搭建nvidia-docker运行环境-Ubutu16.04
这一篇安装成功了,是用deb文件安装的。

直接搜索错误提示。

github | nvidia | k8s-device-plugin
这一篇提示的错误相同。

github | Docker image tensorflow/tensorflow:latest-gpu failed to use nvidia/cuda image #25618
从这一篇,我想到可能是匹配的问题,这里说题主的问题是驱动太老了,那我这里安装的是CUDA8.0和NVIDIA DOCKER 2.0,CUDA 8.0 安装的是老版本的,但NVIDIA DOCKER 2.0 是新版本的,现在要么把CUDA更新一下,要么把NVIDIA DOCKER安装回1.0。

搜索:docker daemon 是什么

了解docker daemon | Docker 实践(一):了解架构

也是关于daemon | Docker系列之一:入门介绍

现在我想问题应该是cuda驱动和nvidia docker版本匹配的问题,现在就是要确认一下,1、autoware docker安装中相关教程版本的选择,2、直接检索出错的那一行验证代码,而不是错误结果,而且,我还记得,nvidia docker github上就有类似345的错误提示。

1、
| isl_qdu | 用 Docker 方式安装 Autoware
Installation (version 1.0)
参考的这一篇用的是nvidia docker 1.0

2、
官方autoware docker安装中也是基于nvidia docker 1的,
autoware github | autoware docker install | Generic x86 Docker

$ wget https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb

3、
基本的想到的解决方法就是安装nvidia docker1,这里还是要补充一些docker和nvidia docker的基本知识。

docer一般是服务于基于cpu的应用,
如果是要基于gpu的话,比如我们现在要做的事情,就要用到nvidia-docker,
nvidia-docker是在docker上做了一层封装,通过nvidia-docker-plugin封装之后调用docker,

更多内容

4、
现在重新安装nvidia docker 1

dong@cvad:~$ sudo docker --version
Docker version 18.09.5, build e8ff056

dong@cvad:~$ nvidia-smi
Tue Apr 30 10:36:35 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   29C    P8     7W / 120W |    338MiB /  6069MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1470      G   /usr/lib/xorg/Xorg                           181MiB |
|    0      2954      G   compiz                                       149MiB |
|    0      3248      G   fcitx-qimpanel                                 5MiB |
+-----------------------------------------------------------------------------+
dong@cvad:~$ 

nvidia docker wili | Installation (version 1.0)

  1. GNU/Linux x86_64 with kernel version > 3.10
  2. Docker >= 1.9 (official docker-engine, docker-ce or docker-ee only)
  3. NVIDIA GPU with Architecture > Fermi (2.1)
  4. NVIDIA drivers >= 340.29 with binary nvidia-modprobe

nvidia docker wili | Installation (version 2.0)

  1. GNU/Linux x86_64 with kernel version > 3.10
  2. Docker >= 1.12
  3. NVIDIA GPU with Architecture > Fermi (2.1)
  4. NVIDIA drivers ~= 361.93 (untested on older versions)

按理来说,安装nvidia docker也未必能解决问题。

Operation

dong@cvad:~$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
>   sudo apt-key add -
OK

dong@cvad:~$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

dong@cvad:~$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
>   sudo tee /etc/apt/sources.list.d/nvidia-docker.list
deb https://nvidia.github.io/libnvidia-container/ubuntu16.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-docker/ubuntu16.04/$(ARCH) /

dong@cvad:~$ sudo apt-get update
获取:1 file:/var/cuda-repo-8-0-local  InRelease
忽略:1 file:/var/cuda-repo-8-0-local  InRelease
获取:2 file:/var/cuda-repo-8-0-local  Release [574 B]
获取:2 file:/var/cuda-repo-8-0-local  Release [574 B]
命中:4 https://nvidia.github.io/libnvidia-container/ubuntu16.04/amd64  InRelease
命中:5 https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64  InRelease
命中:6 https://download.docker.com/linux/ubuntu xenial InRelease
命中:7 https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64  InRelease
命中:8 http://packages.ros.org/ros/ubuntu xenial InRelease                     
获取:9 http://security.ubuntu.com/ubuntu xenial-security InRelease [109 kB]    
命中:10 http://cn.archive.ubuntu.com/ubuntu xenial InRelease                   
获取:11 http://cn.archive.ubuntu.com/ubuntu xenial-updates InRelease [109 kB]  
获取:12 http://security.ubuntu.com/ubuntu xenial-security/main amd64 DEP-11 Metadata [68.0 kB]
命中:13 http://ppa.launchpad.net/apt-fast/stable/ubuntu xenial InRelease       
获取:14 http://cn.archive.ubuntu.com/ubuntu xenial-backports InRelease [107 kB]
获取:15 http://security.ubuntu.com/ubuntu xenial-security/main DEP-11 64x64 Icons [67.1 kB]
忽略:16 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial InRelease    
获取:17 http://security.ubuntu.com/ubuntu xenial-security/universe amd64 DEP-11 Metadata [116 kB]
获取:18 http://cn.archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages [947 kB]
获取:19 http://security.ubuntu.com/ubuntu xenial-security/universe DEP-11 64x64 Icons [173 kB]
忽略:20 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial Release      
获取:21 http://security.ubuntu.com/ubuntu xenial-security/multiverse amd64 DEP-11 Metadata [2,464 B]
获取:22 http://cn.archive.ubuntu.com/ubuntu xenial-updates/main i386 Packages [819 kB]
获取:23 http://cn.archive.ubuntu.com/ubuntu xenial-updates/main amd64 DEP-11 Metadata [318 kB]
获取:24 http://cn.archive.ubuntu.com/ubuntu xenial-updates/main DEP-11 64x64 Icons [233 kB]
获取:25 http://cn.archive.ubuntu.com/ubuntu xenial-updates/universe amd64 Packages [747 kB]
忽略:26 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial/main amd64 Packages
获取:27 http://cn.archive.ubuntu.com/ubuntu xenial-updates/universe i386 Packages [684 kB]
获取:28 http://cn.archive.ubuntu.com/ubuntu xenial-updates/universe amd64 DEP-11 Metadata [252 kB]
获取:29 http://cn.archive.ubuntu.com/ubuntu xenial-updates/universe DEP-11 64x64 Icons [353 kB]
获取:30 http://cn.archive.ubuntu.com/ubuntu xenial-updates/multiverse amd64 DEP-11 Metadata [5,960 B]
获取:31 http://cn.archive.ubuntu.com/ubuntu xenial-updates/multiverse DEP-11 64x64 Icons [14.3 kB]
获取:32 http://cn.archive.ubuntu.com/ubuntu xenial-backports/main amd64 DEP-11 Metadata [3,324 B]
获取:33 http://cn.archive.ubuntu.com/ubuntu xenial-backports/universe amd64 DEP-11 Metadata [5,104 B]
忽略:34 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial/main i386 Packages
忽略:35 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial/main all Packages
忽略:36 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial/main Translation-zh_CN
忽略:37 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial/main Translation-zh
忽略:38 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial/main Translation-en
忽略:39 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial/main amd64 DEP-11 Metadata
忽略:40 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial/main DEP-11 64x64 Icons
忽略:26 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial/main amd64 Packages
-----------------省略
忽略:40 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial/main DEP-11 64x64 Icons
错误:26 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial/main amd64 Packages
  404  Not Found
忽略:34 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial/main i386 Packages
忽略:35 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial/main all Packages
忽略:36 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial/main Translation-zh_CN
忽略:37 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial/main Translation-zh
忽略:38 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial/main Translation-en
忽略:39 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial/main amd64 DEP-11 Metadata
忽略:40 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial/main DEP-11 64x64 Icons
已下载 5,132 kB,耗时 1分 13秒 (69.6 kB/s)
正在读取软件包列表... 完成
W: 软件包仓库 Release 文件 /var/lib/apt/lists/_var_cuda-repo-8-0-local_Release 内 Date 条目无效
W: 仓库 “http://ppa.launchpad.net/fcitx-team/nightly/ubuntu xenial Release” 没有 Release 文件。
N: 无法认证来自该源的数据,所以使用它会带来潜在风险。
N: 参见 apt-secure(8) 手册以了解仓库创建和用户配置方面的细节。
E: 无法下载 http://ppa.launchpad.net/fcitx-team/nightly/ubuntu/dists/xenial/main/binary-amd64/Packages  404  Not Found
E: 部分索引文件下载失败。如果忽略它们,那将转而使用旧的索引文件。

dong@cvad:~$ sudo apt-fast install nvidia-docker
nvidia-docker                            1.0.1-1                  2.2MiB
Download size: 2.2MiB

Do you want to download the packages? [Y/n] 
[#223a1a 1.1MiB/2.1MiB(53%) CN:2 DL:167KiB ETA:6s]                             
[#223a1a 2.1MiB/2.1MiB(98%) CN:1 DL:80KiB]                                     
04/30 15:09:58 [NOTICE] Verification finished successfully. file=/var/cache/apt/apt-fast/nvidia-docker_1.0.1-1_amd64.deb

04/30 15:09:58 [NOTICE] Download complete: /var/cache/apt/apt-fast/nvidia-docker_1.0.1-1_amd64.deb

下载结果:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
223a1a|OK  |   110KiB/s|/var/cache/apt/apt-fast/nvidia-docker_1.0.1-1_amd64.deb

状态标识:
(OK):完成下载。
正在读取软件包列表... 完成
正在分析软件包的依赖关系树       
正在读取状态信息... 完成       
下列软件包是自动安装的并且现在不需要了:
  libnvidia-container-tools libnvidia-container1 nvidia-container-runtime
  nvidia-container-runtime-hook
使用'sudo apt autoremove'来卸载它(它们)。
下列【新】软件包将被安装:
  nvidia-docker
升级了 0 个软件包,新安装了 1 个软件包,要卸载 0 个软件包,有 164 个软件包未被升级。
需要下载 0 B/2,266 kB 的归档。
解压缩后会消耗 14.1 MB 的额外空间。
正在选中未选择的软件包 nvidia-docker。
(正在读取数据库 ... 系统当前共安装有 302907 个文件和目录。)
正准备解包 .../nvidia-docker_1.0.1-1_amd64.deb  ...
正在解包 nvidia-docker (1.0.1-1) ...
正在处理用于 ureadahead (0.100.0-19) 的触发器 ...
ureadahead will be reprofiled on next reboot
正在设置 nvidia-docker (1.0.1-1) ...
Configuring user
Setting up permissions
正在处理用于 ureadahead (0.100.0-19) 的触发器 ...

dong@cvad:~$ sudo pkill -SIGHUP dockerd

#test------------------------------test-------------------------------test

dong@cvad:~$ sudo docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
Unable to find image 'nvidia/cuda:latest' locally
latest: Pulling from nvidia/cuda
898c46f3b1a1: Pull complete 
63366dfa0a50: Pull complete 
041d4cd74a92: Pull complete 
6e1bee0f8701: Pull complete 
131dbe7c254d: Pull complete 
5bca6b05dcd6: Pull complete 
0d286a7b6e12: Pull complete 
5776d2c6371d: Pull complete 
768e84e7fc24: Pull complete 
Digest: sha256:eba1dc5810e40f60625ee797d618a6bd11be24cb67bc6647a4d36392202bb013
Status: Downloaded newer image for nvidia/cuda:latest
docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=10.1 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411 --pid=18238 /var/lib/docker/overlay2/9125ce2210561fc0820c654d4d95828a66b48367be02a67ac548ed635cc3d87b/merged]\\\\nnvidia-container-cli: requirement error: unsatisfied condition: brand = tesla\\\\n\\\"\"": unknown.
dong@cvad:~$ sudo docker version
Client:
 Version:           18.09.5
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        e8ff056
 Built:             Thu Apr 11 04:44:24 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.5
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       e8ff056
  Built:            Thu Apr 11 04:10:53 2019
  OS/Arch:          linux/amd64
  Experimental:     false
dong@cvad:~$ sudo nvidia-docker version
NVIDIA Docker: 1.0.1

Client:
 Version:           18.09.5
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        e8ff056
 Built:             Thu Apr 11 04:44:24 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.5
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       e8ff056
  Built:            Thu Apr 11 04:10:53 2019
  OS/Arch:          linux/amd64
  Experimental:     false


dong@cvad:~$ sudo docker image ls
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
nvidia/cuda         latest              3517732c5437        3 weeks ago         2.74GB
hello-world         latest              fce289e99eb9        3 months ago        1.84kB
autoware/autoware   1.7.0-kinetic       4dd12ac386c0        11 months ago       9.28GB

dong@cvad:~$ sudo docker image rm 351
Untagged: nvidia/cuda:latest
Untagged: nvidia/cuda@sha256:eba1dc5810e40f60625ee797d618a6bd11be24cb67bc6647a4d36392202bb013
Deleted: sha256:3517732c5437ffd87d7c144e1d62cadef128c07a6ccf29f5eb1877d491b0304d
Deleted: sha256:966a04f6408c5c189cd54530de4d3874f202def3de20fb6f2c7d30a39afe2a53
Deleted: sha256:0d53da6826e47b0378024562532b5046946a3bb361ebd17efabcd67b257b7fd2
Deleted: sha256:6c6dbf9c188adebee58cf5d0d3f246c9c45d38adc5aaebf3f433c848be3dc6e9
Deleted: sha256:efde8f73f882aa3a2babd65103b465c7d48410d6824cd2909552aa21d5067079
Deleted: sha256:678a66a6df0cbb0c989b81d83cc2d30ddf1af0414479be003ee7a40e1f2e9953
Deleted: sha256:e783d8ee44ce099d51cbe699f699a04e43c9af445d85d8576f0172ba92e4e16c
Deleted: sha256:cc7fae10c2d465c5e4b95167987eaa53ae01a13df6894493efc5b28b95c1bba2
Deleted: sha256:99fc3504db138523ca958c0c1887dd5e8b59f8104fbd6fd4eed485c3e25d2446
Deleted: sha256:762d8e1a60542b83df67c13ec0d75517e5104dee84d8aa7fe5401113f89854d9

dong@cvad:~$ sudo docker image ls
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
hello-world         latest              fce289e99eb9        3 months ago        1.84kB
autoware/autoware   1.7.0-kinetic       4dd12ac386c0        11 months ago       9.28GB

dong@cvad:~$ sudo nvidia-docker version
NVIDIA Docker: 1.0.1

Client:
 Version:           18.09.5
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        e8ff056
 Built:             Thu Apr 11 04:44:24 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.5
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       e8ff056
  Built:            Thu Apr 11 04:10:53 2019
  OS/Arch:          linux/amd64
  Experimental:     false
dong@cvad:~$ 

调整版本后,发现还是不行,所以我现在想的是,nvidia-docker检查的时候,是直接pull cuda lasted,
我现在手动安装cuda 8.0试试看。

5、
安装nvidia/cuda:8.0,

[end]

你可能感兴趣的:(对 nvidia docker安装错误的一些思考)