追加:【已解决,有一张显卡硬件不稳定】
参考我的最终记录:
https://blog.csdn.net/u012911347/article/details/82854018
这又是一篇关于cuda和nvidia的博客,暂时解决了显卡就只显示一张和cuda无法使用的问题。
如果你想了解更多,可以看看我前面几篇博客记录。大体上就是,ubuntu 18.04和cuda 9.0 在390.48驱动下,突然崩溃了。接着一番修复,apt,aptitude,run文件等,好了又坏,坏了又修。最终是去除ppa,apt安装ubuntu官方源的nvidia-384,接着cuda 9.0的run文件运行,选择装cuda toolkit却不重新用run文件内提供的驱动覆盖系统的。这样正常工作了两天。
今天早上一看,又出了问题,nvidia-smi只显示了一个,另一个是ERR:
Mon Sep 17 09:49:55 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48 Driver Version: 390.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:21:00.0 On | N/A |
|ERR! 44C P8 ERR! / 250W | 295MiB / 11144MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:2D:00.0 Off | N/A |
| 0% 36C P8 10W / 250W | 2MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1817 G /usr/lib/xorg/Xorg 40MiB |
| 0 1860 G /usr/bin/gnome-shell 83MiB |
| 0 2088 G /usr/lib/xorg/Xorg 146MiB |
| 0 2240 G /usr/bin/gnome-shell 4MiB |
| 0 2254 G /opt/teamviewer/tv_bin/TeamViewer 14MiB |
+-----------------------------------------------------------------------------+
实话说我比系统更崩溃,出问题就各种问题,如nvidia-persistenced,如deviceQuery的FAIL,如nvidia-smi显示少一张。对这台工作站的cuda我已经折腾了多次,却没有一个稳定有效的方案,也不知道问题出在哪里。比如这一次,apt安装的nvidia-384,run文件的cuda9.0,没有打cuda9.0的四个补丁,当前显示驱动为390.48。好好工作两天,我都以为解决问题了,这一大早又是出问题。看deviceQuery的信息如下:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 3
-> initialization error
Result = FAIL
这条信息,returned 3都搜不到解决办法。而当前情况下,我注意到Xorg的cpu占用特别高:
2088 root 20 0 504212 140412 93340 R 100.3 0.1 979:58.25 /usr/lib/xorg/Xorg vt2 -displayfd 3 -auth /run/user/1000/gdm/Xauthority -background none -noreset -keeptty -verbose 3
这也为后面的解决办法提供了一些思路。接着还是不死心,又重新启动,发现deviceQuery的输出变了:
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS
原来nvidia-smi显示一张,但是query没有通过。现在query通过了,虽然显示少了一个,NumDevs是1,但是nvidia-smi却一个没有了:
Mon Sep 17 10:01:06 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48 Driver Version: 390.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:2D:00.0 Off | N/A |
| 0% 37C P5 23W / 250W | 0MiB / 11178MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
我的天,完全不知道如何获取更准确的信息。在搜索的过程中,得知nvidia-smi的smi就是System Management Interface,nvidia-smi后面跟个数字应该是grid驱动版本号,后面driver version应该是显卡驱动版本号,二者分开写,有时候也不一样。这里顺便记录以上两点。
查看日志以前就知道syslog,现在也可以看Xorg的log,硬件有关的可以看dmesg。比如我这里看到的有关显卡的日志是:
[ 4.894342] NVRM: GPU at PCI:0000:21:00: GPU-7ce0c4e1-86a8-fe64-288b-da563f52cc95
[ 4.894344] NVRM: GPU Board Serial Number:
[ 4.894346] NVRM: Xid (PCI:0000:21:00): 62, 13adb(75b8) 00000000 00000000
[ 51.725515] NVRM: Xid (PCI:0000:21:00): 32, Channel ID 00000000 intr 80042000
[ 51.736863] NVRM: RmInitAdapter failed! (0x26:0xffff:1123)
[ 51.736926] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 56.665188] NVRM: Xid (PCI:0000:21:00): 32, Channel ID 00000000 intr 80002000
[ 56.671764] NVRM: RmInitAdapter failed! (0x26:0xffff:1123)
[ 56.671787] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 61.553430] NVRM: Xid (PCI:0000:21:00): 32, Channel ID 00000000 intr 80002000
[ 61.560047] NVRM: RmInitAdapter failed! (0x26:0xffff:1123)
[ 61.560070] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 66.293581] NVRM: Xid (PCI:0000:21:00): 32, Channel ID 00000000 intr 80002000
[ 66.300253] NVRM: RmInitAdapter failed! (0x26:0xffff:1123)
[ 66.300276] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 71.055603] NVRM: Xid (PCI:0000:21:00): 32, Channel ID 00000000 intr 80002000
[ 71.066633] NVRM: RmInitAdapter failed! (0x26:0xffff:1123)
[ 71.066682] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 71.626922] usb 1-4: USB disconnect, device number 3
[ 75.959977] NVRM: Xid (PCI:0000:21:00): 32, Channel ID 00000000 intr 80002000
[ 75.971194] NVRM: RmInitAdapter failed! (0x26:0xffff:1123)
[ 75.971228] NVRM: rm_init_adapter failed for device bearing minor number 0
一直看到底,就是这个21:00.0出问题,而2D:00.0这个卡就没问题,这样也对应了前面nvidia-smi显示一张的记录,nvidia-smi中的bus-id可以看到该卡。
这样就比较诡异了,一张卡可以,query也能通过,按理说驱动和cuda都应该是没问题的。另一个却初始化失败,且又有Xorg超高的cpu使用率,所以就怀疑起来是不是接显示器的原因。还别说,把显示器线拔了,重启,真的就好了。
Mon Sep 17 10:31:38 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48 Driver Version: 390.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:21:00.0 Off | N/A |
| 0% 47C P8 10W / 250W | 19MiB / 11170MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:2D:00.0 Off | N/A |
| 0% 36C P8 10W / 250W | 2MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1675 G /usr/lib/xorg/Xorg 9MiB |
| 0 1723 G /usr/bin/gnome-shell 7MiB |
+-----------------------------------------------------------------------------+
抓紧试试:
2018-09-17 11:05:17.424097: I tensorflow/core/common_runtime/placer.cc:886] a: (Const)/job:localhost/replica:0/task:0/device:GPU:0
b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-09-17 11:05:17.424115: I tensorflow/core/common_runtime/placer.cc:886] b: (Const)/job:localhost/replica:0/task:0/device:GPU:0
d: (Const): /job:localhost/replica:0/task:0/device:GPU:1
2018-09-17 11:05:17.424133: I tensorflow/core/common_runtime/placer.cc:886] d: (Const)/job:localhost/replica:0/task:0/device:GPU:1
e: (Const): /job:localhost/replica:0/task:0/device:GPU:1
2018-09-17 11:05:17.424151: I tensorflow/core/common_runtime/placer.cc:886] e: (Const)/job:localhost/replica:0/task:0/device:GPU:1
[[22. 28.]
[49. 64.]]
[[22. 28.]
[49. 64.]]
两张卡,真的都没问题了。
下面进行总结:
各种方式都尝试了,就差最下下策的重装系统了,灵机一动,不接显示器的情况下,两张卡都ok了。而且,如果你看到我前几篇博客介绍就知道,一开始是Matlab的figure画图导致cuda和驱动崩溃的。所以我现在已经比较能确定问题了,是显示驱动有关部分不稳定或者有bug,接显示器的时候导致一个或者俩卡都无法正常初始化。一个nvidia驱动,一个Xorg的图形界面系统,总感觉很容易崩,也不知道谁的锅。
附录一,显示器有关信息:
戴尔P2715Q 4k显示器,以及应该是原装的DP连接线,其中mini dp接显示器,标准dp接1080ti显卡上。实际使用设置了2560*1440的分辨率。
附录二,吐槽:
算是为cuda诡异的问题提供了一个思路,不妨不接显示器。能正常工作多久拭目以待,后面有问题我还会跟进博客。