服务器突然失灵了呃呃啊——NVIDIA-SMI has failed because it couldn‘t communicate with the NVIDIA driver.解决方法总结

  1. 因为实验室装修原因,服务器暂时无法使用了,等能使用时,又发现GPU不灵了,如下:
limiao@xtgly-X9DRG-HF:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
  1. 网上搜索了很多方法都无法解决,这里链接1、链接2、 链接3帮助我解决了一部分问题,如下:
xtgly@xtgly-X9DRG-HF:/home/limiao$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

可以看到已经不再是上面的问题了,也是很大的一步进步了(从1到2的跨越真的是历经了很多的艰辛了,尝试了很多方法,最终还是依靠上面的链接2、链接3实现了我的这个伟大的跨越啊啊啊 )。

下面我要详细记录下我的解决方法,因为根据链接2还是会碰到一些问题,不过解决方法也是从链接3里的评论中获取的,我已经点赞感谢他了:)!

  • 其实主要就是更换内核
    更换方法:
    1. 第一种:
    a. 按照如下命令查看已经安装的所有内核方法:
        xtgly@xtgly-X9DRG-HF:/home/limiao$ sudo dpkg --get-selections |grep linux-image
        linux-image-5.11.0-25-generic                   install
        linux-image-5.11.0-27-generic                   install
        linux-image-5.4.0-42-generic                    deinstall
        linux-image-5.4.0-47-generic                    deinstall
        linux-image-5.4.0-48-generic                    deinstall
        linux-image-5.4.0-51-generic                    deinstall
        linux-image-5.4.0-52-generic                    deinstall
        linux-image-5.4.0-53-generic                    deinstall
        linux-image-5.4.0-56-generic                    deinstall
        linux-image-5.4.0-58-generic                    deinstall
        linux-image-5.4.0-59-generic                    deinstall
        linux-image-5.8.0-34-generic                    deinstall
        linux-image-5.8.0-36-generic                    deinstall
        linux-image-5.8.0-38-generic                    deinstall
        linux-image-5.8.0-41-generic                    deinstall
        linux-image-5.8.0-43-generic                    deinstall
        linux-image-5.8.0-44-generic                    deinstall
        linux-image-5.8.0-45-generic                    deinstall
        linux-image-5.8.0-48-generic                    deinstall
        linux-image-5.8.0-49-generic                    deinstall
        linux-image-5.8.0-50-generic                    deinstall
        linux-image-5.8.0-53-generic                    deinstall
        linux-image-5.8.0-55-generic                    deinstall
        linux-image-5.8.0-59-generic                    deinstall
        linux-image-5.8.0-63-generic                    deinstall
    

b. 使用如下命令一键更换内核

xtgly@xtgly-X9DRG-HF:/home/limiao$ sudo apt-get install linux-image-5.4.0-51-lowlatency linux-headers-5.4.0-51-lowlatency
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
  amd64-microcode intel-microcode iucode-tool linux-headers-generic-hwe-20.04 thermald
Use 'sudo apt autoremove' to remove them.
The following additional packages will be installed:
  linux-headers-5.4.0-51 linux-modules-5.4.0-51-lowlatency
Suggested packages:
  fdutils linux-tools
The following NEW packages will be installed:
  linux-headers-5.4.0-51 linux-headers-5.4.0-51-lowlatency linux-image-5.4.0-51-lowlatency linux-modules-5.4.0-51-lowlatency
0 upgraded, 4 newly installed, 0 to remove and 281 not upgraded.
Need to get 74.1 MB of archives.
After this operation, 360 MB of additional disk space will be used.
Do you want to continue? [Y/n] Y
Get:1 http://cn.archive.ubuntu.com/ubuntu focal-updates/main amd64 linux-headers-5.4.0-51 all 5.4.0-51.56 [11.0 MB]
Get:1 http://cn.archive.ubuntu.com/ubuntu focal-updates/main amd64 linux-headers-5.4.0-51 all 5.4.0-51.56 [11.0 MB]
Get:2 http://cn.archive.ubuntu.com/ubuntu focal-updates/main amd64 linux-headers-5.4.0-51-lowlatency amd64 5.4.0-51.56 [1,255 kB]
Get:3 http://cn.archive.ubuntu.com/ubuntu focal-updates/main amd64 linux-modules-5.4.0-51-lowlatency amd64 5.4.0-51.56 [52.9 MB]
Get:3 http://cn.archive.ubuntu.com/ubuntu focal-updates/main amd64 linux-modules-5.4.0-51-lowlatency amd64 5.4.0-51.56 [52.9 MB]
Get:4 http://cn.archive.ubuntu.com/ubuntu focal-updates/main amd64 linux-image-5.4.0-51-lowlatency amd64 5.4.0-51.56 [8,959 kB]
Fetched 63.3 MB in 18min 27s (57.2 kB/s)
Selecting previously unselected package linux-headers-5.4.0-51.
(Reading database ... 238581 files and directories currently installed.)
Preparing to unpack .../linux-headers-5.4.0-51_5.4.0-51.56_all.deb ...
Unpacking linux-headers-5.4.0-51 (5.4.0-51.56) ...
Selecting previously unselected package linux-headers-5.4.0-51-lowlatency.
Preparing to unpack .../linux-headers-5.4.0-51-lowlatency_5.4.0-51.56_amd64.deb ...
Unpacking linux-headers-5.4.0-51-lowlatency (5.4.0-51.56) ...
Selecting previously unselected package linux-modules-5.4.0-51-lowlatency.
Preparing to unpack .../linux-modules-5.4.0-51-lowlatency_5.4.0-51.56_amd64.deb ...
Unpacking linux-modules-5.4.0-51-lowlatency (5.4.0-51.56) ...
Selecting previously unselected package linux-image-5.4.0-51-lowlatency.
Preparing to unpack .../linux-image-5.4.0-51-lowlatency_5.4.0-51.56_amd64.deb ...
Unpacking linux-image-5.4.0-51-lowlatency (5.4.0-51.56) ...
Setting up linux-headers-5.4.0-51 (5.4.0-51.56) ...
Setting up linux-modules-5.4.0-51-lowlatency (5.4.0-51.56) ...
Setting up linux-headers-5.4.0-51-lowlatency (5.4.0-51.56) ...
/etc/kernel/header_postinst.d/dkms:
 * dkms: running auto installation service for kernel 5.4.0-51-lowlatency

Kernel preparation unnecessary for this kernel.  Skipping...

Building module:
cleaning build area...
'make' -j32 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=5.4.0-51-lowlatency IGNORE_CC_MISMATCH='1' modules.......
cleaning build area...

DKMS: build completed.

nvidia.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.4.0-51-lowlatency/updates/dkms/

nvidia-uvm.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.4.0-51-lowlatency/updates/dkms/

nvidia-modeset.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.4.0-51-lowlatency/updates/dkms/

nvidia-drm.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.4.0-51-lowlatency/updates/dkms/

depmod...

DKMS: install completed.
   ...done.
Setting up linux-image-5.4.0-51-lowlatency (5.4.0-51.56) ...
I: /boot/vmlinuz.old is now a symlink to vmlinuz-5.11.0-27-generic
I: /boot/initrd.img.old is now a symlink to initrd.img-5.11.0-27-generic
I: /boot/vmlinuz is now a symlink to vmlinuz-5.4.0-51-lowlatency
I: /boot/initrd.img is now a symlink to initrd.img-5.4.0-51-lowlatency
Processing triggers for linux-image-5.4.0-51-lowlatency (5.4.0-51.56) ...
/etc/kernel/postinst.d/dkms:
 * dkms: running auto installation service for kernel 5.4.0-51-lowlatency
   ...done.
/etc/kernel/postinst.d/initramfs-tools:
update-initramfs: Generating /boot/initrd.img-5.4.0-51-lowlatency
/etc/kernel/postinst.d/zz-update-grub:
Sourcing file `/etc/default/grub'
Sourcing file `/etc/default/grub.d/init-select.cfg'
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.11.0-27-generic
Found initrd image: /boot/initrd.img-5.11.0-27-generic
Found linux image: /boot/vmlinuz-5.11.0-25-generic
Found initrd image: /boot/initrd.img-5.11.0-25-generic
Found linux image: /boot/vmlinuz-5.4.0-51-lowlatency
Found initrd image: /boot/initrd.img-5.4.0-51-lowlatency
Found memtest86+ image: /boot/memtest86+.elf
Found memtest86+ image: /boot/memtest86+.bin
done
xtgly@xtgly-X9DRG-HF:/home/limiao$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

c. 如上所示(失败):

xtgly@xtgly-X9DRG-HF:/home/limiao$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
  1. 第二种(有效)——使用synaptic 一键更换内核
    a. 首先是安装synaptic
xtgly@xtgly-X9DRG-HF:/home/limiao$ sudo apt-get install synaptic
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
  amd64-microcode intel-microcode iucode-tool linux-headers-generic-hwe-20.04 thermald
Use 'sudo apt autoremove' to remove them.
The following additional packages will be installed:
  libept1.6.0 libxapian30
Suggested packages:
  xapian-tools dwww menu deborphan apt-xapian-index tasksel
The following NEW packages will be installed:
  libept1.6.0 libxapian30 synaptic
0 upgraded, 3 newly installed, 0 to remove and 281 not upgraded.
Need to get 1,362 kB of archives.
After this operation, 6,346 kB of additional disk space will be used.
Do you want to continue? [Y/n] Y
Get:1 http://cn.archive.ubuntu.com/ubuntu focal/universe amd64 libept1.6.0 amd64 1.1+nmu3ubuntu3 [79.6 kB]
Get:2 http://cn.archive.ubuntu.com/ubuntu focal/universe amd64 libxapian30 amd64 1.4.14-2 [661 kB]
Get:3 http://cn.archive.ubuntu.com/ubuntu focal/universe amd64 synaptic amd64 0.84.6ubuntu5 [622 kB]
Fetched 1,362 kB in 6s (210 kB/s)
Selecting previously unselected package libept1.6.0:amd64.
(Reading database ... 274961 files and directories currently installed.)
Preparing to unpack .../libept1.6.0_1.1+nmu3ubuntu3_amd64.deb ...
Unpacking libept1.6.0:amd64 (1.1+nmu3ubuntu3) ...
Selecting previously unselected package libxapian30:amd64.
Preparing to unpack .../libxapian30_1.4.14-2_amd64.deb ...
Unpacking libxapian30:amd64 (1.4.14-2) ...
Selecting previously unselected package synaptic.
Preparing to unpack .../synaptic_0.84.6ubuntu5_amd64.deb ...
Unpacking synaptic (0.84.6ubuntu5) ...
Setting up libxapian30:amd64 (1.4.14-2) ...
Setting up libept1.6.0:amd64 (1.1+nmu3ubuntu3) ...
Setting up synaptic (0.84.6ubuntu5) ...
Processing triggers for mime-support (3.64ubuntu1) ...
Processing triggers for hicolor-icon-theme (0.17-2) ...
Processing triggers for gnome-menus (3.36.0-1ubuntu1) ...
Processing triggers for libc-bin (2.31-0ubuntu9) ...
Processing triggers for man-db (2.9.1-1) ...
Processing triggers for desktop-file-utils (0.24-1ubuntu3) ...

b. 然后是运行synaptic,但遇到如下所示问题:

xtgly@xtgly-X9DRG-HF:/home/limiao$ sudo synaptic
MoTTY X11 proxy: Unsupported authorisation protocol
Unable to init server: Could not connect: Connection refused
Failed to initialize GTK.

Probably you're running Synaptic on Wayland with root permission.
Please restart your session without Wayland, or run Synaptic without root permission

链接3里说的解决方法如下:

远程使用图形化界面时出错:MoTTY X11 proxy: Unsupported authorisation protocol
解决:
cp /root/.Xauthority /home/xxx/.Xauthority
xxx为用户名

但其实不是,根据评论:weixin_43952819:反过来就解决了 去自己用户名的目录下找到Xauthority cp到root下面就解决了,实际操作如下(似乎这一步操作每调用synaptic一次就需要执行一次):

xtgly@xtgly-X9DRG-HF:/home/limiao$ sudo cp .Xauthority /root

c. 然后再运行,不报错,且调出了远程界面:

xtgly@xtgly-X9DRG-HF:/home/limiao$ sudo synaptic

我这里调出的界面如下:
服务器突然失灵了呃呃啊——NVIDIA-SMI has failed because it couldn‘t communicate with the NVIDIA driver.解决方法总结_第1张图片
d. 然后搜索需要更换的内核名,点击search,搜索需要更换的版本号(这个界面反应会比较慢,需要花费一些时间和耐心等),如下图所示:
服务器突然失灵了呃呃啊——NVIDIA-SMI has failed because it couldn‘t communicate with the NVIDIA driver.解决方法总结_第2张图片
e. 选择内核(图中其实是我已经选好并更换完成的内核),一般来说普通电脑需要更换的为headers和image,不要选错了。请直接找到需要更换的headers,如:需要更换 linux-headers-5.4.0-33-generic,右击选择“Mark for Installation”。可以看到 linux-headers-5.4.0-33也被顺便勾起来了,这个也是必要的,请不要取消。往下找,找到linux-image-5.4.0-33-generic,右击同样“Mark for Installation”。请注意务必和前面的headers对应。现在我们有三个勾选选项了。注:建议连带linux-image-extra-5.4.0-33-generic一起安装,虽然不一定用得上,还会让你的kernel列表更臃肿,但是这可以修复潜在的驱动不兼容问题 。服务器突然失灵了呃呃啊——NVIDIA-SMI has failed because it couldn‘t communicate with the NVIDIA driver.解决方法总结_第3张图片
f. 下载内核

点击上方的Apply,展开 To be installed,可以看到选择到的三个选项(带extra就是四个),再次确认image和前面的headers是对应的。

g. 确认后Apply,进入下载,安装
h. 重启:

xtgly@xtgly-X9DRG-HF:/home/limiao$ systemctl reboot -i

i. 重启以后输入uname -r查询内核就可以发现内核已经更换了。

limiao@xtgly-X9DRG-HF:~$ su xtgly
Password:
xtgly@xtgly-X9DRG-HF:/home/limiao$ uname -r
5.4.0-33-generic

j. 但这里还是没有解决最重要的问题,查看nvidia-smi时还是如下:

xtgly@xtgly-X9DRG-HF:/home/limiao$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

k. 接下来就是执行下面两条命令了:sudo apt-get install dkmssudo dkms install -m nvidia -v 455.23.05

xtgly@xtgly-X9DRG-HF:/home/limiao$ sudo apt-get install dkms
[sudo] password for xtgly:
Reading package lists... Done
Building dependency tree
Reading state information... Done
dkms is already the newest version (2.8.1-5ubuntu2).
The following packages were automatically installed and are no longer required:
  amd64-microcode intel-microcode iucode-tool thermald
Use 'sudo apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 281 not upgraded.
xtgly@xtgly-X9DRG-HF:/home/limiao$ sudo dkms install -m nvidia -v 455.23.05

Kernel preparation unnecessary for this kernel.  Skipping...

Building module:
cleaning build area...
'make' -j32 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=5.4.0-33-generic IGNORE_CC_MISMATCH='1' modules.........
cleaning build area...

DKMS: build completed.

nvidia.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.4.0-33-generic/updates/dkms/

nvidia-uvm.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.4.0-33-generic/updates/dkms/

nvidia-modeset.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.4.0-33-generic/updates/dkms/

nvidia-drm.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/5.4.0-33-generic/updates/dkms/

depmod....

DKMS: install completed.

备注:sudo dkms install -m nvidia -v 455.23.05 里的455.23.05是由如下命令查看得出的(最后一行的nvidia-455.23.05便是):

xtgly@xtgly-X9DRG-HF:/home/limiao$ ll /usr/src
total 32
drwxr-xr-x  8 root root 4096 8月  22 03:22 ./
drwxr-xr-x 14 root root 4096 8月   1  2020 ../
drwxr-xr-x 24 root root 4096 8月  22 03:01 linux-headers-5.4.0-33/
drwxr-xr-x  7 root root 4096 8月  22 03:02 linux-headers-5.4.0-33-generic/
drwxr-xr-x  7 root root 4096 7月  30 09:33 linux-headers-5.8.0-63-generic/
drwxr-xr-x 24 root root 4096 7月  30 09:33 linux-hwe-5.8-headers-5.8.0-63/
drwxr-xr-x  4 root root 4096 8月  21 19:42 linux-source-5.4.0/
lrwxrwxrwx  1 root root   45 7月  16 02:01 linux-source-5.4.0.tar.bz2 -> linux-source-5.4.0/linux-source-5.4.0.tar.bz2
drwxr-xr-x  7 root root 4096 5月  26 20:25 nvidia-455.23.05/

l. 查看nvidia,如下,可以看到前面执行的结果了(但这又是另一个问题了):

xtgly@xtgly-X9DRG-HF:/home/limiao$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

这个问题暂时无法解决了,服务器又连接不上了,但是这个应该很容易解决,等服务器连上了再弄吧。。。。。 :)

你可能感兴趣的:(服务器相关,linux,ubuntu)