追加:【已解决,有一张显卡硬件不稳定】
参考我的最终记录:
https://blog.csdn.net/u012911347/article/details/82854018
首先说明我的情况:
Ubuntu18.04,配置了CUDA9.0,在运行一段时间后,图形界面突然崩溃,当时正在使用Teamviewer远程操作Matlab。
报错主要是“starting nvidia persistence daemon”循环启动失败,nvidia-smi调不出。几条日志如下:
Sep 12 18:56:57 hp-server2 kernel: [ 1304.784857] NVRM: Xid (PCI:0000:21:00): 32, Channel ID 00000000 intr 80002000
Sep 12 18:56:57 hp-server2 /usr/lib/gdm3/gdm-x-session[21891]: (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA GPU at PCI:33:0:0. Please
Sep 12 18:56:57 hp-server2 /usr/lib/gdm3/gdm-x-session[21891]: (EE) NVIDIA(GPU-0): check your system's kernel log for additional error
Sep 12 18:56:57 hp-server2 /usr/lib/gdm3/gdm-x-session[21891]: (EE) NVIDIA(GPU-0): messages and refer to Chapter 8: Common Problems in the
Sep 12 18:56:57 hp-server2 /usr/lib/gdm3/gdm-x-session[21891]: (EE) NVIDIA(GPU-0): README for additional information.
Sep 12 18:56:57 hp-server2 /usr/lib/gdm3/gdm-x-session[21891]: (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device!
百度谷歌搜索无果,心态要崩,因为遇到很多次Ubuntu图形系统崩溃都修理不好,而对于现在的图形处理等任务又不能不用图形桌面。虽然在学习中也知道了一些lightdm和gdm的东西,但几个命令操作来加上重启都无效。想着把nvidia的东西都重新搞一遍,又不得不搞cuda的事情。
实话说心累,Linux的图形桌面总感觉不稳定,不像Windows那种图形界面就是内核一部分,机器启动就是带桌面的,否则就是进不去。折腾了很久,最终决定将cuda重新安装,因为我记得cuda安装的时候会装驱动。还是我原来博客中的问题,apt安装不上cuda,用aptitude卸载cuda,结果系统桌面就正常了。
接着使用aptitude安装cuda,无奈系统又和原来一样了,无法进入桌面:
Sep 12 19:38:56 hp-server2 /usr/lib/gdm3/gdm-x-session[19106]: (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA GPU at PCI:33:0:0. Please
Sep 12 19:38:56 hp-server2 /usr/lib/gdm3/gdm-x-session[19106]: (EE) NVIDIA(GPU-0): check your system's kernel log for additional error
Sep 12 19:38:56 hp-server2 /usr/lib/gdm3/gdm-x-session[19106]: (EE) NVIDIA(GPU-0): messages and refer to Chapter 8: Common Problems in the
Sep 12 19:38:56 hp-server2 /usr/lib/gdm3/gdm-x-session[19106]: (EE) NVIDIA(GPU-0): README for additional information.
Sep 12 19:38:56 hp-server2 /usr/lib/gdm3/gdm-x-session[19106]: (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device!
Sep 12 19:38:56 hp-server2 /usr/lib/gdm3/gdm-x-session[19106]: (EE) NVIDIA(0): Failing initialization of X screen 0
Sep 12 19:38:56 hp-server2 /usr/lib/gdm3/gdm-x-session[19106]: (II) UnloadModule: "nvidia"
Sep 12 19:38:56 hp-server2 /usr/lib/gdm3/gdm-x-session[19106]: (II) UnloadSubModule: "wfb"
Sep 12 19:38:56 hp-server2 /usr/lib/gdm3/gdm-x-session[19106]: (II) UnloadSubModule: "fb"
Sep 12 19:38:56 hp-server2 /usr/lib/gdm3/gdm-x-session[19106]: (EE) Screen(s) found, but none have a usable configuration.
Sep 12 19:38:56 hp-server2 /usr/lib/gdm3/gdm-x-session[19106]: (EE)
Sep 12 19:38:56 hp-server2 /usr/lib/gdm3/gdm-x-session[19106]: Fatal server error:
Sep 12 19:38:56 hp-server2 /usr/lib/gdm3/gdm-x-session[19106]: (EE) no screens found(EE)
Sep 12 19:38:56 hp-server2 /usr/lib/gdm3/gdm-x-session[19106]: (EE)
Sep 12 19:38:56 hp-server2 /usr/lib/gdm3/gdm-x-session[19106]: Please consult the The X.Org Foundation support
Sep 12 19:38:56 hp-server2 /usr/lib/gdm3/gdm-x-session[19106]: #011 at http://wiki.x.org
Sep 12 19:38:56 hp-server2 /usr/lib/gdm3/gdm-x-session[19106]: for help.
Sep 12 19:38:56 hp-server2 /usr/lib/gdm3/gdm-x-session[19106]: (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
Sep 12 19:38:56 hp-server2 /usr/lib/gdm3/gdm-x-session[19106]: (EE)
Sep 12 19:24:23 hp-server2 dbus-daemon[5463]: [session uid=120 pid=5463] AppArmor D-Bus mediation is enabled
Sep 12 19:24:23 hp-server2 dbus-daemon[5463]: [session uid=120 pid=5463] Activating service name='org.gnome.ScreenSaver' requested by ':1.12' (uid=120 pid=5466 comm="/usr/lib/gnome-session/gnome-session-binary --auto" label="unc
onfined")
Sep 12 19:24:23 hp-server2 org.gnome.ScreenSaver[5463]: Unable to init server: 无法连接: 拒绝连接
Sep 12 19:24:23 hp-server2 gnome-screensav[5472]: 无法打开显示:
Sep 12 19:24:23 hp-server2 dbus-daemon[5463]: [session uid=120 pid=5463] Activated service 'org.gnome.ScreenSaver' failed: Process org.gnome.ScreenSaver exited with status 1
Sep 12 19:24:23 hp-server2 gnome-session[5466]: gnome-session-binary[5466]: CRITICAL: Unable to create a DBus proxy for GnomeScreensaver: 为 org.gnome.ScreenSaver 调用 StartServiceByName 出错: GDBus.Error:org.freedesktop.DBus.E
rror.Spawn.ChildExited: Process org.gnome.ScreenSaver exited with status 1
Sep 12 19:24:23 hp-server2 gnome-session-binary[5466]: CRITICAL: Unable to create a DBus proxy for GnomeScreensaver: 为 org.gnome.ScreenSaver 调用 StartServiceByName 出错: GDBus.Error:org.freedesktop.DBus.Error.Spawn.ChildExite
d: Process org.gnome.ScreenSaver exited with status 1
Sep 12 19:24:23 hp-server2 kernel: [ 335.660889] gnome-shell[5474]: segfault at 20 ip 00007f79a33e081d sp 00007fff4359cff0 error 4 in libmutter-2.so.0.0.0[7f79a32f2000+156000]
Sep 12 19:24:23 hp-server2 systemd[1]: Starting NVIDIA Persistence Daemon...
Sep 12 19:24:23 hp-server2 nvidia-persistenced: Verbose syslog connection opened
Sep 12 19:24:23 hp-server2 nvidia-persistenced: Now running with user ID 123 and group ID 127
Sep 12 19:24:23 hp-server2 systemd[1]: Started NVIDIA Persistence Daemon.
Sep 12 19:24:23 hp-server2 nvidia-persistenced: Started (5485)
Sep 12 19:24:23 hp-server2 nvidia-persistenced: device 0000:21:00.0 - registered
Sep 12 19:24:23 hp-server2 nvidia-persistenced: device 0000:2d:00.0 - registered
Sep 12 19:24:23 hp-server2 nvidia-persistenced: Local RPC service initialized
Sep 12 19:24:24 hp-server2 gnome-session[5466]: gnome-session-binary[5466]: WARNING: Application 'org.gnome.Shell.desktop' killed by signal 11
Sep 12 19:24:24 hp-server2 gnome-session-binary[5466]: WARNING: Application 'org.gnome.Shell.desktop' killed by signal 11
Sep 12 19:24:24 hp-server2 gnome-session-binary[5466]: Unrecoverable failure in required component org.gnome.Shell.desktop
Sep 12 19:24:24 hp-server2 nvidia-persistenced: Received signal 15
Sep 12 19:24:24 hp-server2 nvidia-persistenced: Socket closed.
Sep 12 19:24:24 hp-server2 systemd[1]: Stopping NVIDIA Persistence Daemon...
Sep 12 19:24:24 hp-server2 nvidia-persistenced: PID file unlocked.
Sep 12 19:24:24 hp-server2 nvidia-persistenced: PID file closed.
Sep 12 19:24:24 hp-server2 nvidia-persistenced: The daemon no longer has permission to remove its runtime data directory /var/run/nvidia-persistenced
Sep 12 19:24:24 hp-server2 nvidia-persistenced: Shutdown (5485)
Sep 12 19:24:24 hp-server2 systemd[1]: Stopped NVIDIA Persistence Daemon.
Sep 12 19:24:24 hp-server2 gdm3: GdmDisplay: display lasted 0.439243 seconds
Sep 12 19:24:24 hp-server2 gdm3: Child process -5461 was already dead.
Sep 12 19:24:24 hp-server2 gdm3: Child process 5443 was already dead.
Sep 12 19:24:24 hp-server2 gdm3: Unable to kill session worker process
Sep 12 19:24:24 hp-server2 systemd[1]: Stopping User Manager for UID 120...
像我们这种非专业运维的,遇到这种问题真的很难处理。现在也只能继续尝试解决,先卸载cuda:
sudo aptitude remove cuda
再去掉nvidia的东西:
sudo apt-get purge nvidia-*
重启系统,桌面能显示登录界面,可是登录不进去了。后面继续操作,升级,重启,发现可以进入桌面了,就是会闪烁,因为用了非nvidia的驱动。因为安装cuda的时候会自动配置驱动,所以我一直都在尝试cuda重装的事情。使用aptitude可以安装成功,只是桌面系统的崩溃问题无法解决,且安装成功后已经无法调用nvidia-smi了。所以我原来博客提供的aptitude安装的办法就无效了,继续搜索中发现了一个可行的办法。
还是先用前面一个命令purge,将nvidia的东西都卸载了。接着:
sudo add-apt-repository ppa:graphics-drivers/ppa
以前我都没有用过这个,因为我用dpkg安装的cuda到/var/cuda-repo-9-0-local,然后apt/aptitude install cuda的。但是aptitude不行,apt报错依赖问题,所以这里才添加这个源。接着update,然后:
sudo apt install cuda
居然可以了,安装到底都没错误:
至此,桌面环境正常了,cuda也正常了。
总结来说,饶了很多弯路。原来以为又是图形界面问题,又是gdm之类的管理器的问题。后来专门从nvidia的cuda和驱动上入手,终于解决了这个问题。所以以前用的aptitude能安装cuda,但是出问题后就再也修理不好。换成了添加库用apt的方式安装,所有问题都解决了。