在执行nvidia-smi的时候报出这个错误,虽然解决办法异常简单,只需要重启一下电脑即可,但是对于错误的原因还是做一下分析和扩展,总是在期望会有意想不到的收获,哈哈。
The NVIDIA Management Library (NVML) is a C-based programmatic interface for monitoring and managing various states within NVIDIA Tesla™ GPUs. It is intended to be a platform for building 3rd party applications, and is also the underlying library for the NVIDIA-supported nvidia-smi tool. NVML is thread-safe so it is safe to make simultaneous NVML calls from multiple threads.归根结底,NVML既是可编程接口,也是第三方应用开发平台,又是某些工具(如:nvidia-tool)依赖的库。
[zuosi@localhost]$cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 396.44 Wed Jul 11 16:51:49 PDT 2018
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)
显示NVRM版本为396.44,再来看看显卡驱动的版本。
[zuosi@localhost]$sudo dpkg --list | grep nvidia
rc nvidia-384 384.130-0ubuntu0.16.04.1 amd64 NVIDIA binary driver - version 384.130
ii nvidia-396 396.82-0ubuntu1 amd64 NVIDIA binary driver - version 396.82
ii nvidia-cuda-dev 7.5.18-0ubuntu1 amd64 NVIDIA CUDA development files
ii nvidia-cuda-doc 7.5.18-0ubuntu1 all NVIDIA CUDA and OpenCL documentation
ii nvidia-cuda-gdb 7.5.18-0ubuntu1 amd64 NVIDIA CUDA Debugger (GDB)
ii nvidia-cuda-toolkit 7.5.18-0ubuntu1 amd64 NVIDIA CUDA development toolkit
ii nvidia-opencl-dev:amd64 7.5.18-0ubuntu1 amd64 NVIDIA OpenCL development files
rc nvidia-opencl-icd-384 384.130-0ubuntu0.16.04.1 amd64 NVIDIA OpenCL ICD
ii nvidia-opencl-icd-396 396.82-0ubuntu1 amd64 NVIDIA OpenCL ICD
ii nvidia-prime 0.8.2 amd64 Tools to enable NVIDIA's Prime
ii nvidia-profiler 7.5.18-0ubuntu1 amd64 NVIDIA Profiler for CUDA and OpenCL
ii nvidia-settings 418.40.04-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
ii nvidia-visual-profiler 7.5.18-0ubuntu1 amd64 NVIDIA Visual Profiler for CUDA and OpenCL
注:ii意味着'应该安装它并且已安装它';rc表示它被删除/卸载,但是它的配置文件仍然存在'
这里显示的驱动版本为nvidia-396.82,差异应该就在这里(396.44 vs 396.82),为什么之前可以,现在突然不一致了?查查dpkg日志,因为这里明显内核中已经加载的版本落后了。
[zuosi@localhost]$cat /var/log/dpkg.log| grep nvidia
2019-04-29 14:08:57 upgrade nvidia-396:amd64 396.44-0ubuntu1 396.82-0ubuntu1
2019-04-29 14:08:57 status half-configured nvidia-396:amd64 396.44-0ubuntu1
2019-04-29 14:09:04 status unpacked nvidia-396:amd64 396.44-0ubuntu1
2019-04-29 14:09:04 status half-installed nvidia-396:amd64 396.44-0ubuntu1
2019-04-29 14:09:13 status half-installed nvidia-396:amd64 396.44-0ubuntu1
2019-04-29 14:09:13 status unpacked nvidia-396:amd64 396.82-0ubuntu1
2019-04-29 14:09:13 status unpacked nvidia-396:amd64 396.82-0ubuntu1
2019-04-29 14:09:14 upgrade nvidia-opencl-icd-396:amd64 396.44-0ubuntu1 396.82-0ubuntu1
2019-04-29 14:09:14 status half-configured nvidia-opencl-icd-396:amd64 396.44-0ubuntu1
2019-04-29 14:09:14 status unpacked nvidia-opencl-icd-396:amd64 396.44-0ubuntu1
2019-04-29 14:09:14 status half-installed nvidia-opencl-icd-396:amd64 396.44-0ubuntu1
2019-04-29 14:09:14 status half-installed nvidia-opencl-icd-396:amd64 396.44-0ubuntu1
2019-04-29 14:09:14 status unpacked nvidia-opencl-icd-396:amd64 396.82-0ubuntu1
2019-04-29 14:09:14 status unpacked nvidia-opencl-icd-396:amd64 396.82-0ubuntu1
2019-04-29 14:09:14 upgrade nvidia-settings:amd64 410.72-0ubuntu1 418.40.04-0ubuntu1
2019-04-29 14:09:14 status half-configured nvidia-settings:amd64 410.72-0ubuntu1
2019-04-29 14:09:14 status unpacked nvidia-settings:amd64 410.72-0ubuntu1
2019-04-29 14:09:14 status half-installed nvidia-settings:amd64 410.72-0ubuntu1
2019-04-29 14:09:14 status half-installed nvidia-settings:amd64 410.72-0ubuntu1
2019-04-29 14:09:14 status unpacked nvidia-settings:amd64 418.40.04-0ubuntu1
2019-04-29 14:09:14 status unpacked nvidia-settings:amd64 418.40.04-0ubuntu1
2019-04-29 14:09:59 configure nvidia-396:amd64 396.82-0ubuntu1
2019-04-29 14:09:59 status unpacked nvidia-396:amd64 396.82-0ubuntu1
2019-04-29 14:09:59 status unpacked nvidia-396:amd64 396.82-0ubuntu1
2019-04-29 14:09:59 status half-configured nvidia-396:amd64 396.82-0ubuntu1
2019-04-29 14:10:54 status installed nvidia-396:amd64 396.82-0ubuntu1
2019-04-29 14:10:55 configure nvidia-opencl-icd-396:amd64 396.82-0ubuntu1
2019-04-29 14:10:55 status unpacked nvidia-opencl-icd-396:amd64 396.82-0ubuntu1
2019-04-29 14:10:55 status unpacked nvidia-opencl-icd-396:amd64 396.82-0ubuntu1
2019-04-29 14:10:55 status half-configured nvidia-opencl-icd-396:amd64 396.82-0ubuntu1
2019-04-29 14:10:55 status installed nvidia-opencl-icd-396:amd64 396.82-0ubuntu1
2019-04-29 14:10:55 configure nvidia-settings:amd64 418.40.04-0ubuntu1
2019-04-29 14:10:55 status unpacked nvidia-settings:amd64 418.40.04-0ubuntu1
2019-04-29 14:10:55 status unpacked nvidia-settings:amd64 418.40.04-0ubuntu1
2019-04-29 14:10:55 status half-configured nvidia-settings:amd64 418.40.04-0ubuntu1
2019-04-29 14:10:55 status installed nvidia-settings:amd64 418.40.04-0ubuntu1
显然,nvidia显卡驱动有一次升级(貌似是因为我手动执行了一次apt-get upgrade?),由396.44升级为396.82,但是内核模型还需要重新加载。实际上内核驱动模块已经就位,只等你重新加载进内核了,不信你看。
[zuosi@localhost]$find /lib/modules/$(uname -r) -name "*nvidia*.ko" -ls
8677356 64 -rw-r--r-- 1 root root 63846 Feb 13 04:31 /lib/modules/4.15.0-46-generic/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
8650998 72 -rw-r--r-- 1 root root 69852 Apr 29 14:10 /lib/modules/4.15.0-46-generic/updates/dkms/nvidia_396_drm.ko
8650995 18392 -rw-r--r-- 1 root root 18830596 Apr 29 14:10 /lib/modules/4.15.0-46-generic/updates/dkms/nvidia_396.ko
8650997 1292 -rw-r--r-- 1 root root 1319556 Apr 29 14:10 /lib/modules/4.15.0-46-generic/updates/dkms/nvidia_396_modeset.ko
8650999 1260 -rw-r--r-- 1 root root 1286612 Apr 29 14:10 /lib/modules/4.15.0-46-generic/updates/dkms/nvidia_396_uvm.ko
[zuosi@localhost]$modinfo /lib/modules/4.15.0-46-generic/updates/dkms/nvidia_396.ko
filename: /lib/modules/4.15.0-46-generic/updates/dkms/nvidia_396.ko
alias: char-major-195-*
version: 396.82
supported: external
license: NVIDIA
srcversion: 1972864AFC73362967DE403
alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00*
alias: pci:v000010DEd*sv*sd*bc03sc02i00*
alias: pci:v000010DEd*sv*sd*bc03sc00i00*
depends: ipmi_msghandler
retpoline: Y
name: nvidia
vermagic: 4.15.0-46-generic SMP mod_unload
parm: NVreg_Mobile:int
parm: NVreg_ResmanDebugLevel:int
parm: NVreg_RmLogonRC:int
parm: NVreg_ModifyDeviceFiles:int
parm: NVreg_DeviceFileUID:int
parm: NVreg_DeviceFileGID:int
parm: NVreg_DeviceFileMode:int
parm: NVreg_UpdateMemoryTypes:int
parm: NVreg_InitializeSystemMemoryAllocations:int
parm: NVreg_UsePageAttributeTable:int
parm: NVreg_MapRegistersEarly:int
parm: NVreg_RegisterForACPIEvents:int
parm: NVreg_CheckPCIConfigSpace:int
parm: NVreg_EnablePCIeGen3:int
parm: NVreg_EnableMSI:int
parm: NVreg_TCEBypassMode:int
parm: NVreg_UseThreadedInterrupts:int
parm: NVreg_EnableStreamMemOPs:int
parm: NVreg_EnableBacklightHandler:int
parm: NVreg_RestrictProfilingToAdminUsers:int
parm: NVreg_EnableUserNUMAManagement:int
parm: NVreg_MemoryPoolSize:int
parm: NVreg_IgnoreMMIOCheck:int
parm: NVreg_RegistryDwords:charp
parm: NVreg_RegistryDwordsPerDevice:charp
parm: NVreg_RmMsg:charp
parm: NVreg_AssignGpus:charp
我使用了最简单的方式,重启的方式加载396.82显卡驱动内核模块,呵呵。