服务器Ubuntu 16.04 更新NVIDIA显卡驱动-命令行版本及报错完美解决

为何有这种需求?

  • 我的tensorflow2.1.0要求CUDA 10.1 + cuDNN 7.6.5版本支持;但对于CUDA 10.1+而言,必须要将NVIDIA显卡驱动升级至>=418.39才行。具体参考:https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#major-components

服务器Ubuntu 16.04 更新NVIDIA显卡驱动-命令行版本及报错完美解决_第1张图片

安装过程

1.到官网查询适合的驱动版本

首先,甩出官网下载链接:https://www.nvidia.cn/Download/index.aspx?lang=cn

其次,怎么选择合适的NVIDIA显卡驱动?需要关注以下几点:

  • 系统?(我是Linux64位系统)
  • 对应显卡的版本?(我是Tesla V系列,可以用lspci |grep -i nvidia命令,结果一望便知)
  • CUDA toolkit版本(这个你应该提前知道,也可以用nvcc -V查询,我是CUDA 10.1)

下载.run文件,然后在命令行直接运行

sudo ./NVIDIA-Linux-x86_64-418.126.02.run -no-x-check -no-nouveau-check -no-opengl-files


安装中遇到的错误合集:

问题1:

An NVIDIA kernel module 'nvidia-uvm' appears to already be loaded in your kernel.  This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.  Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver.  If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.

很简单,就像原文所述,'nvidia-uvm'程序因故未退出导致按照无法正常进行。所以该怎么办?

执行以下命令,查看到底是哪些程序在占用nvidia-uvm。

sudo lsof | grep nvidia.uvm

然后得到pid后,使用「sudo kill -9 `pid`」杀掉进程。再次运行下载下来的.run文件,即可跳过该错误;

问题2:

The CC version check failed

The kernel was built with gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) , but the current compiler version is cc (Ubuntu 4.8.5-4ubuntu2) 4.8.5.

This may lead to subtle problems; if you are not certain whether the mismatched compiler will be compatible with your kernel, you may wish to abort installation, set the CC environment variable to the name of the compiler used to compile your kernel, and restart installation. (Answer: Abort installation

这个问题也很简单,就像原文说的那样,该kernel是gcc==5.4.0编译的,但当前编译器的gcc版本是4.8.5。我们需要安装并更改gcc编译器版本。

该怎么做呢?具体步骤就是到官网下载gcc 5.4.0的压缩文件,在本地解压之后按顺序安装。参考本文即可完成:https://blog.csdn.net/Marilynviolet/article/details/100009979

问题3:

在安装过程中会遇到的一些问题:

  • The distribution-provided pre-install script failed! Are you sure you want to continue? 选择 yes 继续。
  • Would you like to register the kernel module souces with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later? 选择 No 继续。
  • Nvidia’s 32-bit compatibility libraries? 选择 No 继续。
  • Would you like to run the nvidia-xconfigutility to automatically update your x configuration so that the NVIDIA x driver will be used when you restart x? Any pre-existing x confile will be backed up. 选择 Yes 继续

 

额外参考内容:

  • 关于gcc版本的安装、查询和切换https://cloud.tencent.com/developer/article/1430839
  • 安装过程中的问题汇总:https://blog.csdn.net/u013832707/article/details/93157805

 


最后想唠叨一句。由于我这次的安装和配置是在公司服务器上进行的,所以要大家都停下GPU服务然后等我操作。我一开始想请教老手来帮忙,但在聊天过程中意识到不少老手也是按照blog内容直接撸罢了。其实这种配置问题并不难,只是很复杂,你必须花时间去上手做才行。毕竟,没有人一生下来就是老手嘛

你可能感兴趣的:(tensorflow,GPU,tensorflow)