独立显卡的GPU驱动,CUDA, CUDNN安装过程记录

颇走了些弯路哦。

主要是安装显卡坑太多

第一个大坑就是不建议用ubuntu16,因为几台电脑都没安装成功,也许可以靠升级内核来搞定,可是路径太长了。

然后, 有几个节点要注意

1  查看已安装的驱动,如果无法判断结果,就没办法找问题和改善了

hy@hy-Mi-Gaming-Laptop-15-6:~/kxwell$ ls /usr/src | grep nvidia
nvidia-455.38

回显的就是已安装版本

没安装过就没有显示,如果版本太老,建议先卸载

sudo apt-get remove --purge nvidia*

查看显卡信息

lspci |grep VGA能查看设备信息

lspci -vnn | grep VGA -A 12 能查看驱动使用的情况

(base) hy@hy-Default-string:~$ lspci -vnn | grep VGA -A 12
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2184] (rev a1) (prog-if 00 [VGA controller])
    Subsystem: Device [1b4c:1366]
    Flags: bus master, fast devsel, latency 0, IRQ 134
    Memory at de000000 (32-bit, non-prefetchable) [size=16M]
    Memory at c0000000 (64-bit, prefetchable) [size=256M]
    Memory at d0000000 (64-bit, prefetchable) [size=32M]
    I/O ports at e000 [size=128]
    Expansion ROM at 000c0000 [disabled] [size=128K]
    Capabilities:
    Kernel driver in use: nouveau
    Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:1aeb] (rev a1)

这个上面的信息能看出来使用的还是nouveau驱动,而不是nvidia的驱动,所以nvidia-smi没有信息,说明显卡还未正确安装。

更新系统设置

sudo update-initramfs -u

2. UBUNTU18.04 自动安装显卡驱动

查看设备

ubuntu-drivers devices

如果安装推荐版本,只需要终端输入:

sudo ubuntu-drivers autoinstall

完成自动安装,然后重启。

实测1050安装没问题,但实测RTX3060只能安装460版本的,开机黑屏,卸载后重装465,成功。

sudo apt-get install nvidia-driver-465

验证手段: nvidia-smi能够显示显卡信息,以及支持的cuda最高版本

3. 安装cuda库

nvidia-smi可以看到显卡信息,以及支持的cuda版本。

(base) hy@hy-Default-string:~$ nvidia-smi
Mon Jul 12 10:27:19 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 465.27       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   44C    P8    18W / 220W |    633MiB / 12052MiB |     14%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1337      G   /usr/bin/gnome-shell              175MiB |
|    0   N/A  N/A      1572      G   /usr/lib/xorg/Xorg                194MiB |
|    0   N/A  N/A      1703      G   /usr/bin/gnome-shell               72MiB |
|    0   N/A  N/A      2193      G   gnome-control-center               34MiB |
|    0   N/A  N/A      2210      G   /usr/lib/firefox/firefox          118MiB |
|    0   N/A  N/A      2299      G   /usr/lib/firefox/firefox            2MiB |
+-----------------------------------------------------------------------------+

这里最高支持到11.3,考虑到大多数库的兼容性,可以装个11,也可以装10.2,后面琢磨下怎么切换不同版本cuda

cuda版本和显卡驱动应该有一定的对应关系,从cuda的安装包名称上能看出来,所以最好根据nvidia-smi显示的信息来进行安装

cuda已更新到11.4,要找老版本的安装要进这个网址

CUDA Toolkit Archive | NVIDIA Developer

搜寻到想要的版本

选择runfile方便些

独立显卡的GPU驱动,CUDA, CUDNN安装过程记录_第1张图片

wget https://developer.download.nvidia.com/compute/cuda/11.3.0/local_installers/cuda_11.3.0_465.19.01_linux.run

sudo sh cuda_11.3.0_465.19.01_linux.run

根据上面的下载安装提示进行操作,好像不卸载驱动是不行的。

换个方式

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin

sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600

wget https://developer.download.nvidia.com/compute/cuda/11.3.0/local_installers/cuda-repo-ubuntu1804-11-3-local_11.3.0-465.19.01-1_amd64.deb

sudo dpkg -i cuda-repo-ubuntu1804-11-3-local_11.3.0-465.19.01-1_amd64.deb

sudo apt-key add /var/cuda-repo-ubuntu1804-11-3-local/7fa2af80.pub

sudo apt-get update

sudo apt-get -y install cuda

实际操作中,用deb安装方式有可能会带来黑屏,所以还是建议用runfile方式。

应该还要配置路径才能生效,粗暴一点,先按个老版本的

sudo apt install nvidia-cuda-toolkit


然后确认

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85


感觉版本低了,先往下走。

卸载cuda,需要到/usr/local/cuda/bin目录下运行cuda自己的uninstaller

安装完成后,需要配置路径

sudo nano ~/.bashrc

export CUDA_HOME=/usr/local/cuda-xxx
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64
export PATH=${CUDA_HOME}/bin:${PATH}

然后

sudo nano /etc/ld.so.conf

添上

include /usr/local/cuda-11.4/lib64


再运行

 sudo ldconfig


可以用nvcc -V检查下

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:41:19_PDT_2021
Cuda compilation tools, release 11.4, V11.4.100
Build cuda_11.4.r11.4/compiler.30188945_0


完美

4. 安装cudnn

下载页面,注意和cuda的版本对应

cuDNN Archive | NVIDIA Developer

把对应版本下载下来,然后解压

sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
 
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/
 
sudo chmod a+r /usr/local/cuda/include/cudnn.h
 
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*

把路径添加进去,要不tensorflow-gpu会找不到

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64

5. 配置pytorch或者tensorflow验证GPU可用

conda 安装tensorflow

conda install tensorflow-gpu

import tensorflow as tf

提示信息

2021-07-12 11:14:15.015023: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


正常说明安装tf和cuda没问题

hello=tf.constant("Hello")


提示

Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory

说明cudnn没安装

重新确认上面cudnn安装

然后激动的测试一下,确认GPU可用

>>> tf.test.is_gpu_available()
WARNING:tensorflow:From :1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2021-07-12 11:29:23.778891: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-12 11:29:23.780518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3060 computeCapability: 8.6
coreClock: 1.897GHz coreCount: 28 deviceMemorySize: 11.77GiB deviceMemoryBandwidth: 335.32GiB/s
2021-07-12 11:29:23.780720: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-12 11:29:23.782325: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-12 11:29:23.783285: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-07-12 11:29:23.783316: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-12 11:29:23.783325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0
2021-07-12 11:29:23.783333: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N
2021-07-12 11:29:23.783411: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-12 11:29:23.783913: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-12 11:29:23.784384: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/device:GPU:0 with 9710 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3060, pci bus id: 0000:01:00.0, compute capability: 8.6)
True


计算能力8.6,欧耶!(3060算力为8.6, 1660算力为7.5)

新版命令已改为:

tf.config.list_physical_devices('GPU')

 

pytorch官网

PyTorch

conda安装命令

独立显卡的GPU驱动,CUDA, CUDNN安装过程记录_第2张图片

conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia

有可能pytorch版本和cuda不支持的情况下会报错,所以要用上面的命令指定cuda版本来安装对应的torch版本。 

pytorch确认

torch.cuda.is_available()

返回值true即可

出错处理

1.

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

2. 训练出错

RuntimeError:a leaf Variable that requires grad has been used in an in-place

在报错代码前加上with torch.no_grad():

3. 安装CUDA缺库

Missing recommended library: libGLU.so;Missing recommended library: libXmu.so

安装库

sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev

4. 解决 Could not load dynamic library 'libcusolver.so.11';的问题

简单粗暴的,在/usr/local/cuda-11.0里找到了其他版本文件,直接cp过来

sudo cp libcusolver.so.10 libcusolver.so.11

你可能感兴趣的:(gpu,cuda,深度学习,linux)