颇走了些弯路哦。
主要是安装显卡坑太多
第一个大坑就是不建议用ubuntu16,因为几台电脑都没安装成功,也许可以靠升级内核来搞定,可是路径太长了。
然后, 有几个节点要注意
1 查看已安装的驱动,如果无法判断结果,就没办法找问题和改善了
hy@hy-Mi-Gaming-Laptop-15-6:~/kxwell$ ls /usr/src | grep nvidia
nvidia-455.38
回显的就是已安装版本
没安装过就没有显示,如果版本太老,建议先卸载
sudo apt-get remove --purge nvidia*
查看显卡信息
lspci |grep VGA能查看设备信息
lspci -vnn | grep VGA -A 12 能查看驱动使用的情况
(base) hy@hy-Default-string:~$ lspci -vnn | grep VGA -A 12
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2184] (rev a1) (prog-if 00 [VGA controller])
Subsystem: Device [1b4c:1366]
Flags: bus master, fast devsel, latency 0, IRQ 134
Memory at de000000 (32-bit, non-prefetchable) [size=16M]
Memory at c0000000 (64-bit, prefetchable) [size=256M]
Memory at d0000000 (64-bit, prefetchable) [size=32M]
I/O ports at e000 [size=128]
Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities:
Kernel driver in use: nouveau
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:1aeb] (rev a1)
这个上面的信息能看出来使用的还是nouveau驱动,而不是nvidia的驱动,所以nvidia-smi没有信息,说明显卡还未正确安装。
更新系统设置
sudo update-initramfs -u
2. UBUNTU18.04 自动安装显卡驱动
查看设备
ubuntu-drivers devices
如果安装推荐版本,只需要终端输入:
sudo ubuntu-drivers autoinstall
完成自动安装,然后重启。
实测1050安装没问题,但实测RTX3060只能安装460版本的,开机黑屏,卸载后重装465,成功。
sudo apt-get install nvidia-driver-465
验证手段: nvidia-smi能够显示显卡信息,以及支持的cuda最高版本
3. 安装cuda库
nvidia-smi可以看到显卡信息,以及支持的cuda版本。
(base) hy@hy-Default-string:~$ nvidia-smi
Mon Jul 12 10:27:19 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27 Driver Version: 465.27 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 0% 44C P8 18W / 220W | 633MiB / 12052MiB | 14% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1337 G /usr/bin/gnome-shell 175MiB |
| 0 N/A N/A 1572 G /usr/lib/xorg/Xorg 194MiB |
| 0 N/A N/A 1703 G /usr/bin/gnome-shell 72MiB |
| 0 N/A N/A 2193 G gnome-control-center 34MiB |
| 0 N/A N/A 2210 G /usr/lib/firefox/firefox 118MiB |
| 0 N/A N/A 2299 G /usr/lib/firefox/firefox 2MiB |
+-----------------------------------------------------------------------------+
这里最高支持到11.3,考虑到大多数库的兼容性,可以装个11,也可以装10.2,后面琢磨下怎么切换不同版本cuda
cuda版本和显卡驱动应该有一定的对应关系,从cuda的安装包名称上能看出来,所以最好根据nvidia-smi显示的信息来进行安装
cuda已更新到11.4,要找老版本的安装要进这个网址
CUDA Toolkit Archive | NVIDIA Developer
搜寻到想要的版本
选择runfile方便些
wget https://developer.download.nvidia.com/compute/cuda/11.3.0/local_installers/cuda_11.3.0_465.19.01_linux.run
sudo sh cuda_11.3.0_465.19.01_linux.run
根据上面的下载安装提示进行操作,好像不卸载驱动是不行的。
换个方式
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.3.0/local_installers/cuda-repo-ubuntu1804-11-3-local_11.3.0-465.19.01-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1804-11-3-local_11.3.0-465.19.01-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu1804-11-3-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
实际操作中,用deb安装方式有可能会带来黑屏,所以还是建议用runfile方式。
应该还要配置路径才能生效,粗暴一点,先按个老版本的
sudo apt install nvidia-cuda-toolkit
然后确认
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85
感觉版本低了,先往下走。
卸载cuda,需要到/usr/local/cuda/bin目录下运行cuda自己的uninstaller
安装完成后,需要配置路径
sudo nano ~/.bashrc
export CUDA_HOME=/usr/local/cuda-xxx
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64
export PATH=${CUDA_HOME}/bin:${PATH}
然后
sudo nano /etc/ld.so.conf
添上
include /usr/local/cuda-11.4/lib64
再运行
sudo ldconfig
可以用nvcc -V检查下
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:41:19_PDT_2021
Cuda compilation tools, release 11.4, V11.4.100
Build cuda_11.4.r11.4/compiler.30188945_0
完美
4. 安装cudnn
下载页面,注意和cuda的版本对应
cuDNN Archive | NVIDIA Developer
把对应版本下载下来,然后解压
sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/
sudo chmod a+r /usr/local/cuda/include/cudnn.h
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*
把路径添加进去,要不tensorflow-gpu会找不到
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
5. 配置pytorch或者tensorflow验证GPU可用
conda 安装tensorflow
conda install tensorflow-gpu
import tensorflow as tf
提示信息
2021-07-12 11:14:15.015023: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
正常说明安装tf和cuda没问题
hello=tf.constant("Hello")
提示
Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
说明cudnn没安装
重新确认上面cudnn安装
然后激动的测试一下,确认GPU可用
>>> tf.test.is_gpu_available()
WARNING:tensorflow:From
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2021-07-12 11:29:23.778891: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-12 11:29:23.780518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3060 computeCapability: 8.6
coreClock: 1.897GHz coreCount: 28 deviceMemorySize: 11.77GiB deviceMemoryBandwidth: 335.32GiB/s
2021-07-12 11:29:23.780720: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-12 11:29:23.782325: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-12 11:29:23.783285: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-07-12 11:29:23.783316: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-12 11:29:23.783325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2021-07-12 11:29:23.783333: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2021-07-12 11:29:23.783411: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-12 11:29:23.783913: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-12 11:29:23.784384: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/device:GPU:0 with 9710 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3060, pci bus id: 0000:01:00.0, compute capability: 8.6)
True
计算能力8.6,欧耶!(3060算力为8.6, 1660算力为7.5)
新版命令已改为:
tf.config.list_physical_devices('GPU')
pytorch官网
PyTorch
conda安装命令
conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia
有可能pytorch版本和cuda不支持的情况下会报错,所以要用上面的命令指定cuda版本来安装对应的torch版本。
pytorch确认
torch.cuda.is_available()
返回值true即可
出错处理
1.
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2. 训练出错
RuntimeError:a leaf Variable that requires grad has been used in an in-place
在报错代码前加上with torch.no_grad():
3. 安装CUDA缺库
安装库
sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev
4. 解决 Could not load dynamic library 'libcusolver.so.11';的问题
简单粗暴的,在/usr/local/cuda-11.0里找到了其他版本文件,直接cp过来
sudo cp libcusolver.so.10 libcusolver.so.11