官网:https://developer.nvidia.com
CUDA: Compute Unified Device Architecture,是显卡厂商NVIDIA推出的通用并行计算架构
,是一种并行计算平台和编程模型,该架构使GPU能够解决复杂的计算问题。
CUDA包含三部分,CUDA toolkit、CUDA driver和NVIDIA GPU driver
在linux系统中,CUDA driver 和 NVIDIA GPU device driver 是统一在NVIDIA driver下的。
CUDA Driver & NVIDIA Driver
CUDA本身包含CUDA Driver和GPU kernel-mode Driver,而这两者在Linux系统中是统一在NVIDIA Driver中的。
因此在安装好NVIDIA Driver好以后,只需要安装 CUDA toolkit 就可以保证CUDA相关的程序运行。
官网说明:https://developer.nvidia.com/cudnn
The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks.
cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.
cudnn 是专为深度学习计算设计的软件库,里面提供了很多专门的计算函数,如卷积等。
NVIDIA 和 CUDA Toolkit 对应版本
https://docs.nvidia.cn/cuda/cuda-toolkit-release-notes/index.html
$ nvidia-smi
如果下面出现GPU信息,就代表存在旧的驱动。
sudo apt autoremove
sudo /usr/bin/nvidia-uninstall # 不一定有这个命令,版本不同
sudo apt-get --purge remove "*nvidia*"
# 删除旧驱动
sudo apt-get purge nvidia-cuda*
sudo apt-get purge nvidia*
sudo apt-get purge libnvidia*
卸载完后需要重启
Nouveau是由第三方为NVIDIA显卡开发的一个开源3D驱动,也没能得到NVIDIA的认可与支持。
虽然Nouveau无法和NVIDIA官方私有驱动相提并论,不过确让Linux更容易的应对各种复杂的NVIDIA显卡环境,让用户安装完系统即可进入桌面并且有不错的显示效果,所以,很多Linux发行版默认集成了Nouveau驱动。
查看 nouveau 是否运行
lsmod | grep nouveau
如果有输出,表示运行;没有输出,代表已禁用
禁用自带的 nouveau nvidia驱动
vim /usr/lib/modprobe.d/blacklist-nouveau.conf
# sudo vim /etc/modprobe.d/blacklist-nouveau.conf
加入的内容
blacklist nouveau
options nouveau modeset=0
# 更新
sudo update-initramfs -u
dracut -f
systemctl set-default multi-user.target
sudo reboot
# 修改后需要重启系统。确认下Nouveau是已经被你干掉,使用命令:
lsmod | grep nouveau
停止X服务器
sudo service lightdm stop
sudo service gdm stop
sudo service kdm stop # this is the one that worked for mi as I use kdm and Linux mint
安装 gcc、g++
$ sudo apt update
# $ sudo apt install gcc-9 g++-9
$ sudo apt install gcc g++
# 查看版本
$ gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ g++ --version
g++ (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
# 获取ubuntu 版本信息
$ uname -a
Linux ubuntu 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
newtranx@ubuntu:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04 # 版本号
Codename: xenial
# 查看显卡信息
$ lspci | grep -i nvidia
05:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
05:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
08:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
08:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
...
$ ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00001B06sv00001462sd00003609bc03sc00i00
vendor : NVIDIA Corporation
model : GP102 [GeForce GTX 1080 Ti]
driver : nvidia-driver-390 - distro non-free
driver : nvidia-driver-510 - distro non-free
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-418-server - distro non-free
driver : nvidia-driver-450-server - distro non-free
driver : nvidia-driver-510-server - distro non-free recommended
driver : nvidia-driver-470 - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin
recommended
,代表推荐安装这个版本。安装推荐版本
sudo ubuntu-drivers autoinstall
安装指定版本
sudo apt install nvidia-driver-510-server
安装后需要重启电脑,否则可能报错:
$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
Make sure that the latest NVIDIA driver is installed and running.
如果上述查询信息,可以在下载时找到对应的类型就跳过这一步。
如果查询到 1b01 之类的信息,找不到对应的类型,可以参考下属文章找到对应型号。
根据 : https://blog.csdn.net/zhuguiqian/article/details/104795435
http://pci-ids.ucw.cz/mods/PC/10de?action=help?help=pci
根据系统和显卡型号,获取合适的驱动
https://www.nvidia.com/Download/index.aspx?lang=en-us
得到文件: NVIDIA-Linux-x86_64-470.94.run
GeForce 驱动程序
https://www.nvidia.cn/geforce/drivers/
sudo chmod +x NVIDIA-Linux-x86_64-525.60.11.run
sudo sh ./NVIDIA-Linux-x86_64-525.60.11.run --no-x-check -no-opengl-files
Install NVIDIA’s 32-bit compatibility libraries?
选择YES
1、ERROR: An NVIDIA kernel module ‘nvidia-uvm’ appears to already be loaded in your kernel.
ERROR: An NVIDIA kernel module ‘nvidia-uvm’ appears to already be loaded in your kernel. This may be because it is in use (for example, by an X server, a
CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.
Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver. If no GPU-based programs are running,
you know that your kernel supports module unloading, and you still receive this message, then an error may have occurred that has corrupted an NVIDIA kernel module’s usage count, for which the simplest remedy is to reboot your computer.ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
解决方法:
检查之前的 NVIDIA 驱动有没有卸载干净(可看文章最上方),有旧版本的就卸掉,然后重启机器。
The distribution-provided pre-install script failed! Are you sure you want to continue?
这个问题源自nvidia驱动安装包自身的问题,这里可以直接点击yes或者continue继续安装 。
$ nvidia-smi
Fri Dec 2 13:55:13 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11 Driver Version: 525.60.11 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:05:00.0 Off | N/A |
注意 Driver Version: 525.60.11 即刚才安装的 驱动版本
CUDA Toolkit Archive 下载地址:
https://developer.nvidia.com/cuda-toolkit-archive
我将要安装的 pytorch 支持 Cuda 10.2 和 11.3 (https://pytorch.org/get-started/locally/)
Cuda 10.2 并不支持 Ubuntu 20.*,所以我点击进入 11.3,一步步选择合适的版本,得到下载脚本
https://developer.nvidia.com/cuda-11.3.0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_network
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt-get update
sudo apt-get -y install cuda
我得到这样的脚本
wget https://developer.download.nvidia.com/compute/cuda/11.3.1/local_installers/cuda_11.3.1_465.19.01_linux.run
sudo sh cuda_11.3.1_465.19.01_linux.run
运行过程中的选择:
accept the above EULA?
取消勾选 driver, 然后选择 Install
修改 ~/.bashrc
文件,然后source激活
export CUDA_LIB_PATH=/usr/local/cuda-11.3/lib64
export CUDA_BIN_PATH=/usr/local/cuda-11.3/bin
export CUDA_HOME=/usr/local/cuda-11.3
本质是管理 cuda 软链接的
sudo rm -rf cuda # 删除旧版本的软连接
# 建立新版本的软连接,前面的路径是需要的版本的cuda的安装路径。
sudo ln -s /usr/local/cuda-11.3 /usr/local/cuda
sudo ln -s /usr/local/cuda-11.3/bin/nvcc /usr/bin/nvcc
cudann 下载地址: https://developer.nvidia.com/rdp/cudnn-archive
# 解压
tar -xvf cudnn-10.2-linux-x64-v8.2.1.32.tgz
解压之后得到一个 cuda
文件夹。
配置
sudo cp cuda/include/cudnn.h /usr/local/cuda-10.2/include # 填写对应的版本的cuda路径
sudo cp cuda/lib64/libcudnn* /usr/local/cuda-10.2/lib64 # 填写对应的版本的cuda路径
sudo chmod a+r /usr/local/cuda-10.2/include/cudnn.h /usr/local/cuda-10.2/lib64/libcudnn*
这里演示使用 apt-get 安装 gcc;然后修改软链接;
你可以可以下载安装包来安装。
查看 gcc、g++ 版本
gcc --version
g++ --version
可能得到如下结果:
$ gcc --version
gcc (Ubuntu 6.5.0-2ubuntu1~16.04) 6.5.0 20181026
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ g++ --version
g++ (Ubuntu 6.5.0-2ubuntu1~16.04) 6.5.0 20181026
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
查看本机安装了哪些版本gcc
sudo dpkg -l | grep gcc
可能得到类似这样的结果
ii gcc 4:5.3.1-1ubuntu1 amd64 GNU C compiler
ii gcc-5 5.4.0-6ubuntu1~16.04.11 amd64 GNU C compiler
ii gcc-5-base:amd64 5.4.0-6ubuntu1~16.04.11 amd64 GCC, the GNU Compiler Collection (base package)
ii gcc-6 6.5.0-2ubuntu1~16.04 amd64 GNU C compiler
ii gcc-6-base:amd64 6.5.0-2ubuntu1~16.04 amd64 GCC, the GNU Compiler Collection (base package)
...
ii gcc-9-base:amd64 9.4.0-1ubuntu1~16.04 amd64 GCC, the GNU Compiler Collection (base package)
...
ii gir1.2-packagekitglib-1.0 0.8.17-4ubuntu6~gcc5.4ubuntu1.4 amd64 GObject introspection data for the PackageKit GLib library
ii libgcc-5-dev:amd64 5.4.0-6ubuntu1~16.04.11 amd64 GCC support library (development files)
ii libgcc-6-dev:amd64 6.5.0-2ubuntu1~16.04 amd64 GCC support library (development files)
ii libgcc1:amd64 1:9.4.0-1ubuntu1~16.04 amd64 GCC support library
如果已经安装了你所需的 gcc 版本,就不用安装,只需要改变 /usr/bin/gcc
软链接即可
sudo apt-cache search gcc
得到很多结果
...
cpp-5 - GNU C preprocessor
cpp-5-aarch64-linux-gnu - GNU C preprocessor
...
gcc - GNU C compiler
gcc-5 - GNU C compiler
gcc-5-aarch64-linux-gnu - GNU C compiler
gcc-5-aarch64-linux-gnu-base - GCC, the GNU Com
sudo apt-get install gcc-6 g++-6
安装成功后,你可以使用 sudo dpkg -l | grep gcc
命令查看已安装的版本。
此时如果使用 gcc --version
可能还是过去的版本,这时需要修改默认的gcc版本(软链接)。
查看软链接
ls -l /usr/bin/gcc
可能得到如下:代表当前 gcc 指向 gcc-5
lrwxrwxrwx 1 root root 5 Feb 11 2016 /usr/bin/gcc -> gcc-5
重定向 gcc 链接到 gcc-6
cd /usr/bin
sudo sudo rm gcc
sudo ln -s gcc-6 gcc
sudo rm g++
sudo ln -s g++-6 g++
此时再次使用 gcc --version
即可看到设置的版本。
使用 apt-get 可能容易出现这个问题
由于我目前不使用 redis,所以粗暴去掉了,方法如下:
1、查看,通过执行命令
dpkg-statoverride --list
可以看到 redis 字样
2、修改 /var/lib/dpkg/statoverride
文件
这里我使用 vim 将其打开
sudo vim /var/lib/dpkg/statoverride
会发现最后一行是 redis redis 640 /etc/redis/redis.conf
,将这行去掉,然后保存文件。
后面再去运行 apt-get
就不再报这个错误了。
NVIDIA® CUDA Toolkit 11.6 no longer supports development or running applications on macOS.
While there are no tools which use macOS as a target environment, NVIDIA is making macOS host versions of these tools that you can launch profiling and debugging sessions on supported target platforms.
CUDA driver update to support CUDA Toolkit 10.1 Update 1 and macOS 10.13.6
cuda 不再支持 macOS,你可以在 macOS上安装调试工具。
CUDA driver 支持的最高版本是macOS 10.13.6 和 CUDA Toolkit 10.1 版本(2019年5月发布)。
macOS 10.14, 10.15 以上无法安装 cuda,也无法安装 cuda 10.2及以上版本。
具体信息可见官网:
两个显示出来的cuda的版本不同;
nvcc -V
得到的版本,是运行时的cuda 版本;
nvidia-smi
中 cuda 的版本,代表当前驱动支持的最高 cuda 版本。
所以我觉得(未验证)如果你的 nvidia-smi
中cuda 版本比较低,需要升级 NVIDIA 驱动。
cuda 和 NVIDIA 驱动不一致,会报这个错误。
查看显卡驱动版本: nvidia-smi
命令输出的 driver version
字段中
NVIDIA 和 CUDA Toolkit 对应版本
https://docs.nvidia.cn/cuda/cuda-toolkit-release-notes/index.html
如果 cuda 版本比较低,pytorch 版本比较高,运行 pytorch 的时候,可能会报如下错误:
No module named ‘packaging’
AttributeError: module ‘logging’ has no attribute ‘getLogger’
针对不同版本的cuda,可以安装不同版本的 pytorch,详见:
https://pytorch.org/get-started/previous-versions/
如果 pytorch 在低版本的 cuda 下安装
安装 apex 的时候可能会报如下问题:
untimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 10.2.
为了解决安装过程中遇到的各种问题,这里将可能需要查询的信息和方法都罗列在此。
# 查看系统版本
$ lsb_release -a
LSB Version: core-9.20160110ubuntu0.2-amd64:core-9.20160110ubuntu0.2-noarch:security-9.20160110ubuntu0.2-amd64:security-9.20160110ubuntu0.2-noarch
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.1 LTS"
# 查看 ubuntu 架构等信息
$ uname -a
Linux ubuntu-101 4.4.0-210-generic #242-Ubuntu SMP Fri Apr 16 09:57:56 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
# 显示各种驱动信息(包含显卡及其驱动)
$ ubuntu-drivers devices
# 查看显卡型号/ nvidia GPU 信息
$ lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1)
$ nvcc -V # 需要安装 nvidia-cuda-toolkit
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17
# 查看 cuda 版本(旧)
$ cat /usr/local/cuda/version.txt
CUDA Version 10.1.105
# 查看显卡驱动所使用的内核版本
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 418.39 Sat Feb 9 19:19:37 CST 2019
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)
$ lshw -c video
# $ lshw -C display
WARNING: you should run this program as super-user.
*-display
description: VGA compatible controller
product: NVIDIA Corporation
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:01:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: vga_controller bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:16 memory:ee000000-eeffffff memory:d0000000-dfffffff memory:e0000000-e1ffffff ioport:e000(size=128) memory:ef000000-ef07ffff
WARNING: output may be incomplete or inaccurate, you should run this program as super-user.
# lspci | grep -i nvidia查看全部显卡信息。
$ lspci -vnn | grep VGA -A 12
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1b06] (rev a1) (prog-if 00 [VGA controller])
Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3609]
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at ee000000 (32-bit, non-prefetchable) [size=16M]
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at e0000000 (64-bit, prefetchable) [size=32M]
I/O ports at e000 [size=128]
[virtual] Expansion ROM at ef000000 [disabled] [size=512K]
Capabilities: <access denied>
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_418_drm, nvidia_418
01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:10ef] (rev a1)
# 检查硬件加速。启用基于硬件的3D加速。
$ glxinfo | grep OpenGL
The program 'glxinfo' is currently not installed. You can install it by typing:
sudo apt install mesa-utils
# 查看系统驱动日志
$ cat /var/log/dpkg.log | grep nvidia
2022-01-06 06:02:02 upgrade nvidia-driver-470:amd64 470.86-0ubuntu0.20.04.1 470.86-0ubuntu0.20.04.2
2022-01-06 06:02:02 status half-configured nvidia-driver-470:amd64 470.86-0ubuntu0.20.04.1
# 查看驱动程序
$ sudo dpkg --list | grep nvidia-*
ii libnvidia-common-470 470.86-0ubuntu0.20.04.2 all Shared files used by the NVIDIA libraries
ii nvidia-compute-utils-470 470.86-0ubuntu0.20.04.2 amd64 NVIDIA compute utilities
ii nvidia-driver-470 470.86-0ubuntu0.20.04.2 amd64 NVIDIA driver metapackage
ii nvidia-kernel-common-470 470.86-0ubuntu0.20.04.2 amd64 Shared files used with the kernel module
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
# 动态监控显卡状态
$ watch -n 1 nvidia-smi
$ nvidia-smi
Sat May 28 18:13:47 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 32% 44C P8 12W / 250W | 336MiB / 11264MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1051 G /usr/lib/xorg/Xorg 24MiB |
| 0 N/A N/A 1154 G /usr/bin/gnome-shell 49MiB |
| 0 N/A N/A 1838 G /usr/lib/xorg/Xorg 174MiB |
| 0 N/A N/A 1988 G /usr/bin/gnome-shell 83MiB |
+-----------------------------------------------------------------------------+
进入相关 env
$ python3
Python 3.8.8 (default, Apr 13 2021, 19:58:26)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available() # 查看 cuda 是否可用
True # 代表可用
>>> print(torch.__version__) # 查看torch 版本
1.9.0+cu102
# 多少个 cuda 可用
>>> print(torch.cuda.device_count()) # 查看
1
>>> torch.version.cuda
'10.2'
# 查看当前使用的GPU序号:
>>> device = torch.cuda.current_device()
>>> device
0
# 查看指定GPU的容量、名称:
>>> torch.cuda.get_device_capability(device)
(6, 1)
>>> torch.cuda.get_device_name(device)
'NVIDIA GeForce GTX 1080 Ti'
# 清空程序占用的GPU资源:
>>> torch.cuda.empty_cache()
伊织 2022-12-02(五)