【pytorch】Ubuntu+Anaconda+CUDA+pytorch 配置教程 | nvidia-smi 报错 NVIDIA-SMI has failed

Ubuntu+Anaconda+CUDA+pytorch 整体参考:https://aitechtogether.com/article/14143.html

下载安装anaconda

客户端下载anaconda的.sh之后通过SFTP上传到服务器
【pytorch】Ubuntu+Anaconda+CUDA+pytorch 配置教程 | nvidia-smi 报错 NVIDIA-SMI has failed_第1张图片

【pytorch】Ubuntu+Anaconda+CUDA+pytorch 配置教程 | nvidia-smi 报错 NVIDIA-SMI has failed_第2张图片
通过bash开始安装

bash Anaconda3-2023.03-Linux-x86_64.sh

各种按回车,输入yes

使用conda -V命令查看安装的conda版本,如果出现-sh:conda:未找到命令说明没有把conda加入系统路径中,使用下列路径把conda加入系统路径

export PATH=/home/yourName/anaconda3/bin/:$PATH

然后再次输入conda -V会出现conda的版本

使用conda创建新环境

首先先添加一下国内镜像源

conda config --add channels https://anaconda.mirrors.sjtug.sjtu.edu.cn/pkgs/r
conda config --add channels https://anaconda.mirrors.sjtug.sjtu.edu.cn/pkgs/pro
conda config --add channels https://anaconda.mirrors.sjtug.sjtu.edu.cn/pkgs/msys2
conda config --add channels https://anaconda.mirrors.sjtug.sjtu.edu.cn/pkgs/mro
conda config --add channels https://anaconda.mirrors.sjtug.sjtu.edu.cn/pkgs/free
conda config --add channels https://anaconda.mirrors.sjtug.sjtu.edu.cn/pkgs/main
conda config --add channels http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --add channels http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/
conda config --add channels http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda/
conda config --add channels http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/

使用命令conda config --show-sources查看所有配置的源

创造新环境

conda create -n envName python=3.8

下载完成后使用命令source activate envName进入创建的新环境

使用命令conda list查看已安装包的信息

下载pytorch

这里需要先查看一下自己服务器的CUDA版本,下载pytorch时,尽量选择比自己CUDA版本低的CUDA版本对应的pytorch,不然可能会出现兼容问题

使用命令nvidia-smi查看CUDA版本
【pytorch】Ubuntu+Anaconda+CUDA+pytorch 配置教程 | nvidia-smi 报错 NVIDIA-SMI has failed_第3张图片
这里我的CUDA版本是12.0,就选择CUDA11.7版本的下载了
【pytorch】Ubuntu+Anaconda+CUDA+pytorch 配置教程 | nvidia-smi 报错 NVIDIA-SMI has failed_第4张图片

如果nvidia-smi 输出结果异常,需要按照下面小节的步骤进行调整

nvidia-smi报错

执行nvidia-smi 时报错

(pttest) $ nvidia-smi                                                      
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

得去安装NVIDIA驱动

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update

然后需要获取匹配的驱动version

安装 nvidia-cuda-toolkit 工具

sudo apt-get install nvidia-cuda-toolkit

检查系统推荐显卡驱动,记录下recommended选项

sudo ubuntu-drivers devices
(pttest) $ sudo ubuntu-drivers devices
== /sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0 ==
modalias : pci:v000010DEd00001F07sv00001043sd0000866Dbc03sc00i00
vendor   : NVIDIA Corporation
model    : TU106 [GeForce RTX 2070 Rev. A]
driver   : nvidia-driver-470 - distro non-free
driver   : nvidia-driver-450-server - distro non-free
driver   : nvidia-driver-515-open - distro non-free
driver   : nvidia-driver-470-server - distro non-free
driver   : nvidia-driver-525-open - distro non-free recommended
driver   : nvidia-driver-515-server - distro non-free
driver   : nvidia-driver-515 - distro non-free
driver   : nvidia-driver-525-server - distro non-free
driver   : nvidia-driver-510 - distro non-free
driver   : nvidia-driver-525 - distro non-free
driver   : nvidia-driver-418-server - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

recommended是 nvidia-driver-525-open
实测nvidia-driver-525-open不行,得用nvidia-driver-525
用了nvidia-driver-525-open的话,最终nvidia-smi会报错No devices were found

$ nvidia-smi
No devices were found

所以这里得用nvidia-driver-525

sudo apt-get install nvidia-driver-525
sudo reboot

使用nvcc -V检查驱动和cuda

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

发现驱动已经存在

查看已安装驱动的版本信息

ls /usr/src | grep nvidia
$ ls /usr/src | grep nvidia
nvidia-525.105.17

依次输入以下命令

sudo apt-get install dkms
sudo dkms install -m nvidia -v 525.105.17

sudo dkms install -m nvidia -v 525.105.17 报错如下:

$ sudo dkms install -m nvidia -v 525.105.17
Error! Your kernel headers for kernel 5.15.0-69-generic cannot be found.
Please install the linux-headers-5.15.0-69-generic package or use the --kernelsourcedir option to tell DKMS where it's located.

这个错误提示是由于缺少与您当前的内核版本匹配的头文件。
您需要安装相应的内核头文件才能成功安装nvidia驱动。
运行以下命令以安装相应的内核头文件:

sudo apt-get update
sudo apt-get install linux-headers-$(uname -r)

其中$(uname -r)会自动获取当前正在运行的内核版本。

安装完成后,重新运行sudo dkms install -m nvidia -v 525.105.17即可。

$ sudo dkms install -m nvidia -v 525.105.17
Module nvidia/525.105.17 already installed on kernel 5.15.0-69-generic (x86_64).
sudo reboot

再次执行nvidia-smi

$ nvidia-smi
Fri Apr 14 20:21:29 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:3B:00.0 Off |                  N/A |
| 20%   47C    P8    23W / 185W |      1MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

【pytorch】Ubuntu+Anaconda+CUDA+pytorch 配置教程 | nvidia-smi 报错 NVIDIA-SMI has failed_第5张图片

nvidia-smi显示正常

你可能感兴趣的:(服务器,Python,技术杂项,ubuntu,pytorch,linux)