安装fabricmanager解决print(torch.cuda.is_available())报错NumCudaDevices()

安装fabricmanager

问题:print(torch.cuda.is_available())报错但是CUDA和cudnn都安装完成,版本对应良好,报错如下

UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at …/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0

解释:NVIDIA NVLink A100 GPU卡,需额外安装与驱动版本对应的 nvidia-fabricmanager 服务使 GPU 卡间能够互联通过NVSwitch互联,如果仅安装NVIDIA GPU 驱动程序,会导致GPU不能正常使用。安装步骤如下:

网站下载对应驱动版本的fabricmanager:Index of /compute/cuda/repos/ubuntu2204/x86_64 (nvidia.cn)

#若有旧的版本,请删去后重新下载

#手动安装
sudo apt-get install ./nvidia-fabricmanager-535_535.104.05-1_amd64.deb
#解除禁用
sudo systemctl enable nvidia-fabricmanager
#重启
sudo systemctl restart nvidia-fabricmanager
#检查状态
sudo systemctl status nvidia-fabricmanager
#安装成功

你可能感兴趣的:(linux,pytorch)