Installed CUDA version 12.1 does not match the version torch was compiled with 11.7

LMFlow 跑脚本  ./scripts/run_finetune.sh报错

主要原因是本机安装的cuda版本与torch的编译版本不一致

报错内容:

Exception: >- DeepSpeed Op Builder: Installed CUDA version 12.1 does not match the version torch was compiled with 11.7, unable to compile cuda/cpp extensions without a matching cuda version.
Exception ignored in: 
Traceback (most recent call last):
  File "/home/gaosong/anaconda3/envs/gpt/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 110, in __del__
    self.ds_opt_adam.destroy_adam(self.opt_id)
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: 
Traceback (most recent call last):
  File "/home/gaosong/anaconda3/envs/gpt/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 110, in __del__
    self.ds_opt_adam.destroy_adam(self.opt_id)
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

解决思路1: 找到torch版本与cuda的关系, 升级torch版本

目前项目依赖的版本是: torch==2.0.0

解决思路: 降cuda版本

查看torch cuda版本

import torch
print(torch.version)
print(torch.__version__)
print(torch.version.cuda)
print(torch.backends.cudnn.version())
print(torch.cuda.is_available())

通过命令查询服务器CUDA版本

nvidia-smi

思路1升级2.0.1 发现cuda仍然是11.7, 在requirements.txt无法指定cuda版本

思路2降cuda版本,太麻烦

通过新方法解决了问题:  如下目录通过命令 print(torch.version) 获得

vi /home/gaosong/anaconda3/envs/gpt/lib/python3.9/site-packages/torch/version.py

修改 cuda = '11.7' => cuda = '12.1'

你可能感兴趣的:(python,开发语言)