LMFlow 跑脚本 ./scripts/run_finetune.sh报错
主要原因是本机安装的cuda版本与torch的编译版本不一致
报错内容:
Exception: >- DeepSpeed Op Builder: Installed CUDA version 12.1 does not match the version torch was compiled with 11.7, unable to compile cuda/cpp extensions without a matching cuda version.
Exception ignored in:
Traceback (most recent call last):
File "/home/gaosong/anaconda3/envs/gpt/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 110, in __del__
self.ds_opt_adam.destroy_adam(self.opt_id)
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in:
Traceback (most recent call last):
File "/home/gaosong/anaconda3/envs/gpt/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 110, in __del__
self.ds_opt_adam.destroy_adam(self.opt_id)
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
解决思路1: 找到torch版本与cuda的关系, 升级torch版本
目前项目依赖的版本是: torch==2.0.0
解决思路: 降cuda版本
查看torch cuda版本
import torch
print(torch.version)
print(torch.__version__)
print(torch.version.cuda)
print(torch.backends.cudnn.version())
print(torch.cuda.is_available())
通过命令查询服务器CUDA版本
nvidia-smi
思路1升级2.0.1 发现cuda仍然是11.7, 在requirements.txt无法指定cuda版本
思路2降cuda版本,太麻烦
通过新方法解决了问题: 如下目录通过命令 print(torch.version) 获得
vi /home/gaosong/anaconda3/envs/gpt/lib/python3.9/site-packages/torch/version.py
修改 cuda = '11.7' => cuda = '12.1'