记录pytorch一个神奇的错误CUDA error: out of memory (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:241)

完整报错:
Traceback (most recent call last):
File “train.py”, line 73, in
feature = model(seq)
File “/home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 541, in call
result = self.forward(*input, *kwargs)
File “/home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py”, line 148, in forward
inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
File “/home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py”, line 159, in scatter
return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
File “/home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py”, line 36, in scatter_kwargs
inputs = scatter(inputs, target_gpus, dim) if inputs else []
File “/home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py”, line 28, in scatter
res = scatter_map(inputs)
File “/home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py”, line 15, in scatter_map
return list(zip(map(scatter_map, obj)))
File “/home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py”, line 13, in scatter_map
return Scatter.apply(target_gpus, None, dim, obj)
File “/home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/_functions.py”, line 89, in forward
outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
File “/home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/cuda/comm.py”, line 147, in scatter
return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: CUDA error: out of memory (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:241)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f601966c813 in /home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x1cb50 (0x7f60198adb50 in /home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x1de6e (0x7f60198aee6e in /home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: at::native::empty_cuda(c10::ArrayRef, c10::TensorOptions const&, c10::optionalc10::MemoryFormat) + 0x279 (0x7f601f2977e9 in /home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #4: + 0x4225538 (0x7f601dce4538 in /home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #5: + 0x3cdec28 (0x7f601d79dc28 in /home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #6: + 0x1c34521 (0x7f601b6f3521 in /home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #7: at::native::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) + 0x272 (0x7f601b6f3ec2 in /home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #8: + 0x1f5db40 (0x7f601ba1cb40 in /home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #9: + 0x3af0873 (0x7f601d5af873 in /home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #10: torch::cuda::scatter(at::Tensor const&, c10::ArrayRef, c10::optional > const&, long, c10::optional > > const&) + 0x4db (0x7f601e0da10b in /home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #11: + 0x7851a3 (0x7f60653891a3 in /home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x210ac4 (0x7f6064e14ac4 in /home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #20: THPFunction_apply(_object
, _object
) + 0x936 (0x7f60650a5686 in /home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #60: __libc_start_main + 0xe7 (0x7f607452db97 in /lib/x86_64-linux-gnu/libc.so.6)

可疑的起因:
用nohup命令在后台执行训练程序(nohup python train.py), 用了3块显卡(0,1,2)即在程序中显设置了 os.environ[“CUDA_VISIBLE_DEVICES”]=“0,1,2”. 然后就强制退出了服务器登录.再登录服务器后,为了终止训练程序, top查了python的pid, 然后kill pid杀死了该程序.然后再运行训练程序时报了以上错误.

猜测的原因:
kill 程序虽然杀死了python程序,但并未真正释放对GPU的资源分配(虽然nvidia-smi发现并未有任何显存占用).

神奇的解决方案:
(1) 在程序中加入一下两行代码:

torch.cuda.current_device() #这行代码貌似没啥用,但是无所谓了加上就加上了
torch.cuda._initialized = True

错误变成了:
Traceback (most recent call last):
File “train.py”, line 46, in
model = nn.DataParallel(model, device_ids=[0,1,2])
File “/home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py”, line 133, in init
_check_balance(self.device_ids)
File “/home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py”, line 19, in _check_balance
dev_props = [torch.cuda.get_device_properties(i) for i in device_ids]
File “/home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py”, line 19, in
dev_props = [torch.cuda.get_device_properties(i) for i in device_ids]
File “/home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/cuda/init.py”, line 337, in get_device_properties
raise AssertionError(“Invalid device id”)
AssertionError: Invalid device id
原因:
用了不可用的GPU, 所以应该是显卡资源被占用了.
所以我在python控制台查看了可用GPU个数和名字,发现:

>>> import torch
>>>> torch.cuda.device_count()
2 # 明明nvidia-smi出来是3块,这里却是2块
>>> torch.cuda.get_device_name(0)
'GeForce RTX 2080 Ti'
>>> torch.cuda.get_device_name(1)
'GeForce RTX 2080 Ti'
>>> torch.cuda.get_device_name(2) #发现是第3块卡出了问题,无法调用
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/cuda/__init__.py", line 312, in get_device_name
    return get_device_properties(device).name
  File "/home/liang/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/cuda/__init__.py", line 337, in get_device_properties
    raise AssertionError("Invalid device id")
AssertionError: Invalid device id

(2) 接下来执行了一下命令:
fuser -v /dev/nvidia*
并把所列出的pid全部杀死了(因为是服务器,并没有显存做显示用,随便杀)
然后再次在python控制台查看可用GPU,发现第三块还是invalid.
接着在python控制台设置 os.environ[“CUDA_VISIBLE_DEVICES”]=“0,1,2”,再查看可用GPU,第三块就可用了.
但是此时运行训练程序还是报Invalid device id错误
(3) 最后重点的地方来了
我将步骤(1)加入的两行代码删除,TMD就可以了…

总结:
我觉得这个问题对于CUDA玩家,linux高级玩家或者进阶炼丹师来说不是啥问题, 这也是为啥网上没有解决方案. 我先为自己记录一下,以后再遇到不会抓狂.

你可能感兴趣的:(Linux)