我是在服务器上用显卡2上训练我的模型,但是模型还在继续跑,所以我只能在其他显卡上重新做测试实验看效果的好坏。在pytorch上重新load训练好的深度学习模型时报错:RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:32。
THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=32 error=10 : invalid device ordinal
Traceback (most recent call last):
File "las_test.py", line 35, in
listener = torch.load(listener_model_path)
File "/home/zyh/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 303, in load
return _load(f, map_location, pickle_module)
File "/home/zyh/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 469, in _load
result = unpickler.load()
File "/home/zyh/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 437, in persistent_load
data_type(size), location)
File "/home/zyh/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 88, in default_restore_location
result = fn(storage, location)
File "/home/zyh/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 70, in _cuda_deserialize
return obj.cuda(device)
File "/home/zyh/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/_utils.py", line 68, in _cuda
with torch.cuda.device(device):
File "/home/zyh/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/__init__.py", line 227, in __enter__
torch._C._cuda_setDevice(self.idx)
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:32
引起这种报错的原因是因为pytorch在save模型的时候会把显卡的信息也保存,当重新load的时候,发现不是同一一块显卡就报错invalid device ordinal。我们可以追溯报错的源码:
File "/home/zyh/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 303, in load
return _load(f, map_location, pickle_module)
再次进入serialization.py,可以发现原来源码里面其实都说明了:
def load(f, map_location=None, pickle_module=pickle):
"""Loads an object saved with :func:`torch.save` from a file.
:meth:`torch.load` uses Python's unpickling facilities but treats storages,
which underlie tensors, specially. They are first deserialized on the
CPU and are then moved to the device they were saved from. If this fails
(e.g. because the run time system doesn't have certain devices), an exception
is raised. However, storages can be dynamically remapped to an alternative
set of devices using the `map_location` argument.
If `map_location` is a callable, it will be called once for each serialized
storage with two arguments: storage and location. The storage argument
will be the initial deserialization of the storage, residing on the CPU.
Each serialized storage has a location tag associated with it which
identifies the device it was saved from, and this tag is the second
argument passed to map_location. The builtin location tags are `'cpu'` for
CPU tensors and `'cuda:device_id'` (e.g. `'cuda:2'`) for CUDA tensors.
`map_location` should return either None or a storage. If `map_location` returns
a storage, it will be used as the final deserialized object, already moved to
the right device. Otherwise, :math:`torch.load` will fall back to the default
behavior, as if `map_location` wasn't specified.
If `map_location` is a string, it should be a device tag, where all tensors
should be loaded.
Otherwise, if `map_location` is a dict, it will be used to remap location tags
appearing in the file (keys), to ones that specify where to put the
storages (values).
User extensions can register their own location tags and tagging and
deserialization methods using `register_package`.
Args:
f: a file-like object (has to implement read, readline, tell, and seek),
or a string containing a file name
map_location: a function, string or a dict specifying how to remap storage
locations
pickle_module: module used for unpickling metadata and objects (has to
match the pickle_module used to serialize file)
Example:
>>> torch.load('tensors.pt')
# Load all tensors onto the CPU
>>> torch.load('tensors.pt', map_location='cpu')
# Load all tensors onto the CPU, using a function
>>> torch.load('tensors.pt', map_location=lambda storage, loc: storage)
# Load all tensors onto GPU 1
>>> torch.load('tensors.pt', map_location=lambda storage, loc: storage.cuda(1))
# Map tensors from GPU 1 to GPU 0
>>> torch.load('tensors.pt', map_location={'cuda:1':'cuda:0'})
# Load tensor from io.BytesIO object
>>> with open('tensor.pt') as f:
buffer = io.BytesIO(f.read())
>>> torch.load(buffer)
"""
在最后的说明中写了我们最终的解决方案:
#如果你想把所有的tensors放到CPU的话:
torch.load('path of your model', map_location='cpu')
#如果你想使用函数将所有张量加载到CPU上
torch.load('path of your model', map_location=lambda storage, loc: storage)
#在GPU 1上加载所有张量
torch.load('path of your model', map_location=lambda storage, loc: storage.cuda(1))
#将张量从GPU 1(原来训练模型的GPU ID)映射到GPU 0(重新映射的GPU ID)
torch.load('path of your model', map_location={'cuda:1':'cuda:0'})
最终在使用在GPU0上加载所有的张量解决了这个问题。
Reference:
https://blog.csdn.net/shincling/article/details/78919282
https://pytorch-cn.readthedocs.io/zh/latest/