模型在服务器多gpu上训练,测试在自己台式机上进行,只有一块gpu,测试报错:
File "/home/fuxueping/sdb/PycharmProjects/face_recognition/test_face_recognition_pytorc
h.py", line 563, in
test(method)
File "/home/fuxueping/sdb/PycharmProjects/face_recognition/test_face_recognition_pytorc
h.py", line 547, in test
regfeat = extractFeature_mobileFace(net, regimglist, batch)
File "/home/fuxueping/sdb/PycharmProjects/face_recognition/test_face_recognition_pytorc
h.py", line 190, in extractFeature_mobileFace
checkpoint = torch.load(model)
File "/home/fuxueping/anaconda2/envs/pytorch_yolov3/lib/python3.6/site-
packages/torch/serialization.py", line 303, in load
return _load(f, map_location, pickle_module)
File "/home/fuxueping/anaconda2/envs/pytorch_yolov3/lib/python3.6/site-
packages/torch/serialization.py", line 469, in _load
result = unpickler.load()
File "/home/fuxueping/anaconda2/envs/pytorch_yolov3/lib/python3.6/site-
packages/torch/serialization.py", line 437, in persistent_load
data_type(size), location)
File "/home/fuxueping/anaconda2/envs/pytorch_yolov3/lib/python3.6/site-
packages/torch/serialization.py", line 88, in default_restore_location
result = fn(storage, location)
File "/home/fuxueping/anaconda2/envs/pytorch_yolov3/lib/python3.6/site-
packages/torch/serialization.py", line 70, in _cuda_deserialize
return obj.cuda(device)
File "/home/fuxueping/anaconda2/envs/pytorch_yolov3/lib/python3.6/site-
packages/torch/_utils.py", line 68, in _cuda
with torch.cuda.device(device):
File "/home/fuxueping/anaconda2/envs/pytorch_yolov3/lib/python3.6/site-
packages/torch/cuda/__init__.py", line 227, in __enter__
torch._C._cuda_setDevice(self.idx)
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:32
出错场景:我是在服务器上用显卡4,5,6,7上训练我的模型,但是模型还在继续跑,所以我在自己的台式机上测试训练模型的中间结果,台式机只有gpu0。在pytorch上重新load训练好的深度学习模型时报错:RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:32。
出错原因;引起这种报错的原因是因为pytorch在save模型的时候会把显卡的信息也保存,当重新load的时候,发现不是同一一块显卡就报错invalid device ordinal。
解决方法:
未修改前代码:
#gpu 0
checkpoint = torch.load(model)
state_dict = {k.replace("module.", ""): v for k, v in checkpoint.items()}
net.load_state_dict(state_dict)
修改后:
#gpu 0
checkpoint = torch.load(model,map_location=lambda storage, loc: storage.cuda(0))
state_dict = {k.replace("module.", ""): v for k, v in checkpoint.items()}
net.load_state_dict(state_dict)
详细说明,请参考:https://blog.csdn.net/yinhui_zhang/article/details/86572232