pytorch测试报错:RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module

模型在服务器多gpu上训练,测试在自己台式机上进行,只有一块gpu,测试报错:

  File "/home/fuxueping/sdb/PycharmProjects/face_recognition/test_face_recognition_pytorc
h.py", line 563, in 
    test(method)
  File "/home/fuxueping/sdb/PycharmProjects/face_recognition/test_face_recognition_pytorc
h.py", line 547, in test
    regfeat = extractFeature_mobileFace(net, regimglist, batch)
  File "/home/fuxueping/sdb/PycharmProjects/face_recognition/test_face_recognition_pytorc
h.py", line 190, in extractFeature_mobileFace
    checkpoint = torch.load(model)
  File "/home/fuxueping/anaconda2/envs/pytorch_yolov3/lib/python3.6/site-
packages/torch/serialization.py", line 303, in load
    return _load(f, map_location, pickle_module)
  File "/home/fuxueping/anaconda2/envs/pytorch_yolov3/lib/python3.6/site-
packages/torch/serialization.py", line 469, in _load
    result = unpickler.load()
  File "/home/fuxueping/anaconda2/envs/pytorch_yolov3/lib/python3.6/site-
packages/torch/serialization.py", line 437, in persistent_load
    data_type(size), location)
  File "/home/fuxueping/anaconda2/envs/pytorch_yolov3/lib/python3.6/site-
packages/torch/serialization.py", line 88, in default_restore_location

    result = fn(storage, location)
  File "/home/fuxueping/anaconda2/envs/pytorch_yolov3/lib/python3.6/site-
packages/torch/serialization.py", line 70, in _cuda_deserialize
    return obj.cuda(device)
  File "/home/fuxueping/anaconda2/envs/pytorch_yolov3/lib/python3.6/site-
packages/torch/_utils.py", line 68, in _cuda
    with torch.cuda.device(device):
  File "/home/fuxueping/anaconda2/envs/pytorch_yolov3/lib/python3.6/site-
packages/torch/cuda/__init__.py", line 227, in __enter__
    torch._C._cuda_setDevice(self.idx)
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:32

出错场景:我是在服务器上用显卡4,5,6,7上训练我的模型,但是模型还在继续跑,所以我在自己的台式机上测试训练模型的中间结果,台式机只有gpu0。在pytorch上重新load训练好的深度学习模型时报错:RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:32。

出错原因;引起这种报错的原因是因为pytorch在save模型的时候会把显卡的信息也保存,当重新load的时候,发现不是同一一块显卡就报错invalid device ordinal。

解决方法:

未修改前代码:

 #gpu 0
    checkpoint = torch.load(model)
    state_dict = {k.replace("module.", ""): v for k, v in checkpoint.items()}
    net.load_state_dict(state_dict)

修改后:

    #gpu 0
    checkpoint = torch.load(model,map_location=lambda storage, loc: storage.cuda(0))
    state_dict = {k.replace("module.", ""): v for k, v in checkpoint.items()}
    net.load_state_dict(state_dict)

详细说明,请参考:https://blog.csdn.net/yinhui_zhang/article/details/86572232

 

你可能感兴趣的:(pytorch)