RuntimeError: storage has wrong size: expected XXX got XXX 以及 多卡GPU训练转到多卡GPU测试

  1. 在训练时,我的代码出现了以下问题:
Traceback (most recent call last):
  File "/root/PycharmProjects/test.py", line 8, in <module>
    model_dict = torch.load("0.pth")
  File "torch/serialization.py", line 529, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "torch/serialization.py", line 709, in _legacy_load
    deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected 926 got 24

其原因在于使用DistributedDataParallel函数进行分布式训练时,只需要保存local_rank=0进程的模型权重,代码如下:

if args.local_rank == 0:
    state = {
        'epoch': epoch+1, 'state_dict': model.state_dict(),
        'best_top5': best_top5, 'optimizer' : optimizer.state_dict(),
    }
    torch.save(state, filename)
  1. 使用多卡GPU进行训练后,若想要使用不同数量的多卡GPU或单卡GPU进行测试,会出现类似以下问题:
RuntimeError: Attempting to deserialize object on CUDA device 3 but torch.cuda.device_count() is 1. please use torch.load with map_location to map your storages to an XXX.

假设现在使用8张GPU进行训练,而需要使用4张GPU进行测试,则设置torch.load函数中的mao_location参数为:

torch.load("path", map_location={'cuda:7':'cuda:3'})
#前一个参数表示训练GPU数量,后一参数表示当前GPU数量

若使用单卡进行测试,则代码为:

torch.load("path", map_location={'cuda:7':'cuda:0'})

你可能感兴趣的:(深度学习,算法,计算机视觉,机器学习)