MMdetection使用多卡训练时出现“ your module has parameters that were not used in producing loss”

MMdetection使用多卡训练时出现“ your module has parameters that were not used in producing loss”

使用mmdetection多卡训练命令时:

./tools/dist_train.py config_file num_gpus

出现下述错误:

RuntimeError: Expected to have finished reduction in the prior
iteration before starting a new one. This error indicates that your
module has parameters that were not used in producing loss. You can
enable unused parameter detection by (1) passing the keyword argument
find_unused_parameters=True to
torch.nn.parallel.DistributedDataParallel; (2) making sure all forward
function outputs participate in calculating loss. If you already have
done the above two steps, then the distributed data parallel module
wasn’t able to locate the output tensors in the return value of your
module’s forward function. Please include the loss function and the
structure of the return value of forward of your module when reporting
this issue (e.g. list, dict, iterable)

出现该问题的原因之一是网络中有没有参与loss计算的网络参数。一种解决方法就是按照错误提示,将分布式训练代码中的find_unused_parameters设置为True即可,具体位置在mmdet/apis/train.py中。

但一些额外且没用的网络参数会增加网络模型大小,所以另一种解决办法是直接把这些参数在网络定义中删掉:

  • 找到所安装的mmcv软件包位置,并找到optimizer.py(mmcv/runner/hooks中)文件。
  • 在该文件中找到after_train_iter函数。
  • 在该函数中的runner.outputs[‘loss’].backward()的下一行输入以下代码:
    for name, param in runner.model.named_parameters():
         if param.grad is None:
             print(name)   
  • 运行单卡训练代码:
python tools/train.py config_file 

此时会将没有参与loss计算的网络参数名字打印出来。

最后将这些网络参数在代码里注释掉即可。

你可能感兴趣的:(目标检测,python,开发语言,mmdetection,深度学习,pytorch)