Pytorch中查找不参与损失计算的层及解决方法

一 训练问题

在Pytorch中计算loss,如果有些层未参与计算,那么在跑多卡分布式训练的时候,会报错,提示要设置find_unused_parameters=True;

二 解决方案

  1. 参考提示,在设置ddp模型的时候,添加find_unused_parameters=True,示例如下:
model = DDP(model, device_ids=[opt.local_rank], output_device=opt.local_rank,
                    # nn.MultiheadAttention incompatibility with DDP https://github.com/pytorch/pytorch/issues/26698
                    # find_unused_parameters=any(isinstance(layer, nn.MultiheadAttention) for layer in model.modules()))
                    find_unused_parameters=True) #添加这一行代码
  1. 但是这种不解决根本办法,无法知道到底那些层未参与计算loss;在运行脚本前,加上TORCH_DISTRIBUTED_DEBUG=DETAIL,就可以把输出未参与loss计算的层(要求Pytorch版本1.9以上),示例如下:
# 运行命令
TORCH_DISTRIBUTED_DEBUG=DETAIL python -m torch.distributed.launch --nproc_per_node 8 --master_port 9527 train_aux.py --workers 32 --device 0,1,2,3,4,5,6,7 --sync-bn
# 如果报错的输出
Parameters which did not receive grad for rank 2: model.144.m.1.weight, model.144.m.1.bias, model.144.m.2.weight, model.144.m.2.bias, model.144.m.3.weight, model.144.m.3.bias, model.144.m2.0.weight, model.144.mas, model.144.m2.1.weight, model.144.m2.1.bias, model.144.m2.2.weight, model.144.m2.2.bias, model.144.m2.3.weight, model.144.m2.3.bias, model.144.ia.1.implicit, model.144.ia.2.implicit, model.144.ia.3.implicit,.144.im.1.implicit, model.144.im.2.implicit, model.144.im.3.implicit
Parameter indices which did not receive grad for rank 2: 437 438 439 440 441 442 443 444 445 446 447 448 449 450 452 453 454 456 457 458 # 如果未加上前面的命令,只会输出当前行

你可能感兴趣的:(pytorch,人工智能,python)