解决:AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel

解决:AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel_第1张图片

 Pytorch多卡分布式训练,程序发生以下报错:

AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel
  0%|                                                                                                                                                                                                                                                                  | 0/310 [00:01 Traceback (most recent call last):
  File "dist_train.py", line 234, in
    main()
  File "dist_train.py", line 152, in main
    train(net, optimizer)
  File "dist_train.py", line 187, in train
    map_out,num_pre = net(images,Grad_Imgs)  # num_pre [B,1] map_out[B,1,H,W]
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 619, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/SR/SheetCounting_2/Test_demo/model/TS_Net.py", line 124, in forward
    feat_map_r1 = self.r1_conv1(im_data)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 529, in forward
    raise AttributeError('SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel')
AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in
    main()
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'dist_train.py', '--local_rank=1']' returned non-zero exit status 1.
 

重点在于:SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel

按照字面意思是说SyncBatchNorm 只在使用torch.nn.parallel.DistributedDataParallel时被支持,然而,我的程序里,确实时使用了这个东西。

 这里的DDP就是

from torch.nn.parallel import DistributedDataParallel as DDP

报错里面提到的torch.nn.parallel.DistributedDataParallel


解决方案:

最终发现其实是cuda和pytorch版本的问题,因为这个库是近几年才出的,升级了更高版本的cuda便解决了问题。使用docker的朋友可以尝试换个docker环境。

你可能感兴趣的:(Pytorch,深度学习,pytorch,python)