[fairseq] 报错:TypeError: _broadcast_coalesced(): incompatible function arguments

前言

我通过复写了模型的state_dict方法,具体就是给其增加了dynamic_mask(字典类型,里面是tensor),allocated_neuron_num(整型)。

def state_dict(self, destination=None, prefix='', keep_vars=False):
    state_dict = super().state_dict(destination, prefix, keep_vars)
    state_dict['model.dynamic_mask'] = gloVar.dynamic_mask
    state_dict['model.allocated_neuron_num'] = gloVar.allocated_neuron_num
    return state_dict

结果报错:

  File "/data3/syxu/sparsenmt_exp/sparsenmt/fairseq/fairseq/models/distributed_fairseq_model.py", line 58, in DistributedFairseqModel
    wrapped_model = DistributedDataParallel(
  File "/data3/syxu/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 580, in __init__
    self._sync_params_and_buffers(authoritative_rank=0)
  File "/data3/syxu/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 597, in _sync_params_and_buffers
    self._distributed_broadcast_coalesced(
  File "/data3/syxu/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1334, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
TypeError: _broadcast_coalesced(): incompatible function arguments. The following argument types are supported:
    1. (process_group: torch._C._distributed_c10d.ProcessGroup, tensors: List[at::Tensor], buffer_size: int, src: int = 0) -> None

解决

不使用报错中显示的DistributedDataParallel。根据文档,这个在fairseq中体现为–ddp-backend参数。
报错时–ddp-backend=pytorch_ddp(默认),改为legacy_ddpno_c10d都不会再报错。

参考

https://fairseq.readthedocs.io/en/latest/command_line_tools.html
https://blog.csdn.net/j___t/article/details/104368597

你可能感兴趣的:(fairseq,机器翻译)