raise RuntimeError(“Distributed package doesn‘t have NCCL “ “built in“) RuntimeError: Distributed pa

复现stylegan3的时候报错

torch.multiprocessing.spawn.ProcessRaisedException:
– Process 2 terminated with the following error:
Traceback (most recent call last):
File “/home/ubuntu/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py”, line 59, in _wrap
fn(i, *args)
File “/home/ubuntu/lxd-workplace/landf/face/stylegan3-main/train.py”, line 38, in subprocess_fn
torch.distributed.init_process_group(backend=‘nccl’, init_method=init_method, rank=rank, world_size=c.num_gpus)
File “/home/ubuntu/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 583, in init_process_grou p
default_pg = _new_process_group_helper(
File “/home/ubuntu/miniconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 708, in _new_process_grou p_helper
raise RuntimeError("Distributed package doesn’t have NCCL " “built in”)
RuntimeError: Distributed package doesn’t have NCCL built in

raise RuntimeError(“Distributed package doesn‘t have NCCL “ “built in“) RuntimeError: Distributed pa_第1张图片

raise RuntimeError(“Distributed package doesn‘t have NCCL “ “built in“) RuntimeError: Distributed pa_第2张图片
报错代码是这里的nccl

# Init torch.distributed.
    if c.num_gpus > 1:
        init_file = os.path.abspath(os.path.join(temp_dir, '.torch_distributed_init'))
        if os.name == 'nt':
            init_method = 'file:///' + init_file.replace('\\', '/')
            torch.distributed.init_process_group(backend='gloo', init_method=init_method, rank=rank, world_size=c.num_gpus)
        else:
            init_method = f'file://{init_file}'
            torch.distributed.init_process_group(backend='nccl', init_method=init_method, rank=rank, world_size=c.num_gpus)

百度出来都是window报错,说:在dist.init_process_group语句之前添加backend=‘gloo’,也就是在windows中使用GLOO替代NCCL。好家伙,可是我是linux服务器上啊

代码是对的,我开始怀疑是pytorch版本的原因
控制台输入>>>python
接着>>>import torch
然后
在这里插入图片描述
最后还是给找到了,果然是pytorch版本原因,1.8的版本就好了,成功解决
raise RuntimeError(“Distributed package doesn‘t have NCCL “ “built in“) RuntimeError: Distributed pa_第3张图片

你可能感兴趣的:(安装配置,训练过程,服务器,ubuntu,pytorch,python)