【debug】NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system err

报错信息

RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18994 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 18995) of binary:/data/xxx/anaconda3/envs/openmmlab/bin/python

mmseg工程单机多卡可以顺利运行训练,切换到多机多卡训练就报错RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3。

解决方案

首先,在网上找到一种方法,在训练代码命令前加以下几句代码:

export NCCL_DEBUG=info export NCCL_SOCKET_IFNAME=eth0 export NCCL_IB_DISABLE=1 ......

但对我的问题dont work,还是报相同的错,找了半天,居然是防火墙的问题,把两台服务器的防火墙关闭记好了,把两台服务器的防火墙关闭记好了,把两台服务器的防火墙关闭记好了。

重要的事情说起三遍

整理不易,欢迎一键三连!!!

送你们一条美丽的--分割线--


⛵⛵⭐⭐

你可能感兴趣的:(Debug,mmSegmentation,Python,深度学习,人工智能,python,mmsegmentation,多机多卡)