RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18994 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 18995) of binary:/data/xxx/anaconda3/envs/openmmlab/bin/python
mmseg工程单机多卡可以顺利运行训练,切换到多机多卡训练就报错RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3。
首先,在网上找到一种方法,在训练代码命令前加以下几句代码:
export NCCL_DEBUG=info export NCCL_SOCKET_IFNAME=eth0 export NCCL_IB_DISABLE=1 ......
但对我的问题dont work,还是报相同的错,找了半天,居然是防火墙的问题,把两台服务器的防火墙关闭记好了,把两台服务器的防火墙关闭记好了,把两台服务器的防火墙关闭记好了。
重要的事情说起三遍
送你们一条美丽的--分割线--
⛵⛵⭐⭐