【多机多卡】mmsegmentation训练报错“RuntimeError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/”

多机多卡训练代码:

报错信息:

RuntimeError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1248, unhandled system error, NCCL version 2.12.10

第一台机器:

NNODES=2 NODE_RANK=0 PORT=8888 MASTER_ADDR=192.168.XX.XX sh tools/dist_train.sh ./configs/temp.py 4

第二台机器:

NNODES=2 NODE_RANK=1 PORT=8888 MASTER_ADDR=192.168.XX.XX sh tools/dist_train.sh ./configs/temp.py 4

解决方案:

export NCCL_IB_DISABLE=1; export NCCL_P2P_DISABLE=1; NCCL_DEBUG=INFO NNODES=2 NODE_RANK=0 PORT=8888 MASTER_ADDR=192.168.XX.XX sh tools/dist_train.sh ./configs/temp.py 4

 

整理不易,欢迎一键三连!!!

你可能感兴趣的:(mmsegmentation,pytorch,深度学习,python)