deepspeed 报错 up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store 解决

参考

https://github.com/NVIDIA/nccl/issues/708

问题

使用deepspeed的时候报错

RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key ‘0’, but store->get(‘0’) got error: Connection reset by peer

解决方案

  1. 参看自己的网卡名字
ifconfig

deepspeed 报错 up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store 解决_第1张图片
2. 设置正确的NCCL_SOCKET_IFNAME

export NCCL_SOCKET_IFNAME=[前面得到的网卡名]

解决问题~

你可能感兴趣的:(AI编程)