Pytorch 多机多卡报错:Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA

Pytorch 多机多卡 报错

Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

解决

多台机器上面的CUDA和Pytorch版本号不一致,必须要保证每台机器上面的CUDA和Pytorch版本号一致才能运行成功

你可能感兴趣的:(Pytorch基础,pytorch,深度学习,DDP,分布式训练,多机多卡训练)