pytorch 多卡分布式训练 调用all_gather_object 出现阻塞等待死锁的问题

pytorch 多卡分布式训练

torch._C._distributed_c10d中的函数all_gather_object 出现阻塞等待死锁的问题

解决办法就是 在进程通信之前调用torch.cuda.set_device(local_rank)

For NCCL-based processed groups, internal tensor representations of objects must be moved to the GPU device before communication takes place. In this case, the device used is given by torch.cuda.current_device() and it is the user’s responsiblity to ensure that this is set so that each rank has an individual GPU, via torch.cuda.set_device().

你可能感兴趣的:(pytorch,分布式,人工智能)