RuntimeError: The server socket has failed to listen on any local network address. The server socket

Error details: RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
This error occurs when using torch.nn.parallel.DistributedDataParallel to train a model parallelly. I launched program A with python -m torch.distributed.launch --nproc_per_node=2 trainA.py and worked fine. Then when A is running, I tried to launch program B with python -m torch.distributed.launch --nproc_per_node=2 trainB.py yet ended up with the error above.
It turns out that the issue arises from the network address. As the error reports, the address 29500 is being used. Hence, modifying the address should work. So I used the command python -m torch.distributed.launch --nproc_per_node=2 --master_port='29501' trainB.py.
Problem solved!!!

你可能感兴趣的:(python,深度学习,开发语言,pytorch,分布式)