在教程(3)和(4)中讲解了 DistributedDataParallel 有关的底层逻辑,相信大家已经对分布式数据并行有了一定了了解了。PyTorch 为我们提供了一个方便的接口torch.DistributedDataParallel
,让我们比较容易地将代码修改为分布式数据并行模式。在本教程中,我将一步步修改代码为以 torch.distributed.launch
启动的 DDP 版本。
为了更好的理解本教程,我们需要关心的是 torch.distributed.launch
做了什么。我们先看一下 torch.distributed.launch 输入的参数,使用 python -m torch.distributed.launch --help
获得。关于源码可以在 torch.ditributed.launch 中找到。
> python -m torch.distributed.launch --help
usage: launch.py [-h] [--nnodes NNODES] [--node_rank NODE_RANK]
[--nproc_per_node NPROC_PER_NODE] [--master_addr MASTER_ADDR]
[--master_port MASTER_PORT] [--use_env] [-m] [--no_python]
training_script ...
在这里,我详细描述了 torch.distributed.launch
的参数:
nnodes
:节点数量,一个节点对应一个主机;node_rank
:节点的序号,从 0 开始;nproc_per_node
:一个节点中的进程数量,一般一个进程使用一个显卡,故也通常表述为一个节中显卡的数量;master_addr
:master 节点的 IP 地址,也就是 rank=0 对应的主机地址。设置该参数目的是为了让其他节点知道 0 号节点的位置,这样就可以将自己训练的参数传递过去处理;master_port
:master 节点的端口号,用于通信;use_env
:使用 used_env
后,pytorch 会把当前进程所使用的 local_rank
放到环境变量中,而不会放在args.local_rank
中。目前,官方现在已经建议废弃使用 torch.distributed.launch
,而是建议使用 torchrun
。在 torchrun
中,--use_env
这个参数被废弃了并作为默认设置在 torchrun
中,从而强制要求用户从环境变量的 LOACL_RANK
里获取当前进程在本机上的 rank
。在使用 torch.distributed.launch
运行代码后,每个进程都将设置五个参数(MASTER_ADDR、MASTER_PORT、RANK、LOCAL_RANK和WORLD_RANK)到环境变量中。RANK、LOCAL_RANK和WORLD_RANK 的详情如下:
RANK
:使用 os.environ["RANK"]
获取进程的序号,一般是1个 gpu 对应一个进程。它是一个全局的序号,从 0 开始,最大值为所有 GPU 的数量减 1;LOCAL_RANK
:使用 os.environ["LOCAL_RANK"]
获取每个进程在所在主机中的序号。从 0 开始,最大值为当前进程所在主机的 GPU 的数量减 1;WORLD_SIZE
:使用 os.environ["WORLD_SIZE"]
获取当前启动的所有的进程的数量(所有机器的进程总和)。为了便于理解,我们举个例子来说明:假设我们使用了 2 台机器,每台机器 4 块 GPU。那么,RANK
取值为 [0, 7];每台机器上的 LOCAL_RANK
的取值为 [0, 3];WORLD_SIZE
的值为 8。
接下来,我在我们的服务器(2 台服务器,每台 4 块 GPU)上实际测试一下,来打印出设置的这五个参数。使用代码如下:
import os
import time
import torch.distributed as dist
print("before running dist.init_process_group()")
MASTER_ADDR = os.environ["MASTER_ADDR"]
MASTER_PORT = os.environ["MASTER_PORT"]
LOCAL_RANK = os.environ["LOCAL_RANK"]
RANK = os.environ["RANK"]
WORLD_SIZE = os.environ["WORLD_SIZE"]
print("MASTER_ADDR: {}\tMASTER_PORT: {}".format(MASTER_ADDR, MASTER_PORT))
print("LOCAL_RANK: {}\tRANK: {}\tWORLD_SIZE: {}".format(LOCAL_RANK, RANK, WORLD_SIZE))
dist.init_process_group('nccl')
print("after running dist.init_process_group()")
time.sleep(60) # Sleep for a while to avoid exceptions that occur when some processes end too quickly.
dist.destroy_process_group()
我们首先测试单机器多 GPU 的情况。语法为:
> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other
arguments of your training script)
接下来,我们执行该代码。可以看到 torch.distributed.launch
自动在环境变量中添加了 MASTER_ADDR
、MASTER_PORT
、RANK
、LOCAL_RANK
和WORLD_RANK
。
> CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
before running dist.init_process_group()
MASTER_ADDR: 127.0.0.1 MASTER_PORT: 29500
LOCAL_RANK: 0 RANK: 0 WORLD_SIZE: 2
before running dist.init_process_group()
MASTER_ADDR: 127.0.0.1 MASTER_PORT: 29500
LOCAL_RANK: 1 RANK: 1 WORLD_SIZE: 2
after running dist.init_process_group()
after running dist.init_process_group()
> CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 train.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
before running dist.init_process_group()
MASTER_ADDR: 127.0.0.1 MASTER_PORT: 29500
LOCAL_RANK: 0 RANK: 0 WORLD_SIZE: 4
before running dist.init_process_group()
MASTER_ADDR: 127.0.0.1 MASTER_PORT: 29500
LOCAL_RANK: 2 RANK: 2 WORLD_SIZE: 4
before running dist.init_process_group()
MASTER_ADDR: 127.0.0.1 MASTER_PORT: 29500
LOCAL_RANK: 3 RANK: 3 WORLD_SIZE: 4
before running dist.init_process_group()
MASTER_ADDR: 127.0.0.1 MASTER_PORT: 29500
LOCAL_RANK: 1 RANK: 1 WORLD_SIZE: 4
after running dist.init_process_group()
after running dist.init_process_group()
after running dist.init_process_group()
使用 2 个机器举例,master 节点的 IP 地址为 192.168.1.1: 1234。
在机器 1 上的语法如下:
> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
--nnodes=2 --node_rank=0 --master_addr="192.168.1.1"
--master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
and all other arguments of your training script)
在机器 2 上的语法如下:
> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
--nnodes=2 --node_rank=1 --master_addr="192.168.1.1"
--master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
and all other arguments of your training script)
在我们服务器上运行的结果如下:
机器 1(master,IP:168.192.1.105):
> python -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank 0 --master_addr='192.168.1.105' --master_port='12345' train.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105 MASTER_PORT: 12345
LOCAL_RANK: 0 RANK: 0 WORLD_SIZE: 8
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105 MASTER_PORT: 12345
LOCAL_RANK: 1 RANK: 1 WORLD_SIZE: 8
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105 MASTER_PORT: 12345
LOCAL_RANK: 3 RANK: 3 WORLD_SIZE: 8
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105 MASTER_PORT: 12345
LOCAL_RANK: 2 RANK: 2 WORLD_SIZE: 8
after running dist.init_process_group()
after running dist.init_process_group()
after running dist.init_process_group()
after running dist.init_process_group()
机器 2(IP:168.192.1.106):
> python -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank 1 --master_addr='192.168.1.105' --master_port='12345' train.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105 MASTER_PORT: 12345
LOCAL_RANK: 0 RANK: 4 WORLD_SIZE: 8
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105 MASTER_PORT: 12345
LOCAL_RANK: 1 RANK: 5 WORLD_SIZE: 8
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105 MASTER_PORT: 12345
LOCAL_RANK: 3 RANK: 7 WORLD_SIZE: 8
before running dist.init_process_group()
MASTER_ADDR: 192.168.1.105 MASTER_PORT: 12345
LOCAL_RANK: 2 RANK: 6 WORLD_SIZE: 8
after running dist.init_process_group()
after running dist.init_process_group()
after running dist.init_process_group()
after running dist.init_process_group()
更多的详情可以直接访问我的代码:
https://github.com/HongxinXiang/pytorch-multi-GPU-training-tutorial/tree/master/test/torch_distributed_launchgithub.com/HongxinXiang/pytorch-multi-GPU-training-tutorial/tree/master/test/torch_distributed_launch
现在,我们正式开始修改基础版的训练代码为 DDP 训练代码,主要修改的地方为 4 处。
在代码最开始的地方,初始化 DDP 所需要的环境。
def setup_DDP(backend="nccl", verbose=False):
"""
We don't set ADDR and PORT in here, like:
# os.environ['MASTER_ADDR'] = 'localhost'
# os.environ['MASTER_PORT'] = '12355'
Because program's ADDR and PORT can be given automatically at startup.
E.g. You can set ADDR and PORT by using:
python -m torch.distributed.launch --master_addr="192.168.1.201" --master_port=23456 ...
You don't set rank and world_size in dist.init_process_group() explicitly.
:param backend:
:param verbose:
:return:
"""
rank = int(os.environ["RANK"])
local_rank = int(os.environ["LOCAL_RANK"])
world_size = int(os.environ["WORLD_SIZE"])
# If the OS is Windows or macOS, use gloo instead of nccl
dist.init_process_group(backend=backend)
# set distributed device
device = torch.device("cuda:{}".format(local_rank))
if verbose:
print(f"local rank: {local_rank}, global rank: {rank}, world size: {world_size}")
return rank, local_rank, world_size, device
rank, local_rank, world_size, device = setup_DDP(verbose=True)
batch_size
:我将原本的 batch_size=64 除以了 world_size
,因此每个 GPU 将分别处理一部分数据。在传入 batch_size
参数时,随着 GPU 数量的增多,batch_size
应适当增大。DistributedSampler
初始化
DataLoader
:初始化 DataLoader
时, 应传入 sampler
参数batch_size = 64 // world_size # [*] // world_size
train_sampler = DistributedSampler(training_data, shuffle=True) # [*]
test_sampler = DistributedSampler(test_data, shuffle=False) # [*]
train_dataloader = DataLoader(training_data, batch_size=batch_size, sampler=train_sampler) # [*] sampler=...
test_dataloader = DataLoader(test_data, batch_size=batch_size, sampler=test_sampler) # [*] sampler=...
使用 torch.nn.parallel.DistributedDataParallel
包裹定义的 model
,并显示地指定模型使用的设备(device_ids
)以及输出数据存在的设备(output_device
)。
from torch.nn.parallel import DistributedDataParallel as DDP
# initialize model
model = NeuralNetwork().to(device) # copy model from cpu to gpu
# [*] using DistributedDataParallel
model = DDP(model, device_ids=[local_rank], output_device=local_rank) # [*] DDP(...)
模型的保存和单机单 GPU 时保存一样。为了避免重复保存模型,我们仅在 master
主机上保存模型。
# [*] save model on rank 0
if dist.get_rank() == 0:
model_state_dict = model.state_dict()
torch.save(model_state_dict, "model.pth")
print("Saved PyTorch Model State to model.pth")
除此之外,还有两处非必要的修改:
sampler
的 epoch
参数,方便采样器知道当前是训练到第几个 epoch 了。# [*] set sampler
train_dataloader.sampler.set_epoch(t)
test_dataloader.sampler.set_epoch(t)
2. 仅在 rank=0 的主机上打印训练和测试日志。还有一些 print()
也可以修改为 print_only_rank0()
。
def print_only_rank0(log):
if dist.get_rank() == 0:
print(log)
def train(...):
...
# [*] only print log on rank 0
if dist.get_rank() == 0 and batch % 100 == 0:
loss, current = loss.item(), batch * len(X)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
...
def test(...):
...
# [*] only print log on rank 0
print_only_rank0(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
...
最终,完整的代码可以在下面链接中访问:
https://github.com/HongxinXiang/pytorch-multi-GPU-training-tutorial/blob/master/single-machine-and-multi-GPU-DistributedDataParallel-launch.pygithub.com/HongxinXiang/pytorch-multi-GPU-training-tutorial/blob/master/single-machine-and-multi-GPU-DistributedDataParallel-launch.py
我们使用 2 台服务器来运行,IP 分别是 192.168.1.105 (master) 和 192.168.1.106。每台机器有 4 块 GPU。
在多机多卡训练时,我们在运行之前需要注意以下两点:
ping
通;RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
。这可能是 NCCL 没有正常安装、也可能是 NCCL 没有建立通信、也可能是防火墙的问题等等。我们可以使用环境变量将运行切换到 DEBUG 模型,能够为我们提供更多的信息来定位错误。如下命令所示(更多信息见:torch.distributed)。export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
接下来,让我们开始运行我们的程序。
机器 1 (master, IP: 192.168.1.105):
> CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank 0 --master_addr='192.168.1.105' --master_port='12345' single-machine-and-multi-GPU-DistributedDataParallel-launch.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Using device: cuda:3
local rank: 3, global rank: 3, world size: 8
Using device: cuda:2
Using device: cuda:1
local rank: 1, global rank: 1, world size: 8
local rank: 2, global rank: 2, world size: 8
Using device: cuda:0
local rank: 0, global rank: 0, world size: 8
tesla-105:1475:1475 [0] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.105<0>
tesla-105:1475:1475 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
tesla-105:1475:1475 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-105:1475:1475 [0] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.105<0>
tesla-105:1475:1475 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.1
tesla-105:1477:1477 [1] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.105<0>
tesla-105:1477:1477 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
tesla-105:1477:1477 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-105:1477:1477 [1] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.105<0>
tesla-105:1477:1477 [1] NCCL INFO Using network Socket
tesla-105:1481:1481 [3] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.105<0>
tesla-105:1481:1481 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
tesla-105:1481:1481 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-105:1481:1481 [3] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.105<0>
tesla-105:1481:1481 [3] NCCL INFO Using network Socket
tesla-105:1480:1480 [2] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.105<0>
tesla-105:1480:1480 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
tesla-105:1480:1480 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-105:1480:1480 [2] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.105<0>
tesla-105:1480:1480 [2] NCCL INFO Using network Socket
tesla-105:1481:2165 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-105:1481:2165 [3] NCCL INFO Trees [0] -1/-1/-1->3->2|2->3->-1/-1/-1 [1] -1/-1/-1->3->2|2->3->-1/-1/-1
tesla-105:1481:2165 [3] NCCL INFO Setting affinity for GPU 3 to ff,c00ffc00
tesla-105:1475:2146 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7
tesla-105:1477:2150 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-105:1477:2150 [1] NCCL INFO Trees [0] 2/4/-1->1->0|0->1->2/4/-1 [1] 2/-1/-1->1->0|0->1->2/-1/-1
tesla-105:1477:2150 [1] NCCL INFO Setting affinity for GPU 1 to 3ff003ff
tesla-105:1475:2146 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4 5 6 7
tesla-105:1475:2146 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-105:1480:2169 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-105:1480:2169 [2] NCCL INFO Trees [0] 3/-1/-1->2->1|1->2->3/-1/-1 [1] 3/-1/-1->2->1|1->2->3/-1/-1
tesla-105:1480:2169 [2] NCCL INFO Setting affinity for GPU 2 to ff,c00ffc00
tesla-105:1475:2146 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->5|5->0->1/-1/-1
tesla-105:1475:2146 [0] NCCL INFO Setting affinity for GPU 0 to 3ff003ff
tesla-105:1481:2165 [3] NCCL INFO Channel 00 : 3[b1000] -> 4[18000] [send] via NET/Socket/0
tesla-105:1480:2169 [2] NCCL INFO Channel 00 : 2[af000] -> 3[b1000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO Channel 00 : 1[1a000] -> 2[af000] via direct shared memory
tesla-105:1475:2146 [0] NCCL INFO Channel 00 : 7[b1000] -> 0[18000] [receive] via NET/Socket/0
tesla-105:1480:2169 [2] NCCL INFO Channel 00 : 2[af000] -> 1[1a000] via direct shared memory
tesla-105:1475:2146 [0] NCCL INFO Channel 00 : 0[18000] -> 1[1a000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO Channel 00 : 4[18000] -> 1[1a000] [receive] via NET/Socket/0
tesla-105:1477:2150 [1] NCCL INFO Channel 00 : 1[1a000] -> 0[18000] via P2P/IPC
tesla-105:1475:2146 [0] NCCL INFO Channel 01 : 7[b1000] -> 0[18000] [receive] via NET/Socket/0
tesla-105:1475:2146 [0] NCCL INFO Channel 01 : 0[18000] -> 1[1a000] via P2P/IPC
tesla-105:1481:2165 [3] NCCL INFO Channel 00 : 3[b1000] -> 2[af000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO Channel 00 : 1[1a000] -> 4[18000] [send] via NET/Socket/0
tesla-105:1481:2165 [3] NCCL INFO Channel 01 : 3[b1000] -> 4[18000] [send] via NET/Socket/0
tesla-105:1480:2169 [2] NCCL INFO Channel 01 : 2[af000] -> 3[b1000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO Channel 01 : 1[1a000] -> 2[af000] via direct shared memory
tesla-105:1481:2165 [3] NCCL INFO Channel 01 : 3[b1000] -> 2[af000] via P2P/IPC
tesla-105:1480:2169 [2] NCCL INFO Channel 01 : 2[af000] -> 1[1a000] via direct shared memory
tesla-105:1481:2165 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-105:1481:2165 [3] NCCL INFO comm 0x7fd634001060 rank 3 nranks 8 cudaDev 3 busId b1000 - Init COMPLETE
tesla-105:1475:2146 [0] NCCL INFO Channel 01 : 0[18000] -> 5[1a000] [send] via NET/Socket/0
tesla-105:1477:2150 [1] NCCL INFO Channel 01 : 1[1a000] -> 0[18000] via P2P/IPC
tesla-105:1477:2150 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-105:1477:2150 [1] NCCL INFO comm 0x7f30b4001060 rank 1 nranks 8 cudaDev 1 busId 1a000 - Init COMPLETE
tesla-105:1480:2169 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-105:1480:2169 [2] NCCL INFO comm 0x7f37d4001060 rank 2 nranks 8 cudaDev 2 busId af000 - Init COMPLETE
tesla-105:1475:2146 [0] NCCL INFO Channel 01 : 5[1a000] -> 0[18000] [receive] via NET/Socket/0
tesla-105:1475:2146 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-105:1475:2146 [0] NCCL INFO comm 0x7f9e54001060 rank 0 nranks 8 cudaDev 0 busId 18000 - Init COMPLETE
tesla-105:1475:1475 [0] NCCL INFO Launch mode Parallel
DistributedDataParallel(
(module): NeuralNetwork(
(flatten): Flatten(start_dim=1, end_dim=-1)
(linear_relu_stack): Sequential(
(0): Linear(in_features=784, out_features=512, bias=True)
(1): ReLU()
(2): Linear(in_features=512, out_features=512, bias=True)
(3): ReLU()
(4): Linear(in_features=512, out_features=10, bias=True)
)
)
)
Epoch 1
-------------------------------
loss: 2.294374 [ 0/60000]
loss: 2.301075 [ 800/60000]
loss: 2.315739 [ 1600/60000]
loss: 2.299692 [ 2400/60000]
loss: 2.258646 [ 3200/60000]
loss: 2.252302 [ 4000/60000]
loss: 2.218223 [ 4800/60000]
loss: 2.126724 [ 5600/60000]
loss: 2.174220 [ 6400/60000]
loss: 2.177455 [ 7200/60000]
Test Error:
Accuracy: 4.1%, Avg loss: 2.166388
Epoch 2
-------------------------------
loss: 2.136480 [ 0/60000]
loss: 2.127040 [ 800/60000]
loss: 2.118551 [ 1600/60000]
loss: 2.051364 [ 2400/60000]
loss: 2.076279 [ 3200/60000]
loss: 2.002108 [ 4000/60000]
loss: 2.075573 [ 4800/60000]
loss: 1.959522 [ 5600/60000]
loss: 1.861534 [ 6400/60000]
loss: 1.872814 [ 7200/60000]
Test Error:
Accuracy: 7.2%, Avg loss: 1.908959
Epoch 3
-------------------------------
loss: 2.081742 [ 0/60000]
loss: 1.841850 [ 800/60000]
loss: 1.939971 [ 1600/60000]
loss: 1.684577 [ 2400/60000]
loss: 1.648371 [ 3200/60000]
loss: 1.774270 [ 4000/60000]
loss: 1.552769 [ 4800/60000]
loss: 1.508346 [ 5600/60000]
loss: 1.516589 [ 6400/60000]
loss: 1.481997 [ 7200/60000]
Test Error:
Accuracy: 7.8%, Avg loss: 1.533547
Epoch 4
-------------------------------
loss: 1.625404 [ 0/60000]
loss: 1.543570 [ 800/60000]
loss: 1.428792 [ 1600/60000]
loss: 1.446484 [ 2400/60000]
loss: 1.841029 [ 3200/60000]
loss: 1.320562 [ 4000/60000]
loss: 1.511142 [ 4800/60000]
loss: 1.444456 [ 5600/60000]
loss: 1.570060 [ 6400/60000]
loss: 1.482602 [ 7200/60000]
Test Error:
Accuracy: 8.0%, Avg loss: 1.256674
Epoch 5
-------------------------------
loss: 1.064455 [ 0/60000]
loss: 1.233810 [ 800/60000]
loss: 1.168940 [ 1600/60000]
loss: 1.227281 [ 2400/60000]
loss: 1.437644 [ 3200/60000]
loss: 1.195065 [ 4000/60000]
loss: 1.305991 [ 4800/60000]
loss: 1.258441 [ 5600/60000]
loss: 0.970569 [ 6400/60000]
loss: 1.698888 [ 7200/60000]
Test Error:
Accuracy: 8.2%, Avg loss: 1.083617
Done!
Saved PyTorch Model State to model.pth
机器 2 (IP: 192.168.1.106):
> CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank 1 --master_addr='192.168.1.105' --master_port='12345' single-machine-and-multi-GPU-DistributedDataParallel-launch.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Using device: cuda:0
Using device: cuda:1
local rank: 1, global rank: 5, world size: 8
local rank: 0, global rank: 4, world size: 8
Using device: cuda:2
local rank: 2, global rank: 6, world size: 8
Using device: cuda:3
local rank: 3, global rank: 7, world size: 8
tesla-106:1942:1942 [1] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:1942:1942 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
tesla-106:1942:1942 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:1942:1942 [1] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:1942:1942 [1] NCCL INFO Using network Socket
tesla-106:1988:1988 [3] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:1988:1988 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
tesla-106:1988:1988 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:1988:1988 [3] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:1988:1988 [3] NCCL INFO Using network Socket
tesla-106:1943:1943 [2] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:1943:1943 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
tesla-106:1943:1943 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:1943:1943 [2] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:1943:1943 [2] NCCL INFO Using network Socket
tesla-106:1940:1940 [0] NCCL INFO Bootstrap : Using [0]eno2:192.168.1.106<0>
tesla-106:1940:1940 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
tesla-106:1940:1940 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
tesla-106:1940:1940 [0] NCCL INFO NET/Socket : Using [0]eno2:192.168.1.106<0>
tesla-106:1940:1940 [0] NCCL INFO Using network Socket
tesla-106:1988:2787 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-106:1988:2787 [3] NCCL INFO Trees [0] -1/-1/-1->7->6|6->7->-1/-1/-1 [1] -1/-1/-1->7->6|6->7->-1/-1/-1
tesla-106:1988:2787 [3] NCCL INFO Setting affinity for GPU 3 to ff,c00ffc00
tesla-106:1943:2821 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-106:1943:2821 [2] NCCL INFO Trees [0] 7/-1/-1->6->5|5->6->7/-1/-1 [1] 7/-1/-1->6->5|5->6->7/-1/-1
tesla-106:1943:2821 [2] NCCL INFO Setting affinity for GPU 2 to ff,c00ffc00
tesla-106:1942:2786 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-106:1942:2786 [1] NCCL INFO Trees [0] 6/-1/-1->5->4|4->5->6/-1/-1 [1] 6/0/-1->5->4|4->5->6/0/-1
tesla-106:1942:2786 [1] NCCL INFO Setting affinity for GPU 1 to 3ff003ff
tesla-106:1940:2831 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
tesla-106:1940:2831 [0] NCCL INFO Trees [0] 5/-1/-1->4->1|1->4->5/-1/-1 [1] 5/-1/-1->4->-1|-1->4->5/-1/-1
tesla-106:1942:2786 [1] NCCL INFO Channel 00 : 5[1a000] -> 6[af000] via direct shared memory
tesla-106:1940:2831 [0] NCCL INFO Setting affinity for GPU 0 to 3ff003ff
tesla-106:1943:2821 [2] NCCL INFO Channel 00 : 6[af000] -> 7[b1000] via P2P/IPC
tesla-106:1988:2787 [3] NCCL INFO Channel 00 : 7[b1000] -> 0[18000] [send] via NET/Socket/0
tesla-106:1940:2831 [0] NCCL INFO Channel 00 : 3[b1000] -> 4[18000] [receive] via NET/Socket/0
tesla-106:1940:2831 [0] NCCL INFO Channel 00 : 4[18000] -> 5[1a000] via P2P/IPC
tesla-106:1988:2787 [3] NCCL INFO Channel 00 : 7[b1000] -> 6[af000] via P2P/IPC
tesla-106:1942:2786 [1] NCCL INFO Channel 00 : 5[1a000] -> 4[18000] via P2P/IPC
tesla-106:1940:2831 [0] NCCL INFO Channel 00 : 4[18000] -> 1[1a000] [send] via NET/Socket/0
tesla-106:1940:2831 [0] NCCL INFO Channel 00 : 1[1a000] -> 4[18000] [receive] via NET/Socket/0
tesla-106:1988:2787 [3] NCCL INFO Channel 01 : 7[b1000] -> 0[18000] [send] via NET/Socket/0
tesla-106:1943:2821 [2] NCCL INFO Channel 00 : 6[af000] -> 5[1a000] via direct shared memory
tesla-106:1942:2786 [1] NCCL INFO Channel 01 : 5[1a000] -> 6[af000] via direct shared memory
tesla-106:1943:2821 [2] NCCL INFO Channel 01 : 6[af000] -> 7[b1000] via P2P/IPC
tesla-106:1988:2787 [3] NCCL INFO Channel 01 : 7[b1000] -> 6[af000] via P2P/IPC
tesla-106:1988:2787 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-106:1988:2787 [3] NCCL INFO comm 0x7fbb14001060 rank 7 nranks 8 cudaDev 3 busId b1000 - Init COMPLETE
tesla-106:1943:2821 [2] NCCL INFO Channel 01 : 6[af000] -> 5[1a000] via direct shared memory
tesla-106:1940:2831 [0] NCCL INFO Channel 01 : 3[b1000] -> 4[18000] [receive] via NET/Socket/0
tesla-106:1940:2831 [0] NCCL INFO Channel 01 : 4[18000] -> 5[1a000] via P2P/IPC
tesla-106:1942:2786 [1] NCCL INFO Channel 01 : 0[18000] -> 5[1a000] [receive] via NET/Socket/0
tesla-106:1943:2821 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-106:1943:2821 [2] NCCL INFO comm 0x7f6fec001060 rank 6 nranks 8 cudaDev 2 busId af000 - Init COMPLETE
tesla-106:1942:2786 [1] NCCL INFO Channel 01 : 5[1a000] -> 4[18000] via P2P/IPC
tesla-106:1940:2831 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-106:1940:2831 [0] NCCL INFO comm 0x7f5550001060 rank 4 nranks 8 cudaDev 0 busId 18000 - Init COMPLETE
tesla-106:1942:2786 [1] NCCL INFO Channel 01 : 5[1a000] -> 0[18000] [send] via NET/Socket/0
tesla-106:1942:2786 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tesla-106:1942:2786 [1] NCCL INFO comm 0x7f75d4001060 rank 5 nranks 8 cudaDev 1 busId 1a000 - Init COMPLETE
运行代码时,我们发现了一个错误:
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
tesla-105:29334:30213 [1] NCCL INFO Channel 00 : 6[af000] -> 1[1a000] [receive] via NET/Socket/1
tesla-105:29334:30213 [1] NCCL INFO Channel 00 : 1[1a000] -> 0[18000] via P2P/IPC
tesla-105:29331:30215 [0] NCCL INFO Channel 01 : 4[18000] -> 0[18000] [receive] via NET/Socket/1
tesla-105:29331:30215 [0] NCCL INFO Channel 01 : 0[18000] -> 1[1a000] via P2P/IPC
tesla-105:29336:30216 [2] NCCL INFO Call to connect returned Connection refused, retrying
tesla-105:29336:30216 [2] NCCL INFO Call to connect returned Connection refused, retrying
tesla-105:29336:30216 [2] NCCL INFO Call to connect returned Connection refused, retrying
tesla-105:29336:30216 [2] NCCL INFO Call to connect returned Connection refused, retrying
这个问题与网络通信有关。为了解决这个问题,可以进行以下尝试:
--master_port
来解决;ipconfig
来查看可用的网络接口),例如:> export NCCL_SOCKET_IFNAME=eth0