参考
通过提高batch size来增加并行度,Ring-Reduce的数据交换方法提高了通讯效率,并启动多个进程的方式减轻Python GIL的限制,从而提高训练速度。DDP都是显著地比DP快,能达到略低于卡数的加速比(例如,四卡下加速3倍)。
具体:
在DP模式中,总共只有一个进程(受到GIL很强限制)。master节点相当于参数服务器,其会向其他卡广播其参数;在梯度反向传播后,各卡将梯度集中到master节点,master节点对搜集来的参数进行平均后更新参数,再将参数统一发送到其他卡上。这种参数更新方式,会导致master节点的计算任务、通讯量很重,从而导致网络阻塞,降低训练速度。但是DP也有优点,优点就是代码实现简单。
import os
# new
local_rank = int(os.environ["LOCAL_RANK"])
import torch
import torch.nn as nn
from torch import optim
import torch.distributed as dist
##############################################
# import argparse
# parser = argparse.ArgumentParser()
# parser.add_argument("--device", default=-1)
# FLAGS = parser.parse_args()
# local_rank = FLAGS.local_rank
##############################################
# 将模型加载到对应的gpu上
torch.cuda.set_device(local_rank)
dist.init_process_group(backend='nccl')
device = torch.device("cuda", local_rank)
# 构造模型
model = nn.Linear(10, 10).to(device)
model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
# 前向传播
outputs = model(torch.randn(20, 10).to(device))
labels = torch.randn(20, 10).to(device)
loss_fn = nn.MSELoss()
loss_fn(outputs, labels).backward()
# 后向传播
optimizer = optim.SGD(model.parameters(), lr=0.001)
optimizer.step()
如果使用注释中的代码并配合python -m torch.distributed.launch --nproc_per_node 4 main.py
使用,也就是通过cmd读取local_rank的话将会报warning(如下),是因为这个方法将会被弃用。
FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
取而代之,是代码前几行中注释了为new的代码,并且通过 torchrun main.py
直接运行即可。
# 获取world size,在不同进程里都是一样的,得到16
torch.distributed.get_world_size()
# 获取rank,每个进程都有自己的序号,各不相同
torch.distributed.get_rank()
# 获取local_rank。一般情况下,你需要用这个local_rank来手动设置当前模型是跑在当前机器的哪块GPU上面的。
torch.distributed.local_rank()
## main.py文件
import os
import torch
import torch.nn as nn
from torch import optim
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
from torch.utils.data import DataLoader, data
import warnings
warnings.filterwarnings('ignore')
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
dist.init_process_group(backend='nccl')
device = torch.device("cuda", local_rank)
# print(f"world size: {torch.distributed.get_world_size()}")
print(f"rank: {torch.distributed.get_rank()}")
class CPPDataset(data.Dataset):
def __init__(self, seqs, labels):
self.seqs = seqs
self.labels = labels
def __len__(self):
return len(self.seqs)
def __getitem__(self, idx):
return self.seqs[idx], self.labels[idx]
# 构造模型
model = nn.Linear(10, 1).to(device)
model = DDP(model, device_ids=[local_rank], output_device=local_rank)
X_train = torch.Tensor(64,10).to(device)
y_train = torch.zeros(64).to(device)
train_dataset = CPPDataset(X_train, y_train)
# 新增1:使用DistributedSampler,DDP帮我们把细节都封装起来了。用,就完事儿!
# sampler的原理,后面也会介绍。
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
# 需要注意的是,这里的batch_size指的是每个进程下的batch_size。也就是说,总batch_size是这里的batch_size再乘以并行数(world_size)。
trainloader = torch.utils.data.DataLoader(train_dataset, batch_size=4, sampler=train_sampler)
loss_fn = nn.MSELoss()
for epoch in range(4):
# 新增2:设置sampler的epoch,DistributedSampler需要这个来维持各个进程之间的相同随机数种子
trainloader.sampler.set_epoch(epoch)
# 后面这部分,则与原来完全一致了。
for x, yy in trainloader:
prediction = model(x)
loss = loss_fn(prediction.squeeze(), yy)
loss.backward()
optimizer = optim.SGD(model.parameters(), lr=0.001)
optimizer.step()
# 保存模型参数只需要在rank=0上保存
if dist.get_rank() == 0:
checkpoint = {
"net": model.module.state_dict()
}
torch.save(checkpoint, 'model.pt'))
最终通过运行torchrun --nproc_per_node 4 main.py
即可实现单机多卡训练。
当你希望在一个机器上同时跑两个torchrun程序,请使用torchrun --nproc_per_node 4 -master_port=22224 main.py
,master_port的数字不能与正在运行的程序一致,会产生冲突。
注意点:
(logits, _)= model(x_pad, mask)
,则有可能会报错,报错内容如下。需要在封装DDP的地方增加一个参数 find_unused_parameters=True)
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss.