参考几个不错的帖子(还没来得及整理):
基于pytorch多GPU单机多卡训练实践_多卡训练效果不如单卡-CSDN博客
关于PyTorch单机多卡训练_能用torch.device()实现多卡训练吗-CSDN博客
Pytorch多机多卡分布式训练 - 知乎 (zhihu.com)
当代研究生应当掌握的并行训练方法(单机多卡) - 知乎 (zhihu.com)
Dataparallel 较慢,不推荐使用:
DataParallel 并行训练部分主要与如下代码段有关:
# main.py
import torch
import torch.distributed as dist
gpus = [0, 1, 2, 3] #
torch.cuda.set_device('cuda:{}'.format(gpus[0])) #
train_dataset = ...
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=...)
model = ...
model = nn.DataParallel(model.to(device), device_ids=gpus, output_device=gpus[0]) #参与训练的 GPU 有哪些,device_ids=gpus;用于汇总梯度的 GPU 是哪个,output_device=gpus[0]
optimizer = optim.SGD(model.parameters())
for epoch in range(100):
for batch_idx, (data, target) in enumerate(train_loader):
images = images.cuda(non_blocking=True) #
target = target.cuda(non_blocking=True)
...
output = model(images)
loss = criterion(output, target)
...
optimizer.zero_grad()
loss.backward()
optimizer.step()
平时可以直接使用内置的 Distributed:
torch.distributed 并行训练部分主要与如下代码段有关
# main.py
import torch
import argparse
import torch.distributed as dist
parser = argparse.ArgumentParser()
parser.add_argument('--local_rank', default=-1, type=int,
help='node rank for distributed training')
args = parser.parse_args()
dist.init_process_group(backend='nccl')
torch.cuda.set_device(args.local_rank)
train_dataset = ...
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)
model = ...
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])
optimizer = optim.SGD(model.parameters())
for epoch in range(100):
for batch_idx, (data, target) in enumerate(train_loader):
images = images.cuda(non_blocking=True)
target = target.cuda(non_blocking=True)
...
output = model(images)
loss = criterion(output, target)
...
optimizer.zero_grad()
loss.backward()
optimizer.step()
在使用时,调用 torch.distributed.launch 启动器启动:
github完整代码:https://github.com/tczhangzhi/pytorch-distributed/blob/master/distributed.py
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 main.py
以上主要参考:当代研究生应当掌握的并行训练方法(单机多卡) - 知乎 (zhihu.com)