DDP(DistributedDataParallel)和DP(DataParallel)均为并行的pytorch训练的加速方法。两种方法使用场景有些许差别:
DP模式
主要是应用到单机多卡的情况下,对代码的改动比较少,主要是对model进行封装,不需要对数据集和通信等方面进行修改。一般初始化如下:
import torch
import torchvision
model = torchvision.models.resnet101(num_classes=10)
model = torch.nn.DataParallel(model) # 需要进行DP封装
DDP模式
主要是应用到多机多卡的情况下,对代码的改动比较多,主要是对model和datasets进行封装,同时需要对torch.distributed进行初始化。一般初始化如下:
import torch
import torchvision
import torch.distributed as dist
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel as DDP
model = torchvision.models.resnet101(num_classes=10)
# model 进行DDP封装
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) #代替bn layer
model = DDP(model, device_ids=[local_rank], output_device=local_rank)
# datasets进行DistributedSampler封装
trans = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))])
data_set = torchvision.datasets.MNIST("./", train=True, transform=trans, target_transform=None, download=True)
sampler = torch.utils.data.distributed.DistributedSampler(data_set) if rank != -1 and distributed else None
data_loader_train = torch.utils.data.DataLoader(dataset=data_set, batch_size=batch_size, shuffle=(sampler is None), sampler=sampler)
#dist初始化
# 主要是两种方式:torch.multiprocessing.spawn和torch.distributed.launch,接下来再进行详细介绍
假设我们具有N个节点(机器)参与训练,每个节点具有2张GPU参与训练。
DDP采用单卡单进程的方式进行训练,具有以下几个重要的概念:
nnode : 参与训练的节点数量,这里为N
nproc_per_node: 每个节点的gpu卡,所有节点的gpu卡数量一致,这里为2
rank:当前节点的编号,主节点为0,其余的依次为1,2,3....N-1
local_rank:当前节点的可用gpu的编号,如果os.environ["CUDA_VISIBLE_DEVICES"]=["0,1"],则local_rank为0,1;
如果os.environ["CUDA_VISIBLE_DEVICES"]=["1,2"],则local_rank为1,2;
master_addr: 主节点的ip地址
master_port:主节点的端口号
world_size: 参与训练的总进程数,即nnode*nproc_per_node,这里为2N
import torch
import torchvision
import torch.utils.data.distributed
from torchvision import transforms
import argparse, os
def select_device(device=''):
cpu = device.lower() == 'cpu'
if cpu:
os.environ['CUDA_VISIBLE_DEVICES'] = '-1' # force torch.cuda.is_available() = False
elif device: # non-cpu device requested
os.environ['CUDA_VISIBLE_DEVICES'] = device # set environment variable
assert torch.cuda.is_available() # check availability
cuda = not cpu and torch.cuda.is_available()
return torch.device('cuda:0' if cuda else 'cpu')
def main(args, device):
# 数据加载部分,直接利用torchvision中的datasets
trans = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))])
data_set = torchvision.datasets.MNIST("./", train=True, transform=trans, target_transform=None, download=True)
data_loader_train = torch.utils.data.DataLoader(dataset=data_set, batch_size=args.batch_size)
# 网络搭建,调用torchvision中的resnet
net = torchvision.models.resnet101(num_classes=10)
net.conv1 = torch.nn.Conv1d(1, 64, (7, 7), (2, 2), (3, 3), bias=False)
net = net.to(device)
# 定义loss与opt
criterion = torch.nn.CrossEntropyLoss()
opt = torch.optim.Adam(net.parameters(), lr=0.001)
# 网络训练
for epoch in range(args.epochs):
for i, data in enumerate(data_loader_train):
images, labels = data
images, labels = images.cuda(), labels.cuda()
opt.zero_grad()
outputs = net(images)
loss = criterion(outputs, labels)
loss.backward()
opt.step()
print("loss: {}".format(loss.item()))
# 保存checkpoint
torch.save(net, "my_net.pth")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--epochs', type=int, default=30)
parser.add_argument("--batch_size", type=int, default=2, help="number of total batch_size")
parser.add_argument('--device', default='', help='cuda device, i.e. 0 or 0,1,2,3 or cpu')
parser.add_argument("--num_works", type=int, default=8, help="number of workers for dataloader")
args = parser.parse_args()
print("Command Line Args:", args)
device = select_device(args.device)
main(args, device)
接下来将单卡训练代码进行封装,按照mp.spawn和launch两种方式进行介绍。
直接上代码,该代码支持单机单卡、单机多卡、多机多卡,可直接运行:
(1)comm.py,主要包含获取local_rank、rank和world_size的方式。
"""
This file contains primitives for multi-gpu communication.
This is useful when doing distributed training.
"""
import functools
import logging
import numpy as np
import pickle
import torch
import torch.distributed as dist
import os
_LOCAL_PROCESS_GROUP = None
# 获得参与训练的进程总数,如N节点(每个2张GPU卡),则world_size = 2*N
def get_world_size():
if not dist.is_available():
return 1
if not dist.is_initialized():
return 1
return dist.get_world_size()
# 获取当前节点的编号,如参加训练的N张卡,主节点是0,其余的是1,2,......,N-1
def get_rank():
if not dist.is_available():
return 0
if not dist.is_initialized():
return 0
return dist.get_rank()
# 获取当前节点下的单进程的gpu编号,这里获得的不是全局rank
def get_local_rank():
"""
Returns:
The rank of the current process within the local (per-machine) process group.
"""
if not dist.is_available():
return 0
if not dist.is_initialized():
return 0
assert _LOCAL_PROCESS_GROUP is not None
return dist.get_rank(group=_LOCAL_PROCESS_GROUP)
# 获取当前节点下的总进程数,即每台机器的进程个数
def get_local_size():
"""
Returns:
The size of the per-machine process group,
i.e. the number of processes per machine.
"""
if not dist.is_available():
return 1
if not dist.is_initialized():
return 1
return dist.get_world_size(group=_LOCAL_PROCESS_GROUP)
def select_device(device=''):
# device = 'cpu' or '0' or '0,1,2,3'
cpu = device.lower() == 'cpu'
if cpu:
os.environ['CUDA_VISIBLE_DEVICES'] = '-1' # force torch.cuda.is_available() = False
elif device: # non-cpu device requested
os.environ['CUDA_VISIBLE_DEVICES'] = device # set environment variable
assert torch.cuda.is_available()
cuda = not cpu and torch.cuda.is_available()
return torch.device('cuda:0' if cuda else 'cpu')
(2)main_mp.py,主训练代码,注释见代码(包括必要的修改和简单的说明)
import torch
import torchvision
import torch.backends.cudnn as cudnn
from torchvision import transforms
import torch.distributed as dist
from torch.utils.data.distributed import DistributedSampler
import torch.multiprocessing as mp
import logging
from torch.nn.parallel import DistributedDataParallel as DDP
import argparse
import numpy as np
import random
import comm
def init_seeds(seed=0):
# Initialize random number generator (RNG) seeds
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if seed == 0: # slower, more reproducible
cudnn.benchmark, cudnn.deterministic = False, True
else: # faster, less reproducible
cudnn.benchmark, cudnn.deterministic = True, False
def train(args, device):
world_size = args.num_nodes * args.num_gpus
distributed = args.num_nodes > 1 # 节点数大于1,则为分布式
model = torchvision.models.resnet101(num_classes=10)
model.conv1 = torch.nn.Conv1d(1, 64, (7, 7), (2, 2), (3, 3), bias=False)
rank = args.node_rank
model.to(device)
init_seeds(2 + rank) # 设定随即种子,保证每次生成固定的随机数
cuda = device.type != 'cpu'
# 单机多卡下的并行模型
if cuda and rank == -1 and torch.cuda.device_count() > 1:
model = torch.nn.DataParallel(model)
if distributed:
# SyncBatchNorm代替bn,需要DDP环境初始化后初始化
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
# DDP model
model = DDP(model, device_ids=[comm.get_local_rank()], output_device=comm.get_local_rank())
total_batch_size = args.batch_size
# batch_size per gpu
batch_size = total_batch_size // world_size
trans = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))])
data_set = torchvision.datasets.MNIST("./", train=True, transform=trans, target_transform=None, download=True)
# datasets DDP mode
sampler = torch.utils.data.distributed.DistributedSampler(data_set) if rank != -1 and distributed else None
data_loader_train = torch.utils.data.DataLoader(dataset=data_set, batch_size=batch_size, shuffle=(sampler is None), sampler=sampler)
if rank in [-1, 0]:
pass # master node code, you can create test_datasets for val or test
# loss opt
criterion = torch.nn.CrossEntropyLoss()
opt = torch.optim.Adam(model.parameters(), lr=0.001)
# net train
for epoch in range(args.epochs):
model.train()
if rank != -1:
# 设置当前的 epoch,为了让不同的节点之间保持同步
data_loader_train.sampler.set_epoch(epoch)
for i, data in enumerate(data_loader_train):
images, labels = data
images, labels = images.cuda(), labels.cuda()
opt.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
opt.step()
print("loss: {}".format(loss.item()))
# 在主节点上保存模型
if args.node_rank in [-1, 0]:
torch.save(model, "my_net.pth")
def main(args, device):
##################### 新添加代码 #########################
world_size = args.num_nodes * args.num_gpus
# 如果总节点数>1,则通过spawn启动各个进程
if args.num_nodes > 1:
if args.master_addr == "auto":
assert args.num_nodes == 1, "dist_url=auto cannot work with distributed training."
args.master_addr = '127.0.0.1'
# spawn启动进程
mp.spawn(
_distributed_worker, # 启动进程的函数
nprocs=args.num_gpus, # 每个节点启动进程的个数,即每个节点的gpu数量
args=(train, world_size, device, args), # 传入的参数
daemon=False,
)
else:
# 如果单机,则直接启动训练
train(args, device)
def _distributed_worker(local_rank, train, world_size, device, args):
# os.environ["CUDA_VISIBLE_DEVICES"] = 2,3,4,5 local_rank: 0,1,2,3
assert torch.cuda.is_available(), "cuda is not available. Please check your installation."
assert args.num_gpus <= torch.cuda.device_count()
machine_rank = args.node_rank # 节点编号
num_gpus_per_machine = args.num_gpus # 每个节点的gpu数量
assert torch.cuda.is_available(), "cuda is not available. Please check your installation."
global_rank = machine_rank * num_gpus_per_machine + local_rank # 全局的rank编号
init_method = 'tcp://{}:{}'.format(args.master_addr, args.master_port) # 指定如何初始化进程组的URL
try:
# 采用nccl后端,推荐使用
dist.init_process_group(
backend="NCCL", init_method=init_method, world_size=world_size, rank=global_rank
)
print("init_process_group well done!")
except Exception as e:
logger = logging.getLogger(__name__)
logger.error("Process group URL: {}".format(args.master_addr))
raise e
assert num_gpus_per_machine <= torch.cuda.device_count()
torch.cuda.set_device(local_rank)
assert comm._LOCAL_PROCESS_GROUP is None
for i in range(args.num_nodes):
ranks_on_i = list(range(i * num_gpus_per_machine, (i + 1) * num_gpus_per_machine))
pg = dist.new_group(ranks_on_i) # 创建新组,具有所有进程的任意子集,它返回一个不透明的组句柄,作为所有集合体的“group”参数给出
if i == machine_rank:
comm._LOCAL_PROCESS_GROUP = pg # 更新group祖
train(args, device) # 启动训练
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--epochs', type=int, default=30)
parser.add_argument("--batch_size", type=int, default=2, help="number of total batch_size")
parser.add_argument('--device', default='', help='cuda device, i.e. 0 or 0,1,2,3 or cpu')
parser.add_argument("--num_work", type=int, default=8, help="number of workers for dataloader")
########################### 新添加参数 ###########################
parser.add_argument("--num_nodes", type=int, default=2, help="number of nodes")
parser.add_argument("--num_gpus", type=int, default=1, help="number of gpu per node")
parser.add_argument("--node_rank", type=int, default=-1, help="the rank of this machine (unique per machine)")
parser.add_argument("--master_addr", default="auto", help="url of master node")
parser.add_argument("--master_port", default="29500", help="port of master node")
########################### 新添加参数 ###########################
args = parser.parse_args()
print("Command Line Args:", args)
device = comm.select_device(args.device)
main(args, device)
(3)启动脚本
注意:不同节点上运行相同的代码,运行以上脚本即可,注意设置其中的参数。
如两台机器(每台2张gpu)参与训练:
第0个节点启动脚本如下:
export NNODE=2 # 节点的总数
export NODE_RANK=0 # 当前节点的编号,0为主节点
export NUM_GPUS_PER_NODE=2 # 单节点的gpu数量
export MASTER_ADDR="192.168.10.23" # 主节点的ip
export MASTER_PORT="29500" # 主节点的端口
python main_mp.py --num_nodes $NNODE --num_gpus $NUM_GPUS_PER_NODE --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT --epochs 20 --batch_size 16 --device 0,1 # DDP
第1个节点启动脚本如下:
export NNODE=2 # 节点的总数
export NODE_RANK=1 # 当前节点的编号,0为主节点
export NUM_GPUS_PER_NODE=2 # 单节点的gpu数量
export MASTER_ADDR="192.168.10.23" # 主节点的ip
export MASTER_PORT="29500" # 主节点的端口
python main_mp.py --num_nodes $NNODE --num_gpus $NUM_GPUS_PER_NODE --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT --epochs 20 --batch_size 16 --device 0,1 # DDP
launch方式通过参数写入环境变量,代码改动较少,且容易理解。
(1) 直接上代码,该代码支持单机单卡、单机多卡、多机多卡,包含必要的注释和简单的说明。
import torch
import torchvision
import torch.backends.cudnn as cudnn
from torchvision import transforms
import torch.distributed as dist
from torch.utils.data.distributed import DistributedSampler
from contextlib import contextmanager
import logging
from torch.nn.parallel import DistributedDataParallel as DDP
import argparse
import os
import numpy as np
import random
logger = logging.getLogger(__name__)
@contextmanager
def torch_distributed_zero_first(local_rank: int):
"""
Decorator to make all processes in distributed training wait for each local_master to do something.
"""
if local_rank not in [-1, 0]:
torch.distributed.barrier()
yield
if local_rank == 0:
torch.distributed.barrier()
def select_device(device=''):
# device = 'cpu' or '0' or '0,1,2,3'
cpu = device.lower() == 'cpu'
if cpu:
os.environ['CUDA_VISIBLE_DEVICES'] = '-1' # force torch.cuda.is_available() = False
elif device: # non-cpu device requested
os.environ['CUDA_VISIBLE_DEVICES'] = device # set environment variable
assert torch.cuda.is_available(), f'CUDA unavailable, invalid device {device} requested' # check availability
cuda = not cpu and torch.cuda.is_available()
return torch.device('cuda:0' if cuda else 'cpu')
def init_seeds(seed=0):
# Initialize random number generator (RNG) seeds
random.seed(seed)
np.random.seed(seed)
init_torch_seeds(seed)
def init_torch_seeds(seed=0):
torch.manual_seed(seed)
if seed == 0: # slower, more reproducible
cudnn.benchmark, cudnn.deterministic = False, True
else: # faster, less reproducible
cudnn.benchmark, cudnn.deterministic = True, False
def train(args, device):
epochs, batch_size, total_batch_size, rank = args.epochs, args.batch_size, args.total_batch_size, args.global_rank
cuda = device.type != 'cpu'
init_seeds(2 + rank) # 设定随即种子,保证每次生成固定的随机数
if rank in [-1, 0]:
pass # master node code
# model初始化
model = torchvision.models.resnet101(num_classes=10).to(device)
model.conv1 = torch.nn.Conv1d(1, 64, (7, 7), (2, 2), (3, 3), bias=False).to(device)
# 单机多卡下的并行模型
if cuda and rank == -1 and torch.cuda.device_count() > 1:
model = torch.nn.DataParallel(model)
if args.world_size > 1 and cuda and rank != -1:
# SyncBatchNorm代替bn,需要DDP环境初始化后初始化
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model).to(device)
logger.info('Using SyncBatchNorm()')
with torch_distributed_zero_first(rank): #主节点优先执行,等待其执行完,其他节点再执行
trans = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))])
data_set = torchvision.datasets.MNIST("./", train=True, transform=trans, target_transform=None, download=True)
batch_size = min(batch_size, len(data_set))
nw = min([os.cpu_count() // args.world_size, batch_size if batch_size > 1 else 0, args.num_works]) # number of workers
# datasets DDP mode
sampler = torch.utils.data.distributed.DistributedSampler(data_set) if rank != -1 else None
data_loader_train = torch.utils.data.DataLoader(dataset=data_set, batch_size=batch_size,num_workers=nw,
shuffle=(sampler is None), sampler=sampler)
if rank in [-1, 0]:
pass # you can create test_datasets for val or test
# DDP mode
if cuda and rank != -1:
model = DDP(model, device_ids=[args.local_rank], output_device=args.local_rank)
# 定义loss与opt
criterion = torch.nn.CrossEntropyLoss()
opt = torch.optim.Adam(model.parameters(), lr=0.001)
# 网络训练
for epoch in range(10):
model.train()
if args.global_rank != -1:
data_loader_train.sampler.set_epoch(epoch)
for i, data in enumerate(data_loader_train):
images, labels = data
images, labels = images.cuda(), labels.cuda()
opt.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
opt.step()
# if i % 10 == 0:
print("iter:", i)
print("loss: {}".format(loss.item()))
# 主节点保存checkpoint
if rank in [-1, 0]:
torch.save(model, "my_net.pth")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--epochs', type=int, default=30)
parser.add_argument("--batch_size", type=int, default=2, help="number of total batch_size")
parser.add_argument('--device', default='', help='cuda device, i.e. 0 or 0,1,2,3 or cpu')
parser.add_argument("--num_works", type=int, default=8, help="number of workers for dataloader")
######################## 新增参数#######################################
# 在主训练代码中必须解析 --local_rank 参数,该参数用于指定当前设备上的进程编号
parser.add_argument('--local_rank', type=int, default=-1, help='DDP parameter, do not modify')
args = parser.parse_args()
# 通过launch方式将参数写入环境变量后,WORLD_SIZE和RANK即可获得,其中WORLD_SIZE由系统自动计算所得(如何写入见下面的运行脚本)
args.world_size = int(os.environ['WORLD_SIZE']) if 'WORLD_SIZE' in os.environ else 1
args.global_rank = int(os.environ['RANK']) if 'RANK' in os.environ else -1
args.total_batch_size = args.batch_size
print("Command Line Args:", args)
# 设置可用的gpu
device = select_device(args.device)
if args.local_rank != -1: # args.local_rank通过launch方式解析获得
assert torch.cuda.device_count() > args.local_rank
torch.cuda.set_device(args.local_rank)
dist.init_process_group(backend='nccl', init_method='env://') # distributed backend
assert args.total_batch_size % args.world_size == 0, '--batch-size must be multiple of CUDA device count'
args.batch_size = args.total_batch_size // args.world_size # batch_size per gpu
train(args, device) # 训练
(2) 启动脚本
注意:不同节点上运行相同的代码,运行以上脚本即可,注意设置其中的参数。
如两台机器(每台2张gpu)参与训练:
第0个节点启动脚本如下:
export NNODE=2
export NODE_RANK=0
export NUM_GPUS_PER_NODE=2
export MASTER_ADDR="192.168.10.23"
export MASTER_PORT="29500"
python -m torch.distributed.launch --nnodes $NNODE --node_rank $NODE_RANK --nproc_per_node $NUM_GPUS_PER_NODE --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT main_launch.py --epochs 20 --batch_size 4 --device 1 # DDP
# 下列参数均为torch.distributed.launch启动方式的必要参数,这些参数会将WORLD_SIZE和RANK写入环境变量。
# --nnodes 节点 总数
# --node_rank 节点 编号
# --nproc_per_node 单个节点的进程数(gpu数量)
# --master_addr 主节点的ip
# --master_port 主节点的端口
第1个节点的启动如下:
export NNODE=2
export NODE_RANK=1
export NUM_GPUS_PER_NODE=2
export MASTER_ADDR="192.168.10.23"
export MASTER_PORT="29500"
python -m torch.distributed.launch --nnodes $NNODE --node_rank $NODE_RANK --nproc_per_node $NUM_GPUS_PER_NODE --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT main_launch.py --epochs 20 --batch_size 4 --device 1 # DDP
torch.distributed.barrier()可实现进程间的同步,详细请百度其他资料。
另外,分享个已踩的坑——切记!节点内的计算机名不能相同!!!