DataParallel
,当我们使用DataParallel
去做分布式训练时,假设我们使用四块显卡去做训练,数据的batch_size
设置为8
,则程序启动时只启动一个进程,每块卡会分配batch_size=2
的资源进行forward
操作,当4快卡的forward
操作做完之后,主进程会收集所有显卡的结果进行loss
运算和梯度回传以及参数更新,这些都在主进程中完成,也就是说主进程看到看到的forward
运算的结果是batch_size=8
的。DistributedDataParallel
去做分布式训练时,假设我们使用4块显卡训练,总的batch_size
设置为8
,程序启动时会同时启动四个进程,每个进程会负责一块显卡,每块显卡对batch_size=2
的数据进行forward
操作,在每个进程中,程序都会进行的loss的运算、梯度回传以及参数更新,与DataParallel
的区别是,每个进程都会进行loss的计算、梯度回传以及参数更新。这里抛出我们的问题,既然每个进程都会进行loss计算与梯度回传是怎么保证模型训练的同步的呢?datasets.py
: 这个数据类随机生成224x224
大小的图像和其对应的随机标签0-8
class RandomClsDS(Dataset):
def __init__(self):
pass
def __len__(self):
return 10000
def __getitem__(self, item):
image = torch.randn(3,224, 224)
label = np.random.randint(0,9)
return image, label
train.py
import os
import sys
sys.path.append(os.path.join(os.path.dirname(__file__), os.path.pardir))
from datasets.datasets import XRrayDS, RandomClsDS
import torch
import torch.nn as nn
import torch.distributed as dist
import torchvision
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
def reduce_val(val):
world_size = dist.get_world_size()
with torch.no_grad():
dist.all_reduce(val, async_op=True)
val /= world_size
return val
local_rank = int(os.environ['LOCAL_RANK'])
world_size = int(os.environ['WORLD_SIZE'])
rank = int(os.environ['RANK'])
dist.init_process_group('nccl',world_size=world_size, rank=rank)
torch.cuda.set_device(local_rank)
train_ds = RandomClsDS()
train_dataloader = torch.utils.data.DataLoader(train_ds, batch_size=2, num_workers=1)
train_dataloader = tqdm(train_dataloader)
model = torchvision.models.resnet18(num_classes=9)
model = torch.nn.parallel.DistributedDataParallel(model.cuda(), device_ids=[local_rank], output_device=local_rank)
criterion = torch.nn.CrossEntropyLoss()
for index, (images, labels) in enumerate(train_dataloader):
if index > 1:
# print(model.module.fc.weight.grad[0][0])
pass
output = model(images.cuda())
loss = criterion(output, labels.long().cuda())
print(loss)
loss.backward()
loss = reduce_val(loss)
print(model.module.fc.weight.grad[0][0])
break
执行CUDA_VISIBLE_DEVICES=2,3,6,7 python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --node_rank=0 --master_addr='10.100.37.21' --master_port='29500' train.py
命令,显示结果如下:
CUDA_VISIBLE_DEVICES=2,3,6,7 python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --node_rank=0 --master_addr='10.100.37.21' --master_port='29500' train.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
tensor(2.5341, device='cuda:0', grad_fn=)
tensor(2.0336, device='cuda:2', grad_fn=)
tensor(2.3375, device='cuda:1', grad_fn=)
tensor(2.5774, device='cuda:3', grad_fn=)
tensor(-0.0832, device='cuda:2')
tensor(-0.0832, device='cuda:0')
tensor(-0.0832, device='cuda:3')
tensor(-0.0832, device='cuda:1')
解析:
试验中程序通过DistributedDataParallel
并行启动了4个进程,过程中只象征性的计算了一次forward
操作并针对结果计算了loss
,结果如图
tensor(2.5341, device='cuda:0', grad_fn=)
tensor(2.0336, device='cuda:2', grad_fn=)
tensor(2.3375, device='cuda:1', grad_fn=)
tensor(2.5774, device='cuda:3', grad_fn=)
过程中,每个进程的loss都不相同,但是当做了loss.backward()
操作时,4块显卡上的模型的梯度更新却是相同的,如图
tensor(-0.0832, device='cuda:2')
tensor(-0.0832, device='cuda:0')
tensor(-0.0832, device='cuda:3')
tensor(-0.0832, device='cuda:1')
猜测:DistributedDataParallel
在做loss回传时,内部应该有同步机制,不同进程中的loss
会先计算完,然后内部可能做了reduce操作,然后回传计算更新梯度,这样也解释了4个进程中的loss是不同的,但是更新之后模型参数的梯度是相同的,然后调用优化器的参数更新步骤,那么每个进程中的模型在一次迭代后,参数的更新结果也是相同的。上述的过程,我重复了很多遍会发现不同进程间loss
的打印一定在grad
的打印之前出现,猜测在loss.backward()
操作的前或后应该有同步机制,个人猜测这个同步机制应该在loss.backward()
中???,待求证。
上述的代码改为如下启动,即模拟在两个节点上(即两台机器上,本例中还是同一台机器)启动训练过程:
命令1: CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr='10.100.37.21' --master_port='29500' train.py
命令2: CUDA_VISIBLE_DEVICES=6,7 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr='10.100.37.21' --master_port='29500' train.py
只启动命令1时,程序会被堵塞:
CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr='10.100.37.21' --master_port='29500' train.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
知道命令2同时启动时, 程序才能正常执行:
CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr='10.100.37.21' --master_port='29500' train.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
tensor(1.9137, device='cuda:0', grad_fn=)
tensor(2.3257, device='cuda:1', grad_fn=)
tensor(0.0404, device='cuda:0')tensor(0.0404, device='cuda:1')
这应该也是torch.distributed
自带的同步机制。