pytorch分布式系列2——DistributedDataParallel是如何做同步的?

试验2: DistributedDataParallel是如何做同步的?

  1. 在开始试验之前我们先说明DataParallel,当我们使用DataParallel去做分布式训练时,假设我们使用四块显卡去做训练,数据的batch_size设置为8,则程序启动时只启动一个进程,每块卡会分配batch_size=2的资源进行forward操作,当4快卡的forward操作做完之后,主进程会收集所有显卡的结果进行loss运算和梯度回传以及参数更新,这些都在主进程中完成,也就是说主进程看到看到的forward运算的结果是batch_size=8的。
  2. 当我们用DistributedDataParallel去做分布式训练时,假设我们使用4块显卡训练,总的batch_size设置为8,程序启动时会同时启动四个进程,每个进程会负责一块显卡,每块显卡对batch_size=2的数据进行forward操作,在每个进程中,程序都会进行的loss的运算、梯度回传以及参数更新,DataParallel的区别是,每个进程都会进行loss的计算、梯度回传以及参数更新。这里抛出我们的问题,既然每个进程都会进行loss计算与梯度回传是怎么保证模型训练的同步的呢?
    OK, 下面开始我们的试验,不看代码就靠猜。。。
试验用到的代码
  1. 数据类datasets.py: 这个数据类随机生成224x224大小的图像和其对应的随机标签0-8
class RandomClsDS(Dataset):
    def __init__(self):
        pass

    def __len__(self):
        return 10000

    def __getitem__(self, item):
        image = torch.randn(3,224, 224)
        label = np.random.randint(0,9)

        return image, label

  1. 训练类train.py
import os
import sys
sys.path.append(os.path.join(os.path.dirname(__file__), os.path.pardir))
from datasets.datasets import XRrayDS, RandomClsDS
import torch
import torch.nn as nn
import torch.distributed as dist
import torchvision
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm

def reduce_val(val):
    world_size = dist.get_world_size()
    with torch.no_grad():
        dist.all_reduce(val, async_op=True)
        val /= world_size
    return val

local_rank = int(os.environ['LOCAL_RANK'])
world_size = int(os.environ['WORLD_SIZE'])
rank = int(os.environ['RANK'])

dist.init_process_group('nccl',world_size=world_size, rank=rank)


torch.cuda.set_device(local_rank)


train_ds = RandomClsDS()
train_dataloader = torch.utils.data.DataLoader(train_ds, batch_size=2, num_workers=1)

train_dataloader = tqdm(train_dataloader)

model = torchvision.models.resnet18(num_classes=9)

model = torch.nn.parallel.DistributedDataParallel(model.cuda(), device_ids=[local_rank], output_device=local_rank)

criterion = torch.nn.CrossEntropyLoss()



for index, (images, labels) in enumerate(train_dataloader):
    if index > 1:
        # print(model.module.fc.weight.grad[0][0])
        pass
    output = model(images.cuda())
    loss = criterion(output, labels.long().cuda())
    print(loss)
    loss.backward()
    loss = reduce_val(loss)
    print(model.module.fc.weight.grad[0][0])
    break
 
试验过程1

执行CUDA_VISIBLE_DEVICES=2,3,6,7 python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --node_rank=0 --master_addr='10.100.37.21' --master_port='29500' train.py命令,显示结果如下:

CUDA_VISIBLE_DEVICES=2,3,6,7 python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --node_rank=0 --master_addr='10.100.37.21' --master_port='29500' train.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
tensor(2.5341, device='cuda:0', grad_fn=)
tensor(2.0336, device='cuda:2', grad_fn=)
tensor(2.3375, device='cuda:1', grad_fn=)
tensor(2.5774, device='cuda:3', grad_fn=)
tensor(-0.0832, device='cuda:2')
tensor(-0.0832, device='cuda:0')
tensor(-0.0832, device='cuda:3')
tensor(-0.0832, device='cuda:1')

解析:

试验中程序通过DistributedDataParallel并行启动了4个进程,过程中只象征性的计算了一次forward操作并针对结果计算了loss,结果如图

tensor(2.5341, device='cuda:0', grad_fn=)
tensor(2.0336, device='cuda:2', grad_fn=)
tensor(2.3375, device='cuda:1', grad_fn=)
tensor(2.5774, device='cuda:3', grad_fn=)

过程中,每个进程的loss都不相同,但是当做了loss.backward()操作时,4块显卡上的模型的梯度更新却是相同的,如图

tensor(-0.0832, device='cuda:2')
tensor(-0.0832, device='cuda:0')
tensor(-0.0832, device='cuda:3')
tensor(-0.0832, device='cuda:1')

猜测:DistributedDataParallel在做loss回传时,内部应该有同步机制,不同进程中的loss会先计算完,然后内部可能做了reduce操作,然后回传计算更新梯度,这样也解释了4个进程中的loss是不同的,但是更新之后模型参数的梯度是相同的,然后调用优化器的参数更新步骤,那么每个进程中的模型在一次迭代后,参数的更新结果也是相同的。上述的过程,我重复了很多遍会发现不同进程间loss的打印一定在grad的打印之前出现,猜测在loss.backward()操作的前或后应该有同步机制,个人猜测这个同步机制应该在loss.backward()中???,待求证

试验过程2

上述的代码改为如下启动,即模拟在两个节点上(即两台机器上,本例中还是同一台机器)启动训练过程:

命令1: CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr='10.100.37.21' --master_port='29500' train.py

命令2: CUDA_VISIBLE_DEVICES=6,7 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr='10.100.37.21' --master_port='29500' train.py

只启动命令1时,程序会被堵塞:

CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr='10.100.37.21' --master_port='29500' train.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************

知道命令2同时启动时, 程序才能正常执行:

CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr='10.100.37.21' --master_port='29500' train.py
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
tensor(1.9137, device='cuda:0', grad_fn=)
tensor(2.3257, device='cuda:1', grad_fn=)
tensor(0.0404, device='cuda:0')tensor(0.0404, device='cuda:1')

这应该也是torch.distributed自带的同步机制。

你可能感兴趣的:(深度学习,深度学习,分布式,pytorch)