DDP分布式训练的官方demo及相关知识

参考链接

'''
1.分布式训练:并行化,跨进程和跨集群的计算
2.torch.distributed.init_process_group() 来初始化进程组,需要指定worker的通信机制,
    一般为nccl(NVIDIA推出),1个进程对应1个gpu.
    nccl是nvidia的显卡通信方式,用于把模型参数、梯度传递到训练的每个节点上。
3.DDP让进程组的所有worker进行通信
4.torch.utils.data.DataLoader中的batch_size指的是每个进程下的batch_size。也就是说,
    总batch_size是这里的batch_size再乘以并行数(world_size)。

#########################################################################
rank是指在整个分布式任务中进程的序号;local_rank是指在一个node上进程的相对序号
nnodes是指物理节点数量 (node:物理节点,可以是一台机器也可以是一个容器,节点内部可以有多个GPU )
node_rank是物理节点的序号
nproc_per_node是指每个物理节点上面进程的数量
########################################
上一个运算题: 每个node包含16个GPU,且nproc_per_node=8,nnodes=3,机器的node_rank=5,请问word_size是多少?
答案:word_size = 3*8 = 24
结论:word_size = nproc_per_node * nnodes
##########################################################################

多进程组启动方法:torch.distributed.launch  或 torchrun
常用启动方法示例:python3 -m torch.distributed.launch --nproc_per_node 2 main.py
'''

import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP


class ToyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))


def demo_basic():
    dist.init_process_group('nccl')
    rank = dist.get_rank()
    print(f'running on {rank}')

    # get the number of GPUs available
    n_gpus = torch.cuda.device_count()
    # get every rank
    device_id = rank % n_gpus
    model = ToyModel().to(device_id)
    '''
    多卡多线程,设置broadcast_buffers=False,会报错:
    'RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu! 
    (when checking argument for argument mat1 in method wrapper_CUDA_addmm)'
    示例
    ddp_model = DDP(model, broadcast_buffers=False)    
    '''
    ddp_model = DDP(model, device_ids=[device_id])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
    optimizer.zero_grad()
    inputs = torch.randn(3, 10)
    labels = torch.randn(3, 10).to(device_id)
    output = ddp_model(inputs)
    loss = loss_fn(output, labels)
    loss.backward()
    optimizer.step()


if __name__ == '__main__':
    demo_basic()

你可能感兴趣的:(python,pytorch)