开始之前:https://pytorch.org/tutorials/beginner/dist_overview.html
CLASS torch.nn.parallel.
DistributedDataParallel
(module, device_ids=None, output_device=None, dim=0, broadcast_buffers=True, process_group=None, bucket_cap_mb=25, find_unused_parameters=False, check_reduction=False, gradient_as_bucket_view=False)
基于torch.distributed包,在模块水平实现分布式数据并行。
该容器通过在批处理维度中分组,将输入分割到指定的设备上,从而并行化给定的模块。模块被复制到每台机器和每台设备上,每个这样的副本处理输入的一部分。在反向传播时,每个节点的梯度被平均。
批处理大小应该大于本地使用的GPU数量。
另请参阅: Basics 和 Use nn.parallel.DistributedDataParallel instead of multiprocessing or nn.DataParallel. 如 torch.nn.DataParallel那样对输入的限制同样适用于torch.nn.parallel.DistributedDataParallel
.
Creation of this class requires that torch.distributed
to be already initialized, by calling torch.distributed.init_process_group()
.
DistributedDataParallel
is proven to be significantly faster than torch.nn.DataParallel
for single-node multi-GPU data parallel training.
To use DistributedDataParallel
on a host with N GPUs, you should spawn up N
processes, ensuring that each process exclusively works on a single GPU from 0 to N-1. This can be done by either setting CUDA_VISIBLE_DEVICES
for every process or by calling:
>>> torch.cuda.set_device(i)
上面代码中的i取0~N-1,在每个进程中,你应该按照下面的方式构建模块:
>>> torch.distributed.init_process_group(
>>> backend='nccl', world_size=N, init_method='...'
>>> )
>>> model = DistributedDataParallel(model, device_ids=[i], output_device=i)
为了在每个节点产生多个进程,你可以使用torch.distributed.launch或者torch.multiprocessing.sqawn。
注意:
Please refer to PyTorch Distributed Overview for a brief introduction to all features related to distributed training.
注意:
nccl
backend is currently the fastest and highly recommended backend when using GPUs. This applies to both single-node and multi-node distributed training.
注意:
This module also supports mixed-precision distributed training. This means that your model can have different types of parameters such as mixed types of fp16
and fp32
, the gradient reduction on these mixed types of parameters will just work fine.
注意:
If you use torch.save
on one process to checkpoint the module, and torch.load
on some other processes to recover it, make sure that map_location
is configured properly for every process. Without map_location
, torch.load
would recover the module to devices where the module was saved from.
注意:
When a model is trained on M
nodes with batch=N
, the gradient will be M
times smaller when compared to the same model trained on a single node with batch=M*N
(because the gradients between different nodes are averaged). You should take this into consideration when you want to obtain a mathematically equivalent training process compared to the local training counterpart.
注意:
Parameters are never broadcast between processes. The module performs an all-reduce step on gradients and assumes that they will be modified by the optimizer in all processes in the same way. Buffers (e.g. BatchNorm stats) are broadcast from the module in process of rank 0, to all other replicas in the system in every iteration.
注意:
If you are using DistributedDataParallel in conjunction with the Distributed RPC Framework, you should always use torch.distributed.autograd.backward()
to compute gradients and torch.distributed.optim.DistributedOptimizer
for optimizing parameters.
example:
>>> import torch.distributed.autograd as dist_autograd
>>> from torch.nn.parallel import DistributedDataParallel as DDP
>>> from torch import optim
>>> from torch.distributed.optim import DistributedOptimizer
>>> from torch.distributed.rpc import RRef
>>>
>>> t1 = torch.rand((3, 3), requires_grad=True)
>>> t2 = torch.rand((3, 3), requires_grad=True)
>>> rref = rpc.remote("worker1", torch.add, args=(t1, t2))
>>> ddp_model = DDP(my_model)
>>>
>>> # Setup optimizer
>>> optimizer_params = [rref]
>>> for param in ddp_model.parameters():
>>> optimizer_params.append(RRef(param))
>>>
>>> dist_optim = DistributedOptimizer(
>>> optim.SGD,
>>> optimizer_params,
>>> lr=0.05,
>>> )
>>>
>>> with dist_autograd.context() as context_id:
>>> pred = ddp_model(rref.to_here())
>>> loss = loss_func(pred, loss)
>>> dist_autograd.backward(context_id, loss)
>>> dist_optim.step()
警告:
Constructor, forward method, and differentiation of the output (or a function of the output of this module) are distributed synchronization points. Take that into account in case different processes might be executing different code.
警告:
This module assumes all parameters are registered in the model by the time it is created. No parameters should be added nor removed later. Same applies to buffers.
警告:
This module assumes all parameters are registered in the model of each distributed processes are in the same order. The module itself will conduct gradient allreduce
following the reverse order of the registered parameters of the model. In other words, it is users’ responsibility to ensure that each distributed process has the exact same model and thus the exact same parameter registration order.
警告:
This module allows parameters with non-rowmajor-contiguous strides. For example, your model may contain some parameters whose torch.memory_format
is torch.contiguous_format
and others whose format is torch.channels_last
. However, corresponding parameters in different processes must have the same strides.
警告:
This module doesn’t work with torch.autograd.grad()
(i.e. it will only work if gradients are to be accumulated in .grad
attributes of parameters).
警告:
If you plan on using this module with a nccl
backend or a gloo
backend (that uses Infiniband), together with a DataLoader that uses multiple workers, please change the multiprocessing start method to forkserver
(Python 3 only) or spawn
. Unfortunately Gloo (that uses Infiniband) and NCCL2 are not fork safe, and you will likely experience deadlocks if you don’t change this setting.
警告:
Forward and backward hooks defined on module
and its submodules won’t be invoked anymore, unless the hooks are initialized in the forward()
method.
警告:
You should never try to change your model’s parameters after wrapping up your model with DistributedDataParallel
. Because, when wrapping up your model with DistributedDataParallel
, the constructor of DistributedDataParallel
will register the additional gradient reduction functions on all the parameters of the model itself at the time of construction. If you change the model’s parameters afterwards, gradient redunction functions no longer match the correct set of parameters.
警告:
Using DistributedDataParallel
in conjunction with the Distributed RPC Framework is experimental and subject to change.
警告:
The gradient_as_bucket_view
mode does not yet work with Automatic Mixed Precision (AMP). AMP maintains stashed gradients that are used for unscaling gradients. With gradient_as_bucket_view=True
, these stashed gradients will point to communication buckets in the first iteration. In the next iteration, the communication buckets are mutated and thus these stashed gradients will be unexpectedly mutated as well, which might lead to wrong results.