pytorch, sync batch norm and DistributedDataParallel(DDP)

background

  • Pytorch compute batch statistics separately for each device.
    The default behavior of Batchnorm, in Pytorch and most other frameworks, is to compute batch statistics separately for each device. Meaning that, if we use a model with batchnorm layers and train on multiple GPUs, batch statistics will not reflect the whole batch; instead, statistics will reflect slices of data passed to each GPU. The intuition is that this may harm model convergence and impact performance. In fact, this performance drop is known to happen for object detection models and GANs.
  • DataParallel: replicated input module to device_ids and gater outputs to output_device, single-process, multi-thread, only works on a single machine
  • DistributedDataParallel = DataParalled + necessary parameter and gradient synchronization, multi-process, works for both single- and multi- machine training
  • Model parallel: when model is to large to fit on a single GPU

solution

  • sync batchnorm module between devices
  • syncbatchnorm + DDP (Not DP)
    In order to compute batchnorm statistics across all GPUs, we need to use the synchronized batchnorm module that was recently released by Pytorch. To do so, we need to make some changes to our code. We cannot use SyncBatchnorm when using nn.DataParallel(…). SyncBatchnorm requires that we use a very specific setting: we need to use torch.parallel.DistributedDataParallel(…) with Multi-process single GPU configuration. In other words, we need to launch a separate process for each GPU. Below we show step-by-step how to use SynchBatchnorm on a single machine with multiple GPUs.

steps

  • set up device and init process group
  • convert model to sync batchnorm version
  • wrap model with DDP
  • adapt dataloader, create train_sample, set epoch for train_sample in training loop
  • adapt input and target(label) with non_blocking=True
  • launching the processes with torch.distributed.launch / torch.multiprocessing

config

  • distribute backends
    • Use the NCCL backend for distributed GPU training
    • Use the Gloo backend for distributed CPU training.
  • world_size (int) Number of processes participating in the job.
  • rank (int) Rank of the current process, ranging from 0 to world_size.
  • local_rank (int) basic rank for current node, ranging from 0 to ngpus_per_node, correspond to local gpu id

w o r l d _ s i z e = n g p u s _ p e r _ n o d e ∗ n _ n o d e l o c a l _ r a n k ∈ [ 0 , n g p u s _ p e r _ n o d e ) r a n k ∈ [ 0 , w o r l d _ s i z e ) world\_size = ngpus\_per\_node * n\_node \\ local\_rank \in [0,ngpus\_per\_node) \\ rank \in [0,world\_size) \\ world_size=ngpus_per_noden_nodelocal_rank[0,ngpus_per_node)rank[0,world_size)

$$
a_a=10

b \in c
world_size
$$

reference

  • https://github.com/pytorch/examples/tree/master/imagenet, Multi-processing Distributed Data Parallel Training Example
  • https://github.com/dougsouza/pytorch-sync-batchnorm-example
  • https://pytorch.org/tutorials/beginner/aws_distributed_training_tutorial.html
  • torch.distributed.launch, This helper utility can be used to launch multiple processes per node for distributed training.
  • torch.multiprocessing.spawn, This helper function can be used to spawn multiple processes. It works by passing in the function that you want to run and spawns N processes to run it. This can be used for multiprocess distributed training as well.

你可能感兴趣的:(python,pytorch)