pytorch多机多卡分布式训练

pytorch多机多卡分布式训练

  • 1. 单机多卡情况使用DataParallel:
    • Imports and parameters
    • Dummy DataSet
    • Simple Model
    • Create Model and DataParallel
    • Run the Model
    • Results
    • Summary
    • Another demo
  • 2. 多机多卡情况使用torch.nn.parallel.DistributedDataParallel()
    • Initialization
      • TCP initialization 1 -- rank 0 network address (most useful I think)
      • TCP initialization 2 -- Shared file-system initialization
  • 3. 谈谈python的GIL、多线程、多进程

注意:在 DataParallel 中,batch size 设置必须为单卡的 n 倍,但是在 DistributedDataParallel 内,batch size 设置于单卡一样即可。


1. 单机多卡情况使用DataParallel:

Imports and parameters

Import PyTorch modules and define parameters.

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Parameters and DataLoaders
input_size = 5
output_size = 2

batch_size = 30
data_size = 100

Device

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Dummy DataSet

Make a dummy (random) dataset. You just need to implement the getitem

class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

Simple Model

For the demo, our model just gets an input, performs a linear operation, and gives an output. However, you can use DataParallel on any model (CNN, RNN, Capsule Net etc.)

We’ve placed a print statement inside the model to monitor the size of input and output tensors. Please pay attention to what is printed at batch rank 0.

class Model(nn.Module):
    # Our model

    def __init__(self, input_size, output_size):
        super(Model, self).__init__()
        self.fc = nn.Linear(input_size, output_size)

    def forward(self, input):
        output = self.fc(input)
        print("\tIn Model: input size", input.size(),
              "output size", output.size())

        return output

Create Model and DataParallel

First, we need to make a model instance and check if we have multiple GPUs. If we have multiple GPUs, we can wrap our model using nn.DataParallel. Then we can put our model on GPUs by model.to(device)

model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

model.to(device)

Out:

Let's use 2 GPUs!

Run the Model

Now we can see the sizes of input and output tensors.

for data in rand_loader:
    input = data.to(device)
    output = model(input)
    print("Outside: input size", input.size(),
          "output_size", output.size())

Results

If you have no GPU or one GPU, when we batch 30 inputs and 30 outputs, the model gets 30 and outputs 30 as expected. But if you have multiple GPUs, then you can get results like this. If you have 2 GPUs, you will see:

Let's use 2 GPUs!
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
    In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
    In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
    In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])

Summary

DataParallel splits your data automatically and sends job orders to multiple models on several GPUs. After each model finishes their job, DataParallel collects and merges the results before returning it to you.

Another demo

https://www.jianshu.com/p/b366cad90a6c
通常情况下,多GPU运算分为单机多卡和多机多卡,两者在pytorch上面的实现并不相同,因为多机时,需要多个机器之间的通信协议等设置。
pytorch实现单机多卡十分容易,其基本原理就是:加入我们一次性读入一个batch的数据, 其大小为[16, 10, 5],我们有四张卡可以使用。那么计算过程遵循以下步骤:
假设我们有4个GPU可以用,pytorch先把模型同步放到4个GPU中。
那么首先将数据分为4份,按照次序放置到四个GPU的模型中,每一份大小为[4, 10, 5];
每个GPU分别进行前项计算过程;
前向过程计算完后,pytorch再从四个GPU中收集计算后的结果假设[4, 10, 5],然后再按照次序将其拼接起来[16, 10, 5],计算loss。
整个过程其实就是 同步模型参数→分别前向计算→计算损失→梯度反传

import torch
import torch.nn as nn
# 定义模型
class Model(nn.Module):
    def __init__(self):
        self.linear = nn.Linear(5, 5)

    def forward(self, inputs):
        output = self.linear(inputs)
        return output

model = Model()

#定义优化器
optimizer = torch.optim.SGD(model.parameters(), lr = 0.01, momentum=0.9)

# 假设就一个数据
data = torch.rand([16, 10, 5])

# 前向计算要求数据都放进GPU0里面
# device = torch.device('cuda:0')
# data = data.to(device)
data = data.cuda()

# 将网络同步到多个GPU中
model_p = torch.nn.DataParalle(model.cuda(), device_ids=[0, 1,  2, 3])
logits = model_p(inputs)
  
# 接下来计算loss
loss = crit(logits, target)
optimizer.zero_grad()
loss.backward()
optimizer.step()

注:模型要求所有的数据和初始网络被放置到GPU0,实际上并不需要,只需要保证数据和初始网路都在你所选择的多个gpu中的第一块上就行。

2. 多机多卡情况使用torch.nn.parallel.DistributedDataParallel()

a replica for the functionality of DistributedDataParallel:
pytorch tutorial: https://pytorch.org/tutorials/intermediate/dist_tuto.html
https://pytorch.org/docs/stable/notes/ddp.html?highlight=distributeddataparallel
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

In the single-machine synchronous case, torch.nn.parallel.DistributedDataParallel() wrapper may still have advantages over other approaches to data-parallelism, including torch.nn.DataParallel():

  • Each process maintains its own optimizer and performs a complete optimization step with each iteration. While this may appear redundant, since the gradients have already been gathered together and averaged across processes and are thus the same for every process, this means that no parameter broadcast step is needed, reducing time spent transferring tensors between nodes.
  • Each process contains an independent Python interpreter, eliminating the extra interpreter overhead and “GIL-thrashing”(Part 3) that comes from driving several execution threads, model replicas, or GPUs from a single Python process. This is especially important for models that make heavy use of the Python runtime, including models with recurrent layers or many small components.

Initialization

The package needs to be initialized using the torch.distributed.init_process_group() function before calling any other methods. This blocks until all processes have joined.

torch.distributed.is_available()

Returns True if the distributed package is available. Otherwise, torch.distributed does not expose any other APIs. Currently, torch.distributed is available on Linux and MacOS. Set USE_DISTRIBUTED=1 to enable it when building PyTorch from source. Currently, the default value is USE_DISTRIBUTED=1 for Linux and USE_DISTRIBUTED=0 for MacOS.

torch.distributed.init_process_group(backend, init_method=None, timeout=datetime.timedelta(0, 1800), world_size=-1, rank=-1, store=None, group_name='')

Initializes the default distributed process group, and this will also initialize the distributed package.
There are 2 main ways to initialize a process group:

  • Specify store, rank, and world_size explicitly.
  • Specify init_method (a URL string) which indicates where/how to discover peers. Optionally specify rank and world_size, or encode all required parameters in the URL and omit them.

If neither is specified, init_method is assumed to be “env://”.
Parameters:

  • backend (str or Backend) – The backend to use. Depending on build-time configurations, valid values include mpi, gloo, and nccl. This field should be given as a lowercase string (e.g., “gloo”), which can also be accessed via Backend attributes (e.g., Backend.GLOO). If using multiple processes per machine with nccl backend, each process must have exclusive access to every GPU it uses, as sharing GPUs between processes can result in deadlocks.
  • init_method (str, optional) – URL specifying how to initialize the process group. Default is “env://” if no init_method or store is specified. Mutually exclusive with store.
  • world_size (int, optional) – Number of processes participating in the job. Required if store is specified.
  • rank (int, optional) – Rank of the current process. Required if store is specified.
  • store (Store, optional) – Key/value store accessible to all workers, used to exchange connection/address information. Mutually exclusive with init_method.
  • timeout (timedelta, optional) – Timeout for operations executed against the process group. Default value equals 30 minutes. This is applicable for the gloo backend. For nccl, this is applicable only if the environment variable NCCL_BLOCKING_WAIT is set to 1.
  • group_name (str, optional, deprecated) – Group name.

TCP initialization 1 – rank 0 network address (most useful I think)

There are two ways to initialize using TCP, both requiring a network address reachable from all processes and a desired world_size. The first way requires specifying an address that belongs to the rank 0 process. This initialization method requires that all processes have manually specified ranks.

torch.distributed.init_process_group(backend, init_method='tcp://10.1.1.20:23456',
                        rank=args.rank, world_size=4)

TCP initialization 2 – Shared file-system initialization

Another initialization method makes use of a file system that is shared and visible from all machines in a group, along with a desired world_size. The URL should start with file:// and contain a path to a non-existent file (in an existing directory) on a shared file system. File-system initialization will automatically create that file if it doesn’t exist, but will not delete the file. Therefore, it is your responsibility to make sure that the file is cleaned up before the next init_process_group() call on the same file path/name.

# rank should always be specified
torch.distributed.init_process_group(backend, init_method='file:///mnt/nfs/sharedfile',
                        world_size=4, rank=args.rank)

3. 谈谈python的GIL、多线程、多进程

参考知乎:https://zhuanlan.zhihu.com/p/20953544

你可能感兴趣的:(深度学习,python,神经网络,机器学习,人工智能)