Import PyTorch modules and define parameters.
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
# Parameters and DataLoaders
input_size = 5
output_size = 2
batch_size = 30
data_size = 100
Device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
Make a dummy (random) dataset. You just need to implement the getitem
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
batch_size=batch_size, shuffle=True)
For the demo, our model just gets an input, performs a linear operation, and gives an output. However, you can use DataParallel on any model (CNN, RNN, Capsule Net etc.)
We’ve placed a print statement inside the model to monitor the size of input and output tensors. Please pay attention to what is printed at batch rank 0.
class Model(nn.Module):
# Our model
def __init__(self, input_size, output_size):
super(Model, self).__init__()
self.fc = nn.Linear(input_size, output_size)
def forward(self, input):
output = self.fc(input)
print("\tIn Model: input size", input.size(),
"output size", output.size())
return output
First, we need to make a model instance and check if we have multiple GPUs. If we have multiple GPUs, we can wrap our model using nn.DataParallel. Then we can put our model on GPUs by model.to(device)
model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
# dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
model = nn.DataParallel(model)
model.to(device)
Out:
Let's use 2 GPUs!
Now we can see the sizes of input and output tensors.
for data in rand_loader:
input = data.to(device)
output = model(input)
print("Outside: input size", input.size(),
"output_size", output.size())
If you have no GPU or one GPU, when we batch 30 inputs and 30 outputs, the model gets 30 and outputs 30 as expected. But if you have multiple GPUs, then you can get results like this. If you have 2 GPUs, you will see:
Let's use 2 GPUs!
In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])
DataParallel splits your data automatically and sends job orders to multiple models on several GPUs. After each model finishes their job, DataParallel collects and merges the results before returning it to you.
https://www.jianshu.com/p/b366cad90a6c
通常情况下,多GPU运算分为单机多卡和多机多卡,两者在pytorch上面的实现并不相同,因为多机时,需要多个机器之间的通信协议等设置。
pytorch实现单机多卡十分容易,其基本原理就是:加入我们一次性读入一个batch的数据, 其大小为[16, 10, 5],我们有四张卡可以使用。那么计算过程遵循以下步骤:
假设我们有4个GPU可以用,pytorch先把模型同步放到4个GPU中。
那么首先将数据分为4份,按照次序放置到四个GPU的模型中,每一份大小为[4, 10, 5];
每个GPU分别进行前项计算过程;
前向过程计算完后,pytorch再从四个GPU中收集计算后的结果假设[4, 10, 5],然后再按照次序将其拼接起来[16, 10, 5],计算loss。
整个过程其实就是 同步模型参数→分别前向计算→计算损失→梯度反传
import torch
import torch.nn as nn
# 定义模型
class Model(nn.Module):
def __init__(self):
self.linear = nn.Linear(5, 5)
def forward(self, inputs):
output = self.linear(inputs)
return output
model = Model()
#定义优化器
optimizer = torch.optim.SGD(model.parameters(), lr = 0.01, momentum=0.9)
# 假设就一个数据
data = torch.rand([16, 10, 5])
# 前向计算要求数据都放进GPU0里面
# device = torch.device('cuda:0')
# data = data.to(device)
data = data.cuda()
# 将网络同步到多个GPU中
model_p = torch.nn.DataParalle(model.cuda(), device_ids=[0, 1, 2, 3])
logits = model_p(inputs)
# 接下来计算loss
loss = crit(logits, target)
optimizer.zero_grad()
loss.backward()
optimizer.step()
注:模型要求所有的数据和初始网络被放置到GPU0,实际上并不需要,只需要保证数据和初始网路都在你所选择的多个gpu中的第一块上就行。
a replica for the functionality of DistributedDataParallel:
pytorch tutorial: https://pytorch.org/tutorials/intermediate/dist_tuto.html
https://pytorch.org/docs/stable/notes/ddp.html?highlight=distributeddataparallel
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
In the single-machine synchronous case, torch.nn.parallel.DistributedDataParallel() wrapper may still have advantages over other approaches to data-parallelism, including torch.nn.DataParallel():
The package needs to be initialized using the torch.distributed.init_process_group() function before calling any other methods. This blocks until all processes have joined.
torch.distributed.is_available()
Returns True if the distributed package is available. Otherwise, torch.distributed does not expose any other APIs. Currently, torch.distributed is available on Linux and MacOS. Set USE_DISTRIBUTED=1 to enable it when building PyTorch from source. Currently, the default value is USE_DISTRIBUTED=1 for Linux and USE_DISTRIBUTED=0 for MacOS.
torch.distributed.init_process_group(backend, init_method=None, timeout=datetime.timedelta(0, 1800), world_size=-1, rank=-1, store=None, group_name='')
Initializes the default distributed process group, and this will also initialize the distributed package.
There are 2 main ways to initialize a process group:
If neither is specified, init_method is assumed to be “env://”.
Parameters:
There are two ways to initialize using TCP, both requiring a network address reachable from all processes and a desired world_size. The first way requires specifying an address that belongs to the rank 0 process. This initialization method requires that all processes have manually specified ranks.
torch.distributed.init_process_group(backend, init_method='tcp://10.1.1.20:23456',
rank=args.rank, world_size=4)
Another initialization method makes use of a file system that is shared and visible from all machines in a group, along with a desired world_size. The URL should start with file:// and contain a path to a non-existent file (in an existing directory) on a shared file system. File-system initialization will automatically create that file if it doesn’t exist, but will not delete the file. Therefore, it is your responsibility to make sure that the file is cleaned up before the next init_process_group() call on the same file path/name.
# rank should always be specified
torch.distributed.init_process_group(backend, init_method='file:///mnt/nfs/sharedfile',
world_size=4, rank=args.rank)
参考知乎:https://zhuanlan.zhihu.com/p/20953544