pytorch 分布式训练
Cutting edge deep learning models are growing at an exponential rate: where last year’s GPT-2 had ~750 million parameters, this year’s GPT-3 has 175 billion. GPT is a somewhat extreme example; nevertheless, the “enbiggening” of the SOTA is driving larger and larger models into production applications, challenging the ability of even the most powerful of GPU cards to finish model training jobs in a reasonable amount of time.
尖端的深度学习模型正以指数级的速度增长:去年的GPT-2拥有约7.5亿个参数,而今年的GPT-3则有1,750 亿个 。 GPT是一个极端的例子; 然而,SOTA的“包罗万象”正在将越来越大的模型推向生产应用程序,甚至挑战了最强大的GPU卡在合理的时间内完成模型训练工作的能力。
To deal with these problems, practitioners are increasingly turning to distributed training. Distributed training is the set of techniques for training a deep learning model using multiple GPUs and/or multiple machines. Distributing training jobs allow you to push past the single-GPU memory bottleneck, developing ever larger and powerful models by leveraging many GPUs simultaneously.
为了解决这些问题,从业者越来越多地转向分布式培训。 分布式训练是使用多个GPU和/或多个机器训练深度学习模型的技术集。 分发培训作业使您能够克服单GPU内存瓶颈,通过同时利用多个GPU来开发更大,功能更强大的模型。
This blog post is an introduction to the distributed training in pure PyTorch using the torch.nn.parallel.DistributedDataParallel
API. We will:
这篇博客文章是使用torch.nn.parallel.DistributedDataParallel
API在纯PyTorch中进行分布式培训的torch.nn.parallel.DistributedDataParallel
。 我们会:
- Discuss distributed training in general and data parallelization in particular 讨论一般的分布式培训,尤其是数据并行化
Cover the relevant features of the
torch.dist
andDistributedDataParallel
and show how they are used by example涵盖
torch.dist
和DistributedDataParallel
的相关功能,并torch.dist
说明如何使用它们- And benchmark a real training script to see the time savings in action 并测试真实的训练脚本,以节省时间
You can follow along in code by checking out the companion GitHub repo.
您可以通过检查配套的GitHub repo来遵循代码 。
什么是分布式培训? (What is distributed training?)
Before we can dive into DistributedDataParallel
, we first need to acquire some background knowledge about distributed training in general.
在深入研究DistributedDataParallel
之前,我们首先需要获得有关分布式培训的一般背景知识。
There are basically two different forms of distributed training in common use today: data parallelization and model parallelization.
如今,基本上共有两种不同形式的分布式培训: 数据并行化和模型并行化 。
In data parallelization, the model training job is split on the data. Each GPU in the job receives its own independent slice of the data batch, e.g. its own “batch slice”. Each GPU uses this data to independently calculate a gradient update. For example, if you were to use two GPUs and a batch size of 32, one GPU would handle forward and back propagation on the first 16 records, and the second the last 16. These gradient updates are then synchronized among the GPUs, averaged together, and finally applied to the model.
在数据并行化中 ,将模型训练作业拆分为数据。 作业中的每个GPU均接收其自己的数据批处理的独立切片,例如其自己的“批处理切片”。 每个GPU都使用此数据独立计算梯度更新。 例如,如果要使用两个GPU,批处理大小为32,则一个GPU将处理前16个记录的正向传播和反向传播,第二个将处理最后16个记录的正向传播,然后在GPU之间同步这些梯度更新,将它们平均在一起,最后应用于模型。
(the synchronization step is technically optional, but theoretically faster asynchronous update strategies are still an active area of research)
(同步步骤在技术上是可选的,但从理论上讲,更快的异步更新策略仍然是研究的活跃领域)
In model parallelization, the model training job is split on the model. Each GPU in the job receives a slice of the model, e.g. a subset of its layers. So for example, one GPU might be responsible for its output head,another might handle the input layers, and another, the hidden layers in between.
在模型并行化中 ,将模型训练工作拆分到模型上。 作业中的每个GPU都会接收模型的一部分,例如其层的子集。 例如,一个GPU可能负责其输出头,另一个GPU可能负责输入层,另一个GPU负责它们之间的隐藏层。
While each of these techniques has its advantages and disadvantages, data parallelization is the easier of the two to implement (it requires no knowledge of the underlying network architecture) and thus the strategy which is usually tried first.
虽然每种技术都有其优点和缺点,但数据并行化是两者中较容易实现的(它不需要底层网络体系结构的知识),因此通常首先尝试使用该策略。
(it’s also possible to combine the techniques, e.g. to use model and data parallelization simultaneously, but this is an advanced topic that we won’t be covering here)
(也可以将这些技术组合在一起,例如同时使用模型和数据并行化,但这是一个高级主题,我们将不在此处讨论)
Since this blog post is an introduction to the DistributedDataParallel
API, we will not be discussing model parallelization in any further detail — but stay tuned for a future post on the subject!
由于此博文是对DistributedDataParallel
API的介绍,因此我们将不再详细讨论模型并行化-敬请期待有关该主题的后续博文!
数据并行化如何工作 (How data parallelization works)
In the previous section I gave a high-level overview of what data parallelization is. In this section, we will dig further into the details.
在上一节中,我概述了什么是数据并行化。 在本节中,我们将进一步深入研究细节。
The first data parallelization technique to see widespread adoption is the parameter server strategy in TensorFlow. This feature actually predates the very first release of TensorFlow, having been implemented in its Google-internal predecessor, DistBelief, way back in 2012. This strategy is illustrated well in the following diagram (taken from a post on the Uber Engineering blog):
TensorFlow中的第一个被广泛采用的数据并行化技术是参数服务器策略 。 实际上,此功能早于TensorFlow的第一版发布,该版本已在其Google内部的前身DistBelief于2012年实现。此策略在下图中有很好的说明(摘自Uber Engineering博客上的帖子 ):
In the parameter server strategy there is a variable number of worker and parameter processes, with each worker process maintaining its own independent copy of the model in GPU memory. Gradient updates are computed as follows:
在参数服务器策略中,存在可变数量的工作进程和参数进程,每个工作进程在GPU内存中维护模型的独立副本。 渐变更新的计算方式如下:
- Upon receiving the go signal, each worker process accumulates the gradients for its particular batch slice. 收到运行信号后,每个工作进程都会为其特定的批处理切片累积梯度。
- The workers sends their update to the parameter servers in a fan-out manner. 工作人员以扇出方式将其更新发送到参数服务器。
- The parameter servers wait until they have all worker updates, then average the total gradient for the portion of the gradient update parameter space they are responsible for. 参数服务器将等到它们具有所有工作程序更新后,再对它们负责的部分梯度更新参数空间的平均总梯度进行平均。
- The gradient updates are fanned out to the workers, which sum them up and apply them to their in-memory copy of the model weights (thus keeping the worker models in sync). 将渐变更新散发给工作人员,然后对其进行汇总并将其应用于模型权重的内存中副本(从而使工作人员模型保持同步)。
- Once every worker has applied the updates, a new batch of training is ready to begin. 一旦每个工人都应用了更新,就可以开始进行新的培训。
Whilst simple to implement, this strategy has some major limitations. The most important of these is the fact that each additional parameter server requires n_workers additional network calls at each synchronization step — an O(n²) complexity cost. The overall speed of the computation depended on the slowest connection, so large parameter server -based model training jobs get to be very inefficient in practice, pushing net GPU utilization to 50% or below.
尽管实施起来很简单,但是它有一些主要的局限性。 其中最重要的事实是,每个附加参数服务器在每个同步步骤都需要n_workers附加网络调用-复杂度为O(n²)。 计算的整体速度取决于最慢的连接,因此基于大型参数服务器的模型训练作业在实践中效率非常低,将GPU净利用率降至50%或更低。
(For more background I recommend watching “Inside TensorFlow: tf.distribute.Strategy”)
(有关更多背景信息,我建议观看“ Inside TensorFlow:tf.distribute.Strategy ”)
More modern distributed training strategies do away with parameter servers.
更现代的分布式培训策略不再使用参数服务器。
In the DistributedDataParallel
strategy, every process is a worker process. Each process still maintains a complete in-memory copy of the model weights, but batch slice gradients updates are now synchronized and averaged directly on the worker processes themselves. This is achieved using a technique borrowed from the high-performance computing world: an all-reduce algorithm:
在DistributedDataParallel
策略中,每个进程都是一个工作进程。 每个过程仍然维护模型权重的完整内存副本,但是批处理切片梯度更新现在已同步并直接在辅助进程自身上平均。 这是通过使用从高性能计算世界借用的技术来实现的: 全归约算法 :
This diagram shows one particular implementation of an all-reduce algorithm, ring all-reduce, in action. As you can see, this algorithm provides an elegant way of synchronizing the state of a set of variables (in this case tensors) among a collection of processes. The vectors are passed around directly in a sequence of direct worker-to-woker connections. This eliminates the network bottleneck created by the worker-to-parameter-server connections, substantially improving performance.
该图显示了全缩减算法的一种特定实现方式,即循环全缩减。 如您所见,该算法提供了一种优雅的方式来同步一组进程中一组变量(在本例中为张量)的状态。 向量按直接的工人对工人的连接顺序直接传递。 这消除了工作人员与参数服务器之间的连接所造成的网络瓶颈,从而大大提高了性能。
In this scheme, gradient updates are computed as follows:
在此方案中,梯度更新的计算如下:
- Each worker maintains its own copy of the model weights and its own copy of the dataset. 每个工作人员维护自己的模型权重副本和自己的数据集副本。
- Upon receiving the go signal, each worker process draws a disjoint batch from the dataset and computes a gradient for that batch. 收到执行信号后,每个工作进程都会从数据集中提取一个不相交的批次,并计算该批次的梯度。
- The workers use an all-reduce algorithm to synchronize their individual gradients, computing the same average gradient on all nodes locally. 工作人员使用全约算法来同步他们的各个梯度,从而在本地所有节点上计算相同的平均梯度。
- Each worker applies the gradient update to its local copy of the model. 每个工作人员将渐变更新应用于其模型的本地副本。
- The next batch of training begins. 下一批培训开始。
This all-reduce strategy was brought to the forefront in the 2017 Baidu paper “Bringing HPC Techniques to Deep Learning”. The great thing about it is that it is based on well-understood HPC techniques with longstanding open source implementations. All-reduce is included in the Message Passing Interface (MPI) de facto standard, which is why PyTorch DistributedDataParallel
offfers no less than three different backend implementations: Open MPI, NVIDIA NCCL, and Facebook Gloo.
在2017年百度论文“ 将HPC技术带入深度学习 ”中,这种全缩减策略被带到了最前沿。 它的优点在于,它基于易于理解的HPC技术以及长期的开源实现。 全缩减包含在事实上的消息传递接口 (MPI)标准中,这就是PyTorch DistributedDataParallel
提供至少三种不同的后端实现的原因: Open MPI , NVIDIA NCCL和Facebook Gloo 。
数据分发,第1部分:流程初始化 (Data distributed, part 1: process initialization)
Unfortunately modifying your training script to use DistributedDataParallel
strategy is not a simple one-line change.
不幸的是,修改您的训练脚本以使用DistributedDataParallel
策略并不是简单的单行更改。
To demonstrate how the API works, we will build our way towards a complete distributed training script (which we will go on to benchmark later in this article). I recommend following along with the code on GitHub.
为了演示API的工作原理,我们将构建完整的分布式培训脚本的方式(我们将在本文稍后进行基准测试)。 我建议跟随GitHub上的代码 。
The first and most complicated new thing you need to handle is process initialization. A vanilla PyTorch training script executes a single copy of its code inside of a single process. With data parallelized models, the situation is more complicated: there are now as many simultaneous copies of the training script as there are GPUs in the training cluster, each one running in a different process.
您需要处理的第一个也是最复杂的新事物是流程初始化。 原始的PyTorch培训脚本在单个进程内执行其代码的单个副本。 使用数据并行化模型,情况变得更加复杂:培训集群中同时存在的培训脚本副本与培训集群中的GPU一样多,每个GPU在不同的进程中运行。
Consider the following minimal example:
考虑下面的最小示例:
# multi_init.pyimport torchimport torch.distributed as distimport torch.multiprocessing as mpdef init_process(rank, size, backend='gloo'):
""" Initialize the distributed environment. """
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
dist.init_process_group(backend, rank=rank, world_size=size)def train(rank, num_epochs, world_size):
init_process(rank, world_size)
print(
f"Rank {rank + 1}/{world_size} process initialized.\n"
)
# rest of the training script goes here!WORLD_SIZE = torch.cuda.device_count()if __name__=="__main__":
mp.spawn(
train, args=(NUM_EPOCHS, WORLD_SIZE),
nprocs=WORLD_SIZE, join=True
)
In the world of MPI, world size is the number of processes being orchestrated, and (global) rank is the position of the current process in that world. So for example, if this script were to be executing on a beefy machine with four GPUs onboard, WORLD_SIZE
would be 4 (because torch.cuda.device_count() == 4
), so mp.spawn
would spawn 4 different processes, whose rank would be 0, 1, 2, or 3 respectively. The process with rank 0 is given a few extra responsibilities, and is therefore referred to as the master process.
在MPI的世界中, 世界的规模是正在编排的流程数, (全局)排名是该世界中当前流程的位置。 因此,例如,如果此脚本要在具有四个GPU的强大计算机上执行, WORLD_SIZE
将为4(因为torch.cuda.device_count() == 4
),因此mp.spawn
将产生4个不同进程,它们的等级分别为0、1、2或3。 等级为0的过程被赋予一些额外的职责,因此被称为主过程 。
The current process’s rank is passed through as the spawn entrypoint (in this case, the train
method) as its first argument. Before train
can actually do any work, it needs to first set up its connections to its peer processes. This is the responsibility of the dist.init_process_group
. When run in the master process, this method sets up a socket listener on MASTER_ADDR:MASTER_PORT
and starts handling connections from the other processes. Once all of the processes have connected, this method handles setting up the peer connections allowing the processes to communicate.
当前进程的等级作为其第一个参数作为派生入口点(在本例中为train
方法)传递。 在train
实际上可以执行任何工作之前,它需要首先建立与对等流程的连接。 这是dist.init_process_group
的责任。 在主进程中运行时,此方法在MASTER_ADDR:MASTER_PORT
上设置套接字侦听器,并开始处理来自其他进程的连接。 一旦所有进程都已连接,此方法将处理建立对等连接,以允许进程进行通信。
Note that this recipe only works for training on a single multi-GPU machine! The same machine is used to launch every single process in the job, so training can only leverage the GPUs connected to that specific machine. This makes things easy: setting up IPC is as easy as finding a free port on localhost
, which is immediately visible to all processes on that machine. IPC across machines is much more complicated, as it requires configuring an external IP address visible to all machines.
请注意,此食谱仅适用于在一台多GPU机器上进行训练! 同一台机器用于启动作业中的每个流程,因此培训只能利用连接到该特定机器的GPU。 这使事情变得容易:设置IPC就像在localhost
上找到一个空闲端口一样容易,该端口对于该计算机上的所有进程都是立即可见的。 跨计算机的IPC更为复杂,因为它需要配置一个对所有计算机可见的外部IP地址。
In this introductory tutorial we will focus specifically on the single-machine case, aka vertical scaling. Even on its own, vertical scaling is an extremely powerful tool. On the cloud, vertical scaling allows you to scale your deep learning training job all the way up to an 8xV100 instance (e.g. a p3.16xlarge
on AWS). That’s a lot of deep learning horsepower — in the ballpark of an NVIDIA DGX-1, a system that retailed for $150,000 at launch!
在本入门教程中,我们将特别关注单机机箱(也称为垂直缩放)。 即使单独使用,垂直缩放也是一个非常强大的工具。 在云上,垂直扩展允许您将深度学习培训工作一直扩展到8xV100实例(例如,AWS上的p3.16xlarge
)。 在NVIDIA DGX-1的基础上,这具有很大的深度学习能力,该系统在发布时零售价为15万美元!
We will discuss horizontal scaling with data parallelization in a future blog post. In the meantime, to see a code recipe showing it in action, check out the PyTorch AWS tutorial.
我们将在以后的博客文章中讨论水平缩放和数据并行化。 同时,要查看显示其用法的代码配方,请查看PyTorch AWS教程 。
分布式数据,第2部分:流程同步 (Data distributed, part 2: process synchronization)
Now that we understand the initialization process, we can start filling out the body of the train method that does all of the work.
现在我们了解了初始化过程,我们可以开始填写完成所有工作的train方法的主体。
Recall what we have so far:
回想一下我们到目前为止:
def train(rank, num_epochs, world_size):
init_process(rank, world_size)
print(
f"{rank + 1}/{world_size} process initialized.\n"
)
# rest of the training script goes here!
Each of our four training processes runs this function to completion, exiting out when it is finished. If we were to run this code right now (via python multi_init.py
), we would see something like the following printed out to our console:
我们的四个培训过程中的每一个都运行此功能直到完成,然后在完成时退出。 如果我们现在(通过python multi_init.py
)运行此代码,我们将在控制台上看到类似以下内容的内容:
$ python multi_init.py
1/4 process initialized.
3/4 process initialized.
2/4 process initialized.
4/4 process initialized.
The processes are independently executed, and there are no guarantees about what state any one state is at any one point in the training loop. This requires making some careful changes to your initialization process.
这些过程是独立执行的,并且不能保证训练循环中任一点处于什么状态。 这需要对初始化过程进行一些仔细的更改。
(1) Any methods that download data should be isolated to the master process.
(1)任何下载数据的方法都应隔离到主进程中。
Failing to do so will replicate the download process across all of the processes, resulting in four processes writing to the same file simultaneously — a surefire recipe for data corruption.
否则,将在所有过程之间复制下载过程,导致四个过程同时写入同一文件,这是确保数据损坏的必经之路。
Luckily, this is easy to do:
幸运的是,这很容易做到:
# import torch.distributed as distif rank == 0:
downloading_dataset()
downloading_model_weights()
dist.barrier()
print(
f"Rank {rank + 1}/{world_size} training process passed data download barrier.\n"
)
The dist.barrier call in this code sample will block until the master process (rank == 0
) is done downloading_dataset
and downloading_model_weights
. This isolates all of the network I/O to a single process and prevents the other processes from jumping ahead until it’s done.
此代码示例中的dist.barrier调用将阻塞,直到完成主进程( rank == 0
) downloading_dataset
和downloading_model_weights
为止。 这将所有网络I / O隔离到一个进程中,并防止其他进程继续前进直到完成。
(2) The data loader needs to use DistributedSampler
. Code sample:
(2)数据加载器需要使用DistributedSampler
。 代码示例:
def get_dataloader(rank, world_size):
dataset = PascalVOCSegmentationDataset()
sampler = DistributedSampler(
dataset, rank=rank, num_replicas=world_size, shuffle=True
)
dataloader = DataLoader(
dataset, batch_size=8, sampler=sampler
)
DistributedSampler
uses rank
and world_size
to figure out how to split the dataset across the processes into non-overlapping batches. Every training step the worker process retrieves batch_size
observations from its local copy of the dataset. In the example case of four GPUs, this means an effective batch size of 8 * 4 = 32.
DistributedSampler
使用rank
和world_size
找出如何将整个过程中的数据集拆分为不重叠的批次。 工作进程的每个训练步骤都从其本地数据集副本中检索batch_size
观测值。 在四个GPU的示例情况下,这意味着有效批大小为8 * 4 = 32。
(3) Tensors needs to be loaded into the correct device. To do so, parameterize your .cuda()
calls with the rank of the device the process is managing:
(3)张量需要加载到正确的设备中。 为此,请使用该进程正在管理的设备的等级来参数化.cuda()
调用:
batch = batch.cuda(rank)
segmap = segmap.cuda(rank)
model = model.cuda(rank)
(4) Any randomness in model initialization must be disabled.
(4)必须禁用模型初始化中的任何随机性。
It’s extremely important that the models start and remain synchronized throughout the entire training process. Otherwise, you’ll get inaccurate gradients and the model will fail to converge.
在整个训练过程中,模型必须启动并保持同步,这一点非常重要。 否则,您将获得不正确的渐变,并且模型将无法收敛。
Random initialization methods like torch.nn.init.kaiming_normal_
can be made deterministic using the following code:
可以使用以下代码使torch.nn.init.kaiming_normal_
类的随机初始化方法具有确定性:
torch.manual_seed(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(0)
The PyTorch documentation has an entire page dedicated to this topic: Reproducibility.
PyTorch文档有一整个页面专门讨论该主题:可再现性 。
(5) Any methods that perform file I/O should be isolated to the master process.
(5)任何执行文件I / O的方法都应与主进程隔离。
This is necessary for the same reason that isolating network I/O is necessary: the inefficiency and potential for data corruption created by concurrent writes to the same file. Again, this is easy to do using simple conditional logic:
这与隔离网络I / O的必要原因相同,是必要的:由于并发写入同一文件而导致的效率低下和潜在的数据损坏。 同样,使用简单的条件逻辑很容易做到这一点:
if rank == 0:if not os.path.exists('/spell/checkpoints/'):
os.mkdir('/spell/checkpoints/')
torch.save(
model.state_dict(),
f'/spell/checkpoints/model_{epoch}.pth'
)
As an aside, note that any global loss values or statistics you want to log will require you to synchronize the data yourself. This can be done using additional MPI primitives in torch.distributed not covered in-depth in this tutorial. Check out this gist I prepared for a quick intro, and refer to the Distributed Communication Package PyTorch docs page for a detailed API reference.
顺便说一句,请注意,要记录的所有全局损失值或统计信息都需要您自己同步数据。 可以使用torch.distributed中的其他MPI原语来完成此操作,本教程未对此进行深入介绍。 查看为快速入门而准备的这个要点 ,并参阅Distributed Communication Package PyTorch文档页面以获取详细的API参考。
(6) The model must be wrapped in DistributedDataParallel
.
(6)模型必须包装在DistributedDataParallel
。
Assuming you’ve done everything else correctly, this is where the magic happens. ✨
假设您正确完成了所有其他操作,这就是神奇的地方。 ✨
model = DistributedDataParallel(model, device_ids=[rank])
Congratulations — assuming you’ve done everything right (it’s a lot ) your model is now training in distributed data parallel mode!
祝贺您-假设您已正确完成所有操作(很多),您的模型现在正在分布式数据并行模式下训练!
To see complete code samples, head to the GitHub repo.
要查看完整的代码示例,请访问GitHub repo 。
那DataParallel呢? (What about DataParallel?)
Readers familiar with the PyTorch API may know that there is also one other data parallelization strategy in PyTorch, torch.nn.DataParallel
. This API is much easier to use; all you have to do is wrap your model initialization like so:
熟悉PyTorch API的读者可能知道PyTorch中还有另一种数据并行化策略,即torch.nn.DataParallel
。 这个API是为了使用更容易; 您要做的就是包装模型初始化,如下所示:
model = nn.DataParallel(model)
A one-liner change! Why not just use that instead?
一线改变! 为什么不使用它呢?
Under the hood, DataParallel
uses multithreading, instead of multiprocessing, to manage its GPU worker. This greatly simplifies the implementation: since the workers are all different threads of the same process, they all have access to the same shared state without requiring any additional synchronization steps.
在DataParallel
, DataParallel
使用多线程而不是多处理来管理其GPU工作器。 这极大地简化了实现:由于工作进程是同一进程的所有不同线程,因此它们都可以访问相同的共享状态,而无需任何其他同步步骤。
However, using multithreading for computational jobs in Python is famously unperformant, due to the presence of the Global Interpreter Lock. As the benchmarks in the next section will show, models parallelized using DataParallel
are significantly slower than those parallelized using DistributedDataParallel
.
但是,由于存在Global Interpreter Lock ,因此在Python中将多线程用于计算作业的效果不佳。 如下一节中的基准测试所示,使用DataParallel
并行化的模型比使用DistributedDataParallel
并行化的模型要慢得多。
Nevertheless, DataParallel
remains an extremely useful for model training jobs you want to speed up, but don’t want to spend the additional time and energy optimizing fully.
尽管如此, DataParallel
对于想要加速的模型训练工作仍然非常有用,但又不想花费额外的时间和精力进行充分的优化。
基准测试 (Benchmarks)
To benchmark distributed model training performance I trained a DeepLabV3-ResNet 101 model (via Torch Hub) on the PASCAL VOC 2012 dataset (from torchvision
datasets) for 20 epochs. I used the Spell API to launch five different versions of this model training job: once on a single V100 (a p3.2xlarge
on AWS), and once each on a V100x4 (p3.8xlarge
) and a V100x8 (p3.16xlarge
) using DistributedDataParallel
and DataParallel
. This benchmark excludes the time spent downloading data at the beginning of the run — only model training and saving time counts.
为了对分布式模型的训练性能进行基准测试,我在PASCAL VOC 2012数据集(来自torchvision
数据集 )上torchvision
了20个时期的DeepLabV3-ResNet 101模型(通过Torch Hub )。 我用法术API推出五个不同版本的这种模式培训工作的第一次发生在一个单一的V100(A p3.2xlarge
在AWS),并且每一次在V100x4( p3.8xlarge
)和V100x8( p3.16xlarge
使用) DistributedDataParallel
和DataParallel
。 该基准测试不包括运行开始时花在下载数据上的时间-仅模型训练和节省时间计数。
The results are not definitive by any means, but should nevertheless give you some sense of the time save distributed training nets you:
结果无论如何都不是确定的,但是仍然应该使您对节省分布式培训网络的时间有所了解:
As you can clearly see, DistributedDataParallel
is noticeably more efficient than DataParallel
, but still far from perfect. Switching from a V100x1 to a V100x4 is a 4x multiplier on raw GPU power but only 3x on model training speed. Doubling the compute further by moving up to a V100x8 only produces a ~30% improvement in training speed. By that point DataParallel
almost catches up to DistributedDataParallel
in (in)efficiency.
正如您可以清楚地看到的那样, DistributedDataParallel
效率明显高于DataParallel
,但还远远不够完美。 从V100x1切换到V100x4是原始GPU功耗的4倍乘数,但模型训练速度仅为3倍。 通过升级到V100x8使计算进一步加倍,只会使训练速度提高约30%。 到那时, DataParallel
几乎达到了DistributedDataParallel
水平。
结论 (Conclusion)
In this article we discussed distributed training and data parallelization, learned about the DistributedDataParallel
and DataParallel
APIs, and applied them to a real model to get some time-saved benchmarks.
在本文中,我们讨论了分布式训练和数据并行化,了解了DistributedDataParallel
和DataParallel
API,并将其应用于实际模型以获取一些节省时间的基准。
Note that this is still an active area of development. The PyTorch team landed a new PR just this month that promises substantial improvements to DistributedDataParallel
performance. Expect those times to come down in future releases!
请注意,这仍然是活跃的开发领域。 PyTorch团队刚刚在本月获得了新的PR ,该PR承诺将对DistributedDataParallel
性能进行重大改进。 希望这些时间在将来的版本中降下来!
Something that I don’t think gets discussed often enough is the impact that distributed training has on developer productivity. Taking even a moderately-sized model from “this takes three hours to train” to “this takes one hour to train” greatly increases the volume of experiments you can perform on and with the model in a single day — a substantial improvement to your developer velocity.
我认为讨论不多的是分布式培训对开发人员生产力的影响。 从“需要三个小时的训练”到“需要一个小时的训练”,即使采用中等大小的模型,也可以极大地增加您可以在一天之内和使用该模型进行的实验的数量,这对开发人员而言是一个巨大的进步速度。
Hopefully you found this guide helpful! To learn more about the distributed training API I recommend browsing the distributed docs.
希望您对本指南有所帮助! 要了解有关分布式培训API的更多信息,我建议浏览分布式文档 。
Note: This article originally appeared on the Spell.ML Blog.
注意:本文最初出现在Spell.ML Blog上 。
Enjoyed this article? Check out some more stories from the Spell blog:
喜欢这篇文章吗? 在Spell博客中查看更多故事:
A developer-friendly guide to mixed precision training with PyTorch
开发人员友好的PyTorch混合精度培训指南
Reduce cloud GPU model training costs by 66% using spot instances
使用竞价型实例将云GPU模型培训成本降低66%
It’s 2020, why isn’t deep learning 100% on the cloud yet?
到了2020年,为什么还不可以在云上进行100%的深度学习?
翻译自: https://towardsdatascience.com/distributed-model-training-in-pytorch-using-distributeddataparallel-d3d3864dc2a7
pytorch 分布式训练