神经网络深度学习论文阅读

神经网络深度学习论文阅读_第1张图片
This figure shows my classification and summary of these papers.

My reading notes are below. Each note following the headline is divided into several parts, which are the summary, advantages, evaluation and improvement of the paper.
Learning representations by back-propagating errors

  1. The team describe a new back-propagation learning procedure for networks of neurone-like units, which adjusts the weight of those connections in the network over and over in order to reduce the difference between the actual output vector and the desired output vector little by little. And as a consequence of the adjustments of weight, the internal ‘hidden’ units which do not belong to input or output come to present significant features of the task domain. Meanwhile, the interactions of these units extract the regularities in this task. This new learning procedure distinguishes from earlier, simpler methods like the perceptron-convergence(they do not learn representation) procedure because of its ability to create useful new features.
    Aiming at finding a powerful synaptic modification rule that allow neural network to develop an internal structure that is appropriate for a particular task domain, there have been many attempts used self-organizing neural networks. So the task is important, the difficulty is closely related to whether the task is specified. If the task is specified by the desired state vector of the output units and the input units are directly connected to output units, it is easy to find learning rules, finally, the difference between the actual output and the desired output is progressively reduced. Things are different and more difficult when we introduce hidden units whose actual or desired states are not specified by the task. In order to achieve the desired input-output behaviour,when the hidden units should be active must be decided by the learning procedure.And the team prove that a general purpose and relatively simple procedure is powerful enough to construct appropriate internal representations.

2.Divided into three levels, the simplest form of the learning procedure has a layer of input units at the bottom; any number of intermediate layers; and a layer of output units at the top. Connections within a layer or from higher to lower layers are forbidden,but connections can skip intermediate layers. An input vector is presented to the network by setting the states of the input units.Then the states of the units in each layer are determined by applying equations (1) and (2) to the connections coming from lower layers. All units within a layer have their states set in parallel, but different layers have their states set sequentially, starting at the bottom and working upwards until the states of the output units are determined.

3.The most obvious limitation of this learning procedure is about the local and overall problems. The error-surface may contain local minima so that gradient descent is not guaranteed to find a global minimum. But this rarely happens.

4.Although the learning procedure, in its current form, is not a plausible model of learning in brains. Its application in various tasks shows that interesting internal representations can be constructed by gradient descent in weight-space, and this suggests that it is worth looking for more biologically plausible ways of doing gradient descent in neural networks.
Attention Is All You Need
1.The team propose the Transformer, a new simple network architecture which is based on attention mechanisms and totally dispenses with recurrence and convolutions. It consists of two parts: encoder and decoder which associate with the dominant sequence transduction models are based on complex recurrent or convolutional neural networks. Through an attention mechanism connecting encoder and decoder, the model can show best performance. After experiments and displays, the results show that: these models have higher quality and do better in parallel tasks, at the same time needs less time to train.

2.Recurrent language models and encoder-decoder architectures, associating with sequence modeling and transduction problems such as language modeling and machine translation, are current nowadays. Although recent work has achieved significant improvements in computational efficiency through factorization tricks and conditional computation, as a result of sequential computation, constrains and the Gradient disappearance problem still remain. But with the help of attention mechanisms which allows modeling of dependencies without regard to their distance in the input or output sequences, the Transformer can eschewing recurrence. So, Transformer relies entirely on an attention mechanism to draw global dependencies between input and output, which allows for significantly more parallelization and can reach a new state of the art in translation quality especially after enough training.

3.The Transformer follow the overall architecture which contains encoder and decoder. Encoder map an input sequence of symbol representations to a sequence of continuous representations and then decoder generate an output sequence. The whole process is auto-regressive and the next output relies on previous output. The encoder is composed of a stack of N = 6 identical layers so do the decoder.
The attention function mechanism operates through a batch of vectors: the query, keys, values, and output. Map query and a set of key value pairs to an output which is weighted sum of values. The Transformer uses multi-head attention in three different ways: In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. The encoder contains self-attention layers. self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position.

4.Considering of total computational complexity per layer, parallel computing and s the path length between long-range dependencies in the network, the team chose the self-attention. Self-attention provides with feasible parallel computing, lower operational complexity and generate more interpretable models.

5.The team trained on the standard dataset and used the Adam to optimize. Residual Dropout and Label Smoothing were used for regularization. Further experiments show that the large transformer model has higher scores and lower training costs than the previously published best model. As expected,bigger models are better, and dropout is very helpful in avoiding over-fitting.
The first sequence transduction model based entirely on attention,the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers for translation tasks. And the team is planning to extend the Transformer to problems involving input and output modalities other than text.

GPipe: Efficient Training of Giant Neural Networks
using Pipeline Parallelism

  1. Known as an effective method to improve the quality of model for several different machine learning tasks, scaling up the capacity deep of neural network increases model capacity beyond the memory limit of a single accelerator has required developing special algorithms or infrastructure, in many instances. And this kind of solution usually is specified by architecture so it cannot transfer to other tasks. Considering all these constrains and limitations, the team proposed and introduced GPipe, which is a pipeline parallelism library allowing scaling any network which can be expressed as a sequence of layers. The GPipe is able to address the meet the requirements of efficient and task-independent model parallelism. GPipe provides the flexibility to efficiently scale a variety of different networks to a large scale by By pipelining different sub-sequences of layers on separate accelerators. Besides, when the model is divided across multiple accelerators, GPipe can almost achieve linear speedup, using a new a novel batch-splitting pipelining algorithm. The team trained large-scale neural networks on Image Classification attaining a top-1 accuracy of 84.4% on ImageNet-2012. The model also works well on Multilingual Neural Network Translation, they trained a single large-scale Transformer model on a corpus spanning over 100 languages and achieve better quality than all bilingual models.

  2. Thanks to the improvement of methods that have facilitated scaling the effective capacity of neural networks, the deep learning has made great progress. In general, the larger the model, the better its performance in the task. It it globally acknowledged that there is a strong correlation between model size and classification accuracy. However, larger models have notable advantages among many fields and brought remarkable quality improvements, scaling neural networks itself introduces significant practical challenges. There are hardware constrains with big scale models such as memory limitations and communication bandwidths on accelerators. Actually users have to divide large models into parts assigning to different accelerators(GPU or CPU) to deal with the problem. But at the same time the efficient model parallelism algorithms are extremely difficult to design and implement. So it it usual for the practitioner to make difficult decision among scaling capacity, flexibility (or specificity to particular tasks and architectures) and training efficiency. Therefore if the architecture and tasks are specific, the model-parallel algorithms are able to be very efficient. Researchers can scale neural networks easily, with the development and application of deep learning and the increasing demand for reliable and flexible infrastructure. The neural networks now can complete various of machine learning tasks.
    GPipe solves the memory limitation problem by partitioning the model across different accelerators. Each model can be specified as a sequence of layers and consecutive groups of layers can be partitioned into cells which are placed on a separate accelerator. Based on the work of partitioned setup the team propose a novel pipeline parallelism algorithm splitting a mini-batch of training examples into smaller micro-batches, then pipeline the execution of each set of micro-batches over cells. This algorithm helps researchers train increasingly large models by deploying more accelerators. Meanwhile, GPipe is also helpful in data parallelism to further scale training. Therefore it can be combined with data parallelism to use more accelerators in a complementary manner to expand the scale of neural network training. The team applied RMSProp during training and also introduces low communication overhead to optimize performance.
    3.In such scenarios, imperfect partitioning algorithms might lead to load imbalance. Better partitioning algorithms can potentially improve the performance over the heuristic approach.We found that re-computation time was the main contributor to GPipe overhead, taking up to 23% of the total step time. Another source of overhead was load imbalance

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  1. The paper proposes a new language representation model named BERT standing for Bidirectional Encoder Representations from Transformers.
    As well all know, the pre-trained model has been proven to be effective in many natural language tasks including sentence-level tasks such as natural language inference and paraphrasing. These tasks aim at predicting the connections between sentences by analyzing them holistically, as well as token-level tasks like named entity recognition and question answering where the models are required to give fine-grained output at the token-level. In previous work, the two approaches for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning share the same objective function during pre-training. And in these strategies they all use unidirectional language models to learn the representation of general language. The team argue that it is the current techniques that restrict the ability of the pre-trained representations. The core limitation is that standard language models are unidirectional, which limits the options of used architecture during pre-training. These restrictions are sub-optimal for sentence-level tasks and devastating for token-level tasks. These limitations can be extremely harmful for tasks that is crucial to incorporate context from both directions.

  2. In this paper, the team propose BERT to improve the fine-tuning based approaches. Inspired by the Cloze task, BERT uses a “masked language model” pre-training objective to solve the previously mentioned unidirectional constrain. Unlike the left-to-right pre-training language model BERT enable the representation to incorporate the context, which allows the team to pre-train a unique deep bidirectional Transformer. In their work, they demonstrate the significance of bidirectional pre-training, show that pre-trained representations can reduce many complicated work for task-specified architectures. And BERT become the first fine-tuning based model that show the state-of-the-art performance on a large scale of both sentence-level and token-level tasks. Besides all these, BERT performs best in 11 NLP tasks.
    The BERT framework contains two steps: pre-training and fine-tuning. In pre-training, the model use unlabeled data to train over different tasks. While for fine-tuning, BERT firstly initialize with the pre-trained parameters and all of these parameters are fine-tuned via using labeled data from the downstream tasks. And each downstream tasks own their unique fine-tuning models, despite they are all initialized with the same pre-trained parameters. Both pre-training and fine-tuning use the same architecture except the output layers. So the difference between the pre-trained architecture and the final downstream architecture is minimal. That means a significant strength of BERT is its unified architecture among different tasks. Besides, in order to train a model that understands sentence-level relationships, they also introduces next-sentence prediction task for pre-train.

3.In their experiments, they mask 15% of all WordPiece tokens. Compared with denoising auto-encoders, BERT only predict the masked words in stead of reconstructing the whole input, which leads to slower convergence than left-to-right model and mismatch between pre-training and fine-tuning, because tokens does not involved in fine-tuning step. If the [MASK] token is used to much in pre-train, the model’s performance will be effected. And obviously, the BERT consumes a lot of hardware resources.

NASPipe:High Performance and Reproducible Pipeline Parallel Supernet Training via Causal Synchronous Parallelism

  1. In this work, inspired by conventional CPU instruction pipeline problems, the team present NASPipe the first high-performance and reproducible distributed parallel supernet training system via causal synchronous parallel (CSP) pipeline scheduling abstraction. The NASPipe divides a supernet across GPUs executing multiple generated sub-tasks (subnets) in a pipelined manner at the same time. In the meantime, it monitors the correlations between the subnets and resolves all the causal dependency which is caused by subnets’ layer sharing deterministically.
    Neural Architecture Search (NAS) has a significant contribution on building high-quality Deep Neural Networks (DNNs) for different applications and devices. And because of low computation cost and high quality, the supernet paradigm is the most prevalent and widely accepted which composes the entire search space into a whole supernet and train it recurrently. Traditional NAS paradigms which train each explored DNN to convergence costs too much. Unlike the supernet paradigms train tens of thousands of standalone DNNs to training a monolithic one meanwhile keep the quality of searched DNNs. Unfortunately, though the industry and academia have created systems for easily defining NAS supernets and large standalone models like GPipe, Deepspeed and Pipedream, but none of these are designed for effectively train extremely large supernets.
    There are two main problems. One is that it is difficult to deterministically resolve the dependencies between subnets activated in parallel because now available large-scale DNN train systems are designed to parallel .the training of multiple batches within the same DNN model, they are not designed for capturing and enforcing the causal dependency. The other is to leave more cache space for larger batch training to achieve higher GPU utilization via managing the extra-large supernet context among GPUs efficiently.

  2. To enforce high-quality and reproducible supernet training, NASPipe at the same time executes the subnets which are generated by supernet-based exploration algorithms, oversees the connection between subnets and keep all causal dependencies resulted from layer sharing to enforce high-quality and reproducible supernet training. With DNNs are getting larger and larger, a single GPU cannot have a capacity of a subnet itself, which means the pipeline parallelism becomes one of the most efficiently way to train large models. In order to realize parallelization of subnet execution, the team divided each subnet into parts like subset of layers instead of deploying each subnet task on one GPU which allow each GPU executes one part and make the subnet executions into a pipeline. Because of pipeline parallelism, the team is able to efficiently resolve causal dependencies meanwhile execute synchronizations in supernet training ,locally on each GPU in a decentralized manner.
    The team introduces a pipeline scheduler which promotes the subnet tasks with larger chronological order into execution to improve the pipeline efficiency. And via leveraging the status of the current stage and the status passed from other stages, NASPipe predicts the upcoming subnets with the biggest chance to be scheduled in the next few steps to efficiently manage GPU memory and precisely exchange in the context of subnets to be executed.
    The team have prototyped NASPipe on PyTorch which is one of the most prevalent DNN execution frameworks. As a training system for a supernet-based NAS algorithm, NASPipe in service. The NASPipe has advances among recently pipeline training system, firstly it is reproducible which produces the same training process and results independent of GPU numbers. Secondly, despite there are existing dependencies across subnets, the NASPipe can also execute efficiently. Thirdly, it is scalable with the increase of GPU numbers, it can provides a roughly linearly increased computation power.

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

  1. In this paper, the team proposed a novel interleaving pipeline scheduling, a technology called PTD-P. We show how to combine tensor, pipeline, and data parallelism to scale to thousands of GPUs.
    In recent years, the converter based language model has become more available with large-scale computing and the data set has become larger, which has promoted the rapid development of research in the field of natural language processing (NLP). Large language models have achieved state-of-the-art accuracy in multiple tasks. As a result, the number of parameters in the most advanced NLP model increases exponentially. There are two main challenges in training such a model. First, the memory capacity of GPU is limited, and even multiple GPU servers cannot adapt to large-scale models. Even the main memory of the largest NVIDIA 80GB-A100 card cannot fit the parameters of these models. Second, even if we can fit the model in a single GPU (for example, by exchanging parameters between the host and the device memory), the large amount of computing operations required may lead to an unrealistically long training time.
    This requires parallelism. In this paper, they show how to combine PTD-P (inter-node pipeline parallelism, intra-node tensor parallelism and data parallelism) to achieve high aggregation throughput and train large-scale models with trillions of parameters. This facilitates end-to-end training in a reasonable time.

2.They proposed a novel interleaving pipeline scheduling, which can increase the throughput by more than 10%, and the memory consumption is equivalent to the existing methods.
Data parallel horizontal expansion usually works well, but there are two limitations: a) if it exceeds one point, the batch size of each GPU becomes too small, thus reducing GPU utilization and increasing communication cost; and b) the maximum number of devices that can be used is the batch size, which limits the number of accelerators that can be used for training.
The parallelism of tensor (in layer) models, in which the matrix multiplication in each converter layer is split into multiple GPUs, can be used to overcome these limitations. It is not suitable for larger models. A larger model needs to be split across multiple multi GPU servers,
Pipeline model parallelism is another technology that supports large-scale model training, in which the model layer is striped on multiple GPUs. In order to achieve high efficiency, a larger batch size is usually required. In this work, we also introduced a new pipeline plan, which can improve the efficiency of small batches
They consider the performance impact of combining the parallelism of pipeline and tensor models with data parallelism, and also want to consider the interaction between data parallelism and the parallelism of the two models. With data parallelism, each worker has a copy of the complete model, the input data set is segmented, and workers regularly aggregate their gradients. With the pipeline parallelism, the model layer can be partitioned across multiple devices. When used on a model that repeatedly uses the same transformer block, the same number of transformer layers can be assigned to each device. Periodic pipeline refresh is introduced to accurately preserve strict optimizer semantics so that optimizer steps can be synchronized between devices. At the beginning and end of each batch, the equipment is idle. Using the parallelism of tensor model, each layer of the model is divided into multiple devices.
They implemented three model specific optimizations on the computational graph to achieve high performance.

PipeDream Generalized Pipeline Parallelism for DNN Training

  1. The team demonstrated Pipedream, which is a system that adds inter-batch pipeline to intra-batch parallelism to further improve parallel training throughput, help better overlap computation and communication and reduce the traffic as much as possible. Deep neural network (DNN) has made great progress in a series of applications, but its training is very time-consuming and requires efficient multi accelerator parallelization. As the deployment of DNNS becomes more and more extensive, their training and computing costs become higher and higher, so they need to be executed in parallel across multiple accelerators (such as GPUs).DNN training was performed in iterations of forward and backward calculations. In each iteration, the training loop processes a small batch of input data and updates the model parameters. The current method focuses on parallelizing each iteration of the optimization algorithm in a group of workers. Unfortunately, intra-batch parallelization may be affected by large-scale communication costs.
    Although the pipeline is a simple idea, DNN training poses an important challenge that the traditional pipeline does not exist: first complete the forward transfer of all input small batches, and then reverse transfer. However, the statistical efficiency of this method is low, increasing the number of passes through the data set required to generate a high-quality model. In addition, this strategy may prevent the model from achieving the desired target accuracy.
  2. In this paper, we propose Pipedream, which is a system that uses pipeline parallelism to realize faster DNN training by combining intra-batch parallelism with inter batch parallelism. Pipedream divides the model among the available workers, assigns a group of continuous operators (called layers in DNN terminology) in the operator graph to each worker, and then overlaps the calculation and communication of different inputs in a pipeline manner.
    Pipedream automatically determines how to divide the DNN operator according to the short analysis run performed on a single GPU, balances the computational load between different stages, and at the same time minimizes the communication of the target platform. It extends 1F1B to merge cyclic scheduling in the data parallel stage, while ensuring that the gradient in the reverse transfer is routed from the forward transfer to the corresponding worker, Because the same weight version and intermediate output are required to obtain the correct gradient calculation. The combined scheduling algorithm 1F1B-RR generates the operator static scheduling that each worker repeatedly runs to maintain the high utilization rate of all workers. Therefore, pipeline parallel training can be regarded as a combination of inter-batch pipeline and intra-batch parallel principle. Our evaluation includes many combinations of DNN model, data set and hardware configuration, which confirms the training time advantage of Pipedream pipeline parallelism.
    PipeDream achieves high hardware efficiency with no pipeline stalls in steady state, and high statistical efficiency comparable to data parallelism using the same number of workers, which takes a more nuanced approach to pipelining that outperforms other solutions. Pipeline-parallel DNN training helps reduce the communication overheads that bottleneck intra-batch parallelism. PipeDream can better overlap computation with communication while minimizing the amount of data communicated via automatically partitioning DNN training process across workers, combining inter-batch pipelining with intra-batch parallelism.

Single Path One-Shot Neural Architecture Search with Uniform Sampling

  1. This work proposes a single path one shot model to solve the challenges in training. Their central idea is to build a simplified supernet, in which all architectures are single paths, thus reducing the weight adaptation problem. The training is performed by uniform path sampling. All structures (and their weights) are fully and equally trained.
    Neural architecture search (NAS) aims to automate architecture engineering by solving architecture design Recent methods use weight sharing strategy to reduce the amount of calculation. A supernet subsuming all architectures is trained only once. Each architecture inherits its weight from the Supernet. Only fine adjustments are made. The calculation cost is greatly reduced. Most weight sharing methods use continuous relaxation to parameterize the search space. There are two questions. First, the weights in the supernet are deeply coupled. Secondly, joint optimization introduces further coupling between architecture parameters and supernet weights. The greedy nature of gradient based methods inevitably introduces bias in the optimization process, and it is easy to mislead the architecture search. The existing one-time method still has coupling weight in the supernet. Their optimization is complex and involves sensitive hyper parameters. They did not show competitive results on large datasets.
    In order to alleviate these problems, they proposed a simple but effective single path one-shot method.

  2. This work revisits the one-shot paradigm and proposes a new approach to further simplify training and enhance architecture search. Based on the observation that the accuracy of the architecture using inherited weights should be able to predict the accuracy of using optimized weights, we suggest that supernet training should be stochastic. All architectures optimize their weights at the same time. This results in a uniform sampling strategy.
    They discussed the disadvantages of existing NAS methods using nesting and joint optimization. A single path one-time method with uniform sampling is proposed. It overcomes the disadvantages of the existing one-time method. Its simplicity supports rich search space, including novel design of channel size and bit width, all of which are solved in a unified way. They proposed a simple search space composed of a single path architecture to reduce the weight coupling in the supernet. The training is hyperparameter-free and easy to converge.
    Comprehensive experiments show that their method is flexible and effective. It is easy to train and fast to search. It effortlessly supports complex search spaces (e.g., building blocks, channels, mixed precision quantization) and different search constraints (e.g., flop, delay). And thus can be conveniently used for various needs. It defines supernet and performs weight inheritance in a similar manner. It is sequential and combines the advantages of nesting and joint optimization methods. Architecture search is efficient and flexible. It achieves the most advanced performance. on ImageNet, a large dataset.
    Comprehensive experiments show that their method can achieve better results than other methods in several different search spaces. We also analyzed the search cost and relevance of our method. Our approach is more efficient, especially when multiple searches are required.

Scaling Distributed Machine Learning with the Parameter Server

  1. This paper describes the third generation open source implementation of parameter server, focusing on the system aspects of distributed inference.
    In recent years, the key factors to solve large-scale machine learning problems are distributed optimization and inference. The growth of data and the improvement of model complexity are the phenomena of parameter explosion. No machine can solve these problems quickly and effectively. The intensive computing workload and data traffic require careful system design, so it is not easy to design an efficient distributed algorithm.
    Large and complex models are often shared globally by all work nodes, and they must frequently access shared parameters when performing calculations to optimize it. Sharing brings three challenges: 1) parameter access requires a large amount of bandwidth. 2) Sequential machine learning algorithms will damage performance 3) in the cloud environment, machines are unreliable and training tasks will be preempted.
    This paper describes the third generation open source implementation of parameter server, focusing on the system aspects of distributed information. They proposed a parameter server framework for distributed machine learning. Data and workload are distributed on the work node, while the server node maintains global shared parameters, expressed as dense or sparse vectors and matrices. The framework manages asynchronous data communication between nodes and supports flexible consistency model, elastic scalability and continuous fault tolerance.

2.There are two key challenges in building a high-performance parameter server system: 1) communication: Although parameters can be updated to key value pairs in traditional data storage, it is inefficient to use this abstraction naively. 2) Fault tolerance: fault tolerance is critical in scale, and for efficient operation, it cannot require complete restart of long-running computing.
The parameter server framework provides developers with two advantages: first, it keeps the application specific code simple by decomposing the common components of the machine learning system. At the same time, as a sharing platform for system level optimization, it provides a powerful, universal and high-performance implementation that can handle various algorithms from sparse logistic regression to topic models and distributed sketches. Their design is guided by the workload in the actual system. The novelty of the system lies in the synergy achieved by selecting correct system technologies, adapting them to machine learning algorithms, and modifying machine learning algorithms to make them more system friendly. Their parameter server provides five key features: efficient communication: the asynchronous communication model does not block computation (unless requested). It is optimized for machine learning tasks to reduce network traffic and overhead.
This framework is flexible, scalable, error tolerant, durable and easy to use, and is designed to achieve long-term deployment. The parameter server provides five key features: efficient communication: the asynchronous communication model does not block computation (unless requested). It is optimized for machine learning tasks to reduce network traffic and overhead.
They demonstrated experiments on several challenging tasks on a real dataset with billions of variables, fully demonstrating their efficiency.

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

  1. In recent years, most of the progress made in deep learning is directly related to the significant increase in model scale. The language model parameters are expanded to hundreds of billions, and training is conducted on a larger data set, thus realizing new functions. However, training these very large models on distributed clusters currently requires a large number of model definitions and engineering work specific to the cluster environment, such as adjusting and selecting parallel dimensions and selecting pipeline schemes (partition selection).
    Parallelization of automated large-scale models will significantly accelerate ml research and production by enabling model developers to quickly explore new model designs without considering the underlying system challenges. Unfortunately, it needs to navigate a complex planning space that grows exponentially with the dimension of parallelism and the size of the model and cluster.
    This paper proposes Alpa, which automates model parallel training of large-scale deep learning (DL) models by generating an execution plan with uniform data, operators and pipeline parallelism.

  2. Existing model parallel training systems either require users to manually create a parallelization plan or automatically generate one from the limited model parallel configuration space. They are not sufficient to extend complex deep learning models on distributed computing devices.
    Parallelization of automated large-scale models will significantly accelerate ml research and production. Unfortunately, it needs to navigate a complex planning space that grows exponentially with the dimension of parallelism and the size of the model and cluster. Our main observation is that we can organize different parallelization technologies into a hierarchical space and map these parallelization technologies into the hierarchical structure of the computing cluster. You can hierarchically represent parallel execution plans by specifying plans in each parallel category, which brings many advantages.
    They designed and implemented Alpa, the first compiler to automatically generate parallel execution plans covering all data, operators and pipeline parallelism. Given the model description and cluster configuration, Alpa realizes this by dividing the cluster into multiple device grids. Alpa allocates the training of large-scale DL models by treating the parallelism as two levels: inter operator parallelism and intra operator parallelism. On this basis, Alpa builds a new hierarchical space for the massive model parallel execution plan. Alpa has designed many compilation channels to automatically export effective parallel execution plans at each parallel level. Alpa implements an efficient runtime to orchestrate two-level parallel execution on distributed computing devices.
    Their evaluation shows that the parallelization plan generated by Alpa matches or is better than the manually adjusted model parallel training system, even on the model they designed. Unlike special systems, Alpa can also be extended to models with heterogeneous architectures and models without manual design plans. They evaluated Alpa’s training of large models with billions of parameters and achieved efficient execution results.

vPIPE: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

  1. The team introduces vPIPE, the first dynamic layer live partition and memory management system for pipeline parallelism, serving as a transparent acceleration. It works as a virtualized layer between the typical pipeline parallel system and its underlying execution engine. In order to resolve both designed goals G1 (carefully manage all stages’ training memory to avoid exceeding the physical memory capacity on any GPU) and G2 (enforce a “balanced” partition) transparently, it automatically find a globally near-optimal plan, which migrates layers among stages and relocates each layer’s activations and parameters to its current stage’s GPU or CPU memory. Moreover, vPIPE can significantly relieve the tension stage of the pipeline and improve the throughput of the pipeline in a balanced manner.
    The vPIPE has two outstanding innovations, the first is an online search algorithm for layer partitioning and memory managing plans ,which is fast and near-optimal stage distributed and used to look for globally effective exchange, recalculation and partition strategies. It improves the efficiency and scalability of vPIPE. And the second contribution is a transparent real-time migration protocol for rebalancing layer distribution across training pipelines, which not delay the function or change the obsolescence of the parameters of the system above.

  2. In recent years, in order to get higher modeling capacity, the scale of large deep neural networks is explosively increasing, with more layers and more parameters in each layer. Pipeline parallelism is an effective method to train large DNNS. An efficient pipeline system should achieve the two key goals of G1 and G2. Although previous work has made a lot of efforts on building pipeline parallel systems, it is still difficult to achieve complex and dynamic design goals at the same time.
    There are two types of existing pipeline parallel systems. The first type stores the activation tensors generated during the forward passes directly in the GPU memory. It has to maintain a moderate batch size, but a larger training batch size may lead to higher GPU Alu utilization and higher throughput.The second type displays all activation tensors in the forward passes and recalculates them in the backward passes. This significantly alleviates the imbalance in GPU memory utilization between the previous stage and the subsequent stage, but at the cost of additional forward pass. In addition, when NAS is enabled in the DNN model, both types of pipeline parallel systems will experience more severe throughput degradation. The number and layout of model’s layers can be modified by the runtime algorithm and evaluated by running the NAS enabled converter.
    The team believes that the key is that the memory management and layer partition strategies of these systems are static. When the stage becomes tense due to the GPU memory explodes or the newly activated hierarchy, these static strategies cannot use the idle GPU resources available in the adjacent stages to alleviate the tension. Vpipe computes a mixed plan of swap and recompute for all layers on each stage instead of all-recompute strategy. Instead of using a static partition strategy, vPIPE generates a new partition plan and transparently live migrates layers from the dense stage to the adjacent stage. This not only reduces the memory burden of the dense stage (G1), but also realizes a more balanced partition (G2) with higher throughput.
    In order to achieve the goal, there are two challenges. The first challenge is to find globally effective swap, recompute and repartition (SRP) strategies among all stages. By using a powerful decomposition method, they created a fast-convergence and near-optimal search algorithm to cope with it. The second challenge is how to live(without GPU pause or pipeline cleaning) the migration layer while maintaining the transparency of vPIPE to the general upper layer pipeline parallelism system. They propose a new live migration protocol. Our key observation is that the time window between activation generation (in the forward pass) and its end usage (in the corresponding backward pass) allows vPIPE to perform subtle interleaving to transparently migrate layers without changing the parameter obsolescence of the upper system.

  3. VPIPE has two limitations. First, it assumes that a single layer fits within the memory limits of a single GPU for any DNN workload trained with vPIPE. Second, vPIPE’s layermigration protocol remains live when the time cost of transferring a layer’s tensors can overlap with the computation time of DNN training.

你可能感兴趣的:(深度学习,人工智能,机器学习)