This figure shows my classification and summary of these papers.
My reading notes are below. Each note following the headline is divided into several parts, which are the summary, advantages, evaluation and improvement of the paper.
Learning representations by back-propagating errors
2.Divided into three levels, the simplest form of the learning procedure has a layer of input units at the bottom; any number of intermediate layers; and a layer of output units at the top. Connections within a layer or from higher to lower layers are forbidden,but connections can skip intermediate layers. An input vector is presented to the network by setting the states of the input units.Then the states of the units in each layer are determined by applying equations (1) and (2) to the connections coming from lower layers. All units within a layer have their states set in parallel, but different layers have their states set sequentially, starting at the bottom and working upwards until the states of the output units are determined.
3.The most obvious limitation of this learning procedure is about the local and overall problems. The error-surface may contain local minima so that gradient descent is not guaranteed to find a global minimum. But this rarely happens.
4.Although the learning procedure, in its current form, is not a plausible model of learning in brains. Its application in various tasks shows that interesting internal representations can be constructed by gradient descent in weight-space, and this suggests that it is worth looking for more biologically plausible ways of doing gradient descent in neural networks.
Attention Is All You Need
1.The team propose the Transformer, a new simple network architecture which is based on attention mechanisms and totally dispenses with recurrence and convolutions. It consists of two parts: encoder and decoder which associate with the dominant sequence transduction models are based on complex recurrent or convolutional neural networks. Through an attention mechanism connecting encoder and decoder, the model can show best performance. After experiments and displays, the results show that: these models have higher quality and do better in parallel tasks, at the same time needs less time to train.
2.Recurrent language models and encoder-decoder architectures, associating with sequence modeling and transduction problems such as language modeling and machine translation, are current nowadays. Although recent work has achieved significant improvements in computational efficiency through factorization tricks and conditional computation, as a result of sequential computation, constrains and the Gradient disappearance problem still remain. But with the help of attention mechanisms which allows modeling of dependencies without regard to their distance in the input or output sequences, the Transformer can eschewing recurrence. So, Transformer relies entirely on an attention mechanism to draw global dependencies between input and output, which allows for significantly more parallelization and can reach a new state of the art in translation quality especially after enough training.
3.The Transformer follow the overall architecture which contains encoder and decoder. Encoder map an input sequence of symbol representations to a sequence of continuous representations and then decoder generate an output sequence. The whole process is auto-regressive and the next output relies on previous output. The encoder is composed of a stack of N = 6 identical layers so do the decoder.
The attention function mechanism operates through a batch of vectors: the query, keys, values, and output. Map query and a set of key value pairs to an output which is weighted sum of values. The Transformer uses multi-head attention in three different ways: In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. The encoder contains self-attention layers. self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position.
4.Considering of total computational complexity per layer, parallel computing and s the path length between long-range dependencies in the network, the team chose the self-attention. Self-attention provides with feasible parallel computing, lower operational complexity and generate more interpretable models.
5.The team trained on the standard dataset and used the Adam to optimize. Residual Dropout and Label Smoothing were used for regularization. Further experiments show that the large transformer model has higher scores and lower training costs than the previously published best model. As expected,bigger models are better, and dropout is very helpful in avoiding over-fitting.
The first sequence transduction model based entirely on attention,the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers for translation tasks. And the team is planning to extend the Transformer to problems involving input and output modalities other than text.
GPipe: Efficient Training of Giant Neural Networks
using Pipeline Parallelism
Known as an effective method to improve the quality of model for several different machine learning tasks, scaling up the capacity deep of neural network increases model capacity beyond the memory limit of a single accelerator has required developing special algorithms or infrastructure, in many instances. And this kind of solution usually is specified by architecture so it cannot transfer to other tasks. Considering all these constrains and limitations, the team proposed and introduced GPipe, which is a pipeline parallelism library allowing scaling any network which can be expressed as a sequence of layers. The GPipe is able to address the meet the requirements of efficient and task-independent model parallelism. GPipe provides the flexibility to efficiently scale a variety of different networks to a large scale by By pipelining different sub-sequences of layers on separate accelerators. Besides, when the model is divided across multiple accelerators, GPipe can almost achieve linear speedup, using a new a novel batch-splitting pipelining algorithm. The team trained large-scale neural networks on Image Classification attaining a top-1 accuracy of 84.4% on ImageNet-2012. The model also works well on Multilingual Neural Network Translation, they trained a single large-scale Transformer model on a corpus spanning over 100 languages and achieve better quality than all bilingual models.
Thanks to the improvement of methods that have facilitated scaling the effective capacity of neural networks, the deep learning has made great progress. In general, the larger the model, the better its performance in the task. It it globally acknowledged that there is a strong correlation between model size and classification accuracy. However, larger models have notable advantages among many fields and brought remarkable quality improvements, scaling neural networks itself introduces significant practical challenges. There are hardware constrains with big scale models such as memory limitations and communication bandwidths on accelerators. Actually users have to divide large models into parts assigning to different accelerators(GPU or CPU) to deal with the problem. But at the same time the efficient model parallelism algorithms are extremely difficult to design and implement. So it it usual for the practitioner to make difficult decision among scaling capacity, flexibility (or specificity to particular tasks and architectures) and training efficiency. Therefore if the architecture and tasks are specific, the model-parallel algorithms are able to be very efficient. Researchers can scale neural networks easily, with the development and application of deep learning and the increasing demand for reliable and flexible infrastructure. The neural networks now can complete various of machine learning tasks.
GPipe solves the memory limitation problem by partitioning the model across different accelerators. Each model can be specified as a sequence of layers and consecutive groups of layers can be partitioned into cells which are placed on a separate accelerator. Based on the work of partitioned setup the team propose a novel pipeline parallelism algorithm splitting a mini-batch of training examples into smaller micro-batches, then pipeline the execution of each set of micro-batches over cells. This algorithm helps researchers train increasingly large models by deploying more accelerators. Meanwhile, GPipe is also helpful in data parallelism to further scale training. Therefore it can be combined with data parallelism to use more accelerators in a complementary manner to expand the scale of neural network training. The team applied RMSProp during training and also introduces low communication overhead to optimize performance.
3.In such scenarios, imperfect partitioning algorithms might lead to load imbalance. Better partitioning algorithms can potentially improve the performance over the heuristic approach.We found that re-computation time was the main contributor to GPipe overhead, taking up to 23% of the total step time. Another source of overhead was load imbalance
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
The paper proposes a new language representation model named BERT standing for Bidirectional Encoder Representations from Transformers.
As well all know, the pre-trained model has been proven to be effective in many natural language tasks including sentence-level tasks such as natural language inference and paraphrasing. These tasks aim at predicting the connections between sentences by analyzing them holistically, as well as token-level tasks like named entity recognition and question answering where the models are required to give fine-grained output at the token-level. In previous work, the two approaches for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning share the same objective function during pre-training. And in these strategies they all use unidirectional language models to learn the representation of general language. The team argue that it is the current techniques that restrict the ability of the pre-trained representations. The core limitation is that standard language models are unidirectional, which limits the options of used architecture during pre-training. These restrictions are sub-optimal for sentence-level tasks and devastating for token-level tasks. These limitations can be extremely harmful for tasks that is crucial to incorporate context from both directions.
In this paper, the team propose BERT to improve the fine-tuning based approaches. Inspired by the Cloze task, BERT uses a “masked language model” pre-training objective to solve the previously mentioned unidirectional constrain. Unlike the left-to-right pre-training language model BERT enable the representation to incorporate the context, which allows the team to pre-train a unique deep bidirectional Transformer. In their work, they demonstrate the significance of bidirectional pre-training, show that pre-trained representations can reduce many complicated work for task-specified architectures. And BERT become the first fine-tuning based model that show the state-of-the-art performance on a large scale of both sentence-level and token-level tasks. Besides all these, BERT performs best in 11 NLP tasks.
The BERT framework contains two steps: pre-training and fine-tuning. In pre-training, the model use unlabeled data to train over different tasks. While for fine-tuning, BERT firstly initialize with the pre-trained parameters and all of these parameters are fine-tuned via using labeled data from the downstream tasks. And each downstream tasks own their unique fine-tuning models, despite they are all initialized with the same pre-trained parameters. Both pre-training and fine-tuning use the same architecture except the output layers. So the difference between the pre-trained architecture and the final downstream architecture is minimal. That means a significant strength of BERT is its unified architecture among different tasks. Besides, in order to train a model that understands sentence-level relationships, they also introduces next-sentence prediction task for pre-train.
3.In their experiments, they mask 15% of all WordPiece tokens. Compared with denoising auto-encoders, BERT only predict the masked words in stead of reconstructing the whole input, which leads to slower convergence than left-to-right model and mismatch between pre-training and fine-tuning, because tokens does not involved in fine-tuning step. If the [MASK] token is used to much in pre-train, the model’s performance will be effected. And obviously, the BERT consumes a lot of hardware resources.
NASPipe:High Performance and Reproducible Pipeline Parallel Supernet Training via Causal Synchronous Parallelism
In this work, inspired by conventional CPU instruction pipeline problems, the team present NASPipe the first high-performance and reproducible distributed parallel supernet training system via causal synchronous parallel (CSP) pipeline scheduling abstraction. The NASPipe divides a supernet across GPUs executing multiple generated sub-tasks (subnets) in a pipelined manner at the same time. In the meantime, it monitors the correlations between the subnets and resolves all the causal dependency which is caused by subnets’ layer sharing deterministically.
Neural Architecture Search (NAS) has a significant contribution on building high-quality Deep Neural Networks (DNNs) for different applications and devices. And because of low computation cost and high quality, the supernet paradigm is the most prevalent and widely accepted which composes the entire search space into a whole supernet and train it recurrently. Traditional NAS paradigms which train each explored DNN to convergence costs too much. Unlike the supernet paradigms train tens of thousands of standalone DNNs to training a monolithic one meanwhile keep the quality of searched DNNs. Unfortunately, though the industry and academia have created systems for easily defining NAS supernets and large standalone models like GPipe, Deepspeed and Pipedream, but none of these are designed for effectively train extremely large supernets.
There are two main problems. One is that it is difficult to deterministically resolve the dependencies between subnets activated in parallel because now available large-scale DNN train systems are designed to parallel .the training of multiple batches within the same DNN model, they are not designed for capturing and enforcing the causal dependency. The other is to leave more cache space for larger batch training to achieve higher GPU utilization via managing the extra-large supernet context among GPUs efficiently.
To enforce high-quality and reproducible supernet training, NASPipe at the same time executes the subnets which are generated by supernet-based exploration algorithms, oversees the connection between subnets and keep all causal dependencies resulted from layer sharing to enforce high-quality and reproducible supernet training. With DNNs are getting larger and larger, a single GPU cannot have a capacity of a subnet itself, which means the pipeline parallelism becomes one of the most efficiently way to train large models. In order to realize parallelization of subnet execution, the team divided each subnet into parts like subset of layers instead of deploying each subnet task on one GPU which allow each GPU executes one part and make the subnet executions into a pipeline. Because of pipeline parallelism, the team is able to efficiently resolve causal dependencies meanwhile execute synchronizations in supernet training ,locally on each GPU in a decentralized manner.
The team introduces a pipeline scheduler which promotes the subnet tasks with larger chronological order into execution to improve the pipeline efficiency. And via leveraging the status of the current stage and the status passed from other stages, NASPipe predicts the upcoming subnets with the biggest chance to be scheduled in the next few steps to efficiently manage GPU memory and precisely exchange in the context of subnets to be executed.
The team have prototyped NASPipe on PyTorch which is one of the most prevalent DNN execution frameworks. As a training system for a supernet-based NAS algorithm, NASPipe in service. The NASPipe has advances among recently pipeline training system, firstly it is reproducible which produces the same training process and results independent of GPU numbers. Secondly, despite there are existing dependencies across subnets, the NASPipe can also execute efficiently. Thirdly, it is scalable with the increase of GPU numbers, it can provides a roughly linearly increased computation power.
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
2.They proposed a novel interleaving pipeline scheduling, which can increase the throughput by more than 10%, and the memory consumption is equivalent to the existing methods.
Data parallel horizontal expansion usually works well, but there are two limitations: a) if it exceeds one point, the batch size of each GPU becomes too small, thus reducing GPU utilization and increasing communication cost; and b) the maximum number of devices that can be used is the batch size, which limits the number of accelerators that can be used for training.
The parallelism of tensor (in layer) models, in which the matrix multiplication in each converter layer is split into multiple GPUs, can be used to overcome these limitations. It is not suitable for larger models. A larger model needs to be split across multiple multi GPU servers,
Pipeline model parallelism is another technology that supports large-scale model training, in which the model layer is striped on multiple GPUs. In order to achieve high efficiency, a larger batch size is usually required. In this work, we also introduced a new pipeline plan, which can improve the efficiency of small batches
They consider the performance impact of combining the parallelism of pipeline and tensor models with data parallelism, and also want to consider the interaction between data parallelism and the parallelism of the two models. With data parallelism, each worker has a copy of the complete model, the input data set is segmented, and workers regularly aggregate their gradients. With the pipeline parallelism, the model layer can be partitioned across multiple devices. When used on a model that repeatedly uses the same transformer block, the same number of transformer layers can be assigned to each device. Periodic pipeline refresh is introduced to accurately preserve strict optimizer semantics so that optimizer steps can be synchronized between devices. At the beginning and end of each batch, the equipment is idle. Using the parallelism of tensor model, each layer of the model is divided into multiple devices.
They implemented three model specific optimizations on the computational graph to achieve high performance.
PipeDream Generalized Pipeline Parallelism for DNN Training
Single Path One-Shot Neural Architecture Search with Uniform Sampling
This work proposes a single path one shot model to solve the challenges in training. Their central idea is to build a simplified supernet, in which all architectures are single paths, thus reducing the weight adaptation problem. The training is performed by uniform path sampling. All structures (and their weights) are fully and equally trained.
Neural architecture search (NAS) aims to automate architecture engineering by solving architecture design Recent methods use weight sharing strategy to reduce the amount of calculation. A supernet subsuming all architectures is trained only once. Each architecture inherits its weight from the Supernet. Only fine adjustments are made. The calculation cost is greatly reduced. Most weight sharing methods use continuous relaxation to parameterize the search space. There are two questions. First, the weights in the supernet are deeply coupled. Secondly, joint optimization introduces further coupling between architecture parameters and supernet weights. The greedy nature of gradient based methods inevitably introduces bias in the optimization process, and it is easy to mislead the architecture search. The existing one-time method still has coupling weight in the supernet. Their optimization is complex and involves sensitive hyper parameters. They did not show competitive results on large datasets.
In order to alleviate these problems, they proposed a simple but effective single path one-shot method.
This work revisits the one-shot paradigm and proposes a new approach to further simplify training and enhance architecture search. Based on the observation that the accuracy of the architecture using inherited weights should be able to predict the accuracy of using optimized weights, we suggest that supernet training should be stochastic. All architectures optimize their weights at the same time. This results in a uniform sampling strategy.
They discussed the disadvantages of existing NAS methods using nesting and joint optimization. A single path one-time method with uniform sampling is proposed. It overcomes the disadvantages of the existing one-time method. Its simplicity supports rich search space, including novel design of channel size and bit width, all of which are solved in a unified way. They proposed a simple search space composed of a single path architecture to reduce the weight coupling in the supernet. The training is hyperparameter-free and easy to converge.
Comprehensive experiments show that their method is flexible and effective. It is easy to train and fast to search. It effortlessly supports complex search spaces (e.g., building blocks, channels, mixed precision quantization) and different search constraints (e.g., flop, delay). And thus can be conveniently used for various needs. It defines supernet and performs weight inheritance in a similar manner. It is sequential and combines the advantages of nesting and joint optimization methods. Architecture search is efficient and flexible. It achieves the most advanced performance. on ImageNet, a large dataset.
Comprehensive experiments show that their method can achieve better results than other methods in several different search spaces. We also analyzed the search cost and relevance of our method. Our approach is more efficient, especially when multiple searches are required.
Scaling Distributed Machine Learning with the Parameter Server
2.There are two key challenges in building a high-performance parameter server system: 1) communication: Although parameters can be updated to key value pairs in traditional data storage, it is inefficient to use this abstraction naively. 2) Fault tolerance: fault tolerance is critical in scale, and for efficient operation, it cannot require complete restart of long-running computing.
The parameter server framework provides developers with two advantages: first, it keeps the application specific code simple by decomposing the common components of the machine learning system. At the same time, as a sharing platform for system level optimization, it provides a powerful, universal and high-performance implementation that can handle various algorithms from sparse logistic regression to topic models and distributed sketches. Their design is guided by the workload in the actual system. The novelty of the system lies in the synergy achieved by selecting correct system technologies, adapting them to machine learning algorithms, and modifying machine learning algorithms to make them more system friendly. Their parameter server provides five key features: efficient communication: the asynchronous communication model does not block computation (unless requested). It is optimized for machine learning tasks to reduce network traffic and overhead.
This framework is flexible, scalable, error tolerant, durable and easy to use, and is designed to achieve long-term deployment. The parameter server provides five key features: efficient communication: the asynchronous communication model does not block computation (unless requested). It is optimized for machine learning tasks to reduce network traffic and overhead.
They demonstrated experiments on several challenging tasks on a real dataset with billions of variables, fully demonstrating their efficiency.
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
In recent years, most of the progress made in deep learning is directly related to the significant increase in model scale. The language model parameters are expanded to hundreds of billions, and training is conducted on a larger data set, thus realizing new functions. However, training these very large models on distributed clusters currently requires a large number of model definitions and engineering work specific to the cluster environment, such as adjusting and selecting parallel dimensions and selecting pipeline schemes (partition selection).
Parallelization of automated large-scale models will significantly accelerate ml research and production by enabling model developers to quickly explore new model designs without considering the underlying system challenges. Unfortunately, it needs to navigate a complex planning space that grows exponentially with the dimension of parallelism and the size of the model and cluster.
This paper proposes Alpa, which automates model parallel training of large-scale deep learning (DL) models by generating an execution plan with uniform data, operators and pipeline parallelism.
Existing model parallel training systems either require users to manually create a parallelization plan or automatically generate one from the limited model parallel configuration space. They are not sufficient to extend complex deep learning models on distributed computing devices.
Parallelization of automated large-scale models will significantly accelerate ml research and production. Unfortunately, it needs to navigate a complex planning space that grows exponentially with the dimension of parallelism and the size of the model and cluster. Our main observation is that we can organize different parallelization technologies into a hierarchical space and map these parallelization technologies into the hierarchical structure of the computing cluster. You can hierarchically represent parallel execution plans by specifying plans in each parallel category, which brings many advantages.
They designed and implemented Alpa, the first compiler to automatically generate parallel execution plans covering all data, operators and pipeline parallelism. Given the model description and cluster configuration, Alpa realizes this by dividing the cluster into multiple device grids. Alpa allocates the training of large-scale DL models by treating the parallelism as two levels: inter operator parallelism and intra operator parallelism. On this basis, Alpa builds a new hierarchical space for the massive model parallel execution plan. Alpa has designed many compilation channels to automatically export effective parallel execution plans at each parallel level. Alpa implements an efficient runtime to orchestrate two-level parallel execution on distributed computing devices.
Their evaluation shows that the parallelization plan generated by Alpa matches or is better than the manually adjusted model parallel training system, even on the model they designed. Unlike special systems, Alpa can also be extended to models with heterogeneous architectures and models without manual design plans. They evaluated Alpa’s training of large models with billions of parameters and achieved efficient execution results.
vPIPE: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training
The team introduces vPIPE, the first dynamic layer live partition and memory management system for pipeline parallelism, serving as a transparent acceleration. It works as a virtualized layer between the typical pipeline parallel system and its underlying execution engine. In order to resolve both designed goals G1 (carefully manage all stages’ training memory to avoid exceeding the physical memory capacity on any GPU) and G2 (enforce a “balanced” partition) transparently, it automatically find a globally near-optimal plan, which migrates layers among stages and relocates each layer’s activations and parameters to its current stage’s GPU or CPU memory. Moreover, vPIPE can significantly relieve the tension stage of the pipeline and improve the throughput of the pipeline in a balanced manner.
The vPIPE has two outstanding innovations, the first is an online search algorithm for layer partitioning and memory managing plans ,which is fast and near-optimal stage distributed and used to look for globally effective exchange, recalculation and partition strategies. It improves the efficiency and scalability of vPIPE. And the second contribution is a transparent real-time migration protocol for rebalancing layer distribution across training pipelines, which not delay the function or change the obsolescence of the parameters of the system above.
In recent years, in order to get higher modeling capacity, the scale of large deep neural networks is explosively increasing, with more layers and more parameters in each layer. Pipeline parallelism is an effective method to train large DNNS. An efficient pipeline system should achieve the two key goals of G1 and G2. Although previous work has made a lot of efforts on building pipeline parallel systems, it is still difficult to achieve complex and dynamic design goals at the same time.
There are two types of existing pipeline parallel systems. The first type stores the activation tensors generated during the forward passes directly in the GPU memory. It has to maintain a moderate batch size, but a larger training batch size may lead to higher GPU Alu utilization and higher throughput.The second type displays all activation tensors in the forward passes and recalculates them in the backward passes. This significantly alleviates the imbalance in GPU memory utilization between the previous stage and the subsequent stage, but at the cost of additional forward pass. In addition, when NAS is enabled in the DNN model, both types of pipeline parallel systems will experience more severe throughput degradation. The number and layout of model’s layers can be modified by the runtime algorithm and evaluated by running the NAS enabled converter.
The team believes that the key is that the memory management and layer partition strategies of these systems are static. When the stage becomes tense due to the GPU memory explodes or the newly activated hierarchy, these static strategies cannot use the idle GPU resources available in the adjacent stages to alleviate the tension. Vpipe computes a mixed plan of swap and recompute for all layers on each stage instead of all-recompute strategy. Instead of using a static partition strategy, vPIPE generates a new partition plan and transparently live migrates layers from the dense stage to the adjacent stage. This not only reduces the memory burden of the dense stage (G1), but also realizes a more balanced partition (G2) with higher throughput.
In order to achieve the goal, there are two challenges. The first challenge is to find globally effective swap, recompute and repartition (SRP) strategies among all stages. By using a powerful decomposition method, they created a fast-convergence and near-optimal search algorithm to cope with it. The second challenge is how to live(without GPU pause or pipeline cleaning) the migration layer while maintaining the transparency of vPIPE to the general upper layer pipeline parallelism system. They propose a new live migration protocol. Our key observation is that the time window between activation generation (in the forward pass) and its end usage (in the corresponding backward pass) allows vPIPE to perform subtle interleaving to transparently migrate layers without changing the parameter obsolescence of the upper system.
VPIPE has two limitations. First, it assumes that a single layer fits within the memory limits of a single GPU for any DNN workload trained with vPIPE. Second, vPIPE’s layermigration protocol remains live when the time cost of transferring a layer’s tensors can overlap with the computation time of DNN training.