1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs

1. Abstract

  • At no or nearly no loss of accuracy, quantize the gradients aggressively—to but one bit per value—if the quantization error is carried forward across minibatches (error feedback). 对gradient进行量化,不会带来精确度的下降,只要本minibatch的量化误差加到下一个minibatch的gradient上
  • This size reduction makes it feasible to parallelize SGD through data-parallelism with fast processors like recent GPUs. 数据压缩能够更有效地在数据并行下并行SGD
  • Combining this finding with AdaGrad, automatic minibatch-size selection, double buffering, and model parallelism. Unexpectedly, quantization benefits AdaGrad, giving a small accuracy gain. 结合了AdaGrad、自动选择minibatch大小、双缓冲和模型并行。

2. Intro

  • nodes. Each node computes a sub-gradient on its sub-minibatch. These sub-gradients, of the same dimension as the full model, must be summed over all nodes and redistributed. 数据并行,每个节点计算自己数据的subgradient,跟完整的参数是同样的维度,需要累加所有node的subgradient,然后重新分发
  • Applied directly to typical training configurations, this process is infeasible due to the high bandwidth that it takes to exchange sub-minibatch gradients across nodes. 应用distributed replicas来训练,是不可行的,因为需要很大的带宽来交换所有nodes的gradient
  • Avenues for improving efficiency for data parallelism are to increase the minibatch size and to reduce how much data gets exchanged. 提高数据并行效率的解决办法是,增大minibatch大小;减小数据交换大小
  • propose to reduce bandwidth by aggressively quantizing the sub-gradients—to but one bit per value. We show that this does not or almost not reduce word accuracies—but only if the quantization error is carried forward across minibatches, i.e. the error in quantizing the gradient in one minibatch is added (fed back) to the gradient of the next minibatch. 本文提出,对subgradient进行量化,每个值压缩到一个bit。只要本minibatch的量化误差加到下一个minibatch的gradient上,就几乎不会带来精确度的下降。
  • 数据并行不会改变convergence behavior,也就是clock vs objective。也与Hogwild/ASGD不同,本文着重的是确定性的收敛behavior。
  • In this category, an alternative to data parallelism is model parallelism, where models are distributed over nodes. One can also parallelize over layers [19]: Each GPU processes one or more consecutive layers, where data flows up and down through the layers between GPUs. 在上面说的这一类中,数据并行之外的选择是模型并行,将模型分布在不同节点上。也可以将网络层并行,每隔GPU处理一个或多个连续的网络层,数据在不同GPU的层之间流动
  • That work showed, however, that delayed updates can work, and motivated the double-buffering technique we apply in this paper. 上面提到的工作显示了,有delay的update也能够work,这激发了本文使用的double buffering方法

3. Data-parallel Deterministically Distributed SGD

  • CD-DNN-HMM model

3.1. Data-parallel Distributed SGD

  • 当计算和通信完全overlap,也就是计算时间和通信时间相等时,是最优的节点数量
  • 时间开销分为四部分:处理每条数据的时间(每隔layer是三个矩阵相乘);计算完梯度之后的post-processing(momentum+AdaGrad),是component-wise操作(比如加减);交换float类型的subgradient的时间;将梯度更新到模型参数上的开销,是component-wise操作,是fixed开销

3.2. Double Buffering with Half Batches

  • To achieve concurrent computation, we break each minibatch in half and exchange sub-gradients of one half-minibatch while computing the sub-gradients of the next half-minibatch,为了并发地计算,计算到minibatch的一半时,就交换一半的subgradients,同时计算另一半的minibatch
  • using a model that is outdated by N/2 samples (delayed update [19, 8]). 使用的模型有固定的delay,收敛并不会收到根本的影响

3.3. Potential Faster-Than-Fixed-Cost Communication

  • 当通信开销低于fixed的时间开销时,网络不会被占满,因而速度被fixed cost所限制,这是1bit SGD时会发生的情况
  • In this case, double buffering with half-minibatches no longer makes sense, as it masks communication cost at the expense of an additional fixed cost, which is now higher. 在这种情况下,double buffering就没有必要了,因为它带来了额外的开销,而此时网络开销是很小的,网络开销的减少带来的效果不大

3.4. Relation to Hogwild/ASGD

  • Hogwild differs in that it uses an unsynchronized gradient exchange (via a parameter server). It is another form of delayed update where the delay varies non-deterministically across model parameters. Hogwild的不同是,它是另一种delayed update形式,模型参数的不同维度的delay是不确定的.

4. 1-bit SGD with Error Feedback

  • 量化误差是不可避免的,可能会导致发散
  • 参考Sigma-Delta modulation,当量化一个参数的梯度时,保存量化误差,在下一个量化之前,加到下一个minibatch的梯度中
  • We find that as long as error feedback is used, we can quantize all the way to 1 bit at no or nearly no loss of accuracy. 只要使用error feedback,能够一直量化到1bit,也几乎不会降低准确率
  • For our 1-bit implementation, we find that using a constant quantization threshold of 0 is a good (and cheap) choice, whereas the reconstruction values used by the unquantizer Q1(�) are tied within each weight-matrix column (j, l). The two values per column are recomputed as to minimize the
    square quantization error and transmitted in each data exchange. 1bit的量化实现,我们发现阈值设为0就很好,但是重构值需要和参数矩阵的每个列绑定。一个列用两个值就可以使得重构的时候最小化量化平方误差

4.1. Aggregating the Gradients

  • Each compute node is responsible for aggregating a 1=K-th subset of model parameters, which it will receive in quantized form from all peer nodes每个计算节点负责汇总1/K个模型参数的子集,从其他节点获得量化后的值
  • 这些量化值被累加,post-processed(AdaGrad, momentum),然后重新分发给计算节点,然后再被量化。这样每个minibatch的gradient被两次量化,
  • The first quantization is applied to sub-gradients which are summed up, reducing the quantization error through averaging. The second quantization happens after AdaGrad, where gradient values are in more homogeneous numeric range. 第一次量化是对subgradient,通过averaging来减小量化误差;第二次量化发生在AdaGrad之后,这样gradient的不同维度的值有不均匀的数值范围

5. System Description

  • 用前45分钟的数据选择合适的minibatch
  • 根据cross validation set上的准确率decay learning rate
  • 使用AdaGrad,根据不同维度随时间的变化来normalize gradients

6. Experimental Results

6.1. Cost Measurements

  • 不依赖于batch size的开销,gradient post-processing(Adagrad,momentum)和fixed cost(模型更新)的时间开销是基本固定的

6.2. Effect of 1-bit Quantization

  • 1-bit quantization works well across all setups, at minor but consistent impact on training-set frame accuracy.
  • double buffering has minor impact on accuracy

7. Conclusion

  • 1-bit quantization allows to significantly reduce data-exchange bandwidth for data-parallel SGD at no or nearly no loss of accuracy, making data-parallel distribution of SGD feasible even with modern fast hardware (GPUs). 1-bit量化能够减少数据并行SGD的数据交换,几乎不影响准确率
  • For this to work, quantization-error feedback is essential. 量化误差的误差反馈很重要
  • 量化能够和AdaGrad互相作用互相影响

你可能感兴趣的:(1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs)