Part I.
Introduction
Convolutionalneural networks show reliable results on object recognition and detection thatare useful in real world applications. Concurrent to the recent progress inrecognition, interesting advancements have been happening in virtual reality,augmented reality , and wearable devices. Putting these two pieces together, itis the right time to equip smart portable devices with the power ofstate-of-the-art recognition systems.
However,DCNN-based recognition systems need large amounts of memory and computationalpower. While they perform well on expensive, GPU-based machines, they are oftenunsuitable for smaller devices like cell phones and embedded electronics. Thesemodels([1][2]) quickly overtax thelimited storage, battery power, and compute capabilities of smaller deviceslike cell phones.
Inlight of this, substantial research efforts are invested in speeding up DCNNsat both run-time and training-time, on both general-purpose([3]) and specialized computer hardware.Various approaches like quantization and sparsification([4]) have also been proposed.
Part II.
Development Process
To improve the speed of neural network on CPUs,[5] leverage SSSE3 and SSE4 fixed-pointinstructions which provide a 3× improvement over an optimized floating-pointbaseline.Weights are scaled by taking their maximum magnitude in eachlayer and normalizing them to fall in the [−128, 127]range(8 bit unsigned char).[6]observedthe rounding scheme to play a crucial role in determining the network’sbehavior during training, within the context of low-precision fixed-point computations. SR(Stochastic Rounding) showed that deep networkscan be trained using only 16-bit (IL=4bit;FL=12bit) wide fixed-point numberrepresentation, and incur little to no degradation in the classificationaccuracy(@MNIST@Cifar-10).
Binary weights, i.e., weights whichare constrained to only two possible values (e.g. -1 or 1), would bring greatbenefits to specialized DL hardware by replacing many multiply-accumulate operationsby simple accumulations, introduced by[7]also obtained near state-of-the-art results with BC(BinaryConnect) on thepermutation-invariantMNIST, CIFAR-10 and SVHN.The key point to understand with BC(BinaryConnect) is that they onlybinarize the weights during the forward and backward propagations but notduring the parameter update. Keeping good precision weights during the parameter updates is necessaryfor SGD to work at all, which means two sets of weights need to be maintainedduring training process. This method was inherited by BN(BinaryNet[8]) with Sign binary operation.Besides, BN trainsDNNs with BinaryWeights and Activations when computing the parameters’ gradient. Dedicated Deep Learning hardware(FPGA etc.) couldbe used to accelerate network training without significant loss(@MNIST@Cifar-10@SVHN).Compared with unoptimized GPU Kernel, binary matrix multiplication GPU Kernelruns 7x faster.
Recent works have focused on ImageNet(1200K) datasetto fully test the performance of binary network[9]. The classification accuracy with a Binary-Weight-Network(BWN) version ofAlexNet(BN added;LRN removed) is the SAME(top-5 -0.8%) as thefull-precision AlexNet. BWN, still following [7]’s propagation update algorithmframework, only using a scaling factor to approximate as , results in 32x memory saving.WithBinarizedWeights and Activations, XNOR-Net approximate convolutions usingprimarily binary operations. This results in 58× faster convolutionaloperations (in terms of number of the high precision operations) and 32× memorysavings, but suffering a lot from precision loss(AlexNet@ImageNet Top-5 -11%Baseline 80.2%).
Facing the same precision problem(AlexNet8-8-8@ImageNetTop-1 -3% Baseline 55.9%;others with lower bitwidth even worse), DOREFA-NET[10] hasLow Bitwidth Weights and Activations using Low Bitwidth Parameter Gradients. In particular, during backward pass, parametergradients are stochastically quantized to low bitwidth numbers before beingpropagated to convolutional layers. As convolutions during forward/backwardpasses can operate on low bitwidth weights and activations/gradientsrespectively, DoReFa-Net can use bit convolution kernels to accelerateBothTraining and Inference. Due to the memory limit on dedicated hardware,quantization functions can’t be too complicated, just because of this thedesign of those functions seems quite empirical.
Recently, [11]revealed the remarkableRobustness of DCNN to Distortions beyondquantization, including additive and multiplicative noise, and a class ofnon-linear projections where binarization is just a special case, which maylead to the essence of binary quantization.
Part III.
Reference
【1】 Krizhevsky, A., Sutskever, I., Hinton, G.E. Imagenetclassification with deep convolutional neural networksNIPS2012
【2】 Simonyan, K., Zisserman, A. Very deepconvolutional networks for large-scale image recognitionarXiv:1409.1556
【3】 Han, Song, Pool, Jeff, Tran, John, and Dally,William. Learning both weights and connections for efficient neural networkNIPS2015
【4】 Song Han, Huizi Mao, William J. Dally DEEPCOMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINEDQUANTIZATION AND HUFFMAN CODINGarXiv:1510.00149
【5】 Vincent Vanhoucke , Andrew Senior , Mark Z. Mao Improving thespeed of neural networks on CPUs NIPS 2011
【6】 Suyog Gupta , Ankur Agrawal , KailashGopalakrishnan , Pritish Narayanan Deep Learning with Limited Numerical PrecisionICML2015
【7】 Matthieu Courbariaux , Yoshua Bengio , Jean-PierreDavid BinaryConnect: Training Deep Neural Networks with binary weights duringpropagations NIPS 2015
【8】 Courbariaux, M., Bengio, Y. Binarynet Training deep neural networks with weights andactivations constrained to +1 or -1 arXiv:1602.02830
【9】 Mohammad Rastegari , Vicente Ordonez , JosephRedmon , Ali Farhadi XNOR-Net: ImageNet Classification Using Binary Convolutional NeuralNetworksarXiv:1603.05279
【10】Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, HeWen, Yuheng Zou DoReFa-Net: Training Low Bitwidth ConvolutionalNeural Networks with Low Bitwidth GradientsarXiv:1606.06160
【11】Paul Merolla, Rathinakumar Appuswamy, JohnArthur, Steve K. Esser, Dharmendra ModhaDeep neural networks are robust to weightbinarization and other non-linear distortionsarXiv1606.01981