For an arbitrary neural network with n convolutional layers {L1,...,Ln}, there are n sets of weights W = {W1,...,Wn} belonging to these layers. For the i-th convolutional layer, its weight is denoted as Wi ∈ Rci×si×si×oi, where si is the kernel size, ci and oi are the input and output number of channels in this layer respectively. Given a batch of training samples X and the corresponding ground truth Y, the error of the network can be defined as Lcls, where Lcls could be cross entropy loss or mean squared error, etc.. The training of the given deep neural network is conducted by solving an Empirical Risk Minimization problem.
Basically, the training and inference of conventional neural networks utilize floating-point numbers, i.e., both the weights and activations are stored using 32-bit precision. Model quantization methods represent the weights or activations in neural networks with low-bit values so as to reduce the computation and memory costs. To quantize the weights Wi and activations Ai, these floating-point numbers need to be restricted to a finite set of values. The quantization function is usually defined as:
Q(z) = γj, ∀ z ∈ (uj,uj+1], (1)
where (uj,uj+1] denotes a real number interval (j = 1,...,2b), b is the quantization bit-width, and z is the input value, i.e. a weight or an activation. The quantization function in Eq. (1) maps all the values in the range of (uj,uj+1] to γj. For the choices of these intervals, the widely used strategy is to use an unified quantization function [23,51] in which the above range is equally split, i.e.,
, (2)
where the original range (l,r) is divided into 2b unified intervals,
is the interval length and R is the round function. To make the non-differential quantization process can be optimized end-to-end, the straight-through estimator [2] is usually adopted to approximate the derivative of the quantization function.
In this section, we first revisit the conventional network quantization approach and then describe our dynamic quantization strategy for higher performance in terms of both network accuracy and efficiency.
Although a great number of quantization methods have been explored, the bit-width in conventional quantization method is usually static for all the inputs. In fact, the diversity of natural images in recent datasets is very high and most of existing quantization algorithms do not consider their variousness and intrinsic complexity. To allocate the computation resources precisely, we propose the dynamic quantization to adjust the bit-width for each layer according to the input. Suppose there are K bit-width candidates, i.e., b1,b2,··· ,bk. Dynamic quantization aims to select an optimal bit-width for quantizing weights and activations of each layer, which can be formulated by aggregating multiple bit-widths as follows:
,
(3)
where pki,j denotes the selection of the kth bit-width for the jth sample in the ith layer. Then the dynamic quantization of jth sample in the ith layer can be formulated as:
, |
(4) |
K Wci,j = Qi,j(Wi) = Xpki,j · Q∆ki (Wi), k=1 |
(5) |
K Xbi,j = Qi,j(Xi,j) = Xpki,j · Qδk(Xi,j), |
(6) |
i k=1
where
and
denote the quantization intervals of weights and activations, respectively and the same bit-width is applied to the weights and activations in one layer. By exploiting Eq. (4)-(6), the dynamic convolutional layer is exactly the aggregation of mixed precision layers for a given input, which can fully exploit the potential of the resulting quantized neural network. Note that the biases are omitted in the formulation for convenience, the quantization interval ∆ki of biases are the same with which of weights in practice.
The diagram for using the proposed dynamic quantized network (DQNet) is shown in Figure 2, Given an image, our goal is to provide an optimal trade-off between network performance and computation burden. Thereby, the quantized neural network can still work well under a certain limit of computation. However, Eq. (4)-(6) cannot be directly optimized since the selection indicators pki,j for each layer are not fixed, which are variant to the input x. The test dataset cannot be obtained in advance and for the training dataset, the storage requirement for the selection indicators will be tremendous. For example, for ResNet-50 network, the number of possible bit-width configurations is 550 = 8.8×1034 for each sample when there are five bit-width candidates.
To address the above challenging problem, we employ a bit-controller to predict the bit-widths of weights and activations of all layers by identifying the complexity of each input sample. In practice, the bit-controller will output a vector consisting of prediction logits, representing the selection probabilities of each bit-width candidate in each layer.
The bit-controller is carefully designed with extremely small architectures to avoid obviously increasing the overall burden on memory and computation of the resulting network. Specifically, the bit-controller is a smaller network
consisting of the first several layers of the main network followed by a MLP and the MLP consists of only two fullyconnected layer in practice. In this way, our DQNet predicts the bit-width of each layer with negligible computation. We examine the impact of employing some layers from the main quantized network for the bit-controller in Sec. 4.4. In addition, the bit-controller is jointly trained with the main quantized network in an end-to-end manner and the bit-controller will generate the prediction logits of bit-widths of all the subsequent layers at once.
Assuming that the output logits of the bit-controller are h1,h2,··· ,hk for the weights and activations of a certain layer,
the bit-width can be selected accordingly, i.e., pk is determined as
, if k = argmax hk0,
k0 (7) otherwise.
During both training and inference, only one bit-width will be selected for one layer as shown in Figure 3. To provide a differentiable formula for sampling argmax, we utilize the Gumbel-softmax trick during training:
, (8)
where τ is the temperature parameter that controls how closely the new samples approximate one-hot vectors. πk is a random noise that follows Gumbel distribution,which can be described as :
πk = −log(−log(uk)), uk ∼ U(0,1). (9)
During the feed-forward process, after obtaining the bitwidth of a certain layer for a specific sample, DQNet will Algorithm 1 The training procedure of the proposed dynamic quantization scheme.
Input: Input images X and their labels; the trade-off parameter α.
Output: The dynamic quantized neural network using our algorithm.
1: Set epochs T and batch size m for training DQNet
2: Initialize the weights of network and bit-controller randomly and quantize the input images
3: for t = 1,··· ,T do
4: Output the probability of different bit-widths for each layer using the bit-controller.
5: Generate the one-hot vectors for selections of different bit-widths using Eq. (7).
6: for i = 1,··· ,n do
7: for j = 1,··· ,m do
8: Quantize Xi,j and Wi for jth sample in the ith layer according to the selection of different bitwidths using Eq. (5) and (6).
9: end for
10: Calculate the output of i-th layer using Eq. (4).
11: end for
12: Update the entire network using back propagation according to Eq. (11) and utilize Eq. (8) when updating the Gumbel-softmax layer.
13: end for
quantize the weights and the activations accordingly, i.e.,
Wci,j = Qi,j(Wi) = Q∆ki (Wi),
Xbi,j = Qi,j(Xi,j) = Qδik(Xi,j), (10)
k0 where k = argmax h ,
k0
where ∆ki and δik are the quantization intervals for weights and activations using the predicted bit-width, respectively. During back-propagation, the quantized network will utilize the gradient computed by Eq. (8). It is worth noting that although there are multiple bit-width candidates, only the biggest bit-width of weight value need to be stored in practice. Since the bottleneck on devices is the inference time, the slightly additive storage of weights can be ignored.
In order to control the computational cost of DQNet flexibly, we add a Bit-FLOPs constriction term in the loss function so the total loss is formulated as:
n
Ltotal = Lcls + α · max(XBi − Btar,0), (11)
i=1
Given a target Bit-FLOPs, the dynamic quantized neu-
ral network will capture the inherent variance in the computational requirements of the dataset and allocate optimal bit-widths for different instances and different layers. The training procedure of the proposed dynamic quantization scheme is shown in Algorithm 1.