, uint32
, etc.)float32
) and z (uint
) are the factors by which we scale and shift the number line. z is the quantized ‘zero-point’ which will always map back exactly to 0.f
Consider a floating point variable with range (Xmin, Xmax) that needs to be quantized to the range (0, N_levels − 1) where N_levels = 256 for 8-bits of precision. We derive two parameters: Scale (∆) and Zero-point(z) which map the floating point values to integers . The scale specifies the step size of the quantizer and floating point zero maps to zero-point . Zero-point is an integer, ensuring that zero is quantized with no error. This is important to ensure that common operations like zero padding do not cause quantization error.
1 . Post Training Quantization
(1) . Weight only quantization
(2) . Quantizing weights and activations
2 . Quantization Aware Training
1 . Post Training Quantization
In many cases, it is desirable to reduce the model size by compressing weights and/or quantize both weights and activations for faster inference, without requiring to re-train the model. Post Training quantization techniques are simpler to use and allow for quantization with limited data.
Post Training Quantization合理说,计算过程皆为Float,而非Int,所以只能在减少模型的大小,速度方面并不能得到提升.
(1) .Weight only quantization
A simple approach is to only reduce the precision of the weights of the network to 8- bits from float. Since only the weights are quantized, this can be done without requiring any validation data .
(2) .Quantizing weights and activations
One can quantize a floating point model to 8-bit precision by calculating the quantizer parameters for all the quantities to be quantized. Since activations need to be quantized, one needs calibration data and needs to calculate the dynamic ranges of activations.
这种模式,在weight quantization的基础上,对某些支持quantized的Kernel,先进行quantization,再进行activation计算,再de-quant回float32,不支持的话会直接使用Float32进行计算,这相对与直接使用Float32进行计算会快一些.
2 . Quantization Aware Training
Quantization aware training models quantization during training and can provide higher accuracies than post quantization training schemes.
We model the effect of quantization using simulated quantization operations on both weights and activations. For the backward pass, we use the straight through estimator to model quantization. Note that we use simulated quantized weights and activations for both forward and backward pass calculations.
(1) . 当前支持的aware-quant操作: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/toco/graph_transformations/quantize.cc
(2) . 当前不支持keras进行aware-quant,得等到tensorflow2.0才支持.
(1) https://arxiv.org/pdf/1806.08342.pdf
(2) https://arxiv.org/pdf/1712.05877.pdf