TensorflowLite量化原理

一 : 原理

原理公式:

TensorflowLite量化原理_第1张图片

Here:

  • r is the real value (usually float32)
  • q is its quantized representation as a B-bit integer (uint8uint32, etc.)
  • S (float32) and z (uint) are the factors by which we scale and shift the number line. z is the quantized ‘zero-point’ which will always map back exactly to 0.f.

化简:

TensorflowLite量化原理_第2张图片

Consider a floating point variable with range (Xmin, Xmax) that needs to be quantized to the range (0, N_levels − 1) where N_levels = 256 for 8-bits of precision. We derive two parameters: Scale (∆) and Zero-point(z) which map the floating point values to integers . The scale specifies the step size of the quantizer and floating point zero maps to zero-point . Zero-point is an integer, ensuring that zero is quantized with no error. This is important to ensure that common operations like zero padding do not cause quantization error.

卷积原理:

TensorflowLite量化原理_第3张图片

二 : 量化方式

1 . Post Training Quantization

(1) . Weight only quantization

(2) . Quantizing weights and activations

2 . Quantization Aware Training

TensorflowLite量化原理_第4张图片

1 . Post Training Quantization

In many cases, it is desirable to reduce the model size by compressing weights and/or quantize both weights and activations for faster inference, without requiring to re-train the model. Post Training quantization techniques are simpler to use and allow for quantization with limited data.

Post Training Quantization合理说,计算过程皆为Float,而非Int,所以只能在减少模型的大小,速度方面并不能得到提升.

(1) .Weight only quantization

A simple approach is to only reduce the precision of the weights of the network to 8- bits from float. Since only the weights are quantized, this can be done without requiring any validation data .

这种模式,是将模型的weight进行quantized压缩成uint8,但在计算过程中,会将weight进行dequantized回Float.

(2) .Quantizing weights and activations

One can quantize a floating point model to 8-bit precision by calculating the quantizer parameters for all the quantities to be quantized. Since activations need to be quantized, one needs calibration data and needs to calculate the dynamic ranges of activations.

这种模式,在weight quantization的基础上,对某些支持quantized的Kernel,先进行quantization,再进行activation计算,再de-quant回float32,不支持的话会直接使用Float32进行计算,这相对与直接使用Float32进行计算会快一些.

2 . Quantization Aware Training

Quantization aware training models quantization during training and can provide higher accuracies than post quantization training schemes. 

We model the effect of quantization using simulated quantization operations on both weights and activations. For the backward pass, we use the straight through estimator  to model quantization. Note that we use simulated quantized weights and activations for both forward and backward pass calculations.

这种模式,除了会对weight进行quantization,也会在训练过程中,进行模拟量化,求出各个op的max跟min输出,实现不仅仅在训练过程,在测试过程,全程计算过程皆为uint8.不仅仅实现模型的压缩,计算速度也得到提高.

 

三 : Lite文件生成

https://blog.csdn.net/qq_16564093/article/details/78996563

四 : 注意事项

(1) . 当前支持的aware-quant操作:  https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/toco/graph_transformations/quantize.cc

TensorflowLite量化原理_第5张图片

(2) . 当前不支持keras进行aware-quant,得等到tensorflow2.0才支持.

 

五 : 相关文档

(1) https://arxiv.org/pdf/1806.08342.pdf

(2) https://arxiv.org/pdf/1712.05877.pdf

你可能感兴趣的:(机器学习,tensorflow)