tensorflow lite model 的quantization的方法有两种:
“hybrid” post training quantization and post-training integer quantization
“hybrid” post training quantization approach reduced the model size and latency in many cases, but it has the limitation of requiring floating point computation, which may not be available in all hardware accelerators (i.e. Edge TPUs).
post-training integer quantization enables users to take an already-trained floating-point model and fully quantize it to only use 8-bit signed integers (i.e. `int8`). By leveraging this quantization scheme, we can get reasonable quantized model accuracy across many models without resorting to retraining a model with quantization-aware training. With this new tool, models will continue to be 4x smaller, but will see even greater CPU speed-ups. Fixed point hardware accelerators, such as Edge TPUs, will also be able to run these models.
1] 每轴(或每通道)或每张量的权重用int8进行定点量化的可表示范围为[-127,127],且zero-point就是量化值0
2] 每张量的激活值或输入值用int8进行定点量化的可表示范围为[-128,127],其zero-point在[-128,127]内依据公式求得
量化参数:
S: Rmax - Rmin / Qmax - Qmin = Scale: 每个Q单位表示多大的Real value
Z: Zero = Qmax - Rmax / S: Real zero表示多大的Q value
明确权重的量化和激活值的量化方法不同:
1. weight的量化方法是: Real zero 由 Q value 0表示, 而激活值的 Real zero是由 Z = Qmax - Rmax/S 计算得到
2. 定点的范围也不同weight: [-127, 127], active value:[-128, 127]
对量化是阈值的选取计算,这篇文章采用了更简单的方法。
对于权重,使用实际的最大和最小值来决定量化参数。
对于激活输出,使用跨批(batches)的最大和最小值的滑动平均值来决定量化参数。
下面用简单的例子演示怎样生成 float tflite mode(no quantization), hybrid post training quatization and post-training integer quantization.
使用的tensorfow版本为2.0.0
import tensorflow as tf
import numpy as np
print (tf.__version__)
(x_train, y_train),(x_test, y_test) = tf.keras.datasets.mnist.load_data()
#x_train, x_test = (x_train / 255.0), (x_test / 255.0)
x_train, x_test = (x_train / 255.0).astype(np.float32), (x_test / 255.0).astype(np.float32)
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=1)
model.evaluate(x_test, y_test)
model.save('saved_model')
model.save('keral_model.h5')
用的竟然是同一个函数save, 可能是从后面的参数 文件夹还是文件区分的
在介绍生成quantization之前,先介绍下生成TFLiteConverter的方法。
The Python API for converting TensorFlow models to TensorFlow Lite is tf.lite.TFLiteConverter
. TFLiteConverter
provides the following classmethods to convert a model based on the original model format:
TFLiteConverter.from_saved_model()
: Converts SavedModel directories.TFLiteConverter.from_keras_model()
: Converts tf.keras
models.TFLiteConverter.from_concrete_functions()
: Converts concrete functions.This document contains example usages of the API, a detailed list of changes in the API between Tensorflow 1 and TensorFlow 2, and instructions on running the different versions of TensorFlow.
和之前1.x的区别是没有了 from_keras_model_file("xxx.h5")
keras_model = tf.keras.models.load_model('keras_model.h5') 然后
converter = tf.lite.TFLiteConverter.from_keras_model(keras_model)
converter = tf.lite.TFLiteConverter.from_keras_model(model)
float_model = converter.convert()
open("float_model.tflite", "wb").write(float_model)
converter_hybrid = tf.lite.TFLiteConverter.from_saved_model('saved_model')
converter_hybrid.optimizations = [tf.lite.Optimize.DEFAULT]
hybrid_quant = converter_hybrid.convert()
open("hybrid_quant.tflite", "wb").write(hybrid_quant)
相对float model/ hybrid quantization而言,这个比较麻烦
需要从训练数据得到 Dataset: 其对应的函数 from_tesnor_slices、batch、take
x_train, x_test = (x_train / 255.0).astype(np.float32), (x_test / 255.0).astype(np.float32)
这里的astype(类型转换是很重要的)
mnist_ds = tf.data.Dataset.from_tensor_slices((x_train)).batch(1)
def representative_data_gen():
for input_value in mnist_ds.take(100):
yield [input_value]
Traceback (most recent call last):
File "post_training_integer_quantization.py", line 57, in
integer_quant = converter_integer.convert()
File "/usr/local/lib/python2.7/dist-packages/tensorflow_core/lite/python/lite.py", line 450, in convert
constants.FLOAT)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_core/lite/python/lite.py", line 239, in _calibrate_quantize_model
inference_output_type, allow_float)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_core/lite/python/optimize/calibrator.py", line 75, in calibrate_and_quantize
self._calibrator.FeedTensor(calibration_sample)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_core/lite/python/optimize/tensorflow_lite_wrap_calibration_wrapper.py", line 112, in FeedTensor
return _tensorflow_lite_wrap_calibration_wrapper.CalibrationWrapper_FeedTensor(self, input_value)
ValueError: Cannot set tensor: Got tensor of type NOTYPE but expected type FLOAT32 for input 7, name: flatten_input
根据back trace从上到下顺序查看代码对应的代码如下:可知和dataset_gen() 有关,
https://github.com/tensorflow/tensorflow/issues/30861 从这里看到了narray类型转换和narray 转换到tensor
根据建议验证是可以的,其实把narray进行类型转换且不进行narray到tensor的转换也是可以的,问题的关键是要对representative_data_gen的数据进行类型转换,转换为float32类型
train = tf.convert_to_tensor(np.array(train, dtype='float32'))
for calibration_sample in dataset_gen():
448 if self._is_calibration_quantize():
449 result = self._calibrate_quantize_model(result, constants.FLOAT,
450 constants.FLOAT)
233 def _calibrate_quantize_model(self, result, inference_input_type,
234 inference_output_type):
235 allow_float = not self._is_int8_target_required()
236 calibrate_quantize = _calibrator.Calibrator(result)
237 return calibrate_quantize.calibrate_and_quantize(
238 self.representative_dataset.input_gen, inference_input_type,
239 inference_output_type, allow_float)
57 def calibrate_and_quantize(self, dataset_gen, input_type, output_type,
58 allow_float):
59 """Calibrates the model with specified generator and then quantizes it.
60
61 Returns:
62 A quantized model.
63
64 Args:
65 dataset_gen: A generator that generates calibration samples.
66 input_type: A tf.dtype representing the desired real-value input type.
67 output_type: A tf.dtype representing the desired real-value output type.
68 allow_float: A boolean. False if the resulting model cannot perform float
69 computation, useful when targeting an integer-only backend.
70 If False, an error will be thrown if an operation cannot be
71 quantized, otherwise the model will fallback to float ops.
72 """
73 self._calibrator.Prepare()
74 for calibration_sample in dataset_gen():
75 self._calibrator.FeedTensor(calibration_sample)
76 return self._calibrator.QuantizeModel(
77 np.dtype(input_type.as_numpy_dtype()).num,
78 np.dtype(output_type.as_numpy_dtype()).num, allow_float)
111 def FeedTensor(self, input_value):
112 return _tensorflow_lite_wrap_calibration_wrapper.CalibrationWrapper_FeedTensor(self, input_value)
mnist_ds = tf.data.Dataset.from_tensor_slices((x_train)).batch(1)
def representative_data_gen():
for input_value in mnist_ds.take(100):
yield [input_value]
converter_integer = tf.lite.TFLiteConverter.from_saved_model('saved_model')
converter_integer.optimizations = [tf.lite.Optimize.DEFAULT]
#converter.representative_dataset = representative_data_gen
converter_integer.representative_dataset = tf.lite.RepresentativeDataset(representative_data_gen)
converter_integer.target_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
integer_quant = converter_integer.convert()
open("integer_quan.tflite", "wb").write(integer_quant)
-rw-rw-r-- 1 ws ws 408320 1月 2 19:43 float_model.tflite
-rw-rw-r-- 1 ws ws 103664 1月 2 19:43 hybrid_quant.tflite
-rw-rw-r-- 1 ws ws 104144 1月 2 19:43 integer_quan.tflite
-rw-rw-r-- 1 ws ws 1248544 1月 2 19:43 keras_model.h5
-rw-rw-r-- 1 ws ws 104144 1月 2 17:52 post_training_integer_quan.tflite
drwxr-xr-x 4 ws ws 4096 1月 2 10:47 saved_model
float_model、hybrid、integer、h5
float_mode\hybrid的图结构一致但其中的参数的数据类型不同(floa/int8), integer quantization在前增加了quantize,在后增加了dequantize.
(x_train, y_train),(x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = (x_train / 255.0).astype(np.float32), (x_test / 255.0).astype(np.float32)
使用上面的x_test, y_test; 因为推断的结果要和标注比较,构建的Dataset如下:
mnist_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(1)
interpreter = tf.lite.Interpreter(model_path='float_model.tflite')
interpreter.allocate_tensors()
interpreter_quant = tf.lite.Interpreter(model_path='integer_quan.tflite')
interpreter_quant.allocate_tensors()
for img, label in mnist_ds: #img and label 不想象的是临时变量,后面可以用
break
interpreter.set_tensor(interpreter.get_input_details()[0]["index"], img)
interpreter.invoke()
predictions = interpreter.get_tensor(interpreter.get_output_details()[0]["index"])
integer quantization model
for img, label in mnist_ds:
break
interpreter_quant.set_tensor(
interpreter_quant.get_input_details()[0]["index"], img)
interpreter_quant.invoke()
predictions = interpreter_quant.get_tensor(
interpreter_quant.get_output_details()[0]["index"])
推断后的结果是predictions,数据类型是array, 而label是个tensor,
array是one-hot编码,而lebal是sparse编码,array([7]表示数字 “7”
In [52]: predictions
Out[52]:
array([[0. , 0. , 0. , 0.00390625, 0. ,
0. , 0. , 0.99609375, 0. , 0. ]],
dtype=float32)
In [53]: label
Out[53]:
def eval_model(interpreter, mnist_ds):
total_seen = 0
num_correct = 0
input_index = interpreter.get_input_details()[0]["index"]
output_index = interpreter.get_output_details()[0]["index"]
for img, label in mnist_ds:
total_seen += 1
interpreter.set_tensor(input_index, img)
interpreter.invoke()
predictions = interpreter.get_tensor(output_index)
if np.argmax(predictions) == label.numpy()[0]: #这里是关键
num_correct += 1
if total_seen % 500 == 0:
print("Accuracy after %i images: %f" %
(total_seen, float(num_correct) / float(total_seen)))
return float(num_correct) / float(total_seen)
使用同一个测试集用不同interpreter进行验证
# Create smaller dataset for demonstration purposes
mnist_ds_demo = mnist_ds.take(2000)
print(eval_model(interpreter, mnist_ds_demo))
print(eval_model(interpreter_quant, mnist_ds_demo))
In [57]: print(eval_model(interpreter, mnist_ds_demo))
Accuracy after 500 images: 0.974000
Accuracy after 1000 images: 0.961000
Accuracy after 1500 images: 0.954000
Accuracy after 2000 images: 0.951500
Accuracy after 2500 images: 0.946000
Accuracy after 3000 images: 0.947000
Accuracy after 3500 images: 0.949714
Accuracy after 4000 images: 0.945750
Accuracy after 4500 images: 0.945111
Accuracy after 5000 images: 0.945400
Accuracy after 5500 images: 0.949273
Accuracy after 6000 images: 0.950000
Accuracy after 6500 images: 0.951385
Accuracy after 7000 images: 0.952000
Accuracy after 7500 images: 0.954000
Accuracy after 8000 images: 0.955750
Accuracy after 8500 images: 0.957412
Accuracy after 9000 images: 0.959444
Accuracy after 9500 images: 0.960421
Accuracy after 10000 images: 0.959000
0.959
In [58]: print(eval_model(interpreter_quant, mnist_ds_demo))
Accuracy after 500 images: 0.978000
Accuracy after 1000 images: 0.964000
Accuracy after 1500 images: 0.956667
Accuracy after 2000 images: 0.954000
Accuracy after 2500 images: 0.948000
Accuracy after 3000 images: 0.948333
Accuracy after 3500 images: 0.950857
Accuracy after 4000 images: 0.946500
Accuracy after 4500 images: 0.945333
Accuracy after 5000 images: 0.945600
Accuracy after 5500 images: 0.949273
Accuracy after 6000 images: 0.950167
Accuracy after 6500 images: 0.951538
Accuracy after 7000 images: 0.952143
Accuracy after 7500 images: 0.954133
Accuracy after 8000 images: 0.955875
Accuracy after 8500 images: 0.957412
Accuracy after 9000 images: 0.959444
Accuracy after 9500 images: 0.960316
Accuracy after 10000 images: 0.958800
0.9588
从本例来看qantization后略有降低 0.959 -> 0.9588
生成的模型文件中op的版本由文件由tensorflow/lite/tools/versioning/op_version.cc决定的,因此注册算子的版本也要和这里一致,否则找不到对应的op
// Op versions discussed in this file are enumerated here:
// tensorflow/lite/tools/versioning/op_version.cc
class MicroMutableOpResolver : public OpResolver {
public:
const TfLiteRegistration* FindOp(tflite::BuiltinOperator op,
int version) const override;
const TfLiteRegistration* FindOp(const char* op, int version) const override;
void AddBuiltin(tflite::BuiltinOperator op, TfLiteRegistration* registration,
int min_version = 1, int max_version = 1);
void AddCustom(const char* name, TfLiteRegistration* registration,
int min_version = 1, int max_version = 1);
private:
TfLiteRegistration registrations_[TFLITE_REGISTRATIONS_MAX];
int registrations_len_ = 0;
TF_LITE_REMOVE_VIRTUAL_DELETE
};
// Register each supported op with:
// AddBuiltin(
AllOpsResolver::AllOpsResolver() {
AddBuiltin(BuiltinOperator_FULLY_CONNECTED, Register_FULLY_CONNECTED(), 1, 4);
AddBuiltin(BuiltinOperator_MAX_POOL_2D, Register_MAX_POOL_2D());
AddBuiltin(BuiltinOperator_SOFTMAX, Register_SOFTMAX());
AddBuiltin(BuiltinOperator_LOGISTIC, Register_LOGISTIC());
AddBuiltin(BuiltinOperator_SVDF, Register_SVDF());
AddBuiltin(BuiltinOperator_CONV_2D, Register_CONV_2D(), 1, 3);
AddBuiltin(BuiltinOperator_CONCATENATION, Register_CONCATENATION(), 1, 3);
AddBuiltin(BuiltinOperator_DEPTHWISE_CONV_2D, Register_DEPTHWISE_CONV_2D(), 1, 3);
AddBuiltin(BuiltinOperator_AVERAGE_POOL_2D, Register_AVERAGE_POOL_2D(), 1, 2);
}
从注释上看TfLiteQuantizationParams将会被TfLiteAffineQuantization取代,但当前TfLiteQuantizationParams还使用,因为有些算子不支持channel quantize, 而实际上TfLiteAffineQuantization是TfLiteQuantizationParams的数组形式
// Legacy. Will be deprecated in favor of TfLiteAffineQuantization.
// If per-layer quantization is specified this field will still be populated in
// addition to TfLiteAffineQuantization.
// Parameters for asymmetric quantization. Quantized values can be converted
// back to float using:
// real_value = scale * (quantized_value - zero_point)
typedef struct {
float scale;
int32_t zero_point;
} TfLiteQuantizationParams;
// Parameters for asymmetric quantization across a dimension (i.e per output
// channel quantization).
// quantized_dimension specifies which dimension the scales and zero_points
// correspond to.
// For a particular value in quantized_dimension, quantized values can be
// converted back to float using:
// real_value = scale * (quantized_value - zero_point)
typedef struct {
TfLiteFloatArray* scale;
TfLiteIntArray* zero_point;
int32_t quantized_dimension;
} TfLiteAffineQuantization;
从下面可以看出input/output/active value的zero_point不是0,而权重、bias的 zero_point是0但是多通道的,而activate总是单通道。
<<----------------------------------------------------------------
quantization: -8.117244720458984 ≤ 0.07573700696229935 * (q - -21) ≤ 11.19569206237793
--- tensor name: depthwise_conv2d_input_int8 ---
quantization: scale: 0.075737, zero_point: -21
channels: 1
quantization: scale: 0.075737, zero_point: -21
----------------------------------------------------------------->>
<<----------------------------------------------------------------
--- tensor name: StatefulPartitionedCall/sequential/depthwise_conv2d/depthwise/ReadVariableOp ---
quantization: scale: 0.003461, zero_point: 0
channels: 8
quantization: scale: 0.003461, zero_point: 0
quantization: scale: 0.003643, zero_point: 0
quantization: scale: 0.004665, zero_point: 0
quantization: scale: 0.003722, zero_point: 0
quantization: scale: 0.002931, zero_point: 0
quantization: scale: 0.003288, zero_point: 0
quantization: scale: 0.003704, zero_point: 0
quantization: scale: 0.003462, zero_point: 0
----------------------------------------------------------------->>
<<----------------------------------------------------------------
--- tensor name: StatefulPartitionedCall/sequential/depthwise_conv2d/depthwise_bias ---
quantization: scale: 0.000262, zero_point: 0
channels: 8
quantization: scale: 0.000262, zero_point: 0
quantization: scale: 0.000276, zero_point: 0
quantization: scale: 0.000353, zero_point: 0
quantization: scale: 0.000282, zero_point: 0
quantization: scale: 0.000222, zero_point: 0
quantization: scale: 0.000249, zero_point: 0
quantization: scale: 0.000281, zero_point: 0
quantization: scale: 0.000262, zero_point: 0
----------------------------------------------------------------->>
<<----------------------------------------------------------------
--- tensor name: StatefulPartitionedCall/sequential/depthwise_conv2d/Relu ---
quantization: scale: 0.033231, zero_point: -128
channels: 1
quantization: scale: 0.033231, zero_point: -128
----------------------------------------------------------------->>
<<----------------------------------------------------------------
--- tensor name: StatefulPartitionedCall/sequential/average_pooling2d/AvgPool ---
quantization: scale: 0.033231, zero_point: -128
channels: 1
quantization: scale: 0.033231, zero_point: -128
----------------------------------------------------------------->>
<<----------------------------------------------------------------
--- tensor name: StatefulPartitionedCall/sequential/conv2d/Conv2D/ReadVariableOp ---
quantization: scale: 0.005083, zero_point: 0
channels: 16
quantization: scale: 0.005083, zero_point: 0
quantization: scale: 0.006344, zero_point: 0
quantization: scale: 0.005207, zero_point: 0
quantization: scale: 0.003689, zero_point: 0
quantization: scale: 0.004211, zero_point: 0
quantization: scale: 0.005478, zero_point: 0
quantization: scale: 0.010819, zero_point: 0
quantization: scale: 0.004262, zero_point: 0
quantization: scale: 0.010639, zero_point: 0
quantization: scale: 0.004018, zero_point: 0
quantization: scale: 0.009138, zero_point: 0
quantization: scale: 0.003741, zero_point: 0
quantization: scale: 0.005365, zero_point: 0
quantization: scale: 0.003906, zero_point: 0
quantization: scale: 0.004800, zero_point: 0
quantization: scale: 0.004889, zero_point: 0
----------------------------------------------------------------->>
<<----------------------------------------------------------------
--- tensor name: StatefulPartitionedCall/sequential/conv2d/Conv2D_bias ---
quantization: scale: 0.000169, zero_point: 0
channels: 16
quantization: scale: 0.000169, zero_point: 0
quantization: scale: 0.000211, zero_point: 0
quantization: scale: 0.000173, zero_point: 0
quantization: scale: 0.000123, zero_point: 0
quantization: scale: 0.000140, zero_point: 0
quantization: scale: 0.000182, zero_point: 0
quantization: scale: 0.000360, zero_point: 0
quantization: scale: 0.000142, zero_point: 0
quantization: scale: 0.000354, zero_point: 0
quantization: scale: 0.000134, zero_point: 0
quantization: scale: 0.000304, zero_point: 0
quantization: scale: 0.000124, zero_point: 0
quantization: scale: 0.000178, zero_point: 0
quantization: scale: 0.000130, zero_point: 0
quantization: scale: 0.000160, zero_point: 0
quantization: scale: 0.000162, zero_point: 0
----------------------------------------------------------------->>
<<----------------------------------------------------------------
--- tensor name: StatefulPartitionedCall/sequential/conv2d/Relu ---
quantization: scale: 0.012436, zero_point: -128
channels: 1
quantization: scale: 0.012436, zero_point: -128
----------------------------------------------------------------->>
<<----------------------------------------------------------------
--- tensor name: StatefulPartitionedCall/sequential/average_pooling2d_1/AvgPool ---
quantization: scale: 0.012436, zero_point: -128
channels: 1
quantization: scale: 0.012436, zero_point: -128
----------------------------------------------------------------->>
<<----------------------------------------------------------------
--- tensor name: StatefulPartitionedCall/sequential/dense/MatMul/ReadVariableOp/transpose ---
quantization: scale: 0.009346, zero_point: 0
channels: 1
quantization: scale: 0.009346, zero_point: 0
----------------------------------------------------------------->>
<<----------------------------------------------------------------
--- tensor name: StatefulPartitionedCall/sequential/dense/MatMul_bias ---
quantization: scale: 0.000116, zero_point: 0
channels: 1
quantization: scale: 0.000116, zero_point: 0
----------------------------------------------------------------->>
<<----------------------------------------------------------------
--- tensor name: StatefulPartitionedCall/sequential/dense/Relu ---
quantization: scale: 0.047351, zero_point: -128
channels: 1
quantization: scale: 0.047351, zero_point: -128
----------------------------------------------------------------->>
<<----------------------------------------------------------------
--- tensor name: StatefulPartitionedCall/sequential/dense_1/MatMul/ReadVariableOp/transpose ---
quantization: scale: 0.009934, zero_point: 0
channels: 1
quantization: scale: 0.009934, zero_point: 0
----------------------------------------------------------------->>
<<----------------------------------------------------------------
--- tensor name: StatefulPartitionedCall/sequential/dense_1/MatMul_bias ---
quantization: scale: 0.000470, zero_point: 0
channels: 1
quantization: scale: 0.000470, zero_point: 0
----------------------------------------------------------------->>
<<----------------------------------------------------------------
--- tensor name: StatefulPartitionedCall/sequential/dense_1/BiasAdd ---
quantization: scale: 0.079378, zero_point: 28
channels: 1
quantization: scale: 0.079378, zero_point: 28
----------------------------------------------------------------->>
<<----------------------------------------------------------------
--- tensor name: Identity_int8 ---
quantization: scale: 0.003906, zero_point: -128
channels: 1
quantization: scale: 0.003906, zero_point: -128
----------------------------------------------------------------->>