Performance
Performance is an important consideration when training machine learning models.
Performance speeds up and scales research while also providing end users with near instant predictions.
This section provides details on the high level APIs to use along with best practices to build and train high performance models, and quantize models for the least latency and highest throughput for inference.
在训练机器学习模型时,性能是一个重要的考虑因素。性能提升和细粒度研究,同时也给终端用户提供实时预测。本节提供关于使用高级APIs的细节,以及构建和训练高性能模型、针对最少的延迟来量化模型和对推理的最高吞吐量的最佳实战经验。
Tensorflow Model Optimization Toolkit is a set of techniques for optimizing models for inference:
Tensorflow 模型优化工具包是针对优化推理推断模型的一些技术:
描述了训练后的量化
XLA (Accelerated Linear Algebra) is an experimental compiler for linear algebra that optimizes TensorFlow computations. The following guides explore XLA:
Inference efficiency is a critical issue when deploying machine learning models to mobile devices.
Where the computational demand for training grows with the number of models trained on different architectures, the computational demand for inference grows in proportion to the number of users.
The Tensorflow Model Optimization Toolkit minimizes the complexity of inference—the model size, the latency and power consumption.
当部署机器学习模型到移动设备时,推理的效率时至关重要的因素。
训练的计算需求随着在不同架构上训练的模型的数量而增长时,推断的计算量随着用户数量成比例增长。
Tensorflow 模型优化工具包最小化推断的复杂度,模型大小、延迟和功耗。
Model optimization is useful for:
Model optimization uses multiple techniques:
Quantizing deep neural networks uses techniques that allow for reduced precision representations of weights and, optionally, activations for both storage and computation.
量化深度神经网络使用的技术可以降低权重的表达精度,可视情况而定减少存储和计算的执行。
Quantization provides several benefits:
量化提供了几个好处:
许多CPU和硬件加速器实现提供了SIMD指令功能,这对量化尤其有利。
TensorFlow Lite provides several levels of support for quantization.
Tensorflow lite为量化提供了多个级别的支持
Post-training quantization quantizes weights and activations post training and is very easy to use. Quantization-aware training allows for training networks that can be quantized with minimal accuracy drop and is only available for a subset of convolutional neural network architectures.
Post-training quantization 将权重和激活量化为训练后,且使用简单。
Quantization-aware training 考虑到训练网络可以以最小的准确率下降来量化,而且只是对卷积神经网络结构的一个子集有用。
Below are the results of the latency and accuracy of post-training quantization and quantization-aware training on a few models.
下边是在几个模型上进行了post-training quantization 和 quantization-aware training的延迟和准确率的结结果
All latency numbers are measured on Pixel 2 devices using a single big core.
所有的延迟数量是在两个设备上测试的用单一大的那个。
As the toolkit improves, so will the numbers here:
随着工具包的改进,所以这里的数据也会变化:
Model |
Top-1 Accuracy (Original) |
Top-1 Accuracy (Post Training Quantized) |
Top-1 Accuracy (Quantization Aware Training) |
Latency (Original) (ms) |
Latency (Post Training Quantized) (ms) |
Latency (Quantization Aware Training) (ms) |
Size (Original) (MB) |
Size (Optimized) (MB) |
Mobilenet-v1-1-224 |
0.709 |
0.657 |
0.70 |
180 |
145 |
80.2 |
16.9 |
4.3 |
Mobilenet-v2-1-224 |
0.719 |
0.637 |
0.709 |
117 |
121 |
80.3 |
14 |
3.6 |
Inception_v3 |
0.78 |
0.772 |
0.775 |
1585 |
1187 |
637 |
95.7 |
23.9 |
Resnet_v2_101 |
0.770 |
0.768 |
N/A |
3973 |
2868 |
N/A |
178.3 |
44.9 |
Table 1 Benefits of model quantization for select CNN models
As a starting point, check if the models in the TensorFlow Lite model repository can work for your application. If not, we recommend that users start with the post-training quantization tool since this is broadly applicable and does not require training data.
For cases where the accuracy and latency targets are not met, or hardware accelerator support is important, quantization-aware training is the better option.
作为起点,检查tensorflow lite模型存储库中国的模型是否适用于你的应用。如果不适用,我们建议用户从post-training quantization工具开始,因为这个是广泛适用的且不需要训练数据。
对于一些情况,不满足准确率和延迟目标的,或硬件加速器支持很重要的,quantization-aware training是一个更好的选择。
Post-training quantization is a general technique to reduce the model size while also providing up to 3x lower latency with little degradation in model accuracy. Post-training quantization quantizes weights to 8-bits of precision from floating-point.
Post-training quantization是一种通用技术,来减小模型尺寸同时提升延迟降低3倍并且模型准确率只有一点点的降低。Post-training quantization将权重从浮点型量化为8位的精度
This technique is enabled as an option in TensorFlow Lite model converter:
在TensorFlow Lite model converter中这种技术可以作为可选择的
import tensorflow as tf converter = tf.contrib.lite.TocoConverter.from_saved_model(saved_model_dir) converter.post_training_quantize = True tflite_quantized_model = converter.convert() open("quantized_model.tflite", "wb").write(tflite_quantized_model)
At inference, weights are converted from 8-bits of precision to floating-point and computed using floating point kernels. This conversion is done once and cached to reduce latency.
在推理时,权值从8-bits精度转化到float型,并且用float型计算。这种转化只进行一次,并进行缓存以减少延迟。
To further improve latency, hybrid operators dynamically quantize activations to 8-bits and perform computations with 8-bit weights and activations. This optimization provides latencies close to fully fixed-point inference. However, the outputs are still stored using floating-point, so the speedup with hybrid ops is less than a full fixed-point computation.
为了进一步提高延迟,混合操作符将激活动态量化为8位,并使用8位权值和激活执行计算。这种优化提供了接近完全定点推理的延迟。但是,输出仍然使用浮点数存储,因此混合ops的加速比完全的定点计算要小。
Hybrid ops are available for the most compute-intensive operators in a network:
混合操作适用于网络中计算最密集的操作:
Since weights are quantized post-training, there could be an accuracy loss, particularly for smaller networks. Pre-trained fully quantized models are provided for specific networks in the TensorFlow Lite model repository. It is important to check the accuracy of the quantized model to verify that any degradation in accuracy is within acceptable limits. There is a tool to evaluate TensorFlow Lite model accuracy.
因为权重是在训练后进行量化的,所以应该有一个精度的损失,尤其对于小网络。在tensorflow lite 模型库中对于特殊的网络提供了预先训练好的完全量化的模型。它对于检查量化模型的精度是重要的来验证任何精度上的下降是否在可接受的范围内。这有个工具来评估tensorflow lite模型精度
If the accuracy drop is too high, consider using quantization aware training.
如果精度下降过高,考虑用quantization aware training
TensorFlow approaches the conversion of floating-point arrays of numbers into 8-bit representations as a compression problem.
Since the weights and activation tensors in trained neural network models tend to have values that are distributed across comparatively small ranges (for example, -15 to +15 for weights or -500 to 1000 for image model activations).
And since neural nets tend to be robust handling noise, the error introduced by quantizing to a small set of values maintains the precision of the overall results within an acceptable threshold.
A chosen representation must perform fast calculations, especially the large matrix multiplications that comprise the bulk of the computations while running a model.
Tensorflow将float型的数据数组到8-bit表示的转化问题看作压缩问题来处理。 由于在训练神经网络模型中的权值和激活张量的值往往分布在比较小的范围内(例如,权值在-15到15,图像模型几多在-500到1000)。 所选的表示必须执行更快的计算,尤其在运行一个模型时包含了大量计算的大型矩阵乘法
This is represented with two floats that store the overall minimum and maximum values corresponding to the lowest and highest quantized value.
这是用2个浮点表示的,存储着总体最小值和最大值对应于最低和最高量化值
Each entry in the quantized array represents a float value in that range, distributed linearly between the minimum and maximum.
在量化数组中的每个条目表示在那个范围的浮点值,线性分布在最小和最大值之间
For example, with a minimum of -10.0 and maximum of 30.0f, and an 8-bit array, the quantized values represent the following:
例如,最小值为-10.0,最大值为30.0,一个8-bit数组,量化后的值表示如下:
Quantized |
Float |
0 |
-10.0 |
128 |
10.0 |
255 |
30.0 |
Table 2: Example quantized value range
The advantages of this representation format are:
这种表达方式的好处是: