tensorflow performance——Model optimization 模型优化

Performance

Performance is an important consideration when training machine learning models.

Performance speeds up and scales research while also providing end users with near instant predictions.

This section provides details on the high level APIs to use along with best practices to build and train high performance models, and quantize models for the least latency and highest throughput for inference.

在训练机器学习模型时,性能是一个重要的考虑因素。性能提升和细粒度研究,同时也给终端用户提供实时预测。本节提供关于使用高级APIs的细节,以及构建和训练高性能模型、针对最少的延迟来量化模型和对推理的最高吞吐量的最佳实战经验。

 

  • Performance Guide contains a collection of best practices for optimizing your TensorFlow code.
  • 包括一些对于优化你的tensorflow代码的最优实验
  • Data input pipeline guide describes the tf.data API for building efficient data input pipelines for TensorFlow.
  • 描述了针对基于tensorflow构建有效的数据输入算法的tf.dada API
  • Benchmarks contains a collection of benchmark results for a variety of hardware configurations.
  • 包括一些针对多种硬件配置的基准结果
  • For optimizing inference on GPUs, refer to NVIDIA TensorRT™ integration with TensorFlow.
  • 对于在GPUs上优化推理,参考 NVIDIA TensorRT™ integration with TensorFlow.

 

 

Tensorflow Model Optimization Toolkit is a set of techniques for optimizing models for inference:

Tensorflow 模型优化工具包是针对优化推理推断模型的一些技术:

  • Overview, which introduces the model optimization toolkit.
  • 介绍了模型优化工具包
  • Post-training quantization, describes post training quantization.

描述了训练后的量化

 

XLA (Accelerated Linear Algebra) is an experimental compiler for linear algebra that optimizes TensorFlow computations. The following guides explore XLA:

  • XLA Overview, which introduces XLA.
  • Broadcasting Semantics, which describes XLA's broadcasting semantics.
  • Developing a new back end for XLA, which explains how to re-target TensorFlow in order to optimize the performance of the computational graph for particular hardware.
  • Using JIT Compilation, which describes the XLA JIT compiler that compiles and runs parts of TensorFlow graphs via XLA in order to optimize performance.
  • Operation Semantics, which is a reference manual describing the semantics of operations in the ComputationBuilder interface.
  • Shapes and Layout, which details the Shape protocol buffer.
  • Using AOT compilation, which explains tfcompile, a standalone tool that compiles TensorFlow graphs into executable code in order to optimize performance.

 

Model optimization 模型优化

Inference efficiency is a critical issue when deploying machine learning models to mobile devices.

Where the computational demand for training grows with the number of models trained on different architectures, the computational demand for inference grows in proportion to the number of users.

The Tensorflow Model Optimization Toolkit minimizes the complexity of inference—the model size, the latency and power consumption.

当部署机器学习模型到移动设备时,推理的效率时至关重要的因素。

训练的计算需求随着在不同架构上训练的模型的数量而增长时,推断的计算量随着用户数量成比例增长。

Tensorflow 模型优化工具包最小化推断的复杂度,模型大小、延迟和功耗。

Use cases 使用场景

Model optimization is useful for:

  • Deploying models to edge devices with restrictions on processing, memory, or power-consumption. For example, mobile and Internet of Things (IoT) devices.
  • 部署模型到对处理、内存或功耗有限制的edge设备。例如,移动和IOT物联网设备
  • Reduce the payload size for over-the-air model updates.
  • 减小无效模型的有效负载大小
  • Execution on hardware constrained by fixed-point operations.
  • 在受定点操作限制的设备上执行
  • Optimize models for special purpose hardware accelerators.
  • 对于特殊目的的硬件加速器优化模型

Optimization methods 优化方法

Model optimization uses multiple techniques:

  • Reduced parameter count, for example, pruning and structured pruning.
  • 减少参数两,例如,剪枝和结构化剪枝。
  • Reduced representational precision, for example, quantization.
  • 降低表示精度,例如,量化
  • Update the original model topology to a more efficient one, with reduced parameters or faster execution, for example, tensor decomposition methods and distillation.
  • 将原始模型的拓扑结构更新为一个更有效的,减少参数或运行更快,例如,张量分解方法和蒸馏

Model quantization 模型量化

Quantizing deep neural networks uses techniques that allow for reduced precision representations of weights and, optionally, activations for both storage and computation.

量化深度神经网络使用的技术可以降低权重的表达精度,可视情况而定减少存储和计算的执行。

Quantization provides several benefits:

量化提供了几个好处:

  • Support on existing CPU platforms.
  • 支持现有的CPU平台
  • Quantizing activations reduces memory access costs for reading and storing intermediate activations.
  • 量化激活减少了读取和存储中间激活的内存访问成本
  • Many CPU and hardware accelerator implementations provide SIMD instruction capabilities, which are especially beneficial for quantization.

许多CPU和硬件加速器实现提供了SIMD指令功能,这对量化尤其有利。

 

TensorFlow Lite provides several levels of support for quantization.

Tensorflow lite为量化提供了多个级别的支持

Post-training quantization quantizes weights and activations post training and is very easy to use. Quantization-aware training allows for training networks that can be quantized with minimal accuracy drop and is only available for a subset of convolutional neural network architectures.

Post-training quantization 将权重和激活量化为训练后,且使用简单。

Quantization-aware training 考虑到训练网络可以以最小的准确率下降来量化,而且只是对卷积神经网络结构的一个子集有用。

 

Latency and accuracy results 延迟和准确率结果

Below are the results of the latency and accuracy of post-training quantization and quantization-aware training on a few models.

下边是在几个模型上进行了post-training quantization 和 quantization-aware training的延迟和准确率的结结果

All latency numbers are measured on Pixel 2 devices using a single big core.

所有的延迟数量是在两个设备上测试的用单一大的那个。

As the toolkit improves, so will the numbers here:

随着工具包的改进,所以这里的数据也会变化:

Model

Top-1 Accuracy (Original)

Top-1 Accuracy (Post Training Quantized)

Top-1 Accuracy (Quantization Aware Training)

Latency (Original) (ms)

Latency (Post Training Quantized) (ms)

Latency (Quantization Aware Training) (ms)

Size (Original) (MB)

Size (Optimized) (MB)

Mobilenet-v1-1-224

0.709

0.657

0.70

180

145

80.2

16.9

4.3

Mobilenet-v2-1-224

0.719

0.637

0.709

117

121

80.3

14

3.6

Inception_v3

0.78

0.772

0.775

1585

1187

637

95.7

23.9

Resnet_v2_101

0.770

0.768

N/A

3973

2868

N/A

178.3

44.9

Table 1 Benefits of model quantization for select CNN models

Choice of quantization tool

As a starting point, check if the models in the TensorFlow Lite model repository can work for your application. If not, we recommend that users start with the post-training quantization tool since this is broadly applicable and does not require training data.

For cases where the accuracy and latency targets are not met, or hardware accelerator support is important, quantization-aware training is the better option.

作为起点,检查tensorflow lite模型存储库中国的模型是否适用于你的应用。如果不适用,我们建议用户从post-training quantization工具开始,因为这个是广泛适用的且不需要训练数据。

对于一些情况,不满足准确率和延迟目标的,或硬件加速器支持很重要的,quantization-aware training是一个更好的选择。

 

 

Post-training quantization

Post-training quantization is a general technique to reduce the model size while also providing up to 3x lower latency with little degradation in model accuracy. Post-training quantization quantizes weights to 8-bits of precision from floating-point.

Post-training quantization是一种通用技术,来减小模型尺寸同时提升延迟降低3倍并且模型准确率只有一点点的降低。Post-training quantization将权重从浮点型量化为8位的精度

This technique is enabled as an option in TensorFlow Lite model converter:

TensorFlow Lite model converter中这种技术可以作为可选择的

import tensorflow as tf

converter = tf.contrib.lite.TocoConverter.from_saved_model(saved_model_dir)

converter.post_training_quantize = True

tflite_quantized_model = converter.convert()

open("quantized_model.tflite", "wb").write(tflite_quantized_model)



At inference, weights are converted from 8-bits of precision to floating-point and computed using floating point kernels. This conversion is done once and cached to reduce latency.

在推理时,权值从8-bits精度转化到float型,并且用float型计算。这种转化只进行一次,并进行缓存以减少延迟。

To further improve latency, hybrid operators dynamically quantize activations to 8-bits and perform computations with 8-bit weights and activations. This optimization provides latencies close to fully fixed-point inference. However, the outputs are still stored using floating-point, so the speedup with hybrid ops is less than a full fixed-point computation.
为了进一步提高延迟,混合操作符将激活动态量化为8位,并使用8位权值和激活执行计算。这种优化提供了接近完全定点推理的延迟。但是,输出仍然使用浮点数存储,因此混合ops的加速比完全的定点计算要小。

Hybrid ops are available for the most compute-intensive operators in a network:

混合操作适用于网络中计算最密集的操作:

  • tf.contrib.layers.fully_connected
  • tf.nn.conv2d
  • tf.nn.embedding_lookup
  • BasicRNN
  • tf.nn.bidirectional_dynamic_rnn for BasicRNNCell type
  • tf.nn.dynamic_rnn for LSTM and BasicRNN Cell types

Since weights are quantized post-training, there could be an accuracy loss, particularly for smaller networks. Pre-trained fully quantized models are provided for specific networks in the TensorFlow Lite model repository. It is important to check the accuracy of the quantized model to verify that any degradation in accuracy is within acceptable limits. There is a tool to evaluate TensorFlow Lite model accuracy.

因为权重是在训练后进行量化的,所以应该有一个精度的损失,尤其对于小网络。在tensorflow lite 模型库中对于特殊的网络提供了预先训练好的完全量化的模型。它对于检查量化模型的精度是重要的来验证任何精度上的下降是否在可接受的范围内。这有个工具来评估tensorflow lite模型精度

If the accuracy drop is too high, consider using quantization aware training.

如果精度下降过高,考虑用quantization aware training

Representation for quantized tensors 量化张量的表示

TensorFlow approaches the conversion of floating-point arrays of numbers into 8-bit representations as a compression problem.

Since the weights and activation tensors in trained neural network models tend to have values that are distributed across comparatively small ranges (for example, -15 to +15 for weights or -500 to 1000 for image model activations).

And since neural nets tend to be robust handling noise, the error introduced by quantizing to a small set of values maintains the precision of the overall results within an acceptable threshold.

A chosen representation must perform fast calculations, especially the large matrix multiplications that comprise the bulk of the computations while running a model.

Tensorflowfloat型的数据数组到8-bit表示的转化问题看作压缩问题来处理。  由于在训练神经网络模型中的权值和激活张量的值往往分布在比较小的范围内(例如,权值在-1515,图像模型几多在-5001000)。     所选的表示必须执行更快的计算,尤其在运行一个模型时包含了大量计算的大型矩阵乘法

 

This is represented with two floats that store the overall minimum and maximum values corresponding to the lowest and highest quantized value.

这是用2个浮点表示的,存储着总体最小值和最大值对应于最低和最高量化值

Each entry in the quantized array represents a float value in that range, distributed linearly between the minimum and maximum.

在量化数组中的每个条目表示在那个范围的浮点值,线性分布在最小和最大值之间

For example, with a minimum of -10.0 and maximum of 30.0f, and an 8-bit array, the quantized values represent the following:

例如,最小值为-10.0,最大值为30.0,一个8-bit数组,量化后的值表示如下:

Quantized

Float

0

-10.0

128

10.0

255

30.0

Table 2: Example quantized value range

The advantages of this representation format are:

这种表达方式的好处是:

  • It efficiently represents an arbitrary magnitude of ranges.
  • 可以有效的表示任何大小范围
  • The values don't have to be symmetrical.
  • 值没必要是对称的
  • The format represents both signed and unsigned values.
  • 这种方式可以表示有符号和无符号的值
  • The linear spread makes multiplications straightforward.
  • 这个线性扩展使乘法简单明了。

 

 

你可能感兴趣的:(tensorflow)