On-Device Neural Net Inference with Mobile GPUs

On-Device Neural Net Inference with Mobile GPUs

Juhyun Lee, Nikolay Chirkov, Ekaterina Ignasheva, Yury Pisarchyk, Mogan Shieh, Fabio Riccardi, Raman Sarokin, Andrei Kulik, Matthias Grundmann

1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA

Google Research tackles challenges that define the technology of today and tomorrow.
United States of America,U.S.A. or USA
United States,U.S. or US
California,CA:加利福尼亚州,加州
Mountain View is a city in Santa Clara County, California, United States.

Abstract

On-device inference of machine learning models for mobile phones is desirable due to its lower latency and increased privacy. Running such a compute-intensive task solely on the mobile CPU, however, can be difficult due to limited computing power, thermal constraints, and energy consumption. App developers and researchers have begun exploiting hardware accelerators to overcome these challenges. Recently, device manufacturers are adding neural processing units into high-end phones for on-device inference, but these account for only a small fraction of hand-held devices. In this paper, we present how we leverage the mobile GPU, a ubiquitous hardware accelerator on virtually every phone, to run inference of deep neural networks in real-time for both Android and iOS devices. By describing our architecture, we also discuss how to design networks that are mobile GPU-friendly. Our state-of-the-art mobile GPU inference engine is integrated into the open-source project TensorFlow Lite and publicly available at https://www.tensorflow.org/lite.
最近,设备制造商正在将 neural processing units 添加到高端手机中以进行设备上推理,但这些只占手持设备的一小部分。

desirable [dɪ'zaɪrəb(ə)l]:adj. 想望的,可取的,值得拥有的,值得做的 n. 称心如意的人
ubiquitous [juːˈbɪkwɪtəs]:adj. 似乎无所不在的,十分普遍的
virtually [ˈvɜː(r)tʃʊəli]:adv. 几乎,实际上,事实上,虚拟

1 Introduction

On-device machine learning (ML) offers a variety of benefits. The most apparent is the improved inference latency: By skipping the data upload to the server and wait-time for the inference result, the app can respond more quickly to the user’s request. Removing the server dependency has additional benefits, such as:

  • Removing the need to maintain inference servers, (无需维护推理服务器,)
  • Running with limited or no connectivity, and (在连接受限或无连接的情况下运行,)
  • Reducing privacy concerns as the user data remains on the device. (减少隐私问题,因为用户数据保留在设备上。)

However, on-device ML is not trivial. Despite both recent advances in mobile hardware technology and efforts to efficiently run deep networks on mobile devices, mobile CPUs continue to be less powerful than those found in servers. Running deep net inference on a mobile device means adding a significant compute-intensive task to the CPU which competes with existing logic. Fully utilizing the mobile CPU comes with additional unwanted costs, e.g. increased energy consumption leads to shorter battery life and an increase in the phone’s thermal profile causes throttling resulting in slower computation.
尽管最近移动硬件技术取得了进步,并且努力在移动设备上高效运行深度网络,但移动 CPU 的能力仍然不如服务器中的 CPU。在移动设备上运行深度网络推理意味着向 CPU 添加重要的计算密集型任务,这会与 CPU 现有逻辑竞争。充分利用移动 CPU 会带来额外的不必要成本,例如增加能源消耗会导致电池寿命缩短,而手机热分布的增加会导致减速,从而导致计算速度变慢。

Dynamic frequency scaling (CPU throttling) is a power management technique in computer architecture whereby the frequency of a microprocessor can be automatically adjusted “on the fly” depending on the actual needs, to conserve power and reduce the amount of heat generated by the chip.
动态时钟频率调整 (CPU 节流) 是一个用来使微控制器的频率可以自动适应需要进行调节,从而让 CPU 降低功耗,降低发热的技术。

concern [kənˈsɜː(r)n]:n. 关心,忧虑,公司,企业 v. 涉及,影响,牵涉,与 ... 有关
trivial ['trɪviəl]:adj. 不重要的,琐碎的,微不足道的
throttle [ˈθrɒt(ə)l]:v. 抑制 (讨论,贸易等),使窒息,(用节汽阀等) 调节,使节流 n. 风门,风门杆,气管,扼流圈
apparent [əˈpærənt]:adj. 显然,显而易见,明白易懂,貌似的
fraction ['frækʃ(ə)n]:n. 分数,小部分,小数,少量

Hardware accelerators such as the digital signal processors offer solutions to overcome these challenges. The demand for on-device ML has led to recent trends of phone manufacturers integrating dedicated neural processing units (NPUs) for high-end next-generation phones, which account for only a small fraction of the current distribution of mobile devices.
这些手机仅占当前移动设备分布的一小部分。

Our primary goal is a fast inference engine with wide coverage for TensorFlow Lite (TFLite) [8]. By leveraging the mobile GPU, a ubiquitous hardware accelerator on virtually every phone, we can achieve real-time performance for various deep network models. Table 1 demonstrates that GPU has significantly more compute power than CPU.

On-Device Neural Net Inference with Mobile GPUs_第1张图片
Table 1: Example of available compute power on mobile in gigaflops (billion floating point instructions per second). FP16 and FP32 refer to 16- and 32-bit floating point arithmetic, respectively.

This paper presents the techniques we adopt for TFLite GPU and how we achieve an average acceleration of 2-9x for various deep networks on GPU compared to CPU inference. We first describe the general mobile GPU architecture and GPU programming, followed by how we materialize this with Compute Shaders for Android devices, with OpenGL ES 3.1+ [16] and Metal Shaders for iOS devices with iOS 9+ [1].
本文介绍了我们为 TFLite GPU 采用的技术,以及与 CPU 推理相比,我们如何在 GPU 上为各种深度网络实现 2-9 倍的平均加速。我们首先描述了一般的移动 GPU 架构和 GPU 编程,然后是我们如何使用适用于 Android 设备的 Compute Shaders、OpenGL ES 3.1+ [16] 和适用于 iOS 9+ [1] 的 iOS 设备的 Metal Shaders 来实现这一点。

materialize [məˈtɪəriəlaɪz]:v. 实现,发生,成为现实,突然显现

2 Related Work

Various research efforts from both academia and industry endeavor to bring deep neural networks inference previously limited to server, forward to mobile devices. Those efforts can be roughly categorized into three strategies:

  • Network architecture-driven,
  • Hardware-driven, and
  • ML framework-driven.

将以前仅限于服务器的深度神经网络推理推向移动设备。

endeavor [ɪnˈdevə(r)]:n. 努力 v. 努力

Neural network researchers have focused on optimizing their network architectures explicitly for processing on-device in various domains such as image classification [10, 21], object localization [11], and image enhancements [13, 14]. Many of these techniques involve reducing the model size by re-designing the network architecture and adding pre-/post-training quantization of weights. With these, one can achieve faster computation and smaller memory footprint, leading to reduced inference latency at the cost of slightly degraded model accuracy. MorphNet [9] takes a unique path of reducing the number of floating point operations per second which is optimized during training of the model. Our work is complementary to these efforts and instead focuses on optimizing the inference engine that runs the neural network rather than the model or training.
有了这些,人们可以实现更快的计算和更小的内存占用,从而以稍微降低模型精度为代价减少推理延迟。

Major hardware manufacturers have made architectural changes responding to demands for faster mobile inference, and are publishing software development kits (SDKs) to expose those: Arm Compute Library [4], Huawei HiAI SDK [12], MediaTek NeuroPilot SDK [17], and Qualcomm SNPE SDK [20]. These libraries are vendor-specific and either cannot be re-used on a different architecture or do not guarantee the expected performance boost on other platforms. Our work does not add new hardware or SDKs. Instead, we use well-established hardware, the mobile GPU, and well-supported graphics and compute standards as OpenGL [16] and Metal [1], to achieve high-performance neural network inference.
主要硬件制造商已针对更快的移动推理需求做出架构更改,并发布软件开发工具包 (SDK) 以公开这些:Arm Compute Library [4], Huawei HiAI SDK [12], MediaTek NeuroPilot SDK [17], and Qualcomm SNPE SDK [20]. 这些库是特定于供应商的,不能在不同的体系结构上重复使用,或者不能保证在其他平台上的预期性能提升。我们的工作不会添加新的硬件或 SDK。相反,我们使用成熟的硬件、移动 GPU 和支持良好的图形和计算标准,如 OpenGL [16] and Metal [1],来实现高性能神经网络推理。

explicitly [ɪk'splɪsɪtli]:adv. 明确,显然,清楚地,直接地
footprint ['fʊt.prɪnt]:n. 足迹,脚印,面积,覆盖区

Apple presented the Metal Performance Shaders with support of convolutional neural networks [3] accelerated by GPU. This is a solution built on top of the Metal API and allows custom operations. Our approach is analogous to Apple’s on iOS devices. Apple also released CoreML [2], an end-to-end solution for inference on mobile devices using CPU, GPU, and NPU, if available.
这是一个构建在 Metal API 之上的解决方案,并允许自定义操作。我们的方法类似于 Apple 在 iOS 设备上的方法。Apple 还发布了 CoreML [2],这是一种使用 CPU、GPU 和 NPU (如果可用) 在移动设备上进行推理的端到端解决方案。

Android introduced the Android Neural Networks API [7] that serves as a layer between hardware and higher-level ML frameworks that vendors must implement for Android 8.1 or later. Our work has wider coverage and does not depend on a specific Android version, or require vendors to implement individual APIs for deep network processing.

individual [.ɪndɪ'vɪdʒuəl]:n. 个人,与众不同的人,有个性的人,某种类型的人 adj. 单独的,个别的,一个人的,供一人用的

Some of the latest mobile-friendly ML frameworks are:

  • Caffe2 [6] which focuses on CPU inference and uses Arm Compute Library for Arm Mali GPUs.
  • MACE [24] which employs OpenCL which is not a part of standard Android OS.

TFLite GPU leverages the mobile GPU with OpenGL ES for Android devices and Metal for iOS devices. The specific version requirements are OpenGL ES 3.1+ and iOS 9+ which are available for more than 52% of all Android devices [23]. One of our biggest strength is that our framework employs open standards, i.e. is not limited by specific hardware vendor, and thus covers a wide range of devices.

employ [ɪm'plɔɪ]:v. 使用,雇用,运用,应用 n. 使用,雇用,服务,工作

3 General Architecture

This section explains the general architecture of TFLite GPU, consisting of an initialization phase followed by a model inference phase. The techniques in this section are independent of the architecture of the underlying GPU.
本节介绍 TFLite GPU 的一般架构,包括初始化阶段和模型推理阶段。本节中的技术独立于底层 GPU 的架构。

3.1 Initialization

TFLite provides APIs for the delegation of the execution of neural network sub-graphs to another library. We exploit this feature to integrate the GPU backend into TFLite. Given a neural net model, TFLite first checks whether it can execute all the operators in the model with our GPU delegate. Our GPU backend identifies supported operators, and TFLite then partitions the graph into several sub-graphs, substituting the sub-graphs with virtual “delegate nodes”. From that point, the GPU backend is responsible for executing this sub-graph, as depicted in Figure 1. Unsupported operators are by default computed by the CPU. Ideally, the whole graph would be compatible with our mobile GPU backend for maximum performance.
TFLite 提供 API,用于将神经网络子图的执行委托给另一个库。我们利用此功能将 GPU 后端集成到 TFLite 中。给定一个神经网络模型,TFLite 首先检查它是否可以使用我们的 GPU 代理执行模型中的所有运算符。我们的 GPU 后端识别支持的运算符,然后 TFLite 将图划分为几个子图,用虚拟“委托节点”替换子图。从那时起,GPU 后端负责执行此子图,如图 1 所示。不支持的运算符默认由 CPU 计算。 理想情况下,整个图将与我们的移动 GPU 后端兼容,以实现最佳性能。

underlie [ˌʌndə(r)ˈlaɪ]:v. 构成 ... 的基础,作为 .. 的原因
explicitly [ɪk'splɪsɪtli]:adv. 明确,显然,清楚地,直接地
delegation [.delə'ɡeɪʃ(ə)n]:n. 代表团,委派,委托
substitute [ˈsʌbstɪˌtjuːt]:v. 取代,代替 n. 代用品,代替物,代替者,替补

On-Device Neural Net Inference with Mobile GPUs_第2张图片
Figure 1: TFLite’s delegate mechanism: Operations supported by the GPU delegate will run on the GPU, and the rest on the CPU.

As our mobile GPU inference engine is primarily designed for high-performance execution, we first inspect the model and resolve obvious inefficiencies. For example:

  • Merging pad as an option of another op where it was previously described separately.
  • Removing superfluous identity operations, e.g. resize with scale one or single input add/concat.

While these inefficiencies might be caught by the architect, artifacts such as these crop up inevitably, and we should still optimize these whenever possible.
虽然这些低效率可能会被架构师发现,但不可避免地会出现诸如此类的工件,我们仍然应该尽可能优化它们。

superfluous [suːˈpɜː(r)fluəs]:adj. 过剩的,过多的,多余的
architect [ˈɑː(r)kɪˌtekt]:n. 建筑师,设计师,缔造者,创造者
artifact [ˈɑː(r)tɪˌfækt]:n. 人工制品,(组织结构的) 人为现象
inevitably [ɪn'evɪtəbli]:adv. 必然,必定,免不了
crop [krɒp]:n. 庄稼,作物,产量,一批人 v. 剪短,裁切,啃吃,有收成

Note that, in contrast to CPU backends which work without initialization, GPU backends require initialization involving shader compilation and optimization by the driver before inference. The cost of this process depends on network size and may take from few milliseconds to seconds, but is incurred once and not again for subsequent runs until the cache memory is invalidated for any of reasons: application is updated or re-installed, device is rebooted, cache memory is over, or for other OS-specific reasons.

incur [ɪnˈkɜː(r)]:v. 招致,遭受,引起,引致
invalidate [ɪnˈvælɪdeɪt]:v. 证明 ... 错误,使站不住脚,使无效,使作废
subsequent ['sʌbsɪkwənt]:adj. 随后的,后来的,之后的,接后的
invalidate [ɪnˈvælɪdeɪt]:v. 证明 ... 错误,使站不住脚,使无效,使作废

3.2 Running Inference

The inference phase is fairly straightforward. The input tensors are reshaped to the PHWC4 format detailed later in Section 4, if their tensor shape has channel size not equal to 4. For each operator, shader programs are linked by binding resources such the operator’s input/output tensors, weights, etc. and dispatched, i.e. inserted into the command queue. The GPU driver then takes care of scheduling and executing all shader programs in the queue, and makes the result available to the CPU by the CPU/GPU synchronization. There might be a final conversion from PHWC4 to HWC, if the output tensor has a channel size not equal to 4.
推理阶段相当简单。如果输入张量的通道大小不等于 4,则输入张量将 reshape 为 PHWC4 格式,稍后将在第 4 节中详细介绍。对于每个运算符,着色器程序通过绑定资源 (such the operator’s input/output tensors, weights, etc.) 进行链接,并分派,即插入到命令队列中。然后 GPU 驱动程序负责调度和执行队列中的所有着色器程序,并通过 CPU/GPU 同步将结果提供给 CPU。如果输出张量的通道大小不等于 4,则可能存在从 PHWC4 到 HWC 的最终转换。

fairly [ˈfeə(r)li]:adv. 公正地,相当地,一定地,公平合理地
bind [baɪnd]:v. 约束,捆绑,系,装订 n. 窘境

On-Device Neural Net Inference with Mobile GPUs_第3张图片
Figure 2: Example of PHWC4 memory layout (best viewed in color). A tensor of shape ( H = 8 , W = 6 , C = 12 ) (H{=}8, W{=}6, C{=}12) (H=8,W=6,C=12) is split into 4-element slices of size ( H , W , 4 ) (H, W, 4) (H,W,4) which are stored sequentially as a continuous 2D array of size ( H C / 4 = 24 , 4 W = 24 ) (HC/4{=}24,4W{=}24) (HC/4=24,4W=24).

H * (C / 4) = HC/4 = 8 * (12 / 4) = 24
4 * W = 4W = 4 * 8 = 24

For maximum performance, one should avoid CPU/GPU synchronization at all cost, and preferably, never leave GPU context if real-time processing is needed. The most ideal scenario would be the following: A camera provides with RGBA texture that goes directly to TFLite GPU and the output of the network is then directly rendered to the screen.
为了获得最佳性能,应该不惜一切代价避免 CPU/GPU 同步,并且最好不要在需要实时处理时离开 GPU 上下文。最理想的场景如下:相机提供直接进入 TFLite GPU 的 RGBA texture,然后网络的输出直接渲染到屏幕。

preferably ['pref(ə)rəbli]:adv. 最好,宁可,更可取地
render [ˈrendə(r)]:v. 提供,给予,提交,翻译 n. 粉刷,初涂,抹灰,精制油

Shader Program Optimization
In the GPU inference engine, operators exist in the form of shader programs. The shader programs eventually get compiled and inserted into the command queue and the GPU executes programs from this queue without synchronization with the CPU.
在 GPU 推理引擎中,算子以着色器程序的形式存在。着色器程序最终会被编译并插入到命令队列中,GPU 会在不与 CPU 同步的情况下执行该队列中的程序。

To reduce the number of shader programs in the command queue, we consolidate them into meaningful aggregates while maximizing parallelism and well-defined data dependencies.
为了减少命令队列中着色器程序的数量,我们将它们合并为有意义的聚合,同时最大化并行性和明确定义的数据依赖性。

The following techniques are employed when generating the source code for the shader programs:

  • Fusing element-wise operators with computationally expensive operators, e.g. activations with convolution, to reduce the number of shader programs.
  • In-lining parameters and small objects directly into the shader program to reduce memory I/O overhead.
  • Baking uniforms into the source code, instead of passing them in the run-time, allowing drivers to produce more optimal code.
  • Creating specialized version of shaders, like “convolution with kernel size”, to manually optimize shaders for particular cases.
  • Implementing specialization of shader programs optimized for a certain architecture to improve the op’s performance on the said environment.

In computing, inline expansion, or inlining, is a manual or compiler optimization that replaces a function call site with the body of the called function.
内联展开 (内联) 是一种将函数体直接展开到调用处的一种优化技术。它可以由手工指定 (inline 关键字),或者经由编译优化自动完成。内联展开类似于宏展开,区别在于内联展开在编译时完成,而宏展开则可能在预编译、编译时、运行时时完成。内联是一种重要的优化技术。内联的好处主要在于消除函数的调用开销 (压栈,保护/恢复现场),但内联展开可能导致生成的代码体积膨胀,并且影响指令缓存的命中率。函数内联展开在缓存小的时候能提升性能,缓存较大的时候性能有可能下降。

eventually [ɪ'ventʃuəli]:adv. 最后,终于
consolidate [kənˈsɒlɪdeɪt]:v. 合并,使加强,使巩固
aggregate [ˈæɡrɪɡeɪt]:n. 骨料,合计,总数 v. 合计,总计 adj. 总数的,总计的
particular [pə(r)ˈtɪkjʊlə(r)]:adj. 讲究,挑剔,专指的,不寻常的 n. 细节,详细资料,详细介绍材料
said [sed]:adj. 上述的 v. say 的过去式和过去分词
bake [beɪk]:n. 烤,烘烤的成品,烧烤会餐 v. 焙,烘烤,烤硬,灼热

After the source code for each program is generated, each shader gets compiled. This compilation step can take a while, from several milliseconds to seconds. Typically, app developers can hide this latency while loading the model or starting the app for the first time. Once all shader programs are compiled, the GPU backend is ready for inference.
此编译步骤可能需要一段时间,从几毫秒到几秒不等。通常,应用程序开发人员可以在加载模型或首次启动应用程序时隐藏此延迟。编译完所有着色器程序后,GPU 后端就可以进行推理

4 Data Layout

Most modern GPUs use a homogeneous coordinate [18] system which represents points in space with coordinates ( x , y , z , w ) (x, y, z, w) (x,y,z,w). A homogeneous coordinate ( x , y , z , w ) (x, y, z, w) (x,y,z,w), where w ≠ 0 w{\neq}0 w=0, represents a point ( x / w , y / w , z / w , 1 ) (x/w,y/w,z/w,1) (x/w,y/w,z/w,1) in a 3D space. This allows affine transformations and projective transformations to be represented in the form of 4D matrix multiplications. GPUs are essentially processors optimized for 4-element vector compute and load/store operations.

While TFLite does not restrict tensors to a certain shape, many operators assume 4D input/output tensors shaped as where , , , respectively represent batch size, height, width, and channel size. For convenience, the rest of the paper will mostly describe tensors assuming a batch size of , or for short. This simplified example can be generalized if we consider batches to be a concatenation of multiple tensors.

In TFLite GPU, a tensor is split into 4-channel slices which are stored sequentially in memory. If the number of channels is not divisible by , it is padded with zeroes. This memory layout, called PHWC4 (Figure 2), optimally reduces cache misses in the graphics architecture. This is tightly coupled with how compute threads are executed on the GPU, which defines the order of computation, and more importantly, the order of memory load instructions.

References

https://yongqiang.blog.csdn.net/
https://arxiv.org/abs/1907.01989
https://ar5iv.org/abs/1907.01989
https://en.wikipedia.org/wiki/ISO_3166-2:US

你可能感兴趣的:(TensorFlow,-,TensorFlow,Lite,On-Device,Neural,Net,Inference,Mobile,GPUs)