目录
XLA:优化机器学习编译器
XLA Shape
XLA Layout
XLAService内存优化
HLO instruction
HloOrdering
XLA别名
XLA(加速线性代数)是一种针对特定领域的线性代数编译器,能够加快 TensorFlow 模型的运行速度,而且可能不需要更改源代码。
XLA的输入语言称为“ HLO IR”,或简称为HLO(高级优化器,High Level Optimizer)。 HLO的语义在“ 操作语义”页面上描述。将HLO视为编译器IR最为方便。
XLA也是基于LLVM框架开发的,前端的输入是Graph,前端没有将Graph直接转化为LLVM IR,而是转化为了XLA的自定义的中间表示HLO IR.并且为HLO IR设计了一系列的优化器。经过优化的HLO IR接下来会被转化为LLVM IR。
运行 TensorFlow 程序后,所有操作均由 TensorFlow 执行程序单独执行。每个 TensorFlow 操作都有一个预编译的 GPU 内核实现,可以将执行程序分派给该实现。
XLA 提供了一种运行模型的替代模式:它会将 TensorFlow 图编译成一系列专门为给定模型生成的计算内核。由于这些内核是模型特有的,因此它们可以利用模型专属信息进行优化。以 XLA 在简单的 TensorFlow 计算环境中进行的优化为例:
def model_fn(x, y, z):
return tf.reduce_sum(x + y * z)
如果在不使用 XLA 的情况下运行,图会启动三个内核:分别对应于乘法、加法和减法运算。但是,XLA 可以优化该图,使其启动一次内核就能计算结果。它通过将加法、乘法和减法“融合”到一个 GPU 内核中来实现这一点。此外,这种融合操作不会将由 y*z
和 x+y*z
生成的中间值写出到内存中;而是直接将这些中间计算的结果“流式传输”给用户,同时将它们完全保留在 GPU 寄存器中。融合是 XLA 采用的最重要的一项优化措施。 内存带宽通常是硬件加速器上最稀缺的资源,因此消除内存操作是提高性能的最佳方法之一。
The XLA Shape
proto (xla_data.proto) describes the rank, size, and data type of an N-dimensional array (array in short).
The rank of an array is equal to the number of dimensions. The true rank of an array is the number of dimensions which have a size greater than 1.
Dimensions are numbered from 0
up to N-1
for an N
dimensional array. The dimension numbers are arbitrary labels for convenience. The order of these dimension numbers does not imply a particular minor/major ordering in the layout of the shape. The layout is determined by the Layout
proto.
By convention, dimensions are listed in increasing order of dimension number. For example, for a 3-dimensional array of size [A x B x C]
, dimension 0 has size A
, dimension 1 has size B
and dimension 2 has size C
.
Some utilities in XLA also support negative indexing, similarly to Python; dimension -1 is the last dimension (equivalent to N-1
for an N
dimensional array). For example, for the 3-dimensional array described above, dimension -1 has size C
, dimension -2 has size B
and so on.
Two, three, and four dimensional arrays often have specific letters associated with dimensions. For example, for a 2D array:
y
x
For a 3D array:
z
y
x
For a 4D array:
p
z
y
x
Functions in the XLA API which take dimensions do so in increasing order of dimension number. This matches the ordering used when passing dimensions as an initializer_list
; e.g.
ShapeUtil::MakeShape(F32, {A, B, C, D})
Will create a shape whose dimension size array consists of the sequence [A, B, C, D]
.
The Layout
proto describes how an array is represented in memory. The Layout
proto includes the following fields:
message Layout {
repeated int64 minor_to_major = 1;
repeated int64 padded_dimensions = 2;
optional PaddingValue padding_value = 3;
}
Minor-to-major dimension ordering
The only required field is minor_to_major
. This field describes the minor-to-major ordering of the dimensions within a shape. Values in minor_to_major
are an ordering of the dimensions of the array (0
to N-1
for an N
dimensional array) with the first value being the most-minor dimension up to the last value which is the most-major dimension. The most-minor dimension is the dimension which changes most rapidly when stepping through the elements of the array laid out in linear memory.
For example, consider the following 2D array of size [2 x 3]
:
a b c
d e f
Here dimension 0
is size 2, and dimension 1
is size 3. If the minor_to_major
field in the layout is [0, 1]
then dimension 0
is the most-minor dimension and dimension 1
is the most-major dimension. This corresponds to the following layout in linear memory:
a d b e c f
This minor-to-major dimension order of 0
up to N-1
is akin to column-major (at rank 2). Assuming a monotonic ordering of dimensions, another name we may use to refer to this layout in the code is simply "dim 0 is minor".
On the other hand, if the minor_to_major
field in the layout is [1, 0]
then the layout in linear memory is:
a b c d e f
A minor-to-major dimension order of N-1
down to 0
for an N
dimensional array is akin to row-major (at rank 2). Assuming a monotonic ordering of dimensions, another name we may use to refer to this layout in the code is simply "dim 0 is major".
Default minor-to-major ordering
The default layout for newly created Shapes is "dimension order is major-to-minor" (akin to row-major at rank 2).
Padding
Padding is defined in the optional padded_dimensions
and padding_value
fields. The field padded_dimensions
describes the sizes (widths) to which each dimension is padded. If present, the number of elements in padded_dimensions
must equal the rank of the shape.
For example, given the [2 x 3]
array defined above, if padded_dimensions
is [3, 5]
then dimension 0 is padded to a width of 3 and dimension 1 is padded to a width of 5. The layout in linear memory (assuming a padding value of 0 and column-major layout) is:
a d 0 b e 0 c f 0 0 0 0 0 0 0
This is equivalent to the layout of the following array with the same minor-to-major dimension order:
a b c 0 0
d e f 0 0
0 0 0 0 0
Indexing into arrays
The class IndexUtil
in index_util.h provides utilities for converting between multidimensional indices and linear indices given a shape and layout. Multidimensional indices include a int64
index for each dimension. Linear indices are a single int64
value which indexes into the buffer holding the array. See shape_util.h
and layout_util.h
in the same directory for utilities that simplify creation and manipulation of shapes and layouts.
对于训练框架来说, 显存优化的工作至关重要, 主要是由于现阶段GPU+CUDA远没有CPU+Linux组合强大, 后者有完善的建立在虚拟内存基础上的内存管理机制, 内存的高效使用由linux kernel来负责, 即便物理内存不足, 还可以使用swap, 内存压缩等技术确保内存的高效供应, 而在GPU+CUDA里, 这方面的工作很大程度让渡给了程序员自己来搞定, GPU程序接触到的就是物理显存, 如果程序的显存申请超过显存容量, 整个程序就会直接coredump, 此外, 显存本身就集成在GPU板卡上, 无法像内存一样扩展, 而GPU本身造价昂贵, 最后, 在深度学习训练中, 大力出奇迹的现状下, 显存的消耗明显超过的摩尔定律, 这也加剧了显存供求关系的矛盾, 正式由于训练框架做了大量的优化, 才能让模型跑起来.
XLA Service的显存优化设计思想与tensorflow整体一样遵循”静态图”的设计: 先整体优化, 再落地实施. 其中, xla/service/buffer_assignment.cc 是整个显存优化的核心, 在1.14版本中, xla/service/支持两种后端: cpu和gpu, 纷纷针对两种backend有进一步的优化算法
核心文件
内存优化公共:
xla/service/buffer_assignment 内存优化核心文件
xla/service/buffer_liveness.cc 内存片生命周期分析
GPU相关:
xla/service/gpu/buffer_allocations.cc BufferAllocation的组合
xla/service/gpu/gpu_hlo_scheudle.cc Hlo的处理顺序, 与显存的优化策略息息相关, 简单地说, 按照BFS并行执行的HloInstruction消耗的显存肯定大于所有的HloInstruction都顺序执行.
XLAService内存优化的本质是处理LoigicalBuffer和BufferAllocation之间的关系, 原则是使用尽可能少的BufferAllocation去承载尽可能多的LogicalBuffer, 而如何使用的更少, 就涉及到了对Hlo图的分析, 就涉及到了Ordering的问题, 使用不同策略生成Ordering, 直接影响两个LogicalBuffer之间的约束关系, 最简单的, 在图遍历中, 使用DFS和BFS的2种方式遍历会导致图上节点的内存依赖关系大有不同.
HLO instructions就是op了,对应了官网上列出的operation semantics,看注释已经解释的非常清楚了,op融合和向llvm ir转换都是在这个层面进行的。
HLO instructions are the atomic unit of the high-level compiler's IR. HloInstructions live inside of an HloComputation, which is analogous to a function in other programming languages. Nodes have no total order within their computation. Instead, they have a partial ordering determined by their data and control dependencies. HLO does not have basic blocks or explicit "branch" instructions. Instead, certain HloInstructions -- namely, kWhile, kConditional, and kCall -- encode control flow. For example, the kConditional HLO executes one of two possible // computations, depending on the runtime value of a predicate. HLO is pure (mostly). It has no concept of mutable state. Instead, data values are produced by one HLO and flow into consumers across dependency edges.
HloOrdering是描述HloInstruction加载序列的基类, 派生类有PredecessorHloOrdering, DependencyHloOrdering 和 SequentialHloOrdering, 其中, DependencyHloOrdering基于依赖关系, 所以可以并行, 性能更高, 但耗更多的内存, 而SequentialHloOrdering完全串行, 性能相对低, 但可以节约更多内存, 而 PredecessorHloOrdering 是个虚类, 需要子类进一步填充predecessors_, 这也是GPU后端使用的方式.不同的Ordering会影响内存的依赖关系, 进一步影响Launch到GPU后Kernel的执行序列.
构建XLA程序时,可以在输入和输出缓冲区之间指定所需的别名。
在编译时定义别名
例如,考虑一个简单的HLO模块,该模块只需在其输入中添加1
:
HloModule increment
ENTRY entry {
%p = f32[] parameter(0)
%c = f32[] constant(1)
ROOT %out = f32[] add(%p, %c)
}
该模块将分配两个4字节的缓冲区:一个用于输入%p
,一个用于输出%out
。
但是,通常希望就地执行更新(例如,如果在生成表达式的前端中输入变量在计算之后不再有效,如增量p++
)。
为了有效地执行此类更新,可以指定输入别名:
HloModule increment, input_output_alias={ {}: 0 }
ENTRY entry {
%p = f32[] parameter(0)
%c = f32[] constant(1)
ROOT %out = f32[] add(%p, %c)
}
该格式指定将整个输出(用{}
标记)别名为输入参数0
。
在使用XLA编译的TensorFlow程序集群中,所有资源变量更新都在编译时进行别名化(运行时的别名取决于是否还有其他内容引用了资源变量张量)。