relay的流程;
pass功能;
pu上跑通单元测例 tests/python/relay/test_pass_inline.py, 明确pass原理;
能否使用inline针对一个网络生成一个call func ,这种方式能很好支持当前bangc优化和代码生成
relay的流程
1 从Tensorflow、PyTorch或Onnx等框架导入模型。导入层是 TVM 可以从其他框架(如 Tensorflow、PyTorch 或 ONNX)中提取模型的地方。TVM 为每个前端提供的支持水平会随着我们不断改进开源项目而变化。如果在将模型导入 TVM 时遇到问题,需要尝试将其转换为 ONNX。
2 翻译为Relay(TVM 的高级模型语言)。已导入 TVM 的模型用Relay表示。Relay是神经网络的函数语言和中间表示(IR)。它支持:
Relay 应用图级优化来优化模型。
3 降低(Lower)至张量表达式(TE) 表示。降低是指将较高级别的表示转换为较低级别的表示。在应用高级优化之后,Relay 运行 FuseOps pass 将模型划分为许多小的子图,并将子图降低为 TE 表示。张量表达式 (TE) 是一种用于描述张量计算的领域特定语言。TE 还提供了一些调度原语来指定作低级循环优化,例如平铺、矢量化、并行化、展开和融合。为了帮助将 Relay 表示转换为 TE 表示的过程,TVM 包括一个张量算子库 (TOPI),它具有常用张量算子(例如,conv2d、转置)的预定义模板。
4 使用自动调整模块AutoTVM或AutoScheduler搜索最佳时间表。调度为 TE 中定义的运算符或子图指定作低级循环优化。自动调整模块搜索最佳时间表并将其与成本模型和设备上的测量值进行比较。
5 为模型编译选择最佳配置。 调音后, 自动调优模块生成 JSON 格式的调优记录。 这一步 为每个子图选择最佳时间表。
6 Lower to Tensor Intermediate Representation (TIR),TVM 的低级 中间表示。 选择最佳配置后 基于调整步骤,每个 TE 子图被降低到 TIR 并且是 通过低级优化传递进行优化。 接下来,优化的 TIR 是 降低到硬件平台的目标编译器。 这是生成优化模型的最终代码生成阶段 可以部署到生产中。 TVM 支持几种不同的 编译器后端包括:
7 编译成机器码。 在这个过程结束时, 特定于编译器的生成代码可以降级为机器代码。
这几行代码展示了TVM的编译流程,在这个编译流程里面不仅包含了基于Relay IR进行的优化策略来去除冗余的算子(也叫Pass)还包含了将Relay程序编译成特定后端(这里是llvm)可以执行的代码(codegen)。
target = "llvm"
target_host = "llvm"
dev = tvm.cpu(0)
with tvm.transform.PassContext(opt_level=3):
lib = relay.build(mod, target=target, target_host=target_host, params=params)
ralay位于TVM软件栈的位置:
Relay的函数式静态类型中间表示(IR)统一并概括了现有的DL IRs,以表示最先进的模型。
Relay’s functional, statically typed intermediate representation (IR) unifies and generalizes existing DL IRs to express state-of-the-art models.
TVM 可以将模型编译为可链接的对象模块,然后可以使用轻量级 TVM 运行时运行,该运行时提供 C API 以动态地 加载模型,以及其他语言的入口点,例如 Python 和 Rust。 TVM 还可以构建一个捆绑部署,其中运行时是 与模型结合在一个包中。
Relay简介
Relay的设计目标有以下几点:
Pass
Pass是TVM中基于Relay IR进行的一系列优化,类似于onnx-simplifier里面用到的onnxoptimizer,它可以简化计算图,去除一些冗余的算子,提高模型的推理效率。TVM将所有的pass都抽象到了tvm/include/tvm/ir/transform.h这个文件中,主要包含PassContext,PassInfo,Pass,以及Sequential。
这里的PassContext即是上面Python接口对应的C++实现,它包含了Pass执行依赖的一些参数如优化level,依赖的其它特定Pass以及设置不使用某种指定Pass等。PassInfo是用来记录Pass信息的类,包含Pass的opy_level,name,以及当前Pass需要哪些前置Pass。而Pass这个类就执行pass的主体,这是一个基类,每种Pass具体的C++代码实现在tvm/src/relay/transforms中,它们都会继承Pass这个基类。最后,Sequential是一个container,装载所有Pass。
需要说明一下,不是所有的Pass都定义在tvm/src/relay/transforms这里,比如下面的第一个例子就在tvm/src/relay/backend/vm这个文件夹里。
paper研读:
Relay: A High-Level Compiler for Deep Learning
The scenario above highlights the three-pronged extensibility challenge for DL IRs:
上面的场景突出了DL IRs面临的三个方面的可扩展性挑战:
1.表达能力:编写涉及控制流、一级函数和数据结构(如树、图和列表)的模型应该很简单。
2.可组合性:添加新的优化并与现有优化组合(例如量化、算子融合和部分评估)应该很简单。3.可移植性:应该很简单就能够添加新的硬件目标(例如TPU、推断)
Previous IRs have struggled to address these challenges, treating each component of the framework as a disconnected set of programming tasks. Operators are defined in low-level
languages like C++, connected by a dataflow graph, and then scripted in a host language like Python. Consequently, program analyses cannot cross language boundaries between
components, inhibiting optimization and deployment. Learning from previous IRs, we have designed Relay, which features a principled approach to addressing extensibility and
improves expressivity, composability, and portability over previous frameworks.
We make the following contributions:
• The Relay IR, a tensor-oriented, statically typed functional IR, which we describe in Section 3. Relay’s design is motivated by the insight that functional IRs, used by languages
from the ML family can be readily adapted to support DL.With its expressive semantics, including control flow, data structures, and first-class functions, Relay can represent entire state-of-the-art models.
• The insight that common features in ML frameworks, such as quantization and shape inference, can be reframed as standard compiler passes. By using this reframing we can
tap into decades of traditional compilers research to design composable optimization passes.
• A platform-agnostic representation of operators and domain specific optimizations which work in concert to provide portability across hardware backends.
之前的IRs一直在努力应对这些挑战,将框架的每个组成部分视为一个不相连的编程任务集。操作符是用C等低级语言定义的,通过数据流图连接,然后用Python等宿主语言编写脚本。因此,程序分析不能跨越组件之间的语言边界,从而抑制优化和部署。从以前的IRs中学习,我们设计了Relay,它采用了一种有原则的方法来解决可扩展性问题,并与以前的框架相比提高了表达性、可组合性和可移植性。我们做出以下贡献:
Relay IR是一种面向张量的静态类型函数IR,我们在第3节中描述了它。Relay的设计是基于这样一种认识,即ML家族1语言使用的功能性IRs可以很容易地进行调整,以支持DL。Relay IR具有丰富的语义,包括控制流、数据结构和一流的功能,可以代表整个最先进的模型
ML框架中的常见功能,如量化和形状推断,可以重新定义为标准的编译器过程。
通过使用这种重构,我们可以利用几十年来传统编译器的研究来设计可组合的优化过程算子和domain specific optimizations的平台无关表示,协同工作,提供跨硬件后端的可移植性。
We evaluate Relay on several systems and over a diverse set of vision and NLP workloads to demonstrate that (1) Relay enables expressive programs via a large breadth of models, (2) Relay supports composition of program-level optimizations such as quantization and fusion, and (3) Relay provides portability by targeting a number of hardware backends. Not only does Relay provide these three properties, we do so while also demonstrating competitive performance. Relay is an open-source academic project. 2 It has been deployed at a popular web service provider, a telecommunications and consumer electronics manufacturer, and a social media company, among others.
我们评估了多个系统上的Relay,以及一组不同的愿景和NLP工作负载,以证明(1)Relay通过广泛的模型支持表达性程序,(2)Relay支持程序级优化的组合,如量化和融合,(3)中继通过针对多个硬件后端提供可移植性。继电器不仅提供了这三种性能,我们还展示了竞争性能。Relay是一个开源的学术项目。2它已部署在一家受欢迎的网络服务提供商、一家电信和消费电子产品制造商以及一家社交媒体公司等。
2.1. Deep Learning Frameworks
In the early days of deep learning, practitioners and researchers would program in general-purpose languages like Python utilizing scientific computing libraries like NumPy, which pro-
vide low-level operators such as matrix multiplication. In order to accelerate model execution, frameworks supporting accelerators such as GPU were introduced. Early frameworks represented models as directed “computation graphs”, where each node represents an operator, and each edge represents the flow of data from one operator to another.Computation graphs provide a limited programming model, enabling straightforward mapping of operators onto GPUs. Large technology companies, such as Google, Facebook, and Amazon, drive the development of frameworks, and consequently, each company has its own stack consisting of the core framework (TensorFlow, PyTorch, MxNet),compilers(XLA, Glow, TVM), and hardware accelerators (TPU, GraphCore, Inferentia). Frameworks can be roughly categorized into those which support static computation graphs and those which support dynamic computation graphs. Frameworks which use static graphs are said to be define-and-run frameworks, whereas frameworks which use dynamic graphs are said to be define-by-run frameworks.
在深度学习的早期,从业者和研究人员会使用Python等通用语言,利用NumPy等科学计算库进行编程,NumPy提供矩阵乘法等低级运算符。为了加速模型执行,引入了支持加速器(如GPU)的框架[5]。早期的框架将模型表示为定向“计算图”,其中每个节点表示一个操作符,每条边表示从一个操作符到另一个操作符的数据流。计算图提供了一个有限的编程模型,可以将操作符直接映射到GPU上。谷歌、Facebook和亚马逊等大型科技公司推动了框架的开发,因此,每家公司都有自己的堆栈,包括核心框架(TensorFlow[1]、Pytork[8]、MxNet[6])、编译器(XLA[55]、Glow[38]、TVM[7])和硬件加速器(TPU[20]、GraphCore[2])。框架大致可以分为支持静态计算图的框架和支持动态计算图的框架。使用静态图的框架被称为定义和运行框架,而使用动态图的框架被称为由运行框架定义。
Define-And-Run Frameworks TensorFlow, Caffe [19], and Theano [5] are define-and-run frameworks. Static graphs represent a whole-program, enabling optimization and simplified
deployment, by removing the need for a host language like Python. TensorFlow (TF) extends pure dataflow graphs with control edges to emulate the functionality of if and while .
TF’s representation captures many state-of-the-art models,provides support for heterogeneous hardware back-ends, and enables reverse-mode automatic differentiation [4, 1]. TF’s encoding of control has limitations, as control-flow structures do not clearly map to familiar control-structures, instead using specialized encodings which make adapting traditional optimizations challenging. Furthermore, unmodified TensorFlow does not support building models where the shape of the computation graph is dependent on the input, frustrating researchers who wish to experiment with complex models.TensorFlow Fold addresses this particular limitation [26] but offers no general and extensible solution. The crux of the problem is the lack of generic mechanisms for users to define new control flow combinators (e.g., fold ) and data types.
TensorFlow、Caffe[19]和Theano[5]是定义和运行框架。静态图代表一个完整的程序,通过消除对Python等宿主语言的需求,实现优化和简化部署。TensorFlow(TF)扩展了带有控制边的纯数据流图,以模拟if和while的功能。TF的表示捕获了许多最先进的模型,提供了对异构硬件后端的支持,并支持反向模式自动区分[4,1]。TF对控制的编码有局限性,因为控制流结构不能清楚地映射到熟悉的控制结构,而是使用专门的编码,这使得适应传统优化具有挑战性。此外,未经修改的TensorFlow不支持构建计算图形状依赖于输入的模型,这让希望用复杂模型进行实验的研究人员感到沮丧。TensorFlow Fold解决了这个特殊的限制[26],但没有提供通用和可扩展的解决方案。问题的关键是缺乏通用机制,用户无法定义新的控制流组合符(例如,折叠)和数据类型。
Define-By-Run Frameworks PyTorch [33], Gluon [12],Chainer [50], and TensorFlow eager-mode [41] are define-by-run frameworks which attempt to address the challenges of
previous work. The approach popularized by PyTorch is to use a host language (e.g., Python) to eagerly execute operations while simultaneously building a computation graph as a side effect. By using the full host language, its features may be used to provide a highly expressive programming model to users. However, dynamic frameworks construct a graph per program trace and must re-optimize when the graph topology changes, costing CPU cycles and incurring communication overhead between the host machine and accelerators. Instead of just representing traces, Relay combines the advantages of both worlds by representing the whole program ahead of time, while supporting constructs like control flow, first-class functions, and data structures.
PyTorch推广的方法是使用宿主语言(例如Python)急切地执行操作,同时作为副作用构建计算图。通过使用完整的宿主语言,它的特性可以被用来为用户提供一个高度表达的编程模型。然而,动态框架会根据程序跟踪构建一个图,并且在图拓扑发生变化时必须重新优化占用CPU周期,导致主机和加速器之间的通信开销。Relay不只是表示跟踪,而是通过提前表示整个程序,同时支持控制流、一流函数和数据结构等结构,将这两个方面的优点结合起来
2.2. Low-Level Tensor Compilers Low-level tensor compilers are focused on the production
of high-performance operators which implement compute intensive operations such as matrix multiplication or convolution. There are a number of competing approaches,both from academic and commercial entities, such as TVM [7], Halide [35], Tensor Comprehensions(TC) [53], and Diesel [11]. The most notable designs are either inspired by the compute-schedule split introduced by Halide and adapted by TVM, or the polyhedral framework, as used by TC and Diesel. Operator compilers perform code generation for sets of scalar loop nests, but only represent a restricted subset of a whole program, ignoring details such as memory allocation/management, data structures, closures, and arbitrary control flow. Relay focuses on composing generic perators,and the surrounding program into an efficiently orchestrated DL program.
低级别的张量编译器专注于生成高性能的运算符,这些运算符实现矩阵乘法或卷积等计算性运算。有许多相互竞争的方法,来自学术和商业实体,如TVM[7]、Halide [35]、Tensor Comprehensions(TC)和Diesel。最著名的设计要么是受到Halide引入并由TVM改编的计算时间表分割的启发,要么是TC和Diesel使用的多面体框架。运算符编译器为一组标量循环嵌套执行代码生成,但只表示整个程序的一个受限子集,忽略内存分配/管理、数据结构、闭包和任意控制流等细节。Relay专注于将通用运算符和周围的程序组合成一个高效编排的DL程序。
2.3. Deep Learning Compilers
DL frameworks have adopted compilers to tackle both performance and portability for existing applications, most notably XLA [55], Glow [38], nGraph [10], ONNC [24], PlaidML [9], and ModelCompiler. These graph compilers use computation graph IRs and provide lowering onto a variety of targets. Often graph compilers only perform high-level optimizations and then offload to vendor-specific libraries.Due to their limited programming model, they provide the same functionality as Relay with a more limited language.The most comparable points to Relay are recent developments in the TensorFlow and PyTorch ecosystems of MLIR and TorchScript, respectively. Google introduced MLIR as a path forward for unifying its myriad of IRs. Upon first examination MLIR might appear to be a replacement for XLA and related TF compiler efforts, but it is not that. MLIR is shared infrastructure for constructing a set of interoperating IR “dialects” which can be used to construct compilers. The MLIR project is working on IR dialects for TF’s IR and a low-level polyhedral IR, but does not yet have an end-to-end solution for deep learning built upon MLIR, the insights in this paper can guide MLIR’s dialect development.
TorchScript is a high-level Python-like IR developed as the first layer of PyTorch’s JIT compiler. PyTorch (since v1.0) can rewrite a subset of user programs into TorchScript, an idealized subset of Python. TorchScript can then be executed by the TorchScript VM or JIT-compiled to a target platform. TorchScript sits many layers above code generation and must accommodate the flexible semantics of Python, which rules out entire classes of static analysis. In order to optimize away this dynamic behavior, TorchScript has a profiling JIT mode which identifies stable program traces during execution. These stable static traces can then be optimized by lower-level compilers such as Glow or Relay to perform the last level of code generation. Microsoft released ModelCompiler, a system for efficiently compiling RNNs defined in CNTK to CPU. Model Compiler uses Halide to represent low-level operations, but lacks the expressivity of the Relay IR and only demonstrates support for CPUs.
DL框架采用编译器来解决现有应用程序的性能和可移植性,最显著的是XLA(55)、Glow (38)、NCG[ 10 ]、ONNC(24)、PLAIDML(9)和MODEL编译器。这些图形编译器使用计算图IRs,并提供对各种目标的降低。图形编译器通常只执行高级优化,然后offload 到特定于供应商的库中。由于其有限的编程模型,它们使用更有限的语言提供与Relay相同的功能。与Relay最具可比性的是,分别在MLIR和TorchScript的TensorFlow和PyTorch生态系统中的最新发展。谷歌将MLIR作为统一其众多IRs的前进之路。第一次检查后,MIR可能会取代XLA和相关的TF编译器,但不是这样。MLIR是一种共享基础设施,用于构建一组可用于构建编译器的互操作IR“方言”。MLIR项目正在为TF的IR和低级多面体IR研究IR方言,但还没有基于MLIR的深度学习端到端解决方案,本文中的见解可以指导MLIR方言的发展。
TorchScript是作为PyTorch的JIT编译器的第一层开发的类似Python的高级IR。PyTorch(从v1.0开始)可以将用户程序的子集重写为TorchScript,这是Python的理想子集。然后,TorchScript VM或编译到目标平台的JIT可以执行TorchScript。TorchScript位于代码生成之上的许多层,必须适应Python灵活的语义,这排除了整个静态分析类。为了优化这种动态行为,TorchScript有一种分析JIT模式,可以在执行期间识别稳定的程序跟踪。然后,这些稳定的静态跟踪可以由低级编译器(如Glow或Relay)进行优化,以执行最后一级的代码生成。微软发布了ModelCompiler,这是一个将CNTK中定义的RNN高效编译到CPU的系统。ModelCompiler使用Halide表示低级操作,但缺乏中继IR的表现力,只演示了对CPU的支持。
2.4. Programming Languages for Deep Learning
In recent years, the design of new programming languages, or the augmentation of existing ones, has become a popular area of research. New languages designed for machine learning and related tasks include Lantern [54], Lift [43], Flux.jl [18]AutoGraph [30], Swift for TensorFlow [48], and JAX [25].Lantern [54] is the most related work to Relay as it can be
used as a code generator. Lantern is a deep learning DSL in Scala that uses lightweight modular staging (LMS) to lower code into C++ and CUDA. Lantern’s defining feature is the
use of delimited continuations to perform automatic differentiation. Delimited continuations provide an elegant algorithm for AD, only requiring local transforms, but incurs cost of
heap allocated structures, and a less straightforward mapping to define-by-run frameworks. Lantern solves this problem by using a CPS transform which complicated further optimization and code generation. Lantern does not yet support hardware accelerators, and does not focus on full program optimizations. The alternative approach is the augmentation of languages to support deep learning, the most notable being systems like AutoGraph, Flux.jl, Swift for TensorFlow, and JAX. These systems are designed to be user-facing programming environments for deep learning and use a compiler IR to generate code. For all intents and purposes Relay could be the IR in question, therefore Relay complements these systems well by providing a more expressive IR to map computation onto.
3.1. IR
The Relay IR is designed to subsume the functionality of computation graph-based IRs while providing greater faculties for abstraction and control flow. We present Relay’s design by incrementally building up to the full IR starting from a subset that corresponds to a simple computation graph. Deep learning models fundamentally operate on tensors. Hence, Relay’s primary value type is a tensor and operators are included as language primitives (see the tensor constant and operator rules in Figure 1). Relay leaves the implementation of each operator opaque; the operators are represented by a lower-level IR, which is optimized independently. A computation graph,in its simplest form, is a directed acyclic graph with multiple inputs and a single output. Relay uses three constructs to support these simple graphs: (1) variable , (2) function call ,and (3) operator ; see Figure 1 for the corresponding rules.
Multiple Outputs Computation graph IRs have primitive support for multiple outputs because many tensor operators require it. For example, the split operator separates a tensor along a given axis and returns each component. In Relay, multiple outputs can be modeled as tuples, requiring only two rules: tuple formation and tuple projection .
Let By construction, computation graphs enjoy implicit sharing of subcomputations via multiple outgoing dependency edges. Implicit sharing is often implemented via pointers that
uniquely identify subgraphs, a property useful for both execution and analysis. Previous frameworks often obtain this sharing by using a host language’s name binding to construct a graph (e.g., by binding a Python variable to a subgraph and using that variable to construct other subgraphs). General purpose programming languages, on the other hand, provide
explicit sharing via binding constructs, such as let . In programs free of scope, ordering, and effects, implicit sharing and explicit sharing are semantically equivalent. However, in practice, user programs rely on effects and ordering, requiring previous approaches to provide workarounds. For example,TensorFlow’s Eager Mode inserts dummy control edges in its generated graphs to impose effect ordering. The lack of lexical scope in computation graphs complicates language features, like first-class functions and control flow, and reduces the precision of traditional analyses, such as liveness,because the high-level program structure is absent. The addition of a humble let binding, a central concept in functional languages, provides explicit sharing and a solution to the problems outlined above.
Control Flow Emerging models, particularly in the domain of natural language processing, increasingly rely on data dependent control flow, forcing frameworks based on computation graph IRs to incorporate control flow, often through ad hoc and difficult-to-extend constructs. For example, TensorFlow Fold [27] extends TF with special combinators that dynamically compute a graph for each shape permutation;these high-level constructs are opaque to further optimizations.The functional programming community has demonstrated that recursion and pattern matching are sufficient to implement arbitrary combinators for control flow and iteration (e.g.,maps, folds, and scans). To support the definition of functional combinators we enrich Relay with two more language features to implement arbitrary combinators: if and first-class recursive functions.
First-Class Functions A computation graph is a single computation from multiple inputs to multiple outputs. While it is tempting to reinterpret a graph as a function, graphs lack
functional abstraction and named recursion. The addition of first-class named functions dramatically increases Relay’s expressivity, allowing it to encode generic higher-order functions and thus capture higher-level program structure. First-class functions also enable simpler implementations of importers that map higher-level programs to our IR. For example, an instance of TensorFlow’s looping construct tf.while_loop can be represented as a single specialized loop function or a generic fold over the loop state. See Figure 2 for an example of this conversion (via the Relay TensorFlow frontend).Data Abstraction Many models make use of additional data types beyond tuples, such as lists, trees, and graphs [21, 46,23]. Relay borrows from functional languages a generic and principled method of extension: algebraic data types (ADTs).To support them, we add mechanisms for (1) type declaration and (2) pattern matching. This final addition results in a strict functional language, closely resembling the core of languages like OCaml and SML. The increase in expressivity introduced by the Relay IR introduces new optimizations challenges, which we discuss in Sec. 4.
3.2. Type System
Relay’s type system is essential to optimizations. Typing guarantees both well-formedness of the program and provides crucial tensor shape information to perform allocation, check
correctness, and facilitate loop optimizations. Shape information is also valuable for data layout transformations and tensorization, two transformations often demanded by hard-
ware accelerators. In computation graph IRs, only numeric data types and shapes are tracked for each operator. Symbolic shapes (i.e., shape polymorphism) are only handled dynamically, inhibiting certain types of optimizations.It is possible to model arbitrarily complex static properties,such as shape information, with a dependent type theory [40],
but such a design incurs significant user complexity. By incorporating shape analysis into a broader type system, Relay’s type system balances the desire for static tensor shapes with
usability. In this subsection, we describe how to extend a polymorphic type system with shape information and type inference with shape inference.
Tensor Types The primitive value in Relay is a tensor, which has a shape and a base type ( tensor type in Figure 1). Base types describe the elements of tensors by tracking the bit width, the number of lanes (for utilizing vectorized intrinsics), and whether the type is floating point or integral. To ensure Relay can offload tensor computation to devices with greatly varying architectures, Relay tensors may only contain base types,preventing, for example, tensors of closures. The shape of a tensor is a tuple of integers describing the tensor’s dimensions. A dimension may be a variable or arithmetic expression that indicates how the output shape of an operator depends on those of its inputs. Functions may be polymorphic over shapes, which results in shape constraints that must be solved during type inference. Sec. 3.2 describes the process. Relay also supports a special shape called Any , which is used to mark a dynamic shape when static relationships are not profitable to
model.
Operators and Type Relations Operators are one of the key primitives that differs from those of general-purpose programming languages. Relay’s use of opaque operators enables backends to choose different lowering strategies based on the hardware target. Relay’s operator set is extensible, meaning that users may add new operations. Supporting common or user-defined tensor operators requires a type system that can adapt to complex shape relationships between input and output types (e.g., elementwise operators with broadcasting semantics).
To handle the constraints between operators’ argument shapes, Relay’s type system introduces type relations. A type relation is implemented as a function in the meta-language
and represents a symbolic relationship between the input and output types. When developers add a new operator to Relay, they may constrain its type with an existing relation or add their own. Function types may include one or more type relations over a subset of the argument types and the return type.The type checker enforces that these relationships hold at each call site.Type Inference To incorporate type relations into Relay’s type system, we enrich a Hindley-Milner-style type inference algorithm with a constraint solver. Relay’s inference algorithm has three steps: first, it performs a pass over the AST, generating types and a set of relations, then it solves the incurred constraints, and finally annotates each sub-expression with its inferred type.When the type inference algorithm visits a function call site,
the function’s type relations are instantiated with the concrete argument types at the call site. Each instantiated relation is added to the queue of relations to solve. The relationship between a call’s type variables and relations is added as an edge to a bipartite dependency graph where the two disjoint sets are type variables and type relations. Traditional unification constraints are represented using a modified union-find structure that integrates with this dependency graph.Once the queue is populated, the algorithm will dequeue a relation and attempt to solve it. There are two cases when solving a type relation:
3.3. Compiler Framework
The process for compiling Relay proceeds in three stages.First, the frontend converts input formats into the Relay IR.Next, the Relay compiler typechecks and optimizes the program to produce the final program. After performing optimizations, the Relay backend transforms the Relay program into a form that can be executed on the intended hardware,based on the specified execution mechanism. The backend additionally lowers Relay operators into a TVM expression,computes a schedule for the final TVM expression, and lowers it into native code.
Frontend There are several ways to write an Relay program.A user can build an in-memory representation of a program in C++ or Python, parse one written in the Relay text format, load one from the on-disk serialization format, or import one from popular frameworks and interchange formats (e.g.,TensorFlow, MxNet, Keras, DarkNet, and ONNX). Many frameworks and interchange formats use static computation graph-based representations, which can easily be translated into Relay. A greater challenge is translating frameworks with
a richer computation model such as TensorFlow (TF). TF supports control flow and includes TensorArray , a write-once tensor container. We can extract the loop structure out of the
TF graph, converting it to an Relay loop, and transform the TensorArray into an Relay list. Once new deep learning languages and IRs under development are stable it is likely
they can be translated into Relay (see Section 2.4). PyTorch provides an expressive programming model, and is a good fit for Relay, which has integration into PyTorch’s JIT infrastructure, enabling users to transparently use Relay for improved performance.
Compiler Once an Relay abstract syntax tree (AST) is produced, the program is optimized by applying a series of Relay-to-Relay passes. Between each pass, Relay performs type
inference and checking, rejecting malformed programs as well as populating shape and type information that passes can utilize. The Relay compiler supports traditional optimizations
(e.g., constant folding, common subexpression elimination,and dead code elimination) and domain-specific optimizations (see Sec. 4).Backends Relay produces machine-specific code by decomposing the problem of code generation into multiple distinct phases. Relay translates all operators into TVM expressions to produce dense linear algebra kernels [7, 53, 35]. TVM produces low-level operators that expect a fixed calling convention, as well as preallocated inputs and outputs. The result is an object file containing hardware-specific implementations of all operations. The remaining Relay program then is executed or compiled, with operator invocations replaced by calls to the optimized operators. By representing operators as TVM expressions, we can programmatically transform them and automatically generate new implementations for the transformed operators. Optimizations like fusion and quantization rely on this novel behavior. After primitive operators are lowered, the remaining Relay program ties together operator invocations,allocation, control-flow, recursion, and high-level data structures. There are multiple options for executing the combined full program: the Relay interpreter (with JIT compilation), an