[论文笔记] IOS: Inter-Operator Scheduler for CNN Acceleration

IOS: Inter-Operator Scheduler for CNN Acceleration
Proceedings of the 4 th MLSys Conference, San Jose, CA, USA, 2021.
Arxiv | Github | Website

主要内容

随着计算能力的提升,卷积神经网络(Convolutional neural network,CNN)的顺序执行(sequential execution)无法为了更加充分地利用计算资源而提供更多的并行化机会。在该篇论文中作者提出了一种从全局角度对深度学习模型计算图进行算子调度的算法:IOS(Inter-Operator Scheduler)。通过将算子间并行和算子内并行结合,并且采用动态规划算法查找更高效的调度策略,IOS算法可以帮助CNN实现更高的设备利用率。实验表明,相比于TensorRT,IOS可以使CNN的推理性能提升1.1倍到1.5倍。

现有算法及其存在的问题

  1. MetaFlow[1], fuses multiple operators matching a specific pattern into a larger operator to increase operator granularity.

  2. Greedy Schedule[2],directly executes all available CNN operators on CPU tomaximize resource utilization. 存在的问题

    1. puts more operators in the early stages and fewer operators in subsequent stages, resulting in low utilization in later stages.
    2. executing too many operators on the device concurrently may lead to resource contention problem that hurts the performance.
  3. 获得最优并行化 CNN 模型调度策略所面临的挑战:

    1. 调度策略的可选个数随着算子数量的增加呈指数式增长;
    2. 最优调度策略依赖于具体硬件设备和运行参数(如:batch size)

算法详解

计算图递归分割

The illustration of ending
  1. 将计算图 分割为和,和之间的边都是从开始到为止。被称为的终点()。
  2. 在的最优调度策略中的最后一个阶段()所包含的算子,必定是的终点。
  3. 通过枚举的终点,将查找的最优调度策略转换为查找的最优调度策略。
  4. 由此通过对计算图进行递归分割,完成整图的调度

定义代价函数

其中: 为 的终点,。由此, 是全图最优调度策略的得到的模型推理耗时(latency)。

伪代码实现

Algorithm 1 Inter-Operator Scheduler (IOS)

IOS算法的实现由三个函数组成

  1. INTEROPERATORSCHEDULER InterOperatorScheduler
    takes a computation graph as an input and returns the optimal schedule found by IOS
  2. SCHEDULER Scheduler
    a recursive function implementing the dynamic programming algorithm to find the optimal schedule for a subset of operators in
  3. GENERATESTAGE GeneratorStage
    chooses a better parallelization strategy for given operators

算法时间复杂度

(略)

实验

不同调度策略间的比较

End-to-end performance comparison
  • IOS-Merge schedule
    only takes the “operator merge” strategy
  • IOS-Parallel schedule
    only takes the “concurrent execution” strategy
  • IOS-Both schedule
    considers both parallelization strategies
  • sequential schedule
    executes the operator one-by-one according to certain topological ordering
  • greedy schedule
    puts all the operators that can be executed currently in one stage, and repeats this process until all operators have been scheduled.

基于cuDNN框架的比较

End-to-end performance comparison

Figure 7 shows that IOS consistently outperforms all five baseline frameworks on four benchmark CNNs. IOS can achieve 1:1 to 1:5× speedup comparing to the state of the art library TASO, TVM-cuDNN, and TensorRT

更多的激活线程束提升设备利用率

激活线程束比较

算子计算本质上是执行GPU核函数。一个核函数调用一组线程,然后将其分组为多个线程块,并将每个线程块分发到流多处理器(stream multiprocessors, SMs)。SM上的线程块可以进一步划分为多个线程束,线程束作为最基本的执行单元,包含固定的线程个数,去执行单指令多线程方法。

当一个线程束从SM上开始执行到执行完成最后一个指令为止,可以认为其处于激活(active)状态。SM可以通过快速上下文切换来隐藏内存访问导致的线程束停顿:在每个循环中,每个线程束调度器会选择符合条件的线程束并且发射指令。如果此时没有合适线程束可调用,则导致了计算资源处于闲置状态。

增加激活线程束是提升每个周期具有符合条件线程束的几率的有效方法,因此,增加激活线程束至关重要。

Figure 8 shows the number of active warps on the whole GPU throughout the repeated execution of both the IOS and the Sequential schedule, sampled using NVIDIA’s CUPTI profiling toolset every 2.1 ms. IOS schedule achieves 58% more active warps on average compared to the Sequential schedule.

控制变量研究

通过调度调优减少搜索时间

Trade-off between the optimized latency and the optimization cost for Inception V3 and NasNet.

IOS可以通过调整调度策略参数,对优化后推理耗时和优化耗时之间调整。其中,参数 为 the maximum number of operators in each group, 为 the maximum number of groups in a stage。Figure 9 展示了不同策略参数对优化后推理耗时和优化耗时的影响。

专门调度有益于性能提升

Latency (ms) of specialized schedules for batch size 1, 32 and 128, and specialized schedules for NVIDIA Tesla K80 and V100

不同的工作负载(比如:具有不同批量的网络模型)具有不同的计算特点。因此不同的工作负载需要确定不同的调度策略。由Table 3 (1) 可知,针对不同的批量大小选择对应的批量大小运行可以得到最佳的网络性能。

同时,针对不同的运行设备,同样需要确定不同的调度策略。由Table 3 (2) 可知,针对特定设备确定专门策略可以实现更好的网络性能。

The schedule found by IOS for the last block of Inception V3

由Figure 10 可知,同一个网络结构针对不同的批量大小选择不同的调度策略。

There are two stages in the schedule (1), which is optimized for batch size 1 while there are four stages in the schedule (2), which is optimized for batch size 32. The schedule (1) is 28% faster than the schedule (2) on batch size 1, while the schedule (2) is 8% faster than (1) on batch size 32. There are two differences between them. The first one is that convolution f and g in the schedule (2) are merged into a single convolution. This is because activation (the output tensor of an operator) is the memory bottleneck at large batch size. It is more crucial to reduce memory access, even at the cost of larger computation cost. Merging can reduce the memory access, because the merged kernel only access the output of convolution c once, instead of twice in the schedule (1). However, because the kernel size of f and g are 3x1 and 1x3, respectively, their kernel size would be expanded to 3x3 by padding zeros, which increases the amount of computation. Another difference between the schedule (1) and (2) is that the schedule (2) has more stages than the schedule (1). We found a similar phenomenon for large batch sizes because of resource contention. When multiple operators are executed on the device, there is a conflict over access to the shared resources such as the lastlevel cache, making the concurrent execution degrades the performance. This gets more severe for larger batch sizes because the demand for shared resources gets larger

针对不同批量大小的持续提升

The throughput comparison

在实际的应用中,需要不同的批量大小用模型推理。由 Figure 11 可知,模型吞吐量(throughput)随着批量的增加而增加,当批量大小大于128时,模型性能达到饱和,模型吞吐量不再随着批量的增加而增加。同时,相比于其他比较基线,IOS的模型吞吐量在所有批量大小都能达最优

算子内和算子间的并行

End-to-end performance comparison

TVM exploits the intra-operator parallelism by searching the schedule for each kernel on a specific device. IOS focuses on inter-operator parallelism and leaves the exploitation of intra-operator parallelism to cuDNN library

算子内并行和算子间并行并没有什么必然的联系,而且可以同时使用。由 Figure 12 可知,IOS 在部分网络的性能上优于 TVM,这是因为对于性能足够强的计算设备仅使用算子内并行无法提供足够的并行性。再另一部分网络模型,TVM则优于IOS,这是因为TVM找到了可分离卷积更高效的核函数实现,在这些网络模型的算子中可分离卷积占据了较大的比例。


  1. Jia, Z., Thomas, J., Warszawski, T., Gao, M., Zaharia, M., and Aiken, A. Optimizing dnn computation with relaxed graph substitutions. In Talwalkar, A., Smith, V., and Zaharia, M. (eds.), Proceedings of Machine Learning and Systems, volume 1, pp. 27–39. 2019b. ↩

  2. Tang, L., Wang, Y., Willke, T. L., and Li, K. Scheduling computation graphs of deep learning models on manycore cpus. arXiv preprint arXiv:1807.09667, 2018. ↩

你可能感兴趣的:([论文笔记] IOS: Inter-Operator Scheduler for CNN Acceleration)