AI算力基础 -- Roofline模型

Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures

AI算力基础 -- Roofline模型_第1张图片

背景1:Amdahl‘s Law: Gene Amdahl进行了一个富有洞察力的观察: 提升一个系统的一个部分的性能对整个系统有多大影响。这一观察被称为Amdahl’s Law(阿姆达尔定律)
背景2:David Patterson,2017年图灵奖得主、加州伯克利大学计算机科学教授、谷歌杰出工程师David Patterson. 作为计算机体系结构宗师,David Patterson曾带领伯克利团队起草了精简数据集RISC-1,奠定RISC架构基础,该架构后来被当时的巨头「太阳微电子」(Sun Microsystem,后来被甲骨文收购)选中用来制作Sparc处理器。他与斯坦福大学前校长、Google母公司Alphabet现董事长John Hennessey合作的《计算机体系结构:量化研究方法》开创性地提供了体系结构的分析和科学框架,至今都是该领域的经典教材。2016年从伯克利退休后,David Patterson以杰出工程师身份加入Google Brain团队,为两代TPU研发做出了卓越贡献。

2018年3月,David Patterson与John Hennessey共同获得2017年度ACM图灵奖,以表彰他们在计算机体系结构的设计和评估方面开创了一套系统的、量化的方法,并对微处理器行业产生了深远的影响。

Amdahl‘s Law

AI算力基础 -- Roofline模型_第2张图片

1. Abstract

We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for floating point computations.

2. The roofline model

We believe that for the recent past and foreseeable future, off-chip memory bandwidth will often be the constraining resource[23]. Hence, we want a model that relates processor performance to offchip memory traffic.

**operational intensity: operations per byte of DRAM traffic, operational intensity suggests the
DRAM bandwidth needed by a kernel on a particular computer. **
The Y-axis is attainable floating-point performance. The Xaxis is operational intensity

AI算力基础 -- Roofline模型_第3张图片
Attainable GFlops/sec = Min(Peak Floating Point Performance, Peak Memory Bandwidth x Operational Intensity).

起初阶段,Attainable performance 随着 operation intensity增加而增加,到 ridge point 后保持不变;
同时如果不理想会出现 memory-bound 和 computer-bound 两种情况。

右图说明,给定一个rooline,您可以在不同的 kernels 上重复使用它,因为rooline不会变化。
图1b比较了两个系统的Roofline模型。不出所料,Opteron X2的脊点从1.0右移到 Opteron X4的4.4版本。因此,要在 X4,kernel 需要大于1的操作强度。

脊点的横坐标是实现最大性能所需的最小操作强度,如果脊点偏右,则只有操作强度非常高的核才能达到该计算机的最大性能。如果它在最左边,那么几乎所有内核都可能达到最大性能。

4. Adding ceilings to the roofline model

Roofline模型为性能提供了一个上界, 假设你的程序在远低于它的屋顶线的地方执行。应该执行哪些优化,以什么顺序执行?

To reduce computational bottlenecks

  1. Improve instruction level parallelism (ILP) and apply SIMD?
  2. Balance floating-point operation mix.

To reduce memory bottlenecks:

  1. Restructure loops for unit stride accesses.Optimizing for unit stride memory accesses engages hardware prefetching, which significantly increases memory bandwidth.
  2. Ensure memory affinity. Most microprocessors today include a memory controller on the same chip with the processors. If the system has two multicore chips, then some addresses go to the DRAM local to one multicore chip and the rest must go over a chip interconnect to access the DRAM that is local
    to another chip
  3. Use software prefetching
    AI算力基础 -- Roofline模型_第4张图片
    第一个图是改善了compute,第二个图是改善了memory;第三个图是绘制在一起;

AI算力基础 -- Roofline模型_第5张图片
上述是四种 FP kernel 的实现方法,每个方法有各自的 Oper Inten.

  1. Intel , Intel includes a snoop filter to prevent unnecessary coherency traffic on the bus. If the working set is small enough for the hardware to filter, the snoop filter nearly doubles the delivered memory bandwidth.
    AI算力基础 -- Roofline模型_第6张图片
  2. AMD
    AI算力基础 -- Roofline模型_第7张图片
  3. IBM
    AI算力基础 -- Roofline模型_第8张图片

Fallacy: The model does not take into account all features of modern processors, such as caches or prefetching.
Fallacy: Doubling cache size will increase operational intensity
Fallacy: The model doesn’t account for the long memory latency
Fallacy: The model ignores integer units in floating-point programs, which can limit performance
Fallacy: The model is limited to easily optimized kernels that never hit in the cache
Fallacy: The model is limited to floating-point programs
AI算力基础 -- Roofline模型_第9张图片

Conclusions

This paper describes a simple and visual model to help see which systems would be a good match to important kernels, or conversely, to see how to change kernel code or hardware to run desired kernels well.

你可能感兴趣的:(AI_算力,ASIC,NPU)