CPU AI模型推理性能抖动分析一例

问题

2019年初，客户在其环境中发现推理延时毛刺问题，如下。工作负载为某CNN模的FP32推理。

推理时延毛刺问题

从log可见，抖动出现的频率还比较高，且在双路环境中，抖动时延高达平均时延的20倍左右，已经会影响SLA (Service Level Agreement)。

初步排查

环境是裸机还是云主机?
裸机

绑核了吗？
使用下面的命令：

$ export OMP_NUM_THREADS=
$ export MKL_NUM_THREADS=
$ taskset -c {0-} numactl -l caffe time -model  -iterations 100

无效

Disable Turbo有没有效果？
无效
有没有磁盘IO？
测试是dummy data, 不是real data, 测试时无磁盘IO。
会不会是硬件问题？
- 有没有dmesg报错?
  无
- 在跑benchmark的时候有没有频率抖动？
  用"turbostat" 查看无明显异常。
- SGEMM benchmark 有类似的问题吗？
  无

复现

Low-hanging排查失败后，方向转向：先在本地Lab里面复现，然后debug。

环境对齐

要求客户提供如下信息：
- HW
  - lscpu
  - dmidecode -t memory
- BIOS
  - P-state
  - C-state
  - NUMA: enabled
- OS
  - kernel 版本： 4.19.4
  - power governor: performance
```
$ cpupower frequency-info
$ cpupower frequency-set -g performance
```

与客户对齐framework commit、batch size、CPU SKU、内核版本以及命令后, 我们lab在FP32和INT8上均能复现客户的问题。

FP32 AVX-512
INT8 VNNI

Debug

[开始] 是否是OS与应用程序资源争用导致？

之前曾遇到过某云上性能下降的问题，最终root cause是由于OS和KVM争用应用程序的资源导致，在taskset规避掉0～1 core后fix，这个问题是否与之相似？

尝试：将应用程序taskset到 2~23核，防止与OS、KWM争用0～1核。
结果：问题仍然存在，失败。

分析陷入僵局

[转机] 客户提及性能抖动在NUMA off后消失

在lab也复现出“性能抖动在NUMA off后消失”的现象，分析重心向NUMA问题倾斜。又联想到这个现象隔一段时间发生，思路开始往“周期性event"方面倾斜，想看下这个抖动是否与内核/硬件的events有正相关关系。于是，通过"perf stat"命令开始收集latency和events的关系。
经过很多实验，最终narrow down到minor-faults和context-switches两个事件。如下：

可以看到在latency 突变时，minor-faults和context-switches两个events数目也同时发生了大量的增加。那么是什么导致了这两个events的大量增加呢？这要从理解这两个事件开始。

minor-faults: It occurs when the code (or data) needed is actually already in memory, but it isn't allocated to that process/thread。不是major page fault，而是minor page fault，说明数据都在内存中，但出现了数据在线程间迁移的现象。
context-switches: A context switch (also sometimes referred to as a process switch or a task switch) is the switching of the CPU (central processing unit) from one process or thread to another。不仅数据出现了迁移，thread也迁移了。

但我们在跑代码的时候使用KMP_AFFINITY, taskset以及numactl固定了OMP的线程？按之前对这三个工具的理解应该是不会发生线程float的。是不是kernel有自己的行为，仍然会按照自己的策略进行thread 和data的allocation。果然找到了一个Automatic NUMA Balancing这个kernel feature。叙述如下：

An application will generally perform best when the threads of its processes are accessing memory on the same NUMA node as the threads are scheduled. Automatic NUMA balancing moves tasks (which can be threads or processes) closer to the memory they are accessing. It also moves application data to memory closer to the tasks that reference it. This is all done automatically by the kernel when automatic NUMA balancing is active.

Automatic NUMA balancing uses a number of algorithms and data structures, which are only active and allocated if automatic NUMA balancing is active on the system:

Periodic NUMA unmapping of process memory

NUMA hinting fault

Migrate-on-Fault (MoF) - moves memory to where the program using it runs

task_numa_placement - moves running programs closer to their memory

因为Automatic NUMA Balancing 既会移动数据也会移动线程，因此上面同时能看到minor-faults和context-switches两个events的激增就能理解了。

[Solution]

$ cat /proc/sys/kernel/numa_balancing
$ sysctl -w kernel.numa_balancing=0
or 
$ echo 0 > /proc/sys/kernel/numa_balancing

[Results]

问题解决！

[Further Readings]

Linux kernel profiling with perf
Automatic NUMA Balancing
Context Switch Definition
CPU Frequency Scaling
irqbalance
Understanding page faults and memory swap-in/outs: when should you worry?
Memory Management Unit