PMU : Performance Monitoring Unit
perf_events is an event-oriented observability tool, which can help you solve advanced performance and troubleshooting functions.
perf_events is part of the Linux kernel, under tools/perf. While it uses many Linux tracing features, some are not yet exposed via the perf command, and need to be used via the ftrace interface instead.
You may wish to measure more events simultaneously than hardware can support (NMI watchdog may steal one too)
If there are more events than counters, the kernel uses time multiplexing (switch frequency = HZ, generally 100 or 1000) to give each event a chance to access the monitoring hardware. Multiplexing only applies to PMU events. With multiplexing, an event is not measured all the time. At the end of the run, the tool scales the count based on total time enabled vs time running. The actual formula is:
final_count = raw_count * time_enabled/time_running
Events are currently managed in round-robin fashion. Therefore each event will eventually get a chance to run. If there are N counters, then up to the first N events on the round-robin list are programmed into the PMU. In certain situations it may be less than that because some events may not be measured together or they compete for the same counter. Furthermore, the perf_events interface allows multiple tools to measure the same thread or CPU at the same time. Each event is added to the same round-robin list. There is no guarantee that all events of a tool are stored sequentially in the list.
简单实验,由此可以看出有四个 counter
perf stat -e cycles,cycles,cycles,cycles,cycles
^C
Performance counter stats for 'system wide':
49,606,709,293 cycles (80.04%)
49,542,219,044 cycles (80.04%)
49,512,868,392 cycles (80.03%)
49,528,694,848 cycles (80.02%)
49,539,412,782 cycles (80.01%)
perf stat -e cycles,cycles,cycles,cycles,cycles,cycles
^C
Performance counter stats for 'system wide':
22,251,237,320 cycles (66.86%)
22,232,407,299 cycles (66.81%)
22,232,703,664 cycles (66.77%)
22,233,091,623 cycles (66.74%)
22,193,679,137 cycles (66.70%)
22,163,799,967 cycles (66.70%)
There’s a problem with event profiling that you don’t really encounter with CPU profiling (timed sampling). With timed sampling, it doesn’t matter if there was a small sub-microsecond delay between the interrupt and reading the instruction pointer (IP). Some CPU profilers introduce this jitter on purpose, as another way to avoid lockstep sampling. But for event profiling, it does matter: if you’re trying to capture the IP on some PMC event, and there’s a delay between the PMC overflow and capturing the IP, then the IP will point to the wrong address. This is skew. Another contributing problem is that micro-ops are processed in parallel and out-of-order, while the instruction pointer points to the resumption instruction, not the instruction that caused the event. I’ve talked about this before.
The solution is “precise sampling”, which on Intel is PEBS (Precise Event-Based Sampling), and on AMD it is IBS (Instruction-Based Sampling). These use CPU hardware support to capture the real state of the CPU at the time of the event. perf can use precise sampling by adding a :p modifier to the PMC event name, eg, “-e instructions:p”. The more p’s, the more accurate. Here are the docs from tools/perf/Documentation/perf-list.txt:
The 'p' modifier can be used for specifying how precise the instruction
address should be. The 'p' modifier can be specified multiple times:
0 - SAMPLE_IP can have arbitrary skid
1 - SAMPLE_IP must have constant skid
2 - SAMPLE_IP requested to have 0 skid
3 - SAMPLE_IP must have 0 skid
perf record在当前目录产生一个perf.data文件(如果这个文件已经存在,旧的文件会被改名为perf.data.old),用来记录过程数据。之后运行的perf report命令会输出统计的结果。perf.data只包含原始数据,perf report需要访问本地的符号表,pid和进程的对应关系等信息来生成报告。所以perf.data不能直接拷贝到其他机器上用的。
perf record 同时支持 3 种栈回溯方式:fp, dwarf, lbr,可以通过 --call-graph 参数指定,而 -g 就相当于 --call-graph fp.
fp 就是 Frame Pointer,即 x86 中的 EBP 寄存器,fp 指向当前栈帧栈底地址,此地址保存着上一栈帧的 EBP 值,,根据 fp 就可以逐级回溯调用栈。然而这一特性是会被优化掉的,而且这还是 GCC 的默认行为,在不手动指定 -fno-omit-frame-pointer 时默认都会进行此优化,此时 EBP 被当作一般的通用寄存器使用,以此为依据进行栈回溯显然是错误的。不过尝试指定 -fno-omit-frame-pointer 后依然没法获取到正确的调用栈,根据 GCC 手册的说明,指定了此选项后也并不保证所有函数调用都会使用 fp…… 看来只有放弃使用 fp 进行回溯了。
dwarf 是一种调试文件格式,GCC 编译时附加的 -g 参数生成的就是 dwarf 格式的调试信息,其中包括了栈回溯所需的全部信息,使用 libunwind 即可展开这些信息。dwarf 的进一步介绍可参考 “关于DWARF”,值得一提的是,GDB 进行栈回溯时使用的正是 dwarf 调试信息。实际测试表明使用 dwarf 可以很好的获取到准确的调用栈。
lbr 即 Last Branch Records,是较新的 Intel CPU 中提供的一组硬件寄存器,其作用是记录之前若干次分支跳转的地址,主要目的就是用来支持 perf 这类性能分析工具,其详细说明可参考 “An introduction to last branch records” & “Advanced usage of last branch records”。此方法是性能与准确性最高的手段,然而它存在一个很大的局限性,由于硬件 Ring Buffer 寄存器的大小是有限的,lbr 能记录的栈深度也是有限的,具体值取决于特定 CPU 实现,一般就是 32 层,若超过此限制会得到错误的调用栈。
Software events may have a default period. This means that when you use them for sampling, you’re sampling a subset of events, not tracing every event. You can check with perf record -vv.
Sampling a subset by default may be a good thing, especially for high frequency events like context switches.
To specify a sampling period, instead, the -c option must be used. For instance, to collect a sample every 2000 occurrences of event instructions only at the user level only:
perf record -c 10000 -e cycles:ppp -C 0 sleep 5
[ perf record: Woken up 257 times to write data ]
[ perf record: Captured and wrote 66.366 MB perf.data (1682147 samples) ]
Samples: 1M of event 'cycles:ppp', Event count (approx.): 16821470000
Overhead Command Shared Object Symbol
-n, --show-nr-samples
Show a column with the number of samples
-C, --cpu
Collect samples only on the list of CPUs provided. Multiple
CPUs can be provided as a comma-separated list with no space:
0,1. Ranges of CPUs are specified with -: 0-2. In per-thread
mode with inheritance mode on (default), samples are captured
only when the thread executes on the designated CPUs. Default
is to monitor all CPUs.
Default event: cycle counting
By default, perf record uses the cycles event as the sampling event. This is a generic hardware event that is mapped to a hardware-specific PMU event by the kernel. For Intel, it is mapped to UNHALTED_CORE_CYCLES. This event does not maintain a constant correlation to time in the presence of CPU frequency scaling. Intel provides another event, called UNHALTED_REFERENCE_CYCLES but this event is NOT currently available with perf_events.
On AMD systems, the event is mapped to CPU_CLK_UNHALTED and this event is also subject to frequency scaling. On any Intel or AMD processor, the cycle event does not count when the processor is idle, i.e., when it calls mwait().
Hardware event sampling is performed in the following way. PERF configures the hardware performance counters to count the selected events. It also configures each counter to generate an interrupt after the occurrence of the number of events specified by the sampling period. PERF starts the counters and launches the workload. When the sampling period expires for a counter, it generates an interrupt. PERF then reads the restart program counter value from the interrupt stack and writes the PC value (and other information) to a sample buffer. PERF writes the sample buffer to the profile data file when the buffer is full. Finally, PERF re-arms (no pun intended) the performance counter and sets the sampling period after making any needed adjustments for sampling frequency. This whole process continues until the workload completes and PERF disables the hardware performance counters.
When we decrease the sampling period, we increase the number of sampling interrupts and the number of interrupts (samples) to be handled. Decreasing the sampling period increases workload perturbation. Further, the core cannot execute the workload while it is handling an interrupt. More CPU cycles are “stolen” from the workload then the sampling period is shortened because more interrupts must be handled. A shorter sampling period increases statistical accuracy, but it comes at the cost of longer workload elapsed time.
l2_rqsts.all_demand_miss
l2_rqsts.all_pf
l2_rqsts.demand_data_rd_hit
l2_rqsts.demand_data_rd_miss
l2_rqsts.miss
l2_rqsts.pf_miss
l2_rqsts.rfo_miss
l2_trans.l2_wb
l2_rqsts.miss
[All requests that miss L2 cache]
# perf stat -e l2_rqsts.miss -I 1000 -C 0 sleep 10
# time counts unit events
1.000356222 213,888 l2_rqsts.miss
2.001538827 169,999 l2_rqsts.miss
3.001849629 840,332 l2_rqsts.miss
4.002167643 182,219 l2_rqsts.miss
5.002446657 217,416 l2_rqsts.miss
6.002611179 1,258,836 l2_rqsts.miss
7.002893517 1,123,108 l2_rqsts.miss
8.003138038 1,014,948 l2_rqsts.miss
9.003384195 1,173,721 l2_rqsts.miss
10.003566874 418,776 l2_rqsts.miss
10.004632920 13,491 l2_rqsts.miss
比如 sw_prefetch_access.t1_t2
在代码的相关位置插入以下指令,一般在forloop里面
_mm_prefetch(memory_address_, _MM_HINT_T2);
然后使用perf stat 来捕获counter 数量,
实验表明效果还是挺不错的
# perf stat -e inst_retired.any,inst_retired.any_p,sw_prefetch_access.t0,sw_prefetch_access.t1_t2 -p $! sleep 5
Performance counter stats for process id '144495':
32,733,406,923 inst_retired.any
32,733,698,727 inst_retired.any_p
806,260 sw_prefetch_access.t0
902,886,959 sw_prefetch_access.t1_t2
5.005418935 seconds time elapsed
VS remove AVX in Fixed width types
Performance counter stats for process id '169232':
16,805,056,337 inst_retired.any
16,805,127,576 inst_retired.any_p
808,181 sw_prefetch_access.t0
391,681,924 sw_prefetch_access.t1_t2
5.005929544 seconds time elapsed
inst_retired.any 表示指令没有被充分利用,经常 被切换
pidstat 可以产生 minflt/s , majflt/s , page fault 统计
taskset -c 2 ../debug/src/benchmarks/GoogleBenchmarkColumnarToRow & pidstat -r -u -p $! -t 1 &
The cpu-cycles event is mapped to the hardware event UNHALTED_CORE_CYCLES on Intel processors. On AMD processors, however, it is mapped to the hardware event CPU_CLK_UNHALTED.
perf record --call-graph lbr ./GoogleBenchmarkColumnarToRow
perf report -v -n -g
#或-g
perf report -g
通过perf list可以查看Perf支持的各种Event:
# perf list
List of pre-defined events (to be used in -e):
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
cache-misses [Hardware event]
cache-references [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
ref-cycles [Hardware event]
alignment-faults [Software event]
bpf-output [Software event]
context-switches OR cs [Software event]
cpu-clock [Software event]
cpu-migrations OR migrations [Software event]
dummy [Software event]
emulation-faults [Software event]
major-faults [Software event]
...
代码段被定义为过程连接表(Procedure Linkage Table,PLT),而data段中的坑被定义为全局偏移表(Global Offset Table, GOT)。
#include
int main()
{
printf("hello");
printf("hello again");
return 0;
}
用objdump查看
0000000000400526 :
400526: 55 push %rbp
400527: 48 89 e5 mov %rsp,%rbp
40052a: bf d4 05 40 00 mov $0x4005d4,%edi
40052f: b8 00 00 00 00 mov $0x0,%eax
400534: e8 c7 fe ff ff callq 400400
400539: bf da 05 40 00 mov $0x4005da,%edi
40053e: b8 00 00 00 00 mov $0x0,%eax
400543: e8 b8 fe ff ff callq 400400
400548: b8 00 00 00 00 mov $0x0,%eax
40054d: 5d pop %rbp
40054e: c3 retq
40054f: 90 nop
链接器发现printf定义在动态库中,于是将printf函数的位置设置为地址为0x400400的printf@plt。
就是 call 指令(q代表要处理的字节长度,q是8字节,l是4字节,w是2字节)。call 指令指示 CPU 应跳转到某个地址去执行新的函数。
jmpq 就是jmp 指令。q是gnu汇编的用法。q表示跳转到64位地址。
l表示32位地址。
https://blog.csdn.net/hknaruto/article/details/126778632