IACA(Intel® Architecture Code Analyzer)是Intel出品的静态代码分析工具,可以用来分析代码的数据依赖、代码吞吐量、延迟,对于理解CPU执行和性能优化有很大帮助。
要分析一个程序就必须在代码中插入指定的标记(marker),iaca
会找出标记的代码然后进行静态分析,通常可以使用Intel提供的iacaMarks.h
里的宏来实现,使用方法:
while ( condition )
{
IACA_START
}
IACA_END
宏实际展开为内联汇编(或者intrinsic),例如IACA_START
是这样:
__asm mov ebx, 111
__asm _emit 0x64
__asm _emit 0x67
__asm _emit 0x90
现在需要分析一段pony程序的代码,但是pony不支持内联汇编,通过FFI调用C库也不能内联,解决方法之一是在pony编译器里增加intrinsic,这工作量略大,所以我又另辟巧径:
先在代码里条件增加标记
while i < size do
IACA.start()
p(i)? = p(i)? xor mask_key(i % 4)?
i = i + 1
end
IACA.stop()
其中的IACA定义是这样的:
primitive IACA
fun start(): None => None
fun stop(): None => None
编译生成LLVM IR:
ponyc . -r=ir -d
使用debug模式,方便替换,这时打开生成的LLVM IR文件,找到对IACA.start()
和IACA.stop()
的调用:
;
在start调用后加上内联asm:
tail call void asm sideeffect ".byte 0xbb, 0x6f, 0, 0, 0, 0x64, 0x67, 0x90",""()
在stop调用前加上内联asm:
tail call void asm sideeffect ".byte 0xbb, 0xde, 0, 0, 0, 0x64, 0x67, 0x90",""()
然后用clang -O3 -c 编译得到目标文件,就可以用iaca分析了,下面是分析上面pony代码的结果:
Throughput Analysis Report
--------------------------
Block Throughput: 2.75 Cycles Throughput Bottleneck: Dependency chains
Loop Count: 23
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
--------------------------------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 1.0 | 2.5 2.5 | 2.5 2.5 | 1.0 | 1.0 | 1.0 | 0.0 |
--------------------------------------------------------------------------------------------------
DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
| 1* | | | | | | | | | mov esi, edi
| 1 | | | | | | 1.0 | | | and esi, 0x3
| 2^ | | | 0.5 0.5 | 0.5 0.5 | | | 1.0 | | cmp qword ptr [rcx+0x8], rsi
| 0*F | | | | | | | | | jbe 0x2a
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | mov rbx, qword ptr [rcx+0x18]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | movzx ebx, byte ptr [rbx+rsi*1]
| 4 | 1.0 | | 1.0 1.0 | 1.0 1.0 | 1.0 | | | | xor byte ptr [rdx+rdi*1], bl
| 1 | | 1.0 | | | | | | | inc rdi
| 1* | | | | | | | | | cmp rdi, rax
| 0*F | | | | | | | | | jb 0xffffffffffffffdc
Total Num Of Uops: 12