为pony程序添加IACA标记(一)

IACA(Intel® Architecture Code Analyzer)是Intel出品的静态代码分析工具,可以用来分析代码的数据依赖、代码吞吐量、延迟,对于理解CPU执行和性能优化有很大帮助。

要分析一个程序就必须在代码中插入指定的标记(marker),iaca会找出标记的代码然后进行静态分析,通常可以使用Intel提供的iacaMarks.h里的宏来实现,使用方法:

while ( condition )
{
    IACA_START
    
}
IACA_END

宏实际展开为内联汇编(或者intrinsic),例如IACA_START是这样:

__asm  mov ebx, 111
__asm  _emit 0x64
__asm  _emit 0x67
__asm  _emit 0x90

现在需要分析一段pony程序的代码,但是pony不支持内联汇编,通过FFI调用C库也不能内联,解决方法之一是在pony编译器里增加intrinsic,这工作量略大,所以我又另辟巧径:

先在代码里条件增加标记

while i < size do
  IACA.start()
  p(i)? = p(i)? xor mask_key(i % 4)?
  i = i + 1
end
IACA.stop()

其中的IACA定义是这样的:

primitive IACA
  fun start(): None => None

  fun stop(): None => None

编译生成LLVM IR:

ponyc . -r=ir -d

使用debug模式,方便替换,这时打开生成的LLVM IR文件,找到对IACA.start()IACA.stop()的调用:

; 

在start调用后加上内联asm:

tail call void asm sideeffect ".byte 0xbb, 0x6f, 0, 0, 0, 0x64, 0x67, 0x90",""()

在stop调用前加上内联asm:

tail call void asm sideeffect ".byte 0xbb, 0xde, 0, 0, 0, 0x64, 0x67, 0x90",""()

然后用clang -O3 -c 编译得到目标文件,就可以用iaca分析了,下面是分析上面pony代码的结果:

Throughput Analysis Report
--------------------------
Block Throughput: 2.75 Cycles       Throughput Bottleneck: Dependency chains
Loop Count:  23
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  1.0     0.0  |  1.0  |  2.5     2.5  |  2.5     2.5  |  1.0  |  1.0  |  1.0  |  0.0  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1*     |             |      |             |             |      |      |      |      | mov esi, edi
|   1      |             |      |             |             |      | 1.0  |      |      | and esi, 0x3
|   2^     |             |      | 0.5     0.5 | 0.5     0.5 |      |      | 1.0  |      | cmp qword ptr [rcx+0x8], rsi
|   0*F    |             |      |             |             |      |      |      |      | jbe 0x2a
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | mov rbx, qword ptr [rcx+0x18]
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movzx ebx, byte ptr [rbx+rsi*1]
|   4      | 1.0         |      | 1.0     1.0 | 1.0     1.0 | 1.0  |      |      |      | xor byte ptr [rdx+rdi*1], bl
|   1      |             | 1.0  |             |             |      |      |      |      | inc rdi
|   1*     |             |      |             |             |      |      |      |      | cmp rdi, rax
|   0*F    |             |      |             |             |      |      |      |      | jb 0xffffffffffffffdc
Total Num Of Uops: 12

你可能感兴趣的:(pony)