FLOPs 是用来衡量科学计算程序计算量的关键指标,表示一个程序完整运算所需的浮点运算次数。在此,我使用系统性能评测工具 Perf 来衡量一个程序的 FLOPs。
Ubuntu/Debian
apt-get install linux-tools-common linux-tools-generic linux-tools-`uname -r`
CentOS
yum install perf
git clone git://perfmon2.git.sourceforge.net/gitroot/perfmon2/libpfm4
cd libpfm4
make
进入examples
文件夹,运行showevtinfo
程序,查看哪些事件是与 flops 相关的,在我的电脑中,我找到以下几个事件
IDX : 419430470
PMU name : skl (Intel Skylake)
Name : FP_ARITH_INST_RETIRED
Equiv : None
Flags : None
Desc : Floating-point instructions retired
Code : 0xc7
Umask-00 : 0x01 : PMU : [SCALAR_DOUBLE] : None : Number of scalar double precision floating-point arithmetic instructions (multiply by 1 to get flops)
Umask-01 : 0x02 : PMU : [SCALAR_SINGLE] : None : Number of scalar single precision floating-point arithmetic instructions (multiply by 1 to get flops)
Umask-02 : 0x04 : PMU : [128B_PACKED_DOUBLE] : None : Number of scalar 128-bit packed double precision floating-point arithmetic instructions (multiply by 2 to get flops)
Umask-03 : 0x08 : PMU : [128B_PACKED_SINGLE] : None : Number of scalar 128-bit packed single precision floating-point arithmetic instructions (multiply by 4 to get flops)
Umask-04 : 0x10 : PMU : [256B_PACKED_DOUBLE] : None : Number of scalar 256-bit packed double precision floating-point arithmetic instructions (multiply by 4 to get flops)
Umask-05 : 0x20 : PMU : [256B_PACKED_SINGLE] : None : Number of scalar 256-bit packed single precision floating-point arithmetic instructions (multiply by 8 to get flops)
Umask-06 : 0x40 : PMU : [512B_PACKED_DOUBLE] : None : Number of scalar 512-bit packed double precision floating-point arithmetic instructions (multiply by 8 to get flops)
Umask-07 : 0x80 : PMU : [512B_PACKED_SINGLE] : None : Number of scalar 512-bit packed single precision floating-point arithmetic instructions (multiply by 16 to get flops)
Modif-00 : 0x00 : PMU : [k] : monitor at priv level 0 (boolean)
Modif-01 : 0x01 : PMU : [u] : monitor at priv level 1, 2, 3 (boolean)
Modif-02 : 0x02 : PMU : [e] : edge level (may require counter-mask >= 1) (boolean)
Modif-03 : 0x03 : PMU : [i] : invert (boolean)
Modif-04 : 0x04 : PMU : [c] : counter-mask in range [0-255] (integer)
Modif-05 : 0x05 : PMU : [t] : measure any thread (boolean)
Modif-06 : 0x07 : PMU : [intx] : monitor only inside transactional memory region (boolean)
Modif-07 : 0x08 : PMU : [intxcp] : do not count occurrences inside aborted transactional memory region (boolean)
#-----------------------------
IDX : 419430469
PMU name : skl (Intel Skylake)
Name : FP_ARITH
Equiv : FP_ARITH_INST_RETIRED
Flags : None
Desc : Floating-point instructions retired
Code : 0xc7
Umask-00 : 0x01 : PMU : [SCALAR_DOUBLE] : None : Number of scalar double precision floating-point arithmetic instructions (multiply by 1 to get flops)
Umask-01 : 0x02 : PMU : [SCALAR_SINGLE] : None : Number of scalar single precision floating-point arithmetic instructions (multiply by 1 to get flops)
Umask-02 : 0x04 : PMU : [128B_PACKED_DOUBLE] : None : Number of scalar 128-bit packed double precision floating-point arithmetic instructions (multiply by 2 to get flops)
Umask-03 : 0x08 : PMU : [128B_PACKED_SINGLE] : None : Number of scalar 128-bit packed single precision floating-point arithmetic instructions (multiply by 4 to get flops)
Umask-04 : 0x10 : PMU : [256B_PACKED_DOUBLE] : None : Number of scalar 256-bit packed double precision floating-point arithmetic instructions (multiply by 4 to get flops)
Umask-05 : 0x20 : PMU : [256B_PACKED_SINGLE] : None : Number of scalar 256-bit packed single precision floating-point arithmetic instructions (multiply by 8 to get flops)
Umask-06 : 0x40 : PMU : [512B_PACKED_DOUBLE] : None : Number of scalar 512-bit packed double precision floating-point arithmetic instructions (multiply by 8 to get flops)
Umask-07 : 0x80 : PMU : [512B_PACKED_SINGLE] : None : Number of scalar 512-bit packed single precision floating-point arithmetic instructions (multiply by 16 to get flops)
Modif-00 : 0x00 : PMU : [k] : monitor at priv level 0 (boolean)
Modif-01 : 0x01 : PMU : [u] : monitor at priv level 1, 2, 3 (boolean)
Modif-02 : 0x02 : PMU : [e] : edge level (may require counter-mask >= 1) (boolean)
Modif-03 : 0x03 : PMU : [i] : invert (boolean)
Modif-04 : 0x04 : PMU : [c] : counter-mask in range [0-255] (integer)
Modif-05 : 0x05 : PMU : [t] : measure any thread (boolean)
Modif-06 : 0x07 : PMU : [intx] : monitor only inside transactional memory region (boolean)
Modif-07 : 0x08 : PMU : [intxcp] : do not count occurrences inside aborted transactional memory region (boolean)
#-----------------------------
IDX : 419430414
PMU name : skl (Intel Skylake)
Name : FP_ASSIST
Equiv : None
Flags : None
Desc : X87 floating-point assists
Code : 0xca
Umask-00 : 0x1001e : PMU : [ANY] : [default] : Cycles with any input/output SEE or FP assists
Modif-00 : 0x00 : PMU : [k] : monitor at priv level 0 (boolean)
Modif-01 : 0x01 : PMU : [u] : monitor at priv level 1, 2, 3 (boolean)
Modif-02 : 0x02 : PMU : [e] : edge level (may require counter-mask >= 1) (boolean)
Modif-03 : 0x03 : PMU : [i] : invert (boolean)
Modif-04 : 0x04 : PMU : [c] : counter-mask in range [0-255] (integer)
Modif-05 : 0x05 : PMU : [t] : measure any thread (boolean)
Modif-06 : 0x07 : PMU : [intx] : monitor only inside transactional memory region (boolean)
Modif-07 : 0x08 : PMU : [intxcp] : do not count occurrences inside aborted transactional memory region (boolean)
在相同目录下,执行check_events
程序,获取指定代号,程序的参数就是上一步骤中获取的Name和Umask,我的执行命令就是如下:
./check_events FP_ARITH_INST_RETIRED:SCALAR_SINGLE FP_ARITH:SCALAR_SINGLE FP_ASSIST
得到如下结果:
Requested Event: FP_ARITH_INST_RETIRED:SCALAR_SINGLE
Actual Event: skl::FP_ARITH_INST_RETIRED:SCALAR_SINGLE:k=1:u=1:e=0:i=0:c=0:t=0:intx=0:intxcp=0
PMU : Intel Skylake
IDX : 419430470
Codes : 0x5302c7
Requested Event: FP_ARITH:SCALAR_SINGLE
Actual Event: skl::FP_ARITH_INST_RETIRED:SCALAR_SINGLE:k=1:u=1:e=0:i=0:c=0:t=0:intx=0:intxcp=0
PMU : Intel Skylake
IDX : 419430470
Codes : 0x5302c7
Requested Event: FP_ASSIST
Actual Event: skl::FP_ASSIST:ANY:k=1:u=1:e=0:i=0:c=1:t=0:intx=0:intxcp=0
PMU : Intel Skylake
IDX : 419430414
Codes : 0x1531eca
结果中的 Codes
,就是我们要的代号
找到要测量的程序,然后使用perf stat
执行并给予事件代码,即可获得 FLOPs。示例如下:
sudo perf stat -e r5302c7 -e r1531eca ./example.py
得到结果如下:
Performance counter stats for './example.py':
13,061,638 r5302c7
1 r1531eca
1.834101748 seconds time elapsed
1.888016000 seconds user
0.231023000 seconds sys
其中,r5302c7
对应的数值,即为该程序的总 FLOPs。