SoC performance benchmark

Preface

This article would illustrate the programs used to benchmark the SoC(include the SMP) performance, also the step to build and run the benchmark programs.  And at the end, I give 2 scripts to make the benchmark work more efficiently.

These benchmark programs would evaluate the Integer and FP performance, also the latency of the L1-Cache and L2-Cache. We can fetch these tools from net. And some of them comes from the lmbench. For the lmbench you may view my previous blog post(In Chinese).ARM Linux BenchMark. Also refer the github repo which suit the previous blog post:

https://github.com/tonyho/ARM_BenchMark

Besides, if you want to compare the SoC in the phone  and the arm linux board, you can do these:

①Install the benchmark apks(the roylongbottom collect and modify many benchmarks tools for Android) to android phone to make a benchmark

②then use the below repo tools to run a benchmark in ARM linux board:

https://github.com/tonyho/ARM-MP-BenchMark

③compare the result

1. Integer BenchMark: CoreMark(version:1.01)

compile:

downlaod the coremark from http://www.eembc.org/

①compile the source code for single core CPU:
arm-poky-linux-gnueabi-gcc -c -march=armv7-a -mfloat-abi=hard -mfpu=neon -mtune=cortex-a15 -I./ -Isimple -DITERATIONS=0 -DSEED_METHOD=SEED_ARG -DCOMPILER_FLAGS=\""-march=armv7-a-mfloat-abi=hard-mfpu=neon-mtune=cortex-a15-Os\"" -Os core_main.c core_list_join.c core_matrix.c core_state.c core_util.c simple/core_portme.c

Link:

arm-poky-linux-gnueabi-gcc core_main.o core_list_join.o core_matrix.o core_state.o core_util.o core_portme.o -o coremark -lc

For static link:

arm-poky-linux-gnueabi-gcc core_main.o core_list_join.o core_matrix.o core_state.o core_util.o core_portme.o -o coremark.static -lc -static
②compile the source code for multicore CPU:
cp linux/ -r arm_ti

#Modify the CC and LD to cross compile toolchain gcc

gvim arm_ti/core_portme.mak

#build the coremark:

make PORT_DIR=./arm_ti/ XCFLAGS="-DMULTITHREAD=4 -DUSE_FORK=1"
make PORT_DIR=./arm_ti/ REBUILD=1

③Toolchain problem
for these ToolChain cannot pass the string macro which contain space, such as the toolchain built by Yocto 1.6.1

cp linux/ -r arm_ti

#Modify the CC and LD to cross compile toolchain gcc

gvim arm_ti/core_portme.mak

build the source code, the output executable object is coremark.exe:

make clean && arm-poky-linux-gnueabi-gcc -O2 -I./arm_ti/ -I. -DFLAGS_STR=\""-O2-DMULTITHREAD=2-DUSE_FORK=1-DPERFORMANCE_RUN=1-lrt"\" -DITERATIONS=0 -DMULTITHREAD=2 -DUSE_FORK=1 -DPERFORMANCE_RUN=1 core_list_join.c core_main.c core_matrix.c core_state.c core_util.c ./arm_ti//core_portme.c -o ./coremark.exe -lrt

usage:

1. copy the coremark (for multicore is coremark.exe) to /usr/bin
cp coremark/coremark.exe ...
2. run the coremark

Replace the ITER_PROFILE to a number, make sure that the number can make the coremark run at least 1 min.

time coremark/coremark.exe 0x0 0x0 0x66 ITER_PROFILE 7 1 2000
3. get the average result

When the coremark print the result,rerun the coremark for several times, pick the Iterations/Sec value, get the average, fill the table. Eg:

time coremark 0x0 0x0 0x66 400000 7 1 2000
①single core result log example
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 250749878
Total time (secs): 250.749878
Iterations/Sec : 1595.215133
Iterations : 400000
Compiler version : GCC4.8.3 20140401 (prerelease)
Compiler flags : arm-poky-linux-gnueabi-gcc4.8.3-march=armv7-a-mfloat-abi=hard-mfpu=neon-mtune=cortex-a15
Memory location : STACK
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0x65c5
Correct operation validated. See readme.txt for run and reporting rules.
CoreMark 1.0 : 1595.215133 / GCC4.8.3 20140401 (prerelease) arm-poky-linux-gnueabi-gcc4.8.3-march=armv7-a-mfloat-abi=hard-mfpu=neon-mtune=cortex-a15 / STACK

real 4m10.831s
user 4m10.750s
sys 0m0.000s
②multicore/multithread result log example


2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 58661
Total time (secs): 58.661000 
Iterations/Sec : 9546.376639 
Iterations : 560000 
Compiler version : GCC4.8.3 20140401 (prerelease) 
Compiler flags : -O2 -DMULTITHREAD=2 -DUSE_FORK=1 -DPERFORMANCE_RUN=1 -lrt 
Parallel Fork : 2 
Memory location : Please put data memory location here 
(e.g. code in flash, data on heap etc) 
seedcrc : 0xe9f5 
[0]crclist : 0xe714 
[1]crclist : 0xe714 
[0]crcmatrix : 0x1fd7 
[1]crcmatrix : 0x1fd7 
[0]crcstate : 0x8e3a 
[1]crcstate : 0x8e3a 
[0]crcfinal : 0xbd59 
[1]crcfinal : 0xbd59 
Correct operation validated. See readme.txt for run and reporting rules. 
CoreMark 1.0 : 9546.376639 / GCC4.8.3 20140401 (prerelease) -O2 -DMULTITHREAD=2 -DUSE_FORK=1 -DPERFORMANCE_RUN=1 -lrt / Heap / 2:Fork 
real 0m58.670s 
user 1m57.260s 
sys 0m0.000s

For more detail, refer the ARM document: CoreMark Benchmarking for ARM Cortex Processors

2. Float BenchMark

use the lat_ops form lmbench(version:3.0), single core test program

1. program position

lmbench/bin/lat_ops, copy the lmbench to target board

cp -r lmbench /

2. run

change the working directory to lmbench/bin/arm-linux, and run the lat_ops for several times and get avarage value as the result value:
for example:

root@xxx:/# cd /lmbench/bin/arm-linux/ 
root@xxx:/lmbench/bin/arm-linux# ./lat_ops 
integer bit: 0.67 nanoseconds 
integer add: 0.67 nanoseconds 
integer mul: 2.08 nanoseconds 
integer div: 57.43 nanoseconds 
integer mod: 8.11 nanoseconds 
int64 bit: 0.68 nanoseconds 
uint64 add: 0.74 nanoseconds 
int64 mul: 3.36 nanoseconds 
int64 div: 90.15 nanoseconds 
int64 mod: 62.60 nanoseconds 
float add: 3.36 nanoseconds 
float mul: 4.04 nanoseconds 
float div: 12.14 nanoseconds 
double add: 3.36 nanoseconds 
double mul: 4.04 nanoseconds 
double div: 21.52 nanoseconds 
float bogomflops: 10.77 nanoseconds 
double bogomflops: 20.20 nanoseconds

3. L1 L2 Cache Latency BenchMark

use the lat_mem_rd from lmbench(version:3.0), single core test program

1. prepare

program position: lmbench/bin/lat_mem_rd, copy the lmbench to target board

cp -r lmbench /

2. run

change the working directory to lmbench/bin/arm-linux, and run the lat_mem_rd for several times and get average value as the result value.

./lat_mem_rd 1M

In program output log, the following is the latency value:
0.00098-->L1 Cache
0.12500-->L2 Cache
eg:

root@xxx:/lmbench/bin/arm-linux# ./lat_mem_rd 1M
"stride=128
0.00049 2.687
0.00098 2.688
0.00195 2.688
0.00293 2.688
0.00391 2.669
0.00586 2.669
0.00781 2.669
0.01172 2.669
0.01562 2.669
0.02344 8.708
0.03125 7.198
0.04688 13.687
0.06250 13.189
0.09375 14.683
0.12500 14.683
0.18750 14.746
0.25000 14.746
0.37500 14.783
0.50000 14.933
0.75000 27.538
1.00000 70.250

4. DMIPS BenchMark

Use the Dhrystone(version:2.1), single core test program

1.Get the source

get the source from: http://www.roylongbottom.org.uk/linux%20benchmarks.htm#anchor4

wget 'http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz' 
wget 'http://linux-sunxi.org/images/a/a1/Classic_benchmarks.patch' 
tar -xzf classic_benchmarks.tar.gz 
patch -p0 < Classic_benchmarks.patch 
cd classic_benchmarks/source_code/


2. Setting the tuning options

change the toolchain path, and tuning options:

gvim Makefile 
CC=gcc-4.7 ==> CC=XXXX-gcc 
CFLAGS=-static -O3 -mcpu=cortex-A8 -mtune=cortex-A8 -mfpu=neon -funroll-loops ==> 
CFLAGS=-static -O3 -mcpu=cortex-A15 -mtune=cortex-A15 -mfpu=neon -funroll-loops

3. change the SoC type string, and CPU frequency

gvim common_32bit/cpuidc.c

Change the string and SoC frequency:

strcpy(idString1, "Cortex A8"); ==> strcpy(idString1, "Cortex A15"); 
megaHz = 1000; ==> megaHz = 1500;

4. build the program

make

5. run the dhry2 test program

1. cp dhry2 to target board, and add the execution attribute for the file, and run it:

cp dhry2 XXXX 
chmod a+x ./dhry2 
./dhry2

2. the VAX MIPS rating is the DMIPS value, rerun for several times, and get the average as the result
eg:

root@xxx:/# dhry2
####################################################
getDetails and MHz

Assembler CPUID and RDTSC 
CPU Cortex A8, Features Code 00000000, Model Code 00000000

Measured - Minimum 1500 MHz, Maximum 1500 MHz
Linux Functions
get_nprocs() - CPUs 2, Configured CPUs 2
get_phys_pages() and size - RAM Size 1.97 GB, Page Size 4096 Bytes
uname() - Linux, saturn15, 3.10.31-ltsi
#1 SMP PREEMPT Tue Dec 9 13:39:16 JST 2014, armv7l

##########################################

Dhrystone Benchmark, Version 2.1 (Language: C or C++)

Optimisation Opt 3 64 Bit
Register option not selected

40000 runs 0.00 seconds 
400000 runs 0.05 seconds 
4000000 runs 0.49 seconds 
8000000 runs 0.97 seconds 
16000000 runs 1.94 seconds 
32000000 runs 3.89 seconds

Final values (* implementation-dependent):

Int_Glob: O.K. 5 Bool_Glob: O.K. 1
Ch_1_Glob: O.K. A Ch_2_Glob: O.K. B
Arr_1_Glob[8]: O.K. 7 Arr_2_Glob8/7: O.K. 32000010
Ptr_Glob-> Ptr_Comp: * 610704
Discr: O.K. 0 Enum_Comp: O.K. 2
Int_Comp: O.K. 17 Str_Comp: O.K. DHRYSTONE PROGRAM, SOME STRING
Next_Ptr_Glob-> Ptr_Comp: * 610704 same as above
Discr: O.K. 0 Enum_Comp: O.K. 1
Int_Comp: O.K. 18 Str_Comp: O.K. DHRYSTONE PROGRAM, SOME STRING
Int_1_Loc: O.K. 5 Int_2_Loc: O.K. 13
Int_3_Loc: O.K. 7 Enum_Loc: O.K. 1 
Str_1_Loc: O.K. DHRYSTONE PROGRAM, 1'ST STRING
Str_2_Loc: O.K. DHRYSTONE PROGRAM, 2'ND STRING

Microseconds for one run through Dhrystone: 0.12 
Dhrystones per Second: 8232458 
VAX MIPS rating = 4685.52

Press Enter

6. Scripts

For the benchmark, we usually would run the test for several times, then averages all these results to get a final result. And I have written two scripts to do these.

There're 2 scripts my bitbucket snippet: CPU_BenchMark_Scripts:

  • CPUBenchMark_Average.sh: run in host or target board which has the bash and awk and grep
  • CPU_RunBenchMark.sh: run on the target


The CPU_RunBenchMark.sh would run the benchmark programs to get the results and store the results in the PROGRAM_NAME.log, the PROGRAM_NAME is the program name. eg: coremark.

The CPUBenchMark_Average.sh is used to average the results which store in the PROGRAM_NAME .log.

So below is the step to use the scripts:

①Copy the benchmark programs(coremark.exe dhry2 lat_ops lat_mem_rd) to target board

②Copy the CPU_RunBenchMark.sh and CPUBenchMark_Average.sh to the same directory as benchmark programs

③Modify the CPU_RunBenchMark.sh to suit the directory

runTest coremark_v1.0 'time ./coremark.exe 0x0 0x0 0x66 200000 7 1 2000' coremark.log 
runTest classic_benchmarks/source_code 'echo | ./dhry2' dhry2.log 10
runTest lmbench/bin/arm-linux './lat_ops' lat_ops.log
runTest lmbench/bin/arm-linux './lat_mem_rd 1M' lat_mem_rd.log

the runTest shell function is used to run a program ($2) which in the directory $1.

④Modify the for loop for the times of benchmark programs run.

for i in 1 2 3 4 5 6 7 8 9 10;do
eval "$2" 2>&1 | tee -a $3
done

⑤Average the results

Just run the CPUBenchMark_Average.sh if the target board shipped the grep awk, if the target board don't have these tools, copy the logs and scripts to host PC to run, it would output the result to STDOUT, eg:

$ sh average.sh 
===========CoreMark================================
Iterations/Sec = 9569.107810
===========Dhry2===================================
VAX MIPS rating = 4685.468000
===========L1 Lat==================================
0.00098 = 2.669300
===========L2 Lat==================================
0.12500 = 14.684400
===========integer=================================
integer bit = 0.670000
integer add = 0.670000
integer mul = 2.070000
integer div = 56.908000
integer mod = 8.044000
===========int64==================================
int64 bit = 0.670000
uint64 add = 0.710000
int64 mul = 3.340000
int64 div = 89.491000
int64 mod = 62.155000
===========float==================================
float add = 3.340000
float mul = 4.009000
float div = 12.022000
===========double=================================
double add = 3.340000
double mul = 4.010000
double div = 21.372000
===========float/double bogo======================
float bogomflops = 10.688000
double bogomflops = 20.038000

如果文章有格式问题,请移步:http://www.hexiongjun.com/?p=174

转载请注明出处。作者:TonyHo hexiongjun.com 


你可能感兴趣的:(性能,嵌入式,Benchmark,ARM)