This article would illustrate the programs used to benchmark the SoC(include the SMP) performance, also the step to build and run the benchmark programs. And at the end, I give 2 scripts to make the benchmark work more efficiently.
These benchmark programs would evaluate the Integer and FP performance, also the latency of the L1-Cache and L2-Cache. We can fetch these tools from net. And some of them comes from the lmbench. For the lmbench you may view my previous blog post(In Chinese).ARM Linux BenchMark. Also refer the github repo which suit the previous blog post:
https://github.com/tonyho/ARM_BenchMark
Besides, if you want to compare the SoC in the phone and the arm linux board, you can do these:
①Install the benchmark apks(the roylongbottom collect and modify many benchmarks tools for Android) to android phone to make a benchmark
②then use the below repo tools to run a benchmark in ARM linux board:
https://github.com/tonyho/ARM-MP-BenchMark
③compare the result
downlaod the coremark from http://www.eembc.org/
arm-poky-linux-gnueabi-gcc -c -march=armv7-a -mfloat-abi=hard -mfpu=neon -mtune=cortex-a15 -I./ -Isimple -DITERATIONS=0 -DSEED_METHOD=SEED_ARG -DCOMPILER_FLAGS=\""-march=armv7-a-mfloat-abi=hard-mfpu=neon-mtune=cortex-a15-Os\"" -Os core_main.c core_list_join.c core_matrix.c core_state.c core_util.c simple/core_portme.c
Link:
arm-poky-linux-gnueabi-gcc core_main.o core_list_join.o core_matrix.o core_state.o core_util.o core_portme.o -o coremark -lc
For static link:
arm-poky-linux-gnueabi-gcc core_main.o core_list_join.o core_matrix.o core_state.o core_util.o core_portme.o -o coremark.static -lc -static
cp linux/ -r arm_ti
#Modify the CC and LD to cross compile toolchain gcc
gvim arm_ti/core_portme.mak
#build the coremark:
make PORT_DIR=./arm_ti/ XCFLAGS="-DMULTITHREAD=4 -DUSE_FORK=1" make PORT_DIR=./arm_ti/ REBUILD=1
③Toolchain problem
for these ToolChain cannot pass the string macro which contain space, such as the toolchain built by Yocto 1.6.1
cp linux/ -r arm_ti
#Modify the CC and LD to cross compile toolchain gcc
gvim arm_ti/core_portme.mak
build the source code, the output executable object is coremark.exe:
make clean && arm-poky-linux-gnueabi-gcc -O2 -I./arm_ti/ -I. -DFLAGS_STR=\""-O2-DMULTITHREAD=2-DUSE_FORK=1-DPERFORMANCE_RUN=1-lrt"\" -DITERATIONS=0 -DMULTITHREAD=2 -DUSE_FORK=1 -DPERFORMANCE_RUN=1 core_list_join.c core_main.c core_matrix.c core_state.c core_util.c ./arm_ti//core_portme.c -o ./coremark.exe -lrt
cp coremark/coremark.exe ...
Replace the ITER_PROFILE to a number, make sure that the number can make the coremark run at least 1 min.
time coremark/coremark.exe 0x0 0x0 0x66 ITER_PROFILE 7 1 2000
When the coremark print the result,rerun the coremark for several times, pick the Iterations/Sec value, get the average, fill the table. Eg:
time coremark 0x0 0x0 0x66 400000 7 1 2000
2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 250749878 Total time (secs): 250.749878 Iterations/Sec : 1595.215133 Iterations : 400000 Compiler version : GCC4.8.3 20140401 (prerelease) Compiler flags : arm-poky-linux-gnueabi-gcc4.8.3-march=armv7-a-mfloat-abi=hard-mfpu=neon-mtune=cortex-a15 Memory location : STACK seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x65c5 Correct operation validated. See readme.txt for run and reporting rules. CoreMark 1.0 : 1595.215133 / GCC4.8.3 20140401 (prerelease) arm-poky-linux-gnueabi-gcc4.8.3-march=armv7-a-mfloat-abi=hard-mfpu=neon-mtune=cortex-a15 / STACK real 4m10.831s user 4m10.750s sys 0m0.000s
2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 58661 Total time (secs): 58.661000 Iterations/Sec : 9546.376639 Iterations : 560000 Compiler version : GCC4.8.3 20140401 (prerelease) Compiler flags : -O2 -DMULTITHREAD=2 -DUSE_FORK=1 -DPERFORMANCE_RUN=1 -lrt Parallel Fork : 2 Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [1]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [1]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [1]crcstate : 0x8e3a [0]crcfinal : 0xbd59 [1]crcfinal : 0xbd59 Correct operation validated. See readme.txt for run and reporting rules. CoreMark 1.0 : 9546.376639 / GCC4.8.3 20140401 (prerelease) -O2 -DMULTITHREAD=2 -DUSE_FORK=1 -DPERFORMANCE_RUN=1 -lrt / Heap / 2:Fork real 0m58.670s user 1m57.260s sys 0m0.000s
For more detail, refer the ARM document: CoreMark Benchmarking for ARM Cortex Processors
use the lat_ops form lmbench(version:3.0), single core test program
lmbench/bin/lat_ops, copy the lmbench to target board
cp -r lmbench /
change the working directory to lmbench/bin/arm-linux, and run the lat_ops for several times and get avarage value as the result value:
for example:
root@xxx:/# cd /lmbench/bin/arm-linux/ root@xxx:/lmbench/bin/arm-linux# ./lat_ops integer bit: 0.67 nanoseconds integer add: 0.67 nanoseconds integer mul: 2.08 nanoseconds integer div: 57.43 nanoseconds integer mod: 8.11 nanoseconds int64 bit: 0.68 nanoseconds uint64 add: 0.74 nanoseconds int64 mul: 3.36 nanoseconds int64 div: 90.15 nanoseconds int64 mod: 62.60 nanoseconds float add: 3.36 nanoseconds float mul: 4.04 nanoseconds float div: 12.14 nanoseconds double add: 3.36 nanoseconds double mul: 4.04 nanoseconds double div: 21.52 nanoseconds float bogomflops: 10.77 nanoseconds double bogomflops: 20.20 nanoseconds
use the lat_mem_rd from lmbench(version:3.0), single core test program
program position: lmbench/bin/lat_mem_rd, copy the lmbench to target board
cp -r lmbench /
change the working directory to lmbench/bin/arm-linux, and run the lat_mem_rd for several times and get average value as the result value.
./lat_mem_rd 1M
In program output log, the following is the latency value:
0.00098-->L1 Cache
0.12500-->L2 Cache
eg:
root@xxx:/lmbench/bin/arm-linux# ./lat_mem_rd 1M "stride=128 0.00049 2.687 0.00098 2.688 0.00195 2.688 0.00293 2.688 0.00391 2.669 0.00586 2.669 0.00781 2.669 0.01172 2.669 0.01562 2.669 0.02344 8.708 0.03125 7.198 0.04688 13.687 0.06250 13.189 0.09375 14.683 0.12500 14.683 0.18750 14.746 0.25000 14.746 0.37500 14.783 0.50000 14.933 0.75000 27.538 1.00000 70.250
Use the Dhrystone(version:2.1), single core test program
get the source from: http://www.roylongbottom.org.uk/linux%20benchmarks.htm#anchor4
wget 'http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz' wget 'http://linux-sunxi.org/images/a/a1/Classic_benchmarks.patch' tar -xzf classic_benchmarks.tar.gz patch -p0 < Classic_benchmarks.patch cd classic_benchmarks/source_code/
change the toolchain path, and tuning options:
gvim Makefile
CC=gcc-4.7 ==> CC=XXXX-gcc CFLAGS=-static -O3 -mcpu=cortex-A8 -mtune=cortex-A8 -mfpu=neon -funroll-loops ==> CFLAGS=-static -O3 -mcpu=cortex-A15 -mtune=cortex-A15 -mfpu=neon -funroll-loops
gvim common_32bit/cpuidc.c
Change the string and SoC frequency:
strcpy(idString1, "Cortex A8"); ==> strcpy(idString1, "Cortex A15"); megaHz = 1000; ==> megaHz = 1500;
make
1. cp dhry2 to target board, and add the execution attribute for the file, and run it:
cp dhry2 XXXX chmod a+x ./dhry2 ./dhry2
2. the VAX MIPS rating is the DMIPS value, rerun for several times, and get the average as the result
eg:
root@xxx:/# dhry2 #################################################### getDetails and MHz Assembler CPUID and RDTSC CPU Cortex A8, Features Code 00000000, Model Code 00000000 Measured - Minimum 1500 MHz, Maximum 1500 MHz Linux Functions get_nprocs() - CPUs 2, Configured CPUs 2 get_phys_pages() and size - RAM Size 1.97 GB, Page Size 4096 Bytes uname() - Linux, saturn15, 3.10.31-ltsi #1 SMP PREEMPT Tue Dec 9 13:39:16 JST 2014, armv7l ########################################## Dhrystone Benchmark, Version 2.1 (Language: C or C++) Optimisation Opt 3 64 Bit Register option not selected 40000 runs 0.00 seconds 400000 runs 0.05 seconds 4000000 runs 0.49 seconds 8000000 runs 0.97 seconds 16000000 runs 1.94 seconds 32000000 runs 3.89 seconds Final values (* implementation-dependent): Int_Glob: O.K. 5 Bool_Glob: O.K. 1 Ch_1_Glob: O.K. A Ch_2_Glob: O.K. B Arr_1_Glob[8]: O.K. 7 Arr_2_Glob8/7: O.K. 32000010 Ptr_Glob-> Ptr_Comp: * 610704 Discr: O.K. 0 Enum_Comp: O.K. 2 Int_Comp: O.K. 17 Str_Comp: O.K. DHRYSTONE PROGRAM, SOME STRING Next_Ptr_Glob-> Ptr_Comp: * 610704 same as above Discr: O.K. 0 Enum_Comp: O.K. 1 Int_Comp: O.K. 18 Str_Comp: O.K. DHRYSTONE PROGRAM, SOME STRING Int_1_Loc: O.K. 5 Int_2_Loc: O.K. 13 Int_3_Loc: O.K. 7 Enum_Loc: O.K. 1 Str_1_Loc: O.K. DHRYSTONE PROGRAM, 1'ST STRING Str_2_Loc: O.K. DHRYSTONE PROGRAM, 2'ND STRING Microseconds for one run through Dhrystone: 0.12 Dhrystones per Second: 8232458 VAX MIPS rating = 4685.52 Press Enter
For the benchmark, we usually would run the test for several times, then averages all these results to get a final result. And I have written two scripts to do these.
There're 2 scripts my bitbucket snippet: CPU_BenchMark_Scripts:
The CPU_RunBenchMark.sh would run the benchmark programs to get the results and store the results in the PROGRAM_NAME.log, the PROGRAM_NAME is the program name. eg: coremark.
The CPUBenchMark_Average.sh is used to average the results which store in the PROGRAM_NAME .log.
So below is the step to use the scripts:
①Copy the benchmark programs(coremark.exe dhry2 lat_ops lat_mem_rd) to target board
②Copy the CPU_RunBenchMark.sh and CPUBenchMark_Average.sh to the same directory as benchmark programs
③Modify the CPU_RunBenchMark.sh to suit the directory
runTest coremark_v1.0 'time ./coremark.exe 0x0 0x0 0x66 200000 7 1 2000' coremark.log runTest classic_benchmarks/source_code 'echo | ./dhry2' dhry2.log 10 runTest lmbench/bin/arm-linux './lat_ops' lat_ops.log runTest lmbench/bin/arm-linux './lat_mem_rd 1M' lat_mem_rd.log
the runTest shell function is used to run a program ($2) which in the directory $1.
④Modify the for loop for the times of benchmark programs run.
for i in 1 2 3 4 5 6 7 8 9 10;do eval "$2" 2>&1 | tee -a $3 done
⑤Average the results
Just run the CPUBenchMark_Average.sh if the target board shipped the grep awk, if the target board don't have these tools, copy the logs and scripts to host PC to run, it would output the result to STDOUT, eg:
$ sh average.sh ===========CoreMark================================ Iterations/Sec = 9569.107810 ===========Dhry2=================================== VAX MIPS rating = 4685.468000 ===========L1 Lat================================== 0.00098 = 2.669300 ===========L2 Lat================================== 0.12500 = 14.684400 ===========integer================================= integer bit = 0.670000 integer add = 0.670000 integer mul = 2.070000 integer div = 56.908000 integer mod = 8.044000 ===========int64================================== int64 bit = 0.670000 uint64 add = 0.710000 int64 mul = 3.340000 int64 div = 89.491000 int64 mod = 62.155000 ===========float================================== float add = 3.340000 float mul = 4.009000 float div = 12.022000 ===========double================================= double add = 3.340000 double mul = 4.010000 double div = 21.372000 ===========float/double bogo====================== float bogomflops = 10.688000 double bogomflops = 20.038000
如果文章有格式问题,请移步:http://www.hexiongjun.com/?p=174
转载请注明出处。作者:TonyHo hexiongjun.com