Advanced NDK 进阶
3.1Assembly
3.1.1 Greatest Common Divisor最大公约数
3.1.2. Color Conversion 色彩转换
3.1.3Parallel Computation of Average 并行计算平均值
3.2 C Extensions
3.2.1 Built-in Functions内置函数
3.3.2 Unrolling Loops 循环展开
3.3 Tips 技巧
3.3.1 Inlining Functions内联函数
3.3.2 Unrolling Loops 循环展开
3.3.3 Preloading Memory 内存预读取
3.3.4 LDM/STM Instead Of LDR/STD
3.4SUMMARY
Chapter 2 showed you how to set up a project using the Android NDK and how you could use C or C++ code in your Android application. In many cases, this is as far as you will need to go. However, there may be times when digging a little deeper is required to find what can be optimized even more.
In this chapter, you will get your hands dirty and learn how you can use a low-level
language to take advantage of all the bells轰鸣 and whistles转动 the CPU has to offer, which may not be possible to use from plain C or C++ code. The first part of the chapter shows you several examples of how you can optimize functions using the assembly language and gives you an overview of the ARM instruction set指令集. The second part covers some of the C extensions the GCC compiler supports that you can take advantage of to improve your application’s performance. Finally, the chapter concludes with a few very simple tips for optimizing code relatively quickly.
While the latest Android NDK supports the armeabi, armeabi-v7a, and x86 ABIs, this chapter will focus on the first two as Android is mostly deployed on ARM-based
devices. If you plan on writing assembly code, then ARM should be your first target.
While the first Google TV devices were Intel-based, Google TV does not yet support the NDK.
daoci
3.1Assembly
The NDK allows you to use C and C++ in your Android applications. Chapter 2 showed you what native code would look like after the C or C++ code is compiled and how you could use objdump �Cd to disassemble a file (object file or library). For example, the ARM assembly code of computeIterativelyFaster is shown again in Listing 3�C1.
Listing 3�C1. ARM Assembly Code of C Implementation of computeIterativelyFaster
00000410 < computeIterativelyFaster>:
410: e3500001 cmp r0, #1 ; 0x1
414: e92d0030 push {r4, r5}
418: 91a02000 movls r2, r0
41c: 93a03000 movls r3, #0 ; 0x0
420: 9a00000e bls 460 <computeIterativelyFaster+0x50>
3
H. Guihot, Pro Android App
Hervé Guihot 2012
s Performance Optimization
74 CHAPTER 3: Advanced NDK
424: e2400001 sub r0, r0, #1 ; 0x1
428: e1b010a0 lsrs r1, r0, #1
42c: 03a02001 moveq r2, #1 ; 0x1
430: 03a03000 moveq r3, #0 ; 0x0
434: 0a000009 beq 460 < computeIterativelyFaster+0x50>
438: e3a02001 mov r2, #1 ; 0x1
43c: e3a03000 mov r3, #0 ; 0x0
440: e0024000 and r4, r2, r0
444: e3a05000 mov r5, #0 ; 0x0
448: e0944002 adds r4, r4, r2
44c: e0a55003 adc r5, r5, r3
450: e0922004 adds r2, r2, r4
454: e0a33005 adc r3, r3, r5
458: e2511001 subs r1, r1, #1 ; 0x1
45c: 1afffff9 bne 448 < computeIterativelyFaster+0x38>
460: e1a01003 mov r1, r3
464: e1a00002 mov r0, r2
468: e8bd0030 pop {r4, r5}
46c: e12fff1e bx lr
In addition to allowing you to use C or C++ in your application, the Android NDK also lets you to write assembly code directly. Strictly speaking, such a feature is not NDKspecific as assembly code is supported by the GCC compiler, which is used by the Android NDK. Consequently, almost everything you learn in this chapter can also be applied to other projects of yours, for example in applications targeting iOS devices like the iPhone.
As you can see, assembly code can be quite difficult to read, let alone write更不用说手写. However, being able to understand assembly code will allow you to more easily identify bottlenecks瓶颈 and therefore more easily optimize your applications. It will also give you bragging rights特权.
To familiarize yourself with assembly, we will look at three simple examples:
Computation of the greatest common divisor计算最大公约数
Conversion from one color format to another转换颜色格式
Parallel平行线 computation of average of 8-bit values
These examples are simple enough to understand for developers who are new to
assembly, yet they exhibit important principles of assembly optimization. Because these examples introduce you to only a subset of the available instructions可用指令集的一个子集, a more complete introduction to the ARM instruction set will follow as well as a brief introduction to the überpowerful ARM SIMD instructions. Finally, you will learn how to dynamically check what CPU features are available动态检查CPU可用的方法, a mandatory step in your applications that target features not available on all devices.
3.1.1 Greatest Common Divisor最大公约数
The greatest common divisor (gcd) of two non-zero integers is the largest positive
integer that divides both numbers. For example, the greatest common divisor of 10 and 55 is 5. An implementation of a function that computes the greatest common divisor of two integers is shown in Listing 3�C2.
Listing 3�C2. Greatest Common Divisor Simple Implementation
unsigned int gcd (unsigned int a, unsigned int b)
{
// a and b must be different from zero (else, hello infinite loop!)
while (a != b) {
if (a > b) {
a -= b;
} else {
b -= a;
}
}
return a;
}
If you define APP_ABI in your Application.mk file such that x86, armeabi, and armeabi-v7 architectures are supported in your application, then you will have three different libraries三个不同库. Disassembling each library will result in three different pieces of assembly code. However, since you have the option to compile in either ARM or Thumb mode with the armeabi and armeabi-v7a ABIs, there are actually a total of five pieces of code you can review实际上要看5份汇编代码.
TIP: Instead of specifying each individual ABI you want to compile a library for, you can define APP_ABI as “all” (APP_ABI := all) starting with NDK r7. When a new ABI is supported by the NDK you will only have to execute ndk-build without having to modify Application.mk.
Listing 3�C3 shows the resulting x86 assembly code while Listing 3�C4 and Listing 3�C5 show the ARMv5 and ARMv7 assembly code respectively. Because different versions of compilers can output different code, the code you will observe may be slightly different than the that shown here你看到的代码可能会和示例代码略有不同. The generated code will also depend on the optimization level and other options you may have defined.
Listing 3�C3. x86 Assembly Code
00000000 <gcd>:
0: 8b 54 24 04 mov 0x4(%esp),%edx
4: 8b 44 24 08 mov 0x8(%esp),%eax
8: 39 c2 cmp %eax,%edx
a: 75 0a jne 16 <gcd+0x16>
c: eb 12 jmp 20 <gcd+0x20>
e: 66 90 xchg %ax,%ax
10: 29 c2 sub %eax,%edx
12: 39 d0 cmp %edx,%eax
14: 74 0a je 20 <gcd+0x20>
16: 39 d0 cmp %edx,%eax
18: 72 f6 jb 10 <gcd+0x10>
1a: 29 d0 sub %edx,%eax
1c: 39 d0 cmp %edx,%eax
76 CHAPTER 3: Advanced NDK
1e: 75 f6 jne 16 <gcd+0x16>
20: f3 c3 repz ret
If you are familiar with the x86 mnemonics汇编助记, you can see that this code makes heavy use of the jump instructions (jne, jmp, je, jb). Also, while most instructions are 16-bit (for example, “f3 c3”), some are 32-bit.
NOTE: Make sure you use the right version of objdump to disassemble反编译 object files and libraries. For example, using the ARM version of objdump to attempt to disassemble an x86 object file will result in the following message:
arm-linux-androideabi-objdump: Can't disassemble for architecture
UNKNOWN!
Listing 3�C4. ARMv5 Assembly Code (ARM Mode)
00000000 <gcd>:
0: e1500001 cmp r0, r1
4: e1a03000 mov r3, r0
8: 0a000004 beq 20 <gcd+0x20>
c: e1510003 cmp r1, r3
10: 30613003 rsbcc r3, r1, r3
14: 20631001 rsbcs r1, r3, r1
18: e1510003 cmp r1, r3
1c: 1afffffa bne c <gcd+0xc>
20: e1a00001 mov r0, r1
24: e12fff1e bx lr
Listing 3�C5. ARMv7a Assembly Code (ARM Mode)
00000000 <gcd>:
0: e1500001 cmp r0, r1
4: e1a03000 mov r3, r0
8: 0a000004 beq 20 <gcd+0x20>
c: e1510003 cmp r1,
10: 30613003 rsbcc r3, r1, r3
14: 20631001 rsbcs r1, r3, r1
18: e1510003 cmp r1, r3
1c: 1afffffa bne c <gcd+0xc>
20: e1a00001 mov r0, r1
24: e12fff1e bx lr
As it turns out, the GCC compiler generates the same code for the armeabi and
armeabi-v7a ABIs when the code shown in Listing 3�C2 is compiled in ARM mode. This won’t always be the case though as the compiler usually takes advantage of new instructions defined in newer ABIs.
Because you could decide to compile the code in Thumb mode instead of ARM mode, let’s also review the code that would be generated in Thumb mode. Listing 3�C6 shows the ARMv5 assembly code (armeabi ABI in Application.mk) while Listing 3�C7 shows the ARMv7 assembly code (armeabi-v7a ABI in Application.mk).
Listing 3�C6. ARMv5 Assembly Code (Thumb Mode)
00000000 <gcd>:
0: 1c03 adds r3, r0, #0
2: 428b cmp r3, r1
4: d004 beq.n 10 <gcd+0x10>
6: 4299 cmp r1, r3
8: d204 bcs.n 14 <gcd+0x14>
a: 1a5b subs r3, r3, r1
c: 428b cmp r3, r1
e: d1fa bne.n 6 <gcd+0x6>
10: 1c08 adds r0, r1, #0
12: 4770 bx lr
14: 1ac9 subs r1, r1, r3
16: e7f4 b.n 2 <gcd+0x2>
All instructions in Listing 3�C6 are 16-bit (that is, “e7f4,” the last instruction of the listing) and the twelve instructions therefore require 24 bytes of space.
Listing 3�C7. ARMv7 Assembly Code (Thumb Mode)
00000000 <gcd>:
0: 4288 cmp r0, r1
2: 4603 mov r3, r0
4: d007 beq.n 16 <gcd+0x16>
6: 4299 cmp r1, r3
8: bf34 ite cc
a: ebc1 0303 rsbcc r3, r1, r3
e: ebc3 0101 rsbcs r1, r3, r1
12: 4299 cmp r1, r3
14: d1f7 bne.n 6 <gcd+0x6>
16: 4608 mov r0, r1
18: 4770 bx lr
1a: bf00 nop
This time, the two listings are different. While the ARMv5 architecture uses the Thumb instruction set (all 16-bit instructions), the ARMv7 architecture supports the Thumb2 instruction set and instructions can be 16- or 32-bit.
As a matter of fact, Listing 3�C7 looks a lot like Listing 3�C5. The main difference is with the use of the ite (if-then-else) instruction in Listing 3�C7, and the fact that the ARM code is 40-byte long while the Thumb2 code is only 28-byte long.
NOTE: Even though the ARM architecture is the dominant one主流, being able to read x86 assembly code cannot hurt.
Usually, the GCC compiler does a pretty good job and you won’t have to worry too much about the generated code. However, if a piece of code you wrote in C or C++ turns out to be one of the bottlenecks of your application, you should carefully review the assembly code the compiler generated and determine whether you could do better by writing the assembly code yourself. Very often the compiler will generate high-quality code and you won’t be able to do better. That being said, there are cases where, armed with both an intimate knowledge透彻 of the instruction set and a slight taste for suffering, you can achieve better results than the compiler.
NOTE: Consider modifying the C/C++ code to achieve better performance as it is often much easier than writing assembly code.
The gcd function can indeed be implemented differently, resulting in code not only faster but also more compact紧凑, as shown in Listing 3�C8.
Listing 3�C8. Hand-crafted Assembly Code
.global gcd_asm
.func gcd_asm
gcd_asm:
cmp r0, r1
subgt r0, r0, r1
sublt r1, r1, r0
bne gcd_asm
bx lr
.endfunc
.end
Not including the final instruction to return from the function, the core of the algorithm is implemented in only four instructions. Measurements also showed this implementation as being faster. Note the single call to the CMP instruction in Listing 3�C8 compared with the two calls in Listing 3�C7.
This code can be copied in a file called gcd_asm.S and added to the list of files to
compile in Android.mk. Because this file is using ARM instructions, it obviously won’t compile if the target ABI is x86. Consequently, your Android.mk file should make sure the file is only part of the list of files to compile when it is compatible with the ABI.
Listing 3�C9 shows how to modify Android.mk accordingly.
Listing 3�C9. Android.mk
LOCAL_PATH := $(call my-dir)
include $(CLEAR_VARS)
LOCAL_MODULE := chapter3
LOCAL_SRC_FILES := gcd.c
ifeq ($(TARGET_ARCH_ABI),armeabi)
LOCAL_SRC_FILES += gcd_asm.S
endif
ifeq ($(TARGET_ARCH_ABI),armeabi-v7a)
LOCAL_SRC_FILES += gcd_asm.S
endif
include $(BUILD_SHARED_LIBRARY)
CHAPTER 3: Advanced NDK 79
Because gcd_asm.S is already written using assembly code, the resulting object file
should look extremely similar to the source file. Listing 3�C10 shows the disassembled
code and indeed, the disassembled code is virtually identical to the source.
Listing 3�C10. Disassembled gcd_asm Code
00000000 <gcd_asm>:
0: e1500001 cmp r0, r1
4: c0400001 subgt r0, r0, r1
8: b0411000 sublt r1, r1, r0
c: 1afffffb bne 0 <gcd_asm>
10: e12fff1e bx lr
NOTE: The assembler may in some cases substitute some instructions for others so you may still observe slight differences between the code you wrote and the disassembled code.
By simplifying the assembly code, we achieved better results without dramatically
making maintenance more complicated.
3.1.2. Color Conversion 色彩转换
A common operation in graphics routines is to convert a color from one format to
another. For example, a 32-bit value representing a color with four 8-bit channels (alpha,red, green, and blue) could be converted to a 16-bit value representing a color with three channels (5 bits for red, 6 bits for green, 5 bits for blue, no alpha). The two formats would typically be referred to as ARGB8888 and RGB565 respectively.
Listing 3�C11 shows a trivial implementation of such a conversion.
Listing 3�C11. Implementation of Color Conversion Function
unsigned int argb888_to_rgb565 (unsigned int color)
{
/*
input: aaaaaaaarrrrrrrrggggggggbbbbbbbb
output: 0000000000000000rrrrrggggggbbbbb
*/
return
/* red */ ((color >> 8) & 0xF800) |
/* green */ ((color >> 5) & 0x07E0) |
/* blue */ ((color >> 3) & 0x001F);
}
Once again再��嗦一次, five pieces of assembly code can be analyzed. Listing 3�C12 shows the x86 assembly code, Listing 3�C13 shows the ARMv5 assembly code in ARM mode, Listing 3�C 14 shows the ARMv7 assembly code in ARM mode, Listing 3�C15 shows the ARMv5 assembly code in Thumb mode, and finally Listing 3�C16 shows the ARMv7 assembly code in Thumb mode.
Listing 3�C12. x86 Assembly Code
00000000 <argb8888_to_rgb565>:
0: 8b 54 24 04 mov 0x4(%esp),%edx
4: 89 d0 mov %edx,%eax
6: 89 d1 mov %edx,%ecx
8: c1 e8 05 shr $0x5,%eax
b: c1 e9 08 shr $0x8,%ecx
e: 25 e0 07 00 00 and $0x7e0,%eax
13: 81 e1 00 f8 00 00 and $0xf800,%ecx
19: c1 ea 03 shr $0x3,%edx
1c: 09 c8 or %ecx,%eax
1e: 83 e2 1f and $0x1f,%edx
21: 09 d0 or %edx,%eax
23: c3 ret
Listing 3�C13. ARMv5 Assembly Code (ARM Mode)
00000000 <argb8888_to_rgb565>:
0: e1a022a0 lsr r2, r0, #5
4: e1a03420 lsr r3, r0, #8
8: e2022e7e and r2, r2, #2016 ; 0x7e0
c: e2033b3e and r3, r3, #63488 ; 0xf800
10: e1a00c00 lsl r0, r0, #24
14: e1823003 orr r3, r2, r3
18: e1830da0 orr r0, r3, r0, lsr #27
1c: e12fff1e bx lr
Listing 3�C14. ARMv7 Assembly Code (ARM Mode)
00000000 <argb8888_to_rgb565>:
0: e7e431d0 ubfx r3, r0, #3, #5
4: e1a022a0 lsr r2, r0, #5
8: e1a00420 lsr r0, r0, #8
c: e2022e7e and r2, r2, #2016 ; 0x7e0
10: e2000b3e and r0, r0, #63488 ; 0xf800
14: e1820000 orr r0, r2, r0
18: e1800003 orr r0, r0, r3
1c: e12fff1e bx lr
Listing 3�C15. ARMv5 Assembly Code (Thumb Mode)
00000000 <argb8888_to_rgb565>:
0: 23fc movs r3, #252
2: 0941 lsrs r1, r0, #5
4: 00db lsls r3, r3, #3
6: 4019 ands r1, r3
8: 23f8 movs r3, #248
a: 0a02 lsrs r2, r0, #8
c: 021b lsls r3, r3, #8
e: 401a ands r2, r3
10: 1c0b adds r3, r1, #0
12: 4313 orrs r3, r2
14: 0600 lsls r0, r0, #24
16: 0ec2 lsrs r2, r0, #27
18: 1c18 adds r0, r3, #0
1a: 4310 orrs r0, r2
1c: 4770 bx lr
1e: 46c0 nop (mov r8, r8)
Listing 3�C16. ARMv7 Assembly Code (Thumb Mode)
00000000 <argb888_to_rgb565>:
0: 0942 lsrs r2, r0, #5
2: 0a03 lsrs r3, r0, #8
4: f402 62fc and.w r2, r2, #2016 ; 0x7e0
8: f403 4378 and.w r3, r3, #63488 ; 0xf800
c: 4313 orrs r3, r2
e: f3c0 00c4 ubfx r0, r0, #3, #5
12: 4318 orrs r0, r3
14: 4770 bx lr
16: bf00 nop
Simply looking at how many instructions are generated生成, the ARMv5 code in Thumb mode seems to be the least efficient效率最低. That being said, counting the number of instructions is not an accurate way of determining how fast or how slow a piece of code is going to be统计指令条数不是判断代码快慢的指标. To get a closer estimate of the duration, one would have to count how many cycles each instruction will need to complete. For example, the “orr r3, r2”instruction needs only one cycle to execute只需要一个周期去执行. Today’s CPUs make it quite hard to
compute how many cycles will ultimately be needed as they can execute several
instructions per cycle and in some cases even execute instructions out of order to
maximize throughput最大吞吐量.
NOTE: For example, refer to the Cortex-A9 Technical Reference Manual to learn more about the cycle timings of instructions.
Now, it is possible to write a slightly different version of the same conversion function using the UBFX and BFI instructions, as shown in Listing 3�C17.
Listing 3�C17. Hand-crafted Assembly Code
.global argb8888_ro_rgb565_asm
.func argb8888_ro_rgb565_asm
argb8888_ro_rgb565_asm:
// r0=aaaaaaaarrrrrrrrggggggggbbbbbbbb
// r1=undefined (scratch register)
ubfx r1, r0, #3, #5
// r1=000000000000000000000000000bbbbb
lsr r0, r0, #10
// r0=0000000000aaaaaaaarrrrrrrrgggggg
bfi r1, r0, #5, #6
// r1=000000000000000000000ggggggbbbbb
lsr r0, r0, #9
// r0=0000000000000000000aaaaaaaarrrrr
82 CHAPTER 3: Advanced NDK
bfi r1, r0, #11, #5
// r1=0000000000000000rrrrrggggggbbbbb
mov r0, r1
// r0=0000000000000000rrrrrggggggbbbbb
bx lr
.endfunc
.end
Since this code uses the UBFX and BFI instructions (both introduced in the ARMv6T2 architecture), it won’t compile for the armeabi ABI (ARMv5). Obviously it won’t compile for the x86 ABI either.
Similar to what was shown in Listing 3�C9, your Android.mk should make sure the file is only compiled with the right ABI. Listing 3�C18 shows the addition of the rgb.c and
rgb_asm.S files to the build.
Listing 3�C18. Android.mk
LOCAL_PATH := $(call my-dir)
include $(CLEAR_VARS)
LOCAL_MODULE := chapter3
LOCAL_SRC_FILES := gcd.c rgb.c
ifeq ($(TARGET_ARCH_ABI),armeabi)
LOCAL_SRC_FILES += gcd_asm.S
endif
ifeq ($(TARGET_ARCH_ABI),armeabi-v7a)
LOCAL_SRC_FILES += gcd_asm.S rgb_asm.S
endif
include $(BUILD_SHARED_LIBRARY)
If you add rgb_asm.S to the list of files to compile with the armeabi ABI, you will then get
the following errors:
Error: selected processor does not support `ubfx r1,r0,#3,#5'
Error: selected processor does not support `bfi r1,r0,#5,#6'
Error: selected processor does not support `bfi r1,r0,#11,#5'
3.1.3Parallel Computation of Average 并行计算平均值
In this example, we want to treat each 32-bit value as four independent 8-bit values and compute the byte-wise average between two such values. For example, the average of 0x10FF3040 and 0x50FF7000 would be 0x30FF5020 (average of 0x10 and 0x50 is 0x30, average of 0xFF and 0xFF is 0xFF).
An implementation of such function is shown in Listing 3�C19.
Listing 3�C19. Implementation of Parallel Average Function
unsigned int avg8 (unsigned int a, unsigned int b)
{
return
((a >> 1) & 0x7F7F7F7F) +
((b >> 1) & 0x7F7F7F7F) +
(a & b & 0x01010101);
}
Like with the two previous examples, five pieces of assembly code are shown in Listings
3�C20 to 3�C24.
Listing 3�C20. x86 Assembly Code
00000000 <avg8>:
0: 8b 54 24 04 mov 0x4(%esp),%edx
4: 8b 44 24 08 mov 0x8(%esp),%eax
8: 89 d1 mov %edx,%ecx
a: 81 e1 01 01 01 01 and $0x1010101,%ecx
10: d1 ea shr %edx
12: 21 c1 and %eax,%ecx
14: 81 e2 7f 7f 7f 7f and $0x7f7f7f7f,%edx
1a: d1 e8 shr %eax
1c: 8d 14 11 lea (%ecx,%edx,1),%edx
1f: 25 7f 7f 7f 7f and $0x7f7f7f7f,%eax
24: 8d 04 02 lea (%edx,%eax,1),%eax
27: c3 ret
Listing 3�C21. ARMv5 Assembly Code (ARM mode)
00000000 <avg8>:
0: e59f301c ldr r3, [pc, #28] ; 24 <avg8+0x24>
4: e59f201c ldr r2, [pc, #28] ; 28 <avg8+0x28>
8: e0003003 and r3, r0, r3
c: e0033001 and r3, r3, r1
10: e00200a0 and r0, r2, r0, lsr #1
14: e0830000 add r0, r3, r0
18: e00220a1 and r2, r2, r1, lsr #1
1c: e0800002 add r0, r0, r2
20: e12fff1e bx lr
24: 01010101 .word 0x01010101
28: 7f7f7f7f .word 0x7f7f7f7f
Because the ARMv5 MOV instruction cannot simply copy the value to the register寄存器, an LDR instruction is used instead to copy 0x01010101 to register r3. Similarly, an LDR instruction is used to copy 0x7f7f7f7f to r2.
Listing 3�C22. ARMv7 Assembly Code (ARM Mode)
00000000 <avg8>:
0: e3003101 movw r3, #257 ; 0x101
4: e3072f7f movw r2, #32639 ; 0x7f7f
8: e3403101 movt r3, #257 ; 0x101
c: e3472f7f movt r2, #32639 ; 0x7f7f
10: e0003003 and r3, r0, r3
14: e00200a0 and r0, r2, r0, lsr #1
18: e0033001 and r3, r3, r1
1c: e00220a1 and r2, r2, r1, lsr #1
20: e0830000 add r0, r3, r0
84 CHAPTER 3: Advanced NDK
24: e0800002 add r0, r0, r2
28: e12fff1e bx lr
Instead of using an LDR instruction to copy 0x01010101 to r3, the ARMv7 code uses two MOV instructions: the first one, MOVW, is to copy a 16-bit value (0x0101) to the bottom 16 bits of r3 while the second one, MOVT, is to copy 0x0101 to the top 16 bits of r3. After these two instructions, r3 will indeed contain the 0x01010101 value. The rest of the assembly code looks like the ARMv5 assembly code.
Listing 3�C23. ARMv5 Assembly Code (Thumb Mode)
00000000 <avg8>:
0: b510 push {r4, lr}
2: 4c05 ldr r4, [pc, #20] (18 <avg8+0x18>)
4: 4b05 ldr r3, [pc, #20] (1c <avg8+0x1c>)
6: 4004 ands r4, r0
8: 0840 lsrs r0, r0, #1
a: 4018 ands r0, r3
c: 400c ands r4, r1
e: 1822 adds r2, r4, r0
10: 0848 lsrs r0, r1, #1
12: 4003 ands r3, r0
14: 18d0 adds r0, r2, r3
16: bd10 pop {r4, pc}
18: 01010101 .word 0x01010101
1c: 7f7f7f7f .word 0x7f7f7f7f
Since this code makes use of the r4 register, it needs to be saved onto the stack and later restored 保存到栈上后恢复。
Listing 3�C24. ARMv7 Assembly Code (Thumb Mode)
00000000 <avg8>:
0: f000 3301 and.w r3, r0, #16843009 ; 0x1010101
4: 0840 lsrs r0, r0, #1
6: 400b ands r3, r1
8: f000 307f and.w r0, r0, #2139062143 ; 0x7f7f7f7f
c: 0849 lsrs r1, r1, #1
e: 1818 adds r0, r3, r0
10: f001 317f and.w r1, r1, #2139062143 ; 0x7f7f7f7f
14: 1840 adds r0, r0, r1
16: 4770 bx lr
The Thumb2 assembly code is more compact as only one instruction is needed to copy 0x01010101 and 0x7f7f7f7f to r3 and r0.
Before deciding to write optimized assembly code, you may stop and think a little bit about how the C code itself could be optimized. After a little bit of thinking, you may end up with the code shown in Listing 3�C25.
Listing 3�C25. Faster Implementation of Parallel Average Function
unsigned int avg8_faster (unsigned int a, unsigned int b)
{
return (((a ^ b) & 0xFEFEFEFE) >> 1) + (a & b);
}
The C code is more compact that the first version and would appear to be faster. The
first version used two >>, four &, and two + operations (total of eight “basic” operations) while the new version uses only five “basic” operations. Intuitively直觉, the second implementation should be faster. And it is indeed.
Listing 3�C26 shows the ARMv7 Thumb resulting assembly code.
Listing 3�C26. ARMv7 Assembly Code (Thumb Mode)
00000000 <avg8_faster>:
0: ea81 0300 eor.w r3, r1, r0
4: 4001 ands r1, r0
6: f003 33fe and.w r3, r3, #4278124286 ; 0xfefefefe
a: eb01 0053 add.w r0, r1, r3, lsr #1
e: 4770 bx lr
This faster implementation results in faster and more compact code (not including the instruction to return from the function, four instructions instead of eight).
While this may sound terrific, a closer look at the ARM instruction set reveals the
UHADD8 instruction, which would perform an unsigned byte-wise addition, halving the results. This happens to be exactly what we want to compute. Consequently, an even faster implementation can easily be implemented and is shown in Listing 3�C27.
Listing 3�C27. Hand-crafted手工打造 Assembly Code
.global avg8_asm
.func avg8_asm
avg8_asm:
uhadd8 r0, r0, r1
bx lr
.endfunc
.end
Other “parallel instructions” exist. For example, UHADD16 would be like UHADD8 butinstead of performing byte-wise additions it would perform halfword-wise additions.
These instructions can improve performance significantly but because compilers have a hard time generating code that uses them, you will often find yourself having to write theassembly code manually in order to take advantage of them.
NOTE: Parallel instructions were introduced in the ARMv6 architecture so you won’t be able to use them when compiling for the armeabi ABI (ARMv5).
Writing whole functions using assembly code can quickly become tedious漫长. In many cases, only parts of a routine would benefit from using assembly code while the rest can be written in C or C++. The GCC compiler lets you mix assembly with C/C++, as shown in Listing 3�C28.
Listing 3�C28. Assembly Code Mixed With C Code
unsigned int avg8_fastest (unsigned int a, unsigned int b)
{
#if defined(__ARM_ARCH_7A__)
unsigned int avg;
asm("uhadd8 %[average], %[val1], %[val2]"
: [average] "=r" (avg)
: [val1] "r" (a), [val2] "r" (b));
return avg;
#else
return (((a ^ b) & 0xFEFEFEFE) >> 1) + (a & b); // default generic implementation
#endif
}
NOTE: Visit http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html for more
information about extended asm and
http://gcc.gnu.org/onlinedocs/gcc/Constraints.html for details about the
constraints. A single asm() statement can include multiple instructions.
The updated Android.mk is shown in Listing 3�C29.
Listing 3�C29. Android.mk
LOCAL_PATH := $(call my-dir)
include $(CLEAR_VARS)
LOCAL_MODULE := chapter3
LOCAL_SRC_FILES := gcd.c rgb.c avg8.c
ifeq ($(TARGET_ARCH_ABI),armeabi)
LOCAL_SRC_FILES += gcd_asm.S
endif
ifeq ($(TARGET_ARCH_ABI),armeabi-v7a)
LOCAL_SRC_FILES += gcd_asm.S rgb_asm.S avg8_asm.S
endif
include $(BUILD_SHARED_LIBRARY)
This example shows that sometimes, good knowledge of the instruction set is needed to achieve the best performance. Since Android devices are mostly based on ARM architectures, you should focus your attention on the ARM instruction set. The ARM documentation available on the ARM website (infocenter.arm.com) is of great quality so make sure you use it.
3.14 ARM Instructions
ARM instructions are plentiful. While the goal is not to document in great detail what each one does, Table 3�C1 shows the list of available ARM instructions, each one with a brief description指令列表.
As you become familiar with them, you will learn that some of them are used much more often than others, albeit虽然 the more obscure ones are often the ones that can dramatically improve performance. For example, the ADD and MOV are practically ubiquitous普遍存在的 while the SETEND instruction is not going to be used very often (yet it is a great instruction when you need to access data of different endianness).
NOTE: For detailed information about these instructions, refer to the ARM Compiler Toolchain Assembler Reference document available at http://infocenter.arm.com.
Table 3�C1. ARM Instructions
Mnemonic Description
ADC Add with carry
ADD Add
ADR Generate PC- or register-relative address
ADRL (pseudo-instruction) Generate PC- or register-relative address
AND Logical AND
ASR Arithmetic Shift Right
B Branch
BFC Bit Field Clear
BFI Bit Field Insert
BIC Bit Clear
BKPT Breakpoint
BL Branch with Link
BLX Branch with Link, change instruction set
BX Branch, change instruction set
88 CHAPTER 3: Advanced NDK
Mnemonic Description
BXJ Branch, change to Jazelle
CBZ Compare and Branch if Zero
CBNZ Compare and Branch if Not Zero
CDP Coprocessor Data Processing
CDP2 Coprocessor Data Processing
CHKA Check Array
CLREX Clear Exclusive
CLZ Count Leading Zeros
CMN Compare Negative
CMP Compare
CPS Change Processor State
DBG Debug Hint
DMB Data Memory Barrier
DSB Data Synchronization Barrier
ENTERX Change state to ThumbEE
EOR Exclusive OR
HB Handler Branch
HBL Handler Branch
HBLP Handler Branch
HBP Handler Branch
ISB Instruction Synchronization Barrier
IT If-Then
LDC Load Coprocessor
Mnemonic Description
LDC2 Load Coprocessor
LDM Load Multiple registers
LDR Load Register
LDR (pseudo-instruction) Load Register
LDRB Load Register with Byte
LDRBT Load Register with Byte, user mode
LDRD Load Registers with two words
LDREX Load Register, Exclusive
LDREXB Load Register with Byte, Exclusive
LDREXD Load Registers with two words, Exclusive
LDREXH Load Registers with Halfword, Exclusive
LDRH Load Register with Halfword
LDRHT Load Register with Halfword, user mode
LDRSB Load Register with Signed Byte
LDRSBT Load Register with Signed Byte, user mode
LDRSH Load Register with Signed Halfword
LDRT Load Register, user mode
LEAVEX Exit ThumbEE state
LSL Logical Shift Left
LSR Logical Shift Right
MAR Move from Registers to 40-bit Accumulator
MCR Move from Register to Coprocessor
MCR2 Move from Register to Coprocessor
Mnemonic Description
MCRR Move from Registers to Coprocessor
MCRR2 Move from Registers to Coprocessor
MIA Multiply with Internal Accumulate
MIAPH Multiply with Internal Accumulate, Packed Halfwords
MIAxy Multiply with Internal Accumulate, Halfwords
MLA Multiply and Accumulate
MLS Multiply and Subtract
MOV Move
MOVT Move Top
MOV32 (pseudo) Move 32-bit value to register
MRA Move from 40-bit Accumulators to Registers
MRC Move from Coprocessor to Register
MRC2 Move from Coprocessor to Register
MRRC Move from Coprocessor to Registers
MRRC2 Move from Coprocessor to Registers
MRS Move from PSR to Register
MRS Move from system Coprocessor to Register
MSR Move from Register to PSR
MSR Move from Register to system Coprocessor
MUL Multiply
MVN Move Not
NOP No Operation
ORN Logical OR NOT
Mnemonic Description
ORR Logical OR
PKHBT Pack Halfwords (Bottom + Top)
PKHTB Pack Halfwords (Top + Bottom)
PLD Preload Data
PLDW Preload Data with intent to Write
PLI Preload Instructions
POP Pop registers from stack
PUSH Push registers to stack
QADD Signed Add, Saturate
QADD8 Parallel Signed Add (4 x 8-bit), Saturate
QADD16 Parallel Signed Add (2 x 16-bit), Saturate
QASX Exchange Halfwords, Signed Add and Subtract, Saturate
QDADD Signed Double and Add, Saturate
QDSUB Signed Double and Subtract, Saturate
QSAX Exchange Halfwords, Signed Subtract and Add, Saturate
QSUB Signed Subtract, Saturate
QSUB8 Parallel Signed Subtract (4 x 8-bit), Saturate
QSUB16 Parallel Signed Subtract (2 x 16-bit), Saturate
RBIT Reverse Bits
REV Reverse bytes (change endianness)
REV16 Reverse bytes in halfwords
REVSH Reverse bytes in bottom halfword and sign extend
RFE Return From Exception
Mnemonic Description
ROR Rotate Right
RRX Rotate Right with Extend
RSB Reverse Subtract
RSC Reverse Subtract with Carry
SADD8 Parallel Signed Add (4 x 8-bit)
SADD16 Parallel Signed Add (2 x 16-bit)
SASX Exchange Halfwords, Signed Add and Subtract
SBC Subtract with Carry
SBFX Signed Bit Field Extract
SDIV Signed Divide
SEL Select bytes
SETEND Set Endianness for memory access
SEV Set Event
SHADD8 Signed Add (4 x 8-bit), halving the results
SHADD16 Signed Add (2 x 16-bit), halving the results
SHASX Exchange Halfwords, Signed Add and Subtract, halving the results
SHSAX Exchange Halfwords, Signed Subtract and Add, halving the results
SHSUB8 Signed Subtract (4 x 8-bit), halving the results
SHSUB16 Signed Subtract (2 x 16-bit), halving the results
SMC Secure Monitor Call
SMLAxy Signed Multiply with Accumulate
SMLAD Dual Signed Multiply Accumulate
SMLAL Signed Multiply Accumulate
CHAPTER 3: Advanced NDK 93
Mnemonic Description
SMLALxy Signed Multiply Accumulate
SMLALD Dual Signed Multiply Accumulate Long
SMLAWy Signed Multiply with Accumulate
SMLSD Dual Signed Multiply Subtract Accumulate
SMLSLD Dual Signed Multiply Subtract Accumulate Long
SMMLA Signed top word Multiply with Accumulate
SMMLS Signed top word Multiply with Subtract
SMMUL Signed top word Multiply
SMUAD Dual Signed Multiply and Add
SMULxy Signed Multiply
SMULL Signed Multiply
SMULWy Signed Multiply
SMUSD Dual Signed Multiply and Subtract
SRS Store Return State
SSAT Signed Saturate
SSAT16 Signed Saturate, parallel halfwords
SSAX Exchange Halfwords, Signed Subtract and Add
SSUB8 Signed Byte-wise subtraction
SSUB16 Signed Halfword-wise subtraction
STC Store Coprocessor
STC2 Store Coprocessor
STM Store Multiple Registers (see LDM)
STR Store Register (see LDR)
94 CHAPTER 3: Advanced NDK
Mnemonic Description
STRB Store Register with byte
STRBT Store Register with byte, user mode
STRD Store Registers with two words
STREX Store Register, Exclusive (see LDREX)
STREXB Store Register with Byte, Exclusive
STREXD Store Register with two words, Exclusive
STREXH Store Register with Halfword, Exclusive
STRH Store Register with Halfword
STRHT Store Register with Halfword, user mode
STRT Store Register, user mode
SUB Subtract
SUBS Exception Return, no stack
SVC Supervisor Call
SWP Swap Registers and Memory (deprecated in v6)
SWPB Swap Registers and Memory (deprecated in v6)
SXTAB Sign Extend Byte and Add
SXTAB16 Sign Extend two 8-bit values to two 16-bit values and Add
SXTAH Sign Extend Halfword and Add
SXTB Sign Extend Byte
SXTB16 Sign Extend two 8-bit values to two 16-bit values
SXTH Sign Extend Halfword
SYS Execute system coprocessor instruction
TBB Table Branch Byte
CHAPTER 3: Advanced NDK 95
Mnemonic Description
TBH Table Branch Halfword
TEQ Test Equivalence
TST Test
UADD8 Parallel Unsigned Add (4 x 8-bit)
UADD16 Parallel Unsigned Add (2 x 16-bit)
UASX Exchange Halfwords, Unsigned Add and Subtract
UBFX Unsigned Bit Field Extract
UDIV Unsigned Divide
UHADD8 Unsigned Add (4 x 8-bit), halving the results
UHADD16 Unsigned Add (2 x 16-bit), halving the results
UHASX Exchange Halfwords, Unsigned Add and Subtract, halving the results
UHSAX Exchange Halfwords, Unsigned Subtract and Add, halving the results
UHSUB8 Unsigned Subtract (4 x 8-bit), halving the results
UHSUB16 Unsigned Subtract (2 x 16-bit), halving the results
USAD8 Unsigned Sum of Absolute Difference
USADA8 Accumulate Unsigned Sum of Absolute Difference
USAT Unsigned Saturate
USAT16 Unsigned Saturate, parallel halfwords
USAX Exchange Halfwords, Unsigned Subtract and Add
USUB8 Unsigned Byte-wise subtraction
USUB16 Unsigned Halfword-wise subtraction
UXTB Zero Extend Byte
UXTB16 Zero Extend two 8-bit values to two 16-bit values
96 CHAPTER 3: Advanced NDK
Mnemonic Description
UXTH Zero Extend, Halfword
WFE Wait For Event
WFI Wait For Interrupt
YIELD Yield
3.1.5 ARM NEON
NEON is a 128-bit SIMD (Single Instruction, Multiple Data) extension to the Cortex A family of processors. If you understood what the UHADD8 instruction was doing in Listing 3�C27, then you will easily understand NEON. NEON registers are seen as vectors. For example, a 128-bit NEON register can be seen as four 32-bit integers, eight 16-bit integers, or even sixteen 8-bit integers (the same way the UHADD8 instruction interprets a 32-bit register as four 8-bit values). A NEON instruction would then perform the same operation on all elements.
NEON has several important features:
Single instruction can perform multiple operations (after all, this is the essence本质 of SIMD instructions)
Independent registers 独立寄存器
Independent pipeline 独立流水器
Many NEON instructions will look similar to ARM instructions. For example, the VADD instruction will add corresponding elements in two vectors, which is similar to what the ADD instruction does (although the ADD instruction simply adds two 32-bit registers and does not treat them as vectors不把它当向量处理). All NEON instructions start with the letter V, so identifying them is easy.
There are basically two ways to use NEON in your code:
You can use NEON instructions in hand-written assembly code.
You can use NEON intrinsics内联 defined in arm-neon.h, a header file provided in the NDK.
The NDK provides sample code for NEON (hello-neon), so you should first review this
code. While using NEON can greatly increase performance, it may also require you to modify your algorithms a bit to fully take advantage of vectorization向量化.
To use NEON instructions, make sure you add the .neon suffix to the file name in
Android.mk’s LOCAL_SRC_FILES. If all files need to be compiled with NEON support,
you can set LOCAL_ARM_NEON to true in Android.mk.
A great way to learn about NEON is to look at the Android source code itself. For
example, SkBlitRow_opts_arm.cpp (in the external/skia/src/opts directory) contains
several routines using NEON instructions, using asm() or intrinsics. In the same directory you will also find SkBlitRow_opts_SSE2.cpp, which contains optimized routines using x86 SIMD instructions. The Skia source code is also available online at
http://code.google.com/p/skia.
3.1.6 CPU Features
As you have seen already, not all CPUs are the same. Even within the same family of processors (for example, ARM Cortex family), not all processors support the same
features as some are optional. For example, not all ARMv7 processors support the
NEON extension or the VFP extension. For this reason, Android provides functions to
help you query what kind of platform the code is running on. These functions are defined in cpu-features.h, a header file provided in the NDK, and Listing 3�C30 shows you how to use these functions to determine whether a generic function should be used or one that takes advantage of the NEON instruction set.
Listing 3�C30. Checking CPU Features
#include <cpu-features.h>
static inline int has_features(uint64_t features, uint64_t mask)
{
return ((features & mask) == mask);
}
static void (*my_function)(int* dst, const int* src, int size); // function pointer
extern void neon_function(int* dst, const int* src, int size); // defined in some other
file
extern void default_function(int* dst, const int* src, int size);
int init () {
AndroidCpuFamily cpu = android_getCpuFamily();
uint64_t features = android_getCpuFeatures();
int count = android_getCpuCount(); // ignore here
if (cpu == ANDROID_CPU_FAMILY_ARM) {
if (has_features(features,
ANDROID_CPU_ARM_FEATURE_ARMv7|
ANDROID_CPU_ARM_FEATURE_NEON))
{
my_function = neon_function;
}
else
{
// use default functions here
my_function = default_function; // generic function
}
}
else
98 CHAPTER 3: Advanced NDK
{
my_function = default_function; // generic function
}
}
void call_function(int* dst, const int* src, int size)
{
// we assume init() was called earlier to set the my_function pointer
my_function(dst, src, size);
}
To use the CPU features functions, you will have to do two things in your Android.mk:
Add “cpufeatures” to the list of static libraries to link with
(LOCAL_STATIC_LIBRARIES := cpufeatures).
Import the android/cpufeatures module by adding $(call importmodule,
android/cpufeatures) at the end of Android.mk.
Typically, probing the capabilities of the platform will be one of the first tasks you will have to perform in order to use the best possible functions探测CPU特性.
If your code depends on the presence of the VFP extension, you may have to check also whether NEON is supported. The ANDROID_CPU_ARM_FEATURE_VFPv3 flag is for the minimum profile of the extension with sixteen 64-bit floating-point registers (D0 to D15).
If NEON is supported, then thirty-two 64-bit floating-point registers are available (D0 to D31). Registers are shared between NEON and VFP and registers are aliased:
Q0 (128-bit) is an alias for D0 and D1 (both 64-bit).
D0 is an alias for S0 and S1 (S registers are single-precision 32-bit
registers).
The fact that registers are shared and aliased is a very important detail, so make sure you use registers carefully when hand-writing assembly code.
NOTE: Refer to the NDK’s documentation and more particularly to CPU-FEATURES.html for more information about the APIs.
3.2 C Extensions
The Android NDK comes with the GCC compiler (version 4.4.3 in release 7 of the NDK). As a consequence, you are able to use the C extensions the GNU Compiler Collection supports. Among the ones that are particularly interesting, as far as performance is concerned, are:
Built-in functions 内置函数
Vector instructions 向量指令
NOTE: Visit http://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html for an
exhaustive list of the GCC C extensions.
3.2.1 Built-in Functions内置函数
Built-in functions, sometimes referred to as intrinsics, are functions handled in a special manner by the compiler. Built-in functions are often used to allow for some constructs the language does not support, and are often inlined, that is, the compiler replaces the call with a series of instructions specific to the target and typically optimized. For example, a call to the __builtin_clz() function would result in a CLZ instruction being generated (if the code is compiled for ARM and the CLZ instruction is available). When no optimized version of the built-in function exists, or when optimizations are turned off, the compiler simply makes a call to a function containing a generic implementation.
For example, GCC supports the following built-in functions:
__builtin_return_address
__builtin_frame_address
__builtin_expect
__builtin_assume_aligned
__builtin_prefetch
__builtin_ffs
__builtin_clz
__builtin_ctz
__builtin_clrsb
__builtin_popcount
__builtin_parity
__builtin_bswap32
__builtin_bswap64
Using built-in functions allows you to keep your code more generic while still taking advantage of optimizations available on some platforms.
3.2.2 Vector Instructions 向量指令
Vector instructions are not really common in C code. However, with more and more CPUs supporting SIMD instructions, using vectors in your algorithms can accelerate your code quite significantly.
Listing 3�C31 shows how you can define your own vector type using the vector_size
variable attribute and how you can add two vectors并把两个向量相加的方法.
Listing 3�C31. Vectors
typedef int v4int __attribute__ ((vector_size (16))); // vector of four 4 integers (16
bytes)
void add_buffers_vectorized (int* dst, const int* src, int size)
{
v4int* dstv4int = (v4int*) dst;
const v4int* srcv4int = (v4int*) src;
int i;
for (i = 0; i < size/4; i++) {
*dstv4int++ += *srcv4int++;
}
// leftovers
if (size & 0x3) {
dst = (int*) dstv4int;
src = (int*) srcv4int;
switch (size & 0x3) {
case 3: *dst++ += *src++;
case 2: *dst++ += *src++;
case 1:
default: *dst += *src;
}
}
}
// simple implementation
void add_buffers (int* dst, const int* src, int size)
{
while (size--) {
*dst++ += *src++;
}
}
How this code will be compiled depends on whether the target supports SIMD
instructions and whether the compiler is told to use these instructions. To tell the
compiler to use NEON instructions, simply add the .neon suffix to the file name in
Android.mk’s LOCAL_SRC_FILES. Alternatively, you can define LOCAL_ARM_NEON to true if all files need to be compiled with NEON support.
Listing 3�C32 shows the resulting assembly code when the compiler does not use ARM SIMD instructions (NEON) whereas Listing 3�C33 shows the use of the NEON
instructions. (The add_buffers function is compiled the same way and is not shown in
the second listing.) The loop is shown in bold in both listings.
Listing 3�C32. Without NEON Instructions
00000000 <add_buffers_vectorized>:
0: e92d 0ff0 stmdb sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
4: f102 0803 add.w r8, r2, #3 ; 0x3
8: ea18 0822 ands.w r8, r8, r2, asr #32
CHAPTER 3: Advanced NDK 101
c: bf38 it cc
e: 4690 movcc r8, r2
10: b08e sub sp, #56
12: 4607 mov r7, r0
14: 468c mov ip, r1
16: ea4f 08a8 mov.w r8, r8, asr #2
1a: 9201 str r2, [sp, #4]
1c: f1b8 0f00 cmp.w r8, #0 ; 0x0
20: 4603 mov r3, r0
22: 460e mov r6, r1
24: dd2c ble.n 80 <add_buffers_vectorized+0x80>
26: 2500 movs r5, #0
28: f10d 0928 add.w r9, sp, #40 ; 0x28
2c: 462e mov r6, r5
2e: f10d 0a18 add.w sl, sp, #24 ; 0x18
32: f10d 0b08 add.w fp, sp, #8 ; 0x8
36: 197c adds r4, r7, r5
38: 3601 adds r6, #1
3a: e894 000f ldmia.w r4, {r0, r1, r2, r3}
3e: e889 000f stmia.w r9, {r0, r1, r2, r3}
42: eb0c 0305 add.w r3, ip, r5
46: 3510 adds r5, #16
48: 4546 cmp r6, r8
4a: cb0f ldmia r3!, {r0, r1, r2, r3}
4c: e88a 000f stmia.w sl, {r0, r1, r2, r3}
50: 9b0a ldr r3, [sp, #40]
52: 9a06 ldr r2, [sp, #24]
54: 4413 add r3, r2
56: 9a07 ldr r2, [sp, #28]
58: 9302 str r3, [sp, #8]
5a: 9b0b ldr r3, [sp, #44]
5c: 4413 add r3, r2
5e: 9a08 ldr r2, [sp, #32]
60: 9303 str r3, [sp, #12]
62: 9b0c ldr r3, [sp, #48]
64: 4413 add r3, r2
66: 9a09 ldr r2, [sp, #36]
68: 9304 str r3, [sp, #16]
6a: 9b0d ldr r3, [sp, #52]
6c: 4413 add r3, r2
6e: 9305 str r3, [sp, #20]
70: e89b 000f ldmia.w fp, {r0, r1, r2, r3}
74: e884 000f stmia.w r4, {r0, r1, r2, r3}
78: d1dd bne.n 36 <add_buffers_vectorized+0x36>
7a: 0136 lsls r6, r6, #4
7c: 19bb adds r3, r7, r6
7e: 4466 add r6, ip
80: 9901 ldr r1, [sp, #4]
82: f011 0203 ands.w r2, r1, #3 ; 0x3
86: d007 beq.n 98 <add_buffers_vectorized+0x98>
88: 2a02 cmp r2, #2
8a: d00f beq.n ac <add_buffers_vectorized+0xac>
8c: 2a03 cmp r2, #3
8e: d007 beq.n a0 <add_buffers_vectorized+0xa0>
90: 6819 ldr r1, [r3, #0]
92: 6832 ldr r2, [r6, #0]
102 CHAPTER 3: Advanced NDK
94: 188a adds r2, r1, r2
96: 601a str r2, [r3, #0]
98: b00e add sp, #56
9a: e8bd 0ff0 ldmia.w sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
9e: 4770 bx lr
a0: 6819 ldr r1, [r3, #0]
a2: f856 2b04 ldr.w r2, [r6], #4
a6: 188a adds r2, r1, r2
a8: f843 2b04 str.w r2, [r3], #4
ac: 6819 ldr r1, [r3, #0]
ae: f856 2b04 ldr.w r2, [r6], #4
b2: 188a adds r2, r1, r2
b4: f843 2b04 str.w r2, [r3], #4
b8: e7ea b.n 90 <add_buffers_vectorized+0x90>
ba: bf00 nop
00000000 <add_buffers>:
0: b470 push {r4, r5, r6}
2: b14a cbz r2, 18 <add_buffers+0x18>
4: 2300 movs r3, #0
6: 461c mov r4, r3
8: 58c6 ldr r6, [r0, r3]
a: 3401 adds r4, #1
c: 58cd ldr r5, [r1, r3]
e: 1975 adds r5, r6, r5
10: 50c5 str r5, [r0, r3]
12: 3304 adds r3, #4
14: 4294 cmp r4, r2
16: d1f7 bne.n 8 <add_buffers+0x8>
18: bc70 pop {r4, r5, r6}
1a: 4770 bx lr
Listing 3�C33. With NEON Instructions
00000000 <add_buffers_vectorized>:
0: b470 push {r4, r5, r6}
2: 1cd6 adds r6, r2, #3
4: ea16 0622 ands.w r6, r6, r2, asr #32
8: bf38 it cc
a: 4616 movcc r6, r2
c: 4604 mov r4, r0
e: 460b mov r3, r1
10: 10b6 asrs r6, r6, #2
12: 2e00 cmp r6, #0
14: dd0f ble.n 36 <add_buffers_vectorized+0x36>
16: 460d mov r5, r1
18: 2300 movs r3, #0
1a: 3301 adds r3, #1
1c: ecd4 2b04 vldmia r4, {d18-d19}
20: ecf5 0b04 vldmia r5!, {d16-d17}
24: 42b3 cmp r3, r6
26: ef62 08e0 vadd.i32 q8, q9, q8
2a: ece4 0b04 vstmia r4!, {d16-d17}
2e: d1f4 bne.n 1a <add_buffers_vectorized+0x1a>
30: 011b lsls r3, r3, #4
32: 18c4 adds r4, r0, r3
34: 18cb adds r3, r1, r3
36: f012 0203 ands.w r2, r2, #3 ; 0x3
CHAPTER 3: Advanced NDK 103
3a: d008 beq.n 4e <add_buffers_vectorized+0x4e>
3c: 2a02 cmp r2, #2
3e: 4621 mov r1, r4
40: d00d beq.n 5e <add_buffers_vectorized+0x5e>
42: 2a03 cmp r2, #3
44: d005 beq.n 52 <add_buffers_vectorized+0x52>
46: 680a ldr r2, [r1, #0]
48: 681b ldr r3, [r3, #0]
4a: 18d3 adds r3, r2, r3
4c: 600b str r3, [r1, #0]
4e: bc70 pop {r4, r5, r6}
50: 4770 bx lr
52: 6820 ldr r0, [r4, #0]
54: f853 2b04 ldr.w r2, [r3], #4
58: 1882 adds r2, r0, r2
5a: f841 2b04 str.w r2, [r1], #4
5e: 6808 ldr r0, [r1, #0]
60: f853 2b04 ldr.w r2, [r3], #4
64: 1882 adds r2, r0, r2
66: f841 2b04 str.w r2, [r1], #4
6a: e7ec b.n 46 <add_buffers_vectorized+0x46>
You can quickly see that the loop was compiled in far fewer instructions when NEON instructions are used使用Neon指令集编译后循环中的指令少了很多. As a matter of fact, the vldmia instruction loads four integers from memory, the vadd.i32 instruction performs four additions, and the vstmia instruction stores four integers in memory把四个整数存到内存中. This results in more compact and more efficient code.
Using vectors is a double-edged sword双刃剑 though:
They allow you to use SIMD instructions when available while still maintaining a generic一般的 code that can compile for any ABI, regardless of its support for SIMD instructions. (The code in Listing 3�C31 compiles just fine for the x86 ABI as it is not NEON-specific.)
They can result in low-performing code when the target does not support SIMD instructions. (The add_buffers function is far simpler than its “vectorized” equivalent and results in simpler assembly code: see how many times data is read from and written to the stack in add_buffers_vectorized when SIMD instructions are not used.)
NOTE: Visit http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html for more information about vectors.
3.3 Tips 技巧
The following are a few things you can do in your code relatively easily to achieve better performance.
3.3.1 Inlining Functions内联函数
Because function calls can be expensive operations, inlining functions (that is, the
process of replacing the function call with the body of the function itself在调用处实现替换掉用) can make your code run faster. Making a function inlined is simply a matter of adding the “inline” keyword as part of its definition. An example of inline function is showed in Listing 3�C30.
You should use this feature carefully though as it can result in bloated code使代码臃肿 , negating the advantages of the instruction cache. Typically, inlining works better for small functions, where the overhead of the call itself is significant节省的调用开销是很显著的.
NOTE: Alternatively, use macros可以用宏代替.
3.3.2 Unrolling Loops 循环展开
A classic way to optimize loops is to unroll them, sometimes partially. Results will vary and you should experiment in order to measure gains, if any. Make sure the body of the loop does not become too big though as this could have a negative impact on the instruction cache 对指令缓存有负面影响.
Listing 3�C34 shows a trivial example of loop unrolling.
Listing 3�C34. Unrolling void add_buffers_unrolled (int* dst, const int* src, int size)
{
int i;
for (i = 0; i < size/4; i++) {
*dst++ += *src++;
*dst++ += *src++;
*dst++ += *src++;
*dst++ += *src++;
// GCC not really good at that though... No LDM/STM generated
}
// leftovers
if (size & 0x3) {
switch (size & 0x3) {
case 3: *dst++ += *src++;
case 2: *dst++ += *src++;
case 1:
default: *dst += *src;
}
}
}
3.3.3 Preloading Memory 内存预读取
When you know with a certain degree of confidence that specific data will be accessed or specific instructions will be executed, you can preload (or prefetch) this data or these instructions before they are used.
Because moving data from external memory to the cache takes time, giving enough time to transfer the data from external memory to the cache can result in better performance as this may cause a cache hit when the instructions (or data) are finally accessed.
To preload data, you can use:
GCC’s __builtin_prefetch()
PLD and PLDW ARM instructions in assembly code
You can also use the PLI ARM instruction (ARMv7 and above) to preload instructions.
Some CPUs automatically preload memory, so you may not always observe any gain可能并不总会看到效果.
However, since you have a better knowledge of how your code accesses data,
preloading data can still yield great results.
TIP: You can use the PLI ARM instruction (ARMv7 and above) to preload instructions.
Listing 3�C35 shows how you can take advantage of the preloading built-in function.
Listing 3�C35. Preloading Memory
void add_buffers_unrolled_prefetch (int* dst, const int* src, int size)
{
int i;
for (i = 0; i < size/8; i++) {
__builtin_prefetch(dst + 8, 1, 0); // prepare to write
__builtin_prefetch(src + 8, 0, 0); // prepare to read
*dst++ += *src++;
*dst++ += *src++;
*dst++ += *src++;
*dst++ += *src++;
*dst++ += *src++;
*dst++ += *src++;
*dst++ += *src++;
*dst++ += *src++;
}
// leftovers
for (i = 0; i < (size & 0x7); i++) {
*dst++ += *src++;
}
}
You should be careful about preloading memory though as it may in some cases
degrade the performance. Anything you decide to move into the cache will cause other things to be removed from the cache, possibly impacting performance negatively. Make sure that what you preload is very likely to be needed by your code or else you will simply pollute the cache with useless data污染缓存,降低性能.
NOTE: While ARM supports the PLD, PLDW, and PLI instructions, x86 supports the PREFETCHT0, PREFETCHT1, PREFETCHT2, and PREFETCHNTA instructions. Refer to the ARM and x86
documentations for more information. Change the last parameter of __builtin_prefetch()
and compile for x86 to see which instructions will be used.
3.3.4 LDM/STM Instead Of LDR/STD
Loading multiple registers with a single LDM instruction is faster than loading registers using multiple LDR instructions(比用多个LDR加载寄存器快得多). Similarly, storing multiple registers with a single STM instruction is faster than using multiple STR instructions(使用单个SMT指令写入多个寄存器也比多个STR写入寄存器快得多).
While the compiler is often capable of generating such instructions (even when memory accesses are somewhat scattered in your code在你代码中有些分散), you should try to help the compiler as much as possible by writing code that can more easily be optimized by the compiler. For example, the code in Listing 3�C36 shows a pattern the compiler should quite easily recognize and generate LDM and STM instructions for (assuming an ARM ABI). Ideally理想情况下, access to memory should be grouped together whenever possible so that the compiler can generate better code.
Listing 3�C36. Pattern to Generate LDM And STM unsigned int a, b, c, d;
// assuming src and dst are pointers to int
// read source values
a = *src++;
b = *src++;
c = *src++;
d = *src++;
// do something here with a, b, c and d
// write values to dst buffer
*dst++ = a;
*dst++ = b;
*dst++ = c;
*dst++ = d;
NOTE: Unrolling loops and inlining functions can also help the compiler generate LDM or STM
instructions more easily.
Unfortunately, the GCC compiler does not always do a great job at generating LDM and STM instructions. Review the generated assembly code and write the assembly code yourself if you think performance would improve significantly with the use of the LDM and STM instructions.
3.4 Summary
Dennis Ritchie, the father of the C language, said C has the power of assembly language and the convenience of… assembly language. In this chapter you saw that in some cases you may have to use assembly language to achieve the desired results. Although assembly is a very powerful language that provides an unobfuscated清晰 view of the machine capabilities, it can make maintenance significantly more difficult as by definition assembly language targets a specific architecture. However, assembly code or built-in functions would typically be found in very small portions of your application, where performance is critical, and therefore maintenance should remain relatively easy. If you carefully select which part of your application should be written in Java, which part should be written in C, and which part should be written in assembly, you can make sure
your application’s performance is astonishing and impresses your users结果会让你惊喜同时也会让客户留下深刻的印象.