首先找到了要在C源代码中只用NEON库需要的头文件 arm_neon.h、
//在代码中先添加了这行语句,然后执行ndk-build 却提示了错误
//提示要增加什么标志,自己在 LOCAL_CXX_FLAGS 的后面添加了,但是仍然报错
//后来搜索 NDK + NEON 终于找到了一点点苗头并开始发现。
Android.mk 文件内容可以参考这个:
LOCAL_PATH := $(call my-dir)
include $(CLEAR_VARS)
# 这里填写要编译的源文件路径,这里只列举了一部分
LOCAL_SRC_FILES := NcHevcDecoder.cpp JNI_OnLoad.cpp TAppDecTop.cpp
# 默认包含的头文件路径
# -g 后面的一系列附加项目添加了才能使用 arm_neon.h 头文件
# -mfloat-abi=softfp -mfpu=neon 使用 arm_neon.h 必须
LOCAL_CFLAGS := -D__cpusplus -g -mfloat-abi=softfp -mfpu=neon -march=armv7-a -mtune=cortex-a8
LOCAL_LDLIBS := -lz -llog
TARGET_ARCH_ABI :=armeabi-v7aLOCAL_ARM_MODE := arm
ifeq ($(TARGET_ARCH_ABI),armeabi-v7a)
# 采用NEON优化技术
LOCAL_MODULE := NcHevcDecoder
# 生成动态调用库
参考: http://blog.csdn.net/gg137608987/article/details/7565843
APP_PROJECT_PATH := $(call my-dir)/..
APP_PLATFORM := android-10
APP_STL := stlport_static
APP_ABI := armeabi-v7a
APP_CPPFLAGS += -fexceptions
其中APP_ABI这句指定了编译的目标平台类型,可以针对不同平台进行优化。 当然这样指定了之后,就需要相应的设备支持NEON指令。
void reference_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
int i;
for (i=0; i>8);
void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
int i;
uint8x8_t rfac = vdup_n_u8 (77); // 转换权值 R
uint8x8_t gfac = vdup_n_u8 (151); // 转换权值 G
uint8x8_t bfac = vdup_n_u8 (28); // 转换权值 B
for (i=0; i
vmull.u8 multiplies each byte of the first argument with each corresponding byte of the second argument. Each result becomes a 16 bit unsigned integer, so no overflow can happen. The entire result is returned as a 128 bit NEON register pair. vmlal.u8 does the same thing as vmull.u8 but also adds the content of another register to the result.
things at once: It does the shift and afterwards converts the 16 bit integers back to 8 bit by removing all the high-bytes from the result. We get back from the 128 bit register pair to a single 64 bit register.
C-version: 15.1 cycles per pixel.
NEON-version: 9.9 cycles per pixel.
That’s only a speed-up of factor 1.5. I expected much more from the NEON implementation. It processes 8 pixels with just 6 instructions after all.
What’s going on here? A look at the assembler output explained it all. Here is the inner-loop part of the convert function:
160: f46a040f vld3.8 {d16-d18}, [sl]
164: e1a0c005 mov ip, r5
168: ecc80b06 vstmia r8, {d16-d18}
16c: e1a04007 mov r4, r7
170: e2866001 add r6, r6, #1 ; 0x1
174: e28aa018 add sl, sl, #24 ; 0x18
178: e8bc000f ldm ip!, {r0, r1, r2, r3}
17c: e15b0006 cmp fp, r6
180: e1a08005 mov r8, r5
184: e8a4000f stmia r4!, {r0, r1, r2, r3}
188: eddd0b06 vldr d16, [sp, #24]
18c: e89c0003 ldm ip, {r0, r1}
190: eddd2b08 vldr d18, [sp, #32]
194: f3c00ca6 vmull.u8 q8, d16, d22
198: f3c208a5 vmlal.u8 q8, d18, d21
19c: e8840003 stm r4, {r0, r1}
1a0: eddd3b0a vldr d19, [sp, #40]
1a4: f3c308a4 vmlal.u8 q8, d19, d20
1a8: f2c80830 vshrn.i16 d16, q8, #8
1ac: f449070f vst1.8 {d16}, [r9]
1b0: e2899008 add r9, r9, #8 ; 0x8
1b4: caffffe9 bgt 160
// 这里针对生成的目标汇编代码进一步作了优化,优化的代码如下:
# r0: Ptr to destination data
# r1: Ptr to source data
# r2: Iteration count:
push {r4-r5,lr}
lsr r2, r2, #3
# build the three constants:
mov r3, #77
mov r4, #151
mov r5, #28
vdup.8 d3, r3
vdup.8 d4, r4
vdup.8 d5, r5
# load 8 pixels:
vld3.8 {d0-d2}, [r1]!
# do the weight average:
vmull.u8 q3, d0, d3
vmlal.u8 q3, d1, d4
vmlal.u8 q3, d2, d5
# shift and store:
vshrn.u16 d6, q3, #8
vst1.8 {d6}, [r0]!
subs r2, r2, #1
bne .loop
pop { r4-r5, pc }
可以见到NEON优化在性能上提速了 7 倍多(同时处理8个像素),理论应该是8倍。