ARM cortex A8/9 - Android NDK - NEON介绍以及优化

ARM cortex A8/9 - Android NDK - NEON介绍以及优化(资源的整理总结)

(1)What is NDK:

Android开发官网介绍:

http://developer.android.com/sdk/ndk/overview.html

The Android NDK is a toolset that letsyou embed components that make use of native code in your Android applications.

Android applications run in the Dalvikvirtual machine. The NDK allows you to implement parts of your applicationsusing native-code languages such as C and C++. This can provide benefits tocertain classes of applications, in the form of reuse of existing code and insome cases increased speed.

The NDK provides:

·        A set of tools and build files used to generate nativecode libraries from C and C++ sources

·        A way to embed the corresponding native libraries into anapplication package file (.apk) that can be deployed on Androiddevices

·        A set of native system headers and libraries that will besupported in all future versions of the Android platform, starting from Android1.5. Applications that use native activities must be run on Android 2.3 orlater.

·        Documentation, samples, and tutorials

The latest release of the NDK supportsthese ARM instruction sets:

·        ARMv5TE (including Thumb-1 instructions)

·        ARMv7-A (including Thumb-2 and VFPv3-D16 instructions,with optional support for NEON/VFPv3-D32 instructions)

·        x86 instructions (see CPU-ARCH-ABIS.HTML for moreinformation)

ARMv5TE machine code will run on allARM-based Android devices. ARMv7-A will run only on devices such as the VerizonDroid or Google Nexus One that have a compatible CPU. The main differencebetween the two instruction sets is that ARMv7-A supports hardware FPU,Thumb-2, and NEON instructions. You can target either or both of theinstruction sets — ARMv5TE is the default, but switching to ARMv7-A is as easyas adding a single line to the application's Application.mk file, without needing to changeanything else in the file. You can also build for both architectures at thesame time and have everything stored in the final .apk. Complete information is provided inthe CPU-ARCH-ABIS.HTML in the NDK package.

The NDK provides stable headers for libc(the C library), libm (the Math library), OpenGL ES (3D graphics library), theJNI interface, and other libraries.

(2) Sampleapplications

The NDK includes sample applicationsthat illustrate how to use native code in your Android applications:

·        hello-jni — a simpleapplication that loads a string from a native method implemented in a sharedlibrary and then displays it in the application UI.

·        hello-neon — a simple application that shows how to use the cpufeatures library to check CPU capabilitiesat runtime, then use NEON intrinsics if supported by the CPU. Specifically, theapplication implements two versions of a tiny benchmark for a FIR filter loop,a C version and a NEON-optimized version for devices that support it.

(3) Download the Android NDK

http://developer.android.com/sdk/ndk/index.html

(4)AndroidNDK安装设置

http://blog.csdn.net/zwcai/article/details/6211670

Android NDK 基本就是 Linux的开发,不过主要是生成.so形式供SDK调用。涉及的工具,就是Linux开发工具+SDK接口组件

建立 NDK 编译环境

1. 下载android NDK r4 Windows 安装包,解压缩到你想放的位置,如d:/android

2. 安装较新版本的cygwin,安装中需要选择安装的Linux相关组件,主要是make、gcc、g++工具

3. 运行cygwin, 配置文件.bash_profile中添加环境变量 NDK=/cygdrive/d/android/android-ndk-r4; export NDK 。Windows的环境变量PATH 里面设置该路径

4.进入android-ndk-r4/sample中例子,如hello-jni,运行ndk-build,进行编译,结果是libxxx.so,成功说明编译环境搭建好了。、

EclipseNDK设置:

1. Eclipse中打开sample中的工程,具体操作是:新建android工程->从源代码建立。这时候可以编译java工程

2. 设置 NDK编译选项,调用cygwin编译工具,完成后刷新相关文件(主要是libxxx.so),设置方法参考:
http://www.360doc.com/content/11/0223/17/2734308_95473676.shtml

(5)Java Native Interface

http://en.wikipedia.org/wiki/Java_Native_Interface

JNI enables oneto write native methods to handle situations when an application cannot bewritten entirely in the Java programming language, e.g. when the standard Javaclass library does not support the platform-specific features or programlibrary. It is also used to modify an existing application—written in anotherprogramming language—to be accessible to Java applications. Many of thestandard library classes depend on JNI to provide functionality to thedeveloper and the user, e.g. file I/O and sound capabilities. Includingperformance- and platform-sensitive API implementations in the standard libraryallows all Java applications to access this functionality in a safe andplatform-independent manner.

The JNIframework lets a native method use Java objects in the same way that Java codeuses these objects. A native method can create Java objects and then inspectand use these objects to perform its tasks. A native method can also inspectand use objects created by Java application code.

JNI is sometimesreferred to as the "escape hatch" for Java developers because itenables them to add functionality to their Java application that the standardJava APIs cannot otherwise provide. It can be used to interface with codewritten in other languages, such as C and C++. It is also used fortime-critical calculations or operations like solving complicated mathematicalequations, because native code may be faster than JVM code.

http://java.sun.com/docs/books/jni/ORACLE官网,针对JNI的详细介绍以及JNI说明文档:

An entire chapter isdevoted to avoiding common traps and pitfalls. The book uses numerous examplesto illustrate programming techniques that have proven to be effective.

View HTML

Download HTML (zip, ~531K)

Download PDF (~ 3608K)

Download the example code in this book in ZIP or tar.gz formats.

Order this book through

DigitalGuru

amazon.com

(6)NDK Android* 应用移植方法

http://software.intel.com/zh-cn/articles/ndk-android-application-porting-methodologies/

本指南用于帮助开发人员将现有的基于 ARM* 的 NDK 应用移植到 x86。如果您已经拥有一个正常运行的应用,需要知道如何能够快速让 x86 设备在 Android* market 中找到您的应用,本文将可以为您提供一些入门信息。同时本指南还提供了一些技巧和指南,以帮助您解决在移植过程中可能会遇到的编译器问题。

(7)  Android NDK 之NEON优化针对ARM cortex A8/A9

http://blog.csdn.net/zwcai/article/details/6843531

近期正在往Android平台移植算法。确切地说,是针对ARM A8 A9 平台进行优化。发现不同芯片的浮点能力差别颇大。A9系列明显强于A8系列,大约有3倍多的提升,应该就是VFP管线化的优势。不过即使相同核心,不同厂家的芯片也会有不少差别。起初用本人手机,ATRIX,Tegra2处理器,A9双核。测算了一下,跑浮点算法速度是我台式机的三分之一。折算为相同频率的话,已经相差无几了。PC的算法直接编译就可以使用,速度直接达标,DSP时期的什么浮点转定点,直接就Pass掉,啥优化不用,真是惊叹。不过拿上其他A8板子,惊喜立马就飞走了,优化还是得做的,活省不了。主要可用的就是NEON了。

     优化NEON时,挑了几个典型函数,比如向量内积、比例求和、互相干系数,让人去尝试看看。一开始按照TI DSP的惯用招数,将运算用一系列NEON内联函数去整,发现速度仅提升了10%,搞不下去。我分析了汇编代码,发现编出来的有很多栈操作,比如关键的运算语句就一条,但前后 vstd 和 vldr 有十来条,不慢才怪。网上搜搜,也有类似情况,似乎编译器对NEON内联的优化较弱,没法把运算串起来。使出最后一招:嵌入式汇编手工优化,看了半天指令集,挑了最简单的比例求和函数,其实汇编的话,也就对应三条运算语句,就是算上加载和保存,也就十来句,比起编译器出来的几十条省了很多。运行一下,速度提升了5倍。这下有搞头了,让工程师把其他几个也整了,最高有20倍提升。就是编起来有点费劲,半天一个小函数,只能用于优化核心费时的部分。

    有两个比较不错的参考资料:

    http://hilbert-space.de/?p=22  RGB转灰度的优化实例,里面展现了函数对应的汇编指令,以及手工汇编的结果,主要是对加载、保存进行了优化。

   http://blogs.arm.com/software-enablement/241-coding-for-neon-part-3-matrix-multiplication/ 官方实例,矩阵乘法的NEON移植分析与实现。

   另外 OpenMAX库也可以看看。

(7)Cortex™-A8 Technical Reference Manual

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/index.html

(8)Introducing NEON™ Development Article

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/index.html

(9)Coding for NEON:

Coding for NEON - Part 1: Load and Stores - ARM Community

http://blogs.arm.com/software-enablement/161-coding-for-neon-part-1-load-and-stores/

Coding for NEON - Part 2: Dealing With Leftovers - ARM ...

http://blogs.arm.com/software-enablement/196-coding-for-neon-part-2-dealing-with-leftovers/

Coding for NEON - Part 3: Matrix Multiplication - ARM ...

http://blogs.arm.com/software-enablement/241-coding-for-neon-part-3-matrix-multiplication/

Coding for NEON - Part 4: Shifting Left and Right - ARM ...

http://blogs.arm.com/software-enablement/277-coding-for-neon-part-4-shifting-left-and-right/

http://search.arm.com/search?q=Coding+for+NEON&site=Site-Search&btnG=Search&entqr=0&output=xml_no_dtd&sort=date%3AD%3AL%3Ad1&client=Search&ud=1&oe=UTF-8&ie=UTF-8&getfields=Description&proxystylesheet=Search

(10)ARM NEONOptimization. An Example

http://hilbert-space.de/?p=22

(11)What is the fastest way to copy memory on a Cortex-A8?

Can AXI-basedARM cores generate bursts across 1KB boundaries?

My Cortex-A8 DSMdoes not produce a tarmac log

PerformanceMonitor Unit example code for ARM11 and Cortex-A/R

What is thefastest way to copy memory on a Cortex-A8?

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html

ARM Technical Support Knowledge Articles

Applies to: Cortex-A8, RealViewDevelopment Suite (RVDS)

Answer

Many applications frequently copy substantialamounts of data from one area of memory to another, typically using thememcpy() Clibrary function. As this can be quite time consuming, it may be worth spendingsome time optimizing the functions that do this. There is no single ‘bestmethod’ for implementing a copy routine, as the performance will depend uponmany factors (see below). This article looks at seven possible schemes, andcompares their performance in a particular system.

Target system andconditions

The various schemes were tested using aBeagle Board from Texas Instruments. This board is based on an OMAP 3530 SoC,which is based on an ARM Cortex-A8 processor. The implementations have beenwritten with the assumption that the source address, destination address andnumber of bytes to transfer are all multiples of the level 1 cache line size(64 bytes). However, the tests all copied 16 Mbytes of data, so the overhead ofchecking alignment and ensuring the main loop could assume 64 byte alignmentwould be insignificant by comparison (this may not be the case for smallermemory copies). The execution time was measured using the performance countersintegrated into the Cortex-A8 processor, which provide a highly accuratemeasure.

For all tests, the L1NEON bit was set, meaning that loads using theNEON instruction set can cause an L1 data cache linefill. Both the level 1 andlevel 2 caches are enabled, with both the code and data memory regions beingused marked as cacheable. The MMU and branch prediction are also enabled.

Some of the routines make use of the preloadinstruction (PLD). This instruction causes the level 2 cache to start loading the data sometime before the processor executes the code to access this data. Thiseffectively starts the memory request early, so may mean the processor does nothave to wait as long for it be available.

Routines:

1.  Word by Word memorycopy

This is a very simple loop which reads oneword from the source, writes it to the destination and decrements a counter.The performance of this function is taken as the reference for the other tests.

WordCopy
      LDR r3, [r1],#4
      STR r3, [r0],#4
      SUBS r2, r2, #4
      BGE WordCopy

2.  Load-Multiple memorycopy

The previous example is modified to use LDM and STM instructions,transferring 8 words per iteration. Due to the extra registers used, these mustbe stored to the stack and later restored.

LDMCopy
      PUSH {r4-r10}
LDMloop
      LDMIA r1!, {r3- r10}
      STMIA r0!, {r3- r10}
      SUBS r2, r2,#32
      BGE LDMloop
      POP {r4-r10}

  1. NEON memory copy

This implementation uses NEON load and storeinstructions.

NEONCopy
      VLDM r1!, {d0-d7}
      VSTM r0!,{d0-d7}
      SUBS r2, r2,#0x40
      BGE NEONCopy

4.  Word by Word memorycopy with preload

WordCopyPLD
      PLD [r1,#0x100]
      MOV r12, #16

WordCopyPLD1
      LDR r3, [r1],#4
      STR r3, [r0],#4
      SUBS r12, r12,#1
      BNEWordCopyPLD1
      SUBS r2, r2,#0x40
      BNE WordCopyPLD

  1. Load-Multiple memory copy with preload

    LDMCopyPLD
          PUSH {r4-r10}
    LDMloopPLD
          PLD [r1, #0x80]
          LDMIA r1!, {r3 - r10}
          STMIA r0!, {r3 - r10}
          LDMIA r1!, {r3 - r10}
          STMIA r0!, {r3 - r10}
          SUBS r2, r2, #0x40
          BGE LDMloopPLD
          POP {r4-r10}

6.  NEON memory copy withpreload

NEONCopyPLD
      PLD [r1, #0xC0]
      VLDMr1!,{d0-d7}
      VSTMr0!,{d0-d7}
      SUBSr2,r2,#0x40
      BGE NEONCopyPLD

7.  Mixed ARM and NEONmemory copy with preload

This final implementation interleaves ARM andNEON multiple load and store instructions, as well as using the preloadinstruction.

ARMNEONPLD
      PUSH {r4-r11}
      MOV r3, r0

ARMNEON
      PLD [r1, #192]
      PLD [r1, #256]
      VLD1.64{d0-d3}, [r1@128]!
      VLD1.64{d4-d7}, [r1@128]!
      VLD1.64{d16-d19}, [r1@128]!
      LDM r1!,{r4-r11}
      SUBS r2, r2,#128
      VST1.64{d0-d3}, [r3@128]!
      VST1.64{d4-d7}, [r3@128]!
      VST1.64{d16-d19}, [r3@128]!
      STM r3!,{r4-r11}
      BGT ARMNEON
      POP {r4-r11}

Results:

Test

Cycles

Time (seconds)

Mbytes per second

Relative

Word by Word memory copy

52401776

0.104804

152.6665814

100%

Load-Multiple memory copy

47235011

0.09447

169.3658968

111%

NEON memory copy

52389453

0.104779

152.7024915

100%

Word by Word memory copy with PLD

68774347

0.137549

116.3224421

76%

Load-Multiple memory copy with PLD

53277011

0.106554

150.158574

98%

NEON memory copy with PLD

35102279

0.070205

227.9054303

149%

Mixed ARM and NEON memory copy

46742131

0.093484

171.1518031

112%

Some of these results may be surprising.

The Load-multiple routine offers only 11%better performance, despite requiring far fewer instructions to execute and notas many branch instructions. The increase is limited because the processor willbe achieving a 100% instruction cache hit ratio, so instructions can easily befetched as quickly as they can be executed. Branch prediction will also workefficiently in this example, negating the effect of executing more branches.The merging write buffer also means that the memory system sees burst writes,in the same way as it would for single word writes.

The NEON memory copy routine has a few benefits, that are not shown in theperformance of the copy itself. Firstly, the loop can execute withoutcorrupting contents of many ARM (integer) registers. This would be of morebenefit for a small memory copy, where the overhead of stacking / restoringthese registers would be significant. The Cortex-A8 processor can also beconfigured (though it is not for these tests) to only allocate into the level-2cache for NEON loads; this would prevent the memory copy routine replacing theexisting data in the level-1 cache with data that will not be reused. However,the results show that the copy performance itself is the same as with ARM code.

A large gain is used by using the PLD instructionwithin the NEON code loop. This is because it allows the processor toissue the load for a future load to the memory system before it is required,meaning the memory controller can start accessing these locations early. Thenature of SDRAM means that (with a suitable controller) having multiplerequests to work on can hide the long latency of the first access in a burst.This routine is similar to that used by the ARM compiler for codegenerated for the Cortex-A8.

Factors affecting memory copy performance

·        Amount of data tobe copied
Some implementations have an overhead to set up, but thencopy data more quickly. When copying a large block of data, the speed increasein the main loop of the function will outweigh the extra time spent in the setup. For a small amount of data this will may not be the case. One example ofthis is stacking a large number of registers at the beginning of the function,to allow the use of LDM and STM instructions with many registers in the mainloop of the function.

·        Alignment
Even with the unaligned access capabilities introduced inARMv6, the ARM architecture provides better performance when loading quantitiesthat are word aligned. There are also courser alignment granularities that canassist performance. A load multiple instruction on the Cortex-A8 can load 2registers per cycle from the level 1 cache, but only if the address is 64-bitaligned. Cache behaviour (discussed later) can affect the performance of dataaccesses depending on its alignment relative to the size of a cache line. Forthe Cortex-A8, a level 1 cache line is 64 bytes, and a level 2 cache line is 64bytes.

·        Memorycharacteristics
The performance of this function will be largelydependent only on the accesses to data memory. This is because it is likely tohave a small inner loop, so the instruction accesses will cache well, and nocalculations are being performed on the data so the processors arithmetic unitswill not be heavily loaded. Therefore the performance will vary greatlydepending upon the speed of the memory. Certain types of memory system alsoperform better with certain types of access patterns than others – most notablySDRAM has a long latency for the initial access in a burst, but can veryquickly supply the rest of the burst. With a suitable memory controller, theAXI bus will allow multiple outstanding requests to be serviced by the SDRAMcontroller in parallel, hiding much of the effect of this initial latency.However, some code sequences will make better use of this than others.

·        Cache usage
If a large amount of data is being copied, it is possiblethat this data will completely replace the existing data in the data caches.While this will have little effect on the performance of the memory copyitself, it may slow down subsequent code leading to an overall drop in performance.

·        Code dependencies
A standard memcpy() function,particular with slow memory, will result in the processor being unused for muchof the time. It may therefore be more efficient to enable the processor toexecute some other code in parallel in the memory copy. There are several waysof doing this, but will only be helpful if there is other work the processorcan do which is not dependent on the memory copy having completed.

Other methods to copymemory

The C function memcpy() isblocking; once it is called it does not return until the memory copy iscomplete. This can result in the processor pipeline being idle for many cycleswhile waiting for memory accesses. It is often more efficient to perform thememory copy in the background, allowing the processor pipeline to be use forother tasks. However, this is likely to be more complex to control, needing toeither poll for completion or handle an interrupt at completion. There are afew possible ways to achieve this:

·        Use a DMA engine
A DMA engine could copy the data either in one block or as a series ofsub-blocks. This allows the processor to perform other tasks while the copycompletes. Breaking the copy into sub-block would enable the processor to usesome of the copied data before the entire transfer is complete.

·        Use the preloadengine built into the Cortex-A8
The preload engine can be used to load a way of the level 2 cache with data.The processor can start this process, then perform other tasks until itreceives an interrupt to say this load is complete. It can then start thepreload engine off to store this data back to memory, and again run other taskswhile this is completing.

·        Use anotherprocessor
For example, it is possible to use the DSP present on the OMAP 3530 to completethe memory copy, in a similar way to the DMA engine.

你可能感兴趣的:(android,cache,application,performance,library,alignment)