通过PLD指令提升软件性能

Preload Data and Preload Instruction. The processor can signal the memory system that a data or instruction load from an address is likely in the near future.

PLtype{cond} [Rn {, #offset}]
PLtype{cond} [Rn, +/-Rm {, shift}]
PLtype{cond} label

where:
type
    can be one of:
    D
        Data address
    DW
        Data address with intention to write
    I
        Instruction address.

    type cannot be DW if the syntax specifies label.
cond
    is an optional condition code.
    Note：
    cond is permitted only in 32-bit Thumb code, using a preceding IT instruction. This is an unconditional instruction in ARM code and you must not use cond.
Rn
    is the register on which the memory address is based.
offset
    is an immediate offset. If offset is omitted, the address is the value in Rn.
Rm
    is a register containing a value to be used as the offset.
shift
    is an optional shift.
label
    is a PC-relative expression.

The offset is applied to the value in Rn before the preload takes place. The result is used as the memory address for the preload. The range of offsets permitted is:

The assembler calculates the offset from the PC for you. The assembler generates an error if label is out of range.

In ARM code, the value in Rm is added to or subtracted from the value in Rn. In 32-bit Thumb code, the value in Rm can only be added to the value in Rn. The result used as the memory address for the preload.

The range of shifts permitted is:

Rm must not be PC. For Thumb instructions Rm must also not be SP.
Rn must not be PC for Thumb instructions of the syntax PLtype{cond} [Rn, +/-Rm{, #shift}].
ARM PLD is available in ARMv5TE and above.
32-bit Thumb PLD is available in ARMv6T2 and above.

PLDW is available only in ARMv7 and above that implement the Multiprocessing Extensions.
PLI is available only in ARMv7 and above.

There are no 16-bit Thumb PLD, PLDW, or PLI instructions.
These are hint instructions, and their implementation is optional. If they are not implemented, they execute as NOPs.

========================================================================================

I just discovered the incredible performance boost that can be achieved by using thePLD (“Preload Data”) ARM assembler instruction.

What I needed to do is convert image pixel data from RGB to RGBA format — from 3bytes/pixel to 4bytes/pixel; fullscreen in real time during animation. But the general situation is anytime you need to process a large amount of RAM data really fast.

while (n--) {
    *dest++ = *src++;
}

This loop is plain, it just copies data from a memory source to a destination. It is used here just a as a placeholder for some processing of src data. (of course, if you only need to copy the data you should usememcpy instead)

Let’s time this loop over 1MB of data, on a Samsung Galaxy Tab 10.1 with a Tegra 2 processor — it takes about 25ms. What slows the loop down is waiting for data that is not in the processor cache to be fetched from the main memory, which is slow. We can fix this by directing the CPU to prefetch data ahead of the read. We modify the loop adding the PLD magic line:

while (n--) {
    asm ("PLD [%0, #128]"::"r" (src));
    *dest++ = *src++;
}

That asm line starts preloading data from memory to the CPU cache, 128 bytes ahead of the currentsrc location, without blocking the CPU.

We measure again, and the same loop over 1MB of data now takes only 8ms instead of 25ms — it is three times faster! Amazing for that 1-liner, I say. By the way, this is now very close to the performance of memcpy, which is itself implemented in highly-optimized ARM assembly.

You may observe that our loop may be optimized a little bit further by doing partial unrolling — processing more than a single element at each iteration.

With partial loop unrolling:

n /= 4; //assume it's multiple of 4
while (n--) {
    asm ("PLD [%0, #128]"::"r" (src));
    *dest++ = *src++;
    *dest++ = *src++;
    *dest++ = *src++;
    *dest++ = *src++;
}

The conclusion is that if you find yourself optimizing to death some piece of C/C++ code on Android that reads a lot of memory, you should try using the PLD and profile again to see if it helps.. Enjoy!
asm ("PLD [%0, #128]"::"r" (src));

PS:
If you’re curious about the RGB_888 to RGBA_8888 conversion speed, it is possible to do a fullscreen conversion (1280×752 px) on the Tab in about 7ms, which is quite impressive IMO. This is faster than the corresponding memcpy() RGBA to RGBA which takes about 8ms, and thus makes the case for the introduction of the RGB_888 (3bytes/pixel) Bitmap format in the Android Java API (as it saves RAM and memory bandwidth when the Alpha channel isn’t needed).

在最近做的一个用NEON加速浮点数运算的例子（求两个len=10000的float vector的dot product）中，适当使用PLD可以把执行时间从79us减少到33us。

Cortex-A9已经是一个非常复杂的内核了，很多时候难以定量的分析一个指令序列，更多的时候是通过反复尝试找到最好的指令序列。PLD和LDR指令之间，既要留出足够的时间完成cache line fill，又要控制中间的时间间隔不要太长。因为这个时间越长，被context switch打断导致preloaded data被flush出去的概率就越大。

通过PLD指令提升软件性能

你可能感兴趣的:(通过PLD指令提升软件性能)