ARM's NEON technology is a 64/128-bit hybrid SIMD architecture designed to accelerate the performance of multimedia and signal processing applications, including video encoding and decoding, audio encoding and decoding, 3D graphics, speech and image processing.
This is the first part of a series of posts on how to write SIMD code for NEON using assembly language. The series will cover getting started with NEON, using it efficiently, and later, hints and tips for more experienced coders. We will begin by looking at memory operations, and how to use the flexible load and store with permute instructions.
We will start with a concrete example. You have a 24-bit RGB image, where the pixels are arranged in memory as R, G, B, R, G, B... You want to perform a simple image processing operation, like switching the red and blue channels. How can you do this efficiently using NEON?
Using a load that pulls RGB data linearly from memory into registers makes the red/blue swap awkward.
Code to swap channels based on this input is not going to be elegant – masks, shifting, combining. It is unlikely to be efficient.
NEON provides structure load and store instructions to help in these situations. They pull in data from memory and simultaneously separate values into different registers. For this example, you can use VLD3
to split up red, green and blue as they are loaded.
Now switch the red and blue registers (VSWP d0, d2
) and write the data back to memory, with reinterleaving, using the similarly named VST3
store instruction.
NEON structure loads read data from memory into 64-bit NEON registers, with optional deinterleaving. Stores work similarly, reinterleaving data from registers before writing it to memory.
The structure load and store instructions have a syntax consisting of five parts.
VLD
for loads or VST
for stores.
Instructions are available to load, store and deinterleave structures containing from one to four equally sized elements, where the elements are the usual NEON supported widths of 8, 16 or 32-bits.
VLD1
is the simplest form. It loads one to four registers of data from memory, with no deinterleaving. Use this when processing an array of non-interleaved data.VLD2
loads two or four registers of data, deinterleaving even and odd elements into those registers. Use this to separate stereo audio data into left and right channels.VLD3
loads three registers and deinterleaves. Useful for splitting RGB pixels into channels.VLD4
loads four registers and deinterleaves. Use it to process ARGB image data.
Loads and stores interleave elements based on the size specified to the instruction. For example, loading two NEON registers with VLD2.16
results in four 16-bit elements in the first register, and four 16-bit elements in the second, with adjacent pairs (even and odd) separated to each register.
Changing the element size to 32-bits causes the same amount of data to be loaded, but now only two elements make up each vector, again separated into even and odd elements.
Element size also affects endianness handling. In general, if you specify the correct element size to the load and store instructions, bytes will be read from memory in the appropriate order, and the same code will work on little and big-endian systems.
Finally, element size has an impact on pointer alignment. Alignment to the element size will generally give better performance, and it may be a requirement of your target operating system. For example, when loading 32-bit elements, align the address of the first element to at least 32-bits.
In addition to loading multiple elements, structure loads can also read single elements from memory with deinterleaving, either to all lanes of a NEON register, or to a single lane, leaving the other lanes intact.
The latter form is useful when you need to construct a vector from data scattered in memory.
Stores are similar, providing support for writing single or multiple elements with interleaving.
Structure load and store instructions support three formats for specifying addresses.
[ {,:}]
[{,:}]!
[{,:}],
Rm
. This is useful when reading or writing groups of elements that are separated by fixed widths, eg. when reading a vertical line of data from an image.
You can also specify an alignment for the pointer passed in Rn
, using the optional :
parameter, which often speeds up memory accesses.
We have only dealt with structure loads and stores in this post. NEON also provides:
VLDR
and VSTR
to load or store a single register as a 64-bit value.VLDM
and VSTM
to load multiple registers as 64-bit values. Useful for storing and retrieving registers from the stack.
In the next post, we will look at efficiently handling arrays with lengths that are not a multiple of the vector size.