http://www.linuxjournal.com/content/introduction-gcc-compiler-intrinsics-vector-processing?page=0,1
http://stackoverflow.com/questions/7156908/sse-intrinsic-functions-reference
Table 1. GCC Command-Line Options to Generate SIMD Code
Processor/ | Options | |
---|---|---|
X86/MMX/SSE1/SSE2 | -mfpmath=sse -mmmx -msse -msse2 | |
ARM Neon | -mfpu=neon -mfloat-abi=softfp | |
Freescale Altivec | -maltivec -mabi=altivec |
Here are the include files you need:
The X86 compatibles with MMX, SSE1 and SSE2 have the following types:
Table 2. Subset of vector operators and intrinsics used in the examples.
Operation |
Altivec |
Neon |
MMX/SSE/SSE2 |
---|---|---|---|
loading |
vec_ld |
vld1q_f32 |
_mm_set_epi16 |
vector |
vec_splat |
vld1q_s16 |
_mm_set1_epi16 |
|
vec_splat_s16 |
vsetq_lane_f32 |
_mm_set1_pi16 |
|
vec_splat_s32 |
vld1_u8 |
_mm_set_pi16 |
|
vec_splat_s8 |
vdupq_lane_s16 |
_mm_load_ps |
|
vec_splat_u16 |
vdupq_n_s16 |
_mm_set1_ps |
|
vec_splat_u32 |
vmovq_n_f32 |
_mm_loadh_pi |
|
vec_splat_u8 |
vset_lane_u8 |
_mm_loadl_pi |
storing |
vec_st |
vst1_u8 |
|
vector |
|
vst1q_s16 |
_mm_store_ps |
|
|
vst1q_f32 |
|
|
|
vst1_s16 |
|
add |
vec_madd |
vaddq_s16 |
_mm_add_epi16 |
|
vec_mladd |
vaddq_f32 |
_mm_add_pi16 |
|
vec_adds |
vmlaq_n_f32 |
_mm_add_ps |
subtract |
vec_sub |
vsubq_s16 |
|
multiply |
vec_madd |
vmulq_n_s16 |
_mm_mullo_epi16 |
|
vec_mladd |
vmulq_s16 |
_mm_mullo_pi16 |
|
|
vmulq_f32 |
_mm_mul_ps |
|
|
vmlaq_n_f32 |
|
arithmetic |
vec_sra |
vshrq_n_s16 |
_mm_srai_epi16 |
shift |
vec_srl |
|
_mm_srai_pi16 |
|
vec_sr |
|
|
byte |
vec_perm |
vtbl1_u8 |
_mm_shuffle_pi16 |
permutation |
vec_sel |
vtbx1_u8 |
_mm_shuffle_ps |
|
vec_mergeh |
vget_high_s16 |
|
|
vec_mergel |
vget_low_s16 |
|
|
|
vdupq_lane_s16 |
|
|
|
vdupq_n_s16 |
|
|
|
vmovq_n_f32 |
|
|
|
vbsl_u8 |
|
type |
vec_cts |
vmovl_u8 |
_mm_packs_pu16 |
conversion |
vec_unpackh |
vreinterpretq_s16_u16 |
|
|
vec_unpackl |
vcvtq_u32_f32 |
|
|
vec_cts |
vqmovn_s32 |
_mm_cvtps_pi16 |
|
vec_ctu |
vqmovun_s16 |
_mm_packus_epi16 |
|
|
vqmovn_u16 |
|
|
|
vcvtq_f32_s32 |
|
|
|
vmovl_s16 |
|
|
|
vmovq_n_f32 |
|
vector |
vec_pack |
vcombine_u16 |
|
combination |
vec_packsu |
vcombine_u8 |
|
|
|
vcombine_s16 |
|
maximum |
|
|
_mm_max_ps |
minimum |
|
|
_mm_min_ps |
vector |
|
|
_mm_andnot_ps |
logic |
|
|
_mm_and_ps |
|
|
|
_mm_or_ps |
rounding |
vec_trunc |
|
|
misc |
|
|
_mm_empty |
Next, your code should check your processor at runtime to see if you have vector support for it. If you don't have a vector code path for that processor, fall back to your scalar code. If you have vector support, and the vector support is faster, use the vector code path. Test processor features on X86 with the cpuid instruction from <cpuid.h>. (You saw examples of that in samples/simple/x86/*c.) We couldn't find something that well established for Altivec and Neon, so the examples there parse /proc/cpuinfo. (Serious code might insert a test SIMD instruction. If the processor throws a SIGILL signal when it encounters that test instruction, you do not have that feature.)
In summary, GCC offers intrinsics that allow you to get more from your processor without the work of going all the way to assembly. We have covered basic types and some of the vector math functions. When you use intrinsics, make sure you test thoroughly. Test for speed and correctness against a scalar version of your code. Different features of each processor and how well they operate means that this is a wide open field. The more effort you put into it, the more you will get out.
The GCC include files that map intrinsics to compiler built-ins (eg arm_neon.h) and the GCC info pages that explain those built-ins:
http://gcc.gnu.org/onlinedocs/gcc/Target-Builtins.html
http://ds9a.nl/gcc-simd/
http://softpixel.com/~cwright/programming/simd/index.php
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/BABCJFDG.html
http://www.arm.com/products/processors/technologies/neon.php
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/ch01s04s02.html
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0205j/BABGHIFH.html
http://www.tommesani.com/Docs.html
http://www.linuxjournal.com/article/7269
http://developer.apple.com/hardwaredrivers/ve/sse.html
http://en.wikipedia.org/wiki/Multiplication_algorithm#Shift_and_add
http://www.ibm.com/developerworks/power/library/pa-unrollav1/
http://en.wikipedia.org/wiki/MMX_(instruction_set)
Integrated Performance Primitives
http://software.intel.com/en-us/articles/intel-ipp/
http://software.intel.com/en-us/articles/non-commercial-software-download/
OpenMAX
http://www.khronos.org/developers/resources/openmax
Freescale AltiVec Libs for Linux
http://www.freescale.com/webapp/sps/site/overview.jsp?code=DRPPCNWALTVCLIB
AltiVec TM Technology Programming Interface Manual
http://www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf
http://developer.apple.com/hardwaredrivers/ve/instruction_crossref.html
Ian Ollmann's Altivec Tutorial
http://www-linux.gsi.de/~ikisel/reco/Systems/Altivec.pdf
http://arstechnica.com/civis/viewtopic.php?f=19&t=381165
RealView Compilation Tools Compiler Reference Guide (especially Appendix E)
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348c/index.html
RealView Compilation Tools Assembler Guide (esp chapter 5)
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204j/index.html
Intel C++ Intrinsics Reference
http://software.intel.com/sites/default/files/m/9/4/c/8/e/18072-347603.pdf