学习路线指南：https://zhuanlan.zhihu.com/p/...
原文地址：https://developer.arm.com/doc...
neon指令检索：https://developer.arm.com/arc...

Optimizing C Code with Neon Intrinsics（翻译向）

What is Neon?

neon 提供了什么
32个128bit向量寄存器 + SIMD指令
如何使用neon
1.支持Neon的开源库（例如Arm Compute库）
2.编译器中的自动矢量化功能
3.Neon intrinsics (#include )
4.Neon assembler

Why Neon intrinsics？

不用手写汇编
可移植性强
灵活，可以在需要时使用Neon，在不需要时使用C/C ++

Example: RGB deinterleaving 解交织 (HWC -> CHW)

c 程序，在Arm Compiler 6编译器O3优化下未使用neon指令和寄存器，每个单独的8位值都存储在单独的64位通用寄存器中

void rgb_deinterleave_c(uint8_t *r, uint8_t *g, uint8_t *b, uint8_t *rgb, int len_color) {
    /*
     * Take the elements of "rgb" and store the individual colors "r", "g", and "b".
     */
    for (int i=0; i < len_color; i++) {
        r[i] = rgb[3*i];
        g[i] = rgb[3*i+1];
        b[i] = rgb[3*i+2];
    }
}

neon c 程序仅适用于二维尺寸均为四的倍数的矩阵

void rgb_deinterleave_neon(uint8_t *r, uint8_t *g, uint8_t *b, uint8_t *rgb, int len_color) {
    /*
     * Take the elements of "rgb" and store the individual colors "r", "g", and "b"
     */
    int num8x16 = len_color / 16;
    uint8x16x3_t intlv_rgb;   //三个 16x8-bit unsigned integers寄存器
    for (int i=0; i < num8x16; i++) {
        intlv_rgb = vld3q_u8(rgb+3*16*i); //对应LD3底层指令
        vst1q_u8(r+16*i, intlv_rgb.val[0]); //对应ST1底层指令
        vst1q_u8(g+16*i, intlv_rgb.val[1]);
        vst1q_u8(b+16*i, intlv_rgb.val[2]);
    }
}

可以使用以下命令在Arm机器上编译和反汇编上面的完整源代码：
```
gcc -g -o3 rgb.c -o exe_rgb_o3
objdump -d exe_rgb_o3 > disasm_rgb_o3
```

Matrix multiplication example

float c程序

void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
    for (int i_idx=0; i_idx < n; i_idx++) {
        for (int j_idx=0; j_idx < m; j_idx++) {
            C[n*j_idx + i_idx] = 0;
            for (int k_idx=0; k_idx < k; k_idx++) {
                C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];
            }
        }
    }
}

neon c 程序

/*
 * Copyright (C) Arm Limited, 2019 All rights reserved. 
 * 
 * The example code is provided to you as an aid to learning when working 
 * with Arm-based technology, including but not limited to programming tutorials. 
 * Arm hereby grants to you, subject to the terms and conditions of this Licence, 
 * a non-exclusive, non-transferable, non-sub-licensable, free-of-charge licence, 
 * to use and copy the Software solely for the purpose of demonstration and 
 * evaluation.
 * 
 * You accept that the Software has not been tested by Arm therefore the Software 
 * is provided "as is", without warranty of any kind, express or implied. In no 
 * event shall the authors or copyright holders be liable for any claim, damages 
 * or other liability, whether in action or contract, tort or otherwise, arising 
 * from, out of or in connection with the Software or the use of Software.
 */

#include 
#include 
#include 
#include 
#include 

#include 

#define BLOCK_SIZE 4


void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
        for (int i_idx=0; i_idx

 
 Program conventions 编程约定 
  
  Macros 宏
__ARM_NEON - 编译器支持高级SIMD,AArch64下始终是1
__ARM_NEON_FP - 支持NEON浮点运算
__ARM_FEATURE_CRYPTO - 可以使用加密指令(?不懂)
__ARM_FEATURE_FMA - 可以使用融合乘加
详见 https://developer.arm.com/architectures/system-architectures/software-standards/acle 
  Types 类型
baseW_t 标量 baseWxL_t 向量 baseWxLxN_t 向量数组 eg.uint8x16x3_t 
  Functions 函数

NEON入门-Optimizing C Code with Neon Intrinsics（翻译向）

Optimizing C Code with Neon Intrinsics（翻译向）

What is Neon?

Why Neon intrinsics？

Example: RGB deinterleaving 解交织 (HWC -> CHW)

Matrix multiplication example

Program conventions 编程约定

你可能感兴趣的:(simd)