NEON入门-Optimizing C Code with Neon Intrinsics(翻译向)

学习路线指南:https://zhuanlan.zhihu.com/p/...
原文地址:https://developer.arm.com/doc...
neon指令检索:https://developer.arm.com/arc...

Optimizing C Code with Neon Intrinsics(翻译向)

What is Neon?

  • neon 提供了什么
    32个128bit向量寄存器 + SIMD指令
  • 如何使用neon
    1.支持Neon的开源库(例如Arm Compute库)
    2.编译器中的自动矢量化功能
    3.Neon intrinsics (#include )
    4.Neon assembler

Why Neon intrinsics?

  • 不用手写汇编
  • 可移植性强
  • 灵活,可以在需要时使用Neon,在不需要时使用C/C ++

Example: RGB deinterleaving 解交织 (HWC -> CHW)

NEON入门-Optimizing C Code with Neon Intrinsics(翻译向)_第1张图片

  • c 程序,在Arm Compiler 6编译器O3优化下未使用neon指令和寄存器,每个单独的8位值都存储在单独的64位通用寄存器中

    void rgb_deinterleave_c(uint8_t *r, uint8_t *g, uint8_t *b, uint8_t *rgb, int len_color) {
        /*
         * Take the elements of "rgb" and store the individual colors "r", "g", and "b".
         */
        for (int i=0; i < len_color; i++) {
            r[i] = rgb[3*i];
            g[i] = rgb[3*i+1];
            b[i] = rgb[3*i+2];
        }
    }
  • neon c 程序 仅适用于二维尺寸均为四的倍数的矩阵

    void rgb_deinterleave_neon(uint8_t *r, uint8_t *g, uint8_t *b, uint8_t *rgb, int len_color) {
        /*
         * Take the elements of "rgb" and store the individual colors "r", "g", and "b"
         */
        int num8x16 = len_color / 16;
        uint8x16x3_t intlv_rgb;   //三个 16x8-bit unsigned integers寄存器
        for (int i=0; i < num8x16; i++) {
            intlv_rgb = vld3q_u8(rgb+3*16*i); //对应LD3底层指令
            vst1q_u8(r+16*i, intlv_rgb.val[0]); //对应ST1底层指令
            vst1q_u8(g+16*i, intlv_rgb.val[1]);
            vst1q_u8(b+16*i, intlv_rgb.val[2]);
        }
    }
  • 可以使用以下命令在Arm机器上编译和反汇编上面的完整源代码:

    gcc -g -o3 rgb.c -o exe_rgb_o3
    objdump -d exe_rgb_o3 > disasm_rgb_o3

Matrix multiplication example

NEON入门-Optimizing C Code with Neon Intrinsics(翻译向)_第2张图片

  • float c程序

    void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
        for (int i_idx=0; i_idx < n; i_idx++) {
            for (int j_idx=0; j_idx < m; j_idx++) {
                C[n*j_idx + i_idx] = 0;
                for (int k_idx=0; k_idx < k; k_idx++) {
                    C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];
                }
            }
        }
    }
  • neon c 程序
/*
 * Copyright (C) Arm Limited, 2019 All rights reserved. 
 * 
 * The example code is provided to you as an aid to learning when working 
 * with Arm-based technology, including but not limited to programming tutorials. 
 * Arm hereby grants to you, subject to the terms and conditions of this Licence, 
 * a non-exclusive, non-transferable, non-sub-licensable, free-of-charge licence, 
 * to use and copy the Software solely for the purpose of demonstration and 
 * evaluation.
 * 
 * You accept that the Software has not been tested by Arm therefore the Software 
 * is provided "as is", without warranty of any kind, express or implied. In no 
 * event shall the authors or copyright holders be liable for any claim, damages 
 * or other liability, whether in action or contract, tort or otherwise, arising 
 * from, out of or in connection with the Software or the use of Software.
 */

#include 
#include 
#include 
#include 
#include 

#include 

#define BLOCK_SIZE 4


void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
        for (int i_idx=0; i_idx

Program conventions 编程约定

  • Macros 宏

    __ARM_NEON - 编译器支持高级SIMD,AArch64下始终是1
    __ARM_NEON_FP - 支持NEON浮点运算
    __ARM_FEATURE_CRYPTO - 可以使用加密指令(?不懂)
    __ARM_FEATURE_FMA - 可以使用融合乘加
    详见 https://developer.arm.com/architectures/system-architectures/software-standards/acle
  • Types 类型

    baseW_t 标量 baseWxL_t 向量 baseWxLxN_t 向量数组 eg.uint8x16x3_t
  • Functions 函数
    NEON入门-Optimizing C Code with Neon Intrinsics(翻译向)_第3张图片

你可能感兴趣的:(simd)