NEON优化系列文章:
- NEON优化1:软件性能优化、降功耗怎么搞?link
- NEON优化2:ARM优化高频指令总结, link
- NEON优化3:矩阵转置的指令优化案例,link
- NEON优化4:floor/ceil函数的优化案例,link
- NEON优化5:log10函数的优化案例,link
- NEON优化6:关于交叉存取与反向交叉存取,link
- NEON优化7:性能优化经验总结,link
- NEON优化8:性能优化常见问题QA,link
NEON优化过程中,经常遇到内存、存器变量间的读写,NEON内存读写指令中默认是交叉存取,部分特殊指令可以反向交叉存取。
什么是交叉存取,什么是反向交叉存取?
a1 b1 a2 b2
,ld2q读出到2个寄存器为:val[0]: a1 a2, val[1]: b1 b2
a1 b1 a2 b2
,ld1q读出到1个寄存器为:val[0]: a1 b1 a2 b2
主要从内存读写、寄存器间交互进行说明。
int32x4x2_t vzipq_s32(int32x4_t a, int32x4_t b);
int32x4x2_t vuzpq_s32(int32x4_t a, int32x4_t b);
ld4q/st4q功能测试
#define ROW_NUM 4
#define COL_NUM 4
// initial
float M[ROW_NUM][COL_NUM] = {
{0, 1, 2, 3},
{4, 5, 6, 7},
{8, 9, 10, 11},
{12, 13, 14, 15},
};
int i, j;
float32x4x4_t vf32x4x4fTmpABCD = vld4q_f32(&M[0][0]);
float MT[4][4];
vst1q_f32(&MT[0][0], vf32x4x4fTmpABCD.val[0]); // 0 4 8 12
vst1q_f32(&MT[1][0], vf32x4x4fTmpABCD.val[1]);
vst1q_f32(&MT[2][0], vf32x4x4fTmpABCD.val[2]);
vst1q_f32(&MT[3][0], vf32x4x4fTmpABCD.val[3]); // 3 7 11 15
printf("ver1:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
vst4q_f32(&MT[0][0], vf32x4x4fTmpABCD);
printf("ver2:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
输出结果
ver1:
0.000000 4.000000 8.000000 12.000000
1.000000 5.000000 9.000000 13.000000
2.000000 6.000000 10.000000 14.000000
3.000000 7.000000 11.000000 15.000000
ver2:
0.000000 1.000000 2.000000 3.000000
4.000000 5.000000 6.000000 7.000000
8.000000 9.000000 10.000000 11.000000
12.000000 13.000000 14.000000 15.000000
zip/uzp功能测试
#define ROW_NUM 4
#define COL_NUM 4
// initial
float M[ROW_NUM][COL_NUM] = {
{0, 1, 2, 3},
{4, 5, 6, 7},
{8, 9, 10, 11},
{12, 13, 14, 15},
};
float MT[4][4];
// 按列读,按行写
float32x4_t vf32x4fTmp1 = vld1q_f32(&M[0][0]); // 0 1 2 3
float32x4_t vf32x4fTmp2 = vld1q_f32(&M[1][0]); // 4 5 6 7
float32x4x2_t vf32x4x2fTmpZip = vzipq_f32(vf32x4fTmp1, vf32x4fTmp2);
vst1q_f32(&MT[0][0], vf32x4x2fTmpZip.val[0]); // 0 4 1 5
vst1q_f32(&MT[1][0], vf32x4x2fTmpZip.val[1]); // 2 6 3 7
// 按行读,按列写
float32x4_t vf32x4fTmp3 = vld1q_f32(&M[2][0]); // 8 9 10 11
float32x4_t vf32x4fTmp4 = vld1q_f32(&M[3][0]); // 12 13 14 15
float32x4x2_t vf32x4x2fTmpUzp = vuzpq_f32(vf32x4fTmp3, vf32x4fTmp4);
vst1q_f32(&MT[2][0], vf32x4x2fTmpUzp.val[0]); // 8 10 12 14
vst1q_f32(&MT[3][0], vf32x4x2fTmpUzp.val[1]); // 9 11 13 15
printf("ver1:\n");
int i, j;
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
有了前面这些对比,所谓交叉存取、反向交叉存取,简单来看,想象一个矩阵,交叉存取就是按列读出来,按行写入到新变量,反向交叉存取则按行读出来,按列写进去,如此而已。