1.问题描述
使用GCC 10.2编译下面的代码,编译选项为-O3 -msse4.2 -fprefetch-loop-arrays
,其中-O3
默认包含向量化,bread
函数中的3个循环看起来也可以进行向量化,但实际上只有最后一个循环进行了向量化。
#include
#include
#include
#define NUM 512*1024
static char array[NUM];
long bread(void* buf, long nbytes)
{
long sum = 0;
register long *p, *next;
register char *end;
p = (long*)buf;
end = (char*)buf + nbytes;
for (next = p + 128; (void*)next <= (void*)end; p = next, next += 128) {
sum +=
p[0]+p[1]+p[2]+p[3]+p[4]+p[5]+p[6]+p[7]+
p[8]+p[9]+p[10]+p[11]+p[12]+p[13]+p[14]+
p[15]+p[16]+p[17]+p[18]+p[19]+p[20]+p[21]+
p[22]+p[23]+p[24]+p[25]+p[26]+p[27]+p[28]+
p[29]+p[30]+p[31]+p[32]+p[33]+p[34]+p[35]+
p[36]+p[37]+p[38]+p[39]+p[40]+p[41]+p[42]+
p[43]+p[44]+p[45]+p[46]+p[47]+p[48]+p[49]+
p[50]+p[51]+p[52]+p[53]+p[54]+p[55]+p[56]+
p[57]+p[58]+p[59]+p[60]+p[61]+p[62]+p[63]+
p[64]+p[65]+p[66]+p[67]+p[68]+p[69]+p[70]+
p[71]+p[72]+p[73]+p[74]+p[75]+p[76]+p[77]+
p[78]+p[79]+p[80]+p[81]+p[82]+p[83]+p[84]+
p[85]+p[86]+p[87]+p[88]+p[89]+p[90]+p[91]+
p[92]+p[93]+p[94]+p[95]+p[96]+p[97]+p[98]+
p[99]+p[100]+p[101]+p[102]+p[103]+p[104]+
p[105]+p[106]+p[107]+p[108]+p[109]+p[110]+
p[111]+p[112]+p[113]+p[114]+p[115]+p[116]+
p[117]+p[118]+p[119]+p[120]+p[121]+p[122]+
p[123]+p[124]+p[125]+p[126]+p[127];
}
for (next = p + 16; (void*)next <= (void*)end; p = next, next += 16) {
sum +=
p[0]+p[1]+p[2]+p[3]+p[4]+p[5]+p[6]+p[7]+
p[8]+p[9]+p[10]+p[11]+p[12]+p[13]+p[14]+
p[15];
}
for (next = p + 1; (void*)next <= (void*)end; p = next, next++) {
sum += *p;
}
return sum;
}
int main()
{
int i ;
FILE *out;
out = fopen("output.txt","w");
long time_use=0;
struct timeval start;
struct timeval end;
gettimeofday(&start,NULL);
memset((void*)array, 1, NUM);
for(i=0;i<1000;i++)
{
//主要是计算bread函数的执行时间
long sum = bread(array, NUM);
if (out!=NULL)
fprintf(out,"%ld\t",sum);
sum = 0;
}
gettimeofday(&end,NULL);
time_use=(end.tv_sec-start.tv_sec)*1000000+(end.tv_usec-start.tv_usec);//us
printf("time = %ld\n", time_use);
return 0;
}
增加-fopt-info-vec-all
选项,只打印关于向量化的优化报告,命令为gcc -O3 -msse4.2 -fprefetch-loop-arrays -fopt-info-vec-all=vec.log
vec.log文件包含下面的信息(一部分),只有最后一个循环(45行)进行了向量化。
bw_mmap_rd.c:45:5: optimized: loop vectorized using 16 byte vectors
bw_mmap_rd.c:39:5: missed: couldn't vectorize loop
bw_mmap_rd.c:39:5: missed: not vectorized: unsupported data-type
bw_mmap_rd.c:17:5: missed: couldn't vectorize loop
bw_mmap_rd.c:17:5: missed: not vectorized: unsupported data-type
bw_mmap_rd.c:9:6: note: vectorized 1 loops in function.
2.分析
增加-fdump-tree-vect-all
选项,输出关于向量化的中间表示文件,命令为gcc -O3 -msse4.2 -fprefetch-loop-arrays -fdump-tree-vect-all
生成的bw_mmap_rd.c.161t.vect文件包含下面的信息(一部分)
bw_mmap_rd.c:39:5: note: Cost model analysis:
Vector inside of loop cost: 516
Vector prologue cost: 20
Vector epilogue cost: 284
Scalar iteration cost: 256
Scalar outside cost: 32
Vector outside cost: 304
prologue iterations: 0
epilogue iterations: 1
bw_mmap_rd.c:39:5: missed: cost model: the vector iteration cost = 516 divided by the scalar iteration cost = 256 is greater or equal to the vectorization factor = 2.
bw_mmap_rd.c:39:5: missed: not vectorized: vectorization not profitable.
bw_mmap_rd.c:39:5: missed: not vectorized: vector version will never be profitable.
bw_mmap_rd.c:39:5: missed: Loop costings may not be worthwhile.
对于第二个循环,GCC的代价模型计算出的向量化之后的代价(cost)更高,没有收益,所以没有进行向量化。因为SSE指令的XMM寄存器是128bit,16个字节,而循环中的元素是long类型,8个字节,所以一个向量只能包含2个元素,这里的vectorization factor
就是2。向量化之前的代价是:Scalar iteration cost
,即256。向量化之后的代价是Vector inside of loop cost / vectorization factor
,即516 / 2 = 258。
bw_mmap_rd.c:17:5: note: Cost model analysis:
Vector inside of loop cost: 5636
Vector prologue cost: 20
Vector epilogue cost: 2076
Scalar iteration cost: 2048
Scalar outside cost: 32
Vector outside cost: 2096
prologue iterations: 0
epilogue iterations: 1
bw_mmap_rd.c:17:5: missed: cost model: the vector iteration cost = 5636 divided by the scalar iteration cost = 2048 is greater or equal to the vectorization factor = 2.
bw_mmap_rd.c:17:5: missed: not vectorized: vectorization not profitable.
bw_mmap_rd.c:17:5: missed: not vectorized: vector version will never be profitable.
bw_mmap_rd.c:17:5: missed: Loop costings may not be worthwhile.
对于第三个循环,和第二个循环一样,向量化之前的代价是2048,向量化之后的代价是5636 / 2 = 2818,没有收益,所以没有向量化。
bw_mmap_rd.c:45:5: note: Cost model analysis:
Vector inside of loop cost: 20
Vector prologue cost: 20
Vector epilogue cost: 44
Scalar iteration cost: 16
Scalar outside cost: 32
Vector outside cost: 64
prologue iterations: 0
epilogue iterations: 1
Calculated minimum iters for profitability: 4
bw_mmap_rd.c:45:5: note: Runtime profitability threshold = 4
bw_mmap_rd.c:45:5: note: Static estimate profitability threshold = 14
对于第一个循环,向量化之前的代价是16,向量化之后的代价是20 / 2 = 10,向量化是有益的,GCC计算出的profitability threshold
是4,所以会进行向量化。
上面的代码生成的可执行文件的运行时间为(us):
time = 16267
可以增加-fvect-cost-model=unlimited
选项 ,强制让GCC进行向量化,命令为gcc -O3 -msse4.2 -fprefetch-loop-arrays -fopt-info-vec-all=vec.log -fvect-cost-model=unlimited
vec.log优化报告显示所有的循环都进行了向量化:
bw_mmap_rd.c:45:5: optimized: loop vectorized using 16 byte vectors
bw_mmap_rd.c:39:5: optimized: loop vectorized using 16 byte vectors
bw_mmap_rd.c:17:5: optimized: loop vectorized using 16 byte vectors
bw_mmap_rd.c:9:6: note: vectorized 3 loops in function.
根据bw_mmap_rd.c.161t.vect文件,-fvect-cost-model=unlimited
选项会禁用代价模型:
bw_mmap_rd.c:45:5: note: cost model disabled.
bw_mmap_rd.c:39:5: note: cost model disabled.
bw_mmap_rd.c:17:5: note: cost model disabled.
三个循环都进行向量化的情况下,生成的可执行文件的运行时间变长,性能变差:
time = 74177
所以GCC只给第一个循环进行了向量化。
3.补充
3.1 优化选项
前面的优化选项中,-O3
包含了很多其他的优化选项,也包括-ftree-loop-vectorize
(循环向量化)和-ftree-slp-vectorize
(slp向量化)选项。-msse4.2
指定使用SSE4.2
指令集,如果没有指定GCC会默认使用SSE2
指令集。-fprefetch-loop-arrays
打开数组预取优化,GCC会生成prefetcht0
类型的汇编指令,将数据提前移动到dcache
中。
3.2 log信息
使用gcc -O3 -msse4.2 -fprefetch-loop-arrays -fopt-info-vec-all=vec.log
编译,优化报告包含下面的内容:
bw_mmap_rd.c:45:5: optimized: loop vectorized using 16 byte vectors
bw_mmap_rd.c:39:5: missed: couldn't vectorize loop
bw_mmap_rd.c:39:5: missed: not vectorized: unsupported data-type
bw_mmap_rd.c:17:5: missed: couldn't vectorize loop
bw_mmap_rd.c:17:5: missed: not vectorized: unsupported data-type
bw_mmap_rd.c:9:6: note: vectorized 1 loops in function.
其中bw_mmap_rd.c:39:5: missed: not vectorized: unsupported data-type
是GCC中gcc\tree-vect-loop.c的一段代码的输出:
/* TODO: Analyze cost. Decide if worth while to vectorize. */
if (dump_enabled_p ())
{
dump_printf_loc (MSG_NOTE, vect_location, "vectorization factor = ");
dump_dec (MSG_NOTE, vectorization_factor);
dump_printf (MSG_NOTE, "\n");
}
if (known_le (vectorization_factor, 1U))
return opt_result::failure_at (vect_location,
"not vectorized: unsupported data-type\n");
LOOP_VINFO_VECT_FACTOR (loop_vinfo) = vectorization_factor;
return opt_result::success ();
根据bw_mmap_rd.c.161t.vect文件中的信息可知,这是vectorization_factor = 1
对应的输出,此时一个向量只包含一个long类型的元素,GCC不支持这种向量结构(一个向量只有1个元素就是张量了),所以会打印这种log信息。而第三个循环满足向量化的条件,所以不会判断vectorization_factor = 1
的情形,也就没有这个输出了。
所以我们只需要看bw_mmap_rd.c:39:5: missed: couldn't vectorize loop
就行,bw_mmap_rd.c:39:5: missed: not vectorized: unsupported data-type
不是这个循环没有进行向量化的原因。
bw_mmap_rd.c:39:5: note: ==> examining statement: sum_310 = _288 + sum_318;
bw_mmap_rd.c:39:5: note: get vectype for scalar type: long int
bw_mmap_rd.c:39:5: note: vectype: vector(1) long int
bw_mmap_rd.c:39:5: note: nunits = 1
bw_mmap_rd.c:39:5: note: ==> examining statement: next_312 = next_332 + 128;
bw_mmap_rd.c:39:5: note: skip.
bw_mmap_rd.c:39:5: note: ==> examining statement: if (end_301 >= next_312)
bw_mmap_rd.c:39:5: note: skip.
bw_mmap_rd.c:39:5: note: vectorization factor = 1
bw_mmap_rd.c:39:5: missed: not vectorized: unsupported data-type
bw_mmap_rd.c:39:5: missed: can't determine vectorization factor.
bw_mmap_rd.c:39:5: note: ***** Analysis failed with vector mode V8QI
bw_mmap_rd.c:39:5: missed: couldn't vectorize loop
bw_mmap_rd.c:39:5: missed: not vectorized: unsupported data-type