转载自http://dsalli0927.blog.163.com/blog/static/888076072008715535584/
C6000访问存储器是很费时的,要提高C6000的数据处理率,应该使1个Load/Store指令能够访问多个数据。当程序需要对一连传的短型数据进行操作时,可使用字(整型)一次访问2个短型数据;然后用C6000的相应指令,如同时进行2个16位的加法指令,用_add2()对这些数据进行运算,以减少对内存的访问。类似的,对于C64,如需要对一连串整形数据进行操作时,可以使用双字长访问存储器。这种类型的优化就叫做:数据打包技术。
如用字访问代替2个16位短型数据的访问
void vecsum4(short *restrict sum, restrict short *in1, restrict short *in2,unsigned N)
{
int i;
#pragma MUST_ITERATE(10);
for(i=0;i<N;i++)
_amem4(&sum[i])=add2(_amem4_const(&in1[i]),_amem4_const(&in2[i]));
}
说明:
#pragma MUST_ITERATE(10)说明下面的循环至少要执行10次。这个信息对软件流水至关重要。
_amem4.这类intrinsics指定了每次存储器访问的字节数,并说明存储器起始地址是否必须符合边界调整。amem4(&sum[i])告诉编译器:这是一个起始地址在sum、字边界调整的4字节访问。_amem4_const(&in1[i])增加了const关键字,它表示in1[i]是常数数组,在本程序中数值不变。
上例子是假设执行偶数次循环,如果用于奇数次循环,可以采取一些技巧!
例如:把数组的长度人为增加,使它仍执行偶数次。如果要求程序满足不同次数循环的要求,或者要求满足数组起始地址可能是短型数据边界等多种情况,较好的办法是在程序内部检测一下传递过来的数据情况,根据不同的数据情况采取不同的程序段:
例子:通用的求矢量和的程序
void vecsum5(short *restrict sum, const short *restrict in1,short *restrict in2,unsigned int N)
{
int i;
/* test to see if sum ,in2 and in1 are aligned to a word boundary*/
if(((int)sum| (int)in2 |(int) in1) &0x02)
{
#pragma MUST_ITERATE(20);
for(i=0;i<N;i++)
sum[i]=in1[i]+in2[i];
}
else
{
#pragma MUST_ITERATE(10);
for(i=0;i<N;i++)
_amem4(&sum[i])=add2(_amem4_const(&in1[i]),_amem4_const(&in2[i]));
if(N&0x01)sum[i]=in1[i]+in2[i];
}
}
/////////////////////////////////////////////////////////////////////////
The following example shows an example that can benefit from the packed compare and expand intrinsics in action. The Clear Below Threshold kernel scans an image of 8-bit unsigned pixels, and sets all pixels that are below a certain threshold to 0.
Clear Below Threshold Kernel
void clear_below_thresh(unsigned char *restrict image, int count, unsigned char threshold)
{
int i;
for (i = 0; i < count; i++)
{
if (image[i] <= threshold) image[i] = 0;
}
}
Vectorization techniques are applied to the code (as described Packed-Data Processing on the C64x), giving the result shown in the following example. The _cmpgtu4() intrinsic compares against the threshold values, and the _xpnd4() intrinsic generates a mask for setting pixels to 0. Note that the new code has the restriction that the input image must be double-word aligned, and must contain a multiple of 8 pixels. These restrictions are reasonable as common image sizes have a multiple of 8 pixels.
Clear Below Threshold Kernel, Using _cmpgtu4 and _xpnd4 Intrinsics
void clear_below_thresh(unsigned char *restrict image, int count, unsigned char threshold)
{
int i;
unsigned t3_t2_t1_t0; /* Threshold (replicated) */
unsigned p7_p6_p5_p4, p3_p2_p1_p0; /* Pixels */
unsigned c7_c6_c5_c4, c3_c2_c1_c0; /* Comparison results */
unsigned x7_x6_x5_x4, x3_x2_x1_x0; /* Expanded masks */
/* Replicate the threshold value four times in a single word */ unsigned temp = _pack2(threshold, threshold);
t3_t2_t1_t0 = _packl4(temp, temp);
for (i = 0; i < count; i += 8)
{
/* Load 8 pixels from input image (one double-word). */
p7_p6_p5_p4 = _hi(_amemd8(&image[i]));
p3_p2_p1_p0 = _lo(_amemd8(&image[i]));
/* Compare each of the pixels to the threshold. */
c7_c6_c5_c4 = _cmpgtu4(p7_p6_p5_p4, t3_t2_t1_t0);
c3_c2_c1_c0 = _cmpgtu4(p3_p2_p1_p0, t3_t2_t1_t0);
/* Expand the comparison results to generate a bitmask. */
x7_x6_x5_x4 = _xpnd4(c7_c6_c5_c4);
x3_x2_x1_x0 = _xpnd4(c3_c2_c1_c0);
/* Apply mask to the pixels. Pixels that were less than or */
/* equal to the threshold will be forced to 0 because the */
/* corresponding mask bits will be all 0s. The pixels that */
/* were greater will not be modified, because their mask */
/* bits will be all 1s. */
p7_p6_p5_p4 = p7_p6_p5_p4 & x7_x6_x5_x4; p3_p2_p1_p0 = p3_p2_p1_p0 & x3_x2_x1_x0;
/* Store the thresholded pixels back to the image. */
_amemd8(&image[i]) = _itod(p7_p6_p5_p4, p3_p2_p1_p0);
}
}