数据打包技术

转载自http://dsalli0927.blog.163.com/blog/static/888076072008715535584/

C6000访问存储器是很费时的,要提高C6000的数据处理率,应该使1个Load/Store指令能够访问多个数据。当程序需要对一连传的短型数据进行操作时,可使用字(整型)一次访问2个短型数据;然后用C6000的相应指令,如同时进行2个16位的加法指令,用_add2()对这些数据进行运算,以减少对内存的访问。类似的,对于C64,如需要对一连串整形数据进行操作时,可以使用双字长访问存储器。这种类型的优化就叫做:数据打包技术。

如用字访问代替2个16位短型数据的访问

void vecsum4(short *restrict sum, restrict short *in1, restrict short *in2,unsigned N)

{

   int i;

   #pragma MUST_ITERATE(10);

   for(i=0;i<N;i++)

   _amem4(&sum[i])=add2(_amem4_const(&in1[i]),_amem4_const(&in2[i]));

}

说明:

 #pragma MUST_ITERATE(10)说明下面的循环至少要执行10次。这个信息对软件流水至关重要。

_amem4.这类intrinsics指定了每次存储器访问的字节数,并说明存储器起始地址是否必须符合边界调整。amem4(&sum[i])告诉编译器:这是一个起始地址在sum、字边界调整的4字节访问。_amem4_const(&in1[i])增加了const关键字,它表示in1[i]是常数数组,在本程序中数值不变。

上例子是假设执行偶数次循环,如果用于奇数次循环,可以采取一些技巧!

例如:把数组的长度人为增加,使它仍执行偶数次。如果要求程序满足不同次数循环的要求,或者要求满足数组起始地址可能是短型数据边界等多种情况,较好的办法是在程序内部检测一下传递过来的数据情况,根据不同的数据情况采取不同的程序段:

例子:通用的求矢量和的程序

void vecsum5(short *restrict sum, const short *restrict in1,short *restrict in2,unsigned int N)

{

  int i;

/* test to see if sum ,in2 and in1 are aligned to a word boundary*/

if(((int)sum| (int)in2 |(int) in1) &0x02)

{

 #pragma MUST_ITERATE(20);

 for(i=0;i<N;i++)

 sum[i]=in1[i]+in2[i];

 }

else

{

 #pragma MUST_ITERATE(10);

for(i=0;i<N;i++)

 _amem4(&sum[i])=add2(_amem4_const(&in1[i]),_amem4_const(&in2[i]));

 if(N&0x01)sum[i]=in1[i]+in2[i];

}

}

/////////////////////////////////////////////////////////////////////////

The following example shows an example that can benefit from the packed compare and expand intrinsics in action. The Clear Below Threshold kernel scans an image of 8-bit unsigned pixels, and sets all pixels that are below a certain threshold to 0.
Clear Below Threshold Kernel

void clear_below_thresh(unsigned char *restrict image, int count, unsigned char threshold)

 {

    int i;

    for (i = 0; i < count; i++)

    {

        if (image[i] <= threshold) image[i] = 0;

    }

 } 
Vectorization techniques are applied to the code (as described Packed-Data Processing on the C64x), giving the result shown in the following example. The _cmpgtu4() intrinsic compares against the threshold values, and the _xpnd4() intrinsic generates a mask for setting pixels to 0. Note that the new code has the restriction that the input image must be double-word aligned, and must contain a multiple of 8 pixels. These restrictions are reasonable as common image sizes have a multiple of 8 pixels.

Clear Below Threshold Kernel, Using _cmpgtu4 and _xpnd4 Intrinsics

void clear_below_thresh(unsigned char *restrict image, int count, unsigned char threshold)

{

 int i;

 unsigned t3_t2_t1_t0; /* Threshold (replicated) */

 unsigned p7_p6_p5_p4, p3_p2_p1_p0; /* Pixels */

 unsigned c7_c6_c5_c4, c3_c2_c1_c0; /* Comparison results */

 unsigned x7_x6_x5_x4, x3_x2_x1_x0; /* Expanded masks */

 /* Replicate the threshold value four times in a single word */ unsigned temp = _pack2(threshold, threshold);

t3_t2_t1_t0 = _packl4(temp, temp);

for (i = 0; i < count; i += 8)

{

/* Load 8 pixels from input image (one double-word). */

p7_p6_p5_p4 = _hi(_amemd8(&image[i]));

 p3_p2_p1_p0 = _lo(_amemd8(&image[i]));

/* Compare each of the pixels to the threshold. */

c7_c6_c5_c4 = _cmpgtu4(p7_p6_p5_p4, t3_t2_t1_t0);

c3_c2_c1_c0 = _cmpgtu4(p3_p2_p1_p0, t3_t2_t1_t0);

/* Expand the comparison results to generate a bitmask. */

 x7_x6_x5_x4 = _xpnd4(c7_c6_c5_c4);

x3_x2_x1_x0 = _xpnd4(c3_c2_c1_c0);

/* Apply mask to the pixels. Pixels that were less than or */

/* equal to the threshold will be forced to 0 because the */

/* corresponding mask bits will be all 0s. The pixels that */

/* were greater will not be modified, because their mask */

/* bits will be all 1s. */

 p7_p6_p5_p4 = p7_p6_p5_p4 & x7_x6_x5_x4; p3_p2_p1_p0 = p3_p2_p1_p0 & x3_x2_x1_x0;

/* Store the thresholded pixels back to the image. */

_amemd8(&image[i]) = _itod(p7_p6_p5_p4, p3_p2_p1_p0);

 }

 }


你可能感兴趣的:(数据打包技术)