OpenCL性能优化实例研究系列2:避免Local Memory Bank Conflicts的两个简单方法

转自:http://hi.baidu.com/fsword73/item/51df1fafe6083e268919d39e

作者:  fsword73

Bank Conflicts 是存储访问中的常见问题,避免Bank Conflicts有效地提高存储访问速度。下面介绍两个实例, Reduction和Prefix Sum.

1 在Reduction中使用Padding避免Bank Conflicts

    以AMD HD Readon 5870为例,Local Memory 有32Banks, 每个WAVEFronts有64threads, Bank Conflicts的计算公式为

        bank conflicts = STRIDE * 64/ 32 -2 (当STRIDE为偶数个DWORD)   

         bank conflicts = 0                            (当STRIDE为奇数个DWORD)

          STRIDE = 2,   bank conflicts = 2

          STRIDE = 4,   bank conflicts = 6

         STRIDE = 8,   bank conflicts = 14

         STRIDE = 8,   bank conflicts = 14

         STRIDE = 10,   bank conflicts = 18

          STRIDE = 12,   bank conflicts = 22

          Ruduction的代码为UINT4, 因为它的Bank Conflicts = 6  

          原始代码:

__kernel
void 
reduce(__global uint4* input, __global uint4* output, __local uint4* sdata)
{
    // load shared mem
    unsigned int tid = get_local_id(0);
    unsigned int bid = get_group_id(0);
    unsigned int gid = get_global_id(0);

    unsigned int localSize = get_local_size(0);
    sdata[tid] = input[gid];
    barrier(CLK_LOCAL_MEM_FENCE);

    // do reduction in shared mem
    for(unsigned int s = localSize / 2; s > 0; s >>= 1) 
    {
        if(tid < s) 
        {
            sdata[tid] += sdata[tid + s];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    // write result for this block to global mem
    if(tid == 0) output[bid] = sdata[0];
}

优化的代码, 我们必须使用 __attribute__((packed))来定义数据结构来实现5个DWORD的宽度,否则数据的长度是8个DWORDs。

typedef struct __attribute__((packed))
{
uint4 d;
uint r;
}myData;

__kernel
void 
reduce(__global uint4* input, __global uint4* output, __local mydata* sdata)
{
    // load shared mem
    unsigned int tid = get_local_id(0);
    unsigned int bid = get_group_id(0);
    unsigned int gid = get_global_id(0);

    unsigned int localSize = get_local_size(0);
    sdata[tid].d = input[gid];
    barrier(CLK_LOCAL_MEM_FENCE);

    // do reduction in shared mem
    for(unsigned int s = localSize / 2; s > 0; s >>= 1) 
    {
        if(tid < s) 
        {
            sdata[tid].d += sdata[tid + s].d;
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    // write result for this block to global mem
    if(tid == 0) output[bid] = sdata[0].d;
}

2 在Prefix Sum中避免Bank Conflicts

    我们的Intern刘远浩同学(同济大学研究生)使用了一个非常简单的办法来避免Preix Sum的Bank Conflicts.

    #define HD5870_BANKS 32

      #define AVOID_BACNK_CONFLICTS(X) (x + x / HD5870_BANKS)

   original:

      block[2*tid] = input[2*tid];
   Optimized:

     block[AVOID_BACNK_CONFLICTS(2*tid)] = input[2*tid];

    我们来分析Prefix Sum的执行效率

int offset=1;

for(int d = length>>1; d > 0; d >>=1)
{
   barrier(CLK_LOCAL_MEM_FENCE);
  
   if(tid<d)
   {
    int ai = offset*(2*tid + 1) - 1;
    int bi = offset*(2*tid + 2) - 1;
   
    block[bi] += block[ai];
   }
   offset *= 2;
}

如果Length = 512, 静态分析执行效率

    d = 256 : 256 threads = 4 WaveFronts

   d = 128:   128 threads= 2 WaveFronts     

   d = 64 :      64 threads = 1 WaveFronts

   d = 32:       32 threads = 1patial WaveFronts (50% SIMD utlize)

   d = 16:        16 threads = 1patial WaveFronts (25% SIMD utlize)

   d = 8:        8 threads = 1patial WaveFronts (12.5% SIMD utlize)

   d = 4:        4 threads = 1patial WaveFronts (6.25% SIMD utlize)

   d = 2:        2 threads = 1patial WaveFronts (3.125% SIMD utlize)

   d = 1:        1 threads = 1patial WaveFronts (1.5625% SIMD utlize)

   执行效率:

    511 实际计算 Threads / 总共13 实际 WaveFronts = 61.4 %,   所以Prefix最大的Bottleneck是如何提高ALU和Local Memory模块的实际利用效率。

你可能感兴趣的:(OpenCL)