【CUDA】学习记录(5)-Reduction优化

Professional CUDA C Programing
代码下载:http://www.wrox.com/WileyCDA/#并行规约#并行规约
串行的规约:

int sum = 0;
for (int i = 0; i < N; i++)
    sum += array[i];

规约的特点:可交换性,计算结果和计算顺序无关。所以max,min等问题也可以用相同的方法解决。
并行的基本思路:
➤ 1将输入的vector分成更小的数据块
➤ 2一个thread负责计算一个小数据块的总和
➤ 3累加个分块的总和
最普遍的思想:一个数据块只包含两个数据,一个thread完成一对数据对的求和,每次迭代的过程中数据量减半,直到只剩一个元素。(每次迭代的计算的过程都在本地进行运算)
➤ 相邻的数据对
➤ 间隔的数据对(等长度的间隔)


Screenshot from 2017-05-02 14:42:59.png

C语言串行计算:

// Recursive Implementation of Interleaved Pair Approach
int cpuRecursiveReduce(int *data, int const size)
{
    // stop condition
    if (size == 1) return data[0];
    // renew the stride
    int const stride = size / 2;
    // in-place reduction
    for (int i = 0; i < stride; i++)
    {
        data[i] += data[i + stride];
    }
    // call recursively
    return cpuRecursiveReduce(data, stride);
}

1.Neighbored Pair Implementation with divergence

每个线程计算两个相邻元素的和

__global__ void reduceNeighbored (int *g_idata, int *g_odata, unsigned int n)
{
    // set thread ID
    unsigned int tid = threadIdx.x;
    unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // convert global data pointer to the local pointer of this block
    int *idata = g_idata + blockIdx.x * blockDim.x;

    // boundary check
    if (idx >= n) return;

    // in-place reduction in global memory
    for (int stride = 1; stride < blockDim.x; stride *= 2)
    {
        if ((tid % (2 * stride)) == 0)
        {
            idata[tid] += idata[tid + stride];
        }
        // synchronize within threadblock
        __syncthreads();
    }

    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = idata[0];
}

一个block处理的过程

一个block处理过程分析:
block 0:data 0~data 7
一个block处理的开始数据地址:
int *idata = g_idata + blockIdx.x * blockDim.x;
for (int stride = 1; stride < blockDim.x; stride *= 2)
1.stride=1,tid=0,2,4,6四个thread计算相邻元素的和。

_syncthreads()//保证在这次for循环过程中block中的0,2,4,6线程都已经完成了计算。

2.stride=2,tid=0,4的线程参与计算
3.stride=4,tid=0参与了计算,完成了3 1 7 0 4 1 6 3 这8个数据的求和计算,并将结果保存到g_odata[0]=25;


Screenshot from 2017-05-02 15:17:56.png

假设有64个数据,8个block,每个block处理8个数据的和。g_idata保存了输入的原始数据,g_odata有8个数据,g_odata[i]表示的是block i的计算总和,当在CPU端串行计算g_odata数组的和则计算出了所有数据的总和。

nvcc -arch=35 -rdc=true nestedReduce.cu -o nestedReduce
./nestedReduce

注意:直接nvcc编译报错
http://stackoverflow.com/questions/19287461/compiling-code-containing-dynamic-parallelism-fails

2. Improving Divergence in Parallel Reduction

每个线程计算两个相邻元素的和的计算过程我们发现:

if ((tid % (2 * stride)) == 0)

只有当tid满足一定条件时才会执行,如比计算8个数据时:
第一次迭代:tid=0,2,4,6只有一半的线程需要进行计算。
第二次迭代:tid=0,4只有25%的线程参与了计算。
第三次迭代:tid=0,只有1个线程参与了计算。
虽然只有一部分线程参与计算,但是也需要调度所有的线程。因为线程的调度以warp为基本单位,warp是32个连续的thread,比如第二次迭代时:线程0-4,虽然只需要0和4进行计算,但是仍然要调度1,3线程。
合理安排thread处理的数据,减少分支,thread计算相邻元素的和。

Screenshot from 2017-05-02 16:49:31.png

 for (int stride = 1; stride < blockDim.x; stride *= 2)
    {
        // convert tid into local array index
        int index = 2 * stride * tid;//最大的区别
        if (index < blockDim.x)
        {
            idata[index] += idata[index + stride];
        }

        // synchronize within threadblock
        __syncthreads();
    }

如果一个block有512个threads,第一次迭代时8个warp reduction,剩余8个warp do nothing,第二次迭代 4个warp reduction,剩余12个warp do nothing,第三次迭代2个warp执行,剩余14个warp donothing,第四次1个warp执行...

3. Reducing with Interleaved Pairs

Screenshot from 2017-05-02 17:30:52.png

每个线程计算等间距的两个元素的和:全局内存访问顺序有所不同

__global__ void reduceInterleaved (int *g_idata, int *g_odata, unsigned int n)
{
    // set thread ID
    unsigned int tid = threadIdx.x;
    unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // convert global data pointer to the local pointer of this block
    int *idata = g_idata + blockIdx.x * blockDim.x;

    // boundary check
    if(idx >= n) return;

    // in-place reduction in global memory
    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)
    {
        if (tid < stride)
        {
            idata[tid] += idata[tid + stride];
        }

        __syncthreads();
    }

    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = idata[0];
}

Unrolling Loops

展开循环是能够降低分支的频率和减少循环的维护开销,是一项优化的技术。

//方法1
for (int i = 0; i < 100; i++) {
    a[i] = b[i] + c[i];
}
//方法2
for (int i = 0; i < 100; i+=2) {
    a[i] = b[i] + c[i];
    a[i+1] = b[i+1] + c[i+1];
}

从高级语言层面是无法看出性能提升的原因的,需要从low-level instruction层面去分析,第二段代码循环次数减少了一半,而循环体两句语句的读写操作的执行在CPU上是可以同时执行互相独立的,所以相对第一段,第二段性能要好。
Unrolling 在CUDA编程中意义更重。我们的目标依然是通过减少指令执行消耗,增加更多的独立指令来提高性能。这样就会增加更多的并行操作从而产生更高的指令和内存带宽(bandwidth)。也就提供了更多的eligible warps来帮助hide instruction/memory latency 。

4. Reducing with Unrolling

在reduceInterleaved核函数中一个block 处理一个data block。我们如何用single block处理两个data block来展开循环?对于每个线程块,计算来自两个data block的数据。这是一个循环分区(在第1章中介绍):每个线程在多个数据块上工作处理每个数据块中的单个元素。
Step1:单个线程块计算相邻两个data block对应位置上的数据。

 block0:1  0  1  1  3  4  0  2
 block1:1  2  5  7  2  1  6  3
 block1:2  2  6  8  5  5  6  5

Step2:一个线程块按照reduceInterleaved计算一个data blcok中的数据总和。

__global__ void reduceUnrolling2 (int *g_idata, int *g_odata, unsigned int n)
{
    // set thread ID
    unsigned int tid = threadIdx.x;
    unsigned int idx = blockIdx.x * blockDim.x * 2 + threadIdx.x;

    // convert global data pointer to the local pointer of this block
    int *idata = g_idata + blockIdx.x * blockDim.x * 2;

    // unrolling 2
    if (idx + blockDim.x < n) g_idata[idx] += g_idata[idx + blockDim.x];

    __syncthreads();

    // in-place reduction in global memory
    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1)
    {
        if (tid < stride)
        {
            idata[tid] += idata[tid + stride];
        }

        // synchronize within threadblock
        __syncthreads();
    }

    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = idata[0];
}

Screenshot from 2017-05-02 19:59:46.png
    int size = 1 << 24; // total number of elements to reduce
    int blocksize = 512;   // initial block size
    dim3 block (blocksize, 1);
    dim3 grid  ((size + block.x - 1) / block.x, 1);
    //由于每个block处理两个data block,所以需要调整grid的配置.
    reduceUnrolling2<<>>(d_idata, d_odata, size);

实验结果1:Improving Divergence in Parallel Reduction

nvprof --metrics inst_per_warp ./reduceInteger
Device "GeForce GT 740M (0)"
reduceNeighbored Instructions per warp  291.625000  291.625000  291.625000
reduceNeighboredLess Instructions per warp  115.812500  115.812500  115.812500
nvprof --metrics gld_throughput ./reduceInteger
reduceNeighbored Global Load Throughput  16.840GB/s  16.840GB/s  16.839GB/s
reduceNeighboredLess Global Load Throughput  17.905GB/s  17.905GB/s  17.905GB/s

实验结果2:Reducing with Interleaved Pairs

./reduceInteger starting reduction at device 0: GeForce GT 740M     with array size 16777216  grid 32768 block 512
cpu reduce      elapsed 0.046758 sec cpu_sum: 2139353471
gpu Neighbored  elapsed 0.031896 sec gpu_sum: 2139353471 <<>>
gpu Neighbored2 elapsed 0.029930 sec gpu_sum: 2139353471 <<>>
gpu Interleaved elapsed 0.018442 sec gpu_sum: 2139353471 <<>>

实验结果3:循环展开的不同程度
reduceUnrolling4: each threadblock handles 4 data blocks
reduceUnrolling8: each threadblock handles 8 data blocks
可以看出,同一个thread中如果能有更多的独立的load/store操作,会产生更好的性能,因为这样做memory latency能够更好的被隐藏。我们可以使用nvprof的dram_read_throughput来验证:

nvprof --metrics dram_read_throughput ./reduceInteger
//gt740m
reduceUnrolling8 Device Memory Read Throughput  10.086GB/s  
reduceUnrolling4 Device Memory Read Throughput  7.5325GB/s  
reduceUnrolling2 Device Memory Read Throughput  5.6396GB/s  

5. Reducing with Unrolled Warps

__syncthreads用于块内同步。在reduction kernel中,他被用来在每次循环中保证同一个线程块中所有thread的写global memory的操作都已完成,这样才能进行下一阶段的计算。那么,当kernel进行到只需要少于或等32个thread(也就是一个warp)呢?由于我们是使用的SIMT模式,warp内的thread 是有一个隐式的同步过程的。最后六次迭代可以用下面的语句展开:

 // unrolling warp
    if (tid < 32)
    {
        volatile int *vmem = idata;
        vmem[tid] += vmem[tid + 32];
        vmem[tid] += vmem[tid + 16];
        vmem[tid] += vmem[tid +  8];
        vmem[tid] += vmem[tid +  4];
        vmem[tid] += vmem[tid +  2];
        vmem[tid] += vmem[tid +  1];
    }

warp unrolling避免了最后一个warp内的__syncthreads同步操作。这里注意下volatile修饰符,他告诉编译器每次执行赋值时必须将vmem[tid]的值store回global memory。如果不这样做的话,编译器或cache可能会优化我们读写global/shared memory。有了这个修饰符,编译器就会认为这个值会被其他thread修改,从而使得每次读写都直接去memory而不是去cache或者register。
最初我没有理解unrolling warp的含义,我们知道在单个warp中,指令遵循SIMT,就是在同一时刻32个线程执行相同的指令,也就是说当活动线程数目少于32个时,我们不需要进行同步控制。
32个thread处理64个数据相加:
第一次迭代:[0]+[0+32]、[1]+[[1+32]、[2]+[[2+32]...[31]+[31+32]由于这32个线程在一个warp中,所以这些计算默认是同步的,计算结果保存到[0-31]。
第二次迭代:[0]+[0+16]、[1]+[[1+16]、[2]+[[2+16]...[15]+[15+16]...([15]+[15+16]),由于这32个线程在一个warp中,所以这些计算默认是同步的,计算结果仍然保存到[0-31],但是有效的计算结果只有[0-15].
第三次迭代:[0]+[0+8]、[1]+[[1+8]、[2]+[[2+8]...[7]+[7+8]...([31]+[31+8]),有效的计算结果保存到[0-7].
第四次迭代:[0]+[0+4]、[1]+[[1+4]、[2]+[[2+4]、[3]+[3+4]...([31]+[31+4]),有效的计算结果保存到[0-3].
第五次迭代:[0]+[0+2]、[1]+[[1+2]...([31]+[31+2]),有效的计算结果保存到[0-1].
第六次迭代:[0]+[0+1]...[31]+[31+1],有效的计算结果保存到[0].
在每次迭代的过程中,由于是在一个warp内,SIMT,所以不需再使用块内同步。

reduceUnrollWarps8<<>> (d_idata, d_odata, size);
//实验结果gt740m
nvprof --metrics stall_sync ./reduceInteger
gpu Unrolling8  elapsed 0.097200 sec gpu_sum: 2139353471 <<>>
gpu UnrollWarp8 elapsed 0.096202 sec gpu_sum: 2139353471 <<>>
nvprof --metrics stall_sync ./reduceInteger//因为同步占用的时间
reduceUnrollWarps8 Issue Stall Reasons(Synchronization)       9.97% 
reduceUnrolling8   Issue Stall Reasons (Synchronization)      14.54%  

6. Reducing with Complete Unrolling

如果在编译时已知了迭代次数,就可以完全把循环展开。Fermi和Kepler每个block的最大thread数目都是1024,博文中的kernel的迭代次数都是基于blockDim的,所以完全展开循环是可行的。

__global__ void reduceCompleteUnrollWarps8 (int *g_idata, int *g_odata,
        unsigned int n)
{
    // set thread ID
    unsigned int tid = threadIdx.x;
    unsigned int idx = blockIdx.x * blockDim.x * 8 + threadIdx.x;

    // convert global data pointer to the local pointer of this block
    int *idata = g_idata + blockIdx.x * blockDim.x * 8;

    // unrolling 8
    if (idx + 7 * blockDim.x < n)
    {
        int a1 = g_idata[idx];
        int a2 = g_idata[idx + blockDim.x];
        int a3 = g_idata[idx + 2 * blockDim.x];
        int a4 = g_idata[idx + 3 * blockDim.x];
        int b1 = g_idata[idx + 4 * blockDim.x];
        int b2 = g_idata[idx + 5 * blockDim.x];
        int b3 = g_idata[idx + 6 * blockDim.x];
        int b4 = g_idata[idx + 7 * blockDim.x];
        g_idata[idx] = a1 + a2 + a3 + a4 + b1 + b2 + b3 + b4;
    }

    __syncthreads();

    // in-place reduction and complete unroll
    if (blockDim.x >= 1024 && tid < 512) idata[tid] += idata[tid + 512];

    __syncthreads();

    if (blockDim.x >= 512 && tid < 256) idata[tid] += idata[tid + 256];

    __syncthreads();

    if (blockDim.x >= 256 && tid < 128) idata[tid] += idata[tid + 128];

    __syncthreads();

    if (blockDim.x >= 128 && tid < 64) idata[tid] += idata[tid + 64];

    __syncthreads();

    // unrolling warp
    if (tid < 32)
    {
        volatile int *vsmem = idata;
        vsmem[tid] += vsmem[tid + 32];
        vsmem[tid] += vsmem[tid + 16];
        vsmem[tid] += vsmem[tid +  8];
        vsmem[tid] += vsmem[tid +  4];
        vsmem[tid] += vsmem[tid +  2];
        vsmem[tid] += vsmem[tid +  1];
    }
    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = idata[0];
}

main中调用方式

reduceCompleteUnrollWarps8<<>>(d_idata, d_odata, size);

7. Reducing with Template Functions

CUDA代码支持模板,我们可以如下设置block大小:

template 
__global__ void reduceCompleteUnroll(int *g_idata, int *g_odata,
                                     unsigned int n)
{
    // set thread ID
    unsigned int tid = threadIdx.x;
    unsigned int idx = blockIdx.x * blockDim.x * 8 + threadIdx.x;

    // convert global data pointer to the local pointer of this block
    int *idata = g_idata + blockIdx.x * blockDim.x * 8;

    // unrolling 8
    if (idx + 7 * blockDim.x < n)
    {
        int a1 = g_idata[idx];
        int a2 = g_idata[idx + blockDim.x];
        int a3 = g_idata[idx + 2 * blockDim.x];
        int a4 = g_idata[idx + 3 * blockDim.x];
        int b1 = g_idata[idx + 4 * blockDim.x];
        int b2 = g_idata[idx + 5 * blockDim.x];
        int b3 = g_idata[idx + 6 * blockDim.x];
        int b4 = g_idata[idx + 7 * blockDim.x];
        g_idata[idx] = a1 + a2 + a3 + a4 + b1 + b2 + b3 + b4;
    }

    __syncthreads();

    // in-place reduction and complete unroll
    if (iBlockSize >= 1024 && tid < 512) idata[tid] += idata[tid + 512];

    __syncthreads();

    if (iBlockSize >= 512 && tid < 256)  idata[tid] += idata[tid + 256];

    __syncthreads();

    if (iBlockSize >= 256 && tid < 128)  idata[tid] += idata[tid + 128];

    __syncthreads();

    if (iBlockSize >= 128 && tid < 64)   idata[tid] += idata[tid + 64];

    __syncthreads();

    // unrolling warp
    if (tid < 32)
    {
        volatile int *vsmem = idata;
        vsmem[tid] += vsmem[tid + 32];
        vsmem[tid] += vsmem[tid + 16];
        vsmem[tid] += vsmem[tid +  8];
        vsmem[tid] += vsmem[tid +  4];
        vsmem[tid] += vsmem[tid +  2];
        vsmem[tid] += vsmem[tid +  1];
    }

    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = idata[0];
}

注意:最大的不同则是定义了template 来设置block的大小。
代码中的if (iBlockSize >= 1024 && tid < 512) idata[tid] += idata[tid + 512];等判断语句会根据iBlockSize的值去掉无用的语句,比如iBlockSize=256,则该语句永远为false,则编译器会自动把该语句去除。实际的作用类似于case:

switch (blocksize) {  
    case 1024:  
        reduceCompleteUnroll<1024><<>>(d_idata, d_odata, size);  
        break;  
    case 512:  
        reduceCompleteUnroll<512><<>>(d_idata, d_odata, size);  
        break;  
    case 256:  
        reduceCompleteUnroll<256><<>>(d_idata, d_odata, size);  
        break;  
    case 128:  
        reduceCompleteUnroll<128><<>>(d_idata, d_odata, size);  
        break;  
    case 64:  
        reduceCompleteUnroll<64><<>>(d_idata, d_odata, size);  
        break;  
}      

总结Reduction Kernel Performance

ccit@ccit:~/hym/CodeSamples/chapter03$ ./reduceInteger 
./reduceInteger starting reduction at device 0: Tesla K80     with array size 16777216  grid 32768 block 512
cpu reduce      elapsed 0.085437 sec cpu_sum: 2139353471
gpu Neighbored  elapsed 0.008413 sec gpu_sum: 2139353471 <<>>
gpu Neighbored2 elapsed 0.007775 sec gpu_sum: 2139353471 <<>>
gpu Interleaved elapsed 0.005087 sec gpu_sum: 2139353471 <<>>
gpu Unrolling2  elapsed 0.002878 sec gpu_sum: 2139353471 <<>>
gpu Unrolling4  elapsed 0.001665 sec gpu_sum: 2139353471 <<>>
gpu Unrolling8  elapsed 0.000970 sec gpu_sum: 2139353471 <<>>
gpu UnrollWarp8 elapsed 0.000873 sec gpu_sum: 2139353471 <<>>
gpu Cmptnroll8  elapsed 0.000865 sec gpu_sum: 2139353471 <<>>
gpu Cmptnroll   elapsed 0.000831 sec gpu_sum: 2139353471 <<>>
==11132== Profiling application: ./reduceInteger
==11132== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 73.30%  80.877ms         9  8.9864ms  8.8970ms  9.0921ms  [CUDA memcpy HtoD]
  7.67%  8.4588ms         1  8.4588ms  8.4588ms  8.4588ms  reduceNeighbored(int*, int*, unsigned int)
  7.13%  7.8633ms         1  7.8633ms  7.8633ms  7.8633ms  reduceNeighboredLess(int*, int*, unsigned int)
  4.66%  5.1374ms         1  5.1374ms  5.1374ms  5.1374ms  reduceInterleaved(int*, int*, unsigned int)
  2.64%  2.9117ms         1  2.9117ms  2.9117ms  2.9117ms  reduceUnrolling2(int*, int*, unsigned int)
  1.43%  1.5799ms         1  1.5799ms  1.5799ms  1.5799ms  reduceUnrolling4(int*, int*, unsigned int)
  0.82%  901.81us         1  901.81us  901.81us  901.81us  reduceUnrolling8(int*, int*, unsigned int)
  0.77%  847.57us         1  847.57us  847.57us  847.57us  reduceUnrollWarps8(int*, int*, unsigned int)
  0.76%  836.63us         1  836.63us  836.63us  836.63us  reduceCompleteUnrollWarps8(int*, int*, unsigned int)
  0.74%  821.97us         1  821.97us  821.97us  821.97us  void reduceCompleteUnroll(int*, int*, unsigned int)
  0.10%  107.23us         9  11.914us  5.8560us  21.216us  [CUDA memcpy DtoH]

==11132== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 55.86%  341.89ms         2  170.94ms  533.87us  341.36ms  cudaMalloc
 24.73%  151.33ms         1  151.33ms  151.33ms  151.33ms  cudaDeviceReset
 13.32%  81.521ms        18  4.5289ms  37.614us  9.1564ms  cudaMemcpy
  5.01%  30.662ms        18  1.7035ms  123.31us  8.4688ms  cudaDeviceSynchronize
  0.49%  3.0276ms       364  8.3170us     133ns  289.29us  cuDeviceGetAttribute
  0.25%  1.5256ms         4  381.39us  364.76us  393.17us  cuDeviceTotalMem
  0.12%  707.06us         1  707.06us  707.06us  707.06us  cudaGetDeviceProperties
  0.10%  588.83us         2  294.41us  211.99us  376.83us  cudaFree
  0.07%  452.08us         9  50.231us  39.824us  65.419us  cudaLaunch
  0.04%  247.65us         4  61.912us  54.456us  65.017us  cuDeviceGetName
  0.00%  12.626us        27     467ns     152ns  1.3440us  cudaSetupArgument
  0.00%  9.8140us         9  1.0900us     648ns  1.6530us  cudaConfigureCall
  0.00%  9.7070us         1  9.7070us  9.7070us  9.7070us  cudaSetDevice
  0.00%  4.7200us        12     393ns     160ns     750ns  cuDeviceGet
  0.00%  4.1790us         3  1.3930us     288ns  3.2960us  cuDeviceGetCount
ccit@ccit:~/hym/CodeSamples/chapter03$ nvprof ./reduceInteger
==11226== NVPROF is profiling process 11226, command: ./reduceInteger
./reduceInteger starting reduction at device 0: Tesla K80     with array size 16777216  grid 32768 block 512
cpu reduce      elapsed 0.060118 sec cpu_sum: 2139353471
gpu Neighbored  elapsed 0.008543 sec gpu_sum: 2139353471 <<>>
gpu Neighbored2 elapsed 0.007923 sec gpu_sum: 2139353471 <<>>
gpu Interleaved elapsed 0.005194 sec gpu_sum: 2139353471 <<>>
gpu Unrolling2  elapsed 0.002968 sec gpu_sum: 2139353471 <<>>
gpu Unrolling4  elapsed 0.001631 sec gpu_sum: 2139353471 <<>>
gpu Unrolling8  elapsed 0.000953 sec gpu_sum: 2139353471 <<>>
gpu UnrollWarp8 elapsed 0.000905 sec gpu_sum: 2139353471 <<>>
gpu Cmptnroll8  elapsed 0.000911 sec gpu_sum: 2139353471 <<>>
gpu Cmptnroll   elapsed 0.000884 sec gpu_sum: 2139353471 <<>>
==11226== Profiling application: ./reduceInteger
==11226== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 72.93%  79.366ms         9  8.8184ms  8.7691ms  8.9602ms  [CUDA memcpy HtoD]
  7.77%  8.4588ms         1  8.4588ms  8.4588ms  8.4588ms  reduceNeighbored(int*, int*, unsigned int)
  7.22%  7.8620ms         1  7.8620ms  7.8620ms  7.8620ms  reduceNeighboredLess(int*, int*, unsigned int)
  4.72%  5.1380ms         1  5.1380ms  5.1380ms  5.1380ms  reduceInterleaved(int*, int*, unsigned int)
  2.68%  2.9131ms         1  2.9131ms  2.9131ms  2.9131ms  reduceUnrolling2(int*, int*, unsigned int)
  1.45%  1.5792ms         1  1.5792ms  1.5792ms  1.5792ms  reduceUnrolling4(int*, int*, unsigned int)
  0.83%  901.11us         1  901.11us  901.11us  901.11us  reduceUnrolling8(int*, int*, unsigned int)
  0.78%  849.88us         1  849.88us  849.88us  849.88us  reduceUnrollWarps8(int*, int*, unsigned int)
  0.77%  838.42us         1  838.42us  838.42us  838.42us  reduceCompleteUnrollWarps8(int*, int*, unsigned int)
  0.75%  821.56us         1  821.56us  821.56us  821.56us  void reduceCompleteUnroll(int*, int*, unsigned int)
  0.09%  102.27us         9  11.363us  5.2480us  20.416us  [CUDA memcpy DtoH]

==11226== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 49.61%  248.53ms         2  124.26ms  574.69us  247.95ms  cudaMalloc
 27.09%  135.73ms         1  135.73ms  135.73ms  135.73ms  cudaDeviceReset
 15.93%  79.815ms        18  4.4342ms  33.741us  9.0381ms  cudaMemcpy
  6.14%  30.739ms        18  1.7077ms  122.83us  8.4672ms  cudaDeviceSynchronize
  0.56%  2.7905ms       364  7.6660us     128ns  274.24us  cuDeviceGetAttribute
  0.29%  1.4569ms         4  364.24us  326.50us  378.80us  cuDeviceTotalMem
  0.14%  683.42us         1  683.42us  683.42us  683.42us  cudaGetDeviceProperties
  0.10%  499.91us         2  249.96us  185.44us  314.47us  cudaFree
  0.09%  446.98us         9  49.663us  41.375us  67.690us  cudaLaunch
  0.04%  214.37us         4  53.591us  52.862us  54.100us  cuDeviceGetName
  0.00%  12.251us        27     453ns     146ns  1.2490us  cudaSetupArgument
  0.00%  11.131us         1  11.131us  11.131us  11.131us  cudaSetDevice
  0.00%  8.9040us         9     989ns     785ns  1.8110us  cudaConfigureCall
  0.00%  4.0720us         3  1.3570us     257ns  3.4250us  cuDeviceGetCount
  0.00%  2.9850us        12     248ns     125ns     518ns  cuDeviceGet

根据该实验结果我们可以发现拷贝数据的时间要80ms,在cpu上计算的总时间60ms。

Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K80 (0)"
    Kernel: reduceInterleaved(int*, int*, unsigned int)
          1                            gld_efficiency             Global Memory Load Efficiency      96.15%      96.15%      96.15%
          1                            gst_efficiency            Global Memory Store Efficiency      95.52%      95.52%      95.52%
    Kernel: reduceCompleteUnrollWarps8(int*, int*, unsigned int)
          1                            gld_efficiency             Global Memory Load Efficiency     107.54%     107.54%     107.54%
          1                            gst_efficiency            Global Memory Store Efficiency      99.40%      99.40%      99.40%
    Kernel: reduceNeighbored(int*, int*, unsigned int)
          1                            gld_efficiency             Global Memory Load Efficiency      25.02%      25.02%      25.02%
          1                            gst_efficiency            Global Memory Store Efficiency      25.00%      25.00%      25.00%
    Kernel: reduceUnrolling8(int*, int*, unsigned int)
          1                            gld_efficiency             Global Memory Load Efficiency      99.21%      99.21%      99.21%
          1                            gst_efficiency            Global Memory Store Efficiency      97.71%      97.71%      97.71%
    Kernel: reduceUnrolling4(int*, int*, unsigned int)
          1                            gld_efficiency             Global Memory Load Efficiency      98.68%      98.68%      98.68%
          1                            gst_efficiency            Global Memory Store Efficiency      97.71%      97.71%      97.71%
    Kernel: reduceUnrolling2(int*, int*, unsigned int)
          1                            gld_efficiency             Global Memory Load Efficiency      98.04%      98.04%      98.04%
          1                            gst_efficiency            Global Memory Store Efficiency      97.71%      97.71%      97.71%
    Kernel: void reduceCompleteUnroll(int*, int*, unsigned int)
          1                            gld_efficiency             Global Memory Load Efficiency     107.54%     107.54%     107.54%
          1                            gst_efficiency            Global Memory Store Efficiency      99.40%      99.40%      99.40%
    Kernel: reduceUnrollWarps8(int*, int*, unsigned int)
          1                            gld_efficiency             Global Memory Load Efficiency     107.54%     107.54%     107.54%
          1                            gst_efficiency            Global Memory Store Efficiency      99.40%      99.40%      99.40%
    Kernel: reduceNeighboredLess(int*, int*, unsigned int)
          1                            gld_efficiency             Global Memory Load Efficiency      25.02%      25.02%      25.02%
          1                            gst_efficiency            Global Memory Store Efficiency      25.00%      25.00%      25.00%

你可能感兴趣的:(【CUDA】学习记录(5)-Reduction优化)