CUDA(27)之Parallel Reduction Original Implementation

Abstract
This blog will implement an original version of parallel reduction.

1. Key Codes

// reduction on a CUDA block
	for (int i=1; i < 1024; i *= 2){
        if ((tid % (2 * i)) == 0){
            data[tid] += data[tid + i];
        }
        __syncthreads();
    }

So, what is the meaning of the above codes? Well, to explain them figuratively, see a operation diagram as follows.

CUDA(27)之Parallel Reduction Original Implementation_第1张图片

2. Experimental Results
CUDA(27)之Parallel Reduction Original Implementation_第2张图片

In our first version of experiment, we implement the basic reduction on CUDA. The CUDA kernel runs on NVIDIA GTX 780Ti, Intel Core I7 and the operating system, Windows 7. The results show that 13.417 ms is consumed to calculate the 0+1+2+…+1023 reduction for one thousand times.

3. More Details
For more details, you can visit my source codes on Github, anyone interested in this project is warmly welcome to contribute to it.

你可能感兴趣的:(GPU编程)