CUDA(30)之Parallel Reduction Free Loops

Abstract
This blog will talk about how to free unnecessary loops on CUDA codes.

1. Free Loops on CPU
When talking about “unrolling” or “free loops”, the first thing coming up to my mind is “#pragma unroll”, a compiler optimization instruction. From Wiki we can know that free loops is a technique that attempts to optimize a program’s execution speed at the expense of its binary size. For example, the original codes like this

for(int i=0;i<4;i++){
	cout<<"hello world"<

After unrolling the unnecessary loop-control instruction in pursuit of a higher performance while unavoidably costing more memory space, then the codes are like this

cout<<"hello world"<

Remember that there is no a free lunch. Free loops is an approach known as the space-time tradeoff.

2. Free Loops on GPU
The key codes of reduction on CUDA are shown as follows.

// reduction
	if(tid < 512){
		data[tid] += data[tid + 512];
	}
	__syncthreads();

	if(tid < 256){
		data[tid] += data[tid + 256];
	}
	__syncthreads();

	if(tid < 128){
		data[tid] += data[tid + 128];
	}
	__syncthreads();

	if(tid < 64){
		data[tid] += data[tid + 64];
	}
	__syncthreads();

	if(tid < 32){
		data[tid] += data[tid + 32];
		data[tid] += data[tid + 16];
		data[tid] += data[tid + 8];
		data[tid] += data[tid + 4];
		data[tid] += data[tid + 2];
		data[tid] += data[tid + 1];
	}

3. Experimental Results
CUDA(30)之Parallel Reduction Free Loops_第1张图片

A further improved performance can be made by freeing or unrolling loops. That’s really cool, isn’t it?

4. More details
The CUDA-based reduction codes can be viewed on Github.

你可能感兴趣的:(GPU编程)