记录一种CUDA常见错误情景:结果随机

最近需要将之前的CUDA优化代码复盘,得到每一种优化下的加速比,因此,每一次优化后需要对比C,CUDA的结果,赶巧不巧,遇上了这么一种情况:

核函数功能 :核函数D_SDF1_SRC1_MultiChan需要在一个Block中完成多路的执行,因为可以共用数据,利用共享内存提供更快和更少次数的内存访问;其Block图如图所示:
记录一种CUDA常见错误情景:结果随机_第1张图片
于是希望在每一行的Thread中取不同的常数,使用threadIdx.y作为标识:

__global__ void SRC1Kernel_MultiChan(short *d_sr1, float *d_hsrc1, complexType *d_ssrc1, double *d_fc)
{
	int bx = blockIdx.x; //unique i
	int tx = threadIdx.x; //unique m
	int ty = threadIdx.y; //unique fc
	int stride = gridDim.x; 
	int stride_SM = blockDim.x * ty;

	__shared__ float add_re[Q * ChanNum];
	__shared__ float add_im[Q * ChanNum];
	__shared__ double fc_kernel[ChanNum];

	if (tx == 0) fc_kernel[ty] = d_fc[ty];
	__syncthreads();


	for (int i = bx; i < OLENGTH - Q; i += stride) {
		int t1 = tx + ((D * i) % I) * Q;
		int t2 = (D * i) / I - tx + (Q - 1);

		double w_tmp = fc_kernel[ty];

		w_tmp  = (w_tmp / d_fs) * t2;
		w_tmp = 2 * d_pi *(w_tmp - (int)w_tmp);

		add_re[tx] = d_hsrc1[t1] * d_sr1[t2] * __cosf(w_tmp);
		add_im[tx] = d_hsrc1[t1] * d_sr1[t2] * __sinf(w_tmp) * -1;

		__syncthreads();

		if (tx == 0) {
			for (int j = 1; j < Q; ++j) {
				add_re[0 + stride_SM] += add_re[j + stride_SM];
				add_im[0 + stride_SM] += add_im[j + stride_SM];
			}
		}

		__syncthreads();

		d_ssrc1[i + ty * OLENGTH].re = add_re[0 + stride_SM];
		d_ssrc1[i + ty * OLENGTH].im = add_im[0 + stride_SM];

	}
}

但是发现,其结果总是不正确,当我把:

//double w_tmp = fc_kernel[ty]; //换为
double w_tmp = fc_kernel[0];

结果就是正确的;记录一种CUDA常见错误情景:结果随机_第2张图片
仔细分析,当使用ty作为唯一标识的时候,其结果每次都略微有区别,出现一种随机的结果,于是仔细看看代码就发现,原来是每一行的结果放到共享内存的时候,出现了数据重叠的现象;

		add_re[tx] = d_hsrc1[t1] * d_sr1[t2] * __cosf(w_tmp);
		add_im[tx] = d_hsrc1[t1] * d_sr1[t2] * __sinf(w_tmp) * -1;

由于,这里使用tx作为标识,因此,每一行的结果都放到了共享内存的前blockDim.x个位置里;由于GPU实际调用过程中以Warp(32个Thread)作为基本单位,因此,add_readd_im由最终放置的数据组成;

这里有个凑巧的事情,当:

//double w_tmp = fc_kernel[ty]; //换为
double w_tmp = fc_kernel[0];

的时候,两行的每一个Thread算的结果是一样的,因此,最终结果也是正确的,像极了那张图:
记录一种CUDA常见错误情景:结果随机_第3张图片
因此,将这一处修改正确,就正常了:

__global__ void SRC1Kernel_MultiChan(short *d_sr1, float *d_hsrc1, complexType *d_ssrc1, double *d_fc)
{
	int bx = blockIdx.x; //unique i
	int tx = threadIdx.x; //unique m
	int ty = threadIdx.y; //unique fc
	int stride = gridDim.x; 
	int stride_SM = blockDim.x * ty;

	__shared__ float add_re[Q * ChanNum];
	__shared__ float add_im[Q * ChanNum];
	__shared__ double fc_kernel[ChanNum];

	if (tx == 0) {
		fc_kernel[ty] = d_fc[ty];
	}
	__syncthreads();


	for (int i = bx; i < OLENGTH - Q; i += stride) {
		int t1 = tx + ((D * i) % I) * Q;
		int t2 = (D * i) / I - tx + (Q - 1);

		double w_tmp = fc_kernel[ty];

		w_tmp  = (w_tmp / d_fs) * t2;
		w_tmp = 2 * d_pi *(w_tmp - (int)w_tmp);

		add_re[tx + stride_SM] = d_hsrc1[t1] * d_sr1[t2] * __cosf(w_tmp);
		add_im[tx + stride_SM] = d_hsrc1[t1] * d_sr1[t2] * __sinf(w_tmp) * -1;

		__syncthreads();

		if (tx == 0) {
			for (int j = 1; j < Q; ++j) {
				add_re[0 + stride_SM] += add_re[j + stride_SM];
				add_im[0 + stride_SM] += add_im[j + stride_SM];
			}
		}

		__syncthreads();

		d_ssrc1[i + ty * OLENGTH].re = add_re[0 + stride_SM];
		d_ssrc1[i + ty * OLENGTH].im = add_im[0 + stride_SM];

	}
}

你可能感兴趣的:(cuda学习,c++,cuda,并行设计)