Map:映射,在特定的位置读取和写入。
Gather:收集,从多个不同的位置读入,写入一个位置。
Scatter:分发,写入多个位置。
第一个quiz答案:Scatter
有点费解,一开始我以为是Map,后来有人提出一个这样的理解,貌似能说的通。先附在下方,留待日后回过头来理解。
MAP:
output[i] = function(input[i]);
SCATTER:
output[ scatterLoc[i] ] = input[i];
GATHER:
output[i] = input[ gatherLoc[i] ];
我自己的理解:map代表一对一,gather代表多对一,scatter代表一对多
第二个quiz答案:5,9,7
3、Transpose:Tasks re-order data elements in memory
AOS: array of structures
SOA:structures of array
4、what kind of communication pattern
第三个quiz的答案:A、E、C、B
对最后一个选B做一下解释,因为如果选stencil的话,就是对每一个元素都进行取平均值的操作,可是,代码的上方有一个if语句,它只对奇数进行这样的操作,所以不满足stencil的定义,故只能是属于scatter。
5、各种通信模式总结:
Map:one to one
Transpose:one to one
gather:many to one
scatter:one to many
stencil:several to one(一种特殊的收集)
reduce:all to one
scan/sor:all to all
6、第四个quiz答案:1、2、4
解答:一个线程块不能在多个SM上执行;不同线程块之间不能合作解决同一个问题,因为无法通信
7、第五个quiz答案:1、2
8、第六个quiz答案:上述均错
9、第七个quiz答案:16!16个线程块以随机的方式执行,根据排列组合,有A16种可能,即16的阶乘。
10、cuda可以保证:
-所有在同一个线程块中的线程运行在同一个SM上。
-在一个内核中的所有块,在下个内核中任意块启动前都已完成。
11、共享内存是在一个块的线程之间共享。
12、第八个quiz答案:全对。
13、第九个quiz答案:3
array[idx] = threadIdx.x下方需要一个同步,保证所有数组元素都赋值完成。
为了实现题目要求,需要对代码进行改写,如下所示
int temp = array[idx+1]; array[idx] = temp;需要在上述两句代码中间和结尾各加一句同步语句,这样才能防止乱序的情况发生。比如a[1]本来是存放a[2]的值,可是a[2]已经存了a[3],这样就会导致a[1]中最后是a[3]的值,这显然是错误的,所以需要同步语句。
14、第十个quiz答案:B
15、第十一个quiz答案:3124,较简单。
16、第十二个quiz答案:1、3、4、5、6,注意存取连续内存即可。
17、第十三个quiz答案:第1、2、4、5可以实现,3会出错。
然后再来看一下各自花的时间排序,我们知道原子操作是更花费时间的,所以最快的肯定是第三项,接下来是第一项,在1000000个元素上进行加一操作,这个比其他原子操作更快,再下来是第四项操作,因为对比第二项,它访问的元素更少,故比第二项快,最慢的是第五项,因为线程数目是其他几项的十倍,所以这个是最慢的。
按时间排序将是2、4、1、3、5
18、避免线程发散
19、第十四个quiz答案:
原始代码如下:
// Homework 2 // Image Blurring // // In this homework we are blurring an image. To do this, imagine that we have // a square array of weight values. For each pixel in the image, imagine that we // overlay this square array of weights on top of the image such that the center // of the weight array is aligned with the current pixel. To compute a blurred // pixel value, we multiply each pair of numbers that line up. In other words, we // multiply each weight with the pixel underneath it. Finally, we add up all of the // multiplied numbers and assign that value to our output for the current pixel. // We repeat this process for all the pixels in the image. // To help get you started, we have included some useful notes here. //**************************************************************************** // For a color image that has multiple channels, we suggest separating // the different color channels so that each color is stored contiguously // instead of being interleaved. This will simplify your code. // That is instead of RGBARGBARGBARGBA... we suggest transforming to three // arrays (as in the previous homework we ignore the alpha channel again): // 1) RRRRRRRR... // 2) GGGGGGGG... // 3) BBBBBBBB... // // The original layout is known an Array of Structures (AoS) whereas the // format we are converting to is known as a Structure of Arrays (SoA). // As a warm-up, we will ask you to write the kernel that performs this // separation. You should then write the "meat" of the assignment, // which is the kernel that performs the actual blur. We provide code that // re-combines your blurred results for each color channel. //**************************************************************************** // You must fill in the gaussian_blur kernel to perform the blurring of the // inputChannel, using the array of weights, and put the result in the outputChannel. // Here is an example of computing a blur, using a weighted average, for a single // pixel in a small image. // // Array of weights: // // 0.0 0.2 0.0 // 0.2 0.2 0.2 // 0.0 0.2 0.0 // // Image (note that we align the array of weights to the center of the box): // // 1 2 5 2 0 3 // ------- // 3 |2 5 1| 6 0 0.0*2 + 0.2*5 + 0.0*1 + // | | // 4 |3 6 2| 1 4 -> 0.2*3 + 0.2*6 + 0.2*2 + -> 3.2 // | | // 0 |4 0 3| 4 2 0.0*4 + 0.2*0 + 0.0*3 // ------- // 9 6 5 0 3 9 // // (1) (2) (3) // // A good starting place is to map each thread to a pixel as you have before. // Then every thread can perform steps 2 and 3 in the diagram above // completely independently of one another. // Note that the array of weights is square, so its height is the same as its width. // We refer to the array of weights as a filter, and we refer to its width with the // variable filterWidth. //**************************************************************************** // Your homework submission will be evaluated based on correctness and speed. // We test each pixel against a reference solution. If any pixel differs by // more than some small threshold value, the system will tell you that your // solution is incorrect, and it will let you try again. // Once you have gotten that working correctly, then you can think about using // shared memory and having the threads cooperate to achieve better performance. //**************************************************************************** // Also note that we've supplied a helpful debugging function called checkCudaErrors. // You should wrap your allocation and copying statements like we've done in the // code we're supplying you. Here is an example of the unsafe way to allocate // memory on the GPU: // // cudaMalloc(&d_red, sizeof(unsigned char) * numRows * numCols); // // Here is an example of the safe way to do the same thing: // // checkCudaErrors(cudaMalloc(&d_red, sizeof(unsigned char) * numRows * numCols)); // // Writing code the safe way requires slightly more typing, but is very helpful for // catching mistakes. If you write code the unsafe way and you make a mistake, then // any subsequent kernels won't compute anything, and it will be hard to figure out // why. Writing code the safe way will inform you as soon as you make a mistake. // Finally, remember to free the memory you allocate at the end of the function. //**************************************************************************** #include "reference_calc.cpp" #include "utils.h" __global__ void gaussian_blur(const unsigned char* const inputChannel, unsigned char* const outputChannel, int numRows, int numCols, const float* const filter, const int filterWidth) { // TODO // NOTE: Be sure to compute any intermediate results in floating point // before storing the final result as unsigned char. // NOTE: Be careful not to try to access memory that is outside the bounds of // the image. You'll want code that performs the following check before accessing // GPU memory: // // if ( absolute_image_position_x >= numCols || // absolute_image_position_y >= numRows ) // { // return; // } // NOTE: If a thread's absolute position 2D position is within the image, but some of // its neighbors are outside the image, then you will need to be extra careful. Instead // of trying to read such a neighbor value from GPU memory (which won't work because // the value is out of bounds), you should explicitly clamp the neighbor values you read // to be within the bounds of the image. If this is not clear to you, then please refer // to sequential reference solution for the exact clamping semantics you should follow. } //This kernel takes in an image represented as a uchar4 and splits //it into three images consisting of only one color channel each __global__ void separateChannels(const uchar4* const inputImageRGBA, int numRows, int numCols, unsigned char* const redChannel, unsigned char* const greenChannel, unsigned char* const blueChannel) { // TODO // // NOTE: Be careful not to try to access memory that is outside the bounds of // the image. You'll want code that performs the following check before accessing // GPU memory: // // if ( absolute_image_position_x >= numCols || // absolute_image_position_y >= numRows ) // { // return; // } } //This kernel takes in three color channels and recombines them //into one image. The alpha channel is set to 255 to represent //that this image has no transparency. __global__ void recombineChannels(const unsigned char* const redChannel, const unsigned char* const greenChannel, const unsigned char* const blueChannel, uchar4* const outputImageRGBA, int numRows, int numCols) { const int2 thread_2D_pos = make_int2( blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y); const int thread_1D_pos = thread_2D_pos.y * numCols + thread_2D_pos.x; //make sure we don't try and access memory outside the image //by having any threads mapped there return early if (thread_2D_pos.x >= numCols || thread_2D_pos.y >= numRows) return; unsigned char red = redChannel[thread_1D_pos]; unsigned char green = greenChannel[thread_1D_pos]; unsigned char blue = blueChannel[thread_1D_pos]; //Alpha should be 255 for no transparency uchar4 outputPixel = make_uchar4(red, green, blue, 255); outputImageRGBA[thread_1D_pos] = outputPixel; } unsigned char *d_red, *d_green, *d_blue; float *d_filter; void allocateMemoryAndCopyToGPU(const size_t numRowsImage, const size_t numColsImage, const float* const h_filter, const size_t filterWidth) { //allocate memory for the three different channels //original checkCudaErrors(cudaMalloc(&d_red, sizeof(unsigned char) * numRowsImage * numColsImage)); checkCudaErrors(cudaMalloc(&d_green, sizeof(unsigned char) * numRowsImage * numColsImage)); checkCudaErrors(cudaMalloc(&d_blue, sizeof(unsigned char) * numRowsImage * numColsImage)); //TODO: //Allocate memory for the filter on the GPU //Use the pointer d_filter that we have already declared for you //You need to allocate memory for the filter with cudaMalloc //be sure to use checkCudaErrors like the above examples to //be able to tell if anything goes wrong //IMPORTANT: Notice that we pass a pointer to a pointer to cudaMalloc //TODO: //Copy the filter on the host (h_filter) to the memory you just allocated //on the GPU. cudaMemcpy(dst, src, numBytes, cudaMemcpyHostToDevice); //Remember to use checkCudaErrors! } void your_gaussian_blur(const uchar4 * const h_inputImageRGBA, uchar4 * const d_inputImageRGBA, uchar4* const d_outputImageRGBA, const size_t numRows, const size_t numCols, unsigned char *d_redBlurred, unsigned char *d_greenBlurred, unsigned char *d_blueBlurred, const int filterWidth) { //TODO: Set reasonable block size (i.e., number of threads per block) const dim3 blockSize; //TODO: //Compute correct grid size (i.e., number of blocks per kernel launch) //from the image size and and block size. const dim3 gridSize; //TODO: Launch a kernel for separating the RGBA image into different color channels // Call cudaDeviceSynchronize(), then call checkCudaErrors() immediately after // launching your kernel to make sure that you didn't make any mistakes. cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError()); //TODO: Call your convolution kernel here 3 times, once for each color channel. // Again, call cudaDeviceSynchronize(), then call checkCudaErrors() immediately after // launching your kernel to make sure that you didn't make any mistakes. cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError()); // Now we recombine your results. We take care of launching this kernel for you. // // NOTE: This kernel launch depends on the gridSize and blockSize variables, // which you must set yourself. recombineChannels<<<gridSize, blockSize>>>(d_redBlurred, d_greenBlurred, d_blueBlurred, d_outputImageRGBA, numRows, numCols); cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError()); } //Free all the memory that we allocated //TODO: make sure you free any arrays that you allocated void cleanup() { checkCudaErrors(cudaFree(d_red)); checkCudaErrors(cudaFree(d_green)); checkCudaErrors(cudaFree(d_blue)); }
// Homework 2 // Image Blurring // // In this homework we are blurring an image. To do this, imagine that we have // a square array of weight values. For each pixel in the image, imagine that we // overlay this square array of weights on top of the image such that the center // of the weight array is aligned with the current pixel. To compute a blurred // pixel value, we multiply each pair of numbers that line up. In other words, we // multiply each weight with the pixel underneath it. Finally, we add up all of the // multiplied numbers and assign that value to our output for the current pixel. // We repeat this process for all the pixels in the image. // To help get you started, we have included some useful notes here. //**************************************************************************** // For a color image that has multiple channels, we suggest separating // the different color channels so that each color is stored contiguously // instead of being interleaved. This will simplify your code. // That is instead of RGBARGBARGBARGBA... we suggest transforming to three // arrays (as in the previous homework we ignore the alpha channel again): // 1) RRRRRRRR... // 2) GGGGGGGG... // 3) BBBBBBBB... // // The original layout is known an Array of Structures (AoS) whereas the // format we are converting to is known as a Structure of Arrays (SoA). // As a warm-up, we will ask you to write the kernel that performs this // separation. You should then write the "meat" of the assignment, // which is the kernel that performs the actual blur. We provide code that // re-combines your blurred results for each color channel. //**************************************************************************** // You must fill in the gaussian_blur kernel to perform the blurring of the // inputChannel, using the array of weights, and put the result in the outputChannel. // Here is an example of computing a blur, using a weighted average, for a single // pixel in a small image. // // Array of weights: // // 0.0 0.2 0.0 // 0.2 0.2 0.2 // 0.0 0.2 0.0 // // Image (note that we align the array of weights to the center of the box): // // 1 2 5 2 0 3 // ------- // 3 |2 5 1| 6 0 0.0*2 + 0.2*5 + 0.0*1 + // | | // 4 |3 6 2| 1 4 -> 0.2*3 + 0.2*6 + 0.2*2 + -> 3.2 // | | // 0 |4 0 3| 4 2 0.0*4 + 0.2*0 + 0.0*3 // ------- // 9 6 5 0 3 9 // // (1) (2) (3) // // A good starting place is to map each thread to a pixel as you have before. // Then every thread can perform steps 2 and 3 in the diagram above // completely independently of one another. // Note that the array of weights is square, so its height is the same as its width. // We refer to the array of weights as a filter, and we refer to its width with the // variable filterWidth. //**************************************************************************** // Your homework submission will be evaluated based on correctness and speed. // We test each pixel against a reference solution. If any pixel differs by // more than some small threshold value, the system will tell you that your // solution is incorrect, and it will let you try again. // Once you have gotten that working correctly, then you can think about using // shared memory and having the threads cooperate to achieve better performance. //**************************************************************************** // Also note that we've supplied a helpful debugging function called checkCudaErrors. // You should wrap your allocation and copying statements like we've done in the // code we're supplying you. Here is an example of the unsafe way to allocate // memory on the GPU: // // cudaMalloc(&d_red, sizeof(unsigned char) * numRows * numCols); // // Here is an example of the safe way to do the same thing: // // checkCudaErrors(cudaMalloc(&d_red, sizeof(unsigned char) * numRows * numCols)); // // Writing code the safe way requires slightly more typing, but is very helpful for // catching mistakes. If you write code the unsafe way and you make a mistake, then // any subsequent kernels won't compute anything, and it will be hard to figure out // why. Writing code the safe way will inform you as soon as you make a mistake. // Finally, remember to free the memory you allocate at the end of the function. //**************************************************************************** #include "reference_calc.cpp" #include "utils.h" __global__ void gaussian_blur(const unsigned char* const inputChannel, unsigned char* const outputChannel, int numRows, int numCols, const float* const filter, const int filterWidth) { // TODO // NOTE: Be sure to compute any intermediate results in floating point // before storing the final result as unsigned char. // NOTE: Be careful not to try to access memory that is outside the bounds of // the image. You'll want code that performs the following check before accessing // GPU memory: // // if ( absolute_image_position_x >= numCols || // absolute_image_position_y >= numRows ) // { // return; // } // NOTE: If a thread's absolute position 2D position is within the image, but some of // its neighbors are outside the image, then you will need to be extra careful. Instead // of trying to read such a neighbor value from GPU memory (which won't work because // the value is out of bounds), you should explicitly clamp the neighbor values you read // to be within the bounds of the image. If this is not clear to you, then please refer // to sequential reference solution for the exact clamping semantics you should follow. int c = blockIdx.x * blockDim.x + threadIdx.x; //cols num int r = blockIdx.y * blockDim.y + threadIdx.y; //rows num if(c >= numCols || r >= numRows) { return; } float result = 0.f; //For every value in the filter around the pixel (c, r) for (int filter_r = -filterWidth/2; filter_r <= filterWidth/2; ++filter_r) { for (int filter_c = -filterWidth/2; filter_c <= filterWidth/2; ++filter_c) { //Find the global image position for this filter position //clamp to boundary of the image int image_r = min(max(r + filter_r, 0), static_cast<int>(numRows - 1)); int image_c = min(max(c + filter_c, 0), static_cast<int>(numCols - 1)); float image_value = static_cast<float>(inputChannel[image_r * numCols + image_c]); float filter_value = filter[(filter_r + filterWidth/2) * filterWidth + filter_c + filterWidth/2]; result += image_value * filter_value; } } outputChannel[r * numCols + c] = result; } //This kernel takes in an image represented as a uchar4 and splits //it into three images consisting of only one color channel each __global__ void separateChannels(const uchar4* const inputImageRGBA, int numRows, int numCols, unsigned char* const redChannel, unsigned char* const greenChannel, unsigned char* const blueChannel) { // TODO // // NOTE: Be careful not to try to access memory that is outside the bounds of // the image. You'll want code that performs the following check before accessing // GPU memory: // // if ( absolute_image_position_x >= numCols || // absolute_image_position_y >= numRows ) // { // return; // } const int2 thread_2D_pos = make_int2( blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y); const int thread_1D_pos = thread_2D_pos.y * numCols + thread_2D_pos.x; if (thread_2D_pos.x >= numCols || thread_2D_pos.y >= numRows) return; uchar4 rgba = inputImageRGBA[thread_1D_pos]; redChannel[thread_1D_pos] = rgba.x; greenChannel[thread_1D_pos] = rgba.y; blueChannel[thread_1D_pos] = rgba.z; } //This kernel takes in three color channels and recombines them //into one image. The alpha channel is set to 255 to represent //that this image has no transparency. __global__ void recombineChannels(const unsigned char* const redChannel, const unsigned char* const greenChannel, const unsigned char* const blueChannel, uchar4* const outputImageRGBA, int numRows, int numCols) { const int2 thread_2D_pos = make_int2( blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y); const int thread_1D_pos = thread_2D_pos.y * numCols + thread_2D_pos.x; //make sure we don't try and access memory outside the image //by having any threads mapped there return early if (thread_2D_pos.x >= numCols || thread_2D_pos.y >= numRows) return; unsigned char red = redChannel[thread_1D_pos]; unsigned char green = greenChannel[thread_1D_pos]; unsigned char blue = blueChannel[thread_1D_pos]; //Alpha should be 255 for no transparency uchar4 outputPixel = make_uchar4(red, green, blue, 255); outputImageRGBA[thread_1D_pos] = outputPixel; } unsigned char *d_red, *d_green, *d_blue; float *d_filter; void allocateMemoryAndCopyToGPU(const size_t numRowsImage, const size_t numColsImage, const float* const h_filter, const size_t filterWidth) { //allocate memory for the three different channels //original checkCudaErrors(cudaMalloc(&d_red, sizeof(unsigned char) * numRowsImage * numColsImage)); checkCudaErrors(cudaMalloc(&d_green, sizeof(unsigned char) * numRowsImage * numColsImage)); checkCudaErrors(cudaMalloc(&d_blue, sizeof(unsigned char) * numRowsImage * numColsImage)); //TODO: //Allocate memory for the filter on the GPU //Use the pointer d_filter that we have already declared for you //You need to allocate memory for the filter with cudaMalloc //be sure to use checkCudaErrors like the above examples to //be able to tell if anything goes wrong //IMPORTANT: Notice that we pass a pointer to a pointer to cudaMalloc checkCudaErrors(cudaMalloc(&d_filter, sizeof(float) * filterWidth * filterWidth)); //TODO: //Copy the filter on the host (h_filter) to the memory you just allocated //on the GPU. cudaMemcpy(dst, src, numBytes, cudaMemcpyHostToDevice); //Remember to use checkCudaErrors! checkCudaErrors(cudaMemcpy(d_filter,h_filter,sizeof(float) * filterWidth * filterWidth,cudaMemcpyHostToDevice)); } void your_gaussian_blur(const uchar4 * const h_inputImageRGBA, uchar4 * const d_inputImageRGBA, uchar4* const d_outputImageRGBA, const size_t numRows, const size_t numCols, unsigned char *d_redBlurred, unsigned char *d_greenBlurred, unsigned char *d_blueBlurred, const int filterWidth) { //TODO: Set reasonable block size (i.e., number of threads per block) const dim3 blockSize(32,32); //TODO: //Compute correct grid size (i.e., number of blocks per kernel launch) //from the image size and and block size. const dim3 gridSize((numCols+31)/32,(numRows+31)/32); //TODO: Launch a kernel for separating the RGBA image into different color channels separateChannels<<<gridSize,blockSize>>>(d_inputImageRGBA,numRows,numCols,d_red,d_green,d_blue); // Call cudaDeviceSynchronize(), then call checkCudaErrors() immediately after // launching your kernel to make sure that you didn't make any mistakes. cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError()); //TODO: Call your convolution kernel here 3 times, once for each color channel. gaussian_blur<<<gridSize,blockSize>>>(d_red,d_redBlurred,numRows,numCols,d_filter,filterWidth); gaussian_blur<<<gridSize,blockSize>>>(d_green,d_greenBlurred,numRows,numCols,d_filter,filterWidth); gaussian_blur<<<gridSize,blockSize>>>(d_blue,d_blueBlurred,numRows,numCols,d_filter,filterWidth); // Again, call cudaDeviceSynchronize(), then call checkCudaErrors() immediately after // launching your kernel to make sure that you didn't make any mistakes. cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError()); // Now we recombine your results. We take care of launching this kernel for you. // // NOTE: This kernel launch depends on the gridSize and blockSize variables, // which you must set yourself. recombineChannels<<<gridSize, blockSize>>>(d_redBlurred, d_greenBlurred, d_blueBlurred, d_outputImageRGBA, numRows, numCols); cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError()); } //Free all the memory that we allocated //TODO: make sure you free any arrays that you allocated void cleanup() { checkCudaErrors(cudaFree(d_red)); checkCudaErrors(cudaFree(d_green)); checkCudaErrors(cudaFree(d_blue)); }