开始学习CUDA编程,跟的是UDACITY的课程,这是他们的课程链接点击打开链接,这里把一些笔记心得记录下来,以作保存。
1、Latency Vs Bandwidth
在这一节讲到了latency,意思是“延迟”,可以简单理解为所花费的时间。
还有Throughput,意思是“吞吐量”,可以简单理解为---人/小时,也即一小时几个人。
这一节的测试,答案为:
car:22.5;0.089
bus:90;0.44
GPU是为了吞吐量而不是延迟做优化。
2、What Can GPU Do in CUDA
在这一节讲到了GPU,在quiz里出了五道题,GPU因为是从设备,所以不能主动提出请求要接受数据或传输数据,所以1和3都不能选,只能响应来自CPU的请求,选项5意思是计算一个内核,这也是可以的,所以入选。答案为2、4、5。
3、Copy to Host or Copy to Device
在这一节讲到了如何把数据从Host传到Device,Device传到Host,Device到Device,用语法表示就是cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice.
4、第一单元编程题 Problem set 1
convert from color to greyscale,题目要求将一张彩色图片转成灰度图,这里先把代码贴出来,然后再仔细讨论每一行。
// Homework 1 // Color to Greyscale Conversion //A common way to represent color images is known as RGBA - the color //is specified by how much Red, Grean and Blue is in it. //The 'A' stands for Alpha and is used for transparency, it will be //ignored in this homework. //Each channel Red, Blue, Green and Alpha is represented by one byte. //Since we are using one byte for each color there are 256 different //possible values for each color. This means we use 4 bytes per pixel. //Greyscale images are represented by a single intensity value per pixel //which is one byte in size. //To convert an image from color to grayscale one simple method is to //set the intensity to the average of the RGB channels. But we will //use a more sophisticated method that takes into account how the eye //perceives color and weights the channels unequally. //The eye responds most strongly to green followed by red and then blue. //The NTSC (National Television System Committee) recommends the following //formula for color to greyscale conversion: //I = .299f * R + .587f * G + .114f * B //Notice the trailing f's on the numbers which indicate that they are //single precision floating point constants and not double precision //constants. //You should fill in the kernel as well as set the block and grid sizes //so that the entire image is processed. #include "reference_calc.cpp" #include "utils.h" #include <stdio.h> __global__ void rgba_to_greyscale(const uchar4* const rgbaImage, unsigned char* const greyImage, int numRows, int numCols) { //TODO //Fill in the kernel to convert from color to greyscale //the mapping from components of a uchar4 to RGBA is: // .x -> R ; .y -> G ; .z -> B ; .w -> A // //The output (greyImage) at each pixel should be the result of //applying the formula: output = .299f * R + .587f * G + .114f * B; //Note: We will be ignoring the alpha channel for this conversion //First create a mapping from the 2D block and grid locations //to an absolute 2D location in the image, then use that to //calculate a 1D offset int c = (blockIdx.x * blockDim.x) + threadIdx.x; int r = (blockIdx.y * blockDim.y) + threadIdx.y; if ((c < numCols) && (r < numRows)) { uchar4 rgba = rgbaImage[r * numCols + c]; float channelSum = .299f * rgba.x + .587f * rgba.y + .114f * rgba.z; greyImage[r * numCols + c] = channelSum; } } void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage, unsigned char* const d_greyImage, size_t numRows, size_t numCols) { //You must fill in the correct sizes for the blockSize and gridSize //currently only one block with one thread is being launched const dim3 blockSize((numCols+15)/16, (numRows+15)/16, 1); //TODO const dim3 gridSize( 16, 16, 1); //TODO rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols); cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError()); }
const dim3 blockSize((numCols+15)/16, (numRows+15)/16, 1); //TODO const dim3 gridSize( 16, 16, 1); //TODO
分别是定义了两个dim3类型的变量,blockSize表示的是每个block中有多少个thread,gridSize表示的是每个grid中有多少个block,我们这里对gridSize定义为( 16, 16, 1),当然16也可以定义为32or64,这个是没有关系的,这里这么做主要是为了确保每个block中thread的数量不会超过1024(现行的GPU中规定每个block中thread不能超过1024,老款的GPU只能是不超过512)。
然后再看
blockSize((numCols+15)/16, (numRows+15)/16, 1)
这里可能有些同学第一次接触cuda编程会产生疑惑,为什么要+15,其实这里主要是为了防止不整除的时候丢失位数,举个例子,假设numCols等于31,31如果直接除以16,会等于1,这显然是不合理的,我们为了进一位,就给31+15,这样去除16,肯定就能得到2了。这里还有一个容易出错的地方,就是图片的长宽处理上容易出错,我一开始没有仔细思考的时候,把blockSize((numCols+15)/16, (numRows+15)/16, 1)写成了blockSize((numRows+15)/16, (numCols+15)/16, 1),这是错误的,因为对于一张图片而言,numCols对应的是图片的长,numRows对应的才是宽,所以必须要这么写才正确。
再来看kernel函数中的代码,
int c = (blockIdx.x * blockDim.x) + threadIdx.x; int r = (blockIdx.y * blockDim.y) + threadIdx.y; if ((c < numCols) && (r < numRows)) { uchar4 rgba = rgbaImage[r * numCols + c]; float channelSum = .299f * rgba.x + .587f * rgba.y + .114f * rgba.z; greyImage[r * numCols + c] = channelSum; }先看c,也就是列号,这里是用threadIdx.x去计算,这里仔细想想就会明白,列号,当然代表的是在x轴方向上的距离了,同理可以理解r作为行号,要用threadIdx.x及blockIdx.y * blockDim.y去计算。
再看if ((c < numCols) && (r < numRows))这个条件也是非常必须的,因为注意到前面计算blockSize((numCols+15)/16, (numRows+15)/16, 1)的时候,是故意把blockSize向上取整了的,所以,要在这里保证c和r的边界不会出错。
最后,看r * numCols + c,这里一些对图像处理不了解的同学可能也会有点小疑问,其实很简单,r代表行号,c代表列号,要想知道前面有多少列,只需把一行有多少列乘以行号再加上自己当前的列号,就知道了,这个就是r * numCols + c的实际含义。
后面的就好理解了,不再赘述。