Udacity cs344-Introduction to Parallel Programming学习笔记-第一单元

开始学习CUDA编程,跟的是UDACITY的课程,这是他们的课程链接点击打开链接,这里把一些笔记心得记录下来,以作保存。

1、Latency Vs Bandwidth

在这一节讲到了latency,意思是“延迟”,可以简单理解为所花费的时间。

还有Throughput,意思是“吞吐量”,可以简单理解为---人/小时,也即一小时几个人。

这一节的测试,答案为:

car:22.5;0.089

bus:90;0.44

GPU是为了吞吐量而不是延迟做优化。


2、What Can GPU Do in CUDA

在这一节讲到了GPU,在quiz里出了五道题,GPU因为是从设备,所以不能主动提出请求要接受数据或传输数据,所以1和3都不能选,只能响应来自CPU的请求,选项5意思是计算一个内核,这也是可以的,所以入选。答案为2、4、5。


3、Copy to Host or Copy to Device

在这一节讲到了如何把数据从Host传到Device,Device传到Host,Device到Device,用语法表示就是cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice.


4、第一单元编程题 Problem set 1

convert from color to greyscale,题目要求将一张彩色图片转成灰度图,这里先把代码贴出来,然后再仔细讨论每一行。

// Homework 1
// Color to Greyscale Conversion

//A common way to represent color images is known as RGBA - the color
//is specified by how much Red, Grean and Blue is in it.
//The 'A' stands for Alpha and is used for transparency, it will be
//ignored in this homework.

//Each channel Red, Blue, Green and Alpha is represented by one byte.
//Since we are using one byte for each color there are 256 different
//possible values for each color.  This means we use 4 bytes per pixel.

//Greyscale images are represented by a single intensity value per pixel
//which is one byte in size.

//To convert an image from color to grayscale one simple method is to
//set the intensity to the average of the RGB channels.  But we will
//use a more sophisticated method that takes into account how the eye 
//perceives color and weights the channels unequally.

//The eye responds most strongly to green followed by red and then blue.
//The NTSC (National Television System Committee) recommends the following
//formula for color to greyscale conversion:

//I = .299f * R + .587f * G + .114f * B

//Notice the trailing f's on the numbers which indicate that they are 
//single precision floating point constants and not double precision
//constants.

//You should fill in the kernel as well as set the block and grid sizes
//so that the entire image is processed.

#include "reference_calc.cpp"
#include "utils.h"
#include <stdio.h>

__global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
                       unsigned char* const greyImage,
                       int numRows, int numCols)
{
  //TODO
  //Fill in the kernel to convert from color to greyscale
  //the mapping from components of a uchar4 to RGBA is:
  // .x -> R ; .y -> G ; .z -> B ; .w -> A
  //
  //The output (greyImage) at each pixel should be the result of
  //applying the formula: output = .299f * R + .587f * G + .114f * B;
  //Note: We will be ignoring the alpha channel for this conversion

  //First create a mapping from the 2D block and grid locations
  //to an absolute 2D location in the image, then use that to
  //calculate a 1D offset
  int c = (blockIdx.x * blockDim.x) + threadIdx.x;
  int r = (blockIdx.y * blockDim.y) + threadIdx.y;
    
  if ((c < numCols) && (r < numRows))  {
      uchar4 rgba = rgbaImage[r * numCols + c];
      float channelSum = .299f * rgba.x + .587f * rgba.y + .114f * rgba.z;
      greyImage[r * numCols + c] = channelSum;
  }
}

void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage,
                            unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
  //You must fill in the correct sizes for the blockSize and gridSize
  //currently only one block with one thread is being launched
  const dim3 blockSize((numCols+15)/16, (numRows+15)/16, 1);  //TODO
  const dim3 gridSize( 16, 16, 1);  //TODO
  rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
  
  cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
}

我们先来看这两行代码
  const dim3 blockSize((numCols+15)/16, (numRows+15)/16, 1);  //TODO
  const dim3 gridSize( 16, 16, 1);  //TODO

分别是定义了两个dim3类型的变量,blockSize表示的是每个block中有多少个thread,gridSize表示的是每个grid中有多少个block,我们这里对gridSize定义为( 16, 16, 1),当然16也可以定义为32or64,这个是没有关系的,这里这么做主要是为了确保每个block中thread的数量不会超过1024(现行的GPU中规定每个block中thread不能超过1024,老款的GPU只能是不超过512)。

然后再看

blockSize((numCols+15)/16, (numRows+15)/16, 1)

这里可能有些同学第一次接触cuda编程会产生疑惑,为什么要+15,其实这里主要是为了防止不整除的时候丢失位数,举个例子,假设numCols等于31,31如果直接除以16,会等于1,这显然是不合理的,我们为了进一位,就给31+15,这样去除16,肯定就能得到2了。这里还有一个容易出错的地方,就是图片的长宽处理上容易出错,我一开始没有仔细思考的时候,把blockSize((numCols+15)/16, (numRows+15)/16, 1)写成了blockSize((numRows+15)/16, (numCols+15)/16, 1),这是错误的,因为对于一张图片而言,numCols对应的是图片的长,numRows对应的才是宽,所以必须要这么写才正确。


再来看kernel函数中的代码,

int c = (blockIdx.x * blockDim.x) + threadIdx.x;
  int r = (blockIdx.y * blockDim.y) + threadIdx.y;
    
  if ((c < numCols) && (r < numRows))  {
      uchar4 rgba = rgbaImage[r * numCols + c];
      float channelSum = .299f * rgba.x + .587f * rgba.y + .114f * rgba.z;
      greyImage[r * numCols + c] = channelSum;
  }
先看c,也就是列号,这里是用threadIdx.x去计算,这里仔细想想就会明白,列号,当然代表的是在x轴方向上的距离了,同理可以理解r作为行号,要用threadIdx.x及blockIdx.y * blockDim.y去计算。

再看if ((c < numCols) && (r < numRows))这个条件也是非常必须的,因为注意到前面计算blockSize((numCols+15)/16, (numRows+15)/16, 1)的时候,是故意把blockSize向上取整了的,所以,要在这里保证c和r的边界不会出错。

最后,看r * numCols + c,这里一些对图像处理不了解的同学可能也会有点小疑问,其实很简单,r代表行号,c代表列号,要想知道前面有多少列,只需把一行有多少列乘以行号再加上自己当前的列号,就知道了,这个就是r * numCols + c的实际含义。

后面的就好理解了,不再赘述。


你可能感兴趣的:(CUDA,programming,GPU,parallel,MOOC,Udacity)