Udacity cs344-Introduction to Parallel Programming学习笔记-第三单元

1、第一个quiz答案:6,21,问题很简单,数一下就好了。


2、什么是“归约”操作

归约操作有两个输入:

1)输入对象的集合

2)归约运算符:满足二元操作符、满足可结合性


3、第二个quiz答案:multiply、minimum、logical or、bitwise and


4、第三个quiz答案:2、3选项是对的


5、第四个quiz答案:(a+b)+(c+d)


6、第五个quiz答案:logn


7、第六个quiz答案:3倍(不是特别理解)

根据视频中的解答,对全局内存来说,假设N=1024,read操作一共要(1024+512+...+1),write操作需要(512+256+...+1),对共享内存来说read操作需要1024次操作,write操作需要1次。用N表示的话,全局内存需要约3N次操作,共享内存需要N+1次操作,大约相差3倍。


8、scan操作

scan操作就是计算它的输入项的当前和,scan操作中有一个概念叫标识元素,标识元素针对特定操作符,与其他元素操作后的结果仍等于该元素。


9、第七个quiz答案:1、0、1


10、第八个quiz答案:

identity:0

output:0、3、3、4、4、5


11、第九个quiz答案:

代码如下

 for(int i = 0; i < ARRAY_SIZE; i++){     
        out[i] = acc;
 acc = acc + elements[i]; }


12、第十个quiz答案:
steps:o(logn)
work:o(n^2)



13、第十一个quiz答案:
steps:logn
works:nlogn



14、第十二个quiz答案:
2、4、4、0、0、2、0、2、2、4


15、H-S算法总结:
steps:o(logn)
works:o(nlogn)
        Blelloch算法总结:
steps:o(2logn)
works:o(2n)


16、第十三个quiz答案:
H-S、Blelloch、serial


17、第十四个quiz答案:exclusive scan 


18、第十五个quiz答案:
n,n/b,较简单。


19、第十六个quiz答案:
1000,较简单。


20、第十七个quiz答案:
no


21、第十八个quiz答案:
reduce


22、第十九个quiz答案:
256,8

作业代码,参考别人的,写的非常棒。但是step 4有些难,不过Blelloch算法倒是看懂了。

/* Udacity Homework 3
   HDR Tone-mapping

  Background HDR
  ==============

  A High Definition Range (HDR) image contains a wider variation of intensity
  and color than is allowed by the RGB format with 1 byte per channel that we
  have used in the previous assignment.  

  To store this extra information we use single precision floating point for
  each channel.  This allows for an extremely wide range of intensity values.

  In the image for this assignment, the inside of church with light coming in
  through stained glass windows, the raw input floating point values for the
  channels range from 0 to 275.  But the mean is .41 and 98% of the values are
  less than 3!  This means that certain areas (the windows) are extremely bright
  compared to everywhere else.  If we linearly map this [0-275] range into the
  [0-255] range that we have been using then most values will be mapped to zero!
  The only thing we will be able to see are the very brightest areas - the
  windows - everything else will appear pitch black.

  The problem is that although we have cameras capable of recording the wide
  range of intensity that exists in the real world our monitors are not capable
  of displaying them.  Our eyes are also quite capable of observing a much wider
  range of intensities than our image formats / monitors are capable of
  displaying.

  Tone-mapping is a process that transforms the intensities in the image so that
  the brightest values aren't nearly so far away from the mean.  That way when
  we transform the values into [0-255] we can actually see the entire image.
  There are many ways to perform this process and it is as much an art as a
  science - there is no single "right" answer.  In this homework we will
  implement one possible technique.

  Background Chrominance-Luminance
  ================================

  The RGB space that we have been using to represent images can be thought of as
  one possible set of axes spanning a three dimensional space of color.  We
  sometimes choose other axes to represent this space because they make certain
  operations more convenient.

  Another possible way of representing a color image is to separate the color
  information (chromaticity) from the brightness information.  There are
  multiple different methods for doing this - a common one during the analog
  television days was known as Chrominance-Luminance or YUV.

  We choose to represent the image in this way so that we can remap only the
  intensity channel and then recombine the new intensity values with the color
  information to form the final image.

  Old TV signals used to be transmitted in this way so that black & white
  televisions could display the luminance channel while color televisions would
  display all three of the channels.
  

  Tone-mapping
  ============

  In this assignment we are going to transform the luminance channel (actually
  the log of the luminance, but this is unimportant for the parts of the
  algorithm that you will be implementing) by compressing its range to [0, 1].
  To do this we need the cumulative distribution of the luminance values.

  Example
  -------

  input : [2 4 3 3 1 7 4 5 7 0 9 4 3 2]
  min / max / range: 0 / 9 / 9

  histo with 3 bins: [4 7 3]

  cdf : [4 11 14]


  Your task is to calculate this cumulative distribution by following these
  steps.

*/

#include "utils.h"

__global__ void reduce_minimum(float * d_out, const float * const d_in, const size_t numItem) {
    // sdata is allocated in the kernel call: 3rd arg to <<<b, t, shmem>>>
  extern __shared__ float sdata[];

  int myId = threadIdx.x + blockDim.x * blockIdx.x;
  int tid  = threadIdx.x;

  // load shared mem from global mem
  sdata[tid] = 99999999999.0f;
  if (myId < numItem)
    sdata[tid] = d_in[myId];

  __syncthreads();            // make sure entire block is loaded!

  // do reduction in shared mem
  for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
    if (tid < s) {
        sdata[tid] = min(sdata[tid], sdata[tid + s]);
    }
    __syncthreads();        // make sure all adds at one stage are done!
  }

  // only thread 0 writes result for this block back to global mem
  if (tid == 0) {
    d_out[blockIdx.x] = sdata[0];
  }
}

__global__ void reduce_maximum(float * d_out, const float * const d_in, const size_t numItem) {
  // sdata is allocated in the kernel call: 3rd arg to <<<b, t, shmem>>>
  extern __shared__ float sdata[];

  int myId = threadIdx.x + blockDim.x * blockIdx.x;
  int tid  = threadIdx.x;

  // load shared mem from global mem
  sdata[tid] = -99999999999.0f;
  if (myId < numItem)
    sdata[tid] = d_in[myId];

  __syncthreads();            // make sure entire block is loaded!

  // do reduction in shared mem
  for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
    if (tid < s) {
      sdata[tid] = max(sdata[tid], sdata[tid + s]);
    }
    __syncthreads();        // make sure all adds at one stage are done!
  }

  // only thread 0 writes result for this block back to global mem
  if (tid == 0) {
    d_out[blockIdx.x] = sdata[0];
  }
}

__global__ void histogram(unsigned int *d_bins, const float * const d_in, const size_t numBins, const float min_logLum, const float range, const size_t numRows, const size_t numCols) {
  
  int myId = threadIdx.x + blockDim.x * blockIdx.x;
  if (myId >= (numRows * numCols))
    return;

  float myItem = d_in[myId];
  int myBin = (myItem - min_logLum) / range * numBins;
  atomicAdd(&(d_bins[myBin]), 1);
}

__global__ void scan(unsigned int *d_out, unsigned int *d_sums, const unsigned int * const d_in, const unsigned int numBins, const unsigned int numElems)  {

  extern __shared__ float sdata[];
  int myId = blockIdx.x * blockDim.x + threadIdx.x;
  int tid = threadIdx.x;
  int offset = 1;

  // load two items per thread into shared memory
  if ((2 * myId) < numBins) {
    sdata[2 * tid] = d_in[2 * myId];
  }
  else {
    sdata[2 * tid] = 0;
  }
  
  if ((2 * myId + 1) < numBins) {
    sdata[2 * tid + 1] = d_in[2 * myId + 1];
  }
  else {
    sdata[2 * tid + 1] = 0;
  }

 	// Reduce
  for (unsigned int d = numElems >> 1; d > 0; d >>= 1) {
    if (tid < d)  {
      int ai = offset * (2 * tid + 1) - 1;
      int bi = offset * (2 * tid + 2) - 1;
      sdata[bi] += sdata[ai];
    }
    offset *= 2;
    __syncthreads();
  }
    
  // clear the last element
  if (tid == 0) {
    d_sums[blockIdx.x] = sdata[numElems - 1];
    sdata[numElems - 1] = 0;
  }
  
  // Down Sweep
  for (unsigned int d = 1; d < numElems; d *= 2) {
    offset >>= 1;
    if (tid < d) {
      int ai = offset * (2 * tid + 1) - 1;
      int bi = offset * (2 * tid + 2) - 1;
 	    float t = sdata[ai];
      sdata[ai] = sdata[bi];
      sdata[bi] += t;
    }
    __syncthreads();
  }
 
  // write the output to global memory
  if ((2 * myId) < numBins) {
    d_out[2 * myId] = sdata[2 * tid];
  }
  if ((2 * myId + 1) < numBins) {
    d_out[2 * myId + 1] = sdata[2 * tid + 1];
  }
}

// This version only works for one single block! The size of the array of items
__global__ void scan2(unsigned int *d_out, const unsigned int * const d_in, const unsigned int numBins, const unsigned int numElems)  {

  extern __shared__ float sdata[];
  int tid = threadIdx.x;
  int offset = 1;

  // load two items per thread into shared memory
  if ((2 * tid) < numBins) {
    sdata[2 * tid] = d_in[2 * tid];  
  }
  else {
    sdata[2 * tid] = 0;
  }

  if ((2 * tid + 1) < numBins) {
    sdata[2 * tid + 1] = d_in[2 * tid + 1];  
  }
  else {
    sdata[2 * tid + 1] = 0;
  }

 	// Reduce
  for (unsigned int d = numElems >> 1; d > 0; d >>= 1) {
    if (tid < d)  {
      int ai = offset * (2 * tid + 1) - 1;
      int bi = offset * (2 * tid + 2) - 1;
      sdata[bi] += sdata[ai];
    }
    offset *= 2;
    __syncthreads();
  }
    
  // clear the last element
  if (tid == 0) {
    sdata[numElems - 1] = 0;
  }
  
  // Down Sweep
  for (unsigned int d = 1; d < numElems; d *= 2) {
    offset >>= 1;
    if (tid < d) {
      int ai = offset * (2 * tid + 1) - 1;
      int bi = offset * (2 * tid + 2) - 1;
 	    float t = sdata[ai];
      sdata[ai] = sdata[bi];
      sdata[bi] += t;
    }
    __syncthreads();
  }
 
  // write the output to global memory
  if ((2 * tid) < numBins) {
    d_out[2 * tid] = sdata[2 * tid];
  }

  if ((2 * tid + 1) < numBins) {
    d_out[2 * tid + 1] = sdata[2 * tid + 1];
  }
}

__global__ void add_scan(unsigned int *d_out, const unsigned int * const d_in, const unsigned int numBins) {

  if (blockIdx.x == 0)
    return;

  int myId = blockIdx.x * blockDim.x + threadIdx.x;
  unsigned int myOffset = d_in[blockIdx.x];

  if ((2 * myId) < numBins) {
    d_out[2 * myId] += myOffset;
  }
  if ((2 * myId + 1) < numBins) {
    d_out[2 * myId + 1] += myOffset;
  }

}

void your_histogram_and_prefixsum(const float* const d_logLuminance,
                                  unsigned int* const d_cdf,
                                  float &min_logLum,
                                  float &max_logLum,
                                  const size_t numRows,
                                  const size_t numCols,
                                  const size_t numBins)
{
  //TODO
  /*Here are the steps you need to implement
    1) find the minimum and maximum value in the input logLuminance channel
       store in min_logLum and max_logLum
    2) subtract them to find the range
    3) generate a histogram of all the values in the logLuminance channel using
       the formula: bin = (lum[i] - lumMin) / lumRange * numBins
    4) Perform an exclusive scan (prefix sum) on the histogram to get
       the cumulative distribution of luminance values (this should go in the
       incoming d_cdf pointer which already has been allocated for you)       */

  // Initialization
  unsigned int numItem = numRows * numCols;
  dim3 blockSize(256, 1, 1);
  dim3 gridSize(numItem / blockSize.x + 1, 1, 1);
    
  float * d_inter_min;
  float * d_inter_max;
  unsigned int * d_histogram;
  unsigned int * d_sums;
  unsigned int * d_incr;

  checkCudaErrors(cudaMalloc(&d_inter_min, sizeof(float) * gridSize.x));
  checkCudaErrors(cudaMalloc(&d_inter_max, sizeof(float) * gridSize.x));
  checkCudaErrors(cudaMalloc(&d_histogram, sizeof(unsigned int) * numBins));
  checkCudaErrors(cudaMemset(d_histogram, 0, sizeof(unsigned int) * numBins));
     
  // Step 1: Reduce (min and max). It could be done in one step only!
  reduce_minimum<<<gridSize, blockSize, sizeof(float) * blockSize.x>>>(d_inter_min, d_logLuminance, numItem);
  reduce_maximum<<<gridSize, blockSize, sizeof(float) * blockSize.x>>>(d_inter_max, d_logLuminance, numItem);
  numItem = gridSize.x;
  gridSize.x = numItem / blockSize.x + 1;

  while (numItem > 1) {
    reduce_minimum<<<gridSize, blockSize, sizeof(float) * blockSize.x>>>(d_inter_min, d_inter_min, numItem);
    reduce_maximum<<<gridSize, blockSize, sizeof(float) * blockSize.x>>>(d_inter_max, d_inter_max, numItem);
    numItem = gridSize.x;
    gridSize.x = numItem / blockSize.x + 1;
  }

  // Step 2: Range
  checkCudaErrors(cudaMemcpy(&min_logLum, d_inter_min, sizeof(float), cudaMemcpyDeviceToHost));
  checkCudaErrors(cudaMemcpy(&max_logLum, d_inter_max, sizeof(float), cudaMemcpyDeviceToHost));

  float range = max_logLum - min_logLum;

  // Step 3: Histogram
  gridSize.x = numRows * numCols / blockSize.x + 1;
  histogram<<<gridSize, blockSize>>>(d_histogram, d_logLuminance, numBins, min_logLum, range, numRows, numCols);

  // Step 4: Exclusive scan - Blelloch
  unsigned int numElems = 256;
  blockSize.x = numElems / 2;
  gridSize.x = numBins / numElems;
  if (numBins % numElems != 0)
    gridSize.x++;
  checkCudaErrors(cudaMalloc(&d_sums, sizeof(unsigned int) * gridSize.x));
  checkCudaErrors(cudaMemset(d_sums, 0, sizeof(unsigned int) * gridSize.x));

  // First-level scan to obtain the scanned blocks
  scan<<<gridSize, blockSize, sizeof(float) * numElems>>>(d_cdf, d_sums, d_histogram, numBins, numElems);

  // Second-level scan to obtain the scanned blocks sums
  numElems = gridSize.x;

  // Look for the next power of 2 (32 bits)
  unsigned int nextPow = numElems;
  nextPow--;
  nextPow = (nextPow >> 1) | nextPow;
  nextPow = (nextPow >> 2) | nextPow;
  nextPow = (nextPow >> 4) | nextPow;
  nextPow = (nextPow >> 8) | nextPow;
  nextPow = (nextPow >> 16) | nextPow;
  nextPow++;

  blockSize.x = nextPow / 2;
  gridSize.x = 1;
  checkCudaErrors(cudaMalloc(&d_incr, sizeof(unsigned int) * numElems));
  checkCudaErrors(cudaMemset(d_incr, 0, sizeof(unsigned int) * numElems));
  scan2<<<gridSize, blockSize, sizeof(float) * nextPow>>>(d_incr, d_sums, numElems, nextPow);

  // Add scanned block sum i to all values of scanned block i
  numElems = 256;
  blockSize.x = numElems / 2;
  gridSize.x = numBins / numElems;
  if (numBins % numElems != 0)
    gridSize.x++;
  add_scan<<<gridSize, blockSize>>>(d_cdf, d_incr, numBins);

  // Clean memory
  checkCudaErrors(cudaFree(d_inter_min));
  checkCudaErrors(cudaFree(d_inter_max));
  checkCudaErrors(cudaFree(d_histogram));
  checkCudaErrors(cudaFree(d_sums));
  checkCudaErrors(cudaFree(d_incr));
}




你可能感兴趣的:(CUDA,programming,GPU,parallel,MOOC,Udacity)