CUDA系列学习(四)Parallel Task类型 与 Memory Allocation

本文为CUDA系列学习第四讲,首先介绍了Parallel communication patterns的几种形式(map, gather, scatter, stencil, transpose), 然后复习了cuda memory model并从high level上分析怎样写出高效代码,最后学习了流程控制(control flow)以及其中一个重要部分——原子操作。参考资料:udacity cs344.

(一). Parallel communication Patterns

在上一章CUDA系列学习(二)CUDA memory & variables中我们介绍了memory和variable的不同类型,本章中根据不同的memory映射方式,我们将task分为以下几种类型:Map, Gather, Scatter, Stencil, transpose.

1.1 Map, Gather, Scatter

  • Map: one input - one output
  • Gather: several input - one output 
    e.g image blur by average
  • Scatter: one input - several output 
    e.g add a value to its neighbors 
    (因为每个thread 将结果scatter到各个memory,所以叫scatter)

图为Map, Gather & Scatter示意图:

CUDA系列学习(四)Parallel Task类型 与 Memory Allocation_第1张图片

1.2 Stencil, Transpose

  • stencil: 对input中的每一个位置, 
    stencil input:该点的neighborhood 
    stencil output:该点value 
    e.g image blur by average 
    这样也可以看出,stencil和gather很像,其实stencil是gather的一种,只不过stencil要求input必须是neighborhood而且对input的每一个元素都要操作 
    图示:

    1. 2D stencil: (示例为两种形式) 
      这里写图片描述 这里写图片描述
    2. 3D stencil: 
      CUDA系列学习(四)Parallel Task类型 与 Memory Allocation_第2张图片
  • transpose 
    input:matrix M 
    output: M^T 
    图示:

    1. Matrix transpose 
      CUDA系列学习(四)Parallel Task类型 与 Memory Allocation_第3张图片

    2. Transpose represents in vector 
      CUDA系列学习(四)Parallel Task类型 与 Memory Allocation_第4张图片

Exercise 
Q: 
看这个quiz图,给每个蓝线画着的句子标注map/gather/scatter/stencil/transpose: 
CUDA系列学习(四)Parallel Task类型 与 Memory Allocation_第5张图片

A:四个位置分别选AECB。 
这里我最后一个选错成B&D, 为什么不选D呢?看stencil的定义:如果是average,也应该对每一个位置都要进行average,而题目中有if(i%2)这个condition。

那么对于不同的Parallel communication Patterns需要关注哪些点呢? 
1. threads怎样高效访问memory?- 怎样重用数据? 
2. threads怎样相互交互部分结果?(通过sharing memory)这样安全吗?

我们将在下一节中首先回顾讲过的memory model,然后结合具体问题分析阐述how to program。

(二). Programming model and Memroy model

第一讲和第三讲中我们讲过SM与grid, block, thead的关系:各个grid, block的thread组织(gridDim,blockDim,grid shape, block shape)可以不同,分别用于执行不同kernel。 
CUDA系列学习(四)Parallel Task类型 与 Memory Allocation_第6张图片

如我们第一章所讲,不同GPU有不同数量的硬件SM(streaming Multiprocessors),GPU负责将这些block分配到SMs,所有SM独立,并行地跑。

2.1 Memory model

第二讲中我们讲了memroy的几种形式,这里我们先来回顾一下memory model.

CUDA系列学习(四)Parallel Task类型 与 Memory Allocation_第7张图片

每个thread都可以访问: 
1. 该thread独占的local memory 
2. block内threads共享的shared memory 
3. GPU中所有threads(包括不同SM的所有threads)共享的global memory

下面复习一下,做两个quiz。

Quiz -1 : 
CUDA系列学习(四)Parallel Task类型 与 Memory Allocation_第8张图片

Ans:选择A,B,D 
解读:根据定义,一个block只能run在一个SM;SM中不同blocks的threads不能cooperate

Quiz - 2 : 
CUDA系列学习(四)Parallel Task类型 与 Memory Allocation_第9张图片

Ans: 都不选~~~ 
解读:block执行时间及顺序不可控;block分配到哪个SM是GPU做的事情,并非programmer能指定的;

2.2 Memory in Program

How to write Efficient Programs from high level
  1. maximize arithmetic intensity 
    arithmetic intensity = calculation/memory 
    即要maximize calculation per thread 并 minimize memory per thread(其实目的是minimize memory access的时间) 
    方法:经常访问的数据放在可快速访问的memory(GPU中不同memory在硬件层的介绍参考第二章),对于刚才讲的local, shared and global memory的访问速度, 有 
    local > shared >> global >> CPU memory 
    所以,比如我想经常访问一个global memory,那可以在kernel中先将该global memory variable赋值给一个shared memory variable, 然后频繁访问那个shared memory variable.

  2. minimize memory access stride 
    如coalesce memory access图所示: 
    CUDA系列学习(四)Parallel Task类型 与 Memory Allocation_第10张图片

    如果GPU的threads访问相邻memory,我们称为coalesced,如果threads间访问memory有固定步长(蹦着走),我们称stripped,完全没规律的memory访问称为random。访问速度,有 
    coalesced > strided > random

  3. avoid thread divergence 
    这个我们在前两讲中有过相应说明。

Exercise:

给下面这段代码中5,6,7,8行的几句话执行速度排序(1最快,4最慢):

<code class="hljs r has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>   __global__ void f(float* x, float* y, float* z){
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>       float s,t,u;
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>       __shared__ float a,b,c;
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>       <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>       s = *x;
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>       t = s;
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>       a = b;
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>       *y = *z;
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>   }</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li></ul>

Ans: 5,6,7,8行执行速度为:3,1,2,4。

下面一节我们来看具体programming问题中的流程控制与同步。

(三). Control flow and synchronisation

3.1 program 运行顺序

在讲流程控制之前我们首先看一个例子,用来测试不同block的运行顺序。

Demo code:

<code class="hljs vala has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">#include <stdio.h></span>
<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">#define Num_block 16</span>
<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">#define Num_thread  1</span>

__global__ <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">void</span> print(){
     printf(“Num: %d\n”,blockIdx.x);
}

<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> main(){
    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//launch the kernel</span>
    print<<<Num_block, Num_thread>>>();
    cudaDeviceSynchronize();<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// what is the function of this sentence? - force the printf()s to flush, 不然运行时显示不出来</span>
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>;
}</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li></ul>

编译命令: 
nvcc -arch=sm_21 -I ~/NVIDIA_GPU_Computing_SDK/C/common/inc print.cu

运行两次结果:

CUDA系列学习(四)Parallel Task类型 与 Memory Allocation_第11张图片

可见程序执行每一次的结果都不同,也就是不同block之间的执行顺序是不可控的,正如刚才quiz的ans。那么如果我们希望同步各个threads呢?

3.2 同步机制

第二章中我们在一个例子中引入并使用了同步函数syncthreads(), 即设置一个barrier,使所有threads运行到同步函数的时候stop and wait, 直到所有threads运行到此处,那么问题来了。

Exercise: 
考虑一个程序,将每个位置i的元素移到i-1的位置,需要多少个syncthreads()? 
e.g kernel中声明如下:

<code class="hljs ocaml has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">…
<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">int</span> idx = threadIdx.x;
__shared__ <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">int</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">array</span>[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">128</span>];
<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">array</span>[idx] = idx;
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> (idx<<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">127</span>){
     <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">array</span>[idx + <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>] = <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">array</span>[idx];
}
…</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li></ul>

Ans: 3个~

<code class="hljs cpp has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">…
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> idx = threadIdx.x;
__shared__ <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">array</span>[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">128</span>];
<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">array</span>[idx] = idx;
__syncthreads(); <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//如果不加将导致array还没赋值就被操作</span>
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span> (idx<<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">127</span>){
     <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> tmp = <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">array</span>[idx];
     __syncthreads();<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//如不加导致先读后写,数据相关</span>
     <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">array</span>[idx] = tmp;
     __syncthreads(); <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//如不加不能确保下面的程序访问到正确数据</span>
}
…
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li></ul>

Quiz: 看下面这个程序会不会出现collision,哪里会出现collision?

<code class="hljs perl has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1__</span>global_<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">_</span> void f(){
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>    __shared_<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">_</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">s</span>[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1024</span>];
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> i = threadIdx.<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">x</span>;
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>    __syncthreads();
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">s</span>[i] = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">s</span>[i-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>];
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6</span>    __syncthreads();
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7</span>    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">if</span>(i<span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">%2</span>)   <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">s</span>[i] = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">s</span>[i-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>];
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>    __syncthreads();
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">s</span>[i] = (<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">s</span>[i-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]+<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">s</span>[i+<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>])/<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>;
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span>    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">printf</span>(“<span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">%d</span>\n”,<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">s</span>[i]);
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">11</span> }</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li></ul>

Ans: Collision在 
1. 第5行,如上题,应为int tmp = s[i-1]; __syncthread(); s[i] = tmp; 
2. 第9行,同理 
PS: 第7行是没问题的,模拟一下就知道

3.3 Atomic Memory Operation

这一节中我们将要接触到原子操作。 
首先考虑一个问题:用1000000个threads给一个长为10个元素的array做加法,希望每个thread加100000,这个代码大家先写写看,很简单,依照我们之前的方法有下面的code:

注:这里的gputimer.h请去我的资源页面自行下载。

<code class="hljs cpp has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">#include <stdio.h></span>
<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">#include "gputimer.h"</span>
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">using</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">namespace</span> Gadgetron;


<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">#define NUM_THREADS 1000000</span>
<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">#define ARRAY_SIZE  10</span>
<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">#define BLOCK_WIDTH 1000</span>
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">void</span> print_array(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> *<span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">array</span>, <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> size)
{
    <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">printf</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"{ "</span>);
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">for</span> (<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> i = <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>; i<size; i++)  { <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">printf</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"%d "</span>, <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">array</span>[i]); }
    <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">printf</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"}\n"</span>);
}
__global__ <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">void</span> increment_naive(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> *g)
{
     <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// which thread is this?</span>
     <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> i = blockIdx.x * blockDim.x + threadIdx.x;
     <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// each thread to increment consecutive elements, wrapping at ARRAY_SIZE</span>
     i = i % ARRAY_SIZE; 
     g[i] = g[i] + <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>;
}
__global__ <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">void</span> increment_atomic(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> *g)
{
     <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// which thread is this?</span>
     <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> i = blockIdx.x * blockDim.x + threadIdx.x;
     <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// each thread to increment consecutive elements, wrapping at ARRAY_SIZE</span>
     i = i % ARRAY_SIZE; 
     atomicAdd(&g[i], <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>);
}
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> main(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> argc,<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">char</span> **argv)
{  
    GPUTimer timer;
    <span class="hljs-built_in" style="color: rgb(102, 0, 102); box-sizing: border-box;">printf</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"%d total threads in %d blocks writing into %d array elements\n"</span>,
           NUM_THREADS, NUM_THREADS / BLOCK_WIDTH, ARRAY_SIZE);
    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// declare and allocate host memory</span>
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> h_array[ARRAY_SIZE];
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">const</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> ARRAY_BYTES = ARRAY_SIZE * <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">sizeof</span>(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span>);

    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// declare, allocate, and zero out GPU memory</span>
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> * d_array;
    cudaMalloc((<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">void</span> **) &d_array, ARRAY_BYTES);
    cudaMemset((<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">void</span> *) d_array, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>, ARRAY_BYTES);
    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// launch the kernel - comment out one of these</span>
    timer.start();
    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//increment_atomic<<<NUM_THREADS/BLOCK_WIDTH, BLOCK_WIDTH>>>(d_array);</span>
    increment_naive<<<NUM_THREADS/BLOCK_WIDTH, BLOCK_WIDTH>>>(d_array);
    timer.stop();

    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// copy back the array of sums from GPU and print</span>
    cudaMemcpy(h_array, d_array, ARRAY_BYTES, cudaMemcpyDeviceToHost);
    print_array(h_array, ARRAY_SIZE);

    <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// free GPU memory allocation and exit</span>
    cudaFree(d_array);
    <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">return</span> <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>;
}
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li><li style="box-sizing: border-box; padding: 0px 5px;">32</li><li style="box-sizing: border-box; padding: 0px 5px;">33</li><li style="box-sizing: border-box; padding: 0px 5px;">34</li><li style="box-sizing: border-box; padding: 0px 5px;">35</li><li style="box-sizing: border-box; padding: 0px 5px;">36</li><li style="box-sizing: border-box; padding: 0px 5px;">37</li><li style="box-sizing: border-box; padding: 0px 5px;">38</li><li style="box-sizing: border-box; padding: 0px 5px;">39</li><li style="box-sizing: border-box; padding: 0px 5px;">40</li><li style="box-sizing: border-box; padding: 0px 5px;">41</li><li style="box-sizing: border-box; padding: 0px 5px;">42</li><li style="box-sizing: border-box; padding: 0px 5px;">43</li><li style="box-sizing: border-box; padding: 0px 5px;">44</li><li style="box-sizing: border-box; padding: 0px 5px;">45</li><li style="box-sizing: border-box; padding: 0px 5px;">46</li><li style="box-sizing: border-box; padding: 0px 5px;">47</li><li style="box-sizing: border-box; padding: 0px 5px;">48</li><li style="box-sizing: border-box; padding: 0px 5px;">49</li><li style="box-sizing: border-box; padding: 0px 5px;">50</li><li style="box-sizing: border-box; padding: 0px 5px;">51</li><li style="box-sizing: border-box; padding: 0px 5px;">52</li><li style="box-sizing: border-box; padding: 0px 5px;">53</li><li style="box-sizing: border-box; padding: 0px 5px;">54</li><li style="box-sizing: border-box; padding: 0px 5px;">55</li><li style="box-sizing: border-box; padding: 0px 5px;">56</li><li style="box-sizing: border-box; padding: 0px 5px;">57</li><li style="box-sizing: border-box; padding: 0px 5px;">58</li></ul>

执行两次的结果: 
read-write-modify-error

这里写图片描述

可见结果里每个元素都是648/647,不符合预期100000。这是为什么呢?

看我们的kernel部分代码,每次执行g[i] = g[i] + 1, 一个read-modify-write操作,这样会导致许多线程读到g[i]的value,然后慢的线程将快的线程写结果覆盖掉了。如何解决呢?我们引入原子操作(atomic operation), 更改上面的kernel部分为:

<code class="hljs cs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">__global__ <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">void</span> increment_atomic(<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> *g)
{
     <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// which thread is this?</span>
     <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">int</span> i = blockIdx.x * blockDim.x + threadIdx.x;
     <span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">// each thread to increment consecutive elements, wrapping at ARRAY_SIZE</span>
     i = i % ARRAY_SIZE; 
     atomicAdd(&g[i], <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>);
}</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li></ul>

我们可以得到结果: 
这里写图片描述

可见,结果正确。那么原子操作atomicAdd用了怎样的机制呢?——原子操作用了GPU built-in的特殊硬件,用以保证原子操作(同一时刻只能有一个thread做read-modify-write操作)

这里来看一下原子操作的limitations: 
1. only certain operations, data type(功能有限) 
2. still no ordering constraints(还是无序执行) 
3. serializes access to memory(所以慢)

(四). 总结

本节课介绍了以下内容:

  • communication patterns

    • map
    • gather
    • scatter
    • stencil
    • transpose
  • gpu hardware & programming model

    • SMs, threads, blocks ordering
    • synchronization
    • Memory model - local, global, shared memory
  • efficient GPU programming

    • coalesced memory access
    • faster memory for common used variable

OK~ 第三课就结束了,过两天我把exercise上上来~ 敬请关注~.~


from: http://blog.csdn.net/abcjennifer/article/details/43374009

你可能感兴趣的:(CUDA,学习,memory,parallel,Allocation,内存申请,Task类型)