本文简介了 CUDA 编程,参考文章
1. NVIDIA 官网 CUDA 编程
2. CUDA 编程指南 5.0 中文版—— by 风辰
3. CUDA 线程执行模型分析
CUDA 编程允许程序执行在异构系统上,即 CPU 和 GPU(作为主机的协处理器),二者有各自的存储空间,并由 PCIe 总线区分开。
代码中,一般用h_
前缀表示 host memory,d_
表示 device memory。
kernel 是 CUDA 编程中的关键,表示是跑在 GPU 的代码,用标示符 __global__
注明。
CUDA 程序包含运行在 host 端的代码和运行在 device 端的代码,host 端代码是标准 C,device 是 CUDA C 代码。为了充分利用 Device(GPU)的计算能力,串行代码由 host 执行,并行代码在 device 中执行。我们可以把所有代码放到一个单独的源文件,也可以使用多个文件或库。NVIDIA C 编译器(nvcc)可以编译 host 和 device 生成可执行程序。
程序通过调用 CUDA,来管理对内核可见的全局、常量和纹理存储器空间,包括设备存储器分配和释放,也包括在主机和设备间的数据传输。
CUDA 程序的处理流程:
1. 分配主机存储器并初始化
2. 分配设备存储器
3. 将已初始化的主机存储器内容复制到已分配的设备存储器上,即从 CPU 拷贝数据到 GPU
4. GPU 进行计算,即调用kernel来操作存储在 GPU 的数据
5. 将计算完的结果从设备复制到主机上,即将操作结果从 GPU 拷贝至 CPU
6. 处理复制到主机上的数据
CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions.
A kernel is defined using the
__global__
declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new<<<...>>>
execution configuration syntax (see C Language Extensions). Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.
CUDA C 是对 C 的扩展,内核调用时会被 N 个 CUDA 线程执行 N 次(注:其实每个线程只执行了一次),这和普通的 C 函数只执行一次不同。
内核通过声明符 __global__
来定义,使用一种新 <<< ... >>>
执行配置语法指定执行某一指定内核的线程数。每个执行内核的线程拥有一个独一无二的线程 ID,可以通过内置的 threadIdx
变量在内核中访问(注:这只说明在块内是唯一的,并不一定是全局唯一的)。
// Kernel definition
__global__ void VecAdd(float∗ A, float∗ B, float ∗ C)
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
...
// Kernel invocation with N threads
VecAdd<<<1, N>>>(A, B, C);
...
}
这里, N 个线程中的每一个执行 VecAdd() 的一次成对加法(注:由于只使用了一个块,因此线程 ID 是唯一的)。
Grid : 线程网格,一组线程块
One grid <-> One kernel(Kernel : 在 GPU 上执行的核心程序)
调用 kernel 需要指定执行配置,即线程结构。
Threads 和 blocks 的 Dim 和 Id 属性:
blockDim(x,y,z),Block 维度,对应执行模型的第二个参数
以上均为内置变量,只读。由此换算 thread 的绝对 id,映射相应处理数据。
For convenience, threadIdx is a 3-component vector, so that threads can be identified using a one-dimensional, two-dimensional, or three-dimensional thread index, forming a one-dimensional, two-dimensional, or three-dimensional block of threads, called a thread block. This provides a natural way to invoke computation across the elements in a domain such as a vector, matrix, or volume.
The index of a thread and its thread ID relate to each other in a straightforward way: For a one-dimensional block, they are the same; for a two-dimensional block of size (Dx, Dy),the thread ID of a thread of index (x, y) is (x + y Dx); for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y Dx + z Dx Dy).
As an example, the following code adds two matrices A and B of size NxN and stores the result into matrix C:
为简便起见, threadIdx 是一个 3 维向量,使线程可以使用一维、
二维或三维的线程索引确定,进而形成一个一维、二维或三维的线程块,称为 thread block。这提供了一种自然的方式来调用作用在域内元素上的计算,如向量、矩阵、体元(volume)。
线程索引和线程 ID 直接相关:对于一维的块,它们相同;对于二维长
度为(Dx,Dy)的 block,线程索引为(x,y)的线程ID是(x+yDx);对于三维长
度为(Dx,Dy,Dz)的 block,索引为(x,y,z)的线程ID为(x+yDx+zDxDy)(译者注:这和我们使用 C 数组的方式不一样,大家注意理解)。
下面的例子代码将两个长度为 N*N 的矩阵 A 和 B 相加,然后将结果写入矩阵 C。
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N])
{
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
...
// Kernel invocation with one block of N * N * 1 threads
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<>>(A, B, C);
...
}
There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core. On current GPUs, a thread block may contain up to 1024 threads.
However, a kernel can be executed by multiple equally-shaped thread blocks, so that the total number of threads is equal to the number of threads per block times the number of blocks.
Blocks are organized into a one-dimensional, two-dimensional, or three-dimensional grid of thread blocks as illustrated by Figure 6. The number of thread blocks in a grid is usually dictated by the size of the data being processed or the number of processors in the system, which it can greatly exceed.
由于 block 内的所有线程必须存在于同一个处理器核心中且共享该核心有限的存储器资源,因此,一个 block 内的线程数目是有限的。在目前的 GPU 上,一个 block 可以包含多达 1024 个线程。
然而,一个内核可被多个同样大小的 block 执行,所以总的线程数等于每个 block 内的线程数乘以 block 个数。
block 被组织成一维、二维或三维的 grid。一个 grid 内的 block 个数往往由被处理的数据量而不是系统的处理器数决定,前者往往远超后者。
The number of threads per block and the number of blocks per grid specified in the <<<…>>> syntax can be of type int or dim3. Two-dimensional blocks or grids can be specified as in the example above.
Each block within the grid can be identified by a one-dimensional, two-dimensional, or three-dimensional index accessible within the kernel through the built-in blockIdx variable. The dimension of the thread block is accessible within the kernel through the built-in blockDim variable.
Extending the previous MatAdd() example to handle multiple blocks, the code becomes as follows.
block 内的线程数和 grid 内 block 的个数由 <<< ...>>>
语法确定,参数可以是 int 或者 dim3 类型。二维的 block 或 grid 的尺寸可以以和上一个例子相同的方式指定。
grid 内的每个 block 可以通过一维、二维或三维索引唯一确定,在内核中此索引可通过内置的 blockIdx
变量访问。block 的尺寸(dimension)可以在内核中通过内置变量 blockDim
访问。
为了处理多个 block,扩展前面的 MatAdd() 例子后,代码成了下面的样子。
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
...
// Kernel invocation
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N/threadsPerBlock.x, N/threadsPerBlock.y); //N = numBlocks * threadsPerBlock
MatAdd<<>>(A, B, C);
...
}
A thread block size of 16x16 (256 threads), although arbitrary in this case, is a common choice. The grid is created with enough blocks to have one thread per matrix element as before. For simplicity, this example assumes that the number of threads per grid in each dimension is evenly divisible by the number of threads per block in that dimension, although that need not be the case.
Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any number of cores as illustrated by Figure 5, enabling programmers to write code that scales with the number of cores.
Threads within a block can cooperate by sharing data through some shared memory and by synchronizing their execution to coordinate memory accesses. More precisely, one can specify synchronization points in the kernel by calling the __syncthreads() intrinsic function; __syncthreads() acts as a barrier at which all threads in the block must wait before any is allowed to proceed. Shared Memory gives an example of using shared memory.
For efficient cooperation, the shared memory is expected to be a low-latency memory near each processor core (much like an L1 cache) and __syncthreads() is expected to be lightweight.
一个长度为 16*16(256 线程)的 block,虽然是强制指定,但是常见。像以前一样,创建了内有足够的 block 的 grid,使得一个线程处理一个矩阵元素。为简便起见,此例假设 grid 的每一维上的线程数可被 block 内对应维上的线程数整除,尽管这并不是必需。
block 必须独立执行:而且能够以任意顺序,串行或者并行执行。这种独立性要求使得 block 可以以任何顺序在任意数目核心上调度,保证了程序员能够写出能够随核心数目扩展的代码。
block 内的线程可通过共享存储器和同步执行协作,共享存储器可以共享数据,同步执行可以协调存储器访问。更精确一点说,可以在内核中调用 __syncthreads()
内置函数指明同步点; __syncthreads()
起栅栏的作用,在其调用点,block 内的线程必须等待,直到所以线程都到达此点才能向前执行。
为了能有效协作,共享存储器要求是靠近每个处理器核心的低延迟存储器(更像 L1 缓存),而且 __syncthreads()
要是轻量级的。
__shared__
__device__
CUDA 线程可在执行过程中访问多个存储器空间的数据:
CUDA threads may access data from multiple memory spaces during their execution as illustrated by Figure 7. Each thread has private local memory. Each thread block has shared memory visible to all threads of the block and with the same lifetime as the block. All threads have access to the same global memory.
There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The global, constant, and texture memory spaces are optimized for different memory usages (see Device Memory Accesses). Texture memory also offers different addressing modes, as well as data filtering, for some specific data formats(see Texture and Surface Memory).
The global, constant, and texture memory spaces are persistent across kernel launches by the same application.
在执行期间,CUDA 线程可以访问来自多个内存空间的数据。每一个线程都有私有的本地内存(local memory),每一个 block 都有对 block 中的所有线程都可见的共享内存(shared memory),并且与该 block 有相同的生命周期,所有的线程可访问同一全局内存(global memory)。
另外还有两种可被所有线程访问的只读存储器:常量和纹理内存空间。全局、常量和纹理存储空间针对不同的内存用途也有被优化(具体见《Device Memory Accesses》)。纹理内存还为指定的数据格式提供了不同的寻址模式以及数据滤波(具体见《Texture and Surface Memory》)。
在同一应用中启动的内核函数之间,全局,常量和纹理空间是持续的。
CUDA 执行时让 host(CPU)里的一个个的 kernel 按照线程网格(grid)的概念在 device(GPU)上执行。每一个 grid 包含多个线程块(block),每一个 block 中又包含多个线程(thread)。
CUDA 采用了 SIMT(Single Instruction,Multiple Thread,单指令多线程)执行模型。SIMT 体系结构相对于 CPU 中的 SIMD(Single Instruction,Multiple Data,单指令多数据)的概念,是对 SIMD的一种改进。不同于 CPU 中通过 SIMD 来处理矢量数据,GPU 使用 SIMT 的好处是无需开发者费力把数据凑成合适的矢量长度,并且 SIMT 允许每个线程有不同的分支。
The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently. The term warp originates from weaving, the first parallel thread technology. A half-warp is either the first or second half of a warp. A quarter-warp is either the first, second, third, or fourth quarter of a warp.
When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps and each warp gets scheduled by a warp scheduler for execution. The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Thread Hierarchy describes how thread IDs relate to thread indices in the block.
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths.
多处理器创建,管理,调度和执行 32 个并行线程(称为 warp,束)组中的线程。组成 warp 的各个线程从同一程序地址开始,但是它们有自己的指令地址计数器和寄存器状态,因此可以独立地分支和执行。束这个术语来源于
纺织(weaving),这是第一种并行线程技术。半束(half-warp)是束的前一半或后一半。四分之一束(quarter-warp)是束的四分之一的第一,第二,第三或第四部分。
当多处理器得到一个或多个线程执行时,它将它们分割成 warp,并且每个线程经由 warp 调度器调度执行。块被分割成 warp 的方式总是相同的;每个 warp 包含连续的线程,其线程 ID 是递增的,其中第一个 warp 包含线程 0。线程模型描述了线程 ID 与块中线程索引的关系。
warp 一次执行一个相同的指令,因此当 warp 的所有 32 个线程在同一条路径上执行的话,会达到最高效率。如果 warp 的线程由于数据依赖条件分支导致束发散,warp 将依次执行每个分支路径,而禁用不在该路径上的线程,直到所有路径完成,线程重新汇合到同一执行路径。分支发散只会在同一 warp 内发生;不同的 warp 独立执行,不管它们是执行相同或不同的代码路径。
The SIMT architecture is akin to SIMD (Single Instruction, Multiple Data) vector organizations in that a single instruction controls multiple processing elements. A key difference is that SIMD vector organizations expose the SIMD width to the software, whereas SIMT instructions specify the execution and branching behavior of a single thread. In contrast with SIMD vector machines, SIMT enables programmers to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads. For the purposes of correctness, the programmer can essentially ignore the SIMT behavior; however, substantial performance improvements can be realized by taking care that the code seldom requires threads in a warp to diverge. In practice, this is analogous to the role of cache lines in traditional code: Cache line size can be safely ignored when designing for correctness but must be considered in the code structure when designing for peak performance. Vector architectures, on the other hand, require the software to coalesce loads into vectors and manage divergence manually.
SIMT 架构类似于 SIMD(单指令流多数据流)向量组织方法,共同之处是使用单指令来控制多个处理元素。
不同之处在于 SIMD 向量组织方法会向应用公开 SIMD 宽度,向量宽度受到硬件制约,是固定的,数据必须打包成向量后才能被处理。而 SIMT 指令指定单一线程的执行和分支行为。在 SIMT 中,执行数据的宽度将作为硬件细节被隐藏起来,硬件可以自动地适应不同的执行宽度。例如一个 block 有若干个 warp 组成,执行宽度可以在 1~512 线程之间变化。如果按照 SIMD 模型设计 CUDA,每个 block 的执行宽度都必须是一个 warp,即 32 个线程,这无疑会大大降低编程的灵活性。SIMT 允许程序员为独立、标量线程编写线程级的并行代码,还允许为协同线程编写数据并行代码。为了确保正确性,程序员可忽略 SIMT 行为,只要维护一个 warp 内线程很少分支的代码就可显著提升性能。
另外一个不同是 SIMD 的向量中的元素相互之间可以自由通信,因为它们存在于相同的地址空间(在 CPU 的同一寄存器中);而 SIMT 中的每个线程的寄存器都是私有的,线程之间只能通过 shared memory 和同步机制进行通信。
CUDA 程序框架
__global__ void Kernel_First(args)
{
}
void main()
{
cudaMalloc(d_A);
cudaMemcpy(HostToDevice);
Kernel_First<<Block>>>(d_A);
cudaMemcpy(DeviceToHost);
cudaFree(d_A);
}
C vs CUDA C
C 程序
//CPU program
//sum of two vectors a and b
void add_cpu(float *a, float *b, int N)
{
for (int idx = 0; idxvoid main()
{
.....
fun_add(a, b, N);
}
CUDA C 程序
//CUDA program
//sum of two vectors a and b
__global__ void add_gpu(float *a, float *b, int N)
{
int idx =blockIdx.x* blockDim.x+ threadIdx.x;
if (idx < N)
a[idx] += b[idx];
}
void main()
{
···
dim3 dimBlock (256);
dim3 dimGrid( ceil( N / 256 );
fun_add<<>>(a, b, N);
}
CUDA C是对 C 语言进行扩展后形成的变种。主要包括以下几个方面:
函数类型限定符用来确定函数是在 CPU 还是在 GPU 上执行,以及这个函数是从 CPU 调用还是从 GPU 调用。
__device__
表示从 GPU 上调用,在 GPU 上执行;计算能力 2.0 及以后的设备支持递归。 __global__
表示在 CPU 上调用,在 GPU 上执行,也就是所谓的内核( kernel )函数;在计算能力 3.0 及以后的设备上,__global__
可调用 __global__
。 __host__
表明在 CPU 上调用,在 CPU 上执行,这是默认时的情冴,也是传统的 C 函数。 函数类型限定符 | 调用 | 执行 |
---|---|---|
__device__ float DeviceFunc() | GPU | GPU |
__global__ void KernelFunc() | CPU | GPU |
__host__ float HostFunc() | CPU | CPU |
变量类型限定符用来规定变量存储在什么位置上。
__device__
表明声明的数据存放在显存中,所有的线程都可以访问,而且主机也可以通过运行时库访问
__shared__
表示数据存放在共享存储器中,只有所在的块内的线程可以访问,其它块内的线程不能访问
__constant__
表明数据存放在常量存储器中,可以被所有的线程访问,也可以被主机通过运行时库访问 texture
表明被其绑定的数据可以被纹理缓存加速读取执行配置运算符用来传递内核函数的执行参数,有四个参数。
运行时获得网格和块的尺寸及线程索引等信息。
由基本的整型或浮点型构成(char4, ushort3, double2, dim3),通过 x, y, z, w 访问每个分量,在设备端有不同对齐要求。