CUDA为a platform and programming model for CUDA-enabled GPUs。该平台通过GPU来进行计算。CUDA为GPU编程和管理 提供C/C++语言扩展和API。
CUDA编程中,会同时使用CPU和GPU进行计算:
CPU和GPU为独立的系统,具有各自的内存空间。通常,在CPU上运行的串行工作,而将并行计算卸载给GPU。
以Hello world程序为例:
二者最大的不同在于__global__
说明符 和 <<<...>>>
语法:
__global__
说明符:用于标明该函数运行于device(GPU)。这类函数可通过host code调用,如通过main()
函数调用。也可被称为“kernels”。<<<...>>>
语法:当kernel被调用时,其执行配置由<<<...>>>
语言提供,如cuda_helo<<<1,1>>>()
。在CUDA术语中,这被称为“kernel launch”。编译CUDA程序与编译C语言类似。NVIDIA在其CUDA toolkit中提供了名为nvcc
的CUDA编译器来编译CUDA code——通常源代码文件名为.cu
。
以vector addition为例,相应的C语言实现为(vector_add.c):
#define N 10000000
void vector_add(float *out, float *a, float *b, int n) {
for(int i = 0; i < n; i++){
out[i] = a[i] + b[i];
}
}
int main(){
float *a, *b, *out;
// Allocate memory
a = (float*)malloc(sizeof(float) * N);
b = (float*)malloc(sizeof(float) * N);
out = (float*)malloc(sizeof(float) * N);
// Initialize array
for(int i = 0; i < N; i++){
a[i] = 1.0f; b[i] = 2.0f;
}
// Main function
vector_add(out, a, b, N);
}
对应的CUDA程序(vector_add.cu)若为:
#define N 10000000
__global__ void vector_add(float *out, float *a, float *b, int n) {
for(int i = 0; i < n; i++){
out[i] = a[i] + b[i];
}
}
int main(){
float *a, *b, *out;
// Allocate memory
a = (float*)malloc(sizeof(float) * N);
b = (float*)malloc(sizeof(float) * N);
out = (float*)malloc(sizeof(float) * N);
// Initialize array
for(int i = 0; i < N; i++){
a[i] = 1.0f; b[i] = 2.0f;
}
// Main function
vector_add<<<1,1>>>(out, a, b, N);
}
但是,以上CUDA程序并无法运行,因为CPU和GPU为不同的实体,二者具有各自的内存空间。CPU无法直接访问GPU内存,GPU也无法直接访问CPU内存。在CUDA术语中:
GPU想访问的数据因存储在device memory中。CUDA提供了API来分配device memory,并进行host memory和device memory之间的数据传送。
CUDA程序的基本流程为:
cudaMalloc()
和cudaFree()
来分配和释放device memory。分别于C语言中的malloc()
和free()
对应。cudaMalloc(void **devPtr, size_t count);
cudaFree(void *devPtr);
cudaMemcpy
函数来实现数据拷贝,类似于C语言中的memcpy
。// kind表示方向,为`cudaMemcpyHostToDevice` 或 `cudaMemcpyDeviceToHost`
cudaMemcpy(void *dst, void *src, size_t count, cudaMemcpyKind kind)
根据以上流程,对应的CUDA程序(vector_add.cu)需修改为:
#include
#include
#include
#include
#include
#include
#define N 10000000
#define MAX_ERR 1e-6
__global__ void vector_add(float *out, float *a, float *b, int n) {
for(int i = 0; i < n; i ++){
out[i] = a[i] + b[i];
}
}
int main(){
float *a, *b, *out;
float *d_a, *d_b, *d_out;
// Allocate host memory
a = (float*)malloc(sizeof(float) * N);
b = (float*)malloc(sizeof(float) * N);
out = (float*)malloc(sizeof(float) * N);
// Initialize host arrays
for(int i = 0; i < N; i++){
a[i] = 1.0f;
b[i] = 2.0f;
}
// Allocate device memory
cudaMalloc((void**)&d_a, sizeof(float) * N);
cudaMalloc((void**)&d_b, sizeof(float) * N);
cudaMalloc((void**)&d_out, sizeof(float) * N);
// Transfer data from host to device memory
cudaMemcpy(d_a, a, sizeof(float) * N, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, sizeof(float) * N, cudaMemcpyHostToDevice);
// Executing kernel
vector_add<<<1,1>>>(d_out, d_a, d_b, N);
// Transfer data back to host memory
cudaMemcpy(out, d_out, sizeof(float) * N, cudaMemcpyDeviceToHost);
// Verification
for(int i = 0; i < N; i++){
assert(fabs(out[i] - a[i] - b[i]) < MAX_ERR);
}
printf("PASSED\n");
// Deallocate device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_out);
// Deallocate host memory
free(a);
free(b);
free(out);
}
编译并可使用time
来验证程序性能:
$> nvcc vector_add.cu -o vector_add
$> time ./vector_add
NVIDIA也提供了名为nvprof
的命令行profiler工具,可提供更多程序性能信息:
$> nvprof ./vector_add
以Tesla M2050为例,相应profiling为:
==6326== Profiling application: ./vector_add
==6326== Profiling result:
Time(%) Time Calls Avg Min Max Name
97.55% 1.42529s 1 1.42529s 1.42529s 1.42529s vector_add(float*, float*, float*, int)
1.39% 20.318ms 2 10.159ms 10.126ms 10.192ms [CUDA memcpy HtoD]
1.06% 15.549ms 1 15.549ms 15.549ms 15.549ms [CUDA memcpy DtoH]
CUDA使用kernel execution configuration <<<...>>>
来告诉CUDA runtime该在GPU中启动多少个线程。
CUDA organizes threads into a group called “thread block”。
Kernel可启动多个thread blocks,organized into a “grid” structure。
kernel execution configuration的语法为:
<<>>
表示kernel launches with a gird of M
thread blocks。每个thread block具有T
parallel threads。
接下来,将使用multithread来parallelize上例中的vector addition,如使用a thread block with 256 threads,相应的kernel execution configuration为:
vector_add <<< 1 , 256 >>> (d_out, d_a, d_b, N);
CUDA提供了内置变量来访问thread information,此例中包含了一下2个内置变量:
threadIdx.x
:包含了the index of the thread within the block。此例中,index范围为0~255。blockDim.x
:包含了the size of thread block(number of threads in the thread block)。此例中,该值为256。完整的vector_add_thread.cu源代码为:
#include
#include
#include
#include
#include
#include
#define N 10000000
#define MAX_ERR 1e-6
__global__ void vector_add(float *out, float *a, float *b, int n) {
int index = threadIdx.x;
int stride = blockDim.x;
for(int i = index; i < n; i += stride){
out[i] = a[i] + b[i];
}
}
int main(){
float *a, *b, *out;
float *d_a, *d_b, *d_out;
// Allocate host memory
a = (float*)malloc(sizeof(float) * N);
b = (float*)malloc(sizeof(float) * N);
out = (float*)malloc(sizeof(float) * N);
// Initialize host arrays
for(int i = 0; i < N; i++){
a[i] = 1.0f;
b[i] = 2.0f;
}
// Allocate device memory
cudaMalloc((void**)&d_a, sizeof(float) * N);
cudaMalloc((void**)&d_b, sizeof(float) * N);
cudaMalloc((void**)&d_out, sizeof(float) * N);
// Transfer data from host to device memory
cudaMemcpy(d_a, a, sizeof(float) * N, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, sizeof(float) * N, cudaMemcpyHostToDevice);
// Executing kernel
vector_add<<<1,256>>>(d_out, d_a, d_b, N);
// Transfer data back to host memory
cudaMemcpy(out, d_out, sizeof(float) * N, cudaMemcpyDeviceToHost);
// Verification
for(int i = 0; i < N; i++){
assert(fabs(out[i] - a[i] - b[i]) < MAX_ERR);
}
printf("PASSED\n");
// Deallocate device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_out);
// Deallocate host memory
free(a);
free(b);
free(out);
}
$> nvcc vector_add_thread.cu -o vector_add_thread
$> nvprof ./vector_add_thread
相应的性能为:
==6430== Profiling application: ./vector_add_thread
==6430== Profiling result:
Time(%) Time Calls Avg Min Max Name
39.18% 22.780ms 1 22.780ms 22.780ms 22.780ms vector_add(float*, float*, float*, int)
34.93% 20.310ms 2 10.155ms 10.137ms 10.173ms [CUDA memcpy HtoD]
25.89% 15.055ms 1 15.055ms 15.055ms 15.055ms [CUDA memcpy DtoH]
以上为1个thread block。CUDA GPU具有多个并行处理器,名为Streaming Multiprocessors(SMs)。每个SM包含了多个并行处理器,可运行多个concurrent thread blocks。为了充分利用CUDA GPU,kernel应启动多个thread blocks。此时CUDA再额外提供2个内置变量:
blockIdx.x
:包含the index of the block with in the grid。gridDim.x
:包含the size of the grid。
若一共需要 N N N个线程,每个thread block有256个线程,则至少需要 N / 256 N/256 N/256个thread blocks。对于每个thread需要有a unique index,该index的计算规则为:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
多个thread block的vector_add_grid.cu源代码为:
#include
#include
#include
#include
#include
#include
#define N 10000000
#define MAX_ERR 1e-6
__global__ void vector_add(float *out, float *a, float *b, int n) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
// Handling arbitrary vector size
if (tid < n){
out[tid] = a[tid] + b[tid];
}
}
int main(){
float *a, *b, *out;
float *d_a, *d_b, *d_out;
// Allocate host memory
a = (float*)malloc(sizeof(float) * N);
b = (float*)malloc(sizeof(float) * N);
out = (float*)malloc(sizeof(float) * N);
// Initialize host arrays
for(int i = 0; i < N; i++){
a[i] = 1.0f;
b[i] = 2.0f;
}
// Allocate device memory
cudaMalloc((void**)&d_a, sizeof(float) * N);
cudaMalloc((void**)&d_b, sizeof(float) * N);
cudaMalloc((void**)&d_out, sizeof(float) * N);
// Transfer data from host to device memory
cudaMemcpy(d_a, a, sizeof(float) * N, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, sizeof(float) * N, cudaMemcpyHostToDevice);
// Executing kernel
int block_size = 256;
int grid_size = ((N + block_size) / block_size);
vector_add<<>>(d_out, d_a, d_b, N);
// Transfer data back to host memory
cudaMemcpy(out, d_out, sizeof(float) * N, cudaMemcpyDeviceToHost);
// Verification
for(int i = 0; i < N; i++){
assert(fabs(out[i] - a[i] - b[i]) < MAX_ERR);
}
printf("PASSED\n");
// Deallocate device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_out);
// Deallocate host memory
free(a);
free(b);
free(out);
}
编译并profile性能:
$> nvcc vector_add_grid.cu -o vector_add_grid
$> nvprof ./vector_add_grid
在Tesla M2050上的性能表现为:
==6564== Profiling application: ./vector_add_grid
==6564== Profiling result:
Time(%) Time Calls Avg Min Max Name
55.65% 20.312ms 2 10.156ms 10.150ms 10.162ms [CUDA memcpy HtoD]
41.24% 15.050ms 1 15.050ms 15.050ms 15.050ms [CUDA memcpy DtoH]
3.11% 1.1347ms 1 1.1347ms 1.1347ms 1.1347ms vector_add(float*, float*, float*, int)
Version | Execution Time (ms) | Speedup |
---|---|---|
1 thread | 1425.29 | 1.00x |
1 block | 22.78 | 62.56x |
Multiple blocks | 1.13 | 1261.32x |
OpenCL全称为:Open Computing Language。
OpenCL为:
[1] Getting started with OpenCL and GPU Computing
[2] Introduction to GPU Computing with OpenCL
[3] OpenCL™ Programming Guide for the CUDA™ Architecture
[4] CUDA Tutorial