CUDA的全程是Computer Unified Device Architecture,是由显卡头子NVIDIA发明的。有的人对于显卡的印象在于它可以玩游戏,效果十分逼真,但从背后而言,正是因为显卡强大的图形计算能力,才使得计算机可以运行这些大型的3D游戏,并且拥有较高的画质和帧数。
CPU具有以下特点:
它的架构图如下:
除此之外,还需要知道GPU当中拥有许多流处理器(Streaming Multiprocessor),以及众多CUDA核心。
默认已经配置好了相关的环境,本文将不再过多赘述,需要的朋友可以自行搜索,有很多的教程。本文均以Linux环境作为演示。
现在已经配置好了环境,那么就需要查看以下我们拥有的GPU硬件信息,这也方便于我们后期设置一些参数。使用以下例程,就可以查看GPU的硬件信息了。
#include
#include
#include
/**
* @brief print device properties
*
* @param prop
*/
void showDeviceProp(cudaDeviceProp &prop) {
printf("Device name: %s\n", prop.name);
printf(" Compute capability: %d.%d\n", prop.major, prop.minor);
printf(" Clock rate: %d\n", prop.clockRate);
printf(" Memory clock rate: %d\n", prop.memoryClockRate);
printf(" Memory bus width: %d\n", prop.memoryBusWidth);
printf(" Peak memory bandwidth: %d\n", prop.memoryBusWidth);
printf(" Total global memory: %lu\n", prop.totalGlobalMem);
printf(" Total shared memory per block: %lu\n", prop.sharedMemPerBlock);
printf(" Total registers per block: %d\n", prop.regsPerBlock);
printf(" Warp size: %d\n", prop.warpSize);
printf(" Maximum memory pitch: %lu\n", prop.memPitch);
printf(" Maximum threads per block: %d\n", prop.maxThreadsPerBlock);
printf(" Maximum dimension of block: %d x %d x %d\n", prop.maxThreadsDim[0], prop.maxThreadsDim[1], prop.maxThreadsDim[2]);
printf(" Maximum dimension of grid: %d x %d x %d\n", prop.maxGridSize[0], prop.maxGridSize[1], prop.maxGridSize[2]);
printf(" Maximum memory alloc size: %lu\n", prop.totalConstMem);
printf(" Texture alignment: %lu\n", prop.textureAlignment);
printf(" Concurrent copy and execution: %s\n", prop.deviceOverlap ? "Yes" : "No");
printf(" Number of multiprocessors: %d\n", prop.multiProcessorCount);
printf(" Kernel execution timeout: %s\n", prop.kernelExecTimeoutEnabled ? "Yes" : "No");
printf(" Integrated GPU sharing Host Memory: %s\n", prop.integrated ? "Yes" : "No");
}
int main() {
int num_devices;
cudaDeviceProp properties;
cudaGetDeviceCount(&num_devices);
printf("%d CUDA devices found\n", num_devices);
for (int i = 0; i < num_devices; i++) {
cudaGetDeviceProperties(&properties, i);
printf("Device %d: \"%s\"\n", i, properties.name);
showDeviceProp(properties);
}
return 0;
}
编译该程序nvcc device_query.cu -o device_query
,然后运行./device_query
,就可以得到本机的硬件信息了。
1 CUDA devices found
Device 0: "NVIDIA Tesla K40c"
Device name: NVIDIA Tesla K40c
Compute capability: 3.5
Clock rate: 745000
Memory clock rate: 3004000
Memory bus width: 384
Peak memory bandwidth: 384
Total global memory: 11996954624
Total shared memory per block: 49152
Total registers per block: 65536
Warp size: 32
Maximum memory pitch: 2147483647
Maximum threads per block: 1024
Maximum dimension of block: 1024 x 1024 x 64
Maximum dimension of grid: 2147483647 x 65535 x 65535
Maximum memory alloc size: 65536
Texture alignment: 512
Concurrent copy and execution: Yes
Number of multiprocessors: 15
Kernel execution timeout: No
Integrated GPU sharing Host Memory: No
可以看到是一块英伟达特斯拉K40显卡,计算能力为3.5,以及其他各种参数。看不懂也没有关系,因为不是特别重要,主要是检测一下是否成功配置了相关的环境。如果想看完整的参数,那需要增加更多的语句,并且打印对应的参数,完整参数列表可以在该网站找到英伟达API官网。
代码被分成两部分,一部分是在CPU上,也称之为在Host上,另一部分是在GPU上,也称之为在device上。他们两者的关系如下图所示。
程序开始运行时,先将数据通过总线传给GPU,由GPU运算完毕之后再回传给Host,由于数据传输耗费的时间取决于总线带宽,数据量的大小等因素,所以要尽量避免反复传递数据,这样很可能会出现GPU在等数据的时间比实际运算的时间长。
定义运行在GPU上的Code(核函数)
运行在GPU上的代码需要像下面这样声明
__global__ void mykernel(void) { // 要计算的内容}
__global__
表示一个函数要在GPU上运行nvcc
会将host和device的部分分开来编译网格grids和线程块blocks
网格grids,在上层,至多可以分成三维的blocks,在不同block当中的线程是不能通信的;线程块blocks在相对较低的层级,同样可以将线程分成三维,而在同一个块中的线程是可以通信的。
对于一个核函数,只能有一个grid,但是可以有多个block,之所以将线程划分为grid和block是为了使得结构更清晰,便于线程管理,灵活运用。
上图是一个二维grid和二维block的模型示意图,引用了谭生的博客,他写的很好很全面,想要系统慢慢学习的推荐看他的。
调用核函数
调用核函数需要像如下,下面程序表示的模型就是上图所展示的,一个grid当中有6个block,一个block当中有15个线程。
#include
#include
#include
__global__ void mykernel(void) {
int col_index = threadIdx.x + blockIdx.x * blockDim.x;
int row_index = threadIdx.y + blockIdx.y * blockDim.y;
printf("hello from (%d,%d) \n",row_index,col_index);
}
int main(void) {
dim3 grid(2,3);
dim3 block(3,5);
mykernel<<<grid, block>>>();
// synchronize the device
cudaDeviceSynchronize();
}
编译nvcc grid_and_block.cu -o a.out
之后,运行./a.out
,可以观察到,终端中打印出了一共80个坐标,如下所示。
hello from (10,3)
hello from (10,4)
hello from (10,5)
hello from (11,3)
hello from (11,4)
hello from (11,5)
hello from (12,3)
hello from (12,4)
hello from (12,5)
hello from (13,3)
hello from (13,4)
hello from (13,5)
hello from (14,3)
hello from (14,4)
hello from (14,5)
hello from (0,0)
hello from (0,1)
hello from (0,2)
hello from (1,0)
hello from (1,1)
hello from (1,2)
hello from (2,0)
hello from (2,1)
hello from (2,2)
hello from (3,0)
hello from (3,1)
hello from (3,2)
hello from (4,0)
hello from (4,1)
hello from (4,2)
hello from (10,0)
hello from (10,1)
hello from (10,2)
hello from (11,0)
hello from (11,1)
hello from (11,2)
hello from (12,0)
hello from (12,1)
hello from (12,2)
hello from (13,0)
hello from (13,1)
hello from (13,2)
hello from (14,0)
hello from (14,1)
hello from (14,2)
hello from (5,0)
hello from (5,1)
hello from (5,2)
hello from (6,0)
hello from (6,1)
hello from (6,2)
hello from (7,0)
hello from (7,1)
hello from (7,2)
hello from (8,0)
hello from (8,1)
hello from (8,2)
hello from (9,0)
hello from (9,1)
hello from (9,2)
hello from (5,3)
hello from (5,4)
hello from (5,5)
hello from (6,3)
hello from (6,4)
hello from (6,5)
hello from (7,3)
hello from (7,4)
hello from (7,5)
hello from (8,3)
hello from (8,4)
hello from (8,5)
hello from (9,3)
hello from (9,4)
hello from (9,5)
hello from (0,3)
hello from (0,4)
hello from (0,5)
hello from (1,3)
hello from (1,4)
hello from (1,5)
hello from (2,3)
hello from (2,4)
hello from (2,5)
hello from (3,3)
hello from (3,4)
hello from (3,5)
hello from (4,3)
hello from (4,4)
hello from (4,5)
今天主要是大致了解了一下CUDA是什么,以及最基本的需要建立的概念,然后给出了核函数使用的例子。明天继续更新!