前面三节已经对CUDA做了一个简单的介绍,这一节开始真正进入编程环节。
首先,初学者应该对自己使用的设备有较为扎实的理解和掌握,这样对后面学习并行程序优化很有帮助,了解硬件详细参数可以通过上节介绍的几本书和官方资料获得,但如果仍然觉得不够直观,那么我们可以自己动手获得这些内容。
以第二节例程为模板,我们稍加改动的部分代码如下:
// Add vectors in parallel.
cudaError_t cudaStatus;
int num = 0;
cudaDeviceProp prop;
cudaStatus = cudaGetDeviceCount(&num);
for(int i = 0;i
这个改动的目的是让我们的程序自动通过调用cuda API函数获得设备数目和属性,所谓“知己知彼,百战不殆”。
cudaError_t 是cuda错误类型,取值为整数。
cudaDeviceProp为设备属性结构体,其定义可以从cuda Toolkit安装目录中找到,我的路径为:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include\driver_types.h,找到定义为:
/**
* CUDA device properties
*/
struct __device_builtin__ cudaDeviceProp
{
char name[256]; /**< ASCII string identifying device */
size_t totalGlobalMem; /**< Global memory available on device in bytes */
size_t sharedMemPerBlock; /**< Shared memory available per block in bytes */
int regsPerBlock; /**< 32-bit registers available per block */
int warpSize; /**< Warp size in threads */
size_t memPitch; /**< Maximum pitch in bytes allowed by memory copies */
int maxThreadsPerBlock; /**< Maximum number of threads per block */
int maxThreadsDim[3]; /**< Maximum size of each dimension of a block */
int maxGridSize[3]; /**< Maximum size of each dimension of a grid */
int clockRate; /**< Clock frequency in kilohertz */
size_t totalConstMem; /**< Constant memory available on device in bytes */
int major; /**< Major compute capability */
int minor; /**< Minor compute capability */
size_t textureAlignment; /**< Alignment requirement for textures */
size_t texturePitchAlignment; /**< Pitch alignment requirement for texture references bound to pitched memory */
int deviceOverlap; /**< Device can concurrently copy memory and execute a kernel. Deprecated. Use instead asyncEngineCount. */
int multiProcessorCount; /**< Number of multiprocessors on device */
int kernelExecTimeoutEnabled; /**< Specified whether there is a run time limit on kernels */
int integrated; /**< Device is integrated as opposed to discrete */
int canMapHostMemory; /**< Device can map host memory with cudaHostAlloc/cudaHostGetDevicePointer */
int computeMode; /**< Compute mode (See ::cudaComputeMode) */
int maxTexture1D; /**< Maximum 1D texture size */
int maxTexture1DMipmap; /**< Maximum 1D mipmapped texture size */
int maxTexture1DLinear; /**< Maximum size for 1D textures bound to linear memory */
int maxTexture2D[2]; /**< Maximum 2D texture dimensions */
int maxTexture2DMipmap[2]; /**< Maximum 2D mipmapped texture dimensions */
int maxTexture2DLinear[3]; /**< Maximum dimensions (width, height, pitch) for 2D textures bound to pitched memory */
int maxTexture2DGather[2]; /**< Maximum 2D texture dimensions if texture gather operations have to be performed */
int maxTexture3D[3]; /**< Maximum 3D texture dimensions */
int maxTextureCubemap; /**< Maximum Cubemap texture dimensions */
int maxTexture1DLayered[2]; /**< Maximum 1D layered texture dimensions */
int maxTexture2DLayered[3]; /**< Maximum 2D layered texture dimensions */
int maxTextureCubemapLayered[2];/**< Maximum Cubemap layered texture dimensions */
int maxSurface1D; /**< Maximum 1D surface size */
int maxSurface2D[2]; /**< Maximum 2D surface dimensions */
int maxSurface3D[3]; /**< Maximum 3D surface dimensions */
int maxSurface1DLayered[2]; /**< Maximum 1D layered surface dimensions */
int maxSurface2DLayered[3]; /**< Maximum 2D layered surface dimensions */
int maxSurfaceCubemap; /**< Maximum Cubemap surface dimensions */
int maxSurfaceCubemapLayered[2];/**< Maximum Cubemap layered surface dimensions */
size_t surfaceAlignment; /**< Alignment requirements for surfaces */
int concurrentKernels; /**< Device can possibly execute multiple kernels concurrently */
int ECCEnabled; /**< Device has ECC support enabled */
int pciBusID; /**< PCI bus ID of the device */
int pciDeviceID; /**< PCI device ID of the device */
int pciDomainID; /**< PCI domain ID of the device */
int tccDriver; /**< 1 if device is a Tesla device using TCC driver, 0 otherwise */
int asyncEngineCount; /**< Number of asynchronous engines */
int unifiedAddressing; /**< Device shares a unified address space with the host */
int memoryClockRate; /**< Peak memory clock frequency in kilohertz */
int memoryBusWidth; /**< Global memory bus width in bits */
int l2CacheSize; /**< Size of L2 cache in bytes */
int maxThreadsPerMultiProcessor;/**< Maximum resident threads per multiprocessor */
};
后面的注释已经说明了其字段代表意义,可能有些术语对于初学者理解起来还是有一定困难,没关系,我们现在只需要关注以下几个指标:
name:就是设备名称;
totalGlobalMem:就是显存大小;
major,minor:CUDA设备版本号,有1.1, 1.2, 1.3, 2.0, 2.1等多个版本;
clockRate:GPU时钟频率;
multiProcessorCount:GPU大核数,一个大核(专业点称为流多处理器,SM,Stream-Multiprocessor)包含多个小核(流处理器,SP,Stream-Processor)
编译,运行,我们在VS2008工程的cudaGetDeviceProperties()函数处放一个断点,单步执行这一函数,然后用Watch窗口,切换到Auto页,展开+,在我的笔记本上得到如下结果:
可以看到,设备名为GeForce 610M,显存1GB,设备版本2.1(比较高端了,哈哈),时钟频率为950MHz(注意950000单位为kHz),大核数为1。在一些高性能GPU上(如Tesla,Kepler系列),大核数可能达到几十甚至上百,可以做更大规模的并行处理。
PS:今天看SDK代码时发现在help_cuda.h中有个函数实现从CUDA设备版本查询相应大核中小核的数目,觉得很有用,以后编程序可以借鉴,摘抄如下:
// Beginning of GPU Architecture definitions
inline int _ConvertSMVer2Cores(int major, int minor)
{
// Defines for GPU Architecture types (using the SM version to determine the # of cores per SM
typedef struct
{
int SM; // 0xMm (hexidecimal notation), M = SM Major version, and m = SM minor version
int Cores;
} sSMtoCores;
sSMtoCores nGpuArchCoresPerSM[] =
{
{ 0x10, 8 }, // Tesla Generation (SM 1.0) G80 class
{ 0x11, 8 }, // Tesla Generation (SM 1.1) G8x class
{ 0x12, 8 }, // Tesla Generation (SM 1.2) G9x class
{ 0x13, 8 }, // Tesla Generation (SM 1.3) GT200 class
{ 0x20, 32 }, // Fermi Generation (SM 2.0) GF100 class
{ 0x21, 48 }, // Fermi Generation (SM 2.1) GF10x class
{ 0x30, 192}, // Kepler Generation (SM 3.0) GK10x class
{ 0x35, 192}, // Kepler Generation (SM 3.5) GK11x class
{ -1, -1 }
};
int index = 0;
while (nGpuArchCoresPerSM[index].SM != -1)
{
if (nGpuArchCoresPerSM[index].SM == ((major << 4) + minor))
{
return nGpuArchCoresPerSM[index].Cores;
}
index++;
}
// If we don't find the values, we default use the previous one to run properly
printf("MapSMtoCores for SM %d.%d is undefined. Default to use %d Cores/SM\n", major, minor, nGpuArchCoresPerSM[7].Cores);
return nGpuArchCoresPerSM[7].Cores;
}
// end of GPU Architecture definitions
可见,设备版本2.1的一个大核有48个小核,而版本3.0以上的一个大核有192个小核!
前文说到过,当我们用的电脑上有多个显卡支持CUDA时,怎么来区分在哪个上运行呢?这里我们看一下addWithCuda这个函数是怎么做的。
cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
goto Error;
}
使用了cudaSetDevice(0)这个操作,0表示能搜索到的第一个设备号,如果有多个设备,则编号为0,1,2...。
再看我们本节添加的代码,有个函数cudaGetDeviceCount(&num),这个函数用来获取设备总数,这样我们选择运行CUDA程序的设备号取值就是0,1,...num-1,于是可以一个个枚举设备,利用cudaGetDeviceProperties(&prop)获得其属性,然后利用一定排序、筛选算法,找到最符合我们应用的那个设备号opt,然后调用cudaSetDevice(opt)即可选择该设备。选择标准可以从处理能力、版本控制、名称等各个角度出发。后面讲述流并发过程时,还要用到这些API。
如果希望了解更多硬件内容可以结合http://www.geforce.cn/hardware获取。