1.deviceQuery 非常重要,对于编程中遇到的block\grid设置,memory hierarchy 的使用 具有指导意义。
deviceQuery 实际上是一个sample,需要编译后才能使用。 在 /opt/cuda/cuda70/NVIDIA_CUDA-7.0_Samples 或者loca的cuda 文件夹(这个不确定)。
因为是只读文件,需要copy 到 home 文件目录下面,由于会使用 NVIDIA_CUDA-7.0_Samples/common 文件夹中的文件,直接copy NVIDIA_CUDA-7.0_Samples。
make 运行,就得到了deviceQuery 可运行文件。
建议对于任何一个GPU编程,第一个工作就是编译 deviceQuery。
有一个结果不明白,compute mode (我在nvidia-smi的说明书找到了说明):
Compute mode 的意思是是否允许多个程序 同时使用GPU。
Compute Mode The compute mode flag indicates whether individual or multiple compute applications may run on the GPU.
"Default" means multiple contexts are allowed per device.
"Exclusive Thread" means only one context is allowed per device, usable from one thread at a time.
"Exclusive Process" means only one context is allowed per device, usable from multiple threads at a time. "
“Prohibited" means no contexts are allowed per device (no compute apps).
"EXCLUSIVE_PROCESS" was added in CUDA 4.0. Prior CUDA releases supported only one exclusive mode, which is equivalent to "EXCLUSIVE_THREAD" in CUDA 4.0 and beyond.
For all CUDA-capable products
2. Nvidia-smi 有 nvidia-smi 说明书,http://developer.download.nvidia.com/compute/cuda/6_0/rel/gdk/nvidia-smi.331.38.pdf
Nvidia-smi:NVIDIA System Management Interface. 命令行, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.
GPU configuration options (such as ECC memory capability) may be enabled and disabled.
Nvidia-smi命令是在install drive,因此有。
nvidia-smi -i 0 -q 可以显示所有的信息。 (-i, 表示 gpu的编号)
nvidia-smi -h 帮助命令
3. 使用 cudaGetDeviceProperties()
deviceQuery实际是调用 cudaGetDeviceProperties(),逐条答应各种信息。
例如:程序+结果
void PrintDeviceProperties(cudaDeviceProp devProp) { FILE *deviceProperties = fopen("DeviceProperties.txt", "a+"); fprintf(deviceProperties, "Major revision number: %d\n", devProp.major); fprintf(deviceProperties, "Minor revision number: %d\n", devProp.minor); fprintf(deviceProperties, "Name: %s\n", devProp.name); fprintf(deviceProperties, "Total global memory: %u\n", devProp.totalGlobalMem); fprintf(deviceProperties, "Total shared memory per block: %u\n", devProp.sharedMemPerBlock); fprintf(deviceProperties, "Total registers per block: %d\n", devProp.regsPerBlock); fprintf(deviceProperties, "Warp size: %d\n", devProp.warpSize); fprintf(deviceProperties, "Maximum memory pitch: %u\n", devProp.memPitch); fprintf(deviceProperties, "Maximum threads per block: %d\n", devProp.maxThreadsPerBlock); for (int i = 0; i < 3; ++i) fprintf(deviceProperties, "Maximum dimension %d of block: %d\n", i, devProp.maxThreadsDim[i]); for (int i = 0; i < 3; ++i) fprintf(deviceProperties, "Maximum dimension %d of grid: %d\n", i, devProp.maxGridSize[i]); fprintf(deviceProperties, "Clock rate: %d\n", devProp.clockRate); fprintf(deviceProperties, "Total constant memory: %u\n", devProp.totalConstMem); fprintf(deviceProperties, "Texture alignment: %u\n", devProp.textureAlignment); fprintf(deviceProperties, "Concurrent copy and execution: %s\n", (devProp.deviceOverlap ? "Yes" : "No")); fprintf(deviceProperties, "Number of multiprocessors: %d\n", devProp.multiProcessorCount); fprintf(deviceProperties, "Kernel execution timeout: %s\n", devProp.kernelExecTimeoutEnabled ? "Yes" : "No")); fclose(deviceProperties); } And the result is as follows: Major revision number: 2 Minor revision number: 0 Name: Tesla C2075 Total global memory: 1341849600 Total shared memory per block: 49152 Total registers per block: 32768 Warp size: 32 Maximum memory pitch: 2147483647 Maximum threads per block: 1024 Maximum dimension 0 of block: 1024 Maximum dimension 1 of block: 1024 Maximum dimension 2 of block: 64 Maximum dimension 0 of grid: 65535 Maximum dimension 1 of grid: 65535 Maximum dimension 2 of grid: 65535 Clock rate: 1147000 Total constant memory: 65536 Texture alignment: 512 Concurrent copy and execution: Yes Number of multiprocessors: 14 Kernel execution timeout: No