CUDA driver API学习

之前一直都是用runtime API写CUDA代码,对context等不是很了解,打算简单看一下driver API。

context management:
感觉对性能调优可能会有帮助的一些API:
*

  1. CUresult cuCtxGetSharedMemConfig (CUsharedconfig *pConfig):*
    获取当前context的shared memory bank width(4 bytes/ 8 bytes);可用于检查是否存在bank conflict。

  2. CUresult cuCtxPopCurrent (CUcontext *pctx):
    将当前CPU thread的context从当前CPU thread弹出;对应的context的句柄存放在pctx中,之后可被其他CPU thread设置为其current context(通过cuCtxPushCurrent())。

    1. CUresult cuCtxPushCurrent (CUcontext ctx):
      将ctx指定的context push到当前CPU thread对应的stack中,使其成为该thread的current context;上一个current context可以通过调用cuCtxDestroy()或者cuCtxPopCurrent()来获取;

    2. CUresult cuCtxSetCurrent (CUcontext ctx):
      将ctx指定的context与当前CPU thread绑定;其效果为:若当前CPU thread存在一个context stack,则ctx会取代当前stack的top context成为新的top;若ctx为空,实际上相当于对stack的top进行pop操作。

    3. CUresult cuCtxSynchronize (void)

module management
1. *CUresult cuModuleGetFunction (CUfunction *hfunc,
CUmodule hmod, const char name)
Returns in hfunc the handle of the function of name name located in module *hmod

  1. CUresult cuModuleLoad (CUmodule *module, const char
    *fname);
    Takes a filename fname and loads the corresponding module module into the current context. The file should be a cubin file as output by nvcc, or a PTX file either as output by nvcc or handwritten.

memory management
1. *CUresult cuArrayCreate (CUarray *pHandle, const
CUDA_ARRAY_DESCRIPTOR pAllocateArray);
用于创建一个cuda array,pHandle为返回的array handle,pAllocateArray为指定的array的信息(包括width, height, format等);

  1. CUresult cuIpcGetMemHandle (CUipcMemHandle
    *pHandle, CUdeviceptr dptr);
    Takes a pointer to the base of an existing device memory allocation created with cuMemAlloc and exports it for use in another process. This is a lightweight operation and may be called multiple times on an allocation without adverse effects.
    用于IPC;(restricted to devices with support for unified addressing on Linux operating systems.)

  2. CUresult cuIpcOpenMemHandle (CUdeviceptr *pdptr,
    CUipcMemHandle handle, unsigned int Flags);
    和上面一个结合使用:Maps memory exported from another process with cuIpcGetMemHandle into the current device address space,将handle对应的memory映射到自己的地址空间中并返回在本进程中的指针。

  3. CUresult cuMemAlloc (CUdeviceptr *dptr, size_t bytesize);
    Allocates bytesize bytes of linear memory on the device and returns in *dptr a pointer to the allocated memory。

occupancy
1. CUresult cuOccupancyMaxActiveBlocksPerMultiprocessor
(int *numBlocks, CUfunction func, int blockSize, size_t
dynamicSMemSize);
用于计算指定function的occupancy(也即每个sm上可以容纳的最大的active block数目);其中numBlocks为返回值;

  1. CUresult cuOccupancyMaxPotentialBlockSize (int
    *minGridSize, int *blockSize, CUfunction func,
    CUoccupancyB2DSize blockSizeToDynamicSMemSize,
    size_t dynamicSMemSize, int blockSizeLimit);
    用于suggest a launch configuration with reasonable occupancy.
    Returns in *blockSize a reasonable block size that can achieve the maximum occupancy.
    返回值包括:minGridSize(达到maximum occupancy所需的最小的gridSize); blockSize(最大的block size);
    输入参数包括:func(所需要计算的funciton的handle); dynamicSMemSize(dynamic shared memory usage,当每个block所需的dynamic shared memory size大小与block大小无关时,通过该值指定); blockSizeToDynamicSMemSize(为一个自定义的函数,若每个block所需的dynamic shared memory的大小与block大小有关,需要通过该函数指出其关系);

profile control

  1. CUresult cuProfilerInitialize (const char *configFile,
    const char *outputFile, CUoutput_mode outputMode);
    用于初始化profile,其中configFile指定的文件中列出了需要进行profile的选项,outputFile指定了输出文件名,outputMode为输出模式;
    profile过程:
    cuProfilerInitialize();
    cuProfilerStart();
    // code to be profiled
    cuProfilerStop();

    cuProfilerStart();
    // code to be profiled
    cuProfilerStop();

    1. CUresult cuProfilerStart (void);
      CUresult cuProfilerStop (void);
      这两个函数的作用在于:可以对指定的piece of code进行profile;

你可能感兴趣的:(CUDA)