CUDA | Writing and Compiling a CUDA Code

  • CUDA is very similar to C++, with a few additions
  • All the pitfalls, segmentation fault would remain in CUDA but more challenging to detect

Example: Vector addition

  • takes two vectors on the CPU
  • passes them to the GPU
  • adds them on the GPU
  • passes them back to the CPU
  • outputs them on the CPU

The full code starts with:
#include
#include

Memory:

  • Memory is best allocated on the GPU from the CPU

  • Dynamic memory allocation is possible from the GPU, but not advisable for performance reasons

  • allocating memory: input the number of bytes
    float *a, *b, c;
    cudaMalloc((void **) &a, N
    sizeof(float));

  • sets “a” equal to a memory address on the GPU that is the start of a block of memory of size N*sizeof(float) bytes.

Copying data:
On CPU, allocate spare as normal:
float *aHost = new float [N];

Memory copy:
cudaMemcpy(a, aHost, N*sizeof(float), cudaMemcpyHosttoDevice);

  • copies N*sizeof(float) bytes of data from aHost to a.

In order to copy data back from the GPU, use: “cudaMemcpyDevicetoHost”.

Freeing memory:
cudaFree(a);

  • release the memory pointed to by a for later use by other cudaMalloc calls.

Vecotr addition - kernel
__global void add(float* a, float* b, float* c, int N)
{
int I = threadIdx.x;
if (I < N)
{
c[i] = a[i] + b[i];
}
}

  • Kernel designated by global keyword
  • Kernel must have void return type
  • No direct direct return of information possible from kernels (asynchronous execution)
  • Thread number given by the struct threadIdx(.x, .y, .z)

Launching kernel:

  • Kernel launches require thread-block and grid-dimension sizes to be specified

Call a simple kernel:
const int N = 1024;
add<<<1, N>>>

Error handling:
cudaGetLastError (void)

  • returns last error, but also resets last error to cudaSuccess
    cudaGetErrorString()
  • returns an error message

Vector addition - Larger vectors

  • A thread block has a maximum size of 1024 threads and only uses a single SM
  • To use larger arrays (and more SMs), we must use a grid of thread-blocks
  • Use blockIdx containing index of current block within grid

dim3 blocks ((int) ceil(N/1024.0));
add<<>> (a,b,c,N);

in kernel:
index = blockIdx.x * blockDim.x + threadIdx.x;

General thread blocks and grids:

  • how to decide which blocks and grids to use, how to devide the data:
  • one grid-cell or matrix-element or data-point per thread (at least initially)

Complex Function:
For complicated functions of 2 vectors, we may want to use a separate function:
device : GPU dedicated functions
global: CPU and GPU available functions

Functions to be run the GPU must have device or global
Any depth of device functions can be called within a global function

Thread limits:

  • Maximum 1024 threads perblock
  • Maximum x/y dimension of block (in threads): 1024
  • Maximum z dimension of block: 1024
  • Maximum 2 ^ 31 - 1 for grid dims

Profiling:

  • asynchronous calls are problematic with timing CUDA programs

  • clock() may not give fine enough timings

  • events, to find the time between them afterwards
    cudaEvent_t
    cudaEventrecord
    cudaEventSynchronize
    cudaEventElapsedTime
    cudaEventDestroy

CUDA extensions to C++ - Host functions

  • C++ 17 code should be permitted

  • Functions can be prefixed with host

  • Callable on and by the host only

  • use nvcc for all compilation: host compiler is used for host-only code

  • Kernel functions

  • must be prefixed with global

  • executed on device, callable from host or device

  • parameters cannot be references

  • parameters are passed via constant memory

  • must have void return type

  • call is asynchronous - returns before device has finished

  • Device functions

  • prefixed with device

  • all valid C++ code

  • most C++17 features supported

  • device code executed on device and callable from device only

  • Functions:

  • Functions can be prefixed as both device and host and then are compiled for both CPU and GPU as neccessary.

  • Variable attributes:

  • device defined outside functions:

  • Resides in global memory on device (kernels can read/write)

  • Lasts for whole application

  • constant defined outside functions

  • Resides in global memory on device

  • kernel functions can directly access constant

  • shared:

  • Resides in shared memory of Streaming Multiprocessor on which thread block is running

  • Lasts for lifetime of thread block

  • Shared and accessible for all threads in same block

你可能感兴趣的:(CUDA,CUDA,c++,学习)