Example: Vector addition
The full code starts with:
#include
#include
Memory:
Memory is best allocated on the GPU from the CPU
Dynamic memory allocation is possible from the GPU, but not advisable for performance reasons
allocating memory: input the number of bytes
float *a, *b, c;
cudaMalloc((void **) &a, Nsizeof(float));
sets “a” equal to a memory address on the GPU that is the start of a block of memory of size N*sizeof(float) bytes.
Copying data:
On CPU, allocate spare as normal:
float *aHost = new float [N];
Memory copy:
cudaMemcpy(a, aHost, N*sizeof(float), cudaMemcpyHosttoDevice);
In order to copy data back from the GPU, use: “cudaMemcpyDevicetoHost”.
Freeing memory:
cudaFree(a);
Vecotr addition - kernel
__global void add(float* a, float* b, float* c, int N)
{
int I = threadIdx.x;
if (I < N)
{
c[i] = a[i] + b[i];
}
}
Launching kernel:
Call a simple kernel:
const int N = 1024;
add<<<1, N>>>
Error handling:
cudaGetLastError (void)
Vector addition - Larger vectors
dim3 blocks ((int) ceil(N/1024.0));
add<<
in kernel:
index = blockIdx.x * blockDim.x + threadIdx.x;
General thread blocks and grids:
Complex Function:
For complicated functions of 2 vectors, we may want to use a separate function:
device : GPU dedicated functions
global: CPU and GPU available functions
Functions to be run the GPU must have device or global
Any depth of device functions can be called within a global function
Thread limits:
Profiling:
asynchronous calls are problematic with timing CUDA programs
clock() may not give fine enough timings
events, to find the time between them afterwards
cudaEvent_t
cudaEventrecord
cudaEventSynchronize
cudaEventElapsedTime
cudaEventDestroy
CUDA extensions to C++ - Host functions
C++ 17 code should be permitted
Functions can be prefixed with host
Callable on and by the host only
use nvcc for all compilation: host compiler is used for host-only code
Kernel functions
must be prefixed with global
executed on device, callable from host or device
parameters cannot be references
parameters are passed via constant memory
must have void return type
call is asynchronous - returns before device has finished
Device functions
prefixed with device
all valid C++ code
most C++17 features supported
device code executed on device and callable from device only
Functions:
Functions can be prefixed as both device and host and then are compiled for both CPU and GPU as neccessary.
Variable attributes:
device defined outside functions:
Resides in global memory on device (kernels can read/write)
Lasts for whole application
constant defined outside functions
Resides in global memory on device
kernel functions can directly access constant
shared:
Resides in shared memory of Streaming Multiprocessor on which thread block is running
Lasts for lifetime of thread block
Shared and accessible for all threads in same block