Some basic concepts of GPU programming:
Here is the overview of a GPU(Fermi Architecture)[1]:
It is a 16-way many core (16 SM) GPU. Each way of many core has the architecture like this [1]:
It is also called SM: It consists of several SP (Steaming processor) and other resources like register and shared memory.
SP: The basic computation component and the actual processing unit。parallel computation actually uses multi-SP for computation at the same time. SP <==> Cores.
Some basic concepts in CUDA programming:
Warp: Each core(SM) has 48 warps. Warp is an optimisation made by CUDA. It is a computation unit of CUDA. Threads under a same warp execute the same instruction with different data (like SIMD does). Therefore, warp is an abstraction of a particular kind of computation. And the size of each warp is 32. Therefore, there are 32 threads in a warp at most(a 32-wide SIMD vector).
Threads: Threads in GPU are actually the computation performed in the SIMD lines. It means that how many ways of a SIMD computation so there are how many threads in this computation.
CUDA is a mental model to expose the parallelism to GPU. It is a special language that let us communicate with GPU. It keeps a lot of traditions of C language so that we can easily make the program running. Meanwhile, we can expose the parallelism by using blocks we implicit in our codes. Then, the GPU will automatically divide the block into several warps (maybe one) by itself.
Therefore, we call CUDA is a kind of SIMT (Single Instruction Multiple Threads). It is different from SIMD.
SIMD exposes the SIMD width to the programmer. SIMT abstract the number of threads in a thread block as a user-specified parameter[1]. Using the <<<numOfBlock/grid,numOfThreads/block>>> to specify. The kernel function just explain what each thread needs to do.
With the abstraction made by SIMD, we can easily make the code run. However, it need us to focus on the optimisation.
Block: block can fully expose the concurrency. Let the hardware or runtime decide if to sequentialise the execution as necessary. Each block represents a SM. And also, each SM has the limitation of maximum threads: 1024. But we don't have a limitation of a grid. You can create an un-limited number of threads if you want.
Grids: a grid consists of many blocks. A block consists of several threads (several warps). Grid is an unit of outcome of each call of a kernel function.
In summary, threads ==> blocks ==> grids are the concepts from the perspective of programmer (user).
For the perspective of GPU, it only know there are many warps that need it to execute. Every clock cycle, GPU runs a warp which actually with the 32 threads within this warp.
Some details of CUDA and GPU:
Typically, with each register has 4KB size and about 20 register for each thread, there are 4*20*32 (KB) for a particular warp. And there are 48 warps of a core in total. Therefore, we need 4*20*32*48=128KB for a particular core. With there are 16 SM in Fermi architecture, there are 2MB register file is needed.
Why the GPU runs in warps?
It is because 128KB register file is a large memory size which will take more than one clock to get the data. If we did not run in warps, it will cost much time to load the data into cache. Actually, in order to balance the computation and the tradeoff of loading data, GPU provides 16-wide physical SIMD vector which provide half operand per clock cycle and separate the loading into two times. However, we can assume there is only 32-wide vector now.
Reference:
[1] CMU 'How to write fast code' by Jike Chong and Ian Lane