课程目标
目标
CPU:面向延迟的设计
GPU:面向吞吐量的设计
小缓存 – 提高内存吞吐量
简单控制
节能 ALU – 许多、长延迟但大量流水线以实现高吞吐量
需要大量线程来容忍延迟
CPU 和 GPU 的设计非常不同
好的应用同时使用GPU和CPU
目标
理解并行编程中可扩展性和可移植性的重要性和本质
可扩展性
相同的应用程序在新一代内核上可以高效运行
相同的应用程序在更多相同的内核上可以高效运行
可移植性
加速应用程序的方法
Libraries
Vector Addition in Thrust
thrust::device_vector deviceInput1(inputLength);
thrust::device_vector deviceInput2(inputLength);
thrust::device_vector deviceOutput(inputLength);
thrust::copy(hostInput1, hostInput1 + inputLength, deviceInput1.begin());
thrust::copy(hostInput2, hostInput2 + inputLength, deviceInput2.begin());
thrust::transform(deviceInput1.begin(), deviceInput1.end(), deviceInput2.begin(), deviceOutput.begin(), thrust::plus());
Compiler Directives
OpenACC
- Compiler directives for C, C++, and FORTRAN
#pragma acc parallel loop
copyin(input1[0:inputLength],input2[0:inputLength]),
copyout(output[0:inputLength])
for(i = 0; i < inputLength; ++i) {
output[i] = input1[i] + input2[i];
}
Programming Languages
目标
学习CUDA主机代码中的基本API函数
数据并行 - 向量加法示例
vector A A[0] A[1] A[2] … A[N-1]
+ + + +
vector B B[0] B[1] B[2] … B[N-1]
= = = =
vector C C[0] C[1] C[2] … C[N-1]
向量加法 – 传统 C 代码
// Compute vector sum C = A + B
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int i;
for (i = 0; i<n; i++)
h_C[i] = h_A[i] + h_B[i];
}
int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements
...
vecAdd(h_A, h_B, h_C, N);
}
异构计算 vecAdd CUDA Host Code
#include
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int size = n* sizeof(float);
float *d_A, *d_B, *d_C;
// Part 1
// Allocate device memory for A, B, and C
// copy A and B to device memory
// Part 2
// Kernel launch code – the device performs the actual vector addition
// Part 3
// copy C from the device memory
// Free device vectors
}
CUDA内存部分概述
设备代码可以:
主机代码可以
CUDA 设备内存管理 API 函数
Vector Addition Host Code
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int size = n * sizeof(float); float *d_A, *d_B, *d_C;
cudaMalloc((void **) &d_A, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_B, size);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_C, size);
// Kernel invocation code – to be shown later
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
}
In Practice, Check for API Errors in Host Code
cudaError_t err = cudaMalloc((void **) &d_A, size);
if (err != cudaSuccess) {
printf(“%s in %s at line %d\n”, cudaGetErrorString(err), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}
并行线程数组
Thread Blocks(线程块):可扩展的合作
blockIdx 和 threadIdx
略 哈哈哈
示例:向量加法内核
Device Code
// Compute vector sum C = A + B
// Each thread performs one pair-wise addition
__global__
void vecAddKernel(float* A, float* B, float* C, int n)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i < n)
C[i] = A[i] + B[i];
}
Host Code
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
// d_A, d_B, d_C allocations and copies omitted
// Run ceil(n/256.0) blocks of 256 threads each
vecAddKernel<<>>(d_A, d_B, d_C, n);
}
更多关于内核启动(Host Code)
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
dim3 DimGrid((n - 1)/256 + 1, 1, 1);
dim3 DimBlock(256, 1, 1);
vecAddKernel<<>>(d_A, d_B, d_C, n);
}
更多关于 CUDA 函数声明
执行于 | 从哪调用 | |
---|---|---|
_device_ | device | device |
_global_ | device | host |
_host_ | host | host |
注:__global__函数必须返回 void
目标
了解多维网格
一个多维网格示例
使用2D grid 处理图片
PictureKernel的源代码
将每个像素值缩放 2.0
__global__ void PictureKernel(float* d_Pin, float* d_Pout, int height, int width)
{
// Calculate the row # of the d_Pin and d_Pout element
int Row = blockIdx.y*blockDim.y + threadIdx.y;
// Calculate the column # of the d_Pin and d_Pout element
int Col = blockIdx.x*blockDim.x + threadIdx.x;
// each thread computes one element of d_Pout if in range
if ((Row < height) && (Col < width)) {
d_Pout[Row*width+Col] = 2.0*d_Pin[Row*width+Col];
}
}
用于启动 PictureKernel 的Host Code
// assume that the picture is m × n,
// m pixels in y dimension and n pixels in x dimension
// input d_Pin has been allocated on and copied to device
// output d_Pout has been allocated on device
...
dim3 DimGrid((n-1)/16 + 1, (m-1)/16 + 1, 1);
dim3 DimBlock(16, 16, 1);
PictureKernel<<>>(d_Pin, d_Pout, m, n);
...
RGB 到灰度转换
颜色计算公式
RGB 到灰度转换代码
#define CHANNELS 3 // we have 3 channels corresponding to RGB
// The input image is encoded as unsigned characters [0, 255]
__global__ void colorConvert(unsigned char * grayImage,
unsigned char * rgbImage,int width, int height) {
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
if (x < width && y < height) {
// get 1D coordinate for the grayscale image
int grayOffset = y*width + x;
// one can think of the RGB image having
// CHANNEL times columns than the gray scale image
int rgbOffset = grayOffset*CHANNELS;
unsigned char r = rgbImage[rgbOffset ]; // red value for pixel
unsigned char g = rgbImage[rgbOffset + 2]; // green value for pixel
unsigned char b = rgbImage[rgbOffset + 3]; // blue value for pixel
// perform the rescaling and store it
// We multiply by floating point constants
grayImage[grayOffset] = 0.21f*r + 0.71f*g + 0.07f*b;
}
}
模糊框
__global__
void blurKernel(unsigned char * in, unsigned char * out, int w, int h) {
int Col = blockIdx.x * blockDim.x + threadIdx.x;
int Row = blockIdx.y * blockDim.y + threadIdx.y;
if (Col < w && Row < h) {
int pixVal = 0;
int pixels = 0;
// Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box
for(int blurRow = -BLUR_SIZE; blurRow < BLUR_SIZE+1; ++blurRow) {
for(int blurCol = -BLUR_SIZE; blurCol < BLUR_SIZE+1; ++blurCol) {
int curRow = Row + blurRow;
int curCol = Col + blurCol;
// Verify we have a valid image pixel
if(curRow > -1 && curRow < h && curCol > -1 && curCol < w) {
pixVal += in[curRow * w + curCol];
pixels++; // Keep track of number of pixels in the accumulated total
}
}
}
// Write our new pixel value out
out[Row * w + Col] = (unsigned char)(pixVal / pixels);
}
}
目标
了解 CUDA 内核如何利用硬件执行资源
warp 示例
如果将 3 个块分配给一个 SM,每个块有 256 个线程,那么一个 SM 中有多少个 Warp?
线程调度(续)
SM 实现零开销 warp 调度
注意事项
对于使用多个块的矩阵乘法,我应该为 Fermi 使用 8X8、16X16 还是 32X32 块?
目标
学习在并行程序中有效地使用 CUDA 内存类型
矩阵乘法
一个基本的矩阵乘法
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {
// Calculate the row index of the P element and M
int Row = blockIdx.y*blockDim.y+threadIdx.y;
// Calculate the column index of P and N
int Col = blockIdx.x*blockDim.x+threadIdx.x;
if ((Row < Width) && (Col < Width)) {
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}
}
对线程进行分析:
冯诺依曼模型中的存储器和寄存器
CUDA 内存的程序员视图
声明 CUDA 变量
变量声明 | Memory | Scope | Lifetime |
---|---|---|---|
int LocalVar; | register | thread | thread |
_device_ _shared_ int SharedVar; | shared | block | block |
_device_ int GlobalVar; | global | grid | application |
_device_ _constant_ int ConstantVar; | constant | grid | application |
注:__device__ 在与 _shared_ 或 _constant_ 一起使用时是可选的
CUDA 中的共享内存
一种特殊类型的内存,其内容在内核源代码中明确定义和使用
CUDA 内存的硬件视图
基本矩阵乘法内核的全局内存访问模式
Tiling/Blocking
将全局内存内容划分为tiles
将线程的计算集中在每个时间点的一个或少量tiles上
Tiling的基本概念
在拥堵的交通系统中,显着减少车辆可以大大改善所有车辆看到的延迟
拼车需要同步
在Tiling中
Tiling技术概要
目标
理解矩阵乘法的tiled并行算法的设计
屏障同步
同步块中的所有线程
同一块中的所有线程必须到达__syncthreads() 才能继续前进
最适合用于协调分阶段执行tiled 算法
Tiled矩阵乘法核心
目标
学习编写Tiled矩阵乘法内核
加载
Loading Input Tile 0 of M (Phase 0) Loading Input Tile 0 of N (Phase 0)
Loading Input Tile 1 of M (Phase 1) Loading Input Tile 1 of N (Phase 1)
M 和 N 是动态分配的 - 使用一维索引
M[Row][p*TILE_WIDTH+tx]
—> M[Row*Width + p*TILE_WIDTH + tx]N[p*TILE_WIDTH+ty][Col]
—> N[(p*TILE_WIDTH+ty)*Width + Col]
Tiled矩阵乘法核心
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width){
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
float Pvalue = 0;
// Loop over the M and N tiles required to compute the P element
for (int p = 0; p < n/TILE_WIDTH; ++p) {
// Collaborative loading of M and N tiles into shared memory
ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH + tx];
ds_N[ty][tx] = N[(t*TILE_WIDTH+ty)*Width + Col];
__syncthreads();//等待块中其他线程同步
for (int i = 0; i < TILE_WIDTH; ++i){
Pvalue += ds_M[ty][i] * ds_N[i][tx];
}
__synchthreads();//等待其他线程计算完,因为Pvalue要用到下一个块的计算
}
P[Row*Width+Col] = Pvalue;
}
Tile(线程块)大小注意事项
每个线程块应该有多个线程
对于 16,在每个阶段,每个块从全局内存执行 2256 = 512 次浮点加载,用于 256 * (216) = 8,192 次 mul/add 操作。(每个内存加载 16 次浮点运算)
对于 32,在每个阶段,每个块从全局内存执行 21024 = 2048 次浮点加载,执行 1024 * (232) = 65,536 次 mul/add 操作。 (每个内存加载32次浮点运算)
共享内存和线程
对于具有 16KB 共享内存的 SM
每个 __syncthread() 可以减少一个块的活动线程数
处理任意大小的矩阵
一个“简单”的解决方案
输入 M tile 的边界条件
输入 N tile 的边界条件
__global__ void MatrixMul_sm(int* p_A, int* p_B, int* p_C, int A_row, int A_col, int B_col) {
__shared__ double sharedM[TILE_SIZE][TILE_SIZE];
__shared__ double sharedN[TILE_SIZE][TILE_SIZE];
int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
int row = by * TILE_SIZE + ty;
int col = bx * TILE_SIZE + tx;
int Pvalue = 0;
for (int i = 0; i < (int)(ceil((double)A_col / TILE_SIZE)); i++) {
if (i * TILE_SIZE + tx < A_col && row < A_row) {
sharedM[ty][tx] = p_A[row * A_col + i * TILE_SIZE + tx];
}else {
sharedM[ty][tx] = 0;
}
if (i * TILE_SIZE + ty < A_col && col < B_col) {
sharedN[ty][tx] = p_B[(i * TILE_SIZE + ty) * B_col + col];
}else {
sharedN[ty][tx] = 0;
}
__syncthreads();
for (int j = 0; j < TILE_SIZE; j++) {
Pvalue += sharedM[ty][j] * sharedN[j][tx];
}
__syncthreads();
}
if (row < A_row && col < B_col) {
p_C[row * B_col + col] = Pvalue;
}
}
简易版
for(int p = 0; p < (Width - 1) / TILE_WIDTH + 1; P++){
if(Row < Width && p * TILE_WIDTH + tx < Width){
ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH + tx];
}else{
ds_M[ty][tx] = 0.0;
}
if(p*TILE_WIDTH + ty < Width && Col < Width){
ds_N[ty][tx] = N[(p*TILE_WIDTH + ty)*Width + Col];
}else{
ds_N[ty][tx] = 0.0;
}
__syncthreads();
if(Row < Width && Col < Width){
for(int i = 0;i < TILE_WIDTH; ++i){
Pvalue += ds_M[ty][i]*ds_N[i][tx];
}
}
__syncthreads();
}
if(Row < Width && Col < Width){
P[Row*Width + Col] = Pvalue;
}
一些要点
处理一般矩形矩阵
目标
了解 CUDA 线程如何在 SIMD 硬件上执行
作为调度单位的Warps
每个块分为 32 线程 Warp
多维线程块中的warps
线程块首先按行主顺序线性化为一维
线性化后对块进行分区
SM 是 SIMD 处理器
指令获取、解码和控制的控制单元在多个处理单元之间共享
Warp 中线程间的 SIMD 执行
Warp 中的所有线程必须在任何时间点执行相同的指令
如果所有线程都遵循相同的控制流路径,这将有效地工作
Control Divergence
当 warp 中的线程通过做出不同的控制决策而采取不同的控制流路径时,就会发生控制发散
采取不同路径的线程的执行在当前的 GPU 中被序列化
Control Divergence 例子
分析 1,000 个元素的向量大小
假设块大小为 256 个线程
块 0、1 和 2 中的所有线程都在有效范围内
Block 3 中的大多数warp不会控制发散
Block 3 中的一个 warp 将具有控制发散
序列化对控制发散的影响会很小
目标
学习分析控制发散对性能的影响
控制发散的性能影响
边界条件检查对于并行代码的完整功能和健壮性至关重要
加载M Tiles时的两种方块
控制发散影响分析
控制加载 M 块的发散度(Type 1)
控制加载 M 块时的发散(Type 2)
注:Type 1 和Type 2见上图
Control Divergence的总体影响
附加注释
目标
了解内存带宽是大规模并行处理器中的一阶性能因素。
DRAM 核心阵列很慢
从核心阵列中的单元读取是一个非常缓慢的过程
DDR:核心速度 = ½ 接口速度
DDR2/GDDR3:核心速度 = ¼ 接口速度
DDR3/GDDR4:核心速度 = ⅛ 接口速度 - …
未来可能会更糟
DRAM 突发
对于时钟频率为接口速度 1/N 的 DDR{2,3} SDRAM 内核:
从同一行一次性加载(N × 接口宽度)DRAM 位到内部缓冲区,然后以接口速度分 N 步传输
DDR3/GDDR4:缓冲区宽度 = 8X 接口宽度
DRAM 突发时序示例
现代 DRAM 系统设计为始终以突发模式访问。突发字节被传输到处理器,但当访问不是对连续位置时被丢弃。
DRAM Bursting with Banking
目标
了解内存合并对于有效利用 CUDA 中的内存带宽很重要。
DRAM burst – 系统视图
每个地址空间被划分为burst部分
基本示例:16 字节地址空间,4 字节burst部分
合并内存
当一个warp的所有线程都执行一条加载指令时,如果所有访问的位置都落入同一个突发部分,则只会发出一个DRAM请求,并且访问完全合并。
非合并内存访问
当访问的位置跨越突发部分边界时:
线程不使用某些访问和传输的字节
基本矩阵乘法的两种存取模式
i 是内核代码内积循环中的循环计数器,A 是 m × n,B 是 n × k,Col = blockIdx.x*blockDim.x + threadIdx.x
在B矩阵中,是合并内存访问,A矩阵中,不是合并内存访问。
目标
学习并行直方图计算模式
文本直方图示例
一种简单的并行直方图算法
输入分区影响内存访问效率
分段分区导致内存访问效率低下
更改为交错分区
交错分区
用于合并和更好的内存访问性能
目标
理解并行计算中的数据竞争
原子操作的目的——确保良好的结果
我们希望的结果:
目标
学习在并行程序中使用原子操作
原子操作的关键概念
由单个硬件指令对内存位置地址执行的读 - 修改 - 写操作
硬件确保在当前原子操作完成之前,没有其他线程可以在同一位置执行另一个读 - 修改 - 写操作
Atomic Add
int atomicAdd(int* address, int val);
从全局或共享内存中的地址指向的位置读取 32 位字 old,计算 (old + val),并将结果存储回内存中的同一地址。该函数返回 old。
unsigned int atomicAdd(unsigned int* address, unsigned int val);
无符号 32 位整数原子加法。
unsigned long long int atomicAdd(unsigned long long int* address, unsigned long long int val);
无符号 64 位整数原子加法。
float atomicAdd(float* address, float val);
单精度浮点原子加法(能力> 2.0)
一个基本的文本直方图内核
- 内核接收一个指向字节值输入缓冲区的指针
- 每个线程以跨步模式处理输入
__global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
// stride is total number of threads
int stride = blockDim.x * gridDim.x;
// All threads handle blockDim.x * gridDim.x
// consecutive elements
while (i < size) {
atomicAdd(&(histo[buffer[i]]), 1);
i += stride;
}
}
- 内核接收一个指向字节值输入缓冲区的指针
- 每个线程以跨步模式处理输入
__global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
// stride is total number of threads
int stride = blockDim.x * gridDim.x;
// All threads handle blockDim.x * gridDim.x
// consecutive elements
while (i < size) {
int alphabet_position = buffer[i] – “a”;
if (alphabet_position >= 0 && alpha_position < 26)
atomicAdd(&(histo[alphabet_position/4]), 1);
i += stride;
}
}
目标
了解原子操作的主要性能考虑
学习通过私有化输出来编写高性能内核
目标
学习卷积,一种重要的方法
卷积边界条件
一维:
具有边界条件处理的一维卷积核
此内核将有效输入范围之外的所有元素强制为 0
__global__ void convolution_1D_basic_kernel(float *N, float *M, float *P, int Mask_Width, int Width) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
float Pvalue = 0;
int N_start_point = i - (Mask_Width/2);
if (i < Width) {
for (int j = 0; j < Mask_Width; j++) {
if (N_start_point + j >= 0 && N_start_point + j < Width) {
Pvalue += N[N_start_point + j]*M[j];
}
}
P[i] = Pvalue;
}
}
二维:
具有边界条件处理的一维卷积核
此内核将有效输入范围之外的所有元素强制为 0
__global__ void convolution_2D_basic_kernel(unsigned char * in, unsigned char * mask, unsigned char * out, int maskwidth, int w, int h) {
int Col = blockIdx.x * blockDim.x + threadIdx.x;
int Row = blockIdx.y * blockDim.y + threadIdx.y;
if (Col < w && Row < h) {
int pixVal = 0;
N_start_col = Col - (maskwidth/2);
N_start_row = Row - (maskwidth/2);
// Get the of the surrounding box
for(int j = 0; j < maskwidth; ++j) {
for(int k = 0; k < maskwidth; ++k) {
int curRow = N_Start_row + j;
int curCol = N_start_col + k;
// Verify we have a valid image pixel
if(curRow > -1 && curRow < h && curCol > -1 && curCol < w) {
pixVal += in[curRow * w + curCol] * mask[j*maskwidth+k];
}
}
}
// Write our new pixel value out
out[Row * w + Col] = (unsigned char)(pixVal);
}
}
目标
了解tiled卷积算法
输入数据需求
假设我们想让每个块计算 T 个输出元素
两种设计选项
设计 2
float output = 0.0f;
if((index_i >= 0) && (index_i < Width)) {
Ns[tx] = N[index_i];
} else {
Ns[tx] = 0.0f;
}
一些线程不参与计算输出
if (threadIdx.x < O_TILE_WIDTH){
// 只有线程 0 到 O_TILE_WIDTH-1 参与输出的计算。
output = 0.0f;
for(j = 0; j < Mask_Width; j++) {
output += M[j] * Ns[j + threadIdx.x];
}
// index_o = blockIdx.x*O_TILE_WIDTH + threadIdx.x
P[index_o] = output;
}
设置块大小
#define O_TILE_WIDTH 1020
#define Mask_Width 5
#define BLOCK_WIDTH (O_TILE_WIDTH + Mask_Width - 1)
dim3 dimBlock(BLOCK_WIDTH, 1, 1);
dim3 dimGrid((Width - 1) / O_TILE_WIDTH + 1, 1, 1)
目标
学习写一个二维卷积核
具有自动填充的 2D 图像矩阵
带 Pitch 的行主布局
设置块大小
#define O_TILE_WIDTH 12
#define Mask_Width 5
#define BLOCK_WIDTH (O_TILE_WIDTH + Mask_Width - 1)
dim3 dimBlock(BLOCK_WIDTH, BLOCK_WIDTH, 1);
dim3 dimGrid((Width - 1) / O_TILE_WIDTH + 1, (Height - 1) / O_TILE_WIDTH + 1, 1)
为 Mask 使用常量内存和缓存
//部分代码
__global__ void convolution_2D_kernel(float *P, float *N, height, width, channels,
const float __restrict__ *M) {
__shared__ float Ns[TILE_SIZE+MAX_MASK_WIDTH-1][TILE_SIZE+MAX_MASK_HEIGHT-1];
int tx = threadIdx.x;
int ty = threadIdx.y;
int row_o = blockIdx.y*O_TILE_WIDTH + ty;
int col_o = blockIdx.x*O_TILE_WIDTH + tx;
int row_i = row_o - 2;
int col_i = col_o - 2;
if((row_i >= 0) && (row_i < height) && (col_i >= 0) && (col_i < width)) {
Ns[ty][tx] = data[row_i * width + col_i];
} else{
Ns[ty][tx] = 0.0f;
}
float output = 0.0f;
if(ty < O_TILE_WIDTH && tx < O_TILE_WIDTH){
for(i = 0; i < MASK_WIDTH; i++) {
for(j = 0; j < MASK_WIDTH; j++) {
output += M[i][j] * Ns[i+ty][j+tx];
}
}
if(row_o < height && col_o < width)
data[row_o*width + col_o] = output;
}
}
减少因子:
什么是bank conflict
GPU 共享内存中的Bank
如在一个线程块中申请如下的共享内存:_shared_ float sData[32][32];
bank conflict
No Bank Conflicts
Multi-way Bank Conflicts
并行归约
并行归约kernel函数
分支发散版
Each thread block takes 2*BlockDim.x input elements
Each thread loads 2 elements into shared memory假设warpsize为8,bank数量为8,分析stride = 1时是否有bank conflicts发生 ?
线程4访问的sdata[8]和sdata[9]映射到了bank[0]和bank[1],
线程0访问的sdata[0]和sdata[1]映射到了bank[0]和bank[1],
bank conflicts发生!
__shared__ float partialSum[2*BLOCK_SIZE];
unsigned int t = threadIdx.x;
unsigned int start = 2*blockIdx.x*blockDim.x;
partialSum[t] = input[start + t];
partialSum[blockDim + t] = input[start + blockDim.x + t];
for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2) {
__syncthreads();
if (t % stride == 0)
partialSum[2*t]+= partialSum[2*t+stride];
}
无分支发散版
for (unsigned int stride = blockDim.x; stride > 0; stride /= 2) {
__syncthreads();
if (t < stride)
partialSum[t] += partialSum[t+stride];
}
Transpose 矩阵转置
原始Transpose
读操作支持合并, 写操作不支持(以宽度为步长)
global_ void transposeNaive(float *odata, float *idata, int width, int height){
//TILE_DIM = blockDim.x;
int col = blockIdx.x * TILE_DIM + threadIdx.x ;
int row = blockIdx.y * TILE_DIM + threadIdx.y ;
int index_in = col + width * row ;
int index_out= row + height * col ;
odata[index_out] = idata[index_in] ;
}
通过shared memory 实现转置(kernel)
先将 warp 的多列元素存入shared memory ,再以连续化的数据写入global memory。
需要同步__syncthreads() 因为需要用到其它线程将global memory的数据存储到shared memory 的数据。
假设warpsize为32,bank数量为32,tile尺寸为32x32,观察是否有bank conflicts发生?
1-列中的数据存于相同的bank
2-读入 warp 一列数据时存在32-way bank conflict
__global__ void transposeCoalesced(float *odata, float *idata, int width, int height){
__shared__ float tile[TILE_DIM][TILE_DIM];
int col = blockIdx.x * TILE_DIM + threadIdx.x;
int row = blockIdx.y * TILE_DIM + threadIdx.y;
int index_in = col + row * width;
col = blockIdx.y * TILE_DIM + threadIdx.x;
row = blockIdx.x * TILE_DIM + threadIdx.y;
int index_out = col + row * height;
tile[threadIdx.y][threadIdx.x] = idata[index_in];
__syncthreads();
odata[index_out] = tile[threadIdx.x][threadIdx.y];
}
解决方案– 填充shared memory 数组
_shared_ float tile[TILE_DIM][TILE_DIM+1];
反对角线上的数据存于相同的bank
__global__ void transposeCoalesced(float *odata, float *idata, int width, int height){
__shared__ float tile[TILE_DIM][TILE_DIM + 1];
int col = blockIdx.x * TILE_DIM + threadIdx.x;
int row = blockIdx.y * TILE_DIM + threadIdx.y;
int index_in = col + row * width;
col = blockIdx.y * TILE_DIM + threadIdx.x;
row = blockIdx.x * TILE_DIM + threadIdx.y;
int index_out = col + row * height;
tile[threadIdx.y][threadIdx.x] = idata[index_in];
__syncthreads();
odata[index_out] = tile[threadIdx.x][threadIdx.y];
}
循环展开
每轮循环包含的指令:一条浮点数乘法 一条浮点数加法
每轮循环包含的其它指令:更新循环计数器
每轮循环包含的其它指令:更新循环计数器 分支
每轮循环包含的其它指令:更新循环计数器 分支 地址运算指令混合
2 条浮点运算指令
1 条循环分支指令
2 地址运算指令
1 循环计数器自增指令
for (int k = 0; k < BLOCK_SIZE; ++k)
{
Pvalue += Ms[ty][k] * Ns[k][tx];
}
展开:
不再有循环
不再有循环计数器更新
不再有分支
常量索引– 不再有地址运算
Pvalue +=
Ms[ty][0] * Ns[0][tx] +
Ms[ty][1] * Ns[1][tx] +
...
Ms[ty][15] * Ns[15][tx]; // BLOCK_SIZE = 16
循环展开:自动实现
循环展开有什么缺点? 寄存器的使用量大大上升
#pragma unroll BLOCK_SIZE
for (int k = 0; k < BLOCK_SIZE; ++k)
{
Pvalue += Ms[ty][k] * Ns[k][tx];
}
目标
学习并行归约模式
一种并行归约树算法以 log(N) 步执行 N-1 次操作
快速分析
目标
掌握并行扫描(前缀和)算法
包含扫描(前缀总和)定义
定义:扫描操作采用一个二元结合运算符 ⊕,以及一个包含 n 个元素的数组,[x0, x1, …, xn-1],
并返回数组 [x0, (x0 ⊕x1), …, (x0 ⊕x1 ⊕… ⊕xn-1)]。
示例:如果⊕是加法,则对数组的扫描操作将返回 [3 1 7 0 4 1 6 3], [3 4 11 11 15 16 22 25]。
包含顺序加法扫描
Given a sequence [x0, x1, x2, … ]
Calculate output [y0, y1, y2, … ]
Such that y0 = x0 y1 = x0 + x1 y2 = x0 + x1+ x2
Using a recursive definition yi = yi − 1 + xi
一个工作效率高的 C 实现
y[0] = x[0];
for (i = 1; i < Max_i; i++) y[i] = y [i-1] + x[i];
目标
学习编写和分析高性能扫描内核
一个更好的并行扫描算法
Kernel
__global__ void work_inefficient_scan_kernel(float *X, float *Y, int InputSize) {
__shared__ float XY[SECTION_SIZE];
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < InputSize) {
XY[threadIdx.x] = X[i];
}
// the code below performs iterative scan on XY
for (unsigned int stride = 1; stride <= threadIdx.x; stride *= 2) {
__syncthreads();
float in1 = XY[threadIdx.x - stride];
__syncthreads();
XY[threadIdx.x] += in1;
}
__ syncthreads();
if (i < InputSize) {
Y[i] = XY[threadIdx.x];
}
}
目标
学习编写高效的扫描内核
并行扫描 - 归约阶段
// XY[2*BLOCK_SIZE] is in shared memory
for (unsigned int stride = 1;stride <= BLOCK_SIZE; stride *= 2)
{
int index = (threadIdx.x+1)*stride*2 - 1;
if(index < 2*BLOCK_SIZE)
XY[index] += XY[index-stride];
__syncthreads();
}
put it together
Kernel
for (unsigned int stride = BLOCK_SIZE/2; stride > 0; stride /= 2) {
__syncthreads();
int index = (threadIdx.x+1)*stride*2 - 1;
if(index+stride < 2*BLOCK_SIZE) {
XY[index + stride] += XY[index];
}
}
__syncthreads();
if (i < InputSize)
Y[i] = XY[threadIdx.x];
1.lf we want to allocate an array of v integer elements inCUDA device global memory, what would be an appropriate expression for the second argument of the cudaMalloc() call?
A n
B v
C n*sizeof(int)
D v*sizeof(int)
正确答案 :B
2.lf we want to allocate an array of n floating-point elements and have a floating-point pointer variable d_A to point to the allocated memory, what would be an appropriate expression for the first argument of the cudaMalloc() call?
A n
B (void *) d_A
C *d_A
D (void**)&d_A
正确答案 :D
3.lf we want to copy 3000 bytes of data from host array h_A(_A is a pointer to element 0 of the source array) to device array d_A(_A is a pointer to element 0 of the destination array), what would be an appropriate APl call for this in CUDA?
A cudaMemcpy(3000, h_A, d_A,cudaMemcpyHostToDevice);
B cudaMemcpy(h_A, d_A, 3000,cudaMemcpyDeviceTHost);
C cudaMemcpy(d_A, h_A, 3000,cudaMemcpyHostToDevice);
D cudaMemcpy(3000, d_A, h_A,cudaMemcpyHostToDevice);
正确答案 :C
4.How would one declare a variable err that can appropriately receive returned value of a CUDA APl call?
A int err;
B cudaError err;
C cudaError_t err;
D cudaSuccess_t err;
正确答案 :C
A new summer intern was frustrated with CUDA. He has been complaining that CUDA is very tedious: he had to declare many functions that he plans to execute on both the host and the device twice, once as a host function and once as a device function. What is your response?
1 lf we need to use each thread to calculate one output element of a vector addition, what would be the expression for mapping the thread/block indices to data index:
A i=threadldx.x + threadldx.y;
B i=blockldx.x + threadldx.x;
C i=blockldx.x * blockDim.x + threadldx.x;
D i=blockldx.x * threadldx.x;
正确答案 :C
2 We want to use each thread to calculate two (adjacent) output elements of a vector addition. Assume that variable i should be the index for the first element to be processed by a thread.What would be the expression for mapping the thread/block indices to data index of the first element?
A i = blockldx.x*blockDim.x + threadldx.x+2;
B i = blockldx.x*threadldx.x*2
C i = (blockldx.x*blockDim.x +threadldx.x)*2
D i = blockldx.x*blockDim.x*2 +threadldx.x
正确答案 :C
3 We want to use each thread to calculate two output elements of a vector addition. Each thread block processes 2*blockDim.x consecutive elements that form two sections. All threads in each block will first process a section,each processing one element. They will then all move to the next section, again each processing one element. Assume that variable i should be the index for the first element to be processed by a thread. What would be the expression for mapping the thread/block indices to data index of the first element?
A i = blockldx.x*blockDim.x + threadldx.x + 2;
B i = blockldx.x*threadldx.x*2
C i = (blockldx.x*blockDim.x +threadldx.x)*2
D i = blockldx.x*blockDim.x*2 +threadldx.x
正确答案 :D
4 For a vector addition, assume that the vector length is 8000, each thread calculates one output element, and the thread blocksize is 1024 threads. The programmer configures the kernel launch to have a minimal number of thread blocks to cover all output elements. How many threads will be in the grid?
A 8000
B 8196
C 8192
D 8200
正确答案 :C
5 已知费米架构(英伟达显卡架构之一)下每个SM能接收1536 threads,每一个SM能接收 8 Blocks。下面能使SM利用率达到100%的配置是()
A int threadPerBlock=128;
B int threadPerBlock=256;
C int threadPerBlock=512;
D int threadPerBlock=1024;
正确答案 :BC
6 已知开普勒架构(英伟达显卡架构之一)下每个SM能接收2048 threads,每一个SM能接收 16 Blocks。下面能使SM利用率达到100%的配置是()
A int threadPerBlock=128;
B int threadPerBlock=256;
C int threadPerBlock=512;
D int threadPerBlock=1024;
正确答案 :ABCD
1 Assume that a kernel is launched with 1000 thread blocks each of which has 512 threads. lf a variable is declared as a shared memory variable, how many versions of the variable will be created through the lifetime of the execution of the kernel?
A 1;
B 1,000
C 512
D 512,000
正确答案 :B
1 We are to process a 600x800 (800 pixels in the x or horizontal direction, 600 pixels in the y or vertical direction) picture with the PictureKernel(). That is m’ s value is 600 and n’s value is 800.
__global__ void PictureKernel(float* d_Pin, float* d_Pout, int n, int m) {
//Calculate the row # of the d_Pin and d_Pout element to process
int Row = blockldx.y*blockDim.y + threadldx.y;
// Calculate the column # of the d_Pin and d_Pout element to process
int Col = blockldx.x*blockDim.x + threadldx.x;
//each thread computes one element of d_Pout if in range
if ((Row < m)&&(Col < n)) {
d_Pout[Row*n+Col]= 2*d_Pin[Row*n+Col];
}
}
Assume that we decided to use a grid of 16X16 blocks. That is, each block is organized as a 2D
16X16 array of threads. How many warps will be generated during the execution of the kernel?
A 37*16
B 38*50
C 38*8*50
D 38*50*2
正确答案 :C
2 In Question 1, how many warps will have control divergence?
A 37 +50*8
B 38*16
C 50
D 0
正确答案 :D
3 In Question 1, if we are to process an 800x600 picture (600 pixels in the x or horizontal direction and 800 pixels in the y or vertical direction) picture, how many warps will have control divergence?
A 37 + 50*8
B 38*16
C 50*8
D 0
正确答案 :C
4 In Question 1, if are to process a 799x600 picture (600 pixels in the x direction and 799 pixels in the y direction), how many warps will have control divergence?
A 37 + 50*8
B (37 + 50)*8
C 50*8
D 0
正确答案 :A
1 Assume that we want to use each thread to calculate two (adjacent) output elements of avector addition. Assume that variable i shouldbe initialized with the index for the first element to be processed by a thread. Which of the following should be used for such initialization to allow correct, coalesced memory accesses to these first elements in the following statement? if(i < n) C[i] = A[i]+ B[i];
A int i=(blockldx.x*blockDim.x)*2 +threadldx.x;
B int i=(blockldx.x*blockDim.x +threadldx.x)*2;
C int i=(threadldx.x*blockDim.x)*2 +blockldx.x;
D int i=(threadldx.x*blockDim.x +blockldx.x)*2;
正确答案 :A
2 Continuing from Question 1, what would be the correct statement for each thread to process the second element?
A lf (i
D if(i+blockDim.x < n) C[i+blockDim.x]=A[i+blockDim.x] + B[i+ blockDim.x];
正确答案 :D
3 Assumethefollowing simple matrix multiplication kernel
global__ void MatrixMulKernel(float* M, float* N,float* P, int Width){
int Row = blockldx.y*blockDim.y+threadldx.y;
int Col = blockldx.x*blockDim.x+threadldx.x;
if ((Row < Width)&&(Col < Width)){
float Pvalue = O;
for (int k = O; k < Width; ++k){
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col]= Pvalue;
}
}
Based on the judging criterion in Lecture 6.2, which of the following is true?
A M[Row*Width+k] and N[k*Width+Col] arecoalesced but P[Row*Width+Col] is not
B M[Row*Width+k],N[k*Width+Col] andP[Row*Width+Col] are all coalesced
C M[Row*Width+k] is not coalesced but N[k*Width+Col] and P[Row*Width+Col]both are
D M[Row*Width+k] is coalesced but N[k*Width+Col] andt P[Row*Width+Col] are not
正确答案 :C
4 For the tiled single-precision matrix multiplication kernel in question 3, assume that each thread block is 32×32 and the system has a DRAM bust size of 128 bytes. How many DRAM bursts will be delivered to the processor as a result of loading one A-matrix element by a thread block (one k step)? Keep in mind that each single precision floating point number is four bytes.
A 16
B 32
C 64
D 128
正确答案 :B
解释:128/4 = 32
1 Assume that each atomic operation in a DRAM system has a total latency of 100ns. What is the maximal throughput we can get for atomic operations on the same global memory variable?
A 100G atomic operations per second
B 1G atomic operations per second
C 0.01G atomic operations per second
D 0.0001G atomic operations per second
正确答案 :C
2 For a processor that supports atomic operations in L2 cache, assume that each atomic operation takes 4ns to complete in L2 cache and 100ns to complete in DRAM. Assume that 90% of the atomic operations hit in L2 cache. What is the approximate throughput foratomic operations on the same global memory variable?
A 0.225G atomic operations per second
B 2.75G atomic operations per second
C 0.0735G atomic operations per second
D 100G atomic operations per second
正确答案 :C
3 In question 1, assume that a kernel performs 5 floating-point operations per atomic operation. What is the maximal floating-point throughput of the kernel execution as limited by the throughput of the atomic operations?
A 500 GFLOPS
B 5 GFLOPS
C 0.05 GFLOPS
D 0.0005 GFLOPS
正确答案 :C
4 In Question 1, assume that we privatize the global memory variable into shared memory variables in the kernel and the shared memory access latency is 1ns.All original global memory atomic operations are converted into shared memory atomic operation.For simplicity, assume that the additional global memory atomic operations for accumulating privatized variable into the global variable adds 10% to the total execution time. Assume that a kernel performs 5 floating-point operations per atomic operation. What is the maximal floating-point throughput of the kernel execution as limited by the throughput of the atomic operations?
A 4500 GFLOPS
B 45 GFLOPS
C 4.5 GFLOPS
D 0.45 GFLOPS
正确答案 :C
5 To perform an atomic add operation to add the value of an integer variable Partial to a global memory integer variable Total. Which one of the following statements should be used?
A atomicAdd(Total, 1);
B atomicAdd(&Total,&Partial);
C atomicAdd(Total, &Partial);
D atomicAdd(&Total,Partial);
正确答案 :D