围绕图灵系显卡
Streaming Multiprocessor (SM):GPU中的处理器核心
Graphics Processing Clusters (GPCs)
Texture Processing Clusters (TPCs)
Raster Operations Units(ROPs):光栅化处理单元。光栅化操作,是发生在模型完全建立,并且完成基本光照及对应纹理之后的操作环节。除了满足二维平面输出对坐标变换的要求之外,Rasterizer(光栅化)最大的意义在是:由于透视固有的视线前后遮蔽问题,建立好的模型存在很多看不到的部分,光栅化过程对Z值得判断,可以将这些看不到的部分剔除掉]
产品型号 | RTX 2080 Ti | RTX 2080 | RTX 2070 | RTX 2060 | GTX 1660 Ti |
---|---|---|---|---|---|
核心代号 | TU102 | TU104 | TU106 | TU106 | TU116 |
The TU104 and TU106 GPUs utilize the same basic architecture as TU102, scaled down to different degrees for different usage models and market segments.
Graphics Processing Cluster
├── 6 * Texture Processing Cluster
├── 2 * Streaming Multiprocessor
├── 64 CUDA Cores = 64 FP32 cores, 32 FP64 cores, 64 INT32 cores
├── 8 mixed-precision Tensor Cores
├── 1 RT Core
├── 64K 32-bit registers, a 256 KB register file
├── 4 texture units
├── 96 KB of L1/shared memory
├── 16 special function units for single-precision floating-point transcendental functions
├── 4 warp schedulers
memory controller
├── 8 ROP units
├── 512 KB of L2 cache
最优性能
部分lsy@lsy-MS-7B79:~$ cd /usr/local/cuda-10.0/samples/1_Utilities/deviceQuery
lsy@lsy-MS-7B79:/usr/local/cuda-10.0/samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce RTX 2060"
CUDA Driver Version / Runtime Version 10.1 / 10.0
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 5901 MBytes (6188105728 bytes)
(30) Multiprocessors, ( 64) CUDA Cores/MP: 1920 CUDA Cores
GPU Max Clock rate: 1710 MHz (1.71 GHz)
Memory Clock rate: 7001 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 3145728 bytes (3M bytes)
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes(64 KBs)
Total amount of shared memory per block: 49152 bytes(48 KBs)
Total number of registers available per block: 65536(64*4 KBs)
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 29 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS
Device 0: "GeForce GTX 1660 Ti"
CUDA Driver Version / Runtime Version 10.1 / 10.1
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 5945 MBytes (6233391104 bytes)
(24) Multiprocessors, ( 64) CUDA Cores/MP: 1536 CUDA Cores
GPU Max Clock rate: 1455 MHz (1.46 GHz)
Memory Clock rate: 6001 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 1
Result = PASS
GPU适合处理的问题应该具有一下2个特点:
GPU劣势:缓存小,memory access latency 更大一些;逻辑控制能力差
目标:做完x份数学试卷
CPU 计算单元 = 数学教授, GPU 计算单元 = 20 个中学生, 一份试卷必须由单个人完成
CPU缓存大 = 做完一份后,去取下一份试卷花费的时间少
如果 x = 1,那么CPU更快。
如果 x = 1000,那么GPU更快。在计算过程中,取下一份试卷期间,20 个中学生大概率是有试卷在做的。所以,内存延迟就被掩盖掉了(即使延迟更低,也不会起到作用)
Turing架构的显卡计算能力的为sm_75等。它代表了GPU的支持特征(如动态并行、半精度浮点数等)和技术规格(最多可并行的核函数数量、每个大核的32位寄存器个数)
通过提供了c语言的扩展集和运行时库,帮助熟悉c语言的程序员编写在GPU上运行的程序。
包含c语言扩展特性的源文件都要通过nvcc
进行编译。nvcc
是编译器驱动,提供和gcc类似的命令行选项,调用其它工具来实现不同阶段的编译。
线下编译:分离设备代码和主机代码
<<<...>>>
执行 ----修改为----> 调用核函数的运行时函数just-in-time 编译 : Any PTX code loaded by an application at runtime is compiled further to binary code by the device driver. This is called just-in-time compilation.
运行时库内容:C functions that execute on the host to allocate and deallocate device memory, transfer data between host memory and device memory, manage systems with multiple devices, etc.
多线程程序(核函数)被分成多个线程块独立地执行。
// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{ int i = threadIdx.x; C[i] = A[i] + B[i]; }
int main()
{
...
// Kernel invocation with N threads
VecAdd<<<1, N>>>(A, B, C);
...
}
__global__
、threadIdx
和<<
是CUDA C对c语言的扩展。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Sw6356ni-1576298749030)(image/grid-of-thread-blocks.png)]
从大到小:grid
> block
> thread
。核函数中与之对应的内置类型变量是blockIdx
、blockDim
、threadIdx
。
线程块之间是独立进行;线程块内的线程必须同时存在同一个SM上,且它们之间可以共享该block的shared memory,并通过__syncthreads
同步。线程的执行是以warp为单位进行的(分配线程资源以warp为单位)。线程块占用的资源(线程个数、共享内存、寄存器个数/线程)会影响程序执行的效率,见最优性能
线程的创建、管理、调度和执行是warp为单位进行的。每个线程都有独立的指令地址计数器和寄存器状态,但是共享一个程序计数器(sm7.0之前)和一个活跃mask(标识活跃的线程)。
在7.0之前,同一个warp的线程是同步执行的,存在warp范围的线程同步。如果warp内发生线程分歧,则在某一执行路线中处于未活跃的线程仍要执行指令(浪费计算资源),只是不用读写数据。
7.0及之后,每个线程拥有独立的程序计数器和调用栈。因此,实现了独立线程粒度的调度(执行),可以更好地利用计算资源和同一warp线程间的通信。 schedule optimizer determines how to group active threads from the same warp together into SIMT units.
Independent Thread Scheduling
目前只有sm_7.0+的设备支持。这使得warp内的线程可以不同步地进行。当然可以选择不使用这个特征。
predicate
作为输入,与0
进行比较int __all_sync(unsigned mask, int predicate); //if 在mask中且活动的线程 的predicate都大于0,则返回非零值
int __any_sync(unsigned mask, int predicate); //mask有32位,第i位标识第i个线程是否参与同步
unsigned __ballot_sync(unsigned mask, int predicate);
unsigned __activemask();
__syncwarp()
同步warp内线程一个SM可以存在多少个block,依赖于核函数需要的寄存器数量和block的共享内存大小,SM本身的资源和技术规格。如果一个SM不可以满足一个block,则程序不能执行。
block被分为一个或多个warp执行。
__syncthreads()
:7.0之前的设备是warp间的同步,只要warp内的一个线程到达该障碍点即可。7.0及之后,则是线程的同步参数基于RTX2060
以可见性划分:
local memory
,最多为512 KB;寄存器数据global memory
另外2种特殊内存:纹理内存和常量内存
设备代码的自动变量(一般在寄存器中,单线程最多255个寄存器)有可能存储在局部内存,有3种情况:动态定义的数组;大型数组和结构;寄存器溢出后其它内容。加上编译选项--ptxas-options=-v
可看。
-Xptxas -dlcm=ca
,所有数据均可以缓存在 the unified L1/texture cache32 banks that are organized such that successive 32-bit words map to successive banks. Each bank has a bandwidth of 32 bits per clock cycle.
。且有广播机制a read-only constant cache that is shared by all functional units and speeds up reads from the constant memory space, which resides in device memory,
A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.
区别:纹理内存是只读的,表面内存可读写。纹理内存支持硬件滤波和插值
大小:受SM技术规格限制,
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
位置:reside in device memory
缓存:专用的缓存,Cache working set per multiprocessor for texture memory = 32KB
优点:纹理缓存针对2D空间的局部性进行优化;有专用的单元计算地址空间;广播;自动滤波和插值
使用:有2种API用来访问纹理和表面内存,纹理引用API(有限制)和纹理对象API
纹理拾取:CUDA 数组对纹理拾取有优化,并且在设备端只能通过纹理拾取访问。线性内存则无优化
cudaReadModeNormalizedFloat
or cudaReadModeElementType
cudaAddressModeBorder, cudaAddressModeClamp, cudaAddressModeWrap, and cudaAddressModeMirror
cudaFilterModePoint
or cudaFilterModeLinear
cudaCreateTextureObject()
from a resource description of typestruct cudaResourceDesc
, which specifies the texture, and from a texture description cudaTextureDesc
// Allocate CUDA array in device memory
cudaChannelFormatDesc channelDesc =
cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
cudaArray* cuArray;
cudaMallocArray(&cuArray, &channelDesc, width, height);
// Copy to device memory some data located at address h_data
// in host memory
cudaMemcpyToArray(cuArray, 0, 0, h_data, size, cudaMemcpyHostToDevice);
// Specify texture
struct cudaResourceDesc resDesc;
memset(&resDesc, 0, sizeof(resDesc));
resDesc.resType = cudaResourceTypeArray;
resDesc.res.array.array = cuArray;
// Specify texture object parameters
struct cudaTextureDesc texDesc;
memset(&texDesc, 0, sizeof(texDesc));
texDesc.addressMode[0] = cudaAddressModeWrap;
texDesc.addressMode[1] = cudaAddressModeWrap;
texDesc.filterMode = cudaFilterModeLinear;
texDesc.readMode = cudaReadModeElementType;
texDesc.normalizedCoords = 1;
// Create texture object
cudaTextureObject_t texObj = 0;
cudaCreateTextureObject(&texObj, &resDesc, &texDesc, NULL);
// Destroy texture object
cudaDestroyTextureObject(texObj);
// Free device memory
cudaFreeArray(cuArray);
texture reference API:
texture<DataType, Type, ReadMode> texRef;
texture<float, cudaTextureType2D,
cudaReadModeElementType> texRef;
// Set texture reference parameters
texRef.addressMode[0] = cudaAddressModeWrap;
texRef.addressMode[1] = cudaAddressModeWrap;
texRef.filterMode = cudaFilterModeLinear;
texRef.normalized = true;
cudaBindTextureToArray(texRef, cuArray);
cudaUnbindTexture(texRef);
以下操作可以看作独立的任务,可以并行执行:
通过相对于主机的异步操作实现,异步操作是在设备完成任务之前就返回控制权,使得主机继续向下执行。主机和设备之间的异步操作如下:
同一个CUDA环境(相当于CPU的同一进程)下的核函数可以并行执行。SM75下,最大并行数量是128。
Kernels that use many textures or a large amount of local memory are less likely to execute concurrently with other kernels.
设备内数据传输,只需要设备支持核函数并行,即可与核函数并行执行。
主机和设备的数据传输,需要asyncEngineCount
大于0,才可与核函数并行执行。主机内存必须是页锁定的
asyncEngineCount
大于等于2。主机内存必须是页锁定的
需要通过流来实现后3种类型的任务并行。流是按顺序执行的一系列命令。不同流的命令可以并行执行。
cudaStream_t stream[2];
for (int i = 0; i < 2; ++i)
cudaStreamCreate(&stream[i]);
float* hostPtr;
cudaMallocHost(&hostPtr, 2 * size);
for (int i = 0; i < 2; ++i) {
cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,
size, cudaMemcpyHostToDevice, stream[i]);
MyKernel <<<100, 512, 0, stream[i]>>>
(outputDevPtr + i * size, inputDevPtr + i * size, size);
cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,
size, cudaMemcpyDeviceToHost, stream[i]);
}
for (int i = 0; i < 2; ++i)
cudaStreamDestroy(stream[i]);
当执行相关cuda c函数,不指定流参数的时候,命令被发送到默认流上。可以通过编译选项,选择不同类型的默认流:
--default-stream per-thread
: 默认流是常规流,每个主机线程拥有一个默认流--default-stream legacy
:默认流是特殊流,NULL流。每个设备拥有一个NULL流,被所有主机线程共享。NULL流可以引发隐式同步下面的流中命令的顺序是主机代码发布命令的顺序
cudaDeviceSynchronize()
:主机代码等待所有主机线程上的所有流的前面命令执行完cudaStreamSynchronize()
:主机代码等待指定流的前面命令执行完cudaStreamWaitEvent()
:指定流之后的命令等待事件完成。如果是NULL流,则所有流的之后命令等待事件完成cudaStreamQuery()
:判断指定流的前面命令是否完成,无同步不同流的2个命令不能并行执行,如果主机线程在它们之间发出了以下操作:
流这部分看 <<专业 CUDA C 编程 >>更好
回调函数在主机上执行。回调可以插入到一个流的任意节点。当该流之前的操作完成,则会调用回调函数。该命令有同步的作用,即推迟该流之后命令的执行,直到回调完成。如果流是NULL流,则等待和推迟的命令范围扩大为所有流的命令
void CUDART_CB MyCallback(cudaStream_t stream, cudaError_t status, void *data){
printf("Inside callback %d\n", (size_t)data);
}
...
for (size_t i = 0; i < 2; ++i) {
cudaMemcpyAsync(devPtrIn[i], hostPtr[i], size, cudaMemcpyHostToDevice, stream[i]);
MyKernel<<<100, 512, 0, stream[i]>>>(devPtrOut[i], devPtrIn[i], size);
cudaMemcpyAsync(hostPtr[i], devPtrOut[i], size, cudaMemcpyDeviceToHost, stream[i]);
cudaStreamAddCallback(stream[i], MyCallback, (void*)i, 0);
}
At runtime, as blocks in low-priority schemes finish, waiting blocks in higher-priority streams are scheduled in their place.
可用于计时和流间同步
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
for (int i = 0; i < 2; ++i) {
cudaMemcpyAsync(inputDev + i * size, inputHost + i * size,
size, cudaMemcpyHostToDevice, stream[i]);
MyKernel<<<100, 512, 0, stream[i]>>>
(outputDev + i * size, inputDev + i * size, size);
cudaMemcpyAsync(outputHost + i * size, outputDev + i * size,
size, cudaMemcpyDeviceToHost, stream[i]);
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
所有的运行时函数返回错误代码,但是异步函数返回的错误代码反映了执行该函数之前的错误。这个错误可能是之前的异步函数产生的或者是该函数的参数错误。可以通过同步的方法确定该执行处是否有错误:cudaDeviceSynchronize()
运行时为每个主机线程保留了一个错误变量(初始化为cudaSuccess
)。当错误发生的时候,它被重写。cudaPeekAtLastError
返回该错误变量。cudaGetLastError
返回并重置该错误变量。
Turing adds a second parallel execution unit next to every CUDA core that executes these instructions in parallel with floating point math.
the Turing SM adds a new independent integer datapath that can execute instructions concurrently with the floating-point math datapath. In previous generations, executing these instructions would have blocked floating-point instructions from issuing.
maxrregcount
compiler option or launch bounds
性能优化围绕3个基本策略:
应用应该进行更多的并行操作(尽量避免同步操作),并且并行操作可以映射到系统的不同单元。
尽快地传递需要的数据。应该减少低带宽通道的数据传输,同时,采用最优的内存访问模式(如纹理内存、共享内存,局部性(合并访问)、内存对齐等)
设备内存通过 32-, 64-, or 128-byte 的内存交易完成。一个warp内的内存访问(memory access)会被合并为一个或多个内存交易(memory transaction)
全局内存:全局内存指令支持 1, 2, 4, 8, or 16 bytes 的读写。因此,自定义的数据类型大小和内存对齐应该留意(如果类型大小是32字节,数据是连续的,那么访问的内存就会交错,吞吐量降低)。通过CUDA 内存 API 分配的内存至少是256字节对齐的
局部内存:设备代码的自动变量(一般在寄存器中,单线程最多255个寄存器)有可能存储在局部内存,有3种情况:动态定义的数组;大型数组和结构;寄存器溢出后其它内容。加上编译选项--ptxas-options=-v
可看。
其它内存特点,可见存储模型
高效地(充分且不浪费)利用计算单元。如
减少吞吐量低的指令,如使用内置函数、单精度运算
减少warp内分歧(sm70之后,这个的影响减少)
减少指令个数,如避免同步、使用__restrict__
避免产生大量的相同指令,应该给任务分配比例适当的浮点数和整型计算,使得计算资源利用率提高
在精度允许的条件下:-ftz=true
、-prec div=false
、-prec-sqrt=false
__fdividef(x, y)
比/
更快。 -use_fast_math
rsqrtf()
指令只有在-prec-div=false
和-prec-sqrt=false
下,才会产生(编译器把1.0/sqrtf()
优化为该指令)
三角函数:尽量用单精度类型的。输入参数的单位为弧度,尽量不要太大,否则会进入慢的计算通道
整数算术:除法和取余会转换为20个指令,尽可能通过位运算优化
半精度算术:
类型转换:也是有对应指令的,尽量避免
Function execution space specifiers denote whether a function executes on the host or on the device and whether it is callable from the host or from the device.
__global__、__device__、__host__
__device__, __shared__ and __constant__
gridDim blockDim blockIdx threadIdx warpSize
void __threadfence_block(); //前面的写对于block范围可见;前面的读有序
void __threadfence(); //除了块范围效果,前面的写对于设备范围可见。但仅仅保证顺序是这样,可能直接读缓存而发生错误
void __threadfence_system(); //除了块范围效果,前面的写对于系统范围可见
void __syncthreads(); //阻塞,直到所有线程达到该点;之前的内存读写,block内可见; 块内线程必须都能(或都不)到达该点
int __syncthreads_count(int predicate);
int __syncthreads_and(int predicate);
int __syncthreads_or(int predicate);
void __syncwarp(unsigned mask=0xffffffff); //束内同步; 不必所有束内线程一致执行该命令; 正在执行的线程必须等待mask内的线程执行完该函数
执行同步 + 内存读写可见
T __ldg(const T* address);
数据会被缓存到常量缓存
原子函数在32位或64位的全局或共享内存上执行原子操作。
原子函数有自己对应的范围版本:atomicAdd_system() atomicAdd_block()
int __all_sync(unsigned mask, int predicate);
int __any_sync(unsigned mask, int predicate);
unsigned __ballot_sync(unsigned mask, int predicate);
unsigned __activemask();
unsigned int __match_any_sync(unsigned mask, T value);
unsigned int __match_all_sync(unsigned mask, T value, int *pred);
mask
指定了参与线程。函数只返回给活跃的参与线程(mask
中的线程不必全部到达该点)非参加的线程不必同步
T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize); //目标lane = 本lane - delta
T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_xor_sync(unsigned mask, T var, int laneMask, int width=warpSize);
利用 tensor core 计算矩阵D = A*B + C
void __prof_trigger(int counter);
每个SM有16个硬件计数器
void assert(int expression);
void* malloc(size_t size);
void free(void* ptr);
<<< Dg, Db, Ns, S >>>
__global__ void
__launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor)
MyKernel(...)
{
...
}
#pragma unroll