这些年涌现了一系列的深度学习模型。模型里面最耗时的东西,包括卷积、全连接层、attention,都可以转换成GEMM操作。所以说,GEMM优化的重要性,怎么突出都不过分。
介绍GEMM中的数据分块和如何在多级存储进行数据搬运。这也是HPC优化的核心思想,怎么样让数据放在更近的存储上来掩盖计算的延时,从而减少存储墙的影响。
文章分为四个方面进行叙述,首先介绍在global memory层面如何进行分块以及数据搬运,随后介绍在shared memory层面如何进行分块以及数据搬运,而后介绍在register层面如何进行分块以及避免bank冲突,最后介绍如何进行prefetch以更好地掩盖访存时延。
CUDA SASS指令集合科普
CUDA微架构与指令集(3)-SASS指令集分类
首先,最简单的指令形式如FFMA R2, R4, R2, R5
;。
其中FFMA
表示这个指令进行的是32bit float fused multiply-add 操作,也就是R2=R4*R2+R5
。
通常称FFMA
为操作码(Opcode,当然它本质上还是指的FFMA对应的编码), 后面的R2, R4, R2, R5表示参与操作的通用寄存器(General Purpose Register,简称GPR,有时也直接叫Register),称为操作数(Operand)。SASS中的习惯是第一个是目的操作数(dst),后面是源操作数(src,可以依次叫src0,src1,src2…或者srcA,srcB,srcC…都有见过)
内存事务
When a warp executes an instruction that accesses memory, it is important to consider the access pattern created by the threads in that warp.
For example, when loading data through the L1 cache, an entire 128-byte cache line is fetched, regardless of whether one thread is reading one value (the least efficient pattern), or if all 32 threads are reading consecutive 4-byte values (the most efficient pattern).
A memory “request” is an instruction which accesses memory, and a “transaction” is the movement of a unit of data between two regions of memory.内存事务是transaction,是一个movement,内存请求是一条指令。
Efficient access patterns minimize the number of transactions incurred by a request. Inefficient patterns make large numbers of transactions, using only a small amount of data from each transaction, wasting bandwidth in the connections between regions of the Memory Hierarchy.
每一个内存事务,简单说就是一个邮递员,他每次可以携带一封信从A到B,信最大可以携带128字节的信息量,通勤一次时间200天,所以你是愿意写一封128字节的信,还是写32封4字节的信,方案不言而喻。
S I M T A r c h i t e c t u r e SIMT Architecture SIMTArchitecture
The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. Individual threads composing a warp /start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently.(组成 warp 的各个线程在同一个程序地址一起开始,但它们有自己的指令地址计数器和寄存器状态,因此可以自由分支和独立执行。) The term warp originates from weaving, the first parallel thread technology. A half-warp is either the first or second half of a warp. (既…也)A quarter-warp is either the first, second, third, or fourth quarter of a warp.
When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps and each warp gets scheduled by a warp scheduler for execution.
The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. ** the thread ID of a thread of index$ (x, y, z) $ is ( x + y D x + z D x D y ) (x + y Dx + z Dx Dy) (x+yDx+zDxDy).
在同一线程束中的线程执行不同的指令,被称为线程束分化
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp executes each branch path taken, disabling threads that are not on that path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths.