目录
计算量
访存
计算性能指标:
● FlOPS: floating point operations per second
计算量指标:
● MACCs or MADDs: multiply accumulate operations
FLOPS和FLOPs的区别:
FLOPS:注意全大写,是floating point operations per second的缩写,意指每秒浮点运算次数,理解为计算速度。是一个衡量硬件性能的指标。
FLOPs:注意s小写,是floating point operations的缩写(s表复数),意指浮点运算数,理解为计算量。可以用来衡量算法/模型的复杂度。
注意点:MACCs就是乘加次数,FLOPs就是乘与加的次数之和
点乘求和举例说明:
● y = w[0]*x[0] + w[1]*x[1] + w[2]*x[2] + ... + w[n-1]*x[n-1]
w[0]*x[0] + ... 认为是1个MACC,所以是n MACCs
上式乘加表达式中包含n个浮点乘法和n - 1浮点加法,所以是2n-1 FLOPS
一个 MACC差不多是两个FLOPS
注意点: 严格的说,上述公式中只有n-1个加法,比乘法数少一个。这里MACC的数量是一个近似值,就像Big-O符号是一个算法复杂性的近似值一样。
实际卷积计算量:
关于计算量相关的细节可以参考文章《PRUNING CONVOLUTIONAL NEURAL NETWORKS FOR RESOURCE EFFICIENT INFERENCE, ICLR2017》,
假设采用滑动窗实现卷积且忽略非线性计算开销,则卷积核的FLOPs为:
神经网络各层FLOPs计算:
● Full Connected Layer:multiplying a vector of length I with an I × J matrix to get a vector of length J, takes I × J MACCs or (2I - 1) × J FLOPs.
● Activate Layer: We do not measure these in MACCs but in FLOPs, because they’re not dot products.
● Convolution Layer: K × K × Cin × Hout × Wout × Cout MACCs
● Depthwise-Seperable Layer: (K × K × Cin × Hout × Wout) + (Cin × Hout × Wout × Cout) MACCs
->Cin × Hout × Wout × (K × K + Cout) MACCs
● Factor is K × K × Cout / (K × K + Cout).
● 计算量仅仅是运算速度的一个方面,另一个重要的方面是内存带宽(memory bandwidth),甚至比计算量还重要
● 对现代计算机来说,a single memory access from main memory is much slower than a single computation — by a factor of about 100 or more!
● 一个网络来说,内存访问需要访问多少次呢?对每一层来说,包括了以下内存访问:
1. 读取每层的输入
2. 计算结果:包括了载入权重
3. 输出每层的结果
Memory for weights
● Full Connected:输入I/输出J,总共是(I+1)*J
● Convolutional layers have less weights than fully-connected layers:K × K × Cin × Cout
● 因为内存访问非常慢,所以大量的内存访问给网络运行带来的很大的影响,甚至超过了计算量
Feature maps and intermediate results
● Convolution Layer:
(the weights here are negligible)
input = Hin × Win × Cin × K × K × Cout
output = Hout × Wout × Cout
weights = K × K × Cin × Cout + Cout
Example: Cin = 256, Cout = 512, H = W = 28, K = 3,S = 1
1、Normal convolution layer
input = 28 × 28 × 256 × 3 × 3 × 512 = 924,844,032
output = 28 × 28 × 512 = 401,408
weights = 3 × 3 × 256 × 512 + 512 = 1,180,160
total = 926,425,600
2、depthwise layer+pointwise layer
1)depthwise layer
input = 28 × 28 × 256 × 3 × 3 = 1,806,336
output = 28 × 28 × 256 = 200,704
weights = 3 × 3 × 256 + 256 = 2,560
total = 2,009,600
2)pointwise layer
input = 28 × 28 × 256 × 1 × 1 × 512 = 102,760,448
output = 28 × 28 × 512 = 401,408
weights = 1 × 1 × 256 × 512 + 512 = 131,584
total = 103,293,440
total of both layers = 105,303,040
案例研究
● Input dimension: 126x224
MobileNet V1 parameters (multiplier = 1.0): 1.6M
MobileNet V2 parameters (multiplier = 1.0): 0.5M
MobileNet V2 parameters (multiplier = 1.4): 1.0M
MobileNet V1 MACCs (multiplier = 1.0): 255M
MobileNet V2 MACCs (multiplier = 1.0): 111M
MobileNet V2 MACCs (multiplier = 1.4): 214M
MobileNet V1 memory accesses (multiplier = 1.0): 283M
MobileNet V2 memory accesses (multiplier = 1.0): 159M
MobileNet V2 memory accesses (multiplier = 1.4): 286M
MobileNet V2 (multiplier = 1.4) is slightly slower than MobileNet V1 (multiplier = 1.0)
This provides some proof for my hypothesis that the amount of memory accesses is the primary factor for determining the speed of the neural net.
结论
“I hope this shows that all these things — number of computations, number of parameters, and number of memory accesses — are deeply related. A model that works well on mobile needs to carefully balance those factors.”