网络模型计算量评估

目录

计算量

访存


计算量

计算性能指标:

FlOPS: floating point operations per second

计算量指标:

● MACCs or MADDs: multiply accumulate operations

 

FLOPS和FLOPs的区别:

FLOPS:注意全大写,是floating point operations per second的缩写,意指每秒浮点运算次数,理解为计算速度。是一个衡量硬件性能的指标。

FLOPs:注意s小写,是floating point operations的缩写(s表复数),意指运算数,理解为计算量。可以用来衡量算法/模型的复杂度。

注意点:MACCs就是乘加次数,FLOPs就是乘与加的次数之和

 

点乘求和举例说明

y = w[0]*x[0] + w[1]*x[1] + w[2]*x[2] + ... + w[n-1]*x[n-1]

w[0]*x[0] + ... 认为是1MACC,所以是n MACCs

上式乘加表达式中包含n个浮点乘法和n - 1浮点加法,所以是2n-1 FLOPS

一个 MACC差不多是两个FLOPS

注意点: 严格的说,上述公式中只有n-1个加法,比乘法数少一个。这里MACC数量是一个近似值,就像Big-O符号是一个算法复杂性的近似值一样。

 

实际卷积计算量:

关于计算量相关的细节可以参考文章PRUNING CONVOLUTIONAL NEURAL NETWORKS FOR RESOURCE EFFICIENT INFERENCE, ICLR2017》,

假设采用滑动窗实现卷积且忽略非线性计算开销,则卷积核的FLOPs为:

                                                                      FLOPs=2HW(C_{in}K^{2}+1)C_{out}

神经网络各层FLOPs计算:

● Full Connected Layermultiplying a vector of length I with an I × J matrix to get a vector of length J, takes I × J MACCs or (2I - 1) × J FLOPs.

● Activate Layer:  We do not measure these in MACCs but in FLOPs, because they’re not dot products.

Convolution Layer: K × K × Cin × Hout × Wout × Cout MACCs

Depthwise-Seperable Layer: (K × K × Cin × Hout × Wout) + (Cin × Hout × Wout × Cout) MACCs

->Cin × Hout × Wout × (K × K + Cout) MACCs

● Factor is K × K × Cout / (K × K + Cout).

 

访存

计算量仅仅是运算速度的一个方面,另一个重要的方面是内存带宽(memory bandwidth),甚至比计算量还重要

对现代计算机来说,a single memory access from main memory is much slower than a single computation — by a factor of about 100 or more!

一个网络来说,内存访问需要访问多少次呢?对每一层来说,包括了以下内存访问:

    1. 读取每层的输入

    2. 计算结果:包括了载入权重

    3. 输出每层的结果

 

Memory for weights

● Full Connected:输入I/输出J,总共是(I+1)*J

Convolutional layers have less weights than fully-connected layersK × K × Cin × Cout

因为内存访问非常慢,所以大量的内存访问给网络运行带来的很大的影响,甚至超过了计算量

 

Feature maps and intermediate results

● Convolution Layer:

 (the weights here are negligible)

input = Hin × Win × Cin × K × K × Cout

output = Hout × Wout × Cout

weights = K × K × Cin × Cout + Cout

Example: Cin = 256, Cout = 512, H = W = 28, K = 3,S = 1

 

1、Normal convolution layer

     input = 28 × 28 × 256 × 3 × 3 × 512 = 924,844,032

     output = 28 × 28 × 512 = 401,408

     weights = 3 × 3 × 256 × 512 + 512 = 1,180,160

     total = 926,425,600

 

2、depthwise layer+pointwise layer

    1)depthwise layer

        input = 28 × 28 × 256 × 3 × 3 = 1,806,336

        output = 28 × 28 × 256 = 200,704

        weights = 3 × 3 × 256 + 256 = 2,560

        total = 2,009,600

     2)pointwise layer

        input = 28 × 28 × 256 × 1 × 1 × 512 = 102,760,448

        output = 28 × 28 × 512 = 401,408

        weights = 1 × 1 × 256 × 512 + 512 = 131,584

         total = 103,293,440

         total of both layers = 105,303,040

 

案例研究

● Input dimension: 126x224

MobileNet V1 parameters (multiplier = 1.0): 1.6M

MobileNet V2 parameters (multiplier = 1.0): 0.5M

MobileNet V2 parameters (multiplier = 1.4): 1.0M

MobileNet V1 MACCs (multiplier = 1.0): 255M

MobileNet V2 MACCs (multiplier = 1.0): 111M

MobileNet V2 MACCs (multiplier = 1.4): 214M

MobileNet V1 memory accesses (multiplier = 1.0): 283M

MobileNet V2 memory accesses (multiplier = 1.0): 159M

MobileNet V2 memory accesses (multiplier = 1.4): 286M

MobileNet V2 (multiplier = 1.4) is slightly slower than MobileNet V1 (multiplier = 1.0)

This provides some proof for my hypothesis that the amount of memory accesses is the primary factor for determining the speed of the neural net.

 

结论

“I hope this shows that all these things — number of computations, number of parameters, and number of memory accesses — are deeply related. A model that works well on mobile needs to carefully balance those factors.”

你可能感兴趣的:(轻量神经网络,macc,FLOPS,FLOPs,计算量,神经网络)