Output size:
通道数和滤波器数量保持一致,均为64
H/W = (H - K + 2P)/S + 1 = (227-11+4)/4+1=56
Memory(KB):
Number of output elements: C * H* W=64*56 *56=200704; Bytes per element=4 (for 32-bit floating point). KB=200704 * 4/1024 = 784
Parameters(k):
Weight shape = Cout *Cin *K *K=64 *3 *11 *11
Bias shape = 64
Number of weights = 64x3x11x11 + 64 =23296
flop(M)!!!important
number of floating point operations (multiply+add)//since they can be done in one cycle
= (number of output elements) x (ops per element)
=(Cout x H x W) x (Cin xK xK) = 72855552
此处省略了紧随conv1之后的ReLu
对于pooling:
How does AlexNet design? Trial and Error.
Design rules for VGG:
Option 1: conv(5x5. c->c)
Params: 25c^2 FLOPs: 25C^2HW
Option 2: conv(3x3, c->c) conv(3x3, c->c)
Params: 18c^2 FLOPs: 18c^2HW 感受野相同,参数更少,计算消耗更小;此外,当我们选择了两个conv,我们可以在这两个conv之间插入一个relu,从而带给我们更多的depth和nonlinear computation
注意FLOPS错了
但对pooling之后一半size的输入double通道数,能够使FLOPs保持一致,而conv layers at each spatial resolution take the same amount of computation
local unit with parallel branches that is repeated many times throughout the network.
Use 1*1 Bottleneck layers to reduce channels dimensions before the expensive conv
what happens when we go deeper?
this is an optimization problem. Deeper models are harder to optimize, in particular don’t learn identity functions to emulate shallow models.
->change the network so learning identity functions with extra layers is easy.
this layer can now easily learn the identity function, if we set the weights of these two conv layers to zero, this block should compute the identity and makes the dnn easier to emulate the shallow networks. And it also help to improve the gradient flow of deep networks because the add gates make one copy of gradient and pass it through the shortcuts.
Learn from VGG: stages, 3x3 conv
Learn from Google: aggressive stem to downsample the input before applying residual blocks. and global pooling to avoid expensive FC