论文《deep residual learning for image recognition》-Kaiming He

文章中的源码https://github.com/KaimingHe/deep-residual-networks

  • 开头便提到神经网络越深,越难以训练优化。**ResNet出现的主要原因(目的)**是解决深层网络中的退化现象,属于优化难题,SGD的优化更困难。degradation problem:即随着网络的depth增加,train error非但没有降低,反而增加了。

In this paper, we address the degradation problem by introducing a deep residual learning framework.

  • 深层网络中的两大问题:vanishing/exploding gradient && dgradation problem

vanishing/exploding gradients , which
hamper convergence from the beginning.

解决办法:Normalization 正则化,平衡激活函数的线性区(敏感区)和饱和区。

however, has been largely addressed by normalized initialization and intermediate normalization layers, which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with back propagation .
When deeper network

深度学习领域有一个重要的假设:IID即独立同分布假设。Batch Normalization批正则化是对每个层的计算结果(还没有进入激活函数)进行规范,使之服从mean=0,std = 1的标准正态分布,这样保证每层的输入同分布。BN灵感来源于对输入的image进行白化处理。
【batch normalization】S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep
network training by reducing internal covariate shift. In ICML, 2015.

参看深入理解:batch normalization批标准化

  • ResNet学习残差函数F(x)
    论文《deep residual learning for image recognition》-Kaiming He_第1张图片
    假定残差映射更容易优化。

Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear
layers fit another mapping of F(x) := H(x)−x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize
the original, unreferenced mapping.

  • resnet shortcut connection 与 Highway connection
    两者有一定的相似性,均采用shortcut方式,使得前向计算时,前面层的feature信息可以直接传到后面的层,同时在反向传播BP时,也有利于梯度向前流向,从而达到深层网络较好的优化效果。
  1. ResNet 中的shortcut connection相比较highway connection而言,没有额外的参数。
  2. highway connection中的shortcut是带有“门控”的。

“highway networks” present shortcut connections with gating functions.

参看shortcut connection 和 highway network

  • identity mapping by shortcut
    ResNet中的恒等映射:
    在这里插入图片描述
    其中的 F(x, {Wi})表示的是将要学习的残差映射 residual mapping .

The operation F + x is performed by a shortcut connection and element-wise addition. The element-wise addition is performed on two feature maps, channel by channel.

相加之后再施加relu激活函数。
在这里插入图片描述
x 和 F的channel 数目一定是相等的,若dimension增加时(如下图),可以采取两种方法使之相等:

  • with extra zero entries padded for increasing dimensions. This option introduces no extraparameter;
  • The projection shortcut in Eqn(2),即使用1*1的卷积,等价于 perform a linear projection Ws by the
    shortcut connections to match the dimensions:在这里插入图片描述
    论文《deep residual learning for image recognition》-Kaiming He_第2张图片
  • Plain Network 与 Residual Network
    前者指网络中没有shortcut,只是layers的stack,eg VGG[Oxford Visual Geometry Group , 全是3*3 卷积],VGG中的两个设计原则:
  • for the same output feature map size, the layers have the same number of filters;
  • if the feature map size is halved, the number of filters is doubled.
    参看 一文读懂VGG-知乎
    论文《deep residual learning for image recognition》-Kaiming He_第3张图片
    该结果表示,resnet可以使得网络depth增加,error减少。
  • Deeper Bottleneck Architectures
    论文《deep residual learning for image recognition》-Kaiming He_第4张图片
    为了在可行的时间耗费范围内,作者采用bottleneck设计方式来研究更深的resnet结构。
  • Exploring Over 1000 layers
    -论文《deep residual learning for image recognition》-Kaiming He_第5张图片

作者发现当depth=1202与depth = 110时,两者的train error 差不多,但是depth = 1202 的test error还不如 depth = 110的,作者分析是出现了过拟合。maxout 和dropout可以缓解overfitting。

  • Maxout 和 Dropout
    Maxout 需要参数k, 该参数使得参数量增加了k倍,maxout层可以看做是一个“激活函数”,且是分段的,不固定的,可学习的“函数”。Maxout隐隐含层、函数逼近器。相当于在原来的input 与output层之间又增加了k个node。
    在这里插入图片描述
    在这里插入图片描述

你可能感兴趣的:(论文《deep residual learning for image recognition》-Kaiming He)