[Deep Learning] 卷积神经网络 CNNs

本文是Deep Learning Course总结翻译系列的第二篇，主要课程内容是卷积神经网络。讲义参考：Feedforward Nets and Conv Nets (lecturer: Dario Garcia)和Stanford cs231n视频课程。
本系列第一篇传送门： [Deep Learning] 神经网络基础

本文对卷积神经网络（Conv Neural Networks）的主要内容进行简要介绍，感兴趣部分可根据参考论文进行详细学习。

卷积神经网络 Convolutional Neural Networks

虽然反向传播算法在20世纪80年代就被提出后，但直至2006年才出现第一篇由Hinton和Salakhutdinov发表的论文^[1]，提出我们能够训练一个深度神经网络。这篇论文仍与现在的深度网络训练有所不同，它对初始化值的要求非常高，在预训练阶段每个隐层需要分别通过受限玻尔兹曼机得到某些初始的权重值，然后再进行整个网络的反向传播微调权值。2010年，Acoustic Modeling using Deep Belief Network^[2]一文的发表使深度神经网络在语音识别领域获得显著应用效果。2012年，Alex发表Imagenet classification with deep convolutional neural networks^[3]一文，AlexNet大幅降低了图像分类误差基准，成为图像分类发展的里程碑。

CNN原理 CNN from the Inside

卷积神经网络基于生物学的原理，20世纪50年代，Hubel和Wiesel通过实验获取接收视觉刺激时猫脑的电信号并分析神经元对其反应，得到结论：

在大脑皮质的映射中，相近区域的细胞对应着实际视觉中同样相近的区域。

视觉皮层映射

2.神经元具有层次结构
从边缘轮廓到方向角落、Blob（图像中具有相似颜色、纹理特征的连通域）等，下图对神经网络卷积filter进行可视化，每一方格展示了能够使各神经元的输出获得最优值时的输入，也即生物视觉神经中神经元正在寻找的东西。不同的filter用于实现不同的效果，例如将每个像素与周围像素平均可以模糊图像，将中心像素值与其他像素比较可以检测边缘。通常第一层的filter类似于Gabor滤波器，获取图像基础特征，之后越深的层进行特征的结合，学习更复杂的模式。

卷积层filter层次结构设计

滤波器可视化工具Deep Visualization Toolbox video

CNN卷（volumes）
卷积层
保存图像的空间结构，多次滑动不同的filter分别在全深度上进行点乘（点乘时实际上也会将三维的图像和filter拉伸为一维向量）。卷积层的输入没有类似于全连接层的连接权重，而是使用上一层的输出图像使用filter进行本地连接，因此得到的激活函数map相当于其中每一个神经元与输入的一小片区域相连，所有神经元共享权重。

不严格卷积操作

在边缘增加一层扩充（zero-padding）后输出图像长宽大小为：

因此步长为1时输出图像大小和原始一致，深度的大小为filter的种类个数。通常，第一层使用RGB通道深度为3，之后每层的深度由filter所需检测的特征种类个数决定。

ConvNet

1*1 卷积
只改变通道数降维

池化层/子取样Subsampling
缩小长宽，深度不变，池化层不padding不overlay。

MaxPooling

pooling提供 translation inviaration
在每个channel上pooling
spatial pyramid pooling实现 multi scale
全连接层
32323 image -> stretch to 3027*1
CNN参数

卷积层参数个数

CNN训练
在线demo（https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html）

CNN架构 CNN Architectures

通常的架构为[(CONV-RELU)N-POOL]M-(FC-RELU)*K,SOFTMAX

CNN架构

激活函数单独作为一层列出是因为输出要由深度方向上各个神经元共同决定。

ReLU输入值

总结：
https://www.cnblogs.com/guoyaohua/p/8534077.html
https://medium.com/@sidereal/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
https://my.oschina.net/u/876354/blog/1637819

LeNet
LeCun识别手写数字
AlexNet
五层卷积池化层，三全连接层，ReLU，Dropout 0.5，最后Softmax或SVM进行分类

AlexNet

ZFNet
减少卷积核大小和步长较大的kernel破坏局部连接性假设
增大filter（通道数），提供更好的表征能力

ZFNet
VGG
VGG16/19由Oxford Visual Geometry Group 提出^[4]^[5]，主要思想是通过增加层数扩大感受野同时提高非线性拟合能力（Stack of three 3x3 conv (stride 1) layers has same effective receptive field as one 7x7 conv layer），但网络层数太深需要pre-training。
与AlexNet不同，VGG去掉了Local Response Normalization：

AlexNet和VGG的一个共同问题是全连接层参数量巨大

VGG

GoogleNet
GoogleNet^[6]，延伸版本Inception^[7]^[8]
深度加深至22层，是首个
引入NIN的思想，在每个module中stack不同size的kernel
但通道数爆炸，引入1*1卷积进行通道降维

Inception-v3
借鉴Encoder-Decoder思想，将33卷积改为31和1*3卷积，感受野不变，但表达能力增强？？

HIghway Networks
ResNet
微软提出，引入残差框用H(x) - x取代H(x)做目标h解决梯度消失问题^[9]
残差网络有显著的冗余
ResNeXt
DenseNet
引入密度网络
不同dense block之间通过conv pooling降维
shuffle net
空间复杂度
1*1

通道复杂度
传统conv的主要性能瓶颈在通道
-> group conv -> depth-wise conv

Advanced CNN Module

Transpose Convolution
上采样和卷积结合(通常也称为转置卷积Transpose Convolution、不严格的解卷积"Deconv"、Upconv、分数步长卷积Fractionally strided conv、Backward strided conv

参考 Vincent Dumoulin, Francesco Visin: A guide to convolution arithmetic for deep learning.
3D Convolution
3DCNN
经常需要group conv
Graph Convolution
图神经网络
非grid data 非欧式空间流形学习

CNN应用 CNN Applications

目前卷积神经网络可以用于图像分类（数据集MNIST、CIFAR (10或100类低分辨率图像)、ImageNet (1,000类包括难以辨别的相似类型)）、检索retrieval/释义captioning、检测分割bounding box->segmentation（PASCAL VOC数据集^[10]）、自动驾驶、视频分类、人脸识别、姿势识别、结合增强学习玩游戏（ATARI videogames or the GO boardgame）等应用中。

风格迁移
计算Gram矩阵^[11]
图像着色^[12]^[13]
使用解卷积架构^[14]，同样适用于图像分割，因为它允许在像素级进行分类^[15]。

解卷积架构

图像着色

检索/释义
检索文本->图像，释义图像->文本，通过将CNN与一个文本表示的模型相结合可以生成一个针对两种模式共享的特征空间^[16]。

image.png

Segmentation分割
Mask R-CNN
semantic segmentation & instance segmentation
人脸识别
基于HOG直方图
人脸关键点检测

人脸识别系统

未来展望

参考文献

Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." science 313.5786 (2006): 504-507. ↩
Mohamed, Abdel-rahman, George E. Dahl, and Geoffrey Hinton. "Acoustic modeling using deep belief networks." IEEE Trans. Audio, Speech & Language Processing 20.1 (2012): 14-22. ↩
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.
↩
Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014). ↩
http://www.robots.ox.ac.uk/~vgg/practicals/cnn/ ↩
Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. ↩
https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/ ↩
http://iamaaditya.github.io/2016/03/one-by-one-convolution/ ↩
He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. ↩
Chen, Liang-Chieh, et al. “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.” arXiv preprint arXiv:1606.00915 (2016). ↩
Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. “A neural algorithm of artistic style.” arXiv preprint arXiv:1508.06576 (2015). ↩
Zhang, Richard, Phillip Isola, and Alexei A. Efros. “Colorful image colorization.” European Conference on Computer Vision. Springer International Publishing, 2016. ↩
Iizuka, Satoshi, Edgar Simo-Serra, and Hiroshi Ishikawa. “Let there be color!: joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification.” ACM Transactions on Graphics (TOG) 35.4 (2016): 110. ↩
http://www.tensorflowexamples.com/2017/01/transposed-convnets-or-deconvolution.html ↩
Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2015. ↩
Kiros, Ryan, Ruslan Salakhutdinov, and Richard S. Zemel. “Unifying visual-semantic embeddings with multimodal neural language models.” arXiv preprint arXiv:1411.2539 (2014). ↩