Going Deeper with convolutions

转载于: http://blog.csdn.net/stdcoutzyx/article/details/40759903

*****************************************************************************

Going Deeper with convolutions

转载请注明:

http://blog.csdn.net/stdcoutzyx/article/details/40759903

本篇论文是针对ImageNet2014的比赛,论文中的方法是比赛的第一名,包括task1分类任务和task2检测任务。本文主要关注针对计算机视觉的高效深度神经网络结构,通过改进神经网络的结构达到不增加计算资源需求的前提下提高网络的深度,从而达到提高效果的目的。

1. Main Contribution

  • Improve utilization of the computing resources inside the network, which is achieved by carefully crafted design and allows for increasing the depth and width of the network while keeping the computational budget constant.

  • Architecture decisions are based on the Hebbian principle and the intuition of multi-scale processing.

  • A 22 layers deep network is assessed in the competition.

2. Ralated Work

  • 本文提出的网络结构为Inception,得名于论文参考文献12(network in network)。
  • Recent trend of CNN is to increase the number of layers and layer size, while using dropout to address the problem of overfitting。
  • 论文参考文献15使用不同尺度的Gabor过滤器来处理多尺度问题,同本文的Inception Model类似。
  • 本文借鉴参考论文12,使用了很多1×1的卷积核。卷积核在本文中的作用主要在于降维,以此来去除计算瓶颈。
  • Detection task’s leading approach is Regions with Convolutional Neural Networks(R-CNN) (参考文献6)。该方法分为两步:
    • First utilize low-level cues such as color and superpixel consistency for potential object proposals in a category-agnostic fashion.
    • Then use CNN classifiers to identity object categories at those locations.

3. Motivation and High level considerations

3.1. Drawback of increasing CNN size directly:

  • More prone to overfitting.
  • Dramatically increase use of computational resources. (for example, if most weights end up to be close to zero, then lots of computations is wasted.)

3.2. How to solve it?

  • The fundamental way would be by ultimately moving from fully connected to sparsely connected architectures.
  • 论文的参考文献2表明,考虑到统计相关性,一个稀疏网络结构可以重新构建出最优结构。并产生了Hebbian principle——neurons that fire together, wire together。
  • 从更底层考虑,现在的硬件在非一致稀疏数据结构上的计算非常不高效,尤其在这些数据上使用已经为密集矩阵优化过的库时。原来自论文参考文献9以来,都会使用随机稀疏的网络结构来打破对称性,提高学习率。但论文参考文献11中又重新使用全连接的结构,以图利用密集计算的高效性。
  • 所以,现在的问题是有没有一种方法,既能保有网络结构的稀疏性,又能利用密集矩阵的高计算性能。论文提出了一种Inception Module,可以达到此等效果。

4. Architectural Detail

The main idean of the Inception architecture is based on finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered b readily available dense components.

如何发现最优结构呢? 可以这样考虑,较低的层次对应着图像的某个区域,使用1×1的卷积核仍然对应这个区域,使用3×3的卷积核,可以得到更大的区域对应。因而设计如图1。

图 1 Inception Module, Naïve version

为了降维,使用1×1的核进行降维,设计如图2。降维能够起效主要得益于embedding技术的发展,即使较低的维度仍然可以包含很多信息。    

图 2 Inception Module with dimension reductions

在Filter concatenation层将1×1/3×3/5×5的卷积结果连接起来。

如此设计的好处在于防止了层数增多带来的计算资源的爆炸性需求。从而使网络的宽度和深度均可扩大。使用了Inception层的结构可以有2-3×的加速。

5. GoogLeNet

如图3所示。更详细的结构图太大请见原论文。

图 3 GoogLeNet incarnation of Inception architecture

6. Training Methodology

  • Using DistBelief distributed machine learning system with modest amount model and data parallelism.
  • Training with asynchronous stochastic gradient descent with 0.9 momentum, fixed learning rate schedule(decreasing the learning rate by 4% every 8 epochs) and Polyak averaging(论文参考文献13)is used。
  • Sampling of various sized patches of the image whose size is distributed evenly between 8% and 100% of the image area and whose aspect ratio is chosen randomly between 3/4 and 4/3.
  • Photometric distortions
  • Using random interpolation methods for resizing relatively late and in conjunction with other hyperparameter changes.

7. Experiments Setup and Results

  • Trained 7 versions of same GoogLeNet model, performed ensembel prediction with them. These models are trained with the same initialization, but differ in sampling methodologies and the random order in which they see input images.
  • Testing: resize the image to 4 scales where the shorter dim is 256,288,320,352. Take the left, right, center square of these resized images, then take the 4 corners and the center 224×224 crop and the square resize 224×224, and their mirror version. Namely, 4×3×6×2=144 crops per image.
  • Softmax probabilities are averaged over multiple crops and over all individual classifiers to obtain the final prediction. Simple averaged is the best.结果如下:

图 4 performance of the competition

图 5 performance of fusions of Models

8. Reference

[1]. Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[J]. arXiv preprint arXiv:1409.4842, 2014.

你可能感兴趣的:(神经网络,计算机视觉)