Rethinking the Inception Architecture for Computer Vision 笔记

面向计算机视觉的改进Inception CNN框架(3.46% top-5 error on ImageNet)】


摘要: 目的:Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably
factorized convolutions and aggressive regularization

Here we are exploring ways to scale up networks in ways that aim at utilizingthe added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization

在这里,我们正在探索扩大规模的方式,目的在利用网络计算方式添加尽可能高效地通过适当分解卷积和有效的正则化


结果:We
benchmark our methods on the ILSVRC 2012 classification
challenge validation set demonstrate substantial gains over
the state of the art:21.2%top-1and 5.6%top-5error for
single frame evaluation using a network with a computational cost of5billion multiply-adds per inference and with
using less than 25 million parameters. With an ensemble of
4models and multi-crop evaluation, we report 3.5%top-5
error and17.3%top-1error







50亿个多种增加每一层 用少于2500万个参数。  用四个模型。


引言: GoogLeNet 设计网络来存储内存和降低计算复杂度。但是Inception依然很难改变网络,计算了还是很大

       Inception。 Also, [20] does not
provide a clear description about the contributing factors
that lead to the various design decisions of the GoogLeNet
architecture.


Inception 作用还不明显:In this paper, we start with describing a few
general principles and optimization ideas that that proved
to be useful for scaling up convolution networks in efficient
ways. Although our principles are not limited to Inceptiontype networks, they are easier to observe in that context as
the generic structure of the Inception style building blocks
is flexible enough to incorporate those constraints naturally.
This is enabled by the generous use of dimensional reduction and parallel structures of the Inception modules which
allows for mitigating the impact of structural changes on
nearby components. Still, one needs to be cautious about
doing so, as some guiding principles should be observed to
maintain high quality of the models.他们更容易观察到

本文,我们先讨论一些基本的原理和优化思想 来减少网络的有效方法。 尽管我们的方法不局限于Inceptiontype 。


本文:

 
结论:

   提供几种减少CNN网络的原则,和学习inception 结构。这个知道可以导致高性能的 cv 网络。n cost compared to simpler, more monolithic
architectures. Our highest quality version of Inception-v3
reaches21.2%, top-1and 5.6%top-5 error for single crop
evaluation on the ILSVR 2012 classification, setting a new
state of the art


最好的Inception-v3版本,单个交叉验证,本文我们用更少的计算 来提高运算性能。



Rethinkingthe Inception Architecture for Computer Vision


 

 

5x5conv->3x3 conv+3x3 fc :we end up with a net 18 25× reduction of computation,resulting in a relative gain of 28% by this factorization

 

Instead,we argue that the auxiliary classifiers act as regularizer

Thisis supported by the fact that the main classifier of the network performsbetter if the side branch is batch-normalized [7] or has a dropout layer. Thisalso gives a weak supporting evidence for the conjecture that batchnormalization acts as a regularizer

 

Inaddition, gradient clipping [14] was found to be useful to stabilize thetraining

 

Googlenet特点: Much of the original gains of theGoogLeNet network [20] arise from a very generous use of dimension reduction.

 

网络结构:


结果对比:

Rethinking the Inception Architecture for Computer Vision 笔记_第1张图片

 


论文的结果和模型是一流的,一些如何设计卷积网络的建议也非常有用。但是有一些原理解释还有所争议,论文对学术术语的使用和参考文献也可改进。值得一读。

Latest NN from Google (Inception v3) recognizes images with best-ever accuracy, while amount of computation is considerably reduced. E.g. recognize that there is an apple in the image, or poodle or airplane, etc. The paper explains in detail how construct and train such NN.

1. “Avoid representational bottlenecks” (“especially early in the network”). I.e. decrease layer sizes (“representation size”) gradually (moving from NN input up to output layers).

2. Try using more features (“higher dimensional representations”), e.g. layer is 160x120x64 – try using 160x120x”>64″

3. Try more aggressive pooling (“Spatial aggregation can be done over lower dimensional embeddings without much or any loss in representational power.”) because xy-wise-adjacent units are often correlated anyway.

4. Try increasing *both* network width and depth (not only one of those at a time)

– replace 5×5 convolution with two 3×3 convolutions connected in series; similarly convert 7×7 and up to 3x3s

– replace 3×3 convolution with 3×1 convolution connected in series with 1×3; similarly e.g. 7×7 to 7×1 and 1×7 for “medium-sized” layer

– convert a 5×5 convolution in a “two-stage funnel-like” convolution – into 3×3 of 3×3 convolutions, followed by 1 3×3

– do use “batch normalization” (critical)

– use “label dropout” (for regularization, “encouraging the model to be less confident”) by sometimes supplying incorrect ground-truth labels during training

EXTRA DETAILS
– Inception v3 is 42 layers deep, trained using TensorFlow running 50 NN replicas

– to reduce computation – without much performance loss – “change the resolution of the input without further adjustment to the model”

– Inception-v3 achieves new state-of-the-art on ImageNet 2012:

– 21.2% top-1 and 5.6% top-5 error for single frame evaluation, 5 billion multiply-adds per inference, <25 million parameters- using 4 identical networks, averaging their predictions (ensemble learning) and multi-crop evaluation, 3.5% top-5 error and 17.3% top-1 errorREFRESHERInception NN uses "Inception" module instead of traditional convolution module. The Inception module applies a bunch of various-size convolutions/pooling in tree-branch-like-manner - and concatenates those outputs back into a single output.



你可能感兴趣的:(Rethinking the Inception Architecture for Computer Vision 笔记)