原文链接:https://arxiv.org/pdf/1409.4842.pdf
Propose a deep convolutional neural network architecture codenamed Inception.
The main hallmark of this architecture is the improved utilization of the computing resources inside the network.
Increase the depth and width of the network while keeping the computational budget constant.
Architectural decisions are based on the Hebbian principle and the intuition of multi-scale processing.
Note:
Hebbian theory:
A theory in neuroscience that propose an explanation for the adaption of neurons in the brain during the learning process, describing a basic mechanism for synaptic plasticity, where an increase in synaptic efficacy arises from the presynaptic cell’s repeated and persistent stimulation of postsynaptic cell(from wikipedia)
Uses 12 times fewer parameters the AlexNet.
This architecture can efficiently run on mobile and embedded device by saving their power and memory use.
A standard structure - stacked convolutional layers (optionally followed by contrast normalization and max-pooling) are followed by one or more fully-connected layers.
Max-pooling layers result in loss of accurate spatial information.
Inspired by a neuroscience model of the primate visual cortex, use a series of fixed Gabor filters of different sizes to handle multiple scales. All filters in the Inception architecture are learned. Inception layers are repeated many times(22 layers in GoogLeNet).
Network-in-Network is an approach to increase the representational power of neural networks. Addtional 1x1 convolutional layers was added into the network to increase its depth.Dual propose of using 1x1 convolutional layers:
R-CNN decomposes the overall detection problem into two subproblem:
Inception architecture improves both above stages:
Note:
1.Local Contrast Normalization
Enforcing a sort of local competition between adjacent features in a feature map and between features at the same spatial location in different feature maps.
reference: What is the Best Multi-Stage Architecture for Object Recognition?
2.Gabor filter
In image processing, a Gabor filter is a linear filter used for edge detection.
impulse response:
Complex:
Real:
Imaginary:
where:
reference:wikipedia
The most straightforward way of improving the performance of deep neural networks is increasing its depth and width. It leads to:
A fundamental solution is sparsity and replacing the fully connected layers by the sparse ones, even inside the convolutions.
If the probability distribution of the dataset is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer after layer by analyzing the correlation statistics of the preceding layer activations and clustering neurons with highly correlated outputs.
Today’s computing infrastructure are inefficient when it comes to numerical calculation on non-uniform sparse data especially beacuse of using numerical libraries allowing for extremely fast dense matrix multiplication.Most current utilization of sparsity is employing convolutions. However, convolutions are implemented as collections of dense connections to the patches in the earlier layers. Trandition ConvNets use random and sparse connections. Current state-of-art architectures for CV have uniform structure. Large number of filters and greater batch size help.
Inception tends to exploit advantages of both sparse structure and dense matrix multiplication.
Cluster sparse matrices into relatively dense submatrices tends to give competitive performance for sparse matrix multiplication.
Note:
1.Dense and sparse matrix
If most elements of a matrix is zero, this matrix is sparse.
If most elements of a matrix is nonzero, this matrix is dense.
reference: wikipedia
main idea: finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily avaiable dense components.
Translation invariance means whole network can be built from covlutional building blocks. Find the optimal local construction and repeat it spatially.
We should analyze the correlated statistics of the last layer and cluster them into groups of units with high correlation.
In lower layers, correlated units are more concentrated. This means that we can cover them with 1x1 convolutions. However, there can be some spread out clusters so we should use convolutional layers over lager patches. There will be a decreasing number of patched over larger and lager regions.
Current Inception architecture are restricted to filter size 1x1, 3x3 and 5x5.
Conbine all those layers with their output filter banks concatenated into a single output vector forming the input of the next stage.
The ratio of 3x3 and 5x5 convolutions should increase as we move to higher layers because correlated units’ concentration is expected to decrease.
However, the computational budget blows up as the number of Inception module increases. This leads to the second idea: judiciously applying dimension reductions and projections wherever the computational requirements would increase too much otherwise.
Ways of dimension reduction:
Benefit:
With careful manual design, its speed can be 2-3x faster than architecture without Inception module.
It is suggested to start using Inception modules only at higher layers.
All the convolutions, including those inside the Inception modules, use rectified linear activation.
The size of the receptive field in our network is 224×224 taking RGB color channels with mean subtraction.
“#3×3 reduce” and “#5×5 reduce” stands for the number of 1×1 filters in the reduction layer used before the 3×3 and 5×5 convolutions.
Use average pooling and an extra linear layer before classifier, this extra layer makes adapt and fine-tune our networks for other labels easily and it doesn’t affect performance significantly.
Add auxiliary classfier connected to intermediate layers to encourage discrimination in the lower stages, increase the gradient signal that gets propagated back, and provide additional regularization. During training, auxiliary classifier loss is added to total loss with a weight 0.3. During inference time, they are discarded.
Extra network:
DistBelief, CPU based
Detected objects count as correct if they match the class of the groundtruth and their bounding boxes overlap by at least 50 percents.
Extraneous detections count as false positives.
The approach taken by GoogLeNet for detection is similar to the R-CNN but is augmented with the Inception model as the region classifier. Additionally, the region proposal step is improved by combining the Selective Search [20] approach with multi-box predictions for higher object bounding box recall.
Superpixel size was increased by 2x to cut down the number of false positives.
Ensemble of 6 ConvNets when classifying.
No bounding box regerssion.
No use of external data.
I think the most important idea of Inception architecture is:
Approximating the expected optimal sparse structure by readily available dense building blocks.