用于分类/特征提取的CNN结构设计方法总结

说明

最近在用CNN做一个人脸识别的项目,为了吸收前人经验,设计一个比较好用的网络,把2012(AlexNet)、2014(VGGNet、GoogLeNet)、2015(ResNet)、2016这几年在ImageNet上取得好成绩的文章都撸了一遍,写了一点总结。根据导师要求,全是用英文写的。懒得用中文再讲一遍了,就这样吧……

AlexNet:

Proposed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton in 2012, AlexNet was the first large scale convolutional network successfully applied to imagenet challenge. The netetwork consists of five convolutional layers, some of which followed by max-pooling layers, and three fully-connected layers. AlexNet achieves top-1 and top-5 test error rates of 37.5% and 17.0% on ILSVRC-2010 test set(previous best were 45.7% and 25.7%), and 15.3% top-5 test error on ILSVRC-2012 test set(previous best was 26.2%).

Some important tricks that AlexNet used:
1. ReLU nonlinearity(proposed by Nair and Hinton in 2010). ReLU accelerates training dramatically, compared to tanh or softmax.
2. Local Response Normalization. LRN introduces some kind of lateral inhibition, creating competition for big activities amongst different neuron outputs. According to the authors, LRN reduces top-1 and top-5 error rates by 1.4% and 1.2%. However, recent experiments showed that LRN may not be that much of help.
here goes the equation
3. Overlapping Pooling. Using stride=2 and size=3, AlexNet reduces the top-1 and top-5 error rates by 0.4% and 0.3%. Why this scheme works was not discussed.
4. Data augmentation. Cropping and horizontal reflecting images. Altering the intensities of the RGB channels in training images.
5. Dropout.
Tricks 1, 4, 5 are still widely used in current models.

VGGNet:

VGGNet was proposed by Karen Simonyan and Andrew Zisserman from the Visual Geometry Group(VGG) of University of Oxford. By replacing large conv filters with very small(3*3) ones, VGGNet pushed the depth of CNN to 19 weighted layers(16 conv layers and 3 fully-connected layers), and achieved significant improvement on imagenet challenge(6.8% top-5 error on ILSVRC-2014).

Some important tricks that VGGNet used:
1. Using very small receptive fields throughout the whole net(with stride=1). There are two advantages: Two stacked 3*3 conv layers has an effective receptivefield of 5*5, but with less parameters. Using two stacked conv layers allows us to incorporate more non-linear rectification layers instead of a single one.
2. 1*1 conv layers. Increase the nonlinearity of the network without affecting the receptive fields of the conv layers.
3. Heavy test-time data augmentation.

It’s worth noticing that VGGNet has 3 fully-connected layers with size (4096, 4096, 1000), which is more than GoogLeNet(only 1 fully-connected layers) and ResNet(only 1 fully-connected layers). That’s probably the reason why VGGNet is larger and slower than the other two.

Another problem is that while the number of parameters is decreased, does the computational power needed for each forward-pass increase? Let’s do some simple math. Assume that the input size is m*n, m and n are both numbers much greater than 5. If we use a 5*5 conv layer with C output channels, there’ll be approximately 25*C*m*n float multiplications in each forward pass. If we use two 3*3 conv layers, the first layer with C’ channels, the last with C channels, there’ll be approximately 9*C’*m*n float multiplications in the first layer, and 9*C*C’*m*n float multiplications in the last one, so the total computational power we need is (9*C*C’*m*n + 9*C’*m*n), which is of no doubt greater than 25*C*m*n.

GoogLeNet

Proposed by a bunch of researchers in Google, University of North Carolina, University of Michigan and Magic Leap in 2014(When I’m writing this report, I noticed that one of the co-authors is Yangqing Jia). GoogLeNet tried to increased the depth and width of the network while keeping the computational budget constant. The model consists of 22 weighted layers organised by a novel structure called “Inception”, and achieved top-5 error rate of 6.67% on ILSVRC-2014 test set.

The most “straightforward” contribution of this paper is that they proposed a novel structure called “inception”, which consists of several stucked conv filters of multiple scale and pooling layers, and looks like this.here goes the picture The structure itself is pretty simple, what’s important is the motivations and high level considerations behind: to make the model more sparse, so that it’s more efficient and less prone to overfeating. Instead of using sparse matrix, the authors decided to use a sparse structure so that they can take advantage of current highly tuned numerical libraries that allow for extremely fast dense matrix multiplication.

It’s worth noting that the conv filters in the inception structure are sparsely connected and 1*1 conv filters are used to reduce the number of feature maps. By doing so, GoogLeNet can avoid the explosion of the demand for computational power, which we talked about in the previous part.

Another advantage that not mentioned in the paper(which I believe that matters), is that by concating both conv layers and pooling layers, the inception structure captures feature from multiple scales(which of no doubt is beneficial).

Important tricks that GoogLeNet used:
1. Stacked conv filters
2. Sparse structure
3. Sample multi-scale patches from the training set

BN-Inception-V2

Proposed by Google in 2015. BN-Inception-V2 is basiclly GoogLeNet trained using batch normalization. Although the structure wasn’t new, BN-Inception-V2 achieved top-5 error rate of 4.9%(!) on ILSVRC-2014 test set.

Batch normalization addressed the problem of “Internal Covariate Shift”, which means the change in the distribution of network activations due to the change in network parameters during training. It’s been long known that the network converges faster if its inputs are whitened(linearly transformed to have zeros mean and unit variance and decorrelated). And if we regard each layer as a single network, it seems reasonable to do some whitening of its input(which is the output of the layer below).

Since the full whitening of each layer’s inputs is costly and not everywhere differentiable(?), batch normalization normalizes each scalar feature independently, by making it have zero mean and unit variance. During training, batch normalization is performed as described in algorithm 1, a \gamma and a \beta is learned for each input dimensioin. During inference, the means and variances are fixed, so the normalization is simply a linear transform applied to each activation.
here goes the algorithm

For convolutional layers, to obey the convolutional property(different elements of the same feature map are treated in the same way), batch normalization jointly normalizes all the activation functions in a mini-batch over all locations. Considering both conv layers and non-conv layers, a \gamma and a \beta is learned for each channel, instead of each dimension.

One of the major advantages of batch normalization is that it allows higher learn rate. For a scalar \alpha, it’s easy to see that BN(Wu) = BN(\alpha Wu), \frac{\partial BN(\alpha Wu)}{\partial u} = \frac{\partial BN(Wu)}{\partial u}, \frac{\partial BN(\alpha Wu)}{\partial \alpha W} = \frac{1}{\alpha} \frac{\partial BN(Wu)}{\partial W} using chain rule, so the scale of layer parameters won’t amplify the gradient during backpropagation and lead to the model explosion.

Another advantage is that during training, a example is seen in conjunction with other examples in the mini-batch, so the network doesn’t produce deterministic values for a agiven example. Batch notmalization acts like dropout in a sense. Experiments showed that dropout out can be either removed or reduced in strength when batch normalization is introduced.

Inception-V3

Proposed by Google in 2015. The paper discussed several useful principles when designing network structure, and successfully applied them to the old GoogLeNet. Inception-V3 achieved 17.3% top-1 error and 3.5% top-5 error on the ILSVRC-2012 test set.

Four priciples were discussed in the paper:
1. Avoid representatioinal bottlenecks, especially early in the network.(As I understand it, “bottleneck” mainly refers to downsampling and extreme compression in width)
2. Higher dimensional representations are easier to process locally within a network.(Don’t understand)
3. Spatial aggregation can be done over lower dimensional embeddings without much or any loss in representational power.
4. Balance the width and depth of the network.

Following the principles above, some actual techniques were proposed.
1. Factorizing Convolutions with Larget Filter Size. Similar to the technique used in VGGNet, but one step further: other than factorizing large filter into smaller ones, they tried to factorize large filters into asymmetric ones. For example, factorize a 5×5 conv filter into stacked 1*5 and 5*1 filters. Obviously, asymmetric factorization uses less parameters than symmetric factorization.
2. Efficient Grid Size Reductioni. When designing a network, in order to avoid a representational bottlenetck, before applying pooling, the activation dimension of the network is usually expanded, which is computationally expansive. On the other hand, pooling followed by convolution is less expansive, but creates a representational bottleneck as the dimension drops. Google suggested using two parallel stride 2 blocks, one convolutional block and one pooling block.
3. Auxiliary Classifiers. Experiments showed that auxiliary classifiers accelerate convergence at the end of training. The authors argued that auxiliary classifiers act as regularizer. I don’t quite get it.
4. Label Smoothing. Using q(k|x) = (1 - \epsilon) \delta_{k, y} + u(k), where q is the training label, delta_{k, y} is the ground truth, and u(k) is a prior distribution of labels, improves classification performance by about 0.2%. This trick is not new, Prof. Zhang(张学工) talked about the same trick in his Pattern Recognition class in our third year.

The structure of Inceptioin-V3 is pretty complicated. Please refer to the paper to learn more details.

ResNet

Proposed by microsoft in 2015. Winner of ILSVRC-2015 classsification, detection, localization task, and winner of COCO detection and segmentation task. ResNet achieved 3.57% top-5 error on ILSVRC-2015 classification test set.

The idea of ResNet was inspired by the problem of degradation: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. The authors believed that it’s because the network can’t model identity map(if the network can model identity map, then the deeper nets should always outperform the shallower ones). So it seems reasonable to add a “shortcut connection” to the network, which turned out to be extremely useful. A more intuitive explanation is that the short cuts allow the information to flow more freely, and leads to better convergence.

Inception-V4

Proposed recently by Google in ICLR 2016 workshop. Inception-V4 simply combines residual connection and inception module. Experiments showed that networks with residual connections did converge faster, but beat widened Inception-V3 networks only by a small advantage. The authors argued that “the final quality seems to be much more correlated with the model size than with the use of residual connections”.

SqueezeNet

Proposed recently by DeepScale team. They proposed a novel module called “Fire module”, which I believe is a simplified version of Inception module. Using Fire module, SqueezeNet achieves AlexNet-level accuracy on ImageNet, but with 50x fewer parameters.

Important tricks that SqueezeNet used:
1. Replace 3x3 filters with 1x1 filters. Since 1x1 filters has 9 times fewer parameters.
2. Decrease the number of input channels to 3x3 filters. That’s what “squeeze” means. Similar technique was also used in Inception module.
3. Downsample late in the network so that convolution layers have large activation maps. Similar to the “avoid representatioinal bottlenecks” principle discussed in Inception-V3.

Metric Learning?

Might be important, need to look into it.

你可能感兴趣的:(cnn)