论文阅读:SPPnet:Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

文章目录

  • 1、网络总述
  • 2、为什么CNN需要固定的输入
  • 3、crop与wrap
  • 4、Visualization of the feature maps.
  • 5、不同池化尺度下的win与stride计算
  • 6、Multi-size training
  • 7、Full-image Representations Improve Accuracy
  • 8、SPPnet应用到目标检测上
  • 9、模型集成:Model Combination for Detection
  • 10、附录里的Mapping a Window to Feature Maps
  • 参考文献

1、网络总述

论文阅读:SPPnet:Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition_第1张图片

这篇paper在Fast RCNN之前由何凯明大神发表的,对Fast RCNN 以及Faster RCNN的影响很大,它主要针对 由于网络中的全连接层,导致网络输入尺寸必须固定的 这个问题提出了解决办法,即在卷积层之后且全连接层之前加入一个SPP layer,使得不同尺度池化过后的特征数量固定,然后再连接到FC即可,这样就可以实现网络的多尺寸输入,文中也提到了如何进行多尺度训练,比较重要的一点是可以将这个运用到目标检测问题中,但分类层仍然是SVM,没有结合到网络中去,整个网络分为四部分:SS提取proposals,特征提取,SVM,框回归。和FAST RCNN还有一定差距。

2、为什么CNN需要固定的输入

CNN网络可以分解为卷积网络部分以及全连接网络部分。我们知道卷积网络的参数主要是卷积核,完全能够适用任意大小的输入,并且能够产生任意大小的输出。但是全连接层部分不同,全连接层部分的参数是神经元对于所有输入的连接权重,也就是说输入尺寸不固定的话,全连接层参数的个数都不能固定。

SPPnet 直觉地说,可以理解成将原来固定大小为(3x3)窗口的pool5改成了自适应窗口大小和步长,窗口的大小和feature map成比例,保证了经过pooling后出来的feature的长度是一致的

3、crop与wrap

论文阅读:SPPnet:Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition_第2张图片由图知,crop与wrap都有可能破坏目标的特性,crop有可能导致目标的不完整,wrap有可能会改变目标的长宽比,导致目标变形。

4、Visualization of the feature maps.

论文阅读:SPPnet:Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition_第3张图片

可以知道,feature map保留了特征的位置信息,而且不同层的feature map分别编码不同特性的特征。

原文如下:

These outputs are known as feature
maps [1] - they involve not only the strength of the
responses, but also their spatial positions.

5、不同池化尺度下的win与stride计算

论文阅读:SPPnet:Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition_第4张图片

For an image with a given size, we can pre-compute
the bin sizes needed for spatial pyramid pooling.
Consider the feature maps after conv5 that have a size
of a×a (e.g., 13×13). With a pyramid level of n×n
bins, we implement this pooling level as a sliding
window pooling, where the window size win = [a/n] (向上取整)
and stride str =【a/n】(向下取整)
With an l-level pyramid,
we implement l such layers.

6、Multi-size training

输入尺寸在[180,224]之间,假设最后一个卷积层的输出大小为13*13,若给定金字塔层有 个bins,进行滑动窗池化,窗口尺寸为,步长为,使用一个网络完成一个完整epoch的训练,之后切换到另外一个网络。只是在训练的时候用到多尺寸,测试时直接将SPPNet应用于任意尺寸的图像。
【注】:多尺度训练只是提升了一点性能

原文如下:

The output of the spatial
pyramid pooling layer of this 180-network has the
same fixed length as the 224-network. As such, this
180-network has exactly the same parameters as the
224-network in each layer. In other words, during
training we implement the varying-input-size SPP-net
by two fixed-size networks that share parameters.

7、Full-image Representations Improve Accuracy

This shows the importance of maintaining the
complete content. Even though our network is trained
using square images only, it generalizes well to other
aspect ratios.

8、SPPnet应用到目标检测上

论文阅读:SPPnet:Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition_第5张图片

使用SPP进行检测,先用提候选proposals方法(selective search)选出候选框,不过不像RCNN把每个候选区域给深度网络提特征,而是整张图提一次特征,再把候选框映射到conv5上,因为候选框的大小尺度不同,映射到conv5后仍不同,所以需要再通过SPP层提取到相同维度的特征,再进行分类和回归,后面的思路和方法与RCNN一致。实际上这样子做的话就比原先的快很多了,因为之前RCNN也提出了这个原因就是深度网络所需要的感受野是非常大的,这样子的话需要每次将感兴趣区域放大到网络的尺度才能卷积到conv5层。这样计算量就会很大,而SPP只需要计算一次特征,剩下的只需要在conv5层上操作就可以了。当然即使是这么完美的算法,也是有它的瑕疵的,可能Kaiming He大神太投入 SPP的功效了,使得整个流程框架并没有变得更加完美。首先在训练方面,SPP没有发挥出它的优势,依旧用了传统的训练方法,这使得计算量依旧很大,而且分类和bounding box的回归问题也可以联合学习,使得整体框架更加完美。这些Kaiming He都给忽略了,这样也就有了第二篇神作 Fast RCNN。

原文如下:

Our SPP-net can also be used for object detection.
We extract the feature maps from the entire image
only once (possibly at multiple scales). Then we apply the spatial pyramid pooling on each candidate
window of the feature maps to pool a fixed-length
representation of this window (see Figure 5). Because
the time-consuming convolutions are only applied
once, our method can run orders of magnitude faster.
下面是与DPM OverFeat和SS的比较
Our method extracts window-wise features from
regions of the feature maps, while R-CNN extracts
directly from image regions. In previous works, the
Deformable Part Model (DPM) [23] extracts features
from windows in HOG [24] feature maps, and the
Selective Search (SS) method [20] extracts from windows in encoded SIFT feature maps. The Overfeat
detection method [5] also extracts from windows of
deep convolutional feature maps, but needs to predefine the window size. On the contrary, our method
enables feature extraction in arbitrary windows from
the deep convolutional feature maps.

9、模型集成:Model Combination for Detection

原文如下:

We pre-train another network in ImageNet, using
the same structure but different random initializations. Then we repeat the above detection algorithm.
Given the two models, we first use either model
to score all candidate windows on the test image.
Then we perform non-maximum suppression on the
union of the two sets of candidate windows (with
their scores). A more confident window given by
one method can suppress those less confident given
by the other method. After combination, the mAP
is boosted to 60.9% (Table 12). In 17 out of all 20
categories the combination performs better than either
individual model. This indicates that the two models
are complementary.

10、附录里的Mapping a Window to Feature Maps

这部分参考师兄博客的第3节:
【Deep Learning】SPP-Net

参考文献

1、目标检测——从RCNN到Faster RCNN 串烧

2、RCNN学习笔记(3)

3、目标检测(3)-SPPNet

4、【Deep Learning】SPP-Net

你可能感兴趣的:(论文阅读,目标检测)