RCNN、SPPnet、Fast-RCNN 论文学习笔记

------R-CNN、Fast-Rcnn、Fast-Rcnn是目标检测的一系列顶会论文,自己也看了好久,才慢慢有所感悟,这里做个记载。看论文原版还是最好的选择,但由于论文都是英文,且有大量引用前人已有的思想,对于小白来说,直接看论文并不友善,可以选择网上已有的论文解读,大致了解思想,然后有针对性的阅读论文原版,可能会事半功倍。

给出RGB大神博客,基本论文还有源码、slides都可以找到:
【RGB大神博客】
推荐个讲解很好的网站:
Gluon.ai

------object detection是在给定的图片中精确找到物体所在位置,并标注出物体的类别。object detection要解决的问题是物体在哪里,是什么这整个流程的问题。然而,这个问题可不是那么容易解决的,物体的尺寸变化范围很大,摆放物体的角度,姿态不定,而且可以出现在图片的任何地方,更何况物体还可以是多个类别。已有的 CNN 和好地解决了图片中物体是什么的问题,即分类,接下来几篇论文可以看到物体检测大致的发展流程。

R-CNN

基本思想是: 在图片中框出大量区域,每个区域进行物体分类,进而得到目标检测的结果。

  1. 图片中选出大量候选框(选择性搜索,selective search)
  2. 对每一个候选框,进行大小修正,以适合后面卷积网络(保证卷积网络输出大小为固定值)
  3. 对每一个区域,进行特征提取(CNN),随后使用SVM进行分类
  4. 训练一个线性回归模型,使用回归模型精细修正候选框的位置。该回归模型使用Loss函数为 bounding box IOU。
![这里写图片描述](https://img-blog.csdn.net/201808251657309?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3NpbmF0XzM0MDIyMjk4/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

selective search

R-CNN 使用了选择性搜索Selective Search(SS)在图片中获得大约2k个候选框。
其基本思路如下所述:

使用过分割方法将图像分成很多小区域。在此之后,观察现有的区域,之后以最高概率合并两个区域。重复此步骤,直到所有图像合并为一个区域位置。注意,在此处的合并规则与RCNN是相同的,优先合并以下四种区域: 颜色(颜色直方图)相近的; 纹理(梯度直方图)相近的; 合并后总面积小的。最后,所有已经存在的区域都被输出,并生成候选区域。

论文原文对网络结构的描述


------ Our object detection system consists of three modules. The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector. The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. The third module is a set of class specific linear SVMs.

![这里写图片描述](https://img-blog.csdn.net/20180825170331275?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3NpbmF0XzM0MDIyMjk4/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

------Region proposals. While R-CNN is agnostic to the particular region proposal method, we use selective search to enable a controlled comparison with prior detection work (e.g., [34, 36]).

------Feature extraction. We extract a 4096-dimensional feature vector from each region proposal using the Caffe [22] implementation of the CNN described by Krizhevsky et al. [23]. Features are computed by forward propagating a mean-subtracted 227 × 227 RGB image through five convolutional layers and two fully connected layers. We refer readers to [22, 23] for more network architecture details.

------ In order to compute features for a region proposal, we must first convert the image data in that region into a form that is compatible with the CNN (its architecture requires inputs of a fixed 227×227 pixel size). Of the many possible transformations of our arbitrary-shaped regions, we opt for the simplest. Regardless of the size or aspect ratio of the candidate region,we warp all pixels in a tight bounding box around it to the required size. Prior to warping,we dilate the tight bounding box so that at the warped size there are exactly p pixels of warped image context around the original box(we use p = 16).


从上面加粗的内容,可以看到 R-CNN 的几个基本问题:

  1. 一张图片先 selective search,然后送入CNN 网络进行特征检测,这样会造成大量的重复卷积计算;
  2. 因为 CNN 后面接的是 FC 全连接网络,所有对 CNN 的输出层大小有严格要求,进而限制了整个网络的图片输入大小。

SPP-net

------主要解决了上述 R-CNN 需要固定大小的输入图片的问题(因为 CNN 网络后面 会接 FC , FC 全连接层是需要一个固定维度的向量作为输入的)。论文一开始就说了,输入 CNN 网络的图片需要 crop 或者 warp ,用以 resize 到固定大小,但是二者都有不可避免的缺点:crop 无法观察整体图片,warp 容易使图片几何变形,从而失真。

------ In this paper, we introduce a spatial pyramid pooling (SPP) [14], [15] layer to remove the fixed-size constraint of the network. Specifically, we add an SPP layer on top of the last convolutional layer. The SPP layer pools the features and generates fixedlength outputs, which are then fed into the fully connected layers (or other classifiers).

![这里写图片描述](https://img-blog.csdn.net/20180825172500278?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3NpbmF0XzM0MDIyMjk4/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)
对比 **RCNN**,**SPPnet** 的提出主要解决了两个问题: 1. 图片多候选区,多次卷积特征提取 ==> 一次卷积特征提取,将候选框应用在整个图片的特征提取层后面(即conv5),得到不同的特征候选框 2. 不同图片候选框进入CNN时,需要resize,为了在CNN层输出时得到统一大小,从而进入 FC 层--> 使用SPP(spatial pyramid pooling),是一种可伸缩的池化层,不管输入分辨率是多大,都可以划分成m*n个部分。这是SPP-net的第一个显著特征,它的输入是conv5特征图 以及特征图候选框(原图候选框 通过stride映射得到),输出是固定尺寸(m*n)特征;

原论文关对 SPP 的描述


------The Spatial Pyramid Pooling Layer

------The convolutional layers accept arbitrary input sizes, but they produce outputs of variable sizes. The classifiers (SVM/softmax) or fully-connected layers require fixed-length vectors. Such vectors can be generated by the Bag-of-Words (BoW) approach [16] that pools the features together. Spatial pyramid pooling [14], [15] improves BoW in that it can maintain spatial information by pooling in local spatial bins. These spatial bins have sizes proportional to the image size, so the number of bins is fixed regardless of the image size. This is in contrast to the sliding window pooling of the previous deep networks [3], where the number of sliding windows depends on the input size.

![这里写图片描述](https://img-blog.csdn.net/20180825172650150?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3NpbmF0XzM0MDIyMjk4/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

------ To adopt the deep network for images of arbitrary sizes, we replace the last pooling layer (e.g., pool5, after the last convolutional layer) with a spatial pyramid pooling layer. Figure 3 illustrates our method. In each spatial bin, we pool the responses of each filter (throughout this paper we use max pooling). The outputs of the spatial pyramid pooling are kM dimensional vectors with the number of bins denoted as M (k is the number of filters in the last convolutional layer). The fixed-dimensional vectors are the input to the fully-connected layer.

------With spatial pyramid pooling, the input image can be of any sizes. This not only allows arbitrary aspect ratios, but also allows arbitrary scales. We can resize the input image to any scale (e.g., min(w,h)=180, 224, …) and apply the same deep network. When the input image is at different scales, the network (with the same filter sizes) will extract features at different scales.

![这里写图片描述](https://img-blog.csdn.net/20180825173153463?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3NpbmF0XzM0MDIyMjk4/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

------上图中的 sizeX 是池化核的大小,stride 是步长, pool3x3 表示池化后输出大小为 3X3。 可以得到,pool1x1 是对整个 inputs 整个特征图做max pooling,pool2x2 是将 inputs 特征图分割为 2x2 的4个区域,每个区域进行max pooling。因为sizeX和stride的限制,max pooling 基本是在每个特征区域找最大值(私以为,这样会重复选择特征,因为最大值在任何一个区域都是最大值,是不是用 mean pooling或者 Gaussian pooling会好一点)。

RCNN 与 SPPnet 结构对比

![这里写图片描述](https://img-blog.csdn.net/20180825174144418?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3NpbmF0XzM0MDIyMjk4/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)
RCNN 结构图
![这里写图片描述](https://img-blog.csdn.net/20180825174055989?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3NpbmF0XzM0MDIyMjk4/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)
SPPnet 结构图

Fast-Rcnn

同样是针对 Rcnn 的缺点进行改进:

  1. 第一点同 Sppnet 一样,将在图片上提取候选区域,送入CNN进行特征提取,改为对整幅图片进行特征提取(一次CNN),然后在特征提取图上进行 候选区域提取
  2. 第二点与 Sppnet 也差不多,提取的特征区域大小不一,无法直接送入 FC 网络进行分类。与 Sppnet 解决方法不同的是,Fast-Rcnn 采用 ROI pooling 层进行特征映射。
    即: 卷积操作过后可以得到***feature map***,根据之前RoI框选择出对应的区域(既可以理解为将***feature map***映射回原图像), 在最后一次卷积之前,使用 RoI pooling 层来统一相同的比例(可以看成单层 SSP)。
  3. 端到端 的训练,将分类和定位统一在一起进行训练,整个网络的损失函数是二者损失函数之和。进一步加快速度。

原论文对网络结构的描述

**Fast R-CNN architecture **
------A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vectoris fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.

![这里写图片描述](https://img-blog.csdn.net/20180825175140256?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3NpbmF0XzM0MDIyMjk4/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

对于ROI pooling,作者说,可以看作SPPNet的特殊情况,就是只是用了一个 SPP pooling filter 的池化层。H、W 是更具全连接层设定的参数,h、w 是特征候选区的长宽,因为要被划分为 H x W 个区域,所以每个区域大小计算公式为:h/H × w/W。

**The RoIpooling layer **
------ The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H ×W (e.g., 7×7), where H and W are layer hyper-parameters that are independent of any particular RoI. In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is defined by a four-tuple (r,c,h,w) that specifies its top-left corner (r,c) and its height and width (h,w).
------ RoI max pooling works by dividing the h×w RoI window into an H × W grid of sub-windows of approximate size h/H ×w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel, as in standard max pooling. The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets [11] in which there is only one pyramid level. We use the pooling sub-window calculation given in [11].

还有一点是对于training的优化, Fast R-CNN 的SGD batch number N=2,每次进行 R/N=128/2=64 个ROI的计算,而 R-CNN 和 SPPnet 都是batch number N=128,而对每个输入进行1个ROI的计算。作者说经过优化,提升64X倍速率,而且 batch number 过小而可能出现的收敛缓慢并没有出现。

**Fine-tuning for detection **
------Training all network weights with back-propagation is an important capability of Fast R-CNN. First, let’s elucidate why SPPnet is unable to update weights below the spatial pyramid pooling layer.
------ We propose a more efficient training method that takes advantage of feature sharing during training. In Fast RCNN training, stochastic gradient descent (SGD) mini batches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes. Making N small decreases mini-batch computation. For example, when using N = 2 and R = 128, the proposed training scheme is roughly 64× faster than sampling one RoI from 128 different images(i.e.,the R-CNN and SPPnet strategy). One concern over this strategy is it may causes low training convergence because RoIs from the same image are correlated. This concern does not appear to be a practical issue and we achieve good results with N = 2 and R = 128 using fewer SGD iterations than R-CNN.

参考

基于深度学习的目标检测技术演进:R-CNN、Fast R-CNN、Faster R-CNN
RCNN,Fast RCNN, Faster RCNN整理总结

你可能感兴趣的:(深度学习,论文阅读,深度学习,目标检测,RCNN,人工智能)