------R-CNN、Fast-Rcnn、Fast-Rcnn是目标检测的一系列顶会论文,自己也看了好久,才慢慢有所感悟,这里做个记载。看论文原版还是最好的选择,但由于论文都是英文,且有大量引用前人已有的思想,对于小白来说,直接看论文并不友善,可以选择网上已有的论文解读,大致了解思想,然后有针对性的阅读论文原版,可能会事半功倍。
给出RGB大神博客,基本论文还有源码、slides都可以找到:
【RGB大神博客】
推荐个讲解很好的网站:
Gluon.ai
------object detection是在给定的图片中精确找到物体所在位置,并标注出物体的类别。object detection要解决的问题是物体在哪里,是什么这整个流程的问题。然而,这个问题可不是那么容易解决的,物体的尺寸变化范围很大,摆放物体的角度,姿态不定,而且可以出现在图片的任何地方,更何况物体还可以是多个类别。已有的 CNN 和好地解决了图片中物体是什么的问题,即分类,接下来几篇论文可以看到物体检测大致的发展流程。
基本思想是: 在图片中框出大量区域,每个区域进行物体分类,进而得到目标检测的结果。
R-CNN 使用了选择性搜索Selective Search(SS)在图片中获得大约2k个候选框。
其基本思路如下所述:
使用过分割方法将图像分成很多小区域。在此之后,观察现有的区域,之后以最高概率合并两个区域。重复此步骤,直到所有图像合并为一个区域位置。注意,在此处的合并规则与RCNN是相同的,优先合并以下四种区域: 颜色(颜色直方图)相近的; 纹理(梯度直方图)相近的; 合并后总面积小的。最后,所有已经存在的区域都被输出,并生成候选区域。
------ Our object detection system consists of three modules. The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector. The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region. The third module is a set of class specific linear SVMs.
------Region proposals. While R-CNN is agnostic to the particular region proposal method, we use selective search to enable a controlled comparison with prior detection work (e.g., [34, 36]).
------Feature extraction. We extract a 4096-dimensional feature vector from each region proposal using the Caffe [22] implementation of the CNN described by Krizhevsky et al. [23]. Features are computed by forward propagating a mean-subtracted 227 × 227 RGB image through five convolutional layers and two fully connected layers. We refer readers to [22, 23] for more network architecture details.
------ In order to compute features for a region proposal, we must first convert the image data in that region into a form that is compatible with the CNN (its architecture requires inputs of a fixed 227×227 pixel size). Of the many possible transformations of our arbitrary-shaped regions, we opt for the simplest. Regardless of the size or aspect ratio of the candidate region,we warp all pixels in a tight bounding box around it to the required size. Prior to warping,we dilate the tight bounding box so that at the warped size there are exactly p pixels of warped image context around the original box(we use p = 16).
从上面加粗的内容,可以看到 R-CNN 的几个基本问题:
------主要解决了上述 R-CNN 需要固定大小的输入图片的问题(因为 CNN 网络后面 会接 FC , FC 全连接层是需要一个固定维度的向量作为输入的)。论文一开始就说了,输入 CNN 网络的图片需要 crop 或者 warp ,用以 resize 到固定大小,但是二者都有不可避免的缺点:crop 无法观察整体图片,warp 容易使图片几何变形,从而失真。
------ In this paper, we introduce a spatial pyramid pooling (SPP) [14], [15] layer to remove the fixed-size constraint of the network. Specifically, we add an SPP layer on top of the last convolutional layer. The SPP layer pools the features and generates fixedlength outputs, which are then fed into the fully connected layers (or other classifiers).
------The Spatial Pyramid Pooling Layer
------The convolutional layers accept arbitrary input sizes, but they produce outputs of variable sizes. The classifiers (SVM/softmax) or fully-connected layers require fixed-length vectors. Such vectors can be generated by the Bag-of-Words (BoW) approach [16] that pools the features together. Spatial pyramid pooling [14], [15] improves BoW in that it can maintain spatial information by pooling in local spatial bins. These spatial bins have sizes proportional to the image size, so the number of bins is fixed regardless of the image size. This is in contrast to the sliding window pooling of the previous deep networks [3], where the number of sliding windows depends on the input size.
------ To adopt the deep network for images of arbitrary sizes, we replace the last pooling layer (e.g., pool5, after the last convolutional layer) with a spatial pyramid pooling layer. Figure 3 illustrates our method. In each spatial bin, we pool the responses of each filter (throughout this paper we use max pooling). The outputs of the spatial pyramid pooling are kM dimensional vectors with the number of bins denoted as M (k is the number of filters in the last convolutional layer). The fixed-dimensional vectors are the input to the fully-connected layer.
------With spatial pyramid pooling, the input image can be of any sizes. This not only allows arbitrary aspect ratios, but also allows arbitrary scales. We can resize the input image to any scale (e.g., min(w,h)=180, 224, …) and apply the same deep network. When the input image is at different scales, the network (with the same filter sizes) will extract features at different scales.
------上图中的 sizeX 是池化核的大小,stride 是步长, pool3x3 表示池化后输出大小为 3X3。 可以得到,pool1x1 是对整个 inputs 整个特征图做max pooling,pool2x2 是将 inputs 特征图分割为 2x2 的4个区域,每个区域进行max pooling。因为sizeX和stride的限制,max pooling 基本是在每个特征区域找最大值(私以为,这样会重复选择特征,因为最大值在任何一个区域都是最大值,是不是用 mean pooling或者 Gaussian pooling会好一点)。
同样是针对 Rcnn 的缺点进行改进:
**Fast R-CNN architecture **
------A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vectoris fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.
对于ROI pooling,作者说,可以看作SPPNet的特殊情况,就是只是用了一个 SPP pooling filter 的池化层。H、W 是更具全连接层设定的参数,h、w 是特征候选区的长宽,因为要被划分为 H x W 个区域,所以每个区域大小计算公式为:h/H × w/W。
**The RoIpooling layer **
------ The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H ×W (e.g., 7×7), where H and W are layer hyper-parameters that are independent of any particular RoI. In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is defined by a four-tuple (r,c,h,w) that specifies its top-left corner (r,c) and its height and width (h,w).
------ RoI max pooling works by dividing the h×w RoI window into an H × W grid of sub-windows of approximate size h/H ×w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel, as in standard max pooling. The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets [11] in which there is only one pyramid level. We use the pooling sub-window calculation given in [11].
还有一点是对于training的优化, Fast R-CNN 的SGD batch number N=2,每次进行 R/N=128/2=64 个ROI的计算,而 R-CNN 和 SPPnet 都是batch number N=128,而对每个输入进行1个ROI的计算。作者说经过优化,提升64X倍速率,而且 batch number 过小而可能出现的收敛缓慢并没有出现。
**Fine-tuning for detection **
------Training all network weights with back-propagation is an important capability of Fast R-CNN. First, let’s elucidate why SPPnet is unable to update weights below the spatial pyramid pooling layer.
------ We propose a more efficient training method that takes advantage of feature sharing during training. In Fast RCNN training, stochastic gradient descent (SGD) mini batches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes. Making N small decreases mini-batch computation. For example, when using N = 2 and R = 128, the proposed training scheme is roughly 64× faster than sampling one RoI from 128 different images(i.e.,the R-CNN and SPPnet strategy). One concern over this strategy is it may causes low training convergence because RoIs from the same image are correlated. This concern does not appear to be a practical issue and we achieve good results with N = 2 and R = 128 using fewer SGD iterations than R-CNN.
基于深度学习的目标检测技术演进:R-CNN、Fast R-CNN、Faster R-CNN
RCNN,Fast RCNN, Faster RCNN整理总结