Fast R-CNN
ICCV 2015
https://github.com/rbgirshick/fast-rcnn
本文在目标检测中起到一个承上启下的作用,从 R-CNN 经由 Fast R-CNN 过渡到 完美的 Faster R-CNN, 最大的提升还是速度方面的提升。
A Fast R-CNN network takes as input an entire image and a set of object proposals.
Fast R-CNN 的输入时 整个图像 和 一组矩形框,这些矩形框可能包含物体,由 selective search 或其他方法得到。
The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map.
整幅图像经过一系列卷积和池化层处理得到一个卷积特征图,
for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map.
对于每个候选区域矩形框,使用一个 RoI pooling layer 从特征图中提取一个固定长度的特征向量
RoI pooling layer 是将 Fast R-CNN 过渡到 R-CNN 的关键所在,因为 R-CNN将每个候选区域矩形框归一化尺寸,然后再提取卷积特征图,再类别分类和位置回归。Fast R-CNN 是先计算整幅图像的 卷积特征图 ,然后根据矩形框由 RoI pooling layer 来提取一个固定尺寸的特征向量用于后面的 类别分类和位置回归
Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.
每个特征向量输入到几个全链接层,后面是两个 sibling output layers,一个用于输出 softmax 类别概率(K类物体+一个 背景类别),另一个网络层用于预测 K类别中每一类物体的坐标信息
2.1. The RoI pooling layer
The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H ×W (e.g., 7×7),
RoI pooling layer 使用最大池化操作讲任意尺寸的特征图变为一个固定尺寸的小特征图,H ×W (e.g., 7×7)
The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets [11] in which there is only one pyramid level
RoI layer 是 SPPnets 中的 the spatial pyramid pooling layer 一个特殊情况
因此可以看出Fast RCNN主要有3个改进:1、卷积不再是对每个region proposal进行,而是直接对整张图像,这样减少了很多重复计算。原来RCNN是对每个region proposal分别做卷积,因为一张图像中有2000左右的region proposal,肯定相互之间的重叠率很高,因此产生重复计算。2、用ROI pooling进行特征的尺寸变换,因为全连接层的输入要求尺寸大小一样,因此不能直接把region proposal作为输入。3、将regressor 放进网络一起训练,每个类别对应一个regressor,同时用softmax代替原来的SVM分类器。
2.3. Fine-tuning for detection
Training all network weights with back-propagation is an important capability of Fast R-CNN
通过反向传播进行所有网络参数的训练 是 Fast R-CNN 一个重要能力
The root cause is that back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image, which is exactly how R-CNN and SPPnet networks are trained.
通过 SPP layer 进行反向传播计算很低效的原因是 每个训练样本(RoI)来自不同的图像,这正是 R-CNN 和 SPPnet 网络训练采取的策略
低效的根源来自于RoI 可能有一个很大的感受野,通常覆盖整个图像, Since the forward pass must process the entire receptive field, the training inputs are large (often the entire image).
这里我们采取了一个高效的训练策略,每个图像提取了 64个 RoI,先提取整幅图像的卷积特征图,然后再利用这个 特征图对每个 RoI 进行训练
In Fast R-CNN training, stochastic gradient descent (SGD) mini-batches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes. Making N small decreases mini-batch computation. For example, when using N = 2 and R = 128, the proposed training scheme is roughly 64× faster than sampling one RoI from 128 different images (i.e., the R-CNN and SPPnet strategy).
在 Fast R-CNN 中 jointly optimizes a softmax classifier and bounding-box regressors 代替 training a softmax classifier, SVMs, and regressors in three separate stages
Multi-task loss
jointly train for classification and bounding-box regression
OverFeat [19], R-CNN [9], and SPPnet [11] also train classifiers and bounding-box localizers, however these methods use stage-wise training,
which we show is suboptimal for Fast R-CNN
11