论文原文:LINK
论文被引:10839(08/09/2020)
This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16network 9× faster than R-CNN, is 213× faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3× faster , tests 10× faster , and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https: //github.com/rbgirshick/fast-rcnn.
本文提出了一种快速的基于区域的卷积网络方法(Fast R-CNN)进行目标检测。Fast R-CNN以先前的工作为基础,使用深度卷积网络对目标建议进行有效分类。与以前的工作相比,Fast R-CNN采用了多项创新,可以提高训练和测试速度,同时还可以提高检测精度。Fast R-CNN比R-CNN训练非常深的VGG16网络快9倍,在测试时快213倍,在PASCAL VOC 2012上达到更高的mAP。与SPPnet相比,Fast R-CNN训练VGG16快3倍,测试速度提高了10倍,并且更加准确。Fast R-CNN是使用Python和C ++(使用Caffe)实现的,并且可以在https://github.com/rbgirshick/fast-rcnn的开源MIT许可下获得。
Recently, deep ConvNets [14, 16] have significantly improved image classification [14] and object detection [9, 19] accuracy. Compared to image classification, object detection is a more challenging task that requires more complex methods to solve. Due to this complexity, current approaches (e.g., [9, 11, 19, 25]) train models in multi-stage pipelines that are slow and inelegant.
最近,深层ConvNets [14,16]显着提高了图像分类[14]和对象检测[9,19]的准确性。与图像分类相比,目标检测是一项更具挑战性的任务,需要更复杂的方法来解决。由于这种复杂性,目前的方法(例如[9、11、19、25])在速度较慢且不佳的多阶段管道中训练模型。
Complexity arises because detection requires the accurate localization of objects, creating two primary challenges. First, numerous candidate object locations (often called “proposals”) must be processed. Second, these candidates provide only rough localization that must be refined to achieve precise localization. Solutions to these problems often compromise speed, accuracy, or simplicity.
复杂性之所以出现是因为检测需要精确定位对象,从而带来两个主要挑战。首先,必须处理许多候选对象位置(通常称为“建议”)。其次,这些候选对象仅提供粗略的定位,必须对其进行细化以实现精确的定位。这些问题的解决方案通常会损失速度,准确性或简单性。
In this paper, we streamline the training process for stateof-the-art ConvNet-based object detectors [9, 11]. We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.
在本文中,我们简化了基于ConvNet的最新对象检测器的训练过程[9,11]。我们提出了一种单阶段训练算法,该算法可以共同学习以对对象建议进行分类并优化其空间位置。
The resulting method can train a very deep detection network (VGG16 [20]) 9× faster than R-CNN [9] and 3× faster than SPPnet [11]. At runtime, the detection network processes images in 0.3s (excluding object proposal time) while achieving top accuracy on PASCAL VOC 2012 [7] with a mAP of 66% (vs. 62% for R-CNN).
所得方法可以训练非常深的检测网络(VGG16 [20]),其速度比R-CNN [9]快9倍,比SPPnet [11]快3倍。在运行时,检测网络在0.3s内处理图像(不包括对象建议时间),同时以66%的mAP(对于R-CNN为62%)达到 PASCAL VOC 2012[7] 的最高准确性。
The Region-based Convolutional Network method (RCNN) [9] achieves excellent object detection accuracy by using a deep ConvNet to classify object proposals. R-CNN, however, has notable drawbacks:
基于区域的卷积网络方法(RCNN)[9]通过使用深层的ConvNet对目标建议进行分类,实现了出色的目标检测精度。但是,R-CNN具有明显的缺点:
R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation. Spatial pyramid pooling networks (SPPnets) [11] were proposed to speed up R-CNN by sharing computation. The SPPnet method computes a convolutional feature map for the entire input image and then classifies each object proposal using a feature vector extracted from the shared feature map. Features are extracted for a proposal by maxpooling the portion of the feature map inside the proposal into a fixed-size output (e.g., 6 × 6). Multiple output sizes are pooled and then concatenated as in spatial pyramid pooling[15]. SPPnet accelerates R-CNN by 10 to 100× at test time. Training time is also reduced by 3× due to faster proposal feature extraction.
R-CNN速度很慢,因为它为每个对象建议执行一次ConvNet前向传递,而不共享计算。[11]提出了空间金字塔池化网络(SPPnet),以通过共享计算来加速R-CNN。SPPnet方法为整个输入图像计算卷积特征图,然后使用从共享特征图中提取的特征向量对每个对象建议进行分类。通过将建议中的特征图的一部分最大池化为固定大小的输出(例如6×6)来提取建议的特征。池化多个输出大小,然后像在空间金字塔池化中一样进行串联[15]。在测试时,SPPnet将R-CNN的速度提高了10到100倍。由于建议特征提取速度更快,训练时间也减少了3倍。
SPPnet also has notable drawbacks. Like R-CNN, training is a multi-stage pipeline that involves extracting features, fine-tuning a network with log loss, training SVMs, and finally fitting bounding-box regressors. Features are also written to disk. But unlike R-CNN, the fine-tuning algorithm proposed in [11] cannot update the convolutional layers that precede the spatial pyramid pooling. Unsurprisingly, this limitation (fixed convolutional layers) limits the accuracy of very deep networks.
SPPnet也有明显的缺点。像R-CNN一样,训练是一个多阶段的管道,涉及提取特征,对网络进行对数损失微调,训练SVM,最后拟合边界框回归,特征也会写入磁盘。但是与R-CNN不同,文献[11]中提出的微调算法无法更新空间金字塔池之前的卷积层。毫不奇怪,此限制(固定的卷积层)限制了非常深的网络的准确性。
We propose a new training algorithm that fixes the disadvantages of R-CNN and SPPnet, while improving on their speed and accuracy. We call this method Fast R-CNN because it’s comparatively fast to train and test. The Fast RCNN method has several advantages:
我们提出了一种新的训练算法,该算法可以解决R-CNN和SPPnet的缺点,同时提高其速度和准确性。我们称此方法为“Fast R-CNN”,因为它的训练和测试速度相对较快。Fast RCNN方法具有以下优点:
Fig. 1 illustrates the Fast R-CNN architecture. A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.
图1说明了Fast R-CNN架构。Fast R-CNN网络将整个图像和一组对象建议作为输入。网络首先使用几个卷积(conv)和最大池化层处理整个图像,以生成卷积特征图。然后,对于每个对象建议,感兴趣区域(region of interest,RoI)池化层从特征图中提取固定长度的特征向量。每个特征向量都被送到一系列全连接层中,这些层最终分支为两个同级输出层:一层在K个对象类以及所有“背景”类上产生softmax概率估计,另一层分别为K个对象类输出4个实数,表示边界框位置。
图1.Fast R-CNN架构。输入图像和多个感兴趣区域(RoI)被输入到完全卷积网络中。每个RoI被合并到一个固定大小的特征图中,然后通过完连接层(FC)映射到特征向量。该网络每个RoI有两个输出向量:softmax概率和每个类边界框回归偏移。该架构经过端到端的多任务损失训练。
The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H ×W (e.g., 7×7), where H and W are layer hyper-parameters that are independent of any particular RoI. In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is defined by a four-tuple (r, c, h, w) that specifies its top-left corner (r, c) and its height and width (h, w).
RoI合并层使用最大合并将任何有效感兴趣区域内的要素转换为具有固定空间范围H×W(e.g., 7×7)的小要素图,其中H和W是层超参数,独立于任何特定的投资回报率。在本文中,RoI是进入转换特征图的矩形窗口。每个RoI由一个四元组 (r, c, h, w) 定义,该四元组指定其左上角 (r, c) 以及其高度和宽度 (h, w)。
RoI max pooling works by dividing the h×w RoI window into an H×W grid of sub-windows of approximate size h/H × w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel, as in standard max pooling. The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets [11] in which there is only one pyramid level. We use the pooling sub-window calculation given in [11].
RoI最大池化的工作方式是将 h×w RoI 窗口划分为大小约为 h/H × w/W 的子窗口的 H×W 网格,然后将每个子窗口中的最大池化到相应的输出网格单元中。像标准最大池化一样,池化独立应用于每个特征图通道。RoI层只是在SPPnet [11]中使用的空间金字塔池化层的特例,在该空间中,金字塔层只有一个。我们使用[11]中给出的池化窗口计算。
We experiment with three pre-trained ImageNet [4] networks, each with five max pooling layers and between five and thirteen conv layers (see Section 4.1 for network details). When a pre-trained network initializes a Fast R-CNN network, it undergoes three transformations.
我们使用三个经过预训练的ImageNet [4]网络进行实验,每个网络具有五个最大池化层以及五到十三个卷积层(有关网络详细信息,请参见第4.1节)。当预训练的网络初始化Fast R-CNN网络时,它将经历三个转换。
First, the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g., H = W = 7 for VGG16).
首先,最后一个最大池化层被RoI池化层取代,该RoI池化层通过将H和W设置为与网络的第一个全连接层兼容(例如,对于VGG16,H = W = 7)进行配置。
Second, the network’s last fully connected layer and softmax (which were trained for 1000-way ImageNet classification) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K + 1 categories and category-specific bounding-box regressors).
其次,将网络的最后一个全连接层和softmax(经过1000路ImageNet分类训练)替换为先前描述的两个同级层(K + 1类的完全连接层和softmax以及特定于类别的边界盒回归器) )。
Second, the network’s last fully connected layer and softmax (which were trained for 1000-way ImageNet classification) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K + 1 categories and category-specific bounding-box regressors).
其次,将网络的最后一个全连接层和softmax(经过1000路ImageNet分类训练)替换为先前描述的两个同级层(K + 1类的全连接层和softmax以及特定于类别的边界框回归器)。
Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.
第三,修改网络以获取两个数据输入:图像列表和这些图像中的RoI列表。
Training all network weights with back-propagation is an important capability of Fast R-CNN. First, let’s elucidate why SPPnet is unable to update weights below the spatial pyramid pooling layer.
使用反向传播训练所有网络权重是Fast R-CNN的一项重要功能。首先,让我们说明一下 为什么SPPnet无法更新空间金字塔池化层以下的权重。
The root cause is that back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image, which is exactly how R-CNN and SPPnet networks are trained. The inefficiency stems from the fact that each RoI may have a very large receptive field, often spanning the entire input image. Since the forward pass must process the entire receptive field, the training inputs are large (often the entire image).
根本原因是,当每个训练样本(即RoI)来自不同的图像时,通过SPP层进行的反向传播效率非常低,这正是R-CNN和SPPnet网络的训练方式。效率低下的原因是,每个RoI可能都有很大的接收场,通常跨越整个输入图像。由于前向通过必须处理整个接收场,因此训练输入很大(通常是整个图像)。
We propose a more efficient training method that takes advantage of feature sharing during training. In Fast RCNN training, stochastic gradient descent (SGD) minibatches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes. Making N small decreases mini-batch computation. For example, when using N = 2 and R = 128, the proposed training scheme is roughly 64× faster than sampling one RoI from 128 different images (i.e., the R-CNN and SPPnet strategy).
我们提出了一种更有效的训练方法,该方法可以在训练过程中利用特征共享。在Fast RCNN训练中,首先对N个图像进行采样,然后对每个图像中的 R/N RoI 进行采样,从而对随机梯度下降(SGD)小批次进行分层采样。至关重要的是,来自同一图像的RoI在前向和后向计算中共享计算和内存。使N变小会减少小批量计算。例如,当使用N = 2和R = 128时,提出的训练方案比从128个不同图像中采样一个RoI快大约64倍(即R-CNN和SPPnet策略)。
One concern over this strategy is it may cause slow training convergence because RoIs from the same image are correlated. This concern does not appear to be a practical issue and we achieve good results with N = 2 and R = 128 using fewer SGD iterations than R-CNN.
这种策略的一个担心是,由于来自同一图像的RoI相互关联,因此可能会导致训练收敛缓慢。这种担忧似乎不是实际问题,并且在N = 2和R = 128的情况下,与R-CNN相比,使用更少的SGD迭代可以获得良好的结果。
In addition to hierarchical sampling, Fast R-CNN uses a streamlined training process with one fine-tuning stage that jointly optimizes a softmax classifier and bounding-box regressors, rather than training a softmax classifier, SVMs, and regressors in three separate stages [9, 11]. The components of this procedure (the loss, mini-batch sampling strategy, back-propagation through RoI pooling layers, and SGD hyper-parameters) are described below.
除分层采样外,Fast R-CNN使用简化的训练过程和一个微调阶段来共同优化softmax分类器和边界框回归器,而不是在三个单独的阶段中训练softmax分类器,SVM和回归器[9,11]。下面描述了此过程的组成部分(损失,小批量采样策略,通过RoI池化层进行的反向传播以及SGD超参数)。
Multi-task loss. A Fast R-CNN network has two sibling output layers. The first outputs a discrete probability distribution (per RoI), p = ( p 0 , . . . , p K ) p = (p_0, . . . , p_K) p=(p0,...,pK), over K + 1 K + 1 K+1 categories. As usual, p is computed by a softmax over the K + 1 K+1 K+1 outputs of a fully connected layer. The second sibling layer outputs bounding-box regression offsets, t k = ( t k x , t k y , t k w , t k h ) t^k= (t_k^x, t_k^y, t_k^w, t_k^h) tk=(tkx,tky,tkw,tkh), for each of the K object classes, indexed by k. We use the parameterization for tkgiven in [9], in which t k t^k tk specifies a scale-invariant translation and log-space height/width shift relative to an object proposal.
多任务损失。Fast R-CNN网络具有两个同级输出层。第一个输出在K + 1个类别上的离散概率分布(每个RoI), p = ( p 0 , . . . , p K ) p = (p_0, . . . , p_K) p=(p0,...,pK)。通常, p p p 是通过在全连接层的K + 1个输出上的softmax计算的。第二个同级层为K个索引中的每个K类对象输出边界框回归偏移 t k = ( t k x , t k y , t k w , t k h ) t^k= (t_k^x, t_k^y, t_k^w, t_k^h) tk=(tkx,tky,tkw,tkh)。我们使用[9]中给出的 t k t^k tk 的参数化,其中 t k t^k tk 指定了相对于对象建议的尺度不变平移和对数空间高度/宽度偏移。
Each training RoI is labeled with a ground-truth class u and a ground-truth bounding-box regression target v. We use a multi-task loss L on each labeled RoI to jointly train for classification and bounding-box regression:
每个训练的投资回报率都标有地面真理类 u u u 和地面真理的包围盒回归目标 v v v。我们在每个标记的投资回报率上使用多任务损失L来共同训练分类和边界框回归:
in which L c l s ( p , u ) = − l o g p u L_{cls}(p, u) = −logp_u Lcls(p,u)=−logpu is log loss for true class u.
其中 L c l s ( p , u ) = − l o g p u L_{cls}(p, u) = −logp_u Lcls(p,u)=−logpu 是真类u的对数损失。
The second task loss, L l o c L_{loc} Lloc, is defined over a tuple of true bounding-box regression targets for class u , v = ( v x , v y , v w , v h ) u, v = (v_x, v_y, v_w, v_h) u,v=(vx,vy,vw,vh), and a predicted tuple t u = ( t x u , t y u , t w u , t h u ) t^u= (t^u_x, t^u_y, t^u_w, t^u_h) tu=(txu,tyu,twu,thu), again for class u. The Iverson bracket indicator function [ u ≥ 1 ] [u ≥ 1] [u≥1] evaluates to 1 when u ≥ 1 u ≥ 1 u≥1 and 0 otherwise. By convention the catch-all background class is labeled u = 0 u = 0 u=0. For background RoIs there is no notion of a ground-truth bounding box and hence L l o c L_{loc} Lloc is ignored. For bounding-box regression, we use the loss
第二个任务损失 L l o c L_{loc} Lloc 是在针对类 u , v = ( v x , v y , v w , v h ) u, v = (v_x, v_y, v_w, v_h) u,v=(vx,vy,vw,vh) 和预测元组 t u = ( t x u , t y u , t w u , t h u ) t^u= (t^u_x, t^u_y, t^u_w, t^u_h) tu=(txu,tyu,twu,thu) 的真实边界框回归目标的元组上定义的,tu h),再次用于类u。Iverson bracket indicator函数 [ u ≥ 1 ] [u ≥ 1] [u≥1] 在 u ≥ 1 u ≥ 1 u≥1 时计算为1,否则为0。按照惯例,万能背景类标记为 u = 0 u = 0 u=0。对于背景RoIs,没有真实边界框的概念,因此 L l o c L_{loc} Lloc 被忽略。对于边界框回归,我们使用损失
in which
is a robust L 1 L_1 L1 loss that is less sensitive to outliers than the L 2 L_2 L2 loss used in R-CNN and SPPnet. When the regression targets are unbounded, training with L 2 L_2 L2 loss can require careful tuning of learning rates in order to prevent exploding gradients. Eq. 3 eliminates this sensitivity.
是一个健壮的L1 loss,对异常值的敏感性不如R-CNN和SPPnet中使用的L2 loss。当回归目标不受限制时,使用L2 loss进行训练可能需要仔细调整学习率,以防止梯度爆炸。等式3消除了这种敏感性。
The hyper-parameter λ in Eq. 1 controls the balance between the two task losses. We normalize the ground-truth regression targets v i v_i vi to have zero mean and unit variance. All experiments use λ = 1 λ = 1 λ=1.
式1中的超参数 λ λ λ 控制两个任务损失之间的平衡。我们对具有零均值和单位方差的真实标注回归目标 v i v_i vi 进行归一化。所有实验均使用 λ = 1 λ = 1 λ=1。
We note that [6] uses a related loss to train a classagnostic object proposal network. Different from our approach, [6] advocates for a two-network system that separates localization and classification. OverFeat [19], R-CNN [9], and SPPnet [11] also train classifiers and bounding-box localizers, however these methods use stage-wise training, which we show is suboptimal for Fast R-CNN (Section 5.1).
我们注意到,[6]使用相关的损失来训练分类对象建议网络。与我们的方法不同,[6]提倡将定位和分类分开的两个网络系统。 OverFeat [19],R-CNN [9]和SPPnet [11]也训练分类器和边界框定位器,但是这些方法使用分阶段训练,对于快速R-CNN(5.1节),我们证明它们是次优的。
Mini-batch sampling. During fine-tuning, each SGD mini-batch is constructed from N = 2 images, chosen uniformly at random (as is common practice, we actually iterate over permutations of the dataset). We use mini-batches of size R = 128, sampling 64 RoIs from each image. As in [9], we take 25% of the RoIs from object proposals that have intersection over union (IoU) overlap with a groundtruth bounding box of at least 0.5. These RoIs comprise the examples labeled with a foreground object class, i.e. u ≥ 1. The remaining RoIs are sampled from object proposals that have a maximum IoU with ground truth in the interval [0.1,0.5), following [11]. These are the background examples and are labeled with u = 0. The lower threshold of 0.1 appears to act as a heuristic for hard example mining [8]. During training, images are horizontally flipped with probability 0.5. No other data augmentation is used.
小批量采样。在微调过程中,每个SGD微型批处理均由N = 2张图像构成,并随机选择(按照惯例,我们实际上对数据集的排列进行迭代)。我们使用大小为R = 128的小批次,从每个图像中采样64个RoI。像[9]中一样,我们从目标提案中获得25%的投资回报率,这些建议的联合交叉点(IoU)与真实标注边界框至少重叠0.5。这些RoI包括标有前景对象类(即u≥1)的示例,其余的RoI则是从对象建议中采样的,这些对象建议的最大IoU范围为[0.1,0.5),紧随[11]。这些是背景样本,并标有u =0。较低的阈值0.1似乎可以作为难样本挖掘的启发式方法[8]。在训练期间,图像以0.5的概率水平翻转。不使用其他数据扩充。
Back-propagation through RoI pooling layers. Backpropagation routes derivatives through the RoI pooling layer. For clarity, we assume only one image per mini-batch (N = 1), though the extension to N > 1 is straightforward because the forward pass treats all images independently.
通过RoI池化层进行反向传播。反向传播通过RoI池化层派生。为了清楚起见,我们假设每个小批量(N = 1)仅一张图像,尽管扩展到N> 1很简单,因为前向通过独立地对待所有图像。
Let x i ∈ R x_i∈ \R xi∈R be the i-th activation input into the RoI pooling layer and let y r j y_{rj} yrj be the layer’s j-th output from the r-th RoI. The RoI pooling layer computes y r j = x i ∗ ( r , j ) y_{rj} = x_i *(r,j) yrj=xi∗(r,j), in which i ∗ ( r , j ) = a r g m a x i ′ ∈ R ( r , j ) x i ′ i∗(r, j) = argmax_{i'∈R(r,j)}x_i' i∗(r,j)=argmaxi′∈R(r,j)xi′. R ( r , j ) R(r, j) R(r,j) is the index set of inputs in the sub-window over which the output unit y r j y_{rj} yrj max pools. A single x i x_i xi may be assigned to several different outputs y r j y_{rj} yrj.
令 x i ∈ R x_i∈ \R xi∈R 为RoI池化层的第 i i i 个激活输入,而使 y r j y_{rj} yrj 为第 r r r 个RoI的第 i i i 个激活输入。RoI池化层计算 y r j = x i ∗ ( r , j ) y_{rj} = x_i *(r,j) yrj=xi∗(r,j),其中 i ∗ ( r , j ) = a r g m a x i ′ ∈ R ( r , j ) x i ′ i∗(r, j) = argmax_{i'∈R(r,j)}x_i' i∗(r,j)=argmaxi′∈R(r,j)xi′。 R ( r , j ) R(r, j) R(r,j) 是子窗口中输入的索引集,输出单元 y r j y_{rj} yrj 在该子窗口中最大池化。可以将单个 x i x_i xi 分配给几个不同的输出 y r j y_{rj} yrj。
The RoI pooling layer’s backwards function computes partial derivative of the loss function with respect to each input variable xiby following the argmax switches:
RoI池层的向后函数通过遵循argmax开关,针对每个输入变量 x i x_i xi 计算损失函数的偏导数:
In words, for each mini-batch RoI r and for each pooling output unit yrj, the partial derivative ∂L/∂yrj is accumulated if i is the argmax selected for yrjby max pooling. In back-propagation, the partial derivatives ∂ L / ∂ y r j ∂L/∂y_{rj} ∂L/∂yrj are already computed by the backwards function of the layer on top of the RoI pooling layer.
换句话说,对于每个小批量RoI r r r 和每个池化输出单元 y r j y_{rj} yrj,如果 i i i 是通过最大池化为 y r j y_{rj} yrj 选择的argmax,则累积偏导数 ∂ L / ∂ y r j ∂L/∂y_{rj} ∂L/∂yrj。在反向传播中,偏导数∂L/∂yrj已经通过RoI池化层顶部的层的反向函数进行了计算。
SGD hyper-parameters. The fully connected layers used for softmax classification and bounding-box regression are initializedfromzero-meanGaussiandistributionswithstandard deviations 0.01 and 0.001, respectively. Biases are initialized to 0. All layers use a per-layer learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001. When training on VOC07 or VOC12 trainval we run SGD for 30k mini-batch iterations, and then lower the learning rate to 0.0001 and train for another 10k iterations. When we train on larger datasets, we run SGD for more iterations, as described later. A momentum of 0.9 and parameter decay of 0.0005 (on weights and biases) are used.
SGD超参数。从零均值高斯分布分别使用标准偏差0.01和0.001初始化用于softmax分类和边界框回归的完全连接层。偏差被初始化为0,所有层使用权重为1的每层学习率,使用偏差为2的全局学习率为0.001。在对VOC07或VOC12的train val进行训练时,我们对30k个小批量迭代运行SGD,然后将学习率降低至0.0001,然后再进行10k迭代训练。当我们在更大的数据集上进行训练时,我们将运行SGD进行更多的迭代,如稍后所述。使用的动量为0.9,参数衰减为0.0005(基于权重和偏差)。
We explore two ways of achieving scale invariant object detection: (1) via “brute force” learning and (2) by using image pyramids. These strategies follow the two approaches in [11]. In the brute-force approach, each image is processed at a pre-defined pixel size during both training and testing. The network must directly learn scale-invariant object detection from the training data.
我们探索了实现尺度不变物体检测的两种方法:(1)通过“brute force”学习和(2)使用图像金字塔。这些策略遵循[11]中的两种方法。在brute force方法中,在训练和测试期间,每个图像均以预定义的像素大小进行处理。网络必须直接从训练数据中学习尺度不变对象检测。
The multi-scale approach, in contrast, provides approximate scale-invariance to the network through an image pyramid. At test-time, the image pyramid is used to approximately scale-normalize each object proposal. During multi-scale training, we randomly sample a pyramid scale each time an image is sampled, following [11], as a form of data augmentation. We experiment with multi-scale training for smaller networks only, due to GPU memory limits.
相比之下,多尺度方法通过图像金字塔为网络提供近似的尺度不变性。在测试时,图像金字塔用于近似缩放每个对象建议的比例。在多尺度训练中,我们根据[11]每次对图像进行采样时都会随机抽取金字塔比例,作为数据增强的一种形式。由于GPU内存的限制,我们仅针对较小的网络进行了多尺度训练。
Once a Fast R-CNN network is fine-tuned, detection amounts to little more than running a forward pass (assuming object proposals are pre-computed). The network takes as input an image (or an image pyramid, encoded as a list of images) and a list of R object proposals to score. At test-time, R is typically around 2000, although we will consider cases in which it is larger (≈ 45k). When using an image pyramid, each RoI is assigned to the scale such that the scaled RoI is closest to 2242 pixels in area [11].
一旦对Fast R-CNN网络进行了微调,检测量就等于运行前向传递(假设对象建议已预先计算)。网络将图像(或图像金字塔,编码为图像列表)和R个对象建议列表进行输入作为评分。在测试时,R通常约为2000,尽管我们会考虑较大的情况(≈45k)。使用图像金字塔时,会将每个RoI分配给比例,以使缩放后的RoI在区域[11]中最接近2242像素。
For each test RoI r, the forward pass outputs a class posterior probability distribution p and a set of predicted bounding-box offsets relative to r (each of the K classes gets its own refined bounding-box prediction). We assign a detection confidence to r for each object class k using the estimated probability P r ( c l a s s = k ∣ r ) = ∆ p k Pr(class = k | r) ^∆_= p_k Pr(class=k∣r)=∆pk. We then perform non-maximum suppression independently for each class using the algorithm and settings from R-CNN [9].
对于每个测试RoI r r r,前向通过均输出类后验概率分布 p p p 和相对于 r r r 的一组预测的边界框偏移量(K个类中的每一个都具有自己的精确边界框预测)。我们使用估计概率 P r ( c l a s s = k ∣ r ) = ∆ p k Pr(class = k | r) ^∆_= p_k Pr(class=k∣r)=∆pk 为每个对象类别k分配r的检测置信度。然后,我们使用R-CNN [9]的算法和设置为每个类别独立执行非极大值抑制。
For whole-image classification, the time spent computing the fully connected layers is small compared to the conv layers. On the contrary, for detection the number of RoIs to process is large and nearly half of the forward pass time is spent computing the fully connected layers (see Fig. 2). Large fully connected layers are easily accelerated by compressing them with truncated SVD [5, 23].
对于全图像分类,与转换层相比,计算完全连接的层所花费的时间少。相反,对于检测到的RoI的数量很大,正向传播的近一半时间花费在计算全连接层上(见图2)。通过使用截断SVD压缩它们,可以轻松地加速大型的全连接层[5,23]。
In this technique, a layer parameterized by the u × v weight matrix W is approximately factorized as
在此技术中,将由 u × v u×v u×v 权重矩阵 W W W 参数化的图层近似分解为
using SVD. In this factorization, U is a u×t matrix comprising the first t left-singular vectors of W, Σt is a t×t diagonal matrix containing the top t singular values of W, and V is v×t matrix comprising the first t right-singular vectors of W. Truncated SVD reduces the parameter count from uv to t(u + v), which can be significant if t is much smaller than min(u, v). To compress a network, the single fully connected layer corresponding to W is replaced by two fully connected layers, without a non-linearity between them. The first of these layers uses the weight matrix Σ t V T Σ_tV^T ΣtVT (and no biases) and the second uses U (with the original biases associated with W). This simple compression method gives good speedups when the number of RoIs is large.
使用SVD。在此分解中,U是一个包含W的前t个左奇异(left-singular)矢量的u×t矩阵,Σt是包含W的前t个奇异值的t×t对角矩阵,V是包含第一个t右奇异矢量的v×t矩阵W的奇异向量。截断SVD的向量可将参数计数从uv减少到t(u + v),如果t远小于min(u,v),这将很重要。为了压缩网络,将与W对应的单个全连接层替换为两个全连接层,它们之间没有非线性。这些层中的第一层使用权重矩阵 Σ t V T Σ_tV^T ΣtVT(无偏差),第二层使用U(原始偏差与W相关联)。当RoI数量很大时,这种简单的压缩方法可以提供良好的加速效果。
Three main results support this paper’s contributions:
以下三个主要结果支持了本文的工作:
Our experiments use three pre-trained ImageNet models that are available online.2The first is the CaffeNet (essentially AlexNet[14]) from R-CNN [9]. We alternatively refer to this CaffeNet as model S, for “small.” The second network is VGG CNN M 1024 from [3], which has the same depth as S, but is wider. We call this network model M, for “medium.” The final network is the very deep VGG16 model from [20]. Since this model is the largest, we call it model L. In this section, all experiments use single-scale training and testing (s = 600; see Section 5.2 for details).
我们的实验使用了三个可在线使用的预先训练的ImageNet模型。第一个是R-CNN [9]的CaffeNet(本质上是AlexNet [14])。我们也可以将此CaffeNet称为模型S,以表示“小型”。第二个网络是[3]中的VGG CNN M 1024,其深度与S相同,但宽度更大。我们称此网络模型M为“中”。最终的网络是来自[20]的非常深的VGG16模型。由于此模型是最大的模型,因此我们将其称为L模型。在本节中,所有实验均使用单尺度训练和测试(s = 600;有关详细信息,请参见5.2节)。
On these datasets, we compare Fast R-CNN (FRCN, for short) against the top methods on the comp4 (outside data) track from the public leaderboard (Table 2, Table 3).3For the NUS NIN c2000 and BabyLearning methods, there are no associated publications at this time and we could not find exact information on the ConvNet architectures used; they are variants of the Network-in-Network design [17]. All other methods are initialized from the same pre-trained VGG16 network.
在这些数据集上,我们将Fast R-CNN(简称FRCN)与公共排行榜上comp4(外部数据)轨道上的顶级方法(表2,表3)进行了比较.3对于NUS NIN c2000和BabyLearning方法,目前没有相关出版物,我们无法找到有关所使用的ConvNet体系结构的确切信息;它们是网络中网络设计的变体[17]。所有其他方法均从相同的预训练VGG16网络初始化。
Fast R-CNN achieves the top result on VOC12 with a mAP of 65.7% (and 68.4% with extra data). It is also two orders of magnitude faster than the other methods, which are all based on the “slow” R-CNN pipeline. On VOC10, SegDeepM [25] achieves a higher mAP than Fast R-CNN (67.2% vs. 66.1%). SegDeepM is trained on VOC12 trainval plus segmentation annotations; it is designed to boost R-CNN accuracy by using a Markov random field to reason over R-CNN detections and segmentations from the O2P [1] semantic-segmentation method. Fast R-CNN can be swapped into SegDeepM in place of R-CNN, which may lead to better results. When using the enlarged 07++12 training set (see Table 2 caption), Fast R-CNN’s mAP increases to 68.8%, surpassing SegDeepM.
快速R-CNN以65.7%的mAP(在有额外数据的情况下为68.4%)在VOC12上获得最高的结果。它也比其他所有基于“慢速” R-CNN管道的方法快两个数量级。在VOC10上,SegDeepM [25]比快速R-CNN获得了更高的mAP(67.2%对66.1%)。 SegDeepM接受了VOC12 Trainval加上分段注释的培训;通过使用Markov随机字段来推理O-P [1]语义分段方法中的R-CNN检测和分段,可以提高R-CNN的准确性。快速R-CNN可以代替R-CNN交换到SegDeepM中,这可能会导致更好的结果。使用扩大的07 ++ 12训练集(请参见表2标题)时,Fast R-CNN的mAP增加到68.8%,超过了SegDeepM。
On VOC07, we compare Fast R-CNN to R-CNN and SPPnet. All methods start from the same pre-trained VGG16 network and use bounding-box regression. The VGG16 SPPnet results were computed by the authors of [11]. SPPnet uses five scales during both training and testing. The improvement of Fast R-CNN over SPPnet illustrates that even though Fast R-CNN uses single-scale training and testing, fine-tuning the conv layers provides a large improvement in mAP (from 63.1% to 66.9%). R-CNN achieves a mAP of 66.0%. As a minor point, SPPnet was trained without examples marked as “difficult” in PASCAL. Removing these examples improves Fast R-CNN mAP to 68.1%. All other experiments use “difficult” examples.
在VOC07上,我们将Fast R-CNN与R-CNN和SPPnet进行了比较。所有方法均从相同的预训练VGG16网络开始,并使用包围盒回归。 VGG16 SPPnet结果由[11]的作者计算。在培训和测试期间,SPPnet使用五个等级。与SPPnet相比,Fast R-CNN的改进表明,即使Fast R-CNN使用单尺度训练和测试,对卷积层进行微调也可以在mAP上实现较大的改进(从63.1%到66.9%)。 R-CNN的mAP达到66.0%。较小的一点是,对SPPnet进行了训练,没有在PASCAL中标记为“困难”的示例。删除这些示例会将Fast R-CNN mAP提升到68.1%。所有其他实验都使用“困难”示例。
Fast training and testing times are our second main result. Table 4 compares training time (hours), testing rate (seconds per image), and mAP on VOC07 between Fast RCNN, R-CNN, and SPPnet. For VGG16, Fast R-CNN processes images 146× faster than R-CNN without truncated SVD and 213× faster with it. Training time is reduced by 9×, from 84 hours to 9.5. Compared to SPPnet, Fast RCNN trains VGG16 2.7× faster (in 9.5 vs. 25.5 hours) and tests 7× faster without truncated SVD or 10× faster with it. Fast R-CNN also eliminates hundreds of gigabytes of disk storage, because it does not cache features.
快速的训练和测试时间是我们的第二个主要结果。表4比较了快速RCNN,R-CNN和SPPnet在VOC07上的训练时间(小时),测试速率(每幅图像的秒数)和mAP。对于VGG16,快速R-CNN处理图像的速度比没有截断SVD的R-CNN快146倍,使用SVD则快213倍。训练时间从84小时减少到9.5倍,减少了9倍。与SPPnet相比,Fast RCNN将VGG16的训练速度提高了2.7倍(在9.5与25.5小时之间),测试速度提高了7倍,而截断SVD的速度却提高了10倍。快速R-CNN还消除了数百GB的磁盘存储空间,因为它没有缓存功能。
Truncated SVD. Truncated SVD can reduce detection time by more than 30% with only a small (0.3 percentage point) drop in mAP and without needing to perform additional fine-tuning after model compression. Fig. 2 illustrates how using the top 1024 singular values from the 25088×4096 matrix in VGG16’s fc6 layer and the top 256 singular values from the 4096×4096 fc7 layer reduces runtime with little loss in mAP . Further speed-ups are possible with smaller drops in mAP if one fine-tunes again after compression.
截断的SVD。截短的SVD可以将检测时间减少30%以上,而mAP仅下降很小(0.3个百分点),并且在模型压缩后无需执行其他微调。图2说明了如何在VGG16的fc6层中使用25088×4096矩阵中的前1024个奇异值和在4096×4096 fc7层中使用前256个奇异值,以减少mAP的运行时间。如果压缩后再次进行微调,则mAP下降较小的情况下可能会进一步加快速度。
For the less deep networks considered in the SPPnet paper [11], fine-tuning only the fully connected layers appeared to be sufficient for good accuracy. We hypothesized that this result would not hold for very deep networks. To validate that fine-tuning the conv layers is important for VGG16, we use Fast R-CNN to fine-tune, but freeze the thirteen conv layers so that only the fully connected layers learn. This ablation emulates single-scale SPPnet training and decreases mAP from 66.9% to 61.4% (Table 5). This experiment verifies our hypothesis: training through the RoI pooling layer is important for very deep nets.
对于SPPnet论文[11]中考虑的深度较浅的网络,仅微调全连接层似乎足以实现良好的精度。我们假设此结果不适用于非常深的网络。为了验证微调conv层对VGG16的重要性,我们使用Fast R-CNN进行微调,但冻结了13个conv层,以便仅全连接层学习。这种消融模拟了单规模SPPnet训练,并将mAP从66.9%降低到61.4%(表5)。该实验验证了假设:通过RoI池化层进行的训练对于非常深的网络很重要。
Does this mean that all conv layers should be fine-tuned? In short, no. In the smaller networks (S and M) we find that conv1 is generic and task independent (a well-known fact [14]). Allowing conv1 to learn, or not, has no meaningful effect on mAP . For VGG16, we found it only necessary to update layers from conv3 1 and up (9 of the 13 conv layers). This observation is pragmatic: (1) updating from conv2 1 slows training by 1.3× (12.5 vs. 9.5 hours) compared to learning from conv3 1; and (2) updating from conv1 1 over-runs GPU memory. The difference in mAP when learning from conv2 1 up was only +0.3 points (Table 5, last column). All Fast R-CNN results in this paper using VGG16 fine-tune layers conv3 1 and up; all experiments with models S and M fine-tune layers conv2 and up.
这是否意味着所有卷积层都应进行微调?简而言之,没有。在较小的网络(S和M)中,我们发现conv1是通用的且与任务无关(众所周知的事实[14])。允许conv1学习或不影响mAP。对于VGG16,我们发现只需要更新conv3_1和更高的层(13个conv层中的9个)。这种观察是务实的:(1)与从conv3_1学习相比,从conv2_1更新会使训练速度降低1.3倍(12.5与9.5小时)。(2)从conv1_1更新会超出GPU内存。从conv2_1开始学习时,mAP的差异仅为+0.3分(表5,最后一栏)。本文所有的Fast R-CNN结果均使用VGG16微调层conv3_1和更高;使用模型S和M进行的所有实验均会微调conv2层及以上的层。
We conducted experiments to understand how Fast RCNN compares to R-CNN and SPPnet, as well as to evaluate design decisions. Following best practices, we performed these experiments on the PASCAL VOC07 dataset.
我们进行了实验,以了解Fast RCNN与R-CNN和SPPnet的比较,以及评估设计决策。按照最佳做法,我们在PASCAL VOC07数据集上进行了这些实验。
Multi-task training is convenient because it avoids managing a pipeline of sequentially-trained tasks. But it also has the potential to improve results because the tasks influence each other through a shared representation (the ConvNet) [2]. Does multi-task training improve object detection accuracy in Fast R-CNN?
多任务训练很方便,因为它避免了管理顺序训练的任务的流水线。但它也有可能改善结果,因为任务通过共享表示(ConvNet)相互影响[2]。多任务训练是否可以提高Fast R-CNN中的目标检测精度?
To test this question, we train baseline networks that use only the classification loss, Lcls, in Eq. 1 (i.e., setting λ = 0). These baselines are printed for models S, M, and L in the first column of each group in Table 6. Note that these models do not have bounding-box regressors. Next (second column per group), we take networks that were trained with the multi-task loss (Eq. 1, λ = 1), but we disable boundingbox regression at test time. This isolates the networks’ classification accuracy and allows an apples-to-apples comparison with the baseline networks.
为了测试此问题,我们训练了仅使用等式中的分类损失 L c l s L_{cls} Lcls 的基线网络。 (即设定λ= 0)。表6中每组第一栏中的模型S,M和L均打印了这些基准。请注意,这些模型没有边界框回归器。接下来(每组第二列),我们采用经过多任务丢失训练的网络(等式1,λ= 1),但是在测试时禁用边界框回归。这样可以隔离网络的分类准确性,并可以与基准网络进行逐个比较。
Across all three networks we observe that multi-task training improves pure classification accuracy relative to training for classification alone. The improvement ranges from +0.8 to +1.1 mAP points, showing a consistent positive effect from multi-task learning.
在所有三个网络中,我们观察到多任务训练相对于单独的分类训练可以提高纯分类精度。改进范围从+0.8到+1.1 mAP点,显示了多任务学习的一致积极效果。
Finally, we take the baseline models (trained with only the classification loss), tack on the bounding-box regression layer, and train them with Llocwhile keeping all other network parameters frozen. The third column in each group shows the results of this stage-wise training scheme: mAP improves over column one, but stage-wise training underperforms multi-task training (forth column per group).
最后,我们采用基线模型(仅通过分类损失进行训练),在边界框回归层上进行定位,并使用Lloc对其进行训练,同时保持所有其他网络参数冻结。每组的第三列显示了此阶段训练方案的结果:mAP比第一列有所改善,但阶段训练的性能却不如多任务训练(每组第四列)。
We compare two strategies for achieving scale-invariant object detection: brute-force learning (single scale) and image pyramids (multi-scale). In either case, we define the scale s of an image to be the length of its shortest side.
我们比较了实现尺度不变的对象检测的两种策略:蛮力学习(单尺度)和图像金字塔(多尺度)。无论哪种情况,我们都将图像的尺度s定义为其最短边的长度。
All single-scale experiments use s = 600 pixels; s may be less than 600 for some images as we cap the longest image side at 1000 pixels and maintain the image’s aspect ratio. These values were selected so that VGG16 fits in GPU memory during fine-tuning. The smaller models are not memory bound and can benefit from larger values of s; however, optimizing s for each model is not our main concern. We note that PASCAL images are 384 × 473 pixels on average and thus the single-scale setting typically upsamples images by a factor of 1.6. The average effective stride at the RoI pooling layer is thus ≈ 10 pixels.
所有单尺度实验都使用s = 600像素;对于某些图像,s可能小于600,因为我们将最长的图像边上限为1000像素,并保持图像的长宽比。选择这些值是为了使VGG16在微调期间适合GPU内存。较小的模型没有内存限制,无法从较大的值中受益。但是,为每个模型优化并不是我们主要关注的问题。我们注意到,PASCAL图像平均为384×473像素,因此单比例设置通常会将图像上采样率提高1.6倍。因此,RoI池层的平均有效步幅约为10个像素
In the multi-scale setting, we use the same five scales specified in [11] (s ∈ {480,576,688,864,1200}) to facilitate comparison with SPPnet. However, we cap the longest side at 2000 pixels to avoid exceeding GPU memory.
在多尺度设置中,我们使用与[11]中指定的相同的五个尺度(s∈{480,576,688,864,1200}),以便于与SPPnet进行比较。但是,我们将最长边的上限设置为2000像素,以避免超出GPU内存。
Table 7 shows models S and M when trained and tested with either one or five scales. Perhaps the most surprising result in [11] was that single-scale detection performs almost as well as multi-scale detection. Our findings confirm their result: deep ConvNets are adept at directly learning scale invariance. The multi-scale approach offers only a small increase in mAP at a large cost in compute time (Table 7). In the case of VGG16 (model L), we are limited to using a single scale by implementation details. Y et it achieves a mAP of 66.9%, which is slightly higher than the 66.0% reported for R-CNN [10], even though R-CNN uses “infinite” scales in the sense that each proposal is warped to a canonical size.
表7显示了使用一或五个尺度训练和测试时的模型S和M。也许[11]中最令人惊讶的结果是单一尺度检测的性能几乎与多尺度检测相同。我们的发现证实了他们的结果:深度ConvNets擅长直接学习尺度不变性。多尺度方法仅以很小的计算量就提供了mAP的增加(表7)。在VGG16(型号L)的情况下,根据实现细节,我们只能使用单一尺度。迄今为止,它实现了66.9%的mAP,比R-CNN报道的66.0%[10]略高,即使R-CNN使用“无限”尺度,因为每个建议区域都扭曲为规范大小。
Since single-scale processing offers the best tradeoff between speed and accuracy, especially for very deep models, all experiments outside of this sub-section use single-scale training and testing with s = 600 pixels.
由于单尺度处理可在速度和精度之间取得最佳平衡,尤其是对于非常深的模型,因此,本小节以外的所有实验均使用s = 600像素的单标度训练和测试。
A good object detector should improve when supplied with more training data. Zhu et al. [24] found that DPM [8] mAP saturates after only a few hundred to thousand training examples. Here we augment the VOC07 trainval set with the VOC12 trainval set, roughly tripling the number of images to 16.5k, to evaluate Fast R-CNN. Enlarging the training set improves mAP on VOC07 test from 66.9% to 70.0% (Table 1). When training on this dataset we use 60k mini-batch iterations instead of 40k.
一个好的物体检测器在提供更多训练数据时应该得到改善。朱等。 [24]发现DPM [8] mAP仅在几百到数千个训练示例后就饱和。在这里,我们用VOC12训练集扩展VOC07训练集,将图像数量大约增加了三倍,达到16.5k,以评估Fast R-CNN。扩大训练范围可将VOC07测试中的mAP从66.9%提高到70.0%(表1)。在对该数据集进行训练时,我们使用60k的小批量迭代,而不是40k。
We perform similar experiments for VOC10 and 2012, for which we construct a dataset of 21.5k images from the union of VOC07 trainval, test, and VOC12 trainval. When training on this dataset, we use 100k SGD iterations and lower the learning rate by 0.1× each 40k iterations (instead of each 30k). For VOC10 and 2012, mAP improves from 66.1% to 68.8% and from 65.7% to 68.4%, respectively.
我们对VOC10和2012进行了类似的实验,为此我们从VOC07训练,测试和VOC12训练的结合中构建了一个21.5k图像的数据集。在此数据集上进行训练时,我们使用100k SGD迭代,并且每40k迭代(而不是每30k)将学习率降低0.1倍。对于VOC10和2012,mAP分别从66.1%提高到68.8%,从65.7%提高到68.4%。
Fast R-CNN uses the softmax classifier learnt during fine-tuning instead of training one-vs-rest linear SVMs post-hoc, as was done in R-CNN and SPPnet. To understand the impact of this choice, we implemented post-hoc SVM training with hard negative mining in Fast R-CNN. We use the same training algorithm and hyper-parameters as in R-CNN.
快速R-CNN使用在微调过程中学习到的softmax分类器,而不是像R-CNN和SPPnet那样事后训练一对一的线性SVM。为了了解此选择的影响,我们在Fast R-CNN中实施了post-hoc SVM训练,并进行了硬负例挖掘。我们使用与R-CNN中相同的训练算法和超参数。
Table 8 shows softmax slightly outperforming SVM for all three networks, by +0.1 to +0.8 mAP points. This effect is small, but it demonstrates that “one-shot” fine-tuning is sufficient compared to previous multi-stage training approaches. We note that softmax, unlike one-vs-rest SVMs, introduces competition between classes when scoring a RoI.
表8显示了所有三个网络的softmax略胜于SVM,提高了+0.1至+0.8 mAP点。这种效果很小,但表明与以前的多阶段训练方法相比,“单次”微调就足够了。我们注意到,softmax与一对一的SVM不同,在为RoI评分时会引入类之间的竞争。
There are (broadly) two types of object detectors: those that use a sparse set of object proposals (e.g., selective search [21]) and those that use a dense set (e.g., DPM [8]). Classifying sparse proposals is a type of cascade [22] in which the proposal mechanism first rejects a vast number of candidates leaving the classifier with a small set to evaluate. This cascade improves detection accuracy when applied to DPM detections [21]. We find evidence that the proposalclassifier cascade also improves Fast R-CNN accuracy.
有两种类型的对象检测器:使用稀疏对象建议集的对象检测器(例如,选择性搜索[21])和使用密集对象的检测器(例如DPM [8])。对稀疏提议进行分类是一种级联[22],其中建议机制首先会拒绝大量候选,而给分类器留下一小集进行评估。当应用于DPM检测时,这种级联提高了检测精度[21]。我们发现,建议分类器级联还提高了快速R-CNN的准确性。
Using selective search’s quality mode, we sweep from 1k to 10k proposals per image, each time re-training and retesting model M. If proposals serve a purely computational role, increasing the number of proposals per image should not harm mAP.
使用选择性搜索的质量模式,我们每次重新训练和重新测试模型M时,每张图像的建议从1k扫到10,000k。如果建议仅起到计算作用,则增加每张图片的建议数量不会损害mAP。
This result is difficult to predict without actually running the experiment. The state-of-the-art for measuring object proposal quality is Average Recall (AR) [12]. AR correlates well with mAP for several proposal methods using R-CNN, when using a fixed number of proposals per image. Fig. 3 shows that AR (solid red line) does not correlate well with mAP as the number of proposals per image is varied. AR must be used with care; higher AR due to more proposals does not imply that mAP will increase. Fortunately, training and testing with model M takes less than 2.5 hours. Fast R-CNN thus enables efficient, direct evaluation of object proposal mAP, which is preferable to proxy metrics.
如果不实际运行实验,很难预测此结果。测量对象建议质量的最新技术是平均召回率(AR)[12]。当每个图像使用固定数量的建议时,AR与使用R-CNN的几种建议方法的mAP关联良好。图3显示,随着每幅图像的建议数量变化,AR(红色实线)与mAP的关联性不高。必须谨慎使用AR;由于有更多建议,因此更高的AR并不意味着mAP会增加。幸运的是,使用M型进行训练和测试的时间少于2.5小时。因此,快速R-CNN可以高效,直接地评估对象建议mAP,这比代理指标更可取。
We also investigate Fast R-CNN when using densely generated boxes (over scale, position, and aspect ratio), at a rate of about 45k boxes / image. This dense set is rich enough that when each selective search box is replaced by its closest (in IoU) dense box, mAP drops only 1 point (to 57.7%, Fig. 3, blue triangle).
当使用密集生成的框(超比例,位置和长宽比)时,我们还研究了Fast R-CNN,速率约为每张图片45,000个框。这个密集的集合足够丰富,以至于当每个选择性搜索框被最接近的(在IoU中)密集框替换时,mAP只会下降1点(下降到57.7%,图3,蓝色三角形)。
The statistics of the dense boxes differ from those of selective search boxes. Starting with 2k selective search boxes, we test mAP when adding a random sample of 1000 × {2,4,6,8,10,32,45} dense boxes. For each experiment we re-train and re-test model M. When these dense boxes are added, mAP falls more strongly than when adding more selective search boxes, eventually reaching 53.0%.
密集框的统计信息与选择性搜索框的统计信息不同。从2k个选择性搜索框开始,我们在添加1000×{2,4,6,8,10,32,45}密集框的随机样本时测试mAP。对于每个实验,我们都会重新训练和重新测试模型M。添加这些密集框时,与添加更多选择性搜索框相比,mAP下降的幅度更大,最终达到53.0%。
We also train and test Fast R-CNN using only dense boxes (45k / image). This setting yields a mAP of 52.9% (blue diamond). Finally, we check if SVMs with hard negative mining are needed to cope with the dense box distribution. SVMs do even worse: 49.3% (blue circle).
我们还仅使用密集框(45k /图像)训练和测试Fast R-CNN。此设置产生的mAP为52.9%(蓝色菱形)。最后,我们检查是否需要使用带有硬负例挖掘的SVM来处理密集的框分布。 SVM甚至更糟:49.3%(蓝色圆圈)。
We applied Fast R-CNN (with VGG16) to the MS COCO dataset [18] to establish a preliminary baseline. We trained on the 80k image training set for 240k iterations and evaluated on the “test-dev” set using the evaluation server. The PASCAL-style mAP is 35.9%; the new COCO-style AP , which also averages over IoU thresholds, is 19.7%.
我们将快速R-CNN(带有VGG16)应用于MS COCO数据集[18],以建立初步基线。我们在80k图像训练集上进行了240k迭代训练,并使用评估服务器在“ test-dev”数据集上进行了评估。 PASCAL样式的mAP为35.9%;新的COCO样式的AP也达到了IoU阈值的平均值,为19.7%。
This paper proposes Fast R-CNN, a clean and fast update to R-CNN and SPPnet. In addition to reporting state-of-theart detection results, we present detailed experiments that we hope provide new insights. Of particular note, sparse object proposals appear to improve detector quality. This issue was too costly (in time) to probe in the past, but becomes practical with Fast R-CNN. Of course, there may exist yet undiscovered techniques that allow dense boxes to perform as well as sparse proposals. Such methods, if developed, may help further accelerate object detection.
本文提出了快速R-CNN,它是对R-CNN和SPPnet的干净快速更新。除了报告最新的检测结果外,我们还提供了详细的实验,希望能提供新的见解。特别要注意的是,稀疏对象建议似乎可以提高检测器质量。这个问题过去花费的时间太长(无法及时解决),但是对于Fast R-CNN来说是可行的。当然,可能存在尚未发现的技术,这些技术允许密集的框执行稀疏的建议。如果开发出此类方法,则可能有助于进一步加速物体检测。
Acknowledgements. I thank Kaiming He, Larry Zitnick, and Piotr Dollár for helpful discussions and encouragement.