We present region-based, fully convolutional networks for accurate and efficient object detection. In contrast to previous region-based detectors such as Fast/Faster R-CNN [6, 18] that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image. To achieve this goal, we propose position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection. Our method can thus naturally adopt fully convolutional image classifier backbones, such as the latest Residual Networks (ResNets) [9], for object detection. We show competitive results on the PASCAL VOC datasets (e.g., 83.6% mAP on the 2007 set) with the 101-layer ResNet. Meanwhile, our result is achieved at a test-time speed of 170ms per image, 2.5-20× faster than the Faster R-CNN counterpart.
我们提出了基于区域的全卷积网络,以实现准确和高效的目标检测。与先前的基于区域的检测器(如Fast/Faster R-CNN [6,18])相比,这些检测器应用昂贵的每个区域子网络数百次,我们的基于区域的检测器是全卷积的,几乎所有计算都在整张图像上共享。为了实现这一目标,我们提出了位置敏感分数图,以解决图像分类中的平移不变性与目标检测中的平移变化之间的困境。因此,我们的方法可以自然地采用全卷积图像分类器的主干网络,如最新的残差网络(ResNets)[9],用于目标检测。我们使用101层ResNet在PASCAL VOC数据集上展示了具有竞争力的结果(例如,2007数据集上83.6%的mAP)。同时,我们的测试结果是以每张图像170ms的测试速度实现的,比Faster R-CNN对应部分速度快2.5-20倍。
A prevalent family [8, 6, 18] of deep networks for object detection can be divided into two subnetworks by the Region-of-Interest (RoI) pooling layer [6]: (i) a shared, “fully convolutional” subnetwork independent of RoIs, and (ii) an RoI-wise subnetwork that does not share computation. This decomposition [8] was historically resulted from the pioneering classification architectures, such as AlexNet [10] and VGG Nets [23], that consist of two subnetworks by design —— a convolutional subnetwork ending with a spatial pooling layer, followed by several fully-connected (fc) layers. Thus the (last) spatial pooling layer in image classification networks is naturally turned into the RoI pooling layer in object detection networks [8, 6, 18].
流行的目标检测深度网络家族[8,6,18]通过感兴趣区域(RoI)池化层[6]可以划分成两个子网络:(1)独立于RoI的共享“全卷积”子网络,(ii)不共享计算的RoI子网络。这种分解[8]以往是由开创性的分类架构产生的,例如AlexNet[10]和VGG Nets[23]等,在设计上它由两个子网络组成——一个卷积子网络以空间池化层结束,后面是几个全连接(fc)层。因此,图像分类网络中的(最后一个)空间池化层在目标检测网络中[8,6,18]自然地变成了RoI池化层。
But recent state-of-the-art image classification networks such as Residual Nets (ResNets) [9] and GoogLeNets [24, 26] are by design fully convolutional. By analogy, it appears natural to use all convolutional layers to construct the shared, convolutional subnetwork in the object detection architecture, leaving the RoI-wise subnetwork no hidden layer. However, as empirically investigated in this work, this naïve solution turns out to have considerably inferior detection accuracy that does not match the network’s superior classification accuracy. To remedy this issue, in the ResNet paper [9] the RoI pooling layer of the Faster R-CNN detector [18] is unnaturally inserted between two sets of convolutional layers —— this creates a deeper RoI-wise subnetwork that improves accuracy, at the cost of lower speed due to the unshared per-RoI computation.
但是最近最先进的图像分类网络,如ResNet(ResNets)[9]和GoogLeNets[24,26]是全卷积的。通过类比,在目标检测架构中使用所有卷积层来构建共享的卷积子网络似乎是很自然的,使得RoI的子网络没有隐藏层。然而,在这项工作中通过经验性的调查发现,这个天真的解决方案有相当差的检测精度,不符合网络的优秀分类精度。为了解决这个问题,在ResNet论文[9]中,Faster R-CNN检测器[18]的RoI池层不自然地插入在两组卷积层之间——这创建了更深的RoI子网络,其改善了精度,由于非共享的RoI计算,因此是以更低的速度为代价。
We argue that the aforementioned unnatural design is caused by a dilemma of increasing translation invariance for image classification vs. respecting translation variance for object detection. On one hand, the image-level classification task favors translation invariance —— shift of an object inside an image should be indiscriminative. Thus, deep (fully) convolutional architectures that are as translation-invariant as possible are preferable as evidenced by the leading results on ImageNet classification [9, 24, 26]. On the other hand, the object detection task needs localization representations that are translation-variant to an extent. For example, translation of an object inside a candidate box should produce meaningful responses for describing how good the candidate box overlaps the object. We hypothesize that deeper convolutional layers in an image classification network are less sensitive to translation. To address this dilemma, the ResNet paper’s detection pipeline [9] inserts the RoI pooling layer into convolutions —— this region-specific operation breaks down translation invariance, and the post-RoI convolutional layers are no longer translation-invariant when evaluated across different regions. However, this design sacrifices training and testing efficiency since it introduces a considerable number of region-wise layers (Table 1).
我们认为,前述的非自然设计是由于增加图像分类的变换不变性与目标检测的平移可变性而导致的两难境地。一方面,图像级别的分类任务有利于平移不变性——图像内目标的移动应该是无差别的。因此,深度(全)卷积架构尽可能保持平移不变,这一点可以从ImageNet分类[9,24,26]的主要结果中得到证实。另一方面,目标检测任务的定位表示需要一定程度上的平移可变性。例如,在候选框内目标变换应该产生有意义的响应,用于描述候选框与目标的重叠程度。我们假设图像分类网络中较深的卷积层对平移不太敏感。为了解决这个困境,ResNet论文的检测流程[9]将RoI池化层插入到卷积中——特定区域的操作打破了平移不变性,当在不同区域进行评估时,RoI后卷积层不再是平移不变的。然而,这个设计牺牲了训练和测试效率,因为它引入了大量的区域层(表1)。
In this paper, we develop a framework called Region-based Fully Convolutional Network (R-FCN) for object detection. Our network consists of shared, fully convolutional architectures as is the case of FCN [15]. To incorporate translation variance into FCN, we construct a set of position-sensitive score maps by using a bank of specialized convolutional layers as the FCN output. Each of these score maps encodes the position information with respect to a relative spatial position (e.g., “to the left of an object”). On top of this FCN, we append a position-sensitive RoI pooling layer that shepherds information from these score maps, with no weight (convolutional/fc) layers following. The entire architecture is learned end-to-end. All learnable layers are convolutional and shared on the entire image, yet encode spatial information required for object detection. Figure 1 illustrates the key idea and Table 1 compares the methodologies among region-based detectors.
在本文中,我们开发了一个称为基于区域的全卷积网络(R-FCN)框架来进行目标检测。我们的网络由共享的全卷积架构组成,就像FCN[15]一样。为了将平移可变性并入FCN,我们通过使用一组专门的卷积层作为FCN输出来构建一组位置敏感的分数图。这些分数图中的每一个都对关于相对空间位置(的位置信息进行编码例如,“在目标的左边”)。在这个FCN之上,我们添加了一个位置敏感的RoI池化层,它从这些分数图中获取信息,并且后面没有权重(卷积/fc)层。整个架构是端到端的学习。所有可学习的层都是卷积的,并在整个图像上共享,但对目标检测所需的空间信息进行编码。图1说明了关键思想,表1比较了基于区域的检测器方法。
Figure 1: Key idea of R-FCN for object detection. In this illustration, there are k × k = 3 × 3 position-sensitive score maps generated by a fully convolutional network. For each of the k × k bins in an RoI, pooling is only performed on one of the k2 maps (marked by different colors).
图1:R-FCN目标检测的主要思想。在这个例子中,由全卷积网络生成了k×k=3×3的位置敏感分数图。对于RoI中的每个k×k组块,仅在k2个映射中的一个上执行池化(用不同的颜色标记)。
Using the 101-layer Residual Net (ResNet-101) [9] as the backbone, our R-FCN yields competitive results of 83.6%mAP on the PASCAL VOC 2007 set and 82.0% the 2012 set. Meanwhile, our results are achieved at a test-time speed of 170ms per image using ResNet-101, which is 2.5× to 20× faster than the Faster R-CNN + ResNet-101 counterpart in [9]. These experiments demonstrate that our method manages to address the dilemma between invariance/variance on translation, and fully convolutional image-level classifiers such as ResNets can be effectively converted to fully convolutional object detectors.
使用101层残余网络(ResNet-101)[9]作为主干网络,我们的R-FCN在PASCAL VOC 2007数据集和2012数据集上分别获得了83.6% mAP和 82.0% mAP。同时,使用ResNet-101,我们的结果在测试时是以每张图像170ms的速度实现的,比[9]中对应的Faster R-CNN + ResNet-101快了2.5倍到20倍。这些实验表明,我们的方法设法解决平移不变性/可变性和全卷积图像级分类器之间的困境,如ResNet可以有效地转换为全卷积目标检测器。
Overview. Following R-CNN [7], we adopt the popular two-stage object detection strategy [7, 8, 6, 18, 1, 22] that consists of: (i) region proposal, and (ii) region classification. Although methods that do not rely on region proposal do exist (e.g., [17, 14]), region-based systems still possess leading accuracy on several benchmarks [5, 13, 20]. We extract candidate regions by the Region Proposal Network (RPN) [18], which is a fully convolutional architecture in itself. Following [18], we share the features between RPN and R-FCN. Figure 2 shows an overview of the system.
概述。根据R-CNN[7],我们采用了流行的两阶段目标检测策略[7,8,6,18,1,22],其中包括:(i)区域提议和(ii)区域分类。尽管不依赖区域提议的方法确实存在(例如,[17,14]),但是基于区域的系统在几个基准数据集中仍然具有领先的准确性[5,13,20]。我们通过区域提议网络(RPN)提取候选区域[18],其本身就是一个全卷积架构。在[18]之后,我们在RPN和R-FCN之间的共享特征。图2显示了系统的概述。
Figure 2: Overall architecture of R-FCN. A Region Proposal Network (RPN) [18] proposes candidate RoIs, which are then applied on the score maps. All learnable weight layers are convolutional and are computed on the entire image; the per-RoI computational cost is negligible.
图2:R-FCN的总体架构。区域建议网络(RPN)[18]提出了候选RoI,然后将其应用于评分图上。所有可学习的权重层都是卷积的,并在整个图像上计算;每个RoI的计算成本可以忽略不计。
R-CNN [7] has demonstrated the effectiveness of using region proposals [27, 28] with deep networks. R-CNN evaluates convolutional networks on cropped and warped regions, and computation is not shared among regions (Table 1). SPPnet [8], Fast R-CNN [6], and Faster R-CNN [18] are “semi-convolutional”, in which a convolutional subnetwork performs shared computation on the entire image and another subnetwork evaluates individual regions.
R-CNN[7]已经证明了在深度网络中使用区域提议[27,28]的有效性。R-CNN评估裁剪区域和变形区域的卷积网络,计算不在区域之间共享(表1)。SPPnet[8]Fast R-CNN[6]和Faster R-CNN[18]是“半卷积”的,卷积子网络在整张图像上进行共享计算,另一个子网络评估单个区域。
There have been object detectors that can be thought of as “fully convolutional” models. OverFeat [21] detects objects by sliding multi-scale windows on the shared convolutional feature maps; similarly, in Fast R-CNN [6] and [12], sliding windows that replace region proposals are investigated. In these cases, one can recast a sliding window of a single scale as a single convolutional layer. The RPN component in Faster R-CNN [18] is a fully convolutional detector that predicts bounding boxes with respect to reference boxes (anchors) of multiple sizes. The original RPN is class-agnostic in [18], but its class-specific counterpart is applicable (see also [14]) as we evaluate in the following.
有可以被认为是“全卷积”模型的目标检测器。OverFeat[21]通过在共享卷积特征映射上滑动多尺度窗口来检测目标;同样地,在Fast R-CNN[6]和[12]中,研究了用滑动窗口替代区域提议。在这些情况下,可以将一个单尺度的滑动窗口重新设计为单个卷积层。Faster R-CNN [18]中的RPN组件是一个全卷积检测器,它可以相对于多个尺寸的参考框(锚点)预测边界框。最初的RPN在[18]中是类不可知的,但是它的类特定的对应部分也是适用的(参见[14]),我们在下面进行评估。
Another family of object detectors resort to fully-connected (fc) layers for generating holistic object detection results on an entire image, such as [25, 4, 17].
另一个目标检测器家族采用全连接(fc)层来在整张图像上生成整体的目标检测结果,如[25,4,17]。
We presented Region-based Fully Convolutional Networks, a simple but accurate and efficient framework for object detection. Our system naturally adopts the state-of-the-art image classification backbones, such as ResNets, that are by design fully convolutional. Our method achieves accuracy competitive with the Faster R-CNN counterpart, but is much faster during both training and inference.
我们提出了基于区域的全卷积网络,这是一个简单但精确且高效的目标检测框架。我们的系统自然地采用了设计为全卷积的最先进的图像分类骨干网络,如ResNet。我们的方法实现了与Faster R-CNN对应网络相比更具竞争力的准确性,但是在训练和推断上都快得多。
We intentionally keep the R-FCN system presented in the paper simple. There have been a series of orthogonal extensions of FCNs that were developed for semantic segmentation (e.g., see [2]), as well as extensions of region-based methods for object detection (e.g., see [9, 1, 22]). We expect our system will easily enjoy the benefits of the progress in the field.
我们故意保持R-FCN系统如论文中介绍的那样简单。已经有一系列针对语义分割(例如,参见[2])开发的FCN的正交扩展,以及用于目标检测的基于区域的方法的扩展(例如参见[9,1,22])。我们期望我们的系统能够轻松享有这个领域的进步带来的好处。
相关参考