原文:https://arxiv.org/abs/1506.01497
Faster R-CNN: Towards Real-Time ObjectDetection with Region Proposal Networks
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun
Abstract—State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations.Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing regionproposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-imageconvolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutionalnetwork that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end togenerate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNNinto a single network by sharing their convolutional features—using the recently popular terminology of neural networks with“attention” mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model [3],our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detectionaccuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has beenmade publicly available.
Index Terms—Object Detection, Region Proposal, Convolutional Neural Network.
更快的R-CNN:利用区域提议网络实现对象实时检测
邵青任,何开明,罗斯Girshick和孙健
摘要-最先进的目标检测网络依赖区域提议算法来假设目标位置。像SPPnet [1]和Fast R-CNN [2]这样的进展已经减少了这些检测网络的运行时间,将区域提议计算公开为一个瓶颈。在这项工作中,我们引入了区域提议网络(RPN),该网络与检测网络共享全图像卷积特征,从而实现了近乎免费的区域提案。 RPN是一个完全卷积网络,可同时预测每个位置的对象边界和对象分数。 RPN通过端对端培训生成高质量区域提案,Fast R-CNN将其用于检测。我们通过共享卷积特征将RPN和Fast R-CNN进一步合并到一个网络中 - 使用最近流行的带有“关注”机制的神经网络术语,RPN组件告诉统一网络在哪里寻找。对于非常深的VGG-16模型[3],我们的检测系统在GPU上的帧速率为5fps(包括所有步骤),同时在PASCAL VOC 2007,2012和MS上实现了最先进的物体检测准确性COCO数据集,每个图像只有300个提案。在ILSVRC和COCO2015比赛中,更快的R-CNN和RPN是多条赛道中获得第一名的基础。代码已公开发布。
关键词 - 对象检测,区域建议,卷积神经网络。
1 INTRODUCTION
Recent advances in object detection are driven bythe success of region proposal methods (e.g., [4])and region-based convolutional neural networks (R-CNNs) [5]. Although region-based CNNs were com-putationally expensive as originally developed in [5],their cost has been drastically reduced thanks to shar-ing convolutions across proposals [1], [2]. The latestincarnation, Fast R-CNN [2], achieves near real-timerates using very deep networks [3], when ignoring thetime spent on region proposals. Now, proposals are thetest-time computational bottleneck in state-of-the-artdetection systems.
Region proposal methods typically rely on inex-pensive features and economical inference schemes.Selective Search [4], one of the most popular meth-ods, greedily merges superpixels based on engineeredlow-level features. Yet when compared to efficientdetection networks [2], Selective Search is an order ofmagnitude slower, at 2 seconds per image in a CPUimplementation. EdgeBoxes [6] currently provides thebest tradeoff between proposal quality and speed,at 0.2 seconds per image. Nevertheless, the regionproposal step still consumes as much running timeas the detection network.
One may note that fast region-based CNNs takeadvantage of GPUs, while the region proposal meth-ods used in research are implemented on the CPU,making such runtime comparisons inequitable. An ob-vious way to accelerate proposal computation is to re-implement it for the GPU. This may be an effective en-gineering solution, but re-implementation ignores thedown-stream detection network and therefore missesimportant opportunities for sharing computation.
1引言
目标检测的最新进展是由区域提议方法(例如[4])和基于区域的卷积神经网络(R-CNN)[5]的成功所驱动的。虽然[5]中最初开发的基于区域的有线电视网络的计算成本很高,但由于各提案之间的卷积越来越多,所以其成本大幅降低[1],[2]。 Fast R-CNN [2]的最新版本在使用非常深的网络时忽略了花费在地区提案上的时间,实现了接近实时的速度[3]。现在,提案是最先进的检测系统的测试时间计算瓶颈。
区域提议方法通常依赖于不确定的特征和经济推理方案。选择性搜索[4]是最流行的方法之一,它基于工程低级特征贪婪地合并超级像素。然而,与有效检测网络[2]相比,选择性搜索是CPU执行速度较慢的顺序,每个图像在2秒内执行一次。 EdgeBoxes [6]目前在提案质量和速度之间进行了最佳折衷,每张图片的时间为0.2秒。尽管如此,该区域建议步骤仍然消耗与检测网络一样多的运行时间。
有人可能会注意到,基于快速区域的CNN利用了GPU,而在研究中使用的区域提案方法则在CPU上实施,使得这种运行时比较不公平。加速提案计算的一种明显方式是重新实施GPU。这可能是一种有效的工程解决方案,但重新实施忽略了下游检测网络,因此错失了共享计算的重要机会。
In this paper, we show that an algorithmic change—computing proposals with a deep convolutional neu-ral network—leads to an elegant and effective solutionwhere proposal computation is nearly cost-free giventhe detection network’s computation. To this end, weintroduce novel Region Proposal Networks (RPNs) thatshare convolutional layers with state-of-the-art objectdetection networks [1], [2]. By sharing convolutions attest-time, the marginal cost for computing proposalsis small (e.g., 10ms per image).
在本文中,我们表明算法改变计算提案与深卷积神经网络 - 导致一个优雅和有效的解决方案,其中提案计算几乎免费给检测网络的计算。为此,我们引入了新颖的区域提议网络(RPNs),它与最先进的物体检测网络共享卷积层[1],[2]。通过在测试时间共享卷积,计算建议的边际成本很小(例如,每个图像10ms)。
Our observation is that the convolutional featuremaps used by region-based detectors, like Fast R-CNN, can also be used for generating region pro-posals. On top of these convolutional features, weconstruct an RPN by adding a few additional con-volutional layers that simultaneously regress regionbounds and objectness scores at each location on aregular grid. The RPN is thus a kind of fully convo-lutional network (FCN) [7] and can be trained end-to-end specifically for the task for generating detectionproposals.
RPNs are designed to efficiently predict region pro-posals with a wide range of scales and aspect ratios. Incontrast to prevalent methods [8], [9], [1], [2] that use pyramids of images (Figure 1, a) or pyramids of filters(Figure 1, b), we introduce novel “anchor” boxesthat serve as references at multiple scales and aspectratios. Our scheme can be thought of as a pyramidof regression references (Figure 1, c), which avoidsenumerating images or filters of multiple scales oraspect ratios. This model performs well when trainedand tested using single-scale images and thus benefitsrunning speed.
To unify RPNs with Fast R-CNN [2] object detec-tion networks, we propose a training scheme thatalternates between fine-tuning for the region proposaltask and then fine-tuning for object detection, whilekeeping the proposals fixed. This scheme convergesquickly and produces a unified network with convo-lutional features that are shared between both tasks.1
我们的观察结果是,像Fast R-CNN这样的基于区域的检测器所使用的卷积特征映射也可以用于生成区域边界。在这些卷积特征之上,我们通过添加一些额外的卷积层来构建RPN,这些层同时在区域网格上的每个位置处回归区域边界和目标性分数。因此,RPN是一种完全控制网络(FCN)[7],并且可以针对生成检测提议的任务专门进行端对端培训。
RPN旨在有效地预测各种尺度和长宽比的区域标架。不同于使用图像金字塔(图1,a)或过滤器金字塔(图1,b)的流行方法[8],[9],[1],[2],我们引入了新颖的“锚”多个尺度和纵横比的参考。我们的方案可以被认为是回归参考的金字塔(图1,c),它避免列举多个尺度或比例的图像或过滤器。这个模型在使用单一尺度的图像进行训练和测试时运行良好,从而有利于运行速度。
为了将RPN与快速R-CNN [2]物体检测网络统一起来,我们提出了一种培训方案,该方案在区域建议任务的微调和微调物体检测之间进行对等,同时保持固定的建议。该方案迅速收敛并产生具有共享的调度特征的统一网络
We comprehensively evaluate our method on thePASCAL VOC detection benchmarks [11] where RPNs with Fast R-CNNs produce detection accuracy better than the strong baseline of Selective Search withFast R-CNNs. Meanwhile, our method waives nearly all computational burdens of Selective Search attest-time—the effective running time for proposals is just 10 milliseconds. Using the expensive very deep models of [3], our detection method still has a frame rate of 5fps (including all steps) on a GPU and thus is a practical object detection system in terms of both speed and accuracy. We also report results on the MS COCO dataset [12] and investigate the improvements on PASCAL VOC using theCOCO data. Code has been made publicly available at https://github.com/shaoqingren/faster_rcnn (in MATLAB) and https://github.com/rbgirshick/py-faster-rcnn (in Python).
A preliminary version of this manuscript was pub-lished previously [10]. Since then, the frameworks ofRPN and Faster R-CNN have been adopted and gen-eralized to other methods, such as 3D object detection[13], part-based detection [14], instance segmentation[15], and image captioning [16]. Our fast and effectiveobject detection system has also been built in commercial systems such as at Pinterests [17], with user engagement improvements reported.
In ILSVRC and COCO 2015 competitions, FasterR-CNN and RPN are the basis of several 1st-placeentries [18] in the tracks of ImageNet detection, Ima-geNet localization, COCO detection, and COCO seg-mentation. RPNs completely learn to propose regionsfrom data, and thus can easily benefit from deeperand more expressive features (such as the 101-layerresidual nets adopted in [18]). Faster R-CNN and RPNare also used by several other leading entries in thesecompetitions2. These results suggest that our methodis not only a cost-efficient solution for practical usage,but also an effective way of improving object detec-tion accuracy.
我们全面评估了我们在PACCAL VOC检测基准[11]中的方法,其中使用Fast R-CNN的RPN产生的检测精度优于使用Fast R-CNN的强选择性搜索基线。同时,我们的方法可以放弃选择性搜索的几乎所有计算负担,而提议的有效运行时间仅为10毫秒。使用[3]的昂贵的非常深的模型,我们的检测方法在GPU上仍然具有5fps的帧率(包括所有步骤),因此在速度和准确性方面都是实用的物体检测系统。我们还报告了MS COCO数据集[12]的结果,并使用COCO数据研究了对PASCAL VOC的改进。代码已经在https://github.com/shaoqingren/faster_rcnn(在MATLAB中)和https://github.com/rbgirshick/py-faster-rcnn(在Python中)上公开发布。
这篇手稿的初步版本之前已经公布[10]。从那时起,RPN和更快R-CNN的框架已经被采用,并且被广泛应用于其他方法,如3D对象检测[13],基于部分的检测[14],实例分割[15]和图像字幕[16 ]。我们的快速和有效的对象检测系统也已建立在Pinterest的商业系统中[17],并报告了用户参与度的提高。
在ILSVRC和COCO 2015竞赛中,FasterR-CNN和RPN是ImageNet检测,Ima-geNet定位,COCO检测和COCO分割轨迹中几个第一名[18]的基础。 RPN完全学会从数据中提出区域,因此可以从深度和更多表达特征(例如[18]中采用的101层个体网络)中轻松获益。更快的R-CNN和RPN也被这些竞争者中的其他几个主要条目使用2。这些结果表明,我们的方法不仅是实际使用的经济有效的解决方案,而且是提高物体检测精度的有效方法。
2 RELATED WORK
Object Proposals. There is a large literature on objectproposal methods. Comprehensive surveys and com-parisons of object proposal methods can be found in[19], [20], [21]. Widely used object proposal methodsinclude those based on grouping super-pixels (e.g.,Selective Search [4], CPMC [22], MCG [23]) and thosebased on sliding windows (e.g., objectness in windows[24], EdgeBoxes [6]). Object proposal methods wereadopted as external modules independent of the de-tectors (e.g., Selective Search [4] object detectors, R-CNN [5], and Fast R-CNN [2]).
2相关工作
对象建议。 关于对象提议方法有大量文献。 对象提议方法的综合调查和比较可以在[19],[20],[21]中找到。 广泛使用的对象提议方法包括基于分组超像素(如Selective Search [4],CPMC [22],MCG [23])和基于滑动窗口的方法(例如,窗口[24],EdgeBoxes [6])。 对象提议方法作为独立于检测器的外部模块(例如,选择性搜索[4]对象检测器,R-CNN [5]和快速R-CNN [2])被采用。
Deep Networks for Object Detection. The R-CNNmethod [5] trains CNNs end-to-end to classify theproposal regions into object categories or background.R-CNN mainly plays as a classifier, and it does notpredict object bounds (except for refining by boundingbox regression). Its accuracy depends on the perfor-mance of the region proposal module (see compar-isons in [20]). Several papers have proposed ways ofusing deep networks for predicting object boundingboxes [25], [9], [26], [27]. In the OverFeat method [9],a fully-connected layer is trained to predict the boxcoordinates for the localization task that assumes asingle object. The fully-connected layer is then turned into a convolutional layer for detecting multiple class-specific objects. The MultiBox methods [26], [27] generate region proposals from a network whose last fully-connected layer simultaneously predicts multiple class-agnostic boxes, generalizing the “single-box” fashion of OverFeat. These class-agnostic boxes are used as proposals for R-CNN [5]. The MultiBox proposal network is applied on a single image crop or multiple large image crops (e.g., 224×224), in contrast to our fully convolutional scheme. MultiBox does not share features between the proposal and detection networks. We discuss OverFeat and MultiBox in more depth later in context with our method. Concurrent with our work, the DeepMask method [28] is developed for learning segmentation proposals.
Shared computation of convolutions [9], [1], [29],[7], [2] has been attracting increasing attention for ef-ficient, yet accurate, visual recognition. The OverFeatpaper [9] computes convolutional features from animage pyramid for classification, localization, and de-tection. Adaptively-sized pooling (SPP) [1] on sharedconvolutional feature maps is developed for efficientregion-based object detection [1], [30] and semanticsegmentation [29]. Fast R-CNN [2] enables end-to-enddetector training on shared convolutional features andshows compelling accuracy and speed.
分类为对象类别或背景.R-CNN主要作为分类器,并且不预测对象边界(除了通过边界框回归进行细化)。其准确性取决于区域建议模块的性能(参见[20]中的比较)。一些论文提出了使用深度网络来预测对象边界框的方法[25],[9],[26],[27]。在OverFeat方法[9]中,完全连接层被训练来预测假设单个对象的本地化任务的盒子坐标。然后将完全连接的层变成卷积层,用于检测多个类特定对象。 MultiBox方法[26],[27]从一个网络生成区域提案,该网络的最后一个完全连接的层同时预测多个不同类别的盒子,从而推广OverFeat的“单盒子”方式。这些与类别不相关的盒子被用作R-CNN的提案[5]。与我们的完全卷积方案相比,MultiBox提案网络适用于单幅图像作物或多幅大型图像作物(例如224×224)。 MultiBox不提供提案和检测网络之间的功能。稍后在我们的方法的上下文中,我们将更深入地讨论OverFeat和MultiBox。与我们的工作同时,DeepMask方法[28]是为学习细分提议而开发的。
卷积[9],[1],[29],[7],[2]的共享计算已经越来越受到人们的关注,以提高效率但准确的视觉识别。 OverFeatpaper [9]计算来自图像金字塔的卷积特征用于分类,定位和检测。针对共享卷积特征映射的自适应大小池(SPP)[1]开发用于基于区域的高效目标检测[1],[30]和语义分割[29]。 Fast R-CNN [2]支持端到端检测器对共享卷积特征的训练,并显示出令人信服的准确性和速度。
3 FASTER R-CNN
Our object detection system, called Faster R-CNN, iscomposed of two modules. The first module is a deepfully convolutional network that proposes regions,and the second module is the Fast R-CNN detector [2]that uses the proposed regions. The entire system is a single, unified network for object detection (Figure 2).Using the recently popular terminology of neural networks with ‘attention’ [31] mechanisms, the RPN module tells the Fast R-CNN module where to look.In Section 3.1 we introduce the designs and properties of the network for region proposal. In Section 3.2 we develop algorithms for training both modules with features shared.
3更快的R-CNN
我们的对象检测系统称为Faster R-CNN,由两个模块组成。 第一个模块是提出区域的深度卷积网络,第二个模块是使用所提出的区域的快速R-CNN检测器[2]。 整个系统是一个单一的,统一的对象检测网络(图2)。RPN模块使用最近流行的神经网络术语和'注意'[31]机制,告诉Fast R-CNN模块在哪里看。 3.1我们介绍区域建议网络的设计和特性。 在第3.2节中,我们开发了用于训练具有共享功能的模块的算法。
3.1 Region Proposal Networks
A Region Proposal Network (RPN) takes an image(of any size) as input and outputs a set of rectangularobject proposals, each with an objectness score.3 Wemodel this process with a fully convolutional network[7], which we describe in this section. Because our ulti-mate goal is to share computation with a Fast R-CNNobject detection network [2], we assume that both netsshare a common set of convolutional layers. In our ex-periments, we investigate the Zeiler and Fergus model[32] (ZF), which has 5 shareable convolutional layersand the Simonyan and Zisserman model [3] (VGG-16),which has 13 shareable convolutional layers.
To generate region proposals, we slide a smallnetwork over the convolutional feature map outputby the last shared convolutional layer. This smallnetwork takes as input an n × n spatial window ofthe input convolutional feature map. Each slidingwindow is mapped to a lower-dimensional feature(256-d for ZF and 512-d for VGG, with ReLU [33]following). This feature is fed into two sibling fully-connected layers—a box-regression layer (reg) and abox-classification layer (cls). We use n = 3 in thispaper, noting that the effective receptive field on theinput image is large (171 and 228 pixels for ZF andVGG, respectively). This mini-network is illustratedat a single position in Figure 3 (left). Note that be-cause the mini-network operates in a sliding-windowfashion, the fully-connected layers are shared acrossall spatial locations. This architecture is naturally im-plemented with an n×n convolutional layer followedby two sibling 1 × 1 convolutional layers (for reg andcls, respectively).
3.1地区提案网络
一个区域提议网络(RPN)以任意大小的图像作为输入,并输出一组矩形项目提案,每个提案都有一个对象评分。我们用一个完全卷积网络来模拟这个过程[7],我们在本节中描述。因为我们的最终目标是与Fast R-CNNobject检测网络共享计算[2],所以我们假设netsshare是一组常用的卷积层。在我们的实验中,我们研究了具有5个可共享卷积层的Zeiler和Fergus模型[32](ZF)和具有13个可共享卷积层的Simonyan和Zisserman模型[3](VGG-16)。
为了生成区域提议,我们通过最后的共享卷积层在卷积特征映射输出上滑动一个小网络。这个小网络将输入卷积特征映射的n×n空间窗口作为输入。每个滑动窗口都映射到较低维度的特征(ZF为256-d,VGG为512-d,ReLU [33]如下)。该特征被馈送到两个同胞完全连接的层 - 盒回归层(reg)和abox分类层(cls)。我们在本文中使用n = 3,注意输入图像上的有效接受区域较大(分别为ZF和VGG为171和228像素)。这个迷你网络在图3(左)的单个位置上进行了说明。请注意,由于迷你网络以滑动窗口方式运行,因此完全连接的图层会在所有空间位置共享。这种架构自然是用一个n×n卷积层跟随两个同胞1×1卷积层(分别用于reg和cls)来实现的。
3.1.1 Anchors
At each sliding-window location, we simultaneouslypredict multiple region proposals, where the numberof maximum possible proposals for each location isdenoted as k. So the reg layer has 4k outputs encodingthe coordinates of k boxes, and the cls layer outputs2k scores that estimate probability of object or notobject for each proposal4. The k proposals are param-eterized relative to k reference boxes, which we call anchors. An anchor is centered at the sliding window in question, and is associated with a scale and as pectratio (Figure 3, left). By default we use 3 scales and3 aspect ratios, yielding k = 9 anchors at each sliding position. For a convolutional feature map of a sizeW × H (typically ∼2,400), there are W H k anchors in total.
3.1.1锚
在每个滑动窗口位置,我们同时预测多个区域提案,其中每个位置的最大可能提案的数量被表示为k。 因此,reg层具有4k个输出,编码k个框的坐标,并且cls层输出2k个评分,其估计每个提议的对象或不对象的概率4。 k个提案相对于k个参考框进行参数化,我们称之为锚点。 一个锚点位于滑动窗口的中心位置,并与一个比例尺和一个pectratio相关联(图3,左图)。 默认情况下,我们使用3个比例和3个纵横比,在每个滑动位置产生k = 9个锚点。 对于尺寸为W×H(通常约为2400)的卷积特征图,总共有W个H个锚。
Translation-Invariant Anchors
An important property of our approach is that itis translation invariant, both in terms of the anchorsand the functions that compute proposals relative tothe anchors. If one translates an object in an image,the proposal should translate and the same functionshould be able to predict the proposal in either lo-cation. This translation-invariant property is guaran-teed by our method5. As a comparison, the MultiBoxmethod [27] uses k-means to generate 800 anchors,which are not translation invariant. So MultiBox doesnot guarantee that the same proposal is generated ifan object is translated.
The translation-invariant property also reduces themodel size. MultiBox has a (4 + 1) × 800-dimensionalfully-connected output layer, whereas our method hasa (4 + 2) × 9-dimensional convolutional output layerin the case of k = 9 anchors. As a result, our outputlayer has 2.8 × 104 parameters (512 × (4 + 2) × 9for VGG-16), two orders of magnitude fewer thanMultiBox’s output layer that has 6.1 × 106 parameters(1536 × (4 + 1) × 800 for GoogleNet [34] in MultiBox[27]). If considering the feature projection layers, ourproposal layers still have an order of magnitude fewerparameters than MultiBox6. We expect our methodto have less risk of overfitting on small datasets, likePASCAL VOC.
平移不变的锚
我们方法的一个重要特性是它的翻译不变,无论是在计算提议相对于锚定者的锚点和函数方面。如果在图像中翻译对象,提案应该翻译,并且相同的功能可以在任何位置预测提案。我们的方法保证了这种平移不变的特性5。作为比较,MultiBox方法[27]使用k-means生成800个锚点,这不是平移不变量。因此,MultiBox不保证ifan对象被翻译时会生成相同的提议。
平移不变属性也减少了模型大小。 MultiBox有一个(4 + 1)×800维连接输出层,而我们的方法在k = 9个锚点的情况下有一个(4 + 2)×9维卷积输出层。因此,我们的输出层具有2.8×104个参数(对于VGG-16为512×(4 + 2)×9),比具有6.1×106参数(1536×(4 + 1)×800)的多盒输出层少两个数量级对于MultiBox [27]中的GoogleNet [34])。如果考虑特征投影图层,我们的建议图层参数比MultiBox6少一个数量级。我们希望我们的方法在小数据集上过度拟合的风险较小,例如PasCAL VOC。
Multi-Scale Anchors as Regression References
Our design of anchors presents a novel schemefor addressing multiple scales (and aspect ratios). Asshown in Figure 1, there have been two popular waysfor multi-scale predictions. The first way is based onimage/feature pyramids, e.g., in DPM [8] and CNN-based methods [9], [1], [2]. The images are resized atmultiple scales, and feature maps (HOG [8] or deepconvolutional features [9], [1], [2]) are computed foreach scale (Figure 1(a)). This way is often useful butis time-consuming. The second way is to use slidingwindows of multiple scales (and/or aspect ratios) onthe feature maps. For example, in DPM [8], modelsof different aspect ratios are trained separately usingdifferent filter sizes (such as 5×7 and 7×5). If this wayis used to address multiple scales, it can be thoughtof as a “pyramid of filters” (Figure 1(b)). The secondway is usually adopted jointly with the first way [8].
As a comparison, our anchor-based method is builton a pyramid of anchors, which is more cost-efficient.Our method classifies and regresses bounding boxeswith reference to anchor boxes of multiple scales andaspect ratios. It only relies on images and featuremaps of a single scale, and uses filters (sliding win-dows on the feature map) of a single size. We show byexperiments the effects of this scheme for addressingmultiple scales and sizes (Table 8).
Because of this multi-scale design based on anchors,we can simply use the convolutional features com-puted on a single-scale image, as is also done bythe Fast R-CNN detector [2]. The design of multi-scale anchors is a key component for sharing featureswithout extra cost for addressing scales.
多尺度锚点作为回归参考
我们的锚点设计为解决多尺度(和纵横比)提出了一种新的方案。如图1所示,对于多尺度预测有两种流行的方法。第一种方法是基于图像/特征金字塔,例如在DPM [8]和基于CNN的方法[9],[1],[2]中。这些图像在多尺度下调整大小,并且每个尺度(图1(a))计算特征图(HOG [8]或深度卷积特征[9],[1],[2])。这种方式通常很有用但很耗时。第二种方法是在特征地图上使用多个比例(和/或宽高比)的滑动窗口。例如,在DPM [8]中,不同长宽比的模型分别使用不同的滤镜尺寸(例如5×7和7×5)进行训练。如果这种方法用于解决多个尺度问题,可以将其视为“过滤器金字塔”(图1(b))。第二条路通常与第一条路共同采用[8]。
作为比较,我们的基于锚的方法是builton金字塔的锚,这是更具成本效益的。我们的方法分类和回归边界框参考多个尺度和比率的锚箱。它只依赖单个尺度的图像和功能图,并使用单个尺寸的过滤器(在特征地图上滑动双赢)。我们通过实验来展示这种方案对多种尺度和尺寸的影响(表8)。
由于这种基于锚的多尺度设计,我们可以简单地使用计算在单尺度图像上的卷积特征,Fast R-CNN检测器也可以完成这一工作[2]。多尺度锚的设计是共享特征的关键组件,不需要额外的成本来处理尺度。
3.1.2 Loss Function
For training RPNs, we assign a binary class label(of being an object or not) to each anchor. We as-sign a positive label to two kinds of anchors: (i) theanchor/anchors with the highest Intersection-over-Union (IoU) overlap with a ground-truth box, or (ii) ananchor that has an IoU overlap higher than 0.7 with
any ground-truth box. Note that a single ground-truthbox may assign positive labels to multiple anchors.Usually the second condition is sufficient to determinethe positive samples; but we still adopt the firstcondition for the reason that in some rare cases thesecond condition may find no positive sample. Weassign a negative label to a non-positive anchor if itsIoU ratio is lower than 0.3 for all ground-truth boxes.Anchors that are neither positive nor negative do notcontribute to the training objective.
With these definitions, we minimize an objectivefunction following the multi-task loss in Fast R-CNN[2]. Our loss function for an image is defined as:
be thought of as bounding-box regression from ananchor box to a nearby ground-truth box.
Nevertheless, our method achieves bounding-boxregression by a different manner from previous RoI-based (Region of Interest) methods [1], [2]. In [1],[2], bounding-box regression is performed on featurespooled from arbitrarily sized RoIs, and the regressionweights are shared by all region sizes. In our formula-tion, the features used for regression are of the samespatial size (3 × 3) on the feature maps. To accountfor varying sizes, a set of k bounding-box regressorsare learned. Each regressor is responsible for one scaleand one aspect ratio, and the k regressors do not shareweights. As such, it is still possible to predict boxes ofvarious sizes even though the features are of a fixedsize/scale, thanks to the design of anchors.
3.1.2损失函数
为了训练RPN,我们为每个锚点分配一个二进制类标签(作为一个对象或不是)。我们对两种锚点签署了一个正面的标签:(i)具有最高的交叉点联合(IoU)与地面真值箱重叠的锚点/锚点,或(ii)具有高于IoU重叠的锚点0.7与
任何地面真理盒。请注意,一个单一的ground-truthbox可能会为多个anchor指定正数标签。通常,第二个条件足以确定正样本;但我们仍然采用第一条件,因为在某些罕见情况下,第二条件可能找不到正样本。如果所有地面实况框的IUU比率低于0.3,我们就给一个非正向锚点赋予一个负面标签。既不是正面也不是负面的附着物不会影响训练目标。
根据这些定义,我们将Fast R-CNN中的多任务丢失后的目标函数最小化[2]。我们对图像的损失函数定义为:
被认为是从锚箱到附近的地面真值盒的边界框回归。
然而,我们的方法通过与之前基于RoI(感兴趣区域)方法[1],[2]不同的方式实现了边界框回归。在[1],[2]中,对任意大小的RoI所提供的特征执行边界框回归,并且回归权重由所有区域大小共享。在我们的公式中,用于回归的特征在特征映射上具有相同的空间大小(3×3)。为了解决不同的尺寸,学习了一组k个边界框回归器。每个回归者负责一个比例尺和一个长宽比,并且k个回归者不分享权重。因此,由于锚点的设计,即使特征尺寸是固定的,也仍然可以预测各种尺寸的箱子。
3.1.3 Training RPNs
The RPN can be trained end-to-end by back-propagation and stochastic gradient descent (SGD)[35]. We follow the “image-centric” sampling strategyfrom [2] to train this network. Each mini-batch arisesfrom a single image that contains many positive andnegative example anchors. It is possible to optimizefor the loss functions of all anchors, but this willbias towards negative samples as they are dominate.Instead, we randomly sample 256 anchors in an imageto compute the loss function of a mini-batch, wherethe sampled positive and negative anchors have aratio of up to 1:1. If there are fewer than 128 positivesamples in an image, we pad the mini-batch withnegative ones.
We randomly initialize all new layers by drawingweights from a zero-mean Gaussian distribution withstandard deviation 0.01. All other layers (i.e., theshared convolutional layers) are initialized by pre-training a model for ImageNet classification [36], asis standard practice [5]. We tune all layers of theZF net, and conv3 1 and up for the VGG net toconserve memory [2]. We use a learning rate of 0.001for 60k mini-batches, and 0.0001 for the next 20kmini-batches on the PASCAL VOC dataset. We use amomentum of 0.9 and a weight decay of 0.0005 [37].Our implementation uses Caffe [38].
3.1.3训练RPN
RPN可以通过反向传播和随机梯度下降(SGD)进行端对端训练[35]。我们遵循[2]的“以图像为中心”的采样策略来训练这个网络。每个小批量都来自一个包含许多正面和负面示例锚点的图像。有可能优化所有锚点的损失函数,但是这会对负样本造成偏差,因为它们是占主导地位的。相反,我们随机采样图像中的256个锚点来计算小批量的损失函数,其中采样的正锚和负锚具有最高1:1的比率。如果图像中的正片数少于128个,我们用小片填充小片。
我们通过标准偏差为0.01的零均值高斯分布随机初始化所有新图层。所有其他层(即共享卷积层)通过预先训练ImageNet分类模型[36],asis标准实践[5]来初始化。我们调整ZF网络的所有层,并且为VGG网络调用1和3来保存内存[2]。对于60k小批量,我们使用0.001的学习率,对PASCAL VOC数据集中接下来的20kmini批次使用0.0001。我们使用0.9的减量和0.0005的减量[37]。我们的实现使用Caffe [38]。
3.2 Sharing Features for RPN and Fast R-CNN
Thus far we have described how to train a networkfor region proposal generation, without consideringthe region-based object detection CNN that will utilizethese proposals. For the detection network, we adoptFast R-CNN [2]. Next we describe algorithms thatlearn a unified network composed of RPN and FastR-CNN with shared convolutional layers (Figure 2).
Both RPN and Fast R-CNN, trained independently,will modify their convolutional layers in differentways. We therefore need to develop a technique thatallows for sharing convolutional layers between the
Here, i is the index of an anchor in a mini-batch andpi is the predicted probability of anchor i being anobject. The ground-truth label p∗i is 1 if the anchoris positive, and is 0 if the anchor is negative. ti is avector representing the 4 parameterized coordinatesof the predicted bounding box, and t∗i is that of theground-truth box associated with a positive anchor.The classification loss Lcls is log loss over two classes(object vs. not object). For the regression loss, we useLreg (ti, t∗i ) = R(ti − t∗i ) where R is the robust lossfunction (smooth L1) defined in [2]. The term p∗i Lregmeans the regression loss is activated only for positiveanchors (p∗i = 1) and is disabled otherwise (p∗i = 0).The outputs of the cls and reg layers consist of {pi}and {ti} respectively.
3.2共享RPN和快速R-CNN的功能
到目前为止,我们已经描述了如何训练用于区域提案生成的网络,而不考虑将利用这些提案的基于区域的对象检测CNN。对于检测网络,我们采用快速R-CNN [2]。接下来我们将介绍一些算法,该算法可以使用共享卷积图层来组成由RPN和FastR-CNN组成的统一网络(图2)。
独立训练的RPN和Fast R-CNN将在不同的路径上修改卷积层。因此,我们需要开发一种技术,允许共享卷积层之间的卷积层
在这里,我是一个小批量锚点的索引,并且是我作为anobject的锚点的预测概率。如果锚点为正值,则地面真值标签p * i为1,如果锚点为负值,则为0。 ti是代表预测边界框的4个参数化坐标的矢量,t * i是与正锚点相关的地面实况框的分类损失Lcls是两个类(对象与非对象)的对数损失。对于回归损失,我们使用Lreg(ti,t * i)= R(ti-t * i)其中R是[2]中定义的鲁棒损失函数(平滑L1)。术语p * i Lregmeans回归损失仅对正锚(p * i = 1)被激活,否则被禁用(p * i = 0)。cls和reg层的输出包括{pi}和{ti}分别。
The two terms are normalized by Ncls and Nregand weighted by a balancing parameter λ. In ourcurrent implementation (as in the released code), thecls term in Eqn.(1) is normalized by the mini-batchsize (i.e., Ncls = 256) and the reg term is normalizedby the number of anchor locations (i.e., Nreg ∼ 2, 400).By default we set λ = 10, and thus both cls andreg terms are roughly equally weighted. We showby experiments that the results are insensitive to thevalues of λ in a wide range (Table 9). We also notethat the normalization as above is not required andcould be simplified.
For bounding box regression, we adopt the param-eterizations of the 4 coordinates following [5]:
tx =(x−xa)/wa, ty =(y−ya)/ha,
tw = log(w/wa), th = log(h/ha), (2)
t∗x = (x∗ − xa)/wa, t∗y = (y∗ − ya)/ha,t∗w = log(w∗/wa), t∗h = log(h∗/ha),
where x, y, w, and h denote the box’s center coordi-nates and its width and height. Variables x, xa, andx∗ are for the predicted box, anchor box, and ground-truth box respectively (likewise for y, w, h). This can two networks, rather than learning two separate net-works. We discuss three ways for training networks with features shared:
(i) Alternating training. In this solution, we first trainRPN, and use the proposals to train Fast R-CNN.The network tuned by Fast R-CNN is then used toinitialize RPN, and this process is iterated. This is thesolution that is used in all experiments in this paper.
(ii) Approximate joint training. In this solution, theRPN and Fast R-CNN networks are merged into onenetwork during training as in Figure 2. In each SGDiteration, the forward pass generates region propos-als which are treated just like fixed, pre-computedproposals when training a Fast R-CNN detector. Thebackward propagation takes place as usual, where forthe shared layers the backward propagated signalsfrom both the RPN loss and the Fast R-CNN lossare combined. This solution is easy to implement. Butthis solution ignores the derivative w.r.t. the proposalboxes’ coordinates that are also network responses,so is approximate. In our experiments, we have em-pirically found this solver produces close results, yetreduces the training time by about 25-50% comparingwith alternating training. This solver is included inour released Python code.
(iii) Non-approximate joint training. As discussedabove, the bounding boxes predicted by RPN arealso functions of the input. The RoI pooling layer[2] in Fast R-CNN accepts the convolutional featuresand also the predicted bounding boxes as input, soa theoretically valid backpropagation solver shouldalso involve gradients w.r.t. the box coordinates. Thesegradients are ignored in the above approximate jointtraining. In a non-approximate joint training solution,we need an RoI pooling layer that is differentiablew.r.t. the box coordinates. This is a nontrivial problemand a solution can be given by an “RoI warping” layeras developed in [15], which is beyond the scope of thispaper.
这两个项通过Ncls和Nregand进行归一化,并由平衡参数λ加权。在我们当前的实施中(如在发布的代码中),方程(1)中的cls项由小批量规格化(即,Ncls = 256),并且通过锚定位置的数量(即,Nreg〜2 ,400)。默认情况下,我们设定λ= 10,因此cls和reg两项的权重大致相等。我们通过实验显示,结果在很宽的范围内对λ的值不敏感(表9)。我们也注意到上述标准化不是必需的,可以简化。
对于边界框回归,我们采用[5]之后的4个坐标的参数化:
tx =(x-xa)/ wa,ty =(y-ya)/ ha,
tw = log(w / wa),th = log(h / ha),(2)
t * x =(x * -xa)/ wa,t * y =(y * -ya)/ ha,t * w = log(w * / wa),t * h = log(h * / ha)
其中x,y,w和h表示盒子的中心坐标及其宽度和高度。变量x,xa和x *分别针对预测框,锚点框和地面实况框(同样针对y,w,h)。这可以是两个网络,而不是学习两个独立的网络。我们讨论了共享功能的三种培训网络的方法:
(i)交替培训。在这个解决方案中,我们首先训练RPN,并使用这些建议来训练Fast R-CNN。然后用Fast R-CNN调整的网络初始化RPN,并且这个过程被迭代。这是本文所有实验中使用的解决方案。
(二)大概的联合培训。在这个解决方案中,RPN和Fast R-CNN网络在训练过程中被合并为一个网络,如图2所示。在每个新的网站中,正向传递产生区域提议,当训练一个Fast R- CNN检测器。后向传播像往常一样发生,其中共享层将来自RPN丢失和Fast R-CNN丢失的反向传播信号组合在一起。这个解决方案很容易实现。但这个解决方案忽略了导数w.r.t.建议箱的坐标也是网络响应,也是近似的。在我们的实验中,我们发现这个解算器产生了接近的结果,但与交替训练相比,训练时间减少了约25-50%。该解算器包含在我们发布的Python代码中。
(三)非近似联合训练。如上所述,由RPN预测的边界框也是输入的函数。 Fast R-CNN中的RoI池层[2]接受卷积特征和预测的边界框作为输入,所以理论上有效的反向传播求解器也应该涉及梯度w.r.t.箱子坐标。这些渐变在上述近似联合训练中被忽略。在一个非近似的联合培训解决方案中,我们需要一个差异化的可用性层次结构层。箱子坐标。这是一个不容小觑的问题,可以通过[15]中开发的“RoI翘曲”层来给出解决方案,这超出了本文的范围。
4-Step Alternating Training. In this paper, we adopta pragmatic 4-step training algorithm to learn sharedfeatures via alternating optimization. In the first step,we train the RPN as described in Section 3.1.3. Thisnetwork is initialized with an ImageNet-pre-trainedmodel and fine-tuned end-to-end for the region pro-posal task. In the second step, we train a separatedetection network by Fast R-CNN using the proposalsgenerated by the step-1 RPN. This detection net-work is also initialized by the ImageNet-pre-trainedmodel. At this point the two networks do not shareconvolutional layers. In the third step, we use thedetector network to initialize RPN training, but we
fix the shared convolutional layers and only fine-tunethe layers unique to RPN. Now the two networksshare convolutional layers. Finally, keeping the sharedconvolutional layers fixed, we fine-tune the uniquelayers of Fast R-CNN. As such, both networks sharethe same convolutional layers and form a unifiednetwork. A similar alternating training can be runfor more iterations, but we have observed negligibleimprovements.
四步交替训练。在本文中,我们采用了实用的4步训练算法,通过交替优化学习共享特征。在第一步中,我们按照3.1.3节的描述训练RPN。该网络使用ImageNet预先训练好的模型进行初始化,并针对区域管理任务进行了端到端的微调。在第二步中,我们使用由第一步RPN生成的建议,由Fast R-CNN训练单独的检测网络。这个检测网络也由ImageNet预训练模型初始化。此时两个网络不共享卷积层。在第三步中,我们使用检测器网络来初始化RPN培训,但是我们
修复共享的卷积层并且只修复RPN特有的层。现在两个网络共享卷积层。最后,保持共享卷积层固定,我们对Fast R-CNN的独特层进行微调。因此,两个网络共享相同的卷积层并形成统一的网络。类似的交替训练可以运行更多的迭代,但我们观察到忽略了改进。
3.3 Implementation Details
We train and test both region proposal and objectdetection networks on images of a single scale [1], [2].We re-scale the images such that their shorter sideis s = 600 pixels [2]. Multi-scale feature extraction(using an image pyramid) may improve accuracy butdoes not exhibit a good speed-accuracy trade-off [2].On the re-scaled images, the total stride for both ZFand VGG nets on the last convolutional layer is 16pixels, and thus is ∼10 pixels on a typical PASCALimage before resizing (∼500×375). Even such a largestride provides good results, though accuracy may befurther improved with a smaller stride.
For anchors, we use 3 scales with box areas of 1282,2562, and 5122 pixels, and 3 aspect ratios of 1:1, 1:2,and 2:1. These hyper-parameters are not carefully cho-sen for a particular dataset, and we provide ablationexperiments on their effects in the next section. As dis-cussed, our solution does not need an image pyramidor filter pyramid to predict regions of multiple scales,saving considerable running time. Figure 3 (right)shows the capability of our method for a wide rangeof scales and aspect ratios. Table 1 shows the learnedaverage proposal size for each anchor using the ZFnet. We note that our algorithm allows predictionsthat are larger than the underlying receptive field.Such predictions are not impossible—one may stillroughly infer the extent of an object if only the middleof the object is visible.
The anchor boxes that cross image boundaries needto be handled with care. During training, we ignoreall cross-boundary anchors so they do not contributeto the loss. For a typical 1000 × 600 image, therewill be roughly 20000 (≈ 60 × 40 × 9) anchors intotal. With the cross-boundary anchors ignored, thereare about 6000 anchors per image for training. If theboundary-crossing outliers are not ignored in training,they introduce large, difficult to correct error terms inthe objective, and training does not converge. Duringtesting, however, we still apply the fully convolutionalRPN to the entire image. This may generate cross-boundary proposal boxes, which we clip to the image boundary.
Some RPN proposals highly overlap with each other. To reduce redundancy, we adopt non-maximum suppression (NMS) on the proposal regions based on their cls scores. We fix the IoU threshold for NMSat 0.7, which leaves us about 2000 proposal regions per image. As we will show, NMS does not harm the ultimate detection accuracy, but substantially reduces the number of proposals. After NMS, we use the top-N ranked proposal regions for detection. In the following, we train Fast R-CNN using 2000 RPN proposals, but evaluate different numbers of proposals attest-time.
3.3实施细节
我们在单一尺度的图像上训练和测试区域提议和对象检测网络[1],[2]。我们重新缩放图像使得它们的短边s = 600像素[2]。多尺度特征提取(使用图像金字塔)可能会提高精度,但不会表现出良好的速度精度折衷[2]。在重新缩放的图像中,最后一个卷积层上的ZF和VGG网络的总步幅为16像素,因此在调整大小(〜500×375)之前,典型的PASCALimage上约为10像素。即使是这样的大跨度也能提供良好的效果,但精度可能会进一步提高,步幅也会小一些。
对于锚点,我们使用3个比例,盒子面积分别为1282,2562和5122像素,以及3:1:1,1:2和2:1的高宽比。这些超参数不是针对特定数据集仔细选择的,我们在下一部分中提供关于其效应的消融实验。正如讨论的那样,我们的解决方案不需要图像金字塔或过滤金字塔来预测多个比例的区域,从而节省大量运行时间。图3(右)显示了我们的方法适用于各种比例和纵横比的能力。表1显示了使用ZFnet的每个锚点的学习平均值提议大小。我们注意到,我们的算法允许预测大于潜在的接受域。这种预测并非不可能 - 如果只有对象的中间部分可见,则可以仍然粗略地推断出对象的范围。
跨越图像边界的锚箱需要小心处理。在训练过程中,我们忽略了所有跨境主播,以免造成损失。对于典型的1000×600图像,总共大约20000(≈60×40×9)个锚点。在忽略跨界锚点的情况下,每个图像约有6000个锚点用于训练。如果训练中边界交叉的离群点不被忽略,它们会在目标中引入大的,难以纠正的误差项,并且训练不会收敛。然而,在测试期间,我们仍然将完全卷积RPN应用于整个图像。这可能会生成跨边界提案框,我们将其剪辑到图像边界。
一些RPN提案高度重叠。为了减少冗余,我们根据他们的cls分数对提案区域采用非最大抑制(NMS)。我们修复了NMSat 0.7的IoU阈值,每个图像留给我们大约2000个提案区域。正如我们将要展示的那样,NMS不会损害最终检测的准确性,但会大大减少提议的数量。在NMS之后,我们使用排名前N的提案区域进行检测。接下来,我们使用2000个RPN提案对Fast R-CNN进行培训,但是要评估不同数量的提案。
4 EXPERIMENTS
4.1 Experiments on PASCAL VOC
We comprehensively evaluate our method on thePASCAL VOC 2007 detection benchmark [11]. Thisdataset consists of about 5k trainval images and 5ktest images over 20 object categories. We also provideresults on the PASCAL VOC 2012 benchmark for afew models. For the ImageNet pre-trained network,we use the “fast” version of ZF net [32] that has5 convolutional layers and 3 fully-connected layers,and the public VGG-16 model7 [3] that has 13 con-volutional layers and 3 fully-connected layers. Weprimarily evaluate detection mean Average Precision(mAP), because this is the actual metric for objectdetection (rather than focusing on object proposalproxy metrics).
Table 2 (top) shows Fast R-CNN results whentrained and tested using various region proposalmethods. These results use the ZF net. For SelectiveSearch (SS) [4], we generate about 2000 proposals bythe “fast” mode. For EdgeBoxes (EB) [6], we generatethe proposals by the default EB setting tuned for 0.7 IoU. SS has an mAP of 58.7% and EB has an mAP of 58.6% under the Fast R-CNN framework. RPN with Fast R-CNN achieves competitive results, with an mAP of 59.9% while using up to 300 proposals8.Using RPN yields a much faster detection system than using either SS or EB because of shared convolutional computations; the fewer proposals also reduce there gion-wise fully-connected layers’ cost (Table 5).
4实验
4.1 PASCAL VOC的实验
我们全面评估了我们在PACCAL VOC 2007检测基准[11]中的方法。这个数据库包含大约5k个训练图像和20个对象类别的5k个测试图像。我们还为所有型号提供PASCAL VOC 2012基准测试结果。对于ImageNet预训练网络,我们使用具有5个卷积层和3个完全连接层的ZF网络[32]的“快速”版本,以及具有13个卷积层的公共VGG-16模型7 [3] 3个完全连接的层。 Weprimarily评估检测平均平均精度(mAP),因为这是objectdetection的实际指标(而不是关注对象proposalproxy指标)。
表2(顶部)显示了使用各种区域建议方法进行训练和测试的快速R-CNN结果。这些结果使用ZF网络。对于SelectiveSearch(SS)[4],我们通过“快速”模式生成约2000个提案。对于EdgeBoxes(EB)[6],我们通过调整0.7 IoU的默认EB设置生成提案。在Fast R-CNN框架下,SS拥有58.7%的mAP,EB拥有58.6%的mAP。具有Fast R-CNN的RPN实现了有竞争力的结果,在使用多达300个提议的同时,mAP为59.9%。使用RPN产生比由于共享卷积计算而使用SS或EB更快的检测系统;更少的提案也会减少智能方式完全连接层的成本(表5)。
Ablation Experiments on RPN. To investigate the behavior of RPNs as a proposal method, we conducted several ablation studies. First, we show the effect of sharing convolutional layers between the RPN andFast R-CNN detection network. To do this, we stop after the second step in the 4-step training process.Using separate networks reduces the result slightly to58.7% (RPN+ZF, unshared, Table 2). We observe that this is because in the third step when the detector-tuned features are used to fine-tune the RPN, the proposal quality is improved.
Next, we disentangle the RPN’s influence on train-ing the Fast R-CNN detection network. For this pur-pose, we train a Fast R-CNN model by using the2000 SS proposals and ZF net. We fix this detectorand evaluate the detection mAP by changing theproposal regions used at test-time. In these ablationexperiments, the RPN does not share features withthe detector.
Replacing SS with 300 RPN proposals at test-timeleads to an mAP of 56.8%. The loss in mAP is becauseof the inconsistency between the training/testing pro-posals. This result serves as the baseline for the fol-lowing comparisons.
Somewhat surprisingly, the RPN still leads to acompetitive result (55.1%) when using the top-ranked 100 proposals at test-time, indicating that the top-ranked RPN proposals are accurate. On the other extreme, using the top-ranked 6000 RPN proposals(without NMS) has a comparable mAP (55.2%), suggesting NMS does not harm the detection mAP and may reduce false alarms.
RPN上的消融实验。为了研究RPNs作为提议方法的行为,我们进行了几项消融研究。首先,我们展示了在RPN和Fast R-CNN检测网络之间共享卷积层的效果。为此,我们在四步训练过程中的第二步之后停止。使用单独网络将结果略微降低至58.7%(RPN + ZF,未共享,表2)。我们观察到这是因为在第三步中,当使用检测器调整功能来微调RPN时,提议质量得到改善。
接下来,我们分析RPN对训练Fast R-CNN检测网络的影响。为此,我们通过使用2000个SS建议和ZF网络来训练Fast R-CNN模型。我们通过更改测试时使用的提议区域来修复此检测并评估检测mAP。在这些消融实验中,RPN不与检测器共享特征。
在测试时间将300个RPN建议替换为56.8%的mAP。 mAP中的损失是因为培训/测试建议之间的不一致。这个结果可作为以后比较的基准。
有点令人惊讶的是,在测试阶段使用排名最高的100个提案时,RPN仍然会导致竞争性结果(55.1%),表明排名靠前的RPN提案是准确的。另一方面,使用排名最高的6000个RPN提案(无NMS)具有可比的mAP(55.2%),这表明NMS不会损害检测mAP并可能减少误报。
Next, we separately investigate the roles of RPN’scls and reg outputs by turning off either of themat test-time. When the cls layer is removed at test-time (thus no NMS/ranking is used), we randomlysample N proposals from the unscored regions. ThemAP is nearly unchanged with N = 1000 (55.8%), butdegrades considerably to 44.6% when N = 100. Thisshows that the cls scores account for the accuracy ofthe highest ranked proposals.
On the other hand, when the reg layer is removedat test-time (so the proposals become anchor boxes),the mAP drops to 52.1%. This suggests that the high-quality proposals are mainly due to the regressed boxbounds. The anchor boxes, though having multiplescales and aspect ratios, are not sufficient for accuratedetection.
We also evaluate the effects of more powerful net-works on the proposal quality of RPN alone. We useVGG-16 to train the RPN, and still use the abovedetector of SS+ZF. The mAP improves from 56.8%(using RPN+ZF) to 59.2% (using RPN+VGG). This is a promising result, because it suggests that the proposal quality of RPN+VGG is better than that of RPN+ZF.Because proposals of RPN+ZF are competitive withSS (both are 58.7% when consistently used for training and testing), we may expect RPN+VGG to be better than SS. The following experiments justify this hypothesis.
接下来,我们通过关闭任何一个测试时间来分别研究RPN的cls和reg输出的作用。当cls层在测试时被移除(因此不使用NMS /等级),我们从未分区中随机抽样N个提议。当N = 1000(55.8%)时,mAP几乎没有变化,但当N = 100时,这一数字大幅下降到44.6%。这表明,cls分数说明了排名最高的提案的准确性。
另一方面,当reg层在测试时被移除(因此提案变成锚箱)时,mAP下降到52.1%。这表明高质量的提案主要是由于倒退的箱子。锚箱虽然具有多个比例和纵横比,但不足以进行精确检测。
我们还评估了更强大的网络对RPN提案质量的影响。我们使用VGG-16来训练RPN,并仍然使用SS + ZF的上述检测器。 mAP从56.8%(使用RPN + ZF)提高到59.2%(使用RPN + VGG)。这是一个很有希望的结果,因为它表明RPN + VGG的建议质量比RPN + ZF的建议质量好。因为RPN + ZF的建议与SS竞争(一致用于训练和测试时都是58.7%),所以我们可能预计RPN + VGG比SS好。以下实验验证了这个假设。
Performance of VGG-16. Table 3 shows the resultsof VGG-16 for both proposal and detection. UsingRPN+VGG, the result is 68.5% for unshared features,slightly higher than the SS baseline. As shown above,this is because the proposals generated by RPN+VGGare more accurate than SS. Unlike SS that is pre-defined, the RPN is actively trained and benefits frombetter networks. For the feature-shared variant, theresult is 69.9%—better than the strong SS baseline, yetwith nearly cost-free proposals. We further train theRPN and detection network on the union set of PAS-CAL VOC 2007 trainval and 2012 trainval. The mAPis 73.2%. Figure 5 shows some results on the PASCALVOC 2007 test set. On the PASCAL VOC 2012 test set(Table 4), our method has an mAP of 70.4% trainedon the union set of VOC 2007 trainval+test and VOC2012 trainval. Table 6 and Table 7 show the detailednumbers.
In Table 5 we summarize the running time of theentire object detection system. SS takes 1-2 secondsdepending on content (on average about 1.5s), andFast R-CNN with VGG-16 takes 320ms on 2000 SSproposals (or 223ms if using SVD on fully-connectedlayers [2]). Our system with VGG-16 takes in total198ms for both proposal and detection. With the con-volutional features shared, the RPN alone only takes10ms computing the additional layers. Our region-wise computation is also lower, thanks to fewer pro-posals (300 per image). Our system has a frame-rateof 17 fps with the ZF net.
Sensitivities to Hyper-parameters. In Table 8 weinvestigate the settings of anchors. By default we use 3 scales and 3 aspect ratios (69.9% mAP in Table 8).If using just one anchor at each position, the mAP drops by a considerable margin of 3-4%. The mAP is higher if using 3 scales (with 1 aspect ratio) or 3aspect ratios (with 1 scale), demonstrating that using anchors of multiple sizes as the regression references is an effective solution. Using just 3 scales with 1aspect ratio (69.8%) is as good as using 3 scales with3 aspect ratios on this dataset, suggesting that scales and aspect ratios are not disentangled dimensions for the detection accuracy. But we still adopt these two dimensions in our designs to keep our system flexible.
In Table 9 we compare different values of λ in Equa-tion (1). By default we use λ = 10 which makes thetwo terms in Equation (1) roughly equally weightedafter normalization. Table 9 shows that our result isimpacted just marginally (by ∼ 1%) when λ is withina scale of about two orders of magnitude (1 to 100).This demonstrates that the result is insensitive to λ ina wide range.
对超参数的敏感度。在表8中我们调查了锚的设置。默认情况下,我们使用3个缩放比例和3个纵横比(表8中的69.9%mAP)。如果在每个位置仅使用一个锚点,则mAP将下降3-4%。如果使用3个尺度(1个纵横比)或3个纵横比(1个尺度),则mAP更高,表明使用多个尺寸的锚作为回归参考是一个有效的解决方案。仅使用3个具有1个比率的比例(69.8%)与使用3个具有3个比例的3个比例的数据集一样好,这表明比例尺和纵横比并非是检测准确度的解题维度。但我们仍然在设计中采用这两个维度来保持我们的系统灵活性。
在表9中,我们比较了方程(1)中λ的不同值。默认情况下,我们使用λ= 10,这使得方程(1)中的两个项在归一化之后大致相等地加权。表9显示,当λ大约在两个数量级(1至100)时,我们的结果仅仅略微偏差(约1%)。这表明结果在宽范围内对λ不敏感。
Analysis of Recall-to-IoU. Next we compute therecall of proposals at different IoU ratios with ground-truth boxes. It is noteworthy that the Recall-to-IoUmetric is just loosely [19], [20], [21] related to theultimate detection accuracy. It is more appropriate touse this metric to diagnose the proposal method thanto evaluate it.
In Figure 4, we show the results of using 300, 1000,and 2000 proposals. We compare with SS and EB, andthe N proposals are the top-N ranked ones based onthe confidence generated by these methods. The plotsshow that the RPN method behaves gracefully whenthe number of proposals drops from 2000 to 300. Thisexplains why the RPN has a good ultimate detectionmAP when using as few as 300 proposals. As weanalyzed before, this property is mainly attributed tothe cls term of the RPN. The recall of SS and EB dropsmore quickly than RPN when the proposals are fewer.
分析召回至IoU。 接下来,我们使用地面实况框来计算不同IoU比率的提案。 值得注意的是,Recall-to-IoUmetric与最低检测精度相关的松散[19,20] [21]。 诊断建议方法以评估它是更合适的。
在图4中,我们显示了使用300,1000和2000提案的结果。 我们与SS和EB进行比较,根据这些方法产生的置信度,N个提案是排在前N位的提案。 该图显示了当提案数量从2000个减少到300个时,RPN方法表现优雅。这解释了为什么RPN在使用少至300个提案时具有良好的最终检测速度。 如之前所分析的,此属性主要归因于RPN的cls项。 当提案较少时,召回SS和EB的速度比RPN快。
One-Stage Detection vs. Two-Stage Proposal + Detection. The OverFeat paper [9] proposes a detection method that uses regressors and classifiers on sliding windows over convolutional feature maps. Over Feat is a one-stage, class-specific detection pipeline, and ours is a two-stage cascade consisting of class-agnostic proposals and class-specific detections. In OverFeat, there gion-wise features come from a sliding window of one aspect ratio over a scale pyramid. These features are used to simultaneously determine the location and category of objects. In RPN, the features are from square (3×3) sliding windows and predict proposals relative to anchors with different scales and aspect ratios. Though both methods use sliding windows, the region proposal task is only the first stage of Faster R-CNN—the downstream Fast R-CNN detector attends to the proposals to refine them. In the second stage of our cascade, the region-wise features are adaptively pooled [1], [2] from proposal boxes that more faith-fully cover the features of the regions. We believe these features lead to more accurate detections.
To compare the one-stage and two-stage systems,we emulate the OverFeat system (and thus also circum-vent other differences of implementation details) byone-stage Fast R-CNN. In this system, the “proposals”are dense sliding windows of 3 scales (128, 256, 512)and 3 aspect ratios (1:1, 1:2, 2:1). Fast R-CNN istrained to predict class-specific scores and regress boxlocations from these sliding windows. Because theOverFeat system adopts an image pyramid, we alsoevaluate using convolutional features extracted from5 scales. We use those 5 scales as in [1], [2].
Table 10 compares the two-stage system and twovariants of the one-stage system. Using the ZF model,the one-stage system has an mAP of 53.9%. This islower than the two-stage system (58.7%) by 4.8%.This experiment justifies the effectiveness of cascadedregion proposals and object detection. Similar obser-vations are reported in [2], [39], where replacing SS region proposals with sliding windows leads to ∼6%degradation in both papers. We also note that the one-stage system is slower as it has considerably more proposals to process.
一阶段检测与两阶段提议+检测。 OverFeat论文[9]提出了一种在卷积特征映射上的滑动窗口上使用回归器和分类器的检测方法。 Over Feat是一个阶段性的,特定于类别的检测流水线,而我们的是一个两阶段级联,包括类别不可知提议和类别特定检测。在OverFeat中,智能智能功能来自一个比例金字塔上的一个长宽比的滑动窗口。这些功能用于同时确定对象的位置和类别。在RPN中,这些特征来自方形(3×3)滑动窗口并预测相对于具有不同比例和高宽比的锚的建议。尽管两种方法都使用滑动窗口,但区域提议任务只是更快的R-CNN的第一阶段 - 下游的快速R-CNN检测器会参与提案以优化它们。在我们级联的第二阶段,区域特征自适应地汇集[1],[2]来自提案框,更加忠实地覆盖区域的特征。我们相信这些功能可以带来更准确的检测结果。
为了比较一阶段和两阶段系统,我们通过一阶段快速R-CNN模拟OverFeat系统(并因此也规避了实现细节的其他差异)。在这个系统中,“提议”是3个尺度(128,256,512)和3个纵横比(1:1,1:2,2:1)的密集滑动窗口。快速R-CNN被用于预测特定类别的分数并从这些滑动窗口中回归boxlocations。由于OverFeat系统采用图像金字塔,因此我们也使用从5个比例提取的卷积特征来评估。我们使用[1],[2]中的那5个尺度。
表10比较了一阶段系统的两阶段系统和两阶段变量。使用ZF模型,一阶段系统的mAP为53.9%。这比两阶段系统(58.7%)低4.8%。这个实验验证了级联区域建议和目标检测的有效性。在[2],[39]中报告了类似的观察结果,其中使用滑动窗取代SS区域提案导致两篇论文中降低〜6%。我们也注意到,一阶段系统比较慢,因为它有更多的处理建议。
4.2 Experiments on MS COCO
We present more results on the Microsoft COCOobject detection dataset [12]. This dataset involves 80object categories. We experiment with the 80k imageson the training set, 40k images on the validation set,and 20k images on the test-dev set. We evaluate themAP averaged for IoU ∈ [0.5 : 0.05 : 0.95] (COCO’sstandard metric, simply denoted as mAP@[.5, .95])and [email protected] (PASCAL VOC’s metric).
There are a few minor changes of our system madefor this dataset. We train our models on an 8-GPUimplementation, and the effective mini-batch size be-comes 8 for RPN (1 per GPU) and 16 for Fast R-CNN(2 per GPU). The RPN step and Fast R-CNN step areboth trained for 240k iterations with a learning rateof 0.003 and then for 80k iterations with 0.0003. Wemodify the learning rates (starting with 0.003 insteadof 0.001) because the mini-batch size is changed. Forthe anchors, we use 3 aspect ratios and 4 scales(adding 642), mainly motivated by handling smallobjects on this dataset. In addition, in our Fast R-CNNstep, the negative samples are defined as those witha maximum IoU with ground truth in the interval of[0,0.5), instead of [0.1,0.5) used in [1], [2]. We notethat in the SPPnet system [1], the negative samplesin [0.1, 0.5) are used for network fine-tuning, but thenegative samples in [0, 0.5) are still visited in the SVMstep with hard-negative mining. But the Fast R-CNNsystem [2] abandons the SVM step, so the negativesamples in [0,0.1) are never visited. Including these[0,0.1) samples improves [email protected] on the COCOdataset for both Fast R-CNN and Faster R-CNN sys-tems (but the impact is negligible on PASCAL VOC).
The rest of the implementation details are the sameas on PASCAL VOC. In particular, we keep using300 proposals and single-scale (s = 600) testing. Thetesting time is still about 200ms per image on theCOCO dataset.
In Table 11 we first report the results of the FastR-CNN system [2] using the implementation in thispaper. Our Fast R-CNN baseline has 39.3% [email protected] the test-dev set, higher than that reported in [2].We conjecture that the reason for this gap is mainlydue to the definition of the negative samples and alsothe changes of the mini-batch sizes. We also note thatthe mAP@[.5, .95] is just comparable.
Next we evaluate our Faster R-CNN system. Usingthe COCO training set to train, Faster R-CNN has42.1% [email protected] and 21.5% mAP@[.5, .95] on theCOCO test-dev set. This is 2.8% higher for [email protected] 2.2% higher for mAP@[.5, .95] than the Fast R-CNN counterpart under the same protocol (Table 11).This indicates that RPN performs excellent for im-proving the localization accuracy at higher IoU thresh-olds. Using the COCO trainval set to train, Faster R-CNN has 42.7% [email protected] and 21.9% mAP@[.5, .95] onthe COCO test-dev set. Figure 6 shows some resultson the MS COCO test-dev set.
4.2 MS COCO的实验
我们在Microsoft COCOobject检测数据集上呈现更多结果[12]。该数据集涉及80个对象类别。我们用训练集上的80k图像,验证集上的40k图像以及测试开发集上的20k图像进行实验。我们对IoU∈[0.5:0.05:0.95](COCO的标准度量,简称为mAP @ [.5,.95])和[email protected](PASCAL VOC度量)的平均值进行评估。
我们的系统为这个数据集做了一些小的改动。我们在8-GPU实现上训练我们的模型,RPN(每个GPU 1个)和Fast R-CNN(每个GPU 2个)的有效最小批量大小为8个。将RPN步骤和快速R-CNN步骤两者训练用于240k迭代,学习率为0.003,然后对于0.0003的80k迭代进行训练。我们修改了学习率(从0.003而不是0.001开始),因为小批量的大小已经改变。对于锚点,我们使用3个纵横比和4个尺度(增加642),主要是通过处理这个数据集上的小对象来激励。另外,在我们的Fast R-CNNstep中,负样本被定义为在[0,0.5]区间内具有最大IoU的区域,而不是[1],[2]中使用的[0.1,0.5]。我们并没有在SPPnet系统[1]中使用负样本[0.1,0.5]进行网络微调,但[0,0.5]中的负样本仍然在SVM步骤中被访问,并进行了负面挖掘。但Fast R-CNN系统[2]放弃了SVM步骤,所以[0,0.1]中的负面样例从未被访问过。包括这些[0,0.1]样本在COCO数据集上对Fast R-CNN和R-CNN系统提高了[email protected](但对PASCAL VOC的影响可以忽略不计)。
其余的实施细节与PASCAL VOC相同。特别是,我们继续使用300个提案和单一尺度(s = 600)的测试。在COCO数据集上,每张图像的测试时间仍然大约为200毫秒。
在表11中,我们首先报告FastR-CNN系统[2]的结果,使用本文中的实现。我们的Fast R-CNN基线在test-dev集合上有39.3%的[email protected],高于文献[2]中的报告。我们推测这种差距的原因主要是由于负样本的定义和其他小批量大小。我们还注意到,mAP @ [.5,.95]只是可比的。
接下来我们评估我们的更快的R-CNN系统。使用COCO训练集训练,更快的R-CNN在COCO测试开发集上具有42.1%[email protected]和21.5%mAP @ [.5,.95]。在同样的协议下(表11),mAP @ 0.5和mAP @ [.5,.95]比Fast R-CNN对应的高2.8%,这表明RPN对于改善在更高的IoU阈值下定位的准确性。使用COCO训练集训练,在COCO测试开发集上更快的R-CNN具有42.7%的[email protected]和21.9%的mAP @ [.5,.95]。图6显示了MS COCO测试开发集的一些结果。
Faster R-CNN in ILSVRC & COCO 2015 compe-titions We have demonstrated that Faster R-CNNbenefits more from better features, thanks to the factthat the RPN completely learns to propose regions byneural networks. This observation is still valid evenwhen one increases the depth substantially to over100 layers [18]. Only by replacing VGG-16 with a 101-layer residual net (ResNet-101) [18], the Faster R-CNNsystem increases the mAP from 41.5%/21.2% (VGG-16) to 48.4%/27.2% (ResNet-101) on the COCO valset. With other improvements orthogonal to Faster R-CNN, He et al. [18] obtained a single-model result of55.7%/34.9% and an ensemble result of 59.0%/37.4%on the COCO test-dev set, which won the 1st placein the COCO 2015 object detection competition. Thesame system [18] also won the 1st place in the ILSVRC2015 object detection competition, surpassing the sec-ond place by absolute 8.5%. RPN is also a buildingblock of the 1st-place winning entries in ILSVRC 2015localization and COCO 2015 segmentation competi-tions, for which the details are available in [18] and[15] respectively.
ILSVRC&COCO 2015竞赛中更快的R-CNN我们已经证明,由于RPN完全学习了通过神经网络提出区域的事实,更快的R-CNN从更好的特性中受益更多。即使将深度增加到100层以上,这种观察仍然有效[18]。只有用101层残留网(ResNet-101)代替VGG-16,更快的R-CNN系统将mAP从41.5%/ 21.2%(VGG-16)增加到48.4%/ 27.2%(ResNet- 101)在COCO valset上。随着与更快的R-CNN正交的其他改进,He等人[18]在COCO测试开发组中获得了55.7%/ 34.9%的单模结果和59.0%/ 37.4%的合奏结果,在COCO 2015对象检测竞赛中获得了第一名。同样的系统[18]也在ILSVRC2015物体检测竞赛中获得第一名,超过第二名的绝对8.5%。 RPN也是ILSVRC2015定位和COCO2015细分竞赛第一名获奖作品的基石,详情请参见[18]和[15]。
4.3 From MS COCO to PASCAL VOC
Large-scale data is of crucial importance for improv-ing deep neural networks. Next, we investigate howthe MS COCO dataset can help with the detectionperformance on PASCAL VOC.
As a simple baseline, we directly evaluate theCOCO detection model on the PASCAL VOC dataset,without fine-tuning on any PASCAL VOC data. Thisevaluation is possible because the categories onCOCO are a superset of those on PASCAL VOC. Thecategories that are exclusive on COCO are ignored inthis experiment, and the softmax layer is performedonly on the 20 categories plus background. The mAPunder this setting is 76.1% on the PASCAL VOC 2007test set (Table 12). This result is better than that trainedon VOC07+12 (73.2%) by a good margin, even thoughthe PASCAL VOC data are not exploited.
Then we fine-tune the COCO detection model onthe VOC dataset. In this experiment, the COCO modelis in place of the ImageNet-pre-trained model (thatis used to initialize the network weights), and theFaster R-CNN system is fine-tuned as described inSection 3.2. Doing so leads to 78.8% mAP on thePASCAL VOC 2007 test set. The extra data fromthe COCO set increases the mAP by 5.6%. Table 6shows that the model trained on COCO+VOC hasthe best AP for every individual category on PASCALVOC 2007. Similar improvements are observed on thePASCAL VOC 2012 test set (Table 12 and Table 7). Wenote that the test-time speed of obtaining these strongresults is still about 200ms per image.
4.3从MS COCO到PASCAL VOC
大规模数据对于改善深度神经网络至关重要。接下来,我们调查MS COCO数据集如何帮助PASCAL VOC的检测性能。
作为简单的基准线,我们直接评估PASCAL VOC数据集上的COCO检测模型,而无需对任何PASCAL VOC数据进行微调。这种评估是可能的,因为COCO的类别是PASCAL VOC上的类别的超集。在本实验中忽略COCO专用的类别,并且softmax层仅在20个类别加背景下执行。在PASCAL VOC 2007测试集中,此设置下的mAP为76.1%(表12)。这一结果优于VOC07 + 12(73.2%),甚至认为PASCAL VOC数据未被利用。
然后我们对VOC数据集上的COCO检测模型进行微调。在这个实验中,COCO模型取代了ImageNet预先训练的模型(用于初始化网络权重),而更快的R-CNN系统则按3.2节所述进行了微调。这样做会导致PasCAL VOC 2007测试装置上的78.8%mAP。来自COCO集合的额外数据使mAP增加5.6%。表6显示,在COCO + VOC培训的模型对PASCALVOC 2007中的每个类别都具有最佳AP。在PASCAL VOC 2012测试集(表12和表7)上观察到类似的改进。 Wenote认为,获得这些强大结果的测试时间速度仍然是每幅图像200ms左右。
5 CONCLUSION
We have presented RPNs for efficient and accurateregion proposal generation. By sharing convolutional features with the down-stream detection network, the region proposal step is nearly cost-free. Our method enables a unified, deep-learning-based object detection system to run at near real-time frame rates. The learned RPN also improves region proposal quality and thus the overall object detection accuracy.
5结论
我们已经提交了RPN以便生成高效,准确的区域提案。 通过与下游检测网络共享卷积特征,区域建议步骤几乎没有成本。 我们的方法使统一的,基于深度学习的物体检测系统能够以接近实时的帧率运行。 学习到的RPN也提高了地区建议质量,从而提高了整体物体检测的准确性。