In object detection, an intersection over union (IoU) threshold is required to define positives and negatives. An object detector, trained with low IoU threshold, e.g. 0.5, usually produces noisy detections. However, detection performance tends to degrade with increasing the IoU thresholds. Two main factors are responsible for this: 1) overfitting during training, due to exponentially vanishing positive samples, and 2) inference-time mismatch between the IoUs for which the detector is optimal and those of the input hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, is proposed to address these problems. It consists of a sequence of detectors trained with increasing IoU thresholds, to be sequentially more selective against close false positives. The detectors are trained stage by stage, leveraging the observation that the output of a detector is a good distribution for training the next higher quality detector. The resampling of progressively improved hypotheses guarantees that all detectors have a positive set of examples of equivalent size, reducing the overfitting problem. The same cascade procedure is applied at inference, enabling a closer match between the hypotheses and the detector quality of each stage. A simple implementation of the Cascade R-CNN is shown to surpass all single-model object detectors on the challenging COCO dataset. Experiments also show that the Cascade R-CNN is widely applicable across detector architectures, achieving consistent gains independently of the baseline detector strength. The code will be made available at https://github.com/zhaoweicai/cascade-rcnn.
在物体检测中, 需要跨越联合(IoU)阈值的交点来定义正数和负数。一个目标检测器,用低的IoU阈值训练, 如 0.5, 通常会产生噪音检测。但是, 检测性能往往会随着IoU阈值的增加而降低。两个主要因素是: 1) 在训练期间 overfitting, 由于正样本指数消失和 2)检测器最佳IoU和那些输入假说 inference-time不匹配。针对这些问题, 提出了一种multi-stage目标检测体系结构, 即Cascade R-CNN。它包括由增加的IoU门限训练的检测器序列, 循序地对接近的错误positives更具选择性。检测器是经过阶段训练的, 利用观测结果表明, 检测器的输出是一个很好的分布, 用于训练下一个更高品质的检测器。不断改进的假设的重新取样保证了所有检测器都有一个等价大小的正数集合, 减少了 overfitting 问题。同一cascade程序在推理中应用, 使假设与每个阶段的检测器的质量匹配更接近。一个简单实现的Cascade R-CNN 在COCO数据集挑战赛上被展示超过所有单模型物体检测器。实验还表明, Cascade R-CNN 广泛适用于检测器体系结构, 实现了独立于baseline检测器强度的一致增益。代码将在 https://github.com/zhaoweicai/cascade-rcnn 提供。
Object detection is a complex problem, requiring the solution of two main tasks. First, the detector must solve the recognition problem, to distinguish foreground objects from background and assign them the proper object class labels. Second, the detector must solve the localization problem, to assign accurate bounding boxes to different objects. Both of these are particularly difficult because the detector faces many “close” false positives, corresponding to “close but not correct” bounding boxes. The detector must find the true positives while suppressing these close false positives.
目标检测是一个复杂的问题,需要解决两个主要任务。首先,检测器必须解决识别问题,将前景物体与背景区分开来,并为其分配正确的物体类别标签。其次,检测器必须解决定位问题,将精确的bounding boxes分配给不同的物体。这两种情况都特别困难,因为检测器会面临许多“close”false positive,对应于“close but not correct”的bounding boxes。检测器必须找到true positives,同时抑制这些接近的false positive。
Many of the recently proposed object detectors are based on the two-stage R-CNN framework [12, 11, 27, 21], where detection is framed as a multi-task learning problem that combines classification and bounding box regression. Unlike object recognition, an intersection over union (IoU) threshold is required to define positives/negatives. However, the commonly used threshold values u, typically u = 0.5, establish quite a loose requirement for positives. The resulting detectors frequently produce noisy bounding boxes, as shown in Figure 1 (a). Hypotheses that most humans would consider close false positives frequently pass the IoU ≥ 0.5 test. While the examples assembled under the u = 0.5 criterion are rich and diversified, they make it difficult to train detectors that can effectively reject close false positives.
许多最近提出的目标检测器都是基于two-stageR-CNN框架[12,11,27,21],其中检测被定义为一个多任务学习问题,它将分类和bounding box regressor相结合。与物体识别不同,需要跨越联合(IoU)阈值的交集来定义positives/negatives。然而,通常使用的阈值u(通常u = 0.5)对positives提出了相当宽松的要求。如图1(a)所示,所得到的检测器经常会产生noisy bounding boxes。大多数人认为接近false positives的假设经常通过IoU≥0.5检验。虽然在u = 0.5标准下组装的例子很丰富而且多样化,但它们很难训练能够有效拒绝接近false positive的检测器。
In this work, we define the quality of an hypothesis as its IoU with the ground truth, and the quality of the detector as the IoU threshold u used to train it. The goal is to investigate the, so far, poorly researched problem of learning high quality object detectors, whose outputs contain few close false positives, as shown in Figure 1 (b). The basic idea is that a single detector can only be optimal for a single quality level. This is known in the cost-sensitive learning literature [7, 24], where the optimization of different points of the receiver operating characteristic (ROC) requires different loss functions. The main difference is that we consider the optimization for a given IoU threshold, rather than false positive rate.
在这项工作中,我们将假设的质量定义为具有基本事实的IoU,并将检测器的质量定义为用于训练IoU阈值。目标是研究迄今为止,学习高质量物体检测器的研究不足的问题,其输出包含很少的严重false positive,如图1(b)所示。基本思想是单个检测器只能针对单个质量水平进行优化。这在cost-sensitive learning文献[7,24]中是已知的,其中接收器操作特性(ROC)的不同点的优化需要不同的损失函数。主要区别在于我们考虑了给定IoU阈值的优化,而不是false positive rate。
The idea is illustrated by Figure 1 (c) and (d), which present the localization and detection performance, respectively, of three detectors trained with IoU thresholds of u = 0.5, 0.6, 0.7. The localization performance is evaluated as a function of the IoU of the input proposals, and the detection performance as a function of IoU threshold, as in COCO [20]. Note that, in Figure 1 (c), each bounding box regressor performs best for examples of IoU close to the threshold that the detector was trained. This also holds for detection performance, up to overfitting. Figure 1 (d) shows that, the detector of u = 0.5 outperforms the detector of u = 0.6 for low IoU examples, underperforming it at higher IoU levels. In general, a detector optimized at a single IoU level is not necessarily optimal at other levels. These observations suggest that higher quality detection requires a closer quality match between the detector and the hypotheses that it processes. In general, a detector can only have high quality if presented with high quality proposals.
图1(c)和(d)分别说明了这个想法,它分别表示了用u = 0.5,0.6,0.7的IoU阈值训练的三个检测器的定位和检测性能。如COCO [20]所述,定位性能被评估为输入建议的IoU函数,以及作为IoU阈值函数的检测性能。请注意,在图1(c)中,每个bounding box regressor对IoU接近检测器被训练的阈值的例子表现最佳。这也适用于检测性能,直到overfitting。图1(d)显示,对于低IoU的例子,u = 0.5的检测器优于u = 0.6的检测器,在较高的IoU水平下表现不及它。通常,在单个IoU级别优化的检测器在其他级别不一定是最佳的。这些观察结果表明,更高质量的检测需要检测器与其处理的假设之间更接近的质量匹配。一般来说,如果提供高质量的提议,检测器只能具有高质量。
However, to produce a high quality detector, it does not suffice to simply increase u during training. In fact, as seen for the detector of u = 0.7 of Figure 1 (d), this can degrade detection performance. The problem is that the distribution of hypotheses out of a proposal detector is usually heavily imbalanced towards low quality. In general, forcing larger IoU thresholds leads to an exponentially smaller numbers of positive training samples. This is particularly problematic for neural networks, which are known to be very example intensive, and makes the “high u” training strategy quite prone to overfitting. Another difficulty is the mismatch between the quality of the detector and that of the testing hypotheses at inference. As shown in Figure 1, high quality detectors are only necessarily optimal for high quality hypotheses. The detection could be suboptimal when they are asked to work on the hypotheses of other quality levels.
然而,为了生产高质量的检测器,在训练期间仅仅增加u就不足够。实际上,如图1(d)中u = 0.7的检测器所示,这可能会降低检测性能。问题在于,建议检测器之外的假设分布通常严重失衡,导致质量低下。一般而言,强迫更大的IoU阈值导致正指数训练样本数量呈指数级地减少。这对神经网络来说尤其有问题,神经网络被称为very example intensive,并且使得“high u”训练策略很容易过度拟合。另一个难点是检测器质量与推断测试假设之间的不匹配。如图1所示,高质量的检测器对于高质量的假设只是最佳的。当他们被要求研究其他质量水平的假设时,检测可能并不理想。
In this paper, we propose a new detector architecture, Cascade R-CNN, that addresses these problems. It is a multi-stage extension of the R-CNN, where detector stages deeper into the cascade are sequentially more selective against close false positives. The cascade of R-CNN stages are trained sequentially, using the output of one stage to train the next. This is motivated by the observation that the output IoU of a regressor is almost invariably better than the input IoU. This observation can be made in Figure 1 (c), where all plots are above the gray line. It suggests that the output of a detector trained with a certain IoU threshold is a good distribution to train the detector of the next higher IoU threshold. This is similar to boostrapping methods commonly used to assemble datasets in object detection literature [31, 8]. The main difference is that the resampling procedure of the Cascade R-CNN does not aim to mine hard negatives. Instead, by adjusting bounding boxes, each stage aims to find a good set of close false positives for training the next stage. When operating in this manner, a sequence of detectors adapted to increasingly higher IoUs can beat the overfitting problem, and thus be effectively trained. At inference, the same cascade procedure is applied. The progressively improved hypotheses are better matched to the increasing detector quality at each stage. This enables higher detection accuracies, as suggested by Figure 1 (c) and (d).
在本文中,我们提出了一种新的检测器架构Cascade R-CNN来解决这些问题。它是R-CNN的multi-stage延伸,其中cascade更深的检测器阶段依次对接近的false positive有更强的选择性。 R-CNN阶段的cascade是按顺序训练的,使用一个阶段的输出来训练下一阶段。这是由观察到回归器的输出IoU几乎总是优于输入IoU的动机。这个观察可以在图1(c)中进行,其中所有图都在灰线之上。它表明用某个IoU阈值训练的检测器的输出是良好的分布以训练下一个较高IoU阈值的检测器。这与通常用于在物体检测文献中组装数据集的boostrapping方法类似[31,8]。主要区别在于Cascade R-CNN的重采样过程并不旨在mine hard negatives。相反,通过调整bounding boxes,每个阶段的目的都是为了找到一组好的训练下一阶段的false positive。当以这种方式操作时,适用于越来越高的IoU的一系列检测器可以击败过度拟合问题,并且因此被有效地训练。在推断中,应用相同的cascade程序。逐步改进的假设更好地匹配每个阶段检测器质量的提高。这可以实现更高的检测精度,如图1(c)和(d)所示。
The Cascade R-CNN is quite simple to implement and trained end-to-end. Our results show that a vanilla implementation, without any bells and whistles, surpasses all previous state-of-the-art single-model detectors by a large margin, on the challenging COCO detection task [20], especially under the higher quality evaluation metrics. In addition, the Cascade R-CNN can be built with any two-stage object detector based on the R-CNN framework. We have observed consistent gains (of 2∼4 points), at a marginal increase in computation. This gain is independent of the strength of the baseline object detectors. We thus believe that this simple and effective detection architecture can be of interest for many object detection research efforts.
Cascade R-CNN的实施和端对端训练非常简单。我们的研究结果表明,在没有任何花哨的情况下,vanilla的实现在很大程度上超过了所有先前的state-of-the-art 的single-model检测器,对于具有挑战性的COCO检测任务[20],特别是在更高质量的评估指标下。此外,可以使用基于R-CNN框架的任何two-stage目标检测器构建cascadeR-CNN。我们观察到一致的收益(2〜4分),计算的边际增加。这个增益独立于基线物体检测器的强度。因此,我们相信这种简单而有效的检测体系结构对于许多目标检测研究工作可能是有意义的。
Due to the success of the R-CNN [12] architecture, the two-stage formulation of the detection problems, by combining a proposal detector and a region-wise classifier has become predominant in the recent past. To reduce redundant CNN computations in the R-CNN, the SPP-Net [15] and Fast-RCNN [11] introduced the idea of region-wise feature extraction, significantly speeding up the overall detector. Later, the Faster-RCNN [27] achieved further speedsup by introducing a Region Proposal Network (RPN). This architecture has become a leading object detection framework. Some more recent works have extended it to address various problems of detail. For example, the R-FCN [4] proposed efficient region-wise fully convolutions without accuracy loss, to avoid the heavy region-wise CNN computations of the Faster-RCNN; while the MS-CNN [1] and FPN [21] detect proposals at multiple output layers, so as to alleviate the scale mismatch between the RPN receptive fields and actual object size, for high-recall proposal detection.
由于R-CNN [12]体系结构的成功,通过结合提议检测器和区域分类器的two-stage检测问题的制定已成为最近的主流。为了减少R-CNN中CNN的冗余计算,SPP-Net [15]和Fast-R-CNN [11]引入了region-wise feature extraction的思想,显着加快了整个检测器的速度。后来,Faster-RCNN [27]通过引入Region Proposal Network(RPN)实现了更高的速度。这种架构已经成为领先的物体检测框架。最近的一些工作已将其扩展到解决各种细节问题。例如,R-FCN [4]提出了高效的region-wise fully convolutions without accuracy loss,以避免Faster-RCNN的重区域CNN计算;而MS-CNN [1]和FPN [21]检测multiple output layers的提议,以缓解RPN感受野与实际物体大小之间的尺度不匹配,从而实现high-recall proposal detection。
Alternatively, one-stage object detection architectures have also become popular, mostly due to their computational efficiency. These architectures are close to the classic sliding window strategy [31, 8]. YOLO [26] outputs very sparse detection results by forwarding the input image once. When implemented with an efficient backbone network, it enables real time object detection with fair performance. SSD [23] detects objects in a way similar to the RPN [27], but uses multiple feature maps at different resolutions to cover objects at various scales. The main limitation of these architectures is that their accuracies are typically below that of two-stage detectors. Recently, RetinaNet [22] was proposed to address the extreme foreground-background class imbalance in dense object detection, achieving better results than state-of-the-art two-stage object detectors.
或者,one-stage物体检测体系结构也变得流行,主要是由于它们的计算效率。这些架构接近于经典的sliding window strategy[31,8]。 YOLO [26]通过传递输入图像一次输出非常稀疏的检测结果。当使用高效的backbone网络实施时,它可以实现公平性能的实时物体检测。 SSD [23]以类似于RPN [27]的方式检测物体,但使用多个不同分辨率的特征图来覆盖各种比例的物体。这些架构的主要限制是它们的精度通常低于 two-stage 检测器的精度。最近,RetinaNet [22]被提出来解决密集物体检测中extreme foreground-background class imbalance问题,比现有技术的two-stage物体检测器获得更好的结果。
Some explorations in multi-stage object detection have also been proposed. The multi-region detector [9] introduced iterative bounding box regression, where a R-CNN is applied several times, to produce better bounding boxes. CRAFT [33] and AttractioNet [10] used a multi-stage procedure to generate accurate proposals, and forwarded them to a Fast-RCNN. [19, 25] embedded the classic cascade architecture of [31] in object detection networks. [3] iterated a detection and a segmentation task alternatively, for instance segmentation.
还提出了multi-stage目标检测的一些探索。multi-region detector[9]引入iterative bounding box regressor,其中应用多次R-CNN以生成更好的bounding boxes。 CRAFT [33]和AttractioNet [10]使用multi-stage程序来生成准确的提议,并将它们传递到Fast-RCNN。 [19,25]在物体检测网络中嵌入了[3]的classic cascade architecture [31]迭代检测和分割任务,例如实例分割。
In this paper, we proposed a multi-stage object detection framework, the Cascade R-CNN, for the design of high quality object detectors. This architecture was shown to avoid the problems of overfitting at training and quality mismatch at inference. The solid and consistent detection improvements of the Cascade R-CNN on the challenging COCO dataset suggest the modeling and understanding of various concurring factors are required to advance object detection. The Cascade R-CNN was shown to be applicable to many object detection architectures. We believe that it can be useful to many future object detection research efforts.
在本文中,我们提出了一个多multi-stage object detection框架,Cascade R-CNN,用于设计高质量的物体检测器。 该架构被证明可以避免训练过度拟合和推理质量不匹配的问题。 cascade R-CNN在具有挑战性的COCO数据集上的稳固一致的检测改进表明,需要对各种并发因子进行建模和理解,以推进物体检测。 Cascade R-CNN被证明适用于许多目标检测体系结构。 我们相信它可能对许多未来的目标检测研究工作有用。
Figure 3. The architectures of different frameworks. “I” is input image, “conv” backbone convolutions, “pool” region-wise feature extraction, “H” network head, “B” bounding box, and “C” classification. “B0” is proposals in all architectures.