Cascade mask RCNN|Cascade R-CNN: High Quality Object Detection and Instance Segmentation

2019.6.24

cascade rcnn在CVPR2018年被提出，用于目标检测，2019.6发布第二版本，为应用于实例分割cascade mask rcnn

论文地址：https://arxiv.org/abs/1906.09756v1

项目地址：（Caffe版本）

https://github.com/zhaoweicai/cascade-rcnn

https://github.com/zhaoweicai/Detectron-Cascade-RCNN（detectron版本）

https://github.com/open-mmlab/mmdetection （version1）

实例分割超越MaskRCNN，目标检测coco数据集50.9AP

核心思想是：使用不同的IOU阈值，训练多个级联的检测器。它可以用于级联已有的检测器，取得更加精确的目标检测。

论文翻译

Abstract

In object detection, the intersection over union (IoU) threshold is frequently used to deﬁne positives/negatives. The threshold used to train a detector deﬁnes its quality. While the commonly used threshold of 0.5 leads to noisy (low-quality) detections, detection performance frequently degrades for larger thresholds. This paradox of high-quality detection has two causes: 1) overﬁtting, due to vanishing positive samples for large thresholds, and 2) inference-time quality mismatch between detector and test hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, composed of a sequence of detectors trained with increasing IoU thresholds, is proposed to address these problems. The detectors are trained sequentially, using the output of a detector as training set for the next. This resampling progressively improves hypotheses quality, guaranteeing a positive training set of equivalent size for all detectors and minimizing overﬁtting. The same cascade is applied at inference, to eliminate quality mismatches between hypotheses and detectors. An implementation of the Cascade R-CNN without bells or whistles achieves state-of-the-art performance on the COCO dataset, and signiﬁcantly improves high-quality detection on generic and speciﬁc object detection datasets, including VOC, KITTI, CityPerson, and WiderFace. Finally, the Cascade R-CNN is generalized to instance segmentation, with nontrivial improvements over the Mask R-CNN. To facilitate future research, two implementations are made available at https://github.com/zhaoweicai/cascade-rcnn (Caffe) and https://github.com/zhaoweicai/Detectron-Cascade-RCNN (Detectron).

在对象检测中,联合 (IoU) 阈值的交集通常用于定义正例/反例。用于训练探测器的阈值决定其质量。虽然常用的阈值 0.5 会导致噪声(低质量)检测,但检测性能通常会随着阈值越大降低。高质量检测的这种悖论有两个原因:1) 过度拟合,由于大阈值的正样本消失,以及 2) 检测器和测试假设之间的推理-时间不匹配。提出了一种多阶段物体检测架构,即Cascade R-CNN,由一系列经过递增IoU阈值训练的探测器组成,用于解决这些问题。探测器按顺序进行训练,使用探测器的输出作为下一个探测器的训练输入。这种重采样逐步提高了假设质量,保证了所有探测器具有同等大小的正训练集,并最大限度地减少了过度拟合。在推理时应用相同的级联,以消除假设和检测器之间的质量不匹配。实现无铃或哨子的cascade R-CNN,可实现 COCO 数据集上最先进的性能,并显著改善通用和特定对象检测数据集的高质量检测,包括 VOC、KITTI、CityPerson、和widerface。最后,cascade R-CNN 被概括为实例分割,与mask R-CNN 有不平凡的改进。为了便于未来的研究,在https://github.com/zhaoweicai/cascade-rcnn(Caffe)和https://github.com/zhaoweicai/Detectron-Cascade-RCNN(Detectron)提供了两种实施。

1 INTRODUCTION

Object detection is a complex problem, requiring the solution of two tasks. First, the detector must solve the recognition problem, distinguishing foreground objects from background and assigning them the proper object class labels. Second, the detector must solve the localization problem, assigning accurate bounding boxes to different objects. An effective architecture for the solution of the two tasks, on which many of the recently proposed object detectors are based, is the two-stage R-CNN framework [21], [22], [37], [47]. This frames detection as a multi-task learning problem that combines classiﬁcation, to solve the recognition problem, and bounding box regression, to solve localization.

对象检测是一个复杂的问题,需要解决两个任务。首先,检测器必须解决识别问题,区分前景对象和背景对象,并为其分配适当的对象类标签。其次,探测器必须解决定位问题,为不同的对象分配准确的边界框。解决这两项任务的有效架构是两阶段 R-CNN 框架 [21]、[22]、[37]、[47]。该帧检测作为多任务学习问题,结合分类,解决识别问题,并边界框回归,解决定位问题。

Despite the success of this architecture, the two problems can be difﬁcult to solve accurately. This is partly due to the fact that there are many “close” false positives, corresponding to “close but not correct” bounding boxes. An effective detector must ﬁnd all true positives in an image, while suppressing these close false positives. This requirement makes detection more difﬁcult than other classiﬁcation problems, e.g. object recognition, where the difference between positives and negatives is not as ﬁne-grained. In fact, the boundary between positives and negatives must be carefully deﬁned. In the literature, this is done by thresholding the intersection over union (IoU) score between candidate and ground truth bounding boxes. While the threshold is typically set at the value of u = 0.5, this is a very loose requirement for positives. The resulting detectors frequently produce noisy bounding boxes, as shown in Fig. 1 (a).

尽管这个架构取得了成功,但这两个问题很难准确解决。这部分是由于存在许多"close"误报,对应于"close但不正确"边界框。有效的检测器必须找到图像中的所有真实正例,同时抑制这些接近的误报。这一要求使得检测比其他分类问题(例如对象识别)更加困难,因为正值和负值之间的差异没有细粒度。

通过网络理解，在分割中的图像分类中是利用bbox进行分类，这样尽管mask或bbox可能存在差异，但是不会影响分类结果

事实上,必须仔细定义正极和负数之间的边界。在文献中,这是通过对候选和gt边界框之间的联合 (IoU) 分数的交叉阈值来实现的。虽然阈值通常设置为您值 = 0.5,但对于正值,这是非常宽松的要求。生成的探测器经常产生噪声边界框,如图 1 (a) 所示。

Hypotheses that most humans would consider close false positives frequently pass the IoU ≥ 0.5 test. While training examples assembled under the u = 0.5 criterion are rich and diverse, they make it difﬁcult to train detectors that can effectively reject close false positives.

假设大多数人会认为关闭误报经常通过 IoU = 0.5 测试。虽然在 u = 0.5 标准下组装的训练示例丰富多样,但它们使得训练能够有效拒绝临近误报的检测器变得困难。

In this work, we deﬁne the quality of a detection hypothesis as its IoU with the ground truth, and the quality of a detector as the IoU threshold u used to train it. Some examples of hypotheses of increasing quality are shown in Fig. 1 (c). The goal is to investigate the poorly researched problem of learning high quality object detectors. As shown in Fig. 1 (b), these are detectors that produce few close false positives. The starting premise is that a single detector can only be optimal for a single quality level. This is known in the cost-sensitive learning literature [12], [42], where the optimization of different points of the receiver operating characteristic (ROC) requires different loss functions. The main difference is that we consider the optimization for a given IoU threshold, rather than false positive rate.

在这项工作中,我们将检测的质量定义为假设和gt的 IoU,将探测器的质量定义为用于训练的 IoU 阈值 u。图1(c)显示了提高质量的一些假设。目的是研究学习高质量物体探测器的研究不足的问题。如图 1 (b) 所示,这些探测器很少产生接近的误报。出发点是,单个检测器只能针对单个质量级别进行最佳选择。这在损失敏感学习文献 [12] [42] 中是已知的,其中接收器操作特性 (ROC) 的不同点的优化需要不同的损耗函数。主要区别是,我们考虑给定 IoU 阈值的优化,而不是误报率。

Some evidence in support of this premise is given in Fig. 2, which presents the bounding box localization performance, classiﬁcation loss and detection performance, respectively, of three detectors trained with IoU thresholds of u = 0.5,0.6,0.7. Localization and classiﬁcation are evaluated as a function of the detection hypothesis IoU. Detection is evaluated as a function of the IoU threshold, as in COCO [36]. Fig. 2 (a) shows that the three bounding box regressors tend to achieve the best performance for examples of IoU in the vicinity of the threshold used for detector training. Fig. 2 (c) shows a similar effect for detection, up to some overﬁtting for the highest thresholds. The detector trained with u = 0.5 outperforms the detector trained with u = 0.6 for low IoUs, underperforming it at higher IoUs. In general, a detector optimized for a single IoU value is not optimal for other values. This is also conﬁrmed by the classiﬁcation loss, shown in Fig. 2 (b), whose peaks are near the thresholds used for detector training. In general, the threshold determines the classiﬁcation boundary where the classiﬁer is most discriminative, i.e. has largest margin [6], [15].

图 2 提供了支持这一前提的一些证据,其中分别介绍了三个经过 IoU 阈值为= 0.5,0.6,0.7训练的探测器的边界框定位性能、分类丢失和检测性能。定位和分类用检测假设 IoU 的函数进行评估。检测利用IoU 阈值的函数进行评估,如 COCO [36]中。图 2 (a) 显示,三个边界框回归器在用于探测器训练的阈值附近,往往能够达到 IoU 示例的最佳性能。图 2 (c) 显示了类似的检测效果,最高阈值的过拟效果也高达某些过拟。训练的探测器 u= 0.5 优于与探测器 u= 0.6 低 IOU,在较高的 IOU 下表现不佳。通常,针对单个 IoU 值优化的探测器不是其他值的最佳选择。这一点也证实了分类损失,如图2(b)所示,其峰值接近用于探测器训练的阈值。通常,阈值确定分类器最具有区分性的分类边界,即具有最大边距 [6]、[15]。

The observations above suggest that high quality detection requires a close match between the quality of the detector and that of the detection hypotheses. The detector will only achieve high quality if presented with high quality proposals. This, however, cannot be guaranteed by simply increasing the threshold u during training. On the contrary, as seen for the detector of u = 0.7 in Fig. 2 (c), forcing a high value of u usually degrades detection performance. We refer to this problem, i.e. that training a detector with higher threshold leads to poorer performance, as the paradox of high-quality detection. This problem has two causes. First, object proposal mechanisms tend to produce hypotheses distributions heavily imbalanced towards low quality. In result, the use of larger IoU thresholds during training exponentially reduces the number of positive training examples. This is particularly problematic for neural networks, which are very example intensive, making the “high u” training strategy very prone to overﬁtting. Second, there is a mismatch between the quality of the detector and that of the hypotheses available at inference time. Since, as shown in Fig. 2, high quality detectors are only optimal for high qualityhypotheses,detection performance can degrade substantially for hypotheses of lower quality.

上述观察表明,高质量检测要求探测器的质量与检测假设的质量保持紧密匹配。只有提出高质量的建议,探测器才能实现高质量。然而,这不能仅仅通过在训练期间提高阈值u来保证。相反,如图 2 (c) 中的检测器 u= 0.7 所示,强制高值通常会降低检测性能。我们提到这个问题,即训练阈值较高的探测器会导致性能下降,这是高质量检测的悖论。此问题有两个原因。首先,对象建议机制往往产生假设分布严重失衡导致低质量。因此,在培训期间使用较大的 IoU 阈值成倍减少正训练示例的数量。这对非常具有实例密集型的神经网络来说尤其成问题,这使得"高 u"训练策略很容易过度拟合。其次,探测器的质量与推理时可用的假设的质量不匹配。如图 2 所示,高质量的探测器仅是高质量假象的最佳选择,因此,对于质量较低的假设,检测性能会大大降低。

In this paper, we propose a new detector architecture, denoted as Cascade R-CNN, that addresses these problems, to enable high quality object detection. The new architecture is a multi-stage extension of the R-CNN, where detector stages deeper into the cascade are sequentially more selective against close false positives. As is usual for classiﬁer cascades [49], [54], the cascade of R-CNN stages is trained sequentially, using the output of one stage to train the next. This leverages the observation that the output IoU of a bounding box regressor is almost always better than its input IoU, as can be seen in Fig. 2 (a), where nearly all plots are above the gray line. In result, the output of a detector trained with a certain IoU threshold is a good hypothesis distribution to train the detector of the next higher IoU threshold. This has some similarity to boostrapping methods commonly used to assemble datasets for object detection [14], [54]. The main difference is that the resampling performed by the Cascade R-CNN does not aim to mine hard negatives. Instead, by adjusting bounding boxes, each stage aims to ﬁnd a good set of close false positivesfor training the next stage. The main outcome of this resampling is that the quality of the detection hypotheses increases gradually, from one stage to the next. In result, the sequence of detectors addresses the two problems underlying the paradox of highquality detection. First, because the resampling operation guarantees the availability of a large number of examples for the training of all detectors in the sequence, it is possible to train detectors of high IoU without overﬁtting. Second, the use of the same cascade procedure at inference time produces a set of hypotheses of progressively higher quality, well matched to the increasing quality of the detector stages. This enables higher detection accuracies, as suggested by Fig. 2.

在本文中,我们提出了一个新的探测器架构,称为Cascade R-CNN,解决了这些问题,以实现高质量的物体检测。新的架构是R-CNN的多阶段扩展,其中探测器到层叠的更深层的阶段在顺序上对接近的误报更具选择性。与分类器级联[49]、[54]一样,R-CNN级数的级联按顺序训练,使用一个阶段的输出来训练下一个阶段。这利用了边界框回归器的输出 IoU 几乎总是优于其输入 IoU 的观察,如图 2 (a)所示,其中几乎所有图都位于灰色线之上。因此,具有特定 IoU 阈值训练的探测器的输出是训练下一个更高 IoU 阈值的探测器的良好假设分布。这与通常用来组装用于对象检测的数据集的 boostrapping方法有些类似 [14],[54]。主要区别是,Cascade R-CNN 进行的重采样目的不是挖掘hard negative。相反,通过调整边界框,每个阶段旨在找到一套良好的close误报,用于下一阶段的训练。这次重新采样的主要结果是检测假设的质量从一个阶段到下一个阶段逐渐提高。因此,检测器序列解决了高质量检测悖论背后的两个问题。首先,由于重采样操作保证了大量示例的可用性,用于按顺序训练所有探测器,因此可以在不过度拟合的情况下训练高 IoU 探测器。其次,在推理时使用相同的级联过程会产生一套质量逐步提高的假设,与探测器级数质量的提高很好地匹配。如图 2 所示,这可实现更高的检测精度。

boostrapping

The Cascade R-CNN is quite simple to implement and trained end-to-end. Our results show that a vanilla implementation, without any bells and whistles, surpasses almost all previous state-of-the-art single-model detectors, on the challenging COCO detection task [36], especially under the stricter evaluation metrics. In addition, the Cascade RCNN can be built with any two-stage object detector based on the R-CNN framework. We have observed consistent gains (of 2∼4 points, and more under stricter localization metrics), at a marginal increase in computation. This gain is independent of the strength of the baseline object detectors, for all the models we have tested. We thus believe that this simple and effective detection architecture can be of interest for many object detection research efforts.

cascade R-CNN 在端到端实现和培训非常简单。我们的结果表明,在具有挑战性的 COCO 检测任务 [36] 上,一个没有bell和whistle的实现,几乎超过了所有以前的最先进的单模型探测器,尤其是在更严格的评估指标下。此外,Cascade RCNN 可以基于 R-CNN 框架使用任何两级对象探测器进行构建。我们观察到一致的收益(2~4点,在更严格的本地化指标下更多),计算略有增加。对于我们测试过的所有模型,此增益与基线对象探测器的强度无关。因此,我们认为,这种简单而有效的检测架构可以引起许多对象检测研究工作的兴趣。

A preliminary version of this manuscript was previously published in [3]. After the original publication, the Cascade R-CNN has been successfully reproduced within many different codebases, including the popular Detectron [20], PyTorch1, and TensorFlow2, showing consistent and reliable improvements independently of implementation codebase. In this expanded version, we have extended the Cascade R-CNN to instance segmentation, by adding a mask head to the cascade, denoted as Cascade Mask RCNN. This is shown to achieve non-trivial improvements over the popular Mask R-CNN [25]. A new and more extensive evaluation is also presented, showing that the Cascade R-CNN is compatible with many complementary enhancements proposed in the detection and instance segmentation literatures, some of which were introduced after [3], e.g. GroupNorm [56]. Finally, we further present the results of a larger set of experiments, performed on various popular generic/speciﬁc object detection datasets,including PASCAL VOC [13], KITTI [16], CityPerson [64] and WiderFace [61]. These experiments demonstrate that the paradox of high quality object detection applies to all these tasks, and that the Cascade R-CNN enables more effective high quality detection than previously available methods. Due to these properties, as well as its generality and ﬂexibility, the Cascade R-CNN has recently been adopted by the winning teams of the COCO 2018 instance segmentation challenge3, the OpenImage 2018 challenge4, and the Wider Challenge 20185. To facilitate future research, we have released the code on two codebases, Caffe [31] and Detectron [20].

此手稿的初步版本以前发表在 [3] 中。在原始出版物之后,Cascade R-CNN 已在许多不同的代码库中成功复制,包括流行的检测器 [20]、PyTorch1 和 TensorFlow2,显示了独立于实现代码库。在此扩展版本中,我们将cascade R-CNN 扩展为实例分段,将掩码头添加到级联中,表示为级Cascade mask RCNN。这表明,与流行的maskR-CNN[25],实现非平凡的改进。还提出了新的和更广泛的评估,表明 Cascade R-CNN 与检测和实例分割文献中提出的许多补充增强功能兼容,其中一些增强功能是在 [3] 之后引入的,例如 GroupNorm [56].最后,我们进一步介绍了在各种流行的通用/特定对象检测数据集上执行的较大实验的结果,包括 PASCAL VOC [13]、KITTI [16]、城市人 [64] 和 WiderFace [61]。这些实验表明,高质量物体检测的悖论适用于所有任务,并且Cascade R-CNN 能够比之前的方法实现更有效的高质量检测。由于这些特性,以及其通用性和灵活性,Cascade R-CNN最近被 COCO 2018 实例细分挑战3、OpenImage 2018 挑战4 和 20185 年更宽挑战的获胜团队采用。为了便于未来的研究,我们发布了两个代码库的代码,Caffe [31] 和 Detectron [20]。

2 RELATED WORK

Due to the success of the R-CNN [22] detector, which combines a proposal detector and a region-wise classiﬁer, this two-stage architecture has become predominant in the recent past. To reduce redundant CNN computations, the SPP-Net [26] and Fast R-CNN [21] introduced the idea of region-wise feature extraction, enabling the sharing of the bulk of feature computations by object instances. The Faster R-CNN [47] then achieved further speeds-up by introducing a region proposal network (RPN), becoming the cornerstone of modern object detection. Later, some works extended this detector to address various problems of detail. For example, the R-FCN [8] proposed efﬁcient region-wise full convolutions to avoid the heavy CNN computations of the Faster R-CNN; and the Mask R-CNN [25] added a network head that computes object masks to support instance segmentation. Some more recent works have focused on normalizing feature statistics [45], [56], modeling relations between instances [28], non maximum suppression (NMS) [1], and other aspects [39], [50].

由于R-CNN[22]探测器的成功,它结合了建议探测器和按区域分类器,这种两阶段架构在近期已成为主要架构。为了减少冗余的CNN计算,SPP-Net [26] 和 Fast R-CNN [21] 引入了区域特征提取的概念,允许按对象实例共享大部分要素计算。faster R-CNN[47]通过引入区域建议网络(RPN)进一步加快速度,成为现代物体检测的基石。后来,一些工作扩展了这个探测器,以解决各种细节问题。例如,R-FCN [8] 提出了有效的区域全卷积,以避免faster R-CNN 的大量 CNN 计算;和掩码 R-CNN [25] 添加了一个网络头,用于计算对象掩码以支持实例分段。最近的一些工作侧重于标准化特征统计 [45], [56], 实例 [28] 之间的建模关系, 非最大抑制 (NMS) [1], 和其他方面 [39], [50] 。

Scale invariance, an important requisite for effective object detection, has also received substantial attention in the literature [2], [37], [52]. While natural images contain objects at various scales, the ﬁxed receptive ﬁeld size of the ﬁlters implemented by the RPN [47] makes it prone to scale mismatches.Toovercome this, the MS-CNN [2] introduced a multi-scale object proposal network, by generating outputs at multiple layers. This leverages the different receptive ﬁeld sizes of the different layers to produce a set of scale-speciﬁc proposal generators, which is then combined into a strong multi-scale generator. Similarly, the FPN [37] detects highrecall proposals at multiple output layers, with recourse to a scale-invariant feature representation by adding a top-down connection across feature maps of different network depths. Both the MS-CNN and FPN rely on a feature pyramid representation for multi-scale object detection. SNIP [52], on the other hand, recently revisited image pyramid in modern object detection. It normalizes the gradients from different object scales during training, such that the whole detector is scale-speciﬁc. Scale-invariant detection is achieved by using an image pyramid at inference.

比例不变性是有效检测物体的重要必要条件,在文献[2]、[37]、[52]中也备受关注。虽然自然图像包含各种比例的对象,但 RPN [47] 实现的滤波器的固定接收场大小使其容易出现比例不匹配。为了克服这一点,MS-CNN [2] 引入了一个多尺度对象建议网络,通过在多层生成输出。这利用不同层的不同接收场大小来生成一组特定于比例的建议生成器,然后组合成一个强大的多比例生成器。同样,FPN [37] 通过跨不同网络深度的要素映射添加自上而下的连接,在多个输出层上检测高调用建议,并采用比例不变要素表示。MS-CNN 和 FPN 都依赖于特征金字塔表示进行多比例对象检测。另一方面,SNIP [52],最近在现代对象检测中重新审视了图像金字塔。它在训练过程中对不同物体比例的梯度进行规范化,使整个探测器是特定于比例的。缩放不变性检测是通过使用推理时的图像金字塔实现的。

多尺度建议框

MS-CNN：Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A uniﬁed multiscale deep convolutional neural network for fast object detection. In ECCV, pages 354–370, 2016

FPN

One-stage object detection architectures have also become popular for their computational efﬁciency. YOLO [46] outputs very sparse detection results and enables realtime object detection, by forwarding the input image once through an efﬁcient backbone network. SSD [40] detects objects in a way similar to the RPN [47], but uses multiple feature maps at different resolutions to cover objects at various scales. The main limitation of these detectors is that their accuracy is typically below that of two-stage detectors. The RetinaNet [38] detector was proposed to address the extreme foreground-background class imbalance of dense object detection, achieving results comparable to two-stage detectors. Recently, CornerNet [33] proposed to detect an object bounding box as a pair of keypoints, abandoning the widely used concept of anchors ﬁrst introduced by the Faster R-CNN. This detector has achieved very good performance with the help of some training and testing enhancements. ReﬁneDet [65] added an anchor reﬁnement module to the single-shot SSD [40], to improve localization accuracy. This is somewhat similar to the cascaded localization implemented by the proposed Cascade R-CNN, but ignores the problem of high-quality detection.

单级对象检测体系结构也由于计算效率而广受欢迎。YOLO [46] 通过高效的骨干网络转发输入图像一次,输出非常稀疏的检测结果,实现实时对象检测。SSD [40] 以类似于 RPN [47] 的方式检测对象,但使用不同的分辨率的多个要素映射来覆盖不同比例的对象。这些探测器的主要局限性是其精度通常低于两级探测器。提出了RetinaNet[38]探测器,以解决密集物体检测的极端前景-背景类不平衡问题,达到与两级探测器相当的结果。最近,CornerNet [33] 提议将对象边界框检测为一对关键点,从而放弃了"Faster R-CNN"首次引入的广泛使用的锚点概念。借助一些培训和测试增强功能,该探测器取得了非常好的性能。RefineDet [65] 在单次 SSD [40] 中添加了锚点细化模块,以提高本地化精度。这与建议的Cascade R-CNN 实现的级联本地化有些类似,但忽略了高质量检测的问题。

Some explorations in multi-stage object detection have also been proposed. The multi-region detector of [17] introduced iterative bounding box regression, where a R-CNN is applied several times to produce successively more accurate bounding boxes. [18], [19], [60] used a multi-stage procedure to generate accurate proposals, which are forwarded to an accurate model (e.g. Fast R-CNN). [43], [62] proposed an alternative procedure to localize objects sequentially. While this is similar in spirit to the Cascade-RCNN, these methods use the same regressor iteratively for accurate localization. On the other hand, [34], [44] embedded the classic cascade architecture of [54] in an object detection network. Finally, [7] iterated between the detection and segmentation tasks, to achieve improved instance segmentation.

提出了多阶段物体检测的一些探索。[17] 的多区域探测器引入了迭代边界框回归,其中 R-CNN 被多次应用以连续生成更精确的边界框。[18],[19],[60]使用多阶段程序生成准确的方案,这些建议被转发到准确的模型(例如Faster R-CNN)。[43],[62]提出了按顺序本地化对象的替代方法。虽然这在精神上与 Cascade-RCNN 类似,但这些方法在迭代中使用相同的回归器来实现准确的定位。另一方面,[34],[44]在对象检测网络中嵌入了[54]的经典级联架构。最后,[7]迭代检测和分段任务,实现实例分割的改进。

Upon publication of the conference version of this manuscript, several works have pursued the idea behind Cascade R-CNN [5], [32], [41], [55]. [41], [55] applied it to single-shot object detectors, showing nontrivial improvements for high quality single-shot detection, for general objects and pedestrians, respectively. The IoU-Net [32] explored in greater detail high-quality localization, achieving some gains over the Cascade R-CNN by cascading more bounding box regression steps. [24] showed it is possible to achieve state-of-the-art object detectors without ImageNet pretraining, with a help of the Cascade R-CNN. These works show that the Cascade R-CNN idea is robust and applicable to various object detection architectures. This suggests that it should continue to be useful despite future advances in object detection.

本手稿的会议版本出版后,几部作品一直追求在Cascade R-CNN [5]、[32]、[41]、[55]背后的理念。[41],[55]将其应用于单发物体探测器,分别对一般物体和行人的高质量单发探测进行了非同寻常的改进。IoU-Net [32] 更详细地探讨了高质量的本地化,通过级联更多边界框回归步骤,实现了与级联 R-CNN 的一些收益。[24] 表明,借助级联R-CNN,无需图像网预训练即可实现最先进的物体探测器。这些作品表明,级联R-CNN思想是可靠的,适用于各种对象检测架构。这表明,尽管今后在物体探测方面取得了进展,但它应继续有用。

3 HIGH QUALITY OBJECT DETECTION

In this section, we discuss the challenges of high quality object detection.

3.1 Object Detection

While the ideas proposed in this work can be applied to various detector architectures, we focus on the popular twostage architecture of the Faster R-CNN [47], shown in Fig. 3 (a). The ﬁrst stage is a proposal sub-network, in which the entire image is processed by a backbone network, e.g. ResNet [27], and a proposal head (“H0”) is applied to produce preliminary detection hypotheses, known as object proposals. In the second stage, these hypotheses are processed by a region-of-interest detection sub-network (“H1”), denoted as a detection head. A ﬁnal classiﬁcation score (“C”) and a bounding box (“B”) are assigned per hypothesis. The entire detector is learned end-to-end, using a multi-task loss with bounding box regression and classiﬁcation components.

虽然本文中提出的想法可以应用于各种探测器架构,但我们专注于Faster R-CNN [47] 的流行两阶段架构,如图 3 (a) 所示。第一阶段是建议子网络,其中整个图像由骨干网络处理,例如 ResNet [27],并应用建议头 ("H0") 来生成初步检测假设,称为对象建议。在第二阶段,这些假设由感兴趣区域检测子网络 ("H1") 处理,表示为检测头。最终分类分数("C")和边界框("B")按假设分配。使用带边界框回归和分类组件的多任务损耗,端到端地学习整个探测器。

3.1.1 Bounding Box Regression

A bounding box b = (bx,by,bw,bh) contains the four coordinates of an image patch x. Bounding box regression aims to regress a candidate bounding box b into a target bounding box g, using a regressor f(x,b). This is learned from a training set (gi,bi), by minimizing the risk

边界框 b = (bx,by,bw,bh) 包含图像修补程序 x 的四个坐标。边界框回归旨在使用回归函数 f(x,b) 将候选边界框 b 回归到目标边界框 g 中。这是从训练集(gi,bi)中学到的,通过最小化风险

As in Fast R-CNN [21],

where

is the smooth L1 loss function. To encourage invariance to scale and location, smoothL1 operates on the distance vector ∆ = (δx,δy,δw,δh) deﬁned by

是平滑的 L1 损失函数。为了鼓励不变性的缩放和定位,平滑L1在距离矢量 ∆ = (δx,δy,δw,δh)上工作,

Since bounding box regression usually performs minor adjustments on b, the numerical values of (4) can be very small. This usually makes the regression loss much smaller than the classiﬁcation loss. To improve the effectiveness of multi-task learning, ∆ is normalized by its mean and variance, e.g. δx is replaced by

由于边界框回归通常在 b 上执行少量调整,因此 (4) 的数值可能非常小。这通常使回归损失比分类损失小得多。为了提高多任务学习的有效性,* 通过平均值和方差进行标准化,例如 ,由 δx 替换为

This is widely used in the literature [2], [8], [25], [37], [47].

这在文献 [2], [8], [25], [37], [47] 中广泛使用。

3.1.2 Classiﬁcation

The classiﬁer is a function h(x) that assigns an image patch x to one of M+1 classes, where class 0 contains background and the remaining classes the objects to detect. h(x) is a M + 1-dimensional estimate of the posterior distribution over classes, i.e. hk(x) = p(y = k|x), where y is the class label. Given a training set (xi,yi), it is learned by minimizing the classiﬁcation risk

where

is the cross-entropy loss.

3.2 Detection Quality

Consider a ground truth object of bounding box g associated with class label y, and a detection hypothesis x of bounding box b. Since a b usually includes an object and some amount of background, it can be difficult to determine if a detection is correct or not. This is usually addressed by the intersection over union (IoU) metric

考虑与类标签 y 关联的边界框 g 的gt对象,以及边界框 b 的检测假设 x。由于 b 通常包含一个对象和一定数量的背景,因此很难确定检测是否正确。这通常通过联合 (IoU) 指标的交集来解决

If the IoU is above a threshold u, the patch is considered an example of the class of the object of bounding box g and denoted “positive”. Thus, the class label of a hypothesis x is a function of u,

如果 IoU 高于阈值 u,则修补程序将被视为边界框 g 对象类的示例,并表示"正"。因此,假设 x 的类标签是您的函数,

If the IoU does not exceed the threshold for any object, x is assigned to the background and denoted “negative”.

如果 IoU 不超过任何对象的阈值,则 x 将分配给背景并表示"负"。

Although there is no need to define positive/neagtive examples for the bounding box regression task, an IoU threshold u is also required to select the set of samples used to train the regressor. While the IoU thresholds used for the two tasks do not have to be identical, this is usual in practice. Hence, the IoU threshold u defines the quality of a detector. Large thresholds encourage detected bounding boxes to be tightly aligned with their ground truth counterparts. Small thresholds reward detectors that produce loose bounding boxes, of small overlap with the ground truth.

尽管无需为边界框回归任务定义正/新示例,但还需要 IoU 阈值 u 来选择用于训练回归器的样本集。虽然用于这两个任务的 IoU 阈值不必相同,但在实践中这是通常的做法。因此,IoU 阈值 u 定义了探测器的质量。较大的阈值鼓励检测到的边界框与其接地真方的对应框紧密对齐。小阈值奖励探测器,产生松散的边界框,与gt小重叠。

A main challenge of object detection is that, no matter the choice of threshold, the detection setting is highly adversarial. When u is high, positives contain less background but it is difficult to assemble large positive training sets. When u is low, richer and more diverse positive training sets are possible, but the trained detector has little incentive to reject close false positives. In general, it is very difficult to guarantee that a single classifier performs uniformly well over all IoU levels. At inference, since the majority of the hypotheses produced by a proposal detector, e.g. RPN [47] or selective search [53], have low quality, the detector must be more discriminant for lower quality hypotheses. A standard compromise between these conflicting requirements is to settle on u = 0.5, which is used in almost all modern object detectors. This, however, is a relatively low threshold, leading to low quality detections that most humans consider close false positives, as shown in Fig. 1 (a).

对象检测的主要挑战是,无论选择阈值,检测设置都是高度对抗性的。当u高,正包含较少的背景,但很难组装大型积极训练集。当u低,更丰富和更多样化的积极训练集是可能的,但训练有素的探测器几乎没有动力拒绝关闭误报。通常,很难保证单个分类器在所有 IoU 级别上均匀地运行。推论,由于建议检测器产生的大多数假设(例如 RPN [47] 或选择性搜索 [53])质量较低,因此探测器对于低质量假设必须更具鉴别性。这些相互冲突的要求之间的一个标准折衷是解决u = 0.5,这是几乎在所有现代物体探测器中使用。然而,这是一个相对较低的阈值,导致低质量的检测,大多数人认为接近误报,如图1(a)所示。

3.3 Challenges to High Quality Detection

Despite the significant progress in object detection of the past few years, few works attempted to address high quality detection. This is mainly due to the following reasons.

尽管过去几年在物体检测方面取得了重大进展，但很少有工作试图解决高质量检测问题。这主要是由于以下原因。

First, evaluation metrics have historically placed greater emphasis on the low quality detection regime. For performance evaluation, an IoU threshold u is used to determine whether a detection is a success (IoU(b, g) ≥ u) or failure (IoU(b, g) < u). Many object detection datasets, including PASCAL VOC [13], ImageNet [48], Caltech Pedestrian [11], etc., use u = 0.5. This is partly because these datasets were established a while ago, when object detection performance was far from what it is today. However, this loose evaluation standard is adopted even by relatively recent datasets, such as WiderFace [61], or CityPersons [64]. This is one of the main reasons why performance has saturated for many of these datasets. Others, such as COCO [36] or KITTI [16] use stricter evaluation metrics: average precision at u = 0.7 for car in KITTI, and mean average precision across u = [0.5 : 0.05 : 0.95] in COCO. While recent works have focused on these less saturated datasets, most detectors are still designed with the loose IoU threshold of u = 0.5, associated with the low-quality detection regime. In this work, we show that there is plenty of room for improvement when a stricter evaluation metric, e.g. u ≥ 0.75, is used and that it is possible to achieve significant improvements by designing detectors specifically for the high quality regime

首先，评估指标在历史上更加强调低质量检测体系。对于性能评估，IoU阈值u用于确定检测是成功（IoU（b，g）≥u）还是失败（IoU（b，g）许多物体检测数据集，包括PASCAL VOC [13]，ImageNet [48]，Caltech Pedestrian [11]等，使用u = 0.5。这部分是因为这些数据集是在不久之前建立的，当时对象检测性能远不如今天。然而，这种宽松的评估标准甚至被相对较新的数据集采用，例如WiderFace [61]或CityPersons [64]。这是许多这些数据集的性能已经饱和的主要原因之一。其他如COCO [36]或KITTI [16]使用更严格的评估指标：KITTI中汽车的平均精度为u = 0.7，COCO中的平均精度u = [0.5：0.05：0.95]。虽然最近的工作集中在这些不太饱和的数据集上，但大多数探测器的设计仍然具有u = 0.5的松散IoU阈值，与低质量检测方案相关。在这项工作中，我们表明，当有更严格的评估指标时，还有很大的改进空间，例如：使用u≥0.75，通过专门为高质量方案设计探测器，可以实现显着改进

Second, the design of high quality object detectors is not a trivial generalization of existing approaches, due to the paradox of high quality detection. To beat the paradox, it is necessary to match the qualities of the hypotheses generator and the object detector. In the literature, there have been efforts to increase the quality of hypotheses, e.g. by iterative bounding box regression [18], [19] or better RPN design [2], [37], and some efforts to increase the quality of the object detector, e.g. by using the integral loss on a set of IoU thresholds [63]. These attempts fail to guarantee high quality detection because they consider only one of the goals, missing the fact that the qualities of both tasks need to be increased simultaneously. On one hand, raising the quality of the hypotheses has little benefit if the detector remains of low quality, because the latter is not trained todiscriminate high quality from low quality hypotheses. On the other, if only the detector quality is increased, there are too few high quality hypotheses for it to classify, leading to no detection improvement. In fact, because, as shown in Fig. 4 (left), the set of positive samples decreases quickly with u, a high u detector is prone to overfitting. Hence, a high u detector can easily overfit and perform worse than a low u detector, as shown in Fig. 2 (c).

其次，由于高质量检测的悖论，高质量物体检测器的设计不是现有方法的简单概括。为了克服悖论，有必要匹配假设发生器和物体探测器的质量。在文献中，已经努力提高假设的质量，例如，通过迭代边界框回归[18]，[19]或更好的RPN设计[2]，[37]，以及一些提高物体探测器质量的努力，例如：通过在一组IoU阈值上使用积分损失[63]。这些尝试无法保证高质量检测，因为它们只考虑其中一个目标，而忽略了两个任务的质量需要同时增加的事实。一方面，如果检测器保持低质量，则提高假设的质量几乎没有益处，因为后者没有经过培训，从低质量假设中获得高质量的高质量。另一方面，如果仅增加检测器质量，则对其进行分类的高质量假设太少，导致无检测改进。实际上，因为如图4（左）所示，正样本集随着u快速下降，高u检测器容易过度拟合。因此，如图2（c）所示，高u检测器可以容易地过度拟合并且比低u检测器更差。

4 CASCADE R-CNN

In this section we introduce the Cascade R-CNN detector

4.1 Architecture

The architecture of the Cascade R-CNN is shown in Fig. 3 (b). It is a multi-stage extension of the Faster R-CNN architecture of Fig. 3 (a). In this work, we focus on the the detection sub-network, simply adopting the RPN [47] of Fig. 3 (a) for proposal detection. However, the Cascade R-CNN is not limited to this proposal mechanism, other choices should be possible. As discussed in the section above, the goal is to increase the quality of hypotheses and detector simultaneously, to enable high quality object detection. This is achieved with a combination of cascaded bounding box regression and cascaded detection.

Cascade R-CNN的架构如图3（b）所示。它是图3（a）的快速R-CNN架构的多阶段扩展。 在这项工作中，我们专注于检测子网，简单地采用图3（a）的RPN [47]进行提案检测。 但是，Cascade R-CNN不限于此提议机制，其他选择应该是可能的。如上一节所述，目标是同时提高假设和检测器的质量，以实现高质量的物体检测。这是通过级联边界框回归和级联检测的组合实现的。

4.2 Cascaded Bounding Box Regression

High quality hypotheses can be easily produced during training, where ground truth bounding boxes are available, e.g. by sampling around the ground truth. The difficulty is to produce high quality proposals at inference, when ground truth is unavailable. This problem is addressed with resort to cascaded bounding box regression.

在训练期间可以容易地产生高质量的假设，其中可以获得地面真实边界框，例如，通过围绕gt抽样。当gt不可用时，难以在推理中产生高质量的提议。通过级联边界框回归解决了这个问题。

As shown in Fig. 2 (a), a single regressor cannot usually perform uniformly well over all quality levels. However, as is commonly done for pose regression [10] or face alignment [4], [58], [59], the regression task can be decomposed into a sequence of simpler steps. In the Cascade R-CNN detector, the idea is implemented with a cascaded regressor with the architecture of Fig. 3 (b). This consists of a cascade of specialized regressors

如图2（a）所示，单个回归量通常不能在所有质量水平上均匀地表现。然而，正如通常对姿势回归[10]或面部对齐[4]，[58]，[59]所做的那样，回归任务可以分解为一系列更简单的步骤。 在Cascade R-CNN探测器中，该想法是使用具有图3（b）结构的级联回归器实现的。 这包括一系列专门的回归量

where T is the total number of cascade stages. The key point is that each regressor ft is optimized for the bounding box distribution {b^t} generated by the previous regressor, rather than the initial distribution {b^1}. In this way, the hypotheses are improved progressively.

其中T是级联阶段的总数。关键点是每个回归量ft都针对前一个回归量生成的边界框分布{bt}进行了优化，而不是初始分布{b1}。通过这种方式，假设逐步得到改善。

图5：在不同级联阶段的（4）（没有归一化）的距离矢量Δ的分布。上图：（δx，δy）的图。下图：（δw，δh）的图。红点是后续阶段IoU阈值增加的异常值，显示的统计数据是在异常值去除后获得的。

This is illustrated in Fig. 5, which presents the distribution of the regression distance vector ∆ = (δx, δy, δw, δh) at different cascade stages. Note that most hypotheses become closer to the ground truth as they progress through the cascade. There are also some hypotheses that fail to meet the stricter IoU criteria of the later cascade stages. These are declared outliers and eliminated. It should be noted that, as discussed in Section 3.1.1, ∆ needs be mean/variance normalized, as in (5), for effective multi-task learning. The mean and variance statistics computed after this outlier removal step are used to normalize ∆ at each cascade stage. Our experiments show that this implementation of cascaded bounding box regression generates hypotheses of very high quality at both training and inference.

这在图5中示出，其示出了在不同级联阶段的回归距离矢量Δ=（δx，δy，δw，δh）的分布。请注意，大多数假设随着它们在级联中的进展而变得更接近基本事实。还有一些假设无法满足后期级联阶段更严格的IoU标准。这些被宣布为异常值并被淘汰。应该注意的是，如3.1.1节所述，Δ需要是均值/方差归一化，如（5）所示，用于有效的多任务学习。在该异常值去除步骤之后计算的均值和方差统计量用于在每个级联阶段归一化Δ。我们的实验表明，级联边界框回归的这种实现在训练和推理中产生了非常高质量的假设。

4.3 Cascaded Detection

As shown in the left of Fig. 4, the initial hypotheses distribution produced by the RPN is heavily tilted towards low quality. For example, only 2.9% of examples are positive for an IoU threshold u = 0.7. This makes it difficult to train a high quality detector. The Cascade R-CNN addresses the problem by using cascade regression as a resampling mechanism. This is inspired by Fig. 2 (a), where nearly all curves are above the diagonal gray line, showing that a bounding box regressor trained for a certain u tends to produce bounding boxes of higher IoU. Hence, starting from examples {(xi, bi)}, cascade regression successively resamples an example distribution {(x′ i, b′ i)} of higher IoU. This enables the sets of positive examples of the successive stages to keep a roughly constant size, even when the detector quality u is increased. Figure 4 illustrates this property, showing how the example distribution tilts more heavily towards high quality examples after each resampling step.

如图4左侧所示，由RPN产生的初始假设分布严重倾向于低质量。 例如，只有2.9％的例子对于IoU阈值u = 0.7是正的。这使得训练高质量探测器变得困难。 Cascade R-CNN通过使用级联回归作为重采样机制来解决该问题。 这受到图2（a）的启发，其中几乎所有曲线都在对角灰线上方，表明对某个u训练的边界框回归器倾向于产生更高IoU的边界框。因此，从示例{（xi，bi）}开始，级联回归连续地重新采样较高IoU的示例分布{（x'i，b'i）}。这使得连续级的正例组能够保持大致恒定的大小，即使在检测器质量u增加时也是如此。图4说明了此属性，显示了每个重采样步骤后示例分布如何更倾向于高质量示例。

At each stage t, the R-CNN head includes a classifier ht and a regressor ft optimized for the corresponding IoU threshold ut, where ut > ut−1. These are learned with loss

在每个阶段t，R-CNN头包括分类器ht和针对相应的IoU阈值ut优化的回归器ft，其中u^t> u^t-1。这些都是在loss中学到的

where bt = ft−1(xt−1, bt−1), g is the ground truth object for xt, λ = 1 the trade-off coefficient, yt is the label of xt under the ut criterion, according to (9), [·] is the indicator function. Note that the use of [·] implies that the IoU threshold u of bounding box regression is identical to that used for classification. This cascade learning has three important consequences for detector training. First, the potential for overfitting at large IoU thresholds u is reduced, since positive examples become plentiful at all stages (see Fig. 4). Second, detectors of deeper stages are optimal for higher IoU thresholds. Third, because some outliers are removed as the IoU threshold increases (see Fig. 5), the learning effectiveness of bounding box regression increases in the later stages. This simultaneous improvement of hypotheses and detector quality enables the Cascade R-CNN to beat the paradox of high quality detection. At inference, the same cascade is applied. The quality of the hypotheses is improved sequentially, and higher quality detectors are only required to operate on higher quality hypotheses, for which they are optimal. This enables the high quality object detection results of Fig. 1 (b), as suggested by Fig. 2.

其中bt = ft-1（xt-1，bt-1），g是xt的gt对象，λ= 1权衡系数，yt是ut标准下的xt标签，根据（9），[·]是指标功能。注意，使用[·]意味着边界框回归的IoU阈值u与用于分类的IoU阈值u相同。这种级联学习对探测器训练有三个重要影响。首先，在大的IoU阈值u下过度拟合的可能性降低，因为正例在所有阶段都变得充足（见图4）。其次，更深阶段的探测器对于更高的IoU阈值是最佳的。第三，因为随着IoU阈值的增加，一些异常值被移除（见图5），边界框回归的学习效果在后期阶段会增加。同时改进假设和探测器质量使Cascade R-CNN能够击败高质量探测的悖论。在推论中，应用相同的级联。假设的质量是顺序改进的，并且更高质量的检测器仅需要在更高质量的假设上操作，对于这些假设它们是最佳的。这使得能够获得图1（b）的高质量物体检测结果，如图2所示。

4.4 Differences from Previous Works

The Cascade R-CNN has similarities to previous works using iterative bounding box regression and integral loss for detection. There are, however, important differences.The Cascade R-CNN has similarities to previous works using iterative bounding box regression and integral loss for detection. There are, however, important differences.

Cascade R-CNN与先前的工作具有相似性，使用迭代边界框回归和检测的积分损失。然而，存在重要的差异。

Iterative Bounding Box Regression: Some works [17], [18], [27] have previously argued that the use of a single bounding box regressor f is insufficient for accurate localization. These methods apply f iteratively, as a post-processing step

迭代边界框回归：一些作品[17]，[18]，[27]之前曾提出使用单个边界框回归量f不足以进行精确定位。这些方法迭代地应用，作为后处理步骤

that refines a bounding box b. This is called iterative bounding box regression and denoted as iterative BBox. It can be implemented with the inference architecture of Fig. 3 (c) where all heads are identical. Note that this is only for inference, as training is identical to that of a two-stage object detector, e.g. the Faster R-CNN of Fig. 3 (a) with u = 0.5. This approaches ignores two problems. First, as shown in Fig. 2, a regressor f trained at u = 0.5 is suboptimal for hypotheses of higher IoUs. It actually degrades bounding box accuracy for IoUs larger than 0.85. Second, as shown in Fig. 5, the distribution of bounding boxes changes significantly after each iteration. While the regressor is optimal for the initial distribution it can be quite suboptimal after that. Due to these problems, iterative BBox requires a fair amount of human engineering, in the form of proposal accumulation, box voting, etc, and has somewhat unreliable gains [17], [18], [27]. Usually, there is no benefit beyond applying f twice.

精炼边界框b。这称为迭代边界框回归并表示为迭代BBox。它可以用图3（c）的推理架构实现，其中所有头都是相同的。注意，这仅用于推理，因为训练与两级物体检测器的训练相同，例如，图3（a）的Faster R-CNN，u = 0.5。这种方法忽略了两个问题。首先，如图2所示，在u = 0.5时训练的回归f对于较高IoU的假设是次优的。它实际上降低了大于0.85的IoU的边界框精度。其次，如图5所示，边界框的分布在每次迭代后显着变化。虽然回归量对于初始分布是最佳的，但在此之后它可能是非常不理想的。由于这些问题，迭代BBox需要相当数量的人工工程，以提案积累，盒子投票等形式，并且有一些不可靠的收益[17]，[18]，[27]。通常，除了两次应用f之外没有任何好处。

The Cascade R-CNN differs from iterative BBox in several ways. First, while iterative BBox is a post processing procedure used to improve bounding boxes, the Cascade R-CNN uses cascade regression as a resampling mechanism that changes the distribution of hypotheses processed by the different stages. Second, because cascade regression is used at both training and inference, there is no discrepancy between training and inference distributions. Third, the multiple specialized regressors {fT , fT −1, · · · , f1} are optimal for the resampled distributions of the different stages. This is unlike the single f of (13), which is only optimal for the initial distribution. Our experiments show that the Cascade R-CNN enables more precise localization than that possible with iterative BBox, and requires no human engineering.

Cascade R-CNN在几个方面与迭代BBox不同。首先，虽然迭代BBox是用于改进边界框的后处理过程，但是级联R-CNN使用级联回归作为重新采样机制，其改变由不同阶段处理的假设的分布。其次，由于级联回归用于训练和推理，因此训练和推理分布之间不存在差异。第三，多个专用回归量{fT，fT -1，···，f1}对于不同阶段的重采样分布是最佳的。这与（13）的单个f不同，它仅对初始分布是最佳的。我们的实验表明，Cascade R-CNN能够实现比迭代BBox更精确的定位，并且不需要人工设计。

Integral Loss: [63] proposed an ensemble of classifiers with the architecture of Fig. 3 (d) and trained with the integral loss. This is a loss

积分损失：[63]提出了一个具有图3（d）结构的分类器集合，并用积分损失进行训练。这是一种损失

that targets various quality levels, defined by a set of IoU thresholds U= {0.5, 0.55, · · · , 0.75}, chosen to fit the evaluation metric of the COCO challenge.

目标是由一组IoU阈值U = {0.5,0.55，...，0.75}定义的各种质量水平，选择这些阈值以适应COCO挑战的评估指标。

The Cascade R-CNN differs from this detector in several ways. First, (14) fails to address the problem that the various loss terms operate on different numbers of positives. As shown on Fig. 4 (left), the set of positive samples decreases quickly with u. This is particularly problematic because it makes the high quality classifiers very prone to overfitting. On the other hand, as shown in Fig. 4, the resampling of the Cascade R-CNN produces a nearly constant number of positive examples as the IoU threshold u increases. Second, at inference, the high quality classifiers are required to process proposals of overwhelming low quality, for which they are not optimal. This is unlike the higher quality detectors of the Cascade R-CNN, which are only required to operate on higher quality hypotheses. Third, the integral loss is designed to fit the COCO metrics and, by definition, the classifiers are ensembled at inference. The Cascade RCNN aims to achieve high quality detection, and the high quality detector itself in the last stage can obtain the stateof-the-art detection performance. Due to all this, the integral loss detector of Fig. 3 (d) usually fails to outperform the vanilla detector of Fig. 3 (a), for most quality levels. This is unlike the Cascade R-CNN, which can have significant improvements.

Cascade R-CNN在几个方面与该探测器不同。首先，（14）未能解决各种损失条款对不同数量的正数进行操作的问题。如图4（左）所示，正样本集随u快速下降。这尤其成问题，因为它使得高质量分类器非常容易过度拟合。另一方面，如图4所示，当IoU阈值u增加时，级联R-CNN的重新采样产生几乎恒定数量的正例。其次，在推断时，需要高质量的分类器来处理压倒性低质量的提议，因为它们不是最佳的。这与Cascade R-CNN的高质量探测器不同，后者仅需要在更高质量的假设下操作。第三，积分损失旨在符合COCO指标，并且根据定义，分类器在推理中被整合。 Cascade RCNN旨在实现高质量检测，最后阶段的高质量检测器本身可以获得最先进的检测性能。由于所有这些，对于大多数质量水平，图3（d）的积分损失检测器通常不能胜过图3（a）的香草检测器。这与Cascade R-CNN不同，后者可以有显着的改进。

5 INSTANCE SEGMENTATION

Instance segmentation has become popular in the recent past [7], [25], [39]. It aims to predict pixel-level segmentation for each instance, in addition to determining its object class. This is more difficult than object detection, which only predicts a bounding box (plus class) per instance. In general, instance segmentation is implemented in addition to object detection, and a stronger object detector usually leads to improved instance segmentation. The most popular instance segmentation method is arguably the Mask R-CNN [25]. Like the Cascade R-CNN, it is a variant on the two-stage detector. In this section, we extend the Cascade R-CNN architecture to the instance segmentation task, by adding a segmentation branch similar to that of the Mask R-CNN.

实例分割在最近变得流行[7]，[25]，[39]。除了确定其对象类之外，它还旨在预测每个实例的像素级分割。这比对象检测更困难，对象检测仅预测每个实例的边界框（加上类）。通常，除了对象检测之外还实现实例分割，并且更强的对象检测器通常导致改进的实例分割。最流行的实例分割方法可以说是Mask R-CNN [25]。与Cascade R-CNN一样，它是两级探测器的变体。在本节中，我们通过添加类似于Mask R-CNN的分段分支，将Cascade R-CNN架构扩展到实例分段任务。

5.1 Mask R-CNN

The Mask R-CNN [25] extends the Faster R-CNN by adding a segmentation branch in parallel to the existing detection branch during training. It has the architecture of Fig. 6 (a). The training instances are the positive examples also used to train the detection task. At inference, object detections are complemented with segmentation masks, for all detected objects.

Mask R-CNN [25]通过在训练期间与现有检测分支并行地添加分段分支来扩展更快的R-CNN。它具有图6（a）的架构。训练实例是用于训练检测任务的正例。在推理中，对于所有检测到的对象，对象检测用分段掩码补充。

5.2 Cascade Mask R-CNN

In the Mask R-CNN, the segmentation branch is inserted in parallel to the detection branch. However, the Cascade R-CNN has multiple detection branches. This raises the questions of 1) where to add the segmentation branch and 2) how many segmentation branches to add. We consider three strategies for mask prediction in the Cascade R-CNN. The first two strategies address the first question, adding a single mask prediction head at either the first or last stage of the Cascade R-CNN, as shown in Fig. 6 (b) and (c), respectively. Since the instances used to train the segmentation branch are the positives of the detection branch, their number varies in these two strategies. As shown in Fig. 4, placing the segmentation head later on the cascade leads to more examples. However, because segmentation is a pixel-wise operation, a large number of highly overlapping instances is not necessarily as helpful as for object detection, which is a patch-based operation. The third strategy addresses the second question, adding a segmentation branch to each cascade stage, as shown in Fig. 6 (d). This maximizes the diversity of samples used to learn the mask prediction task.

在掩码R-CNN中，分段分支与检测分支并行插入。然而，Cascade R-CNN具有多个检测分支。这提出了以下问题：1）添加分段分支的位置和2）要添加的分段分支数量。我们在Cascade R-CNN中考虑了三种掩模预测策略。前两个策略解决了第一个问题，在级联R-CNN的第一级或最后一级添加了单个掩码预测头，分别如图6（b）和（c）所示。由于用于训练分割分支的实例是检测分支的正数，因此它们的数量在这两种策略中不同。如图4所示，稍后将分段头放置在级联上导致更多示例。然而，因为分割是逐像素操作，所以大量高度重叠的实例不一定与对象检测一样有用，对象检测是基于补丁的操作。第三个策略解决了第二个问题，为每个级联阶段添加了一个分支分支，如图6（d）所示。这最大化了用于学习掩模预测任务的样本的多样性。

At inference time, all three strategies predict the segmentation masks on the patches produced by the final object detection stage, irrespective of the cascade stage on which the segmentation mask is implemented and how many segmentation branches there are. The final mask prediction is obtained from the single segmentation branch for the architectures of Fig. 6 (b) and (c), and from the ensemble of three segmentation branches for the architecture of Fig. 6 (d). Our experiments show that these architectures of the Cascade Mask R-CNN outperform the Mask R-CNN

在推理时，所有三种策略都预测最终对象检测阶段产生的片上的分割掩模，而不管实现分割掩模的级联阶段和有多少分割分支。最终的掩模预测是从图6（b）和（c）的体系结构的单个分割分支获得的，并且是从图6（d）的体系结构的三个分割分支的集合中获得的。我们的实验表明，Cascade Mask R-CNN的这些架构优于Mask R-CNN

6 EXPERIMENTAL RESULTS

7 CONCLUSION

Cascade mask RCNN|Cascade R-CNN: High Quality Object Detection and Instance Segmentation

Abstract

1 INTRODUCTION

2 RELATED WORK

3 HIGH QUALITY OBJECT DETECTION

3.1 Object Detection

3.1.1 Bounding Box Regression

3.1.2 Classiﬁcation

3.2 Detection Quality

3.3 Challenges to High Quality Detection

4 CASCADE R-CNN

4.1 Architecture

4.2 Cascaded Bounding Box Regression

4.3 Cascaded Detection

4.4 Differences from Previous Works

5 INSTANCE SEGMENTATION

5.1 Mask R-CNN

5.2 Cascade Mask R-CNN

6 EXPERIMENTAL RESULTS

7 CONCLUSION

你可能感兴趣的:(Cascade mask RCNN|Cascade R-CNN: High Quality Object Detection and Instance Segmentation)