The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron.
迄今为止, 最高精度的物体检测器是基于 R-CNN 推广的 two-stage方法, 其中分类器被应用到稀疏的候选物体定位集合中。相比之下, one-stage检测器应用在可能的目标定位常规密集取样之上, 它有可能更快、更简单, 但迄今已落后于 two-stage检测器的精确度。本文对这一情况进行了调查。发现在密集检测器训练过程中遇到的极端 foreground-background类不平衡是主要原因。我们提出通过reshaping 标准的 cross entropy loss来解决这类不平衡问题, 这样它就会把损失分配给分类良好的例子。我们的novel Focal Loss集中在一组稀疏的hard example上进行训练, 并防止在训练过程中大量的easy negatives overwhelm检测器。为了评估损失的有效性, 我们设计和训练了一个简单的密集检测器叫做RetinaNet。我们的结果表明, 当用focal loss训练, RetinaNet 可以匹配以前的one-stage检测器的速度, 同时超过了所有现有的state-of-the-art的two-stage检测器的准确性。代码在: https://github.com/facebookresearch/Detectron。
Current state-of-the-art object detectors are based on a two-stage, proposal-driven mechanism. As popularized in the R-CNN framework [11], the first stage generates a sparse set of candidate object locations and the second stage classifies each candidate location as one of the foreground classes or as background using a convolutional neural network. Through a sequence of advances [10, 28, 20, 14], this two-stage framework consistently achieves top accuracy on the challenging COCO benchmark [21].
目前 state-of-the-art 物体检测器是基于two-stage的方案驱动机制。正如在 R-CNN 框架 [11] 中所推广的那样, 第一阶段生成一个稀疏的候选物体定位集, 第二个阶段将每个候选定位分类为一个foreground类, 或者使用卷积神经网络作为background。通过一系列的进步 [10, 28, 20, 14], two-stage的框架始终如一地达到最准确的挑战COCO基准 [21]。
Despite the success of two-stage detectors, a natural question to ask is: could a simple one-stage detector achieve similar accuracy? One stage detectors are applied over a regular, dense sampling of object locations, scales, and aspect ratios. Recent work on one-stage detectors, such as YOLO [26, 27] and SSD [22, 9], demonstrates promising results, yielding faster detectors with accuracy within 1040% relative to state-of-the-art two-stage methods.
尽管 two-stage检测器取得了成功, 但一个自然的问题是: 一个简单的one-stage检测器能达到类似的精确度吗?One-stage检测器应用于常规的、密集的物体定位、尺度和纵横比取样。最近有关one-stage检测器的工作, 如YOLO [26、27] 和 SSD [22、9], 显示出有希望的结果, 在10-40% 中, 相对于state-of-the-art的 two-stage方法, 能够更快地产生精度较高的检测器。
This paper pushes the envelop further: we present a onestage object detector that, for the first time, matches the state-of-the-art COCO AP of more complex two-stage detectors, such as the Feature Pyramid Network (FPN) [20] or Mask R-CNN [14] variants of Faster R-CNN [28]. To achieve this result, we identify class imbalance during training as the main obstacle impeding one-stage detector from achieving state-of-the-art accuracy and propose a new loss function that eliminates this barrier.
本论文进一步推动envelop: 我们提出一个 one-stage 物体检测器, 第一次匹配state-of-the-art的COCO AP 更复杂的 two-stage检测器, 如Feature Pyramid Network (FPN) [20] 或Faster R-CNN 的变种[28]Mask R-CNN [14] . 为了实现这一结果, 我们确定训练中的类别不平衡是阻碍one-stage检测器达到state-of-the-art精度的主要障碍, 并提出了消除这种障碍的新的损失函数。
Class imbalance is addressed in R-CNN-like detectors by a two-stage cascade and sampling heuristics. The proposal stage (e.g., Selective Search [35], EdgeBoxes [39], DeepMask [24, 25], RPN [28]) rapidly narrows down the number of candidate object locations to a small number (e.g., 1-2k), filtering out most background samples. In the second classification stage, sampling heuristics, such as a fixed foreground-to-background ratio (1:3), or online hard example mining (OHEM) [31], are performed to maintain a manageable balance between foreground and background.
在 R-CNN 类似的检测器中, 通过 two-stage级联和采样启发式来解决类不平衡问题。提议阶段 (例如, Selective Search [35], EdgeBoxes [39], DeepMask [24, 25], RPN [28]) 快速地缩小候选物体定位的数量到一个小数字 (例如, 1-2k), 过滤出多数background samples。在第二个分类阶段, 将执行采样启发式, 例如固定的foreground-background比率 (1:3), 或online hard example mining (OHEM) [31], 以在foreground和background之间保持可控制的平衡。
In contrast, a one-stage detector must process a much larger set of candidate object locations regularly sampled across an image. In practice this often amounts to enumerating ∼100k locations that densely cover spatial positions, scales, and aspect ratios. While similar sampling heuristics may also be applied, they are inefficient as the training procedure is still dominated by easily classified background examples. This inefficiency is a classic problem in object detection that is typically addressed via techniques such as bootstrapping [33, 29] or hard example mining [37, 8, 31].
相比之下, one-stage检测器必须处理一组更大的候选物体定位, 定期在图像中取样。在实践中这经常共计枚举约100k定位那些密集地覆盖空间定位、尺度和纵横比。虽然类似的sampling heuristics也可以应用, 但它们效率低下, 因为训练程序仍然以易于分类的background示例为主导。这种低效率是物体检测中的一个典型问题, 通常是通过诸如bootstrapping [33、29] 或hard example mining [37、8、31] 等技术解决的。
In this paper, we propose a new loss function that acts as a more effective alternative to previous approaches for dealing with class imbalance. The loss function is a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases, see Figure 1. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples. Experiments show that our proposed Focal Loss enables us to train a high-accuracy, one-stage detector that significantly outperforms the alternatives of training with the sampling heuristics or hard example mining, the previous state-ofthe-art techniques for training one-stage detectors. Finally, we note that the exact form of the focal loss is not crucial, and we show other instantiations can achieve similar results.
在本文中, 我们提出了一个新的损失函数, 作为一种更有效的替代以前的方法处理类不平衡。损失函数是一个动态缩放的交叉熵损失, 其中scaling factor衰减为零, 因为对正确类的confidence增加, 见图1。直观地, 这个scaling factor可以自动降低在训练过程中容易的例子的贡献, 并迅速将模型集中在hard examples上。实验表明, 我们提出的 Focal Loss使我们能够训练一个高精度, one-stage检测器, 大大优于训练的替代方案与sampling heuristics或hard example mining, 以前的state-of- the-art训练one-stage检测器技术。最后, 我们注意到, focal loss的确切形式并不重要, 我们显示其他实例化也可以取得类似的结果。
To demonstrate the effectiveness of the proposed focal loss, we design a simple one-stage object detector called RetinaNet, named for its dense sampling of object locations in an input image. Its design features an efficient in-network feature pyramid and use of anchor boxes. It draws on a variety of recent ideas from [22, 6, 28, 20]. RetinaNet is efficient and accurate; our best model, based on a ResNet-101FPN backbone, achieves a COCO test-dev AP of 39.1 while running at 5 fps, surpassing the previously best published single-model results from both one and two-stage detectors, see Figure 2.
为了证明所提出的 focal loss的有效性, 我们设计了一个简单的one-stage物体检测器称为 RetinaNet, 命名为输入图像物体定位的密集抽样。其设计具有高效的in-network特征金字塔和anchor boxes的使用的特点。它吸取最近的 [22, 6, 28, 20] 各种想法。RetinaNet 高效、准确;我们最好的模型, 基于 ResNet-101-FPN 为主网络, 实现一个COCO测试开发39.1 AP , 在 5 fps 运行, 超过以前最好的公布的single-model的结果, 无论是one-stage还是two-stage检测器, 见图2。
Classic Object Detectors: The sliding-window paradigm, in which a classifier is applied on a dense image grid, has a long and rich history. One of the earliest successes is the classic work of LeCun et al. who applied convolutional neural networks to handwritten digit recognition [19, 36]. Viola and Jones [37] used boosted object detectors for face detection, leading to widespread adoption of such models. The introduction of HOG [4] and integral channel features [5] gave rise to effective methods for pedestrian detection. DPMs [8] helped extend dense detectors to more general object categories and had top results on PASCAL [7] for many years. While the sliding-window approach was the leading detection paradigm in classic computer vision, with the resurgence of deep learning [18], two-stage detectors, described next, quickly came to dominate object detection.
Classic Object Detectors: sliding-window paradigm, 它在稠密的图像网格中应用, 具有悠久而丰富的历史。最早的成功之一是 LeCun 等人将卷积神经网络应用于手写数字识别的经典项目 [19, 36]。Viola 和Jones [37] 使用了boosted物体检测器来做人脸检测, 导致广泛的采纳这样模型。HOG的介绍 [4] 和integral channel features [5] 导致了有效的方法来做行人检测。DPMs [8] 帮助扩展密集的检测器到更多一般物体类别并且多年来在PASCAL [7] 有名列前茅的结果。而sliding-window法是经典计算机视觉中的主要检测范例, 随着深度学习 [18] 的兴起, two-stage检测器, 接下来描述, 很快就成为了控制物体检测的重要手段。
Two-stage Detectors: The dominant paradigm in modern object detection is based on a two-stage approach. As pioneered in the Selective Search work [35], the first stage generates a sparse set of candidate proposals that should contain all objects while filtering out the majority of negative locations, and the second stage classifies the proposals into foreground classes / background. R-CNN [11] upgraded the second-stage classifier to a convolutional network yielding large gains in accuracy and ushering in the modern era of object detection. R-CNN was improved over the years, both in terms of speed [15, 10] and by using learned object proposals [6, 24, 28]. Region Proposal Networks (RPN) integrated proposal generation with the second-stage classifier into a single convolution network, forming the Faster RCNN framework [28]. Numerous extensions to this framework have been proposed, e.g. [20, 31, 32, 16, 14].
Two-stage检测器: 现代目标检测的主导范例是基于 two-stage的方法。如在 Selective Search work [35] 中首创, 第一阶段生成一组稀疏候选提案, 在筛选大多数负定位时应包含所有物体, 第二阶段将提案归类为foreground类或是background。R-CNN [11] 将二级分类器升级为卷积网络, 在精度上取得了巨大的收益, 并开创了目标检测的modern时代。R-CNN 这几年被改进了, 不仅在速度方面 [15, 10] 并且在使用学习到的物体提议 [6, 24, 28]方面。Region Proposal Networks (RPN) 将第二阶段分类器集成提案生成到单个卷积网络中, 形成了更快的 RCNN 框架 [28]。对这个框架的许多扩展已经被提出了, 例如 [20、31、32、16、14]。
One-stage Detectors: OverFeat [30] was one of the first modern one-stage object detector based on deep networks. More recently SSD [22, 9] and YOLO [26, 27] have renewed interest in one-stage methods. These detectors have been tuned for speed but their accuracy trails that of twostage methods. SSD has a 10-20% lower AP, while YOLO focuses on an even more extreme speed/accuracy trade-off. See Figure 2. Recent work showed that two-stage detectors can be made fast simply by reducing input image resolution and the number of proposals, but one-stage methods trailed in accuracy even with a larger compute budget [17]. In contrast, the aim of this work is to understand if one-stage detectors can match or surpass the accuracy of two-stage detectors while running at similar or faster speeds.
one-stage检测器: OverFeat [30] 是基于深网络的第一个one-stage目标检测器model之一。最近的 SSD [22, 9] 和YOLO [26, 27] 对one-stage方法有了新的兴趣。这些检测器已经调整速度, 但他们的准确性跟随 two-stage方法。SSD 有一个10-20% 低 AP, 而YOLO的重点是一个更极端的速度/准确性的权衡。见图2。最近的研究表明, two-stage检测器可以通过减少输入图像分辨率和提出数量来快速的制作, 但one-stage方法即使用更大的计算预算也被精度 [17]牵制。相比之下, 这项工作的目的是要了解one-stage检测器是否能够匹配或超过 two-stage检测器的准确性, 同时以相近或更快的速度运行。
The design of our RetinaNet detector shares many similarities with previous dense detectors, in particular the concept of ‘anchors’ introduced by RPN [28] and use of features pyramids as in SSD [22] and FPN [20]. We emphasize that our simple detector achieves top results not based on innovations in network design but due to our novel loss.
我们的 RetinaNet 检测器的设计与以往的密集检测器有许多相似之处, 特别是 RPN 引入的 “anchors” 概念 [28] 和在 SSD [22] 和 FPN [20]使用特征金字塔。我们强调, 我们的简单检测器达到的最高结果不是基于创新的网络设计, 而是由于我们的novel loss。
Class Imbalance: Both classic one-stage object detection methods, like boosted detectors [37, 5] and DPMs [8], and more recent methods, like SSD [22], face a large class imbalance during training. These detectors evaluate 104-105 candidate locations per image but only a few locations contain objects. This imbalance causes two problems: (1) training is inefficient as most locations are easy negatives that contribute no useful learning signal; (2) en masse, the easy negatives can overwhelm training and lead to degenerate models. A common solution is to perform some form of hard negative mining [33, 37, 8, 31, 22] that samples hard examples during training or more complex sampling/reweighing schemes [2]. In contrast, we show that our proposed focal loss naturally handles the class imbalance faced by a one-stage detector and allows us to efficiently train on all examples without sampling and without easy negatives overwhelming the loss and computed gradients.
类不平衡: 两个经典的one-stage物体检测方法, 像 boosted detectors [37, 5] 和 DPMs [8], 和最近的方法, 如 SSD [22], 在训练期间面临一个大的类不平衡问题。这些检测器评估每个图像的104-105个候选定位, 但只有几个定位包含物体。这种不平衡导致了两个问题: (1) 训练效率低下, 因为大多数定位都是easy negatives的, 没有任何有用的学习信号;(2) 集体的, easy negatives可以overwhelm训练和导致退化模型。一个共同的解答是执行某种形式hard negative mining [33, 37, 8, 31, 22] 采样hard examples在训练期间或更加复杂的采样或 reweighing方案 [2]。相反, 我们表明, 我们提出的 focal loss自然的处理one-stage检测器所面临的类不平衡, 使我们能够有效地训练所有的例子, 无需取样, 并且没有easy negatives overwhelming损失和计算梯度。
Robust Estimation: There has been much interest in designing robust loss functions (e.g., Huber loss [13]) that reduce the contribution of outliers by down-weighting the loss of examples with large errors (hard examples). In contrast, rather than addressing outliers, our focal loss is designed to address class imbalance by down-weighting inliers (easy examples) such that their contribution to the total loss is small even if their number is large. In other words, the focal loss performs the opposite role of a robust loss: it focuses training on a sparse set of hard examples.
鲁棒估计: 人们在设计鲁棒的损耗函数 (例如, Huber loss [13]) 时有很大的兴趣, , 它通过 down-weighting 大量错误的例子损失 (hard examples) 来减少异常的贡献。与此相反, 我们的focal loss不是针对异常点, 而是通过down-weighting inliers (easy examples) 来解决类不平衡问题, 这样他们对总损失的贡献就很小, 即使他们的数量很大。换言之, focal loss在一个强大的损失中扮演相反的角色: 它将训练集中在一组稀疏的 hard examples上。
The Focal Loss is designed to address the one-stage object detection scenario in which there is an extreme imbalance between foreground and background classes during training (e.g., 1:1000). We introduce the focal loss starting from the cross entropy (CE) loss for binary classification:
Focal Loss是为了解决one-stage的物体检测方案, 训练中foreground和background类别之间的极端失衡 (例如, 1:1000)。我们从二进制分类的cross entropy (CE) 损耗出发, 引入 focal loss:
In the above y ∈ {±1} specifies the ground-truth class and p ∈ [0, 1] is the model’s estimated probability for the class with label y = 1. For notational convenience, we define pt:
在上面的 y ∈ {±1} 指定ground-truth类以及 p ∈ [0, 1] 是类的模型估计概率,用y = 1标注。为了标注方便起见, 我们定义了 pt:
The CE loss can be seen as the blue (top) curve in Figure 1. One notable property of this loss, which can be easily seen in its plot, is that even examples that are easily classified (pt ≫.5) incur a loss with non-trivial magnitude. When summed over a large number of easy examples, these small loss values can overwhelm the rare class.
CE loss可以被视为图1中的蓝色 (顶部) 曲线。这一损失的一个显著的属性, 可以在图中很容易地看到, 即使是容易归类的例子 (pt ≫.5) 引发损失的non-trivial的规模。当总结大量easy examples时, 这些小的损失值会overwhelm稀有类。
In this work, we identify class imbalance as the primary obstacle preventing one-stage object detectors from surpassing top-performing, two-stage methods. To address this, we propose the focal loss which applies a modulating term to the cross entropy loss in order to focus learning on hard negative examples. Our approach is simple and highly effective. We demonstrate its efficacy by designing a fully convolutional one-stage detector and report extensive experimental analysis showing that it achieves state-of-the-art accuracy and speed. Source code is available at https://github.com/facebookresearch/Detectron [12].
在这项工作中, 我们将类不平衡鉴定为防止one-stage目标检测器超越最高性能,two-stage方法的主要障碍。为了解决这一问题, 我们提出了将一个调制项应用于cross entropy loss的 focal loss, 以便将学习重点放在hard negative examples上。我们的方法简单而高效。我们通过设计一个全卷积one-stage检测器来证明其有效性, 并报告了广泛的实验分析, 表明它达到了state-of-the-art的精度和速度。源代码在 https://github.com/facebookresearch/Detectron [12] 中可用。
Figure 3. The one-stage RetinaNet network architecture uses a Feature Pyramid Network (FPN) [20] backbone on top of a feedforward ResNet architecture [16] (a) to generate a rich, multi-scale convolutional feature pyramid (b). To this backbone RetinaNet attaches two subnetworks, one for classifying anchor boxes (c) and one for regressing from anchor boxes to ground-truth object boxes (d). The network design is intentionally simple, which enables this work to focus on a novel focal loss function that eliminates the accuracy gap between our one-stage detector and state-of-the-art two-stage detectors like Faster R-CNN with FPN [20] while running at faster speeds.
图3。one-stage RetinaNet 网络体系结构在前馈 ResNet 体系结构 [16] (a) 上使用Feature Pyramid Network (FPN) [20] 为主网络,生成一个丰富的多尺度卷积特征金字塔 (b)。对这个RetinaNet 主网络 附加两个子网络, 一个用于分类anchor boxes (c), 一个用于从anchor boxes到ground-truth物体框 (d) 的回归。网络设计是有意简单的, 这使得这项工作集中在一个novel focal loss function, 消除了我们的one-stage检测器和 state-of-the-art的 two-stage检测器之间的精度差距, 如 Faster R-CNN 用 FPN [20] 以更快的速度运行。