论文阅读 FCOS: Fully Convolutional One-Stage Object Detect



We propose a fully convolutional one-stage object detector (FCOS) to solve object detection in a per-pixel prediction fashion, analogue to semantic segmentation. Almost all state-of-the-art object detectors such as RetinaNet, SSD, YOLOv3, and Faster R-CNN rely on pre-defined anchor boxes. In contrast, our proposed detector FCOS is anchorbox free, as well as proposal free. By eliminating the predefined set of anchor boxes, FCOS completely avoids the complicated computation related to anchor boxes such as calculating overlapping during training and significantly reduces the training memory footprint. More importantly, we also avoid all hyper-parameters related to anchor boxes, which are often very sensitive to the final detection performance. With the only post-processing non-maximum suppression (NMS), our detector FCOS outperforms previous anchor-based one-stage detectors with the advantage of being much simpler. For the first time, we demonstrate a much simpler and flexible detection framework achieving improved detection accuracy. We hope that the proposed FCOS framework can serve as a simple and strong alternative for many other instance-level tasks

本文提出了一种全卷积one-stage目标检测算法(FCOS),以逐像素预测的方式解决目标检测问题,类似于语义分割。目前最流行的不论是one-stage目标检测算法,如RetinaNet,SSD,YOLOv3,还是two-stage目标检测算法,如Faster R-CNN。这两类算法大都依赖于预定义的锚框(anchor boxes)。相比之下,本文提出的目标检测算法FCOS不需要锚框。通过消除预定义的锚框,FCOS避免了与锚框相关的复杂计算,例如在训练期间计算重叠等,并且显著减少了训练内存。更重要的是,FCOS还避免了设定与anchor相关的所有超参数,这些参数通常对最终检测性能非常敏感。FCOS算法凭借唯一的后处理:非极大值抑制(NMS),实现了优于以前基于锚框的one-stage检测算法的效果,具有更简单的优点。首次提出了一种简单灵活的检测框架,提高了检测精度。我们希望所提出的FCOS框架可以作为许多其他实例级任务的简单而强大的替代方案.


1 Introduction

Object detection is a fundamental yet challenging task in computer vision, which requires the algorithm to predict a bounding box with a category label for each instance of interest in an image. All current mainstream detectors such as Faster R-CNN [20], SSD [15] and YOLOv2, v3 [19] rely on a set of pre-defined anchor boxes and it has long been believed that the use of anchor boxes is the key to detectors success. Despite their great success, it is important to note that anchor-based detectors suffer some drawbacks


As shown in [12, 20], detection performance is sensitive to the sizes, aspect ratios and number of anchor boxes. For example, in RetinaNet [12], varying these hyper-parameters affects the performance up to 4% in AP on the COCO benchmark [13]. As a result, these
hyper-parameters need to be carefully tuned in anchorbased detectors.


Even with careful design, because the scales and aspect ratios of anchor boxes are kept fixed, detectors encounter difficulties to deal with object candidates with large shape variations, particularly for small objects. The pre-defined anchor boxes also hamper the generalization ability of detectors, as they need to be redesigned on new detection tasks with different object sizes or aspect ratios.


3.为了获得较高的查全率,需要一个基于锚点的检测器将anchor box密集地放置在输入图像上(例如,对于短边为800的图像,在特征金字塔网络(FPN)[11]中放置超过180K个锚点盒)。在训练过程中,这些锚盒大多被标记为负样本。负样本样本数量过多加剧了训练中正样本与负性样本之间的不平衡。

An excessively large number of anchor boxes also significantly increase the amount of computation and memory footprint when computing the intersectionover-union (IOU) scores between all anchor boxes and ground-truth boxes during training.
4.当计算训练过程中所有锚点盒与地面真实盒之间的相交并集(intersectionover- union, IOU)得分时,过多的锚点盒也会显著增加计算量和内存占用

Recently, fully convolutional networks (FCNs) [16] have achieved tremendous success in dense prediction tasks such as semantic segmentation [16], depth estimation [14], keypoint detection [2], and counting [1]. As one of high-level vision tasks, object detection might be the only one deviating from the neat fully convolutional per-pixel prediction framework mainly due to the use of anchor boxes. It is nature to ask a question: Can we solve object detection in the neat per-pixel prediction fashion, analogue to FCN for semantic segmentation, for example? Thus those fundamental vision tasks can be unified in (almost) one single framework. We show that the answer is affirmative. Moreover, we demonstrate that, for the first time, the much simper FCN-based detector achieves even better performance than its anchor-based counterparts.

In the literature, some works attempted to leverage the FCNs-based framework for object detection such as Dense- Box [9] and UnitBox [24]. Specifically, these FCN-based frameworks directly predict a 4D vector plus a class category at each spatial location on a level of feature maps. As shown in Fig. 1 (left), the 4D vector depicts the relative offsets from the four sides of a bounding box to the location. These frameworks are similar to the FCNs for semantic segmentation, except that each location is required to regress a 4D continuous vector. However, to handle the bounding boxes with different sizes, DenseBox [9] resizes training images to a fixed scale. Thus DenseBox has to perform detection on image pyramids, which is against FCN s philosophy of computing all convolutions once. Beside, more significantly, these methods are mainly used in special domain objection detection such as scene text detection [25] or face detection [24, 9], since it is believed that these methods do not work well when applied to generic object detection with highly overlapped bounding boxes. As shown in Fig. 1 (right), the highly overlapped bounding boxes result in an intractable ambiguity during training: it is not clear w.r.t. which bounding box to regress for the pixels in the overlapped regions.
在文献中,一些工作试图利用基于fcn的框架进行对象检测,如(稠密)Dense- Box[9]和(单元)UnitBox[24]。具体地说,这些基于fcn的框架直接预测了特征映射级别上每个空间位置上的一个4D向量加上一个类类别。如图1(左)所示,4D向量描述了一个边界框到该位置的四个边的相对偏移量。这些框架类似于用于语义分割的FCNs,只是每个位置都需要返回一个4D连续向量。但是,为了处理不同大小的边界框,DenseBox[9]将训练图像调整为固定的比例。因此,DenseBox必须对图像金字塔进行检测,这与FCN的一次计算所有卷积的思想相违背。此外,更重要的是,这些方法主要用于特殊领域的目标检测,如场景文本检测[25]或人脸检测[24,9],因为人们认为这些方法不适用于具有高度重叠边界框的一般对象检测。如图1(右)所示,高度重叠的边界框在训练过程中产生了难以处理的歧义:对于重叠区域内的像素点,不清楚wr.t应该返回哪个边界框。

In the sequel, we take a closer look at the issue and show that with FPN this ambiguity can be largely eliminated. As a result, our method can already obtain comparable detection accuracy with those traditional anchor based detectors. Furthermore, we observe that our method may produce a number of low-quality predicted bounding boxes at the locations that are far from the center of an target object. In order to suppress these low-quality detections, we introduce a novel center-ness branch (only one layer) to predict the deviation of a pixel to the center of its corresponding bounding box, as defined in Eq. (3). This score is then used to down-weight low-quality detected bounding boxes and merge the detection results in NMS. The simple yet effective center-ness branch allows the FCN-based detector to outperform anchor-based counterparts under exactly the same training and testing settings.
Eq3为: Eq3

This new detection framework enjoys the following advantages.
Detection is now unified with many other FCNsolvable tasks such as semantic segmentation, making it easier re-use ideas from those tasks.
Detection becomes proposal free and anchor free, which significantly reduces the number of design parameters. The design parameters typically need heuristic tuning and many tricks are involved in order to achieve good performance. Therefore, our new detection framework makes the detector, particular its training, considerably simpler. Moreover, by eliminating the anchor boxes, our new detector completely avoids the complex IOU computation and matching between anchor boxes and ground-truth boxes during training and reduces the total training memory footprint by a factor of 2 or so.
The proposed detector can be immediately extended to solve other vision tasks with minimal modification, including instance segmentation and key-point detection. We believe that this new method can be the new baseline for many instance-wise prediction problems.

2. Related Work

Anchor-based Detectors. Anchor-based detectors inherit the ideas from traditional sliding-window and proposal based detectors such as Fast R-CNN [5]. In anchor-based detectors, the anchor boxes can be viewed as pre-defined sliding windows or proposals, which are classified as positive or negative patches, with an extra offsets regression to refine the prediction of bounding box locations. Therefore, the anchor boxes in these detectors may be viewed as training samples. Unlike previous detectors like Fast RCNN, which compute image features for each sliding window/ proposal repeatedly, anchor boxes make use of the feature maps of convolutional neural networks (CNNs) and avoid repeated feature computation, speeding up detection

Anchor-based探测器。基于锚点的检测器继承了传统滑动窗口和基于提案的检测器(如Fast R-CNN[5])的思想。在基于锚点的检测器中,锚盒可以看作是预先定义的滑动窗口或proposals,这些窗口或提案被划分为正补丁或负补丁,通过额外的偏移量回归来细化边界盒位置的预测。因此,这些检测器中的锚盒可以看作是训练样本。与之前的快速RCNN等检测器反复计算每个滑动窗口/提案的图像特征不同,锚盒利用卷积神经网络(CNNs)的特征图,避免了重复的特征计算,戏剧性的加快了检测速度。快速的R-CNN在其RPNs[20]、SSD[15]和YOLOv2[18]中推广了锚盒的设计,成为现代探测器的惯例。

However, as described above, anchor boxes result in excessively many hyper-parameters, which typically need to be carefully tuned in order to achieve good performance. Besides the hyper-parameters of anchor shapes described above, the anchor-based detectors also need other hyperparameters to label each anchor box as a positive, ignored or negative sample. In previous works, they often employ intersection over union (IOU) between anchor boxes and ground-truth boxes to label them (e.g., a positive anchor if its IOU is in [0:5; 1]). These hyper-parameters have shown a great impact on the final accuracy, and require heuristic tuning. Meanwhile, these hyper-parameters are specific to detection tasks, making detection tasks deviate from a neat fully convolutional network architectures as used in other dense prediction tasks such as semantic segmentation.

然而,如上所述,锚框会导致过多的超参数,通常需要仔细调整这些超参数才能获得良好的性能。除了上述锚点形状的超参数外,基于锚点的检测器还需要其他超参数将每个锚点盒标记为正样本、忽略样本或负样本。在之前的工作中,他们经常使用锚盒和ground-truth盒之间的交集over union (IOU)来标记它们(例如,如果锚的IOU在[0:5;1])。这些超参数对最终的精度影响很大,需要进行启发式调优。同时,这些超参数是针对检测任务的,使得检测任务偏离了语义分割等其他密集预测任务中使用的简洁的全卷积网络架构。

Anchor-free Detectors. The most popular anchor-free detector might be YOLOv1 [17]. Instead of using anchor boxes, YOLOv1 predicts bounding boxes at points near the center of objects. Only the points near the center are used since they are considered to be able to produce higherquality detection. However, since only points near the center are used to predict bounding boxes, YOLOv1 suffers from low recall as mentioned in YOLOv2 [18]. As a result, YOLOv2 [18] makes use of anchor boxes as well. Compared to YOLOv1, FCOS takes advantages of all points in a ground truth bounding box to predict the bounding boxes and the low-quality detected bounding boxes are suppressed by the proposed “center-ness” branch. As a result, FCOS is able to provide comparable recall with anchor-based detectors as shown in our experiments.


CornerNet [10] is a recently proposed one-stage anchorfree detector, which detects a pair of corners of a bounding box and groups them to form the final detected bounding box. CornerNet requires much more complicated postprocessing to group the pairs of corners belonging to the same instance. An extra distance metric is learned for the purpose of grouping.
CornerNet [10]是最近提出的一种单级无锚定检测器,它检测边界框的一对角,并将它们分组形成最终检测到的边界框。CornerNet 需要更复杂的后处理来对属于同一实例的corners对进行分组。为了分组,我们学习了一个额外的距离度量。

Another family of anchor-free detectors such as [24] are based on DenseBox [9]. The family of detectors have been considered unsuitable for generic object detection due to difficulty in handling overlapping bounding boxes and the recall being low. In this work, we show that both problems can be largely alleviated with multi-level FPN prediction. Moreover, we also show together with our proposed centerness branch, the much simpler detector can achieve even better detection performance than its anchor-based counterparts.
FCOS的网络架构,其中C3、C4、C5表示骨干网络的特征映射,P3到P7为最终预测所用的特征层。hw是特征图的高度和宽度。/s (s = 8;16;:::;(128)为特征映射层与输入图像的下采样比。例如,所有数字都是使用800 *1024输入计算的。

大家应该也都注意到了,feature pyramid结构部分并不是标准的FPN结构,P6和P7层似乎有些多余,所以实验部分(Table 7)和Retinanet做对比,证明FCOS输出部分网络设计的优势,笔者认为就有些差强人意了。

3. Our Approach

In this section, we first reformulate object detection in a per-pixel prediction fashion. Next, we show that how we make use of multi-level prediction to improve the recall and resolve the ambiguity resulted from overlapped bounding boxes in training. Finally, we present our proposed center-ness branch, which helps suppress the low-quality detected bounding boxes and improve the overall performance by a large margin
在本节中,我们首先以逐像素预测的方式重新定义目标检测。接下来,我们展示了如何利用多级预测来提高召回率,并解决训练中由于边界框重叠而产生的歧义。最后,我们提出了我们的center-ness 分支,它有助于抑制低质量的检测边界框,并大大提高了整体性能。

3.1. Fully Convolutional OneStage Object Detector

与基于锚点的检测器不同,基于锚点的检测器将输入图像上的位置作为锚点盒的中心,并对这些锚点盒的目标边界盒进行归一化处理,我们直接对每个位置的目标边界盒进行归一化处理。也就是说,我们的检测器直接将位置看作训练样本,而不是在基于anchor 的检测器中将anchor boxes看作训练样本,这与FCNs(全卷积网络)中的语义分割方法是一样的。

anchor-based 检测器是将anchor回归到ground truth,不同于这种方法,本文的方法是直接将location回归到ground truth。换句话说,是直接将location视作训练样本来代替anchor。

It is worth noting that FCOS can leverage as many foreground samples as possible to train the regressor. It is different from anchor-based detectors, which only consider the anchor boxes with a highly enough IOU with ground-truth boxes as positive samples. We argue that it may be one of the reasons that FCOS outperforms its anchor-based counterparts.

值得注意的是,FCOS可以利用尽可能多的前景样本来训练回归器。与基于锚点的检测器不同,基于anchor的检测器只考虑与地面真实值的anchor boxes具有足够高的IOU的anchor boxes作为正样本。我们认为这可能是FCOS优于基于anchor的同类产品的原因之一。

我们将训练损失函数定义如下:Lcls 为焦点损失,Lreg 是IOU损失像是UnitBox中【24】
为1 如果ci*>0

FCOS的推断很简单,给出一个输入图像,通过网络前向传播得到特征图Fi上每一个位置(x,y)的 分类概率Px,y ,回归预测坐标 t x,y 。之后,选取Px,y > 0.05作为正样本,逆公式1得到预测的边界框。

3.2. Multilevel Prediction with FPN for FCOS (用FPN对FCOS进行多级预测)

Here we show that how two possible issues of the proposed FCOS can be resolved with multi-level prediction with FPN [11]. 1) The large stride (e.g., 16) of the final feature maps in a CNN can result in a relatively low best possible recall (BPR)1. For anchor based detectors, low recall rates due to the large stride can be compensated to some extent by lowering the required IOU scores for positive anchor boxes. For FCOS, at the first glance one may think that the BPR can be much lower than anchor-based detectors because it is impossible to recall an object for which no location on the final feature maps encodes due to a large stride.
Here, we empirically show that even with a large stride, FCN-based FCOS is still able to produce a good BPR, and it can even better than the BPR of the anchor-based detector
RetinaNet [12] in the official implementation Detectron [6] (refer to Table 1). Therefore, the BPR is actually not a problem of FCOS. Moreover, with multi-level FPN prediction [11], the BPR can be improved further to match the best BPR the anchor-based RetinaNet can achieve. 2) Overlaps in ground-truth boxes can cause intractable ambiguity during training, i.e., w.r.t. which bounding box should a location in the overlap to regress? This ambiguity results in degraded performance of FCN-based detectors. In this work, we show that the ambiguity can be greatly resolved with multi-level prediction, and the FCN-based detector can obtain on par, sometimes even better, performance compared with anchor-based ones.

在这里,我们展示了如何解决提出的FCOS导致的两个可能的,可以被多层次预测与FPN[11] 解决的问题

  1. CNN中经过加大的stride得到的feature map可能会导致相对较低的best possible recall (BPR)1。对于基于anchor的检测器,由于较大的步长而导致的低召回率可以通过降低正样本锚点盒所需的IOU分数来在一定程度上得到补偿。对于FCOS,乍一看,对于FCOS来说,由于较大stride后的feature map上没有位置编码信息,因此,人们可能认为BPR比基于anchor的检测器要低得多。在这里,我们通过实验证明,即使有较大的跨步,基于fcn的FCOS仍然能够产生良好的BPR,甚至可以比基于锚的检测器的BPR更好官方实现中的RetinaNet[12],因此BPR实际上不是FCOS的问题。此外,通过much-level FPN预测,BPR可以得到进一步的提高可以达到RetinaNet最好的高度。

  2. ground-truth框中的重叠会在训练过程中造成难以处理的歧义,即, w.r.t.,在重叠区域内的哪个边界框应该回归?这种模糊性导致基于fcn的检测器性能下降。结果表明,采用多层次预测方法可以有效地解决模糊问题,与基于锚点的模糊检测器相比,基于模糊控制器的模糊检测器具有更好的性能。

与FPN相似,本文在不同层次的feature map上进行不同尺寸的目标检测。具体来说,我们使用了定义为的五个层次的特征映射{P3; P4; P5; P6; P7.}
P3,P4,P5由backbone 的C3,C4,C5后接1x1的卷积得到。 如下图所示,P6,P7在分别在P5,P6上设置stride 为2并增加卷积层得到。最终,P3,P4,P5,P6,P7的stride分别为8,16,32,64,128

不同于基于anchor的检测器,在不同层的feature map上应用不同尺寸的anchor,本文直接限制边界框回归的范围。首先计算出所有层上每个位置对应的回归目标,l*,t*,r*,b*,若一个位置满足下列两个条件max(l*,t*,r*,b*)>mi或者max(l*,t*,r*,b*)因为不同大小的对象被分配到不同的feature level ,并且大部分重叠发生在大小相当不同的对象之间,多层预测在很大程度上缓解了上述的模糊性,将基于fcn的检测器提高到与基于anchor的检测器相同的水平,如我们的实验结果表明。

最后,根据[11,12],我们在不同的特征层之间共享头部,不仅使检测器的参数效率更高,而且提高了检测性能。然而,我们观察到不同的特征级别需要回归不同的大小范围(P3 [0; 64] P4 [64; 128] ),因此,对不同的特征层使用相同的头部是不合理的。因此,不使用标准exp(x),我们使用exp(Six)用可训练的标量Si对于特征层Pi自动调整指数函数的基数,实验表明提高了检测性能。

使用基于FPN的多尺度预测提高召回率和缓解重叠bounding boxes带来的二义性。使用来自5层步长分别为8, 16, 32, 64 和 128的feature map P3,P4,P5,P6,P7 其中P6,P7分别是P5,P6的下采样。不满足每层目标回归尺寸的目标不会被回归,因此可以有效地减轻重叠目标带来的二义性(作者假设重叠目标大小差异较大)。

3.3. Centerness for FCOS

center-ness,可以译成中心点打分,它表征了当前像素点是否处于ground truth target的中心区域,以下面的热力图为例,红色部分表示center-ness值为1,蓝色部分表示center-ness值为0,其他部分的值介于0和1之间。

After using multi-level prediction in FCOS, there is still a performance gap between FCOS and anchor-based detectors. We observed that it is due to a lot of low-quality predicted bounding boxes produced by locations far away from the center of an object.
We propose a simple yet effective strategy to suppress these low-quality detected bounding boxes without introducing any hyper-parameters.Specifically, we add a singlelayer branch, in parallel with the classification branch to predict the “center-ness” of a location (i.e., the distance from the location to the center of the object that the location is responsible for), as shown in Fig. 2.


We employ sqrt here to slow down the decay of the centerness.The center-ness ranges from 0 to 1 and is thus trained with binary cross entropy (BCE) loss. The loss is added to the loss function Eq. (2). When testing, the final score (used for ranking the detected bounding boxes) is computed by multiplying the predicted center-ness with the corresponding classification score. Thus the center-ness can downweight the scores of bounding boxes far from the center of an object. As a result, with high probability, these lowquality bounding boxes might be filtered out by the final non-maximum suppression (NMS) process, improving the detection performance remarkably.

我们用根号是为了降低 center-ness 衰减的速度。Center-ness 值的范围从0到1,通过二元交叉熵损失来训练。这个损失然后加到等式2 的损失函数中去。测试时,将预测的 center-ness 和对应的分类得分相乘,得到最终的得分,再用这个得分对检测边框进行排名。这样,这个 center-ness 就可以降低那些远离物体中心边框的得分。在最后的 NMS 过程中,这些低质量的边框就会很大概率上被剔除,提升检测效果。

From the perspective of anchor-based detectors, which use two IOU thresholds Tlow and Thigh to label the anchor boxes as negative, ignored and positive samples, the centerness can be viewed as a soft threshold. It is learned during the training of networks and does not need to be tuned. Moreover, with the strategy, our detector can still view any locations falling into a ground box as positive samples, except for the ones set as negative samples in the aforementioned multi-level prediction, so as to use as many training samples as possible for the regressor.

基于锚点的检测器使用两个IOU阈值Tlow和Thigh将锚点盒标记为负、忽略和正样本,从anchor的角度可以将centerness看作一个软阈值。它是在模型训练中学习的,不需要手动调整。此外,利用该策略,我们的检测器仍可以将任意落入 ground truth 边框的点看作正样本,除了那些在多层级预测中已经被标注为负样本的点,在回归器中就可以使用尽可能多的训练样本。

1、FCOS的2个可能存在的问题可以被应用了FPN的multi-level prediction解决。

  • 最终feature map上大的补偿可能会导致一个相对低的BPR。对于anchor-based,由于大的步长导致的低的召回率可以被降低设置positive anchor的iou补偿一些。对于FCOS,第一反应可能会认为BPR会比anchor-based低很多因为不可能找回一个在最终feature map上没有location的物体。作者设置了大的步长,结果发现FCOS仍然可以得到一个好的结果,甚至比RetinaNet的效果更好,因此认为BPR不是FCOS的一个问题(为什么?)。此外,通过multi-level prediction,BPR提高的更多。
  • 如果出现overlaps的情况,一个location应该去回归哪个ground truth?
    这个问题可以被multi-level prediction解决,因为在multi-level prediction中,不同尺寸的目标被分配到不同的feature level。


  • 将feature_map中的每一个点(x,y)映射回原始的输入图片中;
  • 如果映射原图的点在ground-truth边界框范围之内,正样本, c就是Bi的类别标签,否则负样本,类别标签c=0(背景)
  • 回归的目标是(l,t,r,b),即中心点做BB的left、top、right和bottom之间的距离



