论文阅读 FCOS: Fully Convolutional One-Stage Object Detect

论文地址:https://arxiv.org/pdf/1904.01355.pdf
源码地址:https://github.com/tianzhi0549/FCOS

Abstract

We propose a fully convolutional one-stage object detector (FCOS) to solve object detection in a per-pixel prediction fashion, analogue to semantic segmentation. Almost all state-of-the-art object detectors such as RetinaNet, SSD, YOLOv3, and Faster R-CNN rely on pre-defined anchor boxes. In contrast, our proposed detector FCOS is anchorbox free, as well as proposal free. By eliminating the predefined set of anchor boxes, FCOS completely avoids the complicated computation related to anchor boxes such as calculating overlapping during training and significantly reduces the training memory footprint. More importantly, we also avoid all hyper-parameters related to anchor boxes, which are often very sensitive to the final detection performance. With the only post-processing non-maximum suppression (NMS), our detector FCOS outperforms previous anchor-based one-stage detectors with the advantage of being much simpler. For the first time, we demonstrate a much simpler and flexible detection framework achieving improved detection accuracy. We hope that the proposed FCOS framework can serve as a simple and strong alternative for many other instance-level tasks

本文提出了一种全卷积one-stage目标检测算法(FCOS),以逐像素预测的方式解决目标检测问题,类似于语义分割。目前最流行的不论是one-stage目标检测算法,如RetinaNet,SSD,YOLOv3,还是two-stage目标检测算法,如Faster R-CNN。这两类算法大都依赖于预定义的锚框(anchor boxes)。相比之下,本文提出的目标检测算法FCOS不需要锚框。通过消除预定义的锚框,FCOS避免了与锚框相关的复杂计算,例如在训练期间计算重叠等,并且显著减少了训练内存。更重要的是,FCOS还避免了设定与anchor相关的所有超参数,这些参数通常对最终检测性能非常敏感。FCOS算法凭借唯一的后处理:非极大值抑制(NMS),实现了优于以前基于锚框的one-stage检测算法的效果,具有更简单的优点。首次提出了一种简单灵活的检测框架,提高了检测精度。我们希望所提出的FCOS框架可以作为许多其他实例级任务的简单而强大的替代方案.

其实最著名的无anchor的目标检测网络是YOLOv1算法,YOLOv1算法纯粹是为了告诉大家,回归网络也可以进行目标检测,该网络由于其召回率过低而使其并无太多实用价值,因此YOLO作者在其基础上提出了基于anchor的YOLOv2算法。而本文提出的FCOS算法相当于保留了无anchor机制,并且引入了逐像素回归预测,多尺度特征以及center-ness三种策略,主要流程框架如下图所示,最终实现了在无anchor的情况下效果能够比肩各类主流基于anchor的目标检测算法。

1 Introduction

Object detection is a fundamental yet challenging task in computer vision, which requires the algorithm to predict a bounding box with a category label for each instance of interest in an image. All current mainstream detectors such as Faster R-CNN [20], SSD [15] and YOLOv2, v3 [19] rely on a set of pre-defined anchor boxes and it has long been believed that the use of anchor boxes is the key to detectors success. Despite their great success, it is important to note that anchor-based detectors suffer some drawbacks

目标检测是计算机视觉中的一项基本而又具有挑战性的任务,它要求算法为图像中感兴趣的每个实例预测一个带有类别标签的边界框。目前所有主流的检测器,如速度更快的R-CNN[20]、SSD[15]、YOLOv2、v3[19]都依赖于一组预定义的锚盒,长期以来一直认为锚盒的使用是检测器成功的关键。尽管它们取得了巨大的成功,但需要注意的是,基于锚的检测器存在一些缺陷

As shown in [12, 20], detection performance is sensitive to the sizes, aspect ratios and number of anchor boxes. For example, in RetinaNet [12], varying these hyper-parameters affects the performance up to 4% in AP on the COCO benchmark [13]. As a result, these
hyper-parameters need to be carefully tuned in anchorbased detectors.

1.检测性能对锚盒的尺寸、纵横比和数量非常敏感。例如,在RetinaNet[12]中,改变这些超参数对性能的影响高达4%AP上的COCO基准[13]。因此,这些超参数需要在基于anchor的检测器中仔细调整.

Even with careful design, because the scales and aspect ratios of anchor boxes are kept fixed, detectors encounter difficulties to deal with object candidates with large shape variations, particularly for small objects. The pre-defined anchor boxes also hamper the generalization ability of detectors, as they need to be redesigned on new detection tasks with different object sizes or aspect ratios.

2.即使经过精心设计,由于锚盒的尺度和长径比是固定的,探测器在处理形状变化较大的候选对象时也会遇到困难,特别是对于小对象。预定义的锚盒还限制了检测器的泛化能力,因为它们需要针对不同对象大小或纵横比的新检测任务重新设计

Even with careful design, because the scales and aspect ratios of anchor boxes are kept fixed, detectors encounter difficulties to deal with object candidates with large shape variations, particularly for small objects. The pre-defined anchor boxes also hamper the generalization ability of detectors, as they need to be redesigned on new detection tasks with different object sizes or aspect ratios.
3.为了获得较高的查全率,需要一个基于锚点的检测器将anchor box密集地放置在输入图像上(例如,对于短边为800的图像,在特征金字塔网络(FPN)[11]中放置超过180K个锚点盒)。在训练过程中,这些锚盒大多被标记为负样本。负样本样本数量过多加剧了训练中正样本与负性样本之间的不平衡。

An excessively large number of anchor boxes also significantly increase the amount of computation and memory footprint when computing the intersectionover-union (IOU) scores between all anchor boxes and ground-truth boxes during training.
4.当计算训练过程中所有锚点盒与地面真实盒之间的相交并集(intersectionover- union, IOU)得分时,过多的锚点盒也会显著增加计算量和内存占用

Recently, fully convolutional networks (FCNs) [16] have achieved tremendous success in dense prediction tasks such as semantic segmentation [16], depth estimation [14], keypoint detection [2], and counting [1]. As one of high-level vision tasks, object detection might be the only one deviating from the neat fully convolutional per-pixel prediction framework mainly due to the use of anchor boxes. It is nature to ask a question: Can we solve object detection in the neat per-pixel prediction fashion, analogue to FCN for semantic segmentation, for example? Thus those fundamental vision tasks can be unified in (almost) one single framework. We show that the answer is affirmative. Moreover, we demonstrate that, for the first time, the much simper FCN-based detector achieves even better performance than its anchor-based counterparts.
近年来,全卷积网络[16]在语义分割[16]、深度估计[14]、关键点检测[2]、计数[1]等密集预测任务中取得了巨大的成功。目标检测作为高级视觉任务之一,可能是唯一一个偏离纯卷积逐像素预测框架的任务,这主要是由于锚盒的使用。问这样一个问题是很自然的:我们能否以简洁的逐像素预测的方式来解决对象检测问题,例如类似于FCN的语义分割?因此,这些基本的远景任务可以(几乎)统一在一个框架中。我们证明答案是肯定的。此外,我们还首次证明了基于fcn的检测器比基于锚的检测器具有更好的性能。

In the literature, some works attempted to leverage the FCNs-based framework for object detection such as Dense- Box [9] and UnitBox [24]. Specifically, these FCN-based frameworks directly predict a 4D vector plus a class category at each spatial location on a level of feature maps. As shown in Fig. 1 (left), the 4D vector depicts the relative offsets from the four sides of a bounding box to the location. These frameworks are similar to the FCNs for semantic segmentation, except that each location is required to regress a 4D continuous vector. However, to handle the bounding boxes with different sizes, DenseBox [9] resizes training images to a fixed scale. Thus DenseBox has to perform detection on image pyramids, which is against FCN s philosophy of computing all convolutions once. Beside, more significantly, these methods are mainly used in special domain objection detection such as scene text detection [25] or face detection [24, 9], since it is believed that these methods do not work well when applied to generic object detection with highly overlapped bounding boxes. As shown in Fig. 1 (right), the highly overlapped bounding boxes result in an intractable ambiguity during training: it is not clear w.r.t. which bounding box to regress for the pixels in the overlapped regions.
在文献中,一些工作试图利用基于fcn的框架进行对象检测,如(稠密)Dense- Box[9]和(单元)UnitBox[24]。具体地说,这些基于fcn的框架直接预测了特征映射级别上每个空间位置上的一个4D向量加上一个类类别。如图1(左)所示,4D向量描述了一个边界框到该位置的四个边的相对偏移量。这些框架类似于用于语义分割的FCNs,只是每个位置都需要返回一个4D连续向量。但是,为了处理不同大小的边界框,DenseBox[9]将训练图像调整为固定的比例。因此,DenseBox必须对图像金字塔进行检测,这与FCN的一次计算所有卷积的思想相违背。此外,更重要的是,这些方法主要用于特殊领域的目标检测,如场景文本检测[25]或人脸检测[24,9],因为人们认为这些方法不适用于具有高度重叠边界框的一般对象检测。如图1(右)所示,高度重叠的边界框在训练过程中产生了难以处理的歧义:对于重叠区域内的像素点,不清楚wr.t应该返回哪个边界框。

论文阅读 FCOS: Fully Convolutional One-Stage Object Detect_第1张图片

In the sequel, we take a closer look at the issue and show that with FPN this ambiguity can be largely eliminated. As a result, our method can already obtain comparable detection accuracy with those traditional anchor based detectors. Furthermore, we observe that our method may produce a number of low-quality predicted bounding boxes at the locations that are far from the center of an target object. In order to suppress these low-quality detections, we introduce a novel center-ness branch (only one layer) to predict the deviation of a pixel to the center of its corresponding bounding box, as defined in Eq. (3). This score is then used to down-weight low-quality detected bounding boxes and merge the detection results in NMS. The simple yet effective center-ness branch allows the FCN-based detector to outperform anchor-based counterparts under exactly the same training and testing settings.
在续集中,我们将进一步研究这个问题,并说明使用FPN可以在很大程度上消除这种模糊性。结果表明,该方法与传统的基于锚点的检测方法具有相当的检测精度。此外,我们注意到我们的方法可能会在远离目标对象中心的位置产生大量低质量的预测边界框为了抑制这些低质量的检测,我们引入了一个新的中心分支(只有一层)来预测一个像素到其相应边界框中心的偏移,如Eq3中定义的。然后,该分数用于降低低质量检测边界框的权重,并将检测结果合并到NMS中。简单而有效的center-ness分支允许基于fcn的检测器在完全相同的训练和测试设置下胜过基于锚的检测器。
Eq3为: Eq3

This new detection framework enjoys the following advantages.
新的检测框架有如下优点:
Detection is now unified with many other FCNsolvable tasks such as semantic segmentation, making it easier re-use ideas from those tasks.
1.现在的检测与许多其他FCN-solvable任务(如语义分割)统一起来,从而更容易重用这些任务中的思想。
Detection becomes proposal free and anchor free, which significantly reduces the number of design parameters. The design parameters typically need heuristic tuning and many tricks are involved in order to achieve good performance. Therefore, our new detection framework makes the detector, particular its training, considerably simpler. Moreover, by eliminating the anchor boxes, our new detector completely avoids the complex IOU computation and matching between anchor boxes and ground-truth boxes during training and reduces the total training memory footprint by a factor of 2 or so.
2.检测变为无proposal、无锚,大大减少了设计参数的数量。设计参数通常需要启发式调优,为了获得良好的性能,需要使用许多技巧。因此,我们的新检测框架使检测器,特别是它的训练,变得相当简单。此外,通过消除锚盒,我们的新检测器完全避免了训练过程中锚盒与地面真值盒之间复杂的IOU计算和匹配,并将总训练内存占用降低了2倍左右。
Detection becomes proposal free and anchor free, which significantly reduces the number of design parameters. The design parameters typically need heuristic tuning and many tricks are involved in order to achieve good performance. Therefore, our new detection framework makes the detector, particular its training, considerably simpler. Moreover, by eliminating the anchor boxes, our new detector completely avoids the complex IOU computation and matching between anchor boxes and ground-truth boxes during training and reduces the total training memory footprint by a factor of 2 or so.
3.在没有附加条件的情况下,我们在单级检测器中实现了最先进的结果。结果表明,该算法可以作为两级检测器的区域建议网络(RPNs),其性能明显优于基于锚点的RPN算法。考虑到更简单的无锚检测器的性能甚至更好,我们鼓励社区重新考虑在对象检测中使用锚盒的必要性,目前这被认为是检测的实际标准。
The proposed detector can be immediately extended to solve other vision tasks with minimal modification, including instance segmentation and key-point detection. We believe that this new method can be the new baseline for many instance-wise prediction problems.
4.所提出的检测器可以立即扩展到解决其他视觉任务的最小修改,包括实例分割和关键点检测。我们认为,该方法可以作为许多实例预测问题的新基线。

2. Related Work

Anchor-based Detectors. Anchor-based detectors inherit the ideas from traditional sliding-window and proposal based detectors such as Fast R-CNN [5]. In anchor-based detectors, the anchor boxes can be viewed as pre-defined sliding windows or proposals, which are classified as positive or negative patches, with an extra offsets regression to refine the prediction of bounding box locations. Therefore, the anchor boxes in these detectors may be viewed as training samples. Unlike previous detectors like Fast RCNN, which compute image features for each sliding window/ proposal repeatedly, anchor boxes make use of the feature maps of convolutional neural networks (CNNs) and avoid repeated feature computation, speeding up detection

Anchor-based探测器。基于锚点的检测器继承了传统滑动窗口和基于提案的检测器(如Fast R-CNN[5])的思想。在基于锚点的检测器中,锚盒可以看作是预先定义的滑动窗口或proposals,这些窗口或提案被划分为正补丁或负补丁,通过额外的偏移量回归来细化边界盒位置的预测。因此,这些检测器中的锚盒可以看作是训练样本。与之前的快速RCNN等检测器反复计算每个滑动窗口/提案的图像特征不同,锚盒利用卷积神经网络(CNNs)的特征图,避免了重复的特征计算,戏剧性的加快了检测速度。快速的R-CNN在其RPNs[20]、SSD[15]和YOLOv2[18]中推广了锚盒的设计,成为现代探测器的惯例。

However, as described above, anchor boxes result in excessively many hyper-parameters, which typically need to be carefully tuned in order to achieve good performance. Besides the hyper-parameters of anchor shapes described above, the anchor-based detectors also need other hyperparameters to label each anchor box as a positive, ignored or negative sample. In previous works, they often employ intersection over union (IOU) between anchor boxes and ground-truth boxes to label them (e.g., a positive anchor if its IOU is in [0:5; 1]). These hyper-parameters have shown a great impact on the final accuracy, and require heuristic tuning. Meanwhile, these hyper-parameters are specific to detection tasks, making detection tasks deviate from a neat fully convolutional network architectures as used in other dense prediction tasks such as semantic segmentation.

然而,如上所述,锚框会导致过多的超参数,通常需要仔细调整这些超参数才能获得良好的性能。除了上述锚点形状的超参数外,基于锚点的检测器还需要其他超参数将每个锚点盒标记为正样本、忽略样本或负样本。在之前的工作中,他们经常使用锚盒和ground-truth盒之间的交集over union (IOU)来标记它们(例如,如果锚的IOU在[0:5;1])。这些超参数对最终的精度影响很大,需要进行启发式调优。同时,这些超参数是针对检测任务的,使得检测任务偏离了语义分割等其他密集预测任务中使用的简洁的全卷积网络架构。

Anchor-free Detectors. The most popular anchor-free detector might be YOLOv1 [17]. Instead of using anchor boxes, YOLOv1 predicts bounding boxes at points near the center of objects. Only the points near the center are used since they are considered to be able to produce higherquality detection. However, since only points near the center are used to predict bounding boxes, YOLOv1 suffers from low recall as mentioned in YOLOv2 [18]. As a result, YOLOv2 [18] makes use of anchor boxes as well. Compared to YOLOv1, FCOS takes advantages of all points in a ground truth bounding box to predict the bounding boxes and the low-quality detected bounding boxes are suppressed by the proposed “center-ness” branch. As a result, FCOS is able to provide comparable recall with anchor-based detectors as shown in our experiments.

Anchor-free探测器。最流行的无锚探测器可能是YOLOv1[17]。YOLOv1没有使用锚框,而是预测在靠近对象中心的点上的边界框。只使用中心附近的点,因为它们被认为能够产生更高的质量检测。然而,由于仅使用靠近中心的点来预测边界框,YOLOv1的召回率较低,正如YOLOv2[18]中所述。因此,YOLOv2[18]也使用了锚盒。与YOLOv1相比,FCOS利用地面真值包围盒中的所有点来预测包围盒,并通过提出的“中心度”分支抑制低质量检测到的包围盒。因此,FCOS能够提供类似的召回基于锚的探测器,如我们的实验所示。

CornerNet [10] is a recently proposed one-stage anchorfree detector, which detects a pair of corners of a bounding box and groups them to form the final detected bounding box. CornerNet requires much more complicated postprocessing to group the pairs of corners belonging to the same instance. An extra distance metric is learned for the purpose of grouping.
CornerNet [10]是最近提出的一种单级无锚定检测器,它检测边界框的一对角,并将它们分组形成最终检测到的边界框。CornerNet 需要更复杂的后处理来对属于同一实例的corners对进行分组。为了分组,我们学习了一个额外的距离度量。

Another family of anchor-free detectors such as [24] are based on DenseBox [9]. The family of detectors have been considered unsuitable for generic object detection due to difficulty in handling overlapping bounding boxes and the recall being low. In this work, we show that both problems can be largely alleviated with multi-level FPN prediction. Moreover, we also show together with our proposed centerness branch, the much simpler detector can achieve even better detection performance than its anchor-based counterparts.
另一类无锚定探测器如[24]是基于DenseBox[9]的。由于难以处理重叠的边界框,且召回率较低,该检测器系列已被认为不适合通用对象检测。在这项工作中,我们证明了这两个问题可以很大程度上缓解多级FPN预测。此外,我们也指出,与我们提出的中心分支,更简单的检测器可以实现甚至更好的检测性能比基于锚的同类。
在这里插入图片描述
论文阅读 FCOS: Fully Convolutional One-Stage Object Detect_第2张图片
FCOS的网络架构,其中C3、C4、C5表示骨干网络的特征映射,P3到P7为最终预测所用的特征层。hw是特征图的高度和宽度。/s (s = 8;16;:::;(128)为特征映射层与输入图像的下采样比。例如,所有数字都是使用800 *1024输入计算的。

大家应该也都注意到了,feature pyramid结构部分并不是标准的FPN结构,P6和P7层似乎有些多余,所以实验部分(Table 7)和Retinanet做对比,证明FCOS输出部分网络设计的优势,笔者认为就有些差强人意了。

3. Our Approach

In this section, we first reformulate object detection in a per-pixel prediction fashion. Next, we show that how we make use of multi-level prediction to improve the recall and resolve the ambiguity resulted from overlapped bounding boxes in training. Finally, we present our proposed center-ness branch, which helps suppress the low-quality detected bounding boxes and improve the overall performance by a large margin
在本节中,我们首先以逐像素预测的方式重新定义目标检测。接下来,我们展示了如何利用多级预测来提高召回率,并解决训练中由于边界框重叠而产生的歧义。最后,我们提出了我们的center-ness 分支,它有助于抑制低质量的检测边界框,并大大提高了整体性能。

3.1. Fully Convolutional OneStage Object Detector

论文阅读 FCOS: Fully Convolutional One-Stage Object Detect_第3张图片
i层特征图:
在这里插入图片描述
真实标注:

在这里插入图片描述
其中左上角和右下角坐标为:
在这里插入图片描述

边界框中物体所属于的类别为:
在这里插入图片描述

总类别为C(coco中为80)

论文阅读 FCOS: Fully Convolutional One-Stage Object Detect_第4张图片

对于特征图上的位置(x,y)可以找到输入图中的位置,它靠近感受野的中心(x,y)。
与基于锚点的检测器不同,基于锚点的检测器将输入图像上的位置作为锚点盒的中心,并对这些锚点盒的目标边界盒进行归一化处理,我们直接对每个位置的目标边界盒进行归一化处理。也就是说,我们的检测器直接将位置看作训练样本,而不是在基于anchor 的检测器中将anchor boxes看作训练样本,这与FCNs(全卷积网络)中的语义分割方法是一样的。

anchor-based 检测器是将anchor回归到ground truth,不同于这种方法,本文的方法是直接将location回归到ground truth。换句话说,是直接将location视作训练样本来代替anchor。

论文阅读 FCOS: Fully Convolutional One-Stage Object Detect_第5张图片

如果位置(x,y)落入任何ground-truth边界框中,则被认为是正样本,该位置的类别标签c就是Bi的类别标签,否则就是负样本,类别标签c=0(背景类别),除了用于分类的标签,还有一个四维的向量在这里插入图片描述
作为每个样本的回归目标,四个值分别表示从位置到边界框四边的距离。如果一个位置落在多个b边界框中,它被认为是一个模糊的样本。现在,我们简单地选择面积最小的边界框作为其回归目标。在下一节中,我们将展示使用多级预测,可以显著减少模糊样本的数量。如果位置(x,y)与一个边界框Bi相关联,该位置的训练回归目标可以表示为公式1:

在这里插入图片描述
论文阅读 FCOS: Fully Convolutional One-Stage Object Detect_第6张图片
It is worth noting that FCOS can leverage as many foreground samples as possible to train the regressor. It is different from anchor-based detectors, which only consider the anchor boxes with a highly enough IOU with ground-truth boxes as positive samples. We argue that it may be one of the reasons that FCOS outperforms its anchor-based counterparts.

值得注意的是,FCOS可以利用尽可能多的前景样本来训练回归器。与基于锚点的检测器不同,基于anchor的检测器只考虑与地面真实值的anchor boxes具有足够高的IOU的anchor boxes作为正样本。我们认为这可能是FCOS优于基于anchor的同类产品的原因之一。

论文阅读 FCOS: Fully Convolutional One-Stage Object Detect_第7张图片

对应于训练目标,我们的网络的最后一层预测分类标签的80D向量p和4D向量t=(l,t,r,b)边界框坐标,之后,我们不再训练多类分类器,而是训练C个二分类器。与[12]类似,我们在主干网络的特征图之后分别添加了4个卷积层,用于分类和回归分支。此外,由于回归目标总是正的,本文应用exp(x)在回归分支的顶部将任意实数映射到(0,OO),值得注意的是,FCOS相比基于anchor-based的方法减少了9倍的网络输出[12,20].
论文阅读 FCOS: Fully Convolutional One-Stage Object Detect_第8张图片
我们将训练损失函数定义如下:Lcls 为焦点损失,Lreg 是IOU损失像是UnitBox中【24】
Npos为正样本数,拉姆达=1在本文中,是Lreg的平衡权重。对特征图上的所有位置进行求和是指标函数,
在这里插入图片描述
为1 如果ci*>0
否则为0。

在这里插入图片描述
论文阅读 FCOS: Fully Convolutional One-Stage Object Detect_第9张图片
FCOS的推断很简单,给出一个输入图像,通过网络前向传播得到特征图Fi上每一个位置(x,y)的 分类概率Px,y ,回归预测坐标 t x,y 。之后,选取Px,y > 0.05作为正样本,逆公式1得到预测的边界框。

3.2. Multilevel Prediction with FPN for FCOS (用FPN对FCOS进行多级预测)

Here we show that how two possible issues of the proposed FCOS can be resolved with multi-level prediction with FPN [11]. 1) The large stride (e.g., 16) of the final feature maps in a CNN can result in a relatively low best possible recall (BPR)1. For anchor based detectors, low recall rates due to the large stride can be compensated to some extent by lowering the required IOU scores for positive anchor boxes. For FCOS, at the first glance one may think that the BPR can be much lower than anchor-based detectors because it is impossible to recall an object for which no location on the final feature maps encodes due to a large stride.
Here, we empirically show that even with a large stride, FCN-based FCOS is still able to produce a good BPR, and it can even better than the BPR of the anchor-based detector
RetinaNet [12] in the official implementation Detectron [6] (refer to Table 1). Therefore, the BPR is actually not a problem of FCOS. Moreover, with multi-level FPN prediction [11], the BPR can be improved further to match the best BPR the anchor-based RetinaNet can achieve. 2) Overlaps in ground-truth boxes can cause intractable ambiguity during training, i.e., w.r.t. which bounding box should a location in the overlap to regress? This ambiguity results in degraded performance of FCN-based detectors. In this work, we show that the ambiguity can be greatly resolved with multi-level prediction, and the FCN-based detector can obtain on par, sometimes even better, performance compared with anchor-based ones.

在这里,我们展示了如何解决提出的FCOS导致的两个可能的,可以被多层次预测与FPN[11] 解决的问题

  1. CNN中经过加大的stride得到的feature map可能会导致相对较低的best possible recall (BPR)1。对于基于anchor的检测器,由于较大的步长而导致的低召回率可以通过降低正样本锚点盒所需的IOU分数来在一定程度上得到补偿。对于FCOS,乍一看,对于FCOS来说,由于较大stride后的feature map上没有位置编码信息,因此,人们可能认为BPR比基于anchor的检测器要低得多。在这里,我们通过实验证明,即使有较大的跨步,基于fcn的FCOS仍然能够产生良好的BPR,甚至可以比基于锚的检测器的BPR更好官方实现中的RetinaNet[12],因此BPR实际上不是FCOS的问题。此外,通过much-level FPN预测,BPR可以得到进一步的提高可以达到RetinaNet最好的高度。

  2. ground-truth框中的重叠会在训练过程中造成难以处理的歧义,即, w.r.t.,在重叠区域内的哪个边界框应该回归?这种模糊性导致基于fcn的检测器性能下降。结果表明,采用多层次预测方法可以有效地解决模糊问题,与基于锚点的模糊检测器相比,基于模糊控制器的模糊检测器具有更好的性能。

论文阅读 FCOS: Fully Convolutional One-Stage Object Detect_第10张图片

论文阅读 FCOS: Fully Convolutional One-Stage Object Detect_第11张图片

与FPN相似,本文在不同层次的feature map上进行不同尺寸的目标检测。具体来说,我们使用了定义为的五个层次的特征映射{P3; P4; P5; P6; P7.}
P3,P4,P5由backbone 的C3,C4,C5后接1x1的卷积得到。 如下图所示,P6,P7在分别在P5,P6上设置stride 为2并增加卷积层得到。最终,P3,P4,P5,P6,P7的stride分别为8,16,32,64,128

论文阅读 FCOS: Fully Convolutional One-Stage Object Detect_第12张图片
不同于基于anchor的检测器,在不同层的feature map上应用不同尺寸的anchor,本文直接限制边界框回归的范围。首先计算出所有层上每个位置对应的回归目标,l*,t*,r*,b*,若一个位置满足下列两个条件max(l*,t*,r*,b*)>mi或者max(l*,t*,r*,b*)因为不同大小的对象被分配到不同的feature level ,并且大部分重叠发生在大小相当不同的对象之间,多层预测在很大程度上缓解了上述的模糊性,将基于fcn的检测器提高到与基于anchor的检测器相同的水平,如我们的实验结果表明。

论文阅读 FCOS: Fully Convolutional One-Stage Object Detect_第13张图片
在这里插入图片描述
最后,根据[11,12],我们在不同的特征层之间共享头部,不仅使检测器的参数效率更高,而且提高了检测性能。然而,我们观察到不同的特征级别需要回归不同的大小范围(P3 [0; 64] P4 [64; 128] ),因此,对不同的特征层使用相同的头部是不合理的。因此,不使用标准exp(x),我们使用exp(Six)用可训练的标量Si对于特征层Pi自动调整指数函数的基数,实验表明提高了检测性能。

使用基于FPN的多尺度预测提高召回率和缓解重叠bounding boxes带来的二义性。使用来自5层步长分别为8, 16, 32, 64 和 128的feature map P3,P4,P5,P6,P7 其中P6,P7分别是P5,P6的下采样。不满足每层目标回归尺寸的目标不会被回归,因此可以有效地减轻重叠目标带来的二义性(作者假设重叠目标大小差异较大)。

3.3. Centerness for FCOS

论文阅读 FCOS: Fully Convolutional One-Stage Object Detect_第14张图片
红色、蓝色和其他颜色分别表示1,0和它们之间的值。Center-ness由公式3计算。当位置偏离物体中心时,从1衰减到0。测试时,将网络预测的center-ness与分类分数相乘,这样就可以降低远离物体中心位置预测的低质量边界框的权重。

center-ness,可以译成中心点打分,它表征了当前像素点是否处于ground truth target的中心区域,以下面的热力图为例,红色部分表示center-ness值为1,蓝色部分表示center-ness值为0,其他部分的值介于0和1之间。

After using multi-level prediction in FCOS, there is still a performance gap between FCOS and anchor-based detectors. We observed that it is due to a lot of low-quality predicted bounding boxes produced by locations far away from the center of an object.
在FCOS中使用多级预测后,与基于anchor的检测器相比,FCOS的性能仍然存在较大的差距。我们观察到,这是由于远离一个物体的中心的位置产生的许多低质量的预测边界框。
We propose a simple yet effective strategy to suppress these low-quality detected bounding boxes without introducing any hyper-parameters.Specifically, we add a singlelayer branch, in parallel with the classification branch to predict the “center-ness” of a location (i.e., the distance from the location to the center of the object that the location is responsible for), as shown in Fig. 2.
我们提出了一个简单而有效的策略来抑制这些低质量检测到的边界框,而不引入任何超参数。具体来说,我们添加了一个单层分支,与分类分支并行,以预测一个位置的“center-ness”(即,为该位置到该位置所负责的对象中心的距离),如图2所示
已知回归目标(l*,t*,r*,b*),center-ness目标定义为
在这里插入图片描述

centerness衡量了当前像素偏离真实目标的中心点的程度,值越小,偏离越大

We employ sqrt here to slow down the decay of the centerness.The center-ness ranges from 0 to 1 and is thus trained with binary cross entropy (BCE) loss. The loss is added to the loss function Eq. (2). When testing, the final score (used for ranking the detected bounding boxes) is computed by multiplying the predicted center-ness with the corresponding classification score. Thus the center-ness can downweight the scores of bounding boxes far from the center of an object. As a result, with high probability, these lowquality bounding boxes might be filtered out by the final non-maximum suppression (NMS) process, improving the detection performance remarkably.

我们用根号是为了降低 center-ness 衰减的速度。Center-ness 值的范围从0到1,通过二元交叉熵损失来训练。这个损失然后加到等式2 的损失函数中去。测试时,将预测的 center-ness 和对应的分类得分相乘,得到最终的得分,再用这个得分对检测边框进行排名。这样,这个 center-ness 就可以降低那些远离物体中心边框的得分。在最后的 NMS 过程中,这些低质量的边框就会很大概率上被剔除,提升检测效果。

From the perspective of anchor-based detectors, which use two IOU thresholds Tlow and Thigh to label the anchor boxes as negative, ignored and positive samples, the centerness can be viewed as a soft threshold. It is learned during the training of networks and does not need to be tuned. Moreover, with the strategy, our detector can still view any locations falling into a ground box as positive samples, except for the ones set as negative samples in the aforementioned multi-level prediction, so as to use as many training samples as possible for the regressor.

基于锚点的检测器使用两个IOU阈值Tlow和Thigh将锚点盒标记为负、忽略和正样本,从anchor的角度可以将centerness看作一个软阈值。它是在模型训练中学习的,不需要手动调整。此外,利用该策略,我们的检测器仍可以将任意落入 ground truth 边框的点看作正样本,除了那些在多层级预测中已经被标注为负样本的点,在回归器中就可以使用尽可能多的训练样本。

思考:
1、FCOS的2个可能存在的问题可以被应用了FPN的multi-level prediction解决。

  • 最终feature map上大的补偿可能会导致一个相对低的BPR。对于anchor-based,由于大的步长导致的低的召回率可以被降低设置positive anchor的iou补偿一些。对于FCOS,第一反应可能会认为BPR会比anchor-based低很多因为不可能找回一个在最终feature map上没有location的物体。作者设置了大的步长,结果发现FCOS仍然可以得到一个好的结果,甚至比RetinaNet的效果更好,因此认为BPR不是FCOS的一个问题(为什么?)。此外,通过multi-level prediction,BPR提高的更多。
  • 如果出现overlaps的情况,一个location应该去回归哪个ground truth?
    这个问题可以被multi-level prediction解决,因为在multi-level prediction中,不同尺寸的目标被分配到不同的feature level。

2、FCOS算法和基于anchors的检测算法的不同之处在哪里?
对于基于anchors的目标检测算法而言,我们将输入的图片送入backbone网络之后,会获得最终的feature_map,比如说是17x17x256;然后我们会在该feature_map上的每一位置上使用预先定义好的anchors。而FCOS的改动点就在这里,它是直接在feature_map上的每一点进行回归操作。具体的实施思路如下所示:

  • 将feature_map中的每一个点(x,y)映射回原始的输入图片中;
  • 如果映射原图的点在ground-truth边界框范围之内,正样本, c就是Bi的类别标签,否则负样本,类别标签c=0(背景)
  • 回归的目标是(l,t,r,b),即中心点做BB的left、top、right和bottom之间的距离

由于FCOS可以通过这样方式获得很多正样本块,然后使用这样正样本块进行回归操作,因此获得了比较好的性能提升,而原始的基于anchor的算法需要通过计算预设的anchor和对应的GT之间的IOU值,当该IOU值大于设定的阈值时才将其看做正样本

总结
(1)虽然Yolo-v1也是anchor-free算法,区别在于,yolo-v1只利用了目标的中心区域的点做预测,因此recall较低。而FCOS利用了目标的整个区域内的点,recall和anchor-based算法相当;
(2)尽管centerness确实带来效果上的明显提升,但是缺乏理论可解释性;
(3)作为一种新的anchor-free算法,它的效果确实超过了yolo-v1、cornernet、FSAF,但是,既然是one-stage算法,推理速度是固有优势,而论文中却始终未提速度,可见,开发anchor-free且速度较快的检测算法,还有一段路要走。

参考:
https://blog.csdn.net/diligent_321/article/details/89069018
https://blog.csdn.net/qq_34718684/article/details/89118775
https://blog.csdn.net/WZZ18191171661/article/details/89258086
其他关于该论文的解读:
https://www.cnblogs.com/fourmi/p/10771436.html

你可能感兴趣的:(论文阅读)