以下内容摘自论文(FCOS: Fully Convolutional One-Stage Object Detection),本人英语水平极差,翻译勉强看看就好
Almostall state-of-the-art object detectors such as RetinaNet, SSD,YOLOv3, and Faster R-CNN rely on pre-defined anchor boxes. In contrast, our proposed detector FCOS is anchor box free, as well as proposal free.
当前先进的目标检测器如RetinaNet, SSD,YOLOv3, 以及Faster R-CNN都依赖于预定义的锚框,但我们提出的检测器FCOS不需要。
…
阐述不需要锚框(anchor box free)的好处:避免与锚框相关的复杂计算,例如训练模型时计算重叠(overlapping);避免引入与锚框相关的超参数(hyper-parameters),这些超参数通常对最终检测性能非常敏感。
Recently, fully convolutional networks (FCNs)have achieved tremendous success in dense prediction tasks such as semantic segmentation, depth estimation, keypoint detection and counting.
近来,全卷积网络(FCNs)在密集预测(dense prediction)任务上取得巨大的成功,如语义分割(semantic segmentation),深度估计(depth estimation),关键点检测(keypoint detection)和计数(counting)
In the literature, some works attempted to leverage the FCNs-based framework for object detection such as DenseBox. Specifically, these FCN-based frameworks directly predict a 4D vector plus a class category at each spatial location on a level of feature maps.
在文献中,一些工作试图利用基于FCN的框架进行对象检测,例如DenseBox。具体而言,这些基于FCN的框架在特征图(feature maps)层级上的每一个空间位置直接预测一个4D向量以及一个类别。
the 4D vector depicts the relative offsets from the four sides of a bounding box to the location.
4D向量描述了从边界框的四边到该位置的相对偏移。
…
阐述DenseBox缺点:为了处理不同大小的边界框,DenseBox将裁剪图像并调整其尺寸为固定比例。 因此,DenseBox必须对图像金字塔进行检测,这与FCN一次计算所有卷积的思想背道而驰;其次,应用于具有高度重叠边界框的通用对象检测时,其效果不佳,因为高度重叠的边界框将导致不确定性:不清楚为重叠区域中的像素回归哪个边界框。
In the sequel, we take a closer look at the issue and show that with FPN this ambiguity can be largely eliminated.
在后文中,我们仔细研究了这个问题,并表明使用特征图金字塔网络(FPN, Feature Pyramid Networks)可以消除这种不确定性。
Furthermore, we observe that our method may produce anumber of low-quality predicted bounding boxes at the locations that are far from the center of an target object. In order to suppress these low-quality detections, we introduce a novel “center-ness” branch (only one layer) to predict the deviation of a pixel to the center of its corresponding bounding box.
此外,我们观察到,我们的方法可能会在远离目标对象中心的位置上生成许多低质量的预测边界框。为了抑制这些低质量的检测,我们引入了一种新颖的“中心度”分支(仅一层)来预测像素到其相应边界框中心的偏差。
In this section, we first reformulate object detection in a per-pixel prediction fashion. Next, we show that how we make use of multi-level prediction to improve the recall and resolve the ambiguity resulted from overlapped bounding boxes. Finally, we present our proposed “center-ness” branch, which helps suppress the low-quality detected bounding boxes and improves the overall performance by alarge margin.
在本节中,我们首先以每个像素的预测方式重新构造对象检测。 接下来,我们展示了如何利用多级预测来提高召回率(recall)并解决边界框重叠造成的不确定性。 最后,我们提出了我们提出的“中心性”分支,该分支有助于抑制检测到的低质量边界框并大幅度提高整体性能。
the feature maps at layer i i i of a backbone CNN: F i ∈ R H × W × C F_i\in\mathbb R^{H×W×C} Fi∈RH×W×C
the total stride until the layer: s s s
the ground-truth bounding boxes for an input image: { B i } \{B_i\} {Bi}
B i = ( x 0 ( i ) , y 0 ( i ) , x 1 ( i ) , y 1 ( i ) , c ( i ) ) B_i=(x_0^{(i)},y_0^{(i)},x_1^{(i)},y_1^{(i)},c^{(i)}) Bi=(x0(i),y0(i),x1(i),y1(i),c(i)) ∈ R 4 × { 1 , 2... C } \in\mathbb R^4×\{1,2...C\} ∈R4×{1,2...C}
( x 0 ( i ) , y 0 ( i ) ) (x_0^{(i)},y_0^{(i)}) (x0(i),y0(i)):the left-top corner of the bounding box
( x 1 ( i ) , y 1 ( i ) ) (x_1^{(i)},y_1^{(i)}) (x1(i),y1(i)):the right-bottom corner
C is the number of classes, which is 80 for MS-COCO dataset
Training sample
location(x,y) falls into any ground truth box → a postive sample, c* is the label of the ground truth box
Otherwise → a negative sample, c*= 0(back-ground class)
the regression targets: 4D real vector t*= (l*,t*,r*,b*)
If a location falls into multiple bounding boxes, choose the bounding box with minimal area as its regression target
l ∗ = x − x 0 ( i ) , t ∗ = y − y 0 ( i ) , r ∗ = x 1 ( i ) − x , b ∗ = y 1 ( i ) − y l^*=x-x_0^{(i)},t^*=y-y_0^{(i)},r^*=x_1^{(i)}-x,b^*=y_1^{(i)}-y l∗=x−x0(i),t∗=y−y0(i),r∗=x1(i)−x,b∗=y1(i)−y
Output
an 80D vector p of classification labels
a 4D vector t=(l, t, r, b) bounding box coordinates
regression targets are always positive → employ exp(x)
Loss function
L ( { p x , y } , { t x , y } ) = 1 N p o s ∑ x , y L c l s ( p x , y , c x , y ∗ ) + λ N p o s ∑ x , y 1 { c x , y ∗ > 0 } L r e g ( t x , y , t x , y ∗ ) L(\{\mathbf p_{x,y}\},\{\mathbf t_{x,y}\})=\frac1{N_{pos}}\sum_{x,y}L_{cls}(\mathbf p_{x,y},c_{x,y}^*)+\frac{\lambda}{N_{pos}}\sum_{x,y}\mathbb 1_{\{c_{x,y}^*>0\}}L_{reg}(\mathbf t_{x,y},\mathbf t_{x,y}^*) L({px,y},{tx,y})=Npos1∑x,yLcls(px,y,cx,y∗)+Nposλ∑x,y1{cx,y∗>0}Lreg(tx,y,tx,y∗)
L c l s L_{cls} Lcls → focal loss
cross entropy(CE)
C E ( p , y ) = { − l o g ( p ) , y = 1 − l o g ( 1 − p ) , o t h e r w i s e CE(p,y)= \left\{ \begin{array}{lr} -log(p), y=1 & \\ -log(1-p), otherwise & \\ \end{array} \right. CE(p,y)={−log(p),y=1−log(1−p),otherwise
p t = { p , y = 1 1 − p , o t h e r w i s e p_t= \left\{ \begin{array}{lr} p, y=1 & \\ 1-p, otherwise & \\ \end{array} \right. pt={p,y=11−p,otherwise
→ CE(p,y)=CE(pt)=-log(pt)
there is an extreme imbalance between foreground and background classes during training.we propose to reshape the loss function to down-weight easy examples and thus focus training on hard negatives.
A common method for addressing class imbalance is to introduce a weighting factorα∈[0,1] for class1and1−α for class−1.
背景像素远多于目标像素,在交叉熵的基础上添加权重αt
While α balances the importance of positive/negative examples, it does not differentiate between easy/hard examples. Instead,we propose to reshape the loss function to down-weight easy examples and thus focus training on hard negatives.
More formally, we propose to add a modulating factor ( 1 − p t ) γ (1 − p_t)^γ (1−pt)γ to the cross entropy loss, with tunable focusing parameter γ ≥ 0 γ≥0 γ≥0.
αt仅平衡正/负样本,还需引入 ( 1 − p t ) γ (1 − p_t)^γ (1−pt)γ以使训练关注难分类的负样本
F L ( p t ) = − α t ( 1 − p t ) γ l o g ( p t ) FL(p_t)=−α_t(1 − p_t)^γ log(p_t) FL(pt)=−αt(1−pt)γlog(pt)
L r e g L_{reg} Lreg → IOU loss
N p o s N_{pos} Npos → the number of positive samples
λ \lambda λ → the balance weight for L r e g L_{reg} Lreg(1 in this paper)
Inference
input images →
classification scores px,y ( >0.05 as positive samples)
the regression prediction tx,y
five levels of feature maps → {P3,P4,P5,P6,P7}
limit the range of bounding box regression for each level
限制每一层回归出的边界框范围
max(l*,t*,r*,b*)> mi or max(l*,t*,r*,b*)< mi−1 → negative
mi is the maximum distance that feature level i needs to regress. In this work, m2, m3, m4, m5, m6 and m7 are set as 0, 64, 128, 256, 512 and ∞
we share the heads between different feature levels
regress different size range → exp(six)
low-quality predicted bounding boxes produced by locations far away from the center of an object.
The center-ness depicts the normalized distance from the location to the center of the object that the location is re-sponsible for.
c e n t e r n e s s ∗ = m i n ( l ∗ , r ∗ ) m a x ( l ∗ , r ∗ ) × m i n ( t ∗ , b ∗ ) m a x ( t ∗ , b ∗ ) centerness*=\sqrt{\frac{min(l^*,r^*)}{max(l^*,r^*)}×\frac{min(t^*,b^*)}{max(t^*,b^*)}} centerness∗=max(l∗,r∗)min(l∗,r∗)×max(t∗,b∗)min(t∗,b∗)
the final score (used for ranking the detected bounding boxes) is computed by multiplying the predicted center-ness with the corresponding classification score.
→ thus the center-ness can down-weight the scores of bounding boxes far from the center of an object.
→ with high probability, these low-quality bounding boxes might be filtered out by the finalnon-maximum suppression (NMS) process, improving the detection performance remarkably
给最终分数加上权重,过滤远离目标中心位置生成的低质量边界框
An alternative of the center-ness is to make use of only the central portion of ground-truth bounding box as positive samples with the price of one extra hyper-parameter
另一种方法是只使用真实边界框中心部分作为正样本,但会额外引入一个超参数
[1] Tian Z., Shen C., Chen H., et al. FCOS: Fully Convolutional One-Stage Object Detection[J].2019, 9627-9636.
[2] Lin T.-Y., Goyal P., Girshick R., et al. Focal Loss for Dense Object Detection [J].2017, 2999-3007.
[3]Yu J., Jiang Y., Wang Z., et al. UnitBox: An Advanced Object Detection Network[J].2016, 516–520.