论文阅读:RetinaNet:Focal Loss for Dense Object Detection

文章目录

  • 1、网络总述
  • 2、why two-stage检测器可以相对避免难易样本失衡这个问题
  • 3、关于one stage和two stage之争
  • 4、Focal loss具体作用
  • 5、Class Imbalance
  • 6、Focal Loss Definition
  • 7、two properties of the focal loss
  • 8、Class Imbalance and Model Initialization
  • 9、Inference
  • 10、 γ 和α的关系
  • 11、mAP30.2的由来
  • 12、 Ablation experiments
  • 13、Anchor Density
  • 参考文献

1、网络总述

论文阅读:RetinaNet:Focal Loss for Dense Object Detection_第1张图片
论文阅读:RetinaNet:Focal Loss for Dense Object Detection_第2张图片
论文阅读:RetinaNet:Focal Loss for Dense Object Detection_第3张图片
这篇paper看似只是提出了一个Focal loss损失函数,但是这是在深刻分析了one_stage和two_stage检测器的基础上,得出one-stage不如two_stage的本质原因是:anchors的类别不均衡,类别不平衡容易导致分类器训练失败,因为分类可以把所有的样本都分为负样本,照样准确率很高。
【注】: 这里的类别不均衡既有正负样本的不均衡,也包括易分类负样本和难分类负样本的不均衡。

前两天看的OHEM也是解决这个问题,但是论文中提到OHEM是用Loss大的rois训练网络,且直接抛弃了小Loss的rois;而Focal loss则是降低那些loss小的anchors的损失值,并么有完全放弃它们,论文中也有数据对比,Focal loss要比OHEM高了3个map的百分点。

2、why two-stage检测器可以相对避免难易样本失衡这个问题

因为two-stage系有RPN或SS罩着。
第一个stage的RPN会对anchor进行简单的二分类(只是简单地区分是前景还是背景,并不区别究竟属于哪个细类)。经过该轮初筛,属于background的bbox被大幅砍削。虽然其数量依然远大于前景类bbox,但是至少数量差距已经不像最初生成的anchor那样夸张了。就等于是 从 “类别 极 不平衡” 变成了 “类别 较 不平衡” 。
不过,其实two-stage系的detector也不能完全避免这个问题,不然也不会有OHEM算法了,只能说是在很大程度上减轻了“类别不平衡”对检测精度所造成的影响。接着到了第二个stage时,分类器登场,在初筛过后的bbox上进行难度小得多的第二波分类(这次是细分类)。这样一来,分类器得到了较好的训练,最终的检测精度自然就高啦。但是经过这么两个stage一倒腾,操作复杂,检测速度就被严重拖慢了。

3、关于one stage和two stage之争

检测追求的是模型精确定位目标即精确找到物体的平移缩放参数,而识别目标类型又需要模型容错一定的平移缩放等变换,二者互相矛盾。所以我认为在one stage检测框架中同时进行目标识别和定位本身就是矛盾的,类间差距越小,one stage越不稳定。所以one stage的问题不是一个FL能解决的。况且,以yolo、ssd为代表的one stage方法的提出,其主要是为了解决预测速度的问题,而不是精度。而且, one stage的词法分割也是存在这个矛盾问题的。如果词法分割能解决得很好,那one stage的问题也同时被解决了,以后就没two stage什么事了。
(词法分割不太理解)

这篇paper中也提到two_stage的速度可以以后改进,但One_stage的精度不好改。

4、Focal loss具体作用

loss量级 量大的类别(简单负样本) 量小的类别(正样本和难分负样本)
被正确分类时的loss变化情况 大幅下降 稍微下降
被错误分类时的loss变化情况 适当下降 近乎保持不变

The loss function is a dynamically scaled cross entropy loss(在标准的交叉熵损失函数上改的), where the scaling factor decays to zero as confidence in the correct class increases,see Figure 1. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during
training and rapidly focus the model on hard examples. Experiments show that our proposed Focal Loss enables us to
train a high-accuracy, one-stage detector that significantly
outperforms the alternatives of training with the sampling
heuristics or hard example mining
, the previous state-ofthe-art techniques for training one-stage detectors. Finally,
we note that the exact form of the focal loss is not crucial,
and we show other instantiations can achieve similar results(这个损失函数的具体形式不重要)

5、Class Imbalance

This imbalance causes two problems:
(1) training is inefficient as most locations are easy negatives that contribute no useful learning signal;
(2) en masse,
the easy negatives can overwhelm training and lead to degenerate models.

6、Focal Loss Definition

在这里插入图片描述

在这里插入图片描述

A common method for addressing class imbalance is to
introduce a weighting factor α ∈ [0, 1] for class 1 and 1 1α
for class s1. In practice α may be set by inverse class frequency or treated as a hyperparameter to set by cross validation. For notational convenience, we define αt analogously
to how we defined pt. We write the α-balanced CE loss as:

Easily classified negatives comprise
the majority of the loss and dominate the gradient. While
α balances the importance of positive/negative examples, it
does not differentiate between easy/hard examples.

在这里插入图片描述
在这里插入图片描述

7、two properties of the focal loss

(1) When an example is misclassified and pt
is small, the modulating factor is near 1 and the loss is unaffected.
As pt → 1, the factor goes to 0 and the loss for well-classified
examples is down-weighted.
(2) The focusing parameter γ
smoothly adjusts the rate at which easy examples are downweighted. When γ = 0, FL is equivalent to CE, and as γ is
increased the effect of the modulating factor is likewise increased
(we found γ = 2 to work best in our experiments)

8、Class Imbalance and Model Initialization

这部分论文中几次提到:

在这里插入图片描述

Binary classification models are by default initialized to
have equal probability of outputting either y = -1 or 1.
Under such an initialization, in the presence of class imbalance, the loss due to the frequent class can dominate total
loss and cause instability in early training. To counter this,
we introduce the concept of a ‘prior’ for the value of p estimated by the model for the rare class (foreground) at the
start of training. We denote the prior by π and set it so that
the model’s estimated p for examples of the rare class is low,
e.g. 0.01. We note that this is a change in model initialization (see §4.1) and not of the loss function. We found this
to improve training stability for both the cross entropy and
focal loss in the case of heavy class imbalance.

【注】:大致意思就是模型初始化时候,让前几次迭代产生的预测概率p小一点,例如0.01,这样网络在遇到大量容易的背景之后,突然遇到一个前景,那前景的样本的损失值-log§就要大很多,而p小的时候,负样本的损失值-log(1-p)就会很小,这样即使有大量负样本,loss也不会大。

那么如何实现呢?这就涉及到了4.1节的具体实现细节。
原文如下:

Initialization: .All new conv layers except the final
one in the RetinaNet subnets are initialized with bias b = 0
and a Gaussian weight fill with σ = 0.01. For the final conv
layer of the classification subnet, we set the bias initialization to b = - log((1-1 π)/π) , where π specifies that at the start of training every anchor should be labeled as foreground with confidence of ∼π. We use π = .01 in all experiments, although results are robust to the exact value. As
explained in §3.3, this initialization prevents the large number of background anchors from generating a large, destabilizing loss value in the first iteration of training.

就是说刚开始训练时候, 每个anchors都应该被视作小概率(0.01)的前景。

9、Inference

RetinaNet forms a single FCN comprised of a
ResNet-FPN backbone, a classification subnet, and a box
regression subnet, see Figure 3. As such, inference involves
simply forwarding an image through the network. To improve speed, we only decode box predictions from at most
1k top-scoring predictions per FPN level, after thresholding detector confidence at 0.05. The top predictions from
all levels are merged and non-maximum suppression with a
threshold of 0.5 is applied to yield the final detections.

10、 γ 和α的关系

We emphasize that when training RetinaNet, the focal loss is
applied to all ∼100k anchors in each sampled image. This
stands in contrast to common practice of using heuristic
sampling (RPN) or hard example mining (OHEM, SSD) to
select a small set of anchors (e.g., 256) for each minibatch.
The total focal loss of an image is computed as the sum
of the focal loss over all ∼100k anchors, normalized by the
number of anchors assigned to a ground-truth box. We perform the normalization by the number of assigned anchors,
not total anchors, since the vast majority of anchors are easy
negatives and receive negligible loss values under the focal
loss.
Finally we note that α, the weight assigned to the rare
class, also has a stable range, but it interacts with γ making it necessary to select the two together (see Tables 1a
and 1b). In general α should be decreased slightly as γ is
increased (for γ = 2, α = 0.25 works best).

γ 增加时, α 要降低,相互影响

11、mAP30.2的由来

Our first attempt to train RetinaNet uses standard cross entropy (CE) loss without any modifications to the initialization or learning strategy. This
fails quickly,
with the network diverging during training.
However, simply initializing the last layer of our model such
that the prior probability of detecting an object is π = .01
(see §4.1) enables effective learning. Training RetinaNet
with ResNet-50 and this initialization already yields a respectable AP of 30.2 on COCO. Results are insensitive to
the exact value of π so we use π = .01 for all experiments.

简单的用 π初始化这个策略就可以让RetinaNet达到mAP30.2,注意,并没有用γ , α。

12、 Ablation experiments

论文阅读:RetinaNet:Focal Loss for Dense Object Detection_第4张图片

13、Anchor Density

One of the most important design factors in a one-stage detection system is how densely it covers the space of possible image boxes. Two-stage detectors can
classify boxes at any position, scale, and aspect ratio using
a region pooling operation [10]. In contrast, as one-stage
detectors use a fixed sampling grid, a popular approach for
achieving high coverage of boxes in these approaches is to
use multiple ‘anchors’ [28] at each spatial position to cover
boxes of various scales and aspect ratios.

参考文献

1、如何评价Kaiming的Focal Loss for Dense Object Detection?

2、Focal Loss

3、论文阅读: RetinaNet

你可能感兴趣的:(论文阅读,目标检测)