论文链接:A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection
基于mmdet实现代码:aLRP Loss
主流损失函数定义为 L = L c + w r L r L=L_c+w_rL_r L=Lc+wrLr,存在如下缺点:
基于排名的损失计算方法能够一定程度缓解正负样本不平衡问题,但目前的算法大都仅限于分类任务,并没有改变定位分支
在AP Loss排序损失 L = 1 Z ∑ i ∈ P l ( i ) L=\frac{1}{Z}\sum_{i\in P}l(i) L=Z1∑i∈Pl(i)中, Z Z Z是标准化常量。
定义 L i j L_{ij} Lij是对正样本 i i i和负样本 j j j的损失计算,可以看作正样本 i i i经由概率 p ( j ∣ i ) p(j|i) p(j∣i)在负样本 j j j上的损失,即
L i j = { l ( i ) p ( j ∣ i ) f o r i ∈ P , j ∈ N 0 o t h e r w i s e L_{ij}=\begin{cases} l(i)p(j|i)& for\space i\in P,j\in N\\ 0& otherwise \end{cases} Lij={l(i)p(j∣i)0for i∈P,j∈Notherwise
因此,AP Loss排序损失可表示为 L = 1 Z ∑ i ∈ P l ( i ) = 1 Z ∑ i ∈ P ∑ j ∈ N L i j L=\frac{1}{Z}\sum_{i\in P}l(i)=\frac{1}{Z}\sum_{i\in P}\sum_{j\in N}L_{ij} L=Z1∑i∈Pl(i)=Z1∑i∈P∑j∈NLij。
这个表达方式更为灵活,可以通过定义 p ( i ∣ j ) p(i|j) p(i∣j)改变损失函数的分布,或使之侧重于困难样本。
∑ i ∈ P ∣ ∂ L ∂ s i ∣ = ∑ j ∈ N ∣ ∂ L ∂ s j ∣ \sum_{i\in P}|\frac{\partial L}{\partial s_i}|=\sum_{j\in N}|\frac{\partial L}{\partial s_j}| i∈P∑∣∂si∂L∣=j∈N∑∣∂sj∂L∣
将AP Loss的扩展成aLRP Loss以解决上述三个不足。参考检测准确度(precision)和AP Loss之间的关联,我们将aLRP Loss定义为PR曲线上正样本的LRP均值:
L a L R P : = 1 ∣ P ∣ ∑ i ∈ P l L R P ( i ) L^{aLRP}:=\frac{1}{|P|}\sum_{i\in P}l^{LRP}(i) LaLRP:=∣P∣1i∈P∑lLRP(i)
假定锚框足够密集,能够覆盖所有gt,即 N F N = 0 N_{FN}=0 NFN=0(没有gt被忽略/检测为负样本),正样本集合 P P P即 T P TP TP,负样本集合 N N N即 F P FP FP, F N FN FN不参与损失函数的计算。因此参考LRP评价指标的定义:
L R P ( s ) = 1 N F N + N T P + N F P ( N F N + N F P + ∑ k ∈ T P ε l o c ( k ) ) LRP(s)=\frac{1}{\cancel{N_{FN}}+N_{TP}+N_{FP}}\left(\cancel{N_{FN}}+N_{FP}+\sum_{k\in TP}\varepsilon_{loc}(k)\right) LRP(s)=NFN +NTP+NFP1(NFN +NFP+k∈TP∑εloc(k))
对于正样本 i i i,损失值 l L R P ( i ) l^{LRP}(i) lLRP(i)定义如下,其中 N F P ( i ) N_{FP}(i) NFP(i)和 r a n k ( i ) = N T P ( i ) + N F P ( i ) rank(i)=N_{TP}(i)+N_{FP}(i) rank(i)=NTP(i)+NFP(i)分别表示样本 i i i在负样本和全体正负样本中的排名:
l L R P ( i ) = 1 r a n k ( i ) ( N F P ( i ) + ε l o c ( i ) + ∑ k ∈ P , k ≠ i ε l o c ( k ) H ( x i k ) ) l^{LRP}(i)=\frac{1}{rank(i)}\left(N_{FP}(i)+\varepsilon_{loc}(i)+\sum_{k\in P,k\neq i}\varepsilon_{loc}(k)H(x_{ik})\right) lLRP(i)=rank(i)1⎝⎛NFP(i)+εloc(i)+k∈P,k=i∑εloc(k)H(xik)⎠⎞
该式也可以拆分为两部分,分别表示分类损失和定位损失:
l L R P ( i ) = N F P r a n k ( i ) + 1 r a n k ( i ) ( ε l o c ( i ) + ∑ k ∈ P , k ≠ i ε l o c ( k ) H ( x i k ) ) l^{LRP}(i)=\textcolor{orangered}{\frac{N_{FP}}{rank(i)}}+\textcolor{blue}{\frac{1}{rank(i)}\left( \varepsilon_{loc}(i)+\sum_{k\in P,k\neq i}\varepsilon_{loc}(k)H(x_{ik})\right)} lLRP(i)=rank(i)NFP+rank(i)1⎝⎛εloc(i)+k∈P,k=i∑εloc(k)H(xik)⎠⎞
可以发现,aLRP Loss只对分类正确的样本计算定位损失,因此在网络训练初始阶段,分类效果不佳时,损失函数由分类损失主导,网络很难对定位分支进行优化。为了缓解这个问题,引入一个自平衡参数,参考定理2可知正负样本的梯度贡献相等,对定位框梯度 ∂ L a L R P ∂ B \frac{\partial L^{aLRP}}{\partial B} ∂B∂LaLRP乘以该epoch的均值 L a L R P L l o c a L R P \frac{L^{aLRP}}{L_{loc}^{aLRP}} LlocaLRPLaLRP,从而使得分类得分和定位梯度对aLRP Loss的贡献相近。
与AP Loss同样,aLRP Loss可以定义为 L i j a L R P = l L R P ( i ) p ( j ∣ i ) L_{ij}^{aLRP}=l^{LRP}(i)p(j|i) LijaLRP=lLRP(i)p(j∣i),其目标值定义如下:
L i j a L R P ∗ = L a L R P ( i ) ∗ ⋅ p ( j ∣ i ) = 1 r a n k ( i ) ( N F P ( i ) + ε l o c ( i ) + ∑ k ∈ P , k ≠ i ε l o c ( k ) H ( x i k ) ) ⋅ p ( j ∣ i ) = ε l o c ( i ) r a n k ( i ) ⋅ p ( j ∣ i ) \begin{aligned} L_{ij}^{aLRP^*}&=L^{aLRP}(i)^*·p(j|i)\\ &=\frac{1}{rank(i)}\left( \cancel{N_{FP}(i)}+\varepsilon_{loc}(i)+ \cancel{\sum_{k\in P,k\neq i}\varepsilon_{loc}(k)H(x_{ik})} \right)·p(j|i) \\&=\frac{\varepsilon_{loc}(i)}{rank(i)}·p(j|i) \end{aligned} LijaLRP∗=LaLRP(i)∗⋅p(j∣i)=rank(i)1⎝⎛NFP(i) +εloc(i)+k∈P,k=i∑εloc(k)H(xik) ⎠⎞⋅p(j∣i)=rank(i)εloc(i)⋅p(j∣i)
从而有 x i j x_{ij} xij的误差驱动更新量如下,进而可求 ∂ L a L R P ∂ s i \frac{\partial L^{aLRP}}{\partial s_i} ∂si∂LaLRP。
Δ x i j = ( l L R P ( i ) ∗ − l L R P ( i ) ) p ( j ∣ i ) = − 1 r a n k ( i ) ( N F P ( i ) + ∑ k ∈ P , k ≠ i ε l o c ( k ) H ( x i k ) ) H ( x i j ) N F P ( i ) \begin{aligned} \Delta x_{ij}&=\left( l^{LRP}(i)^*-l^{LRP}(i) \right)p(j|i) \\&=-\frac{1}{rank(i)}\left( N_{FP}(i)+\sum_{k\in P,k\neq i}\varepsilon_{loc}(k)H(x_{ik}) \right)\frac{H(x_{ij})}{N_{FP}(i)} \end{aligned} Δxij=(lLRP(i)∗−lLRP(i))p(j∣i)=−rank(i)1⎝⎛NFP(i)+k∈P,k=i∑εloc(k)H(xik)⎠⎞NFP(i)H(xij)
Highlight
代码结构与AP Loss相似,输出一个 c l a s s i f i c a t i o n _ g r a d s classification\_grads classification_grads作为反向传播的依据。
其中正样本的损失值由该样本的排序质量和回归质量决定,torch.sum(fg_relations * regression_losses)
表示所有相对于该样本排序不正确的正样本的回归损失,FP_num/rank[ii]
表示该正样本的排名损失;
而负样本的损失值由所有与之排序错误的正样本共同决定,relevant_bg_grad += (bg_relations*(-fg_grad[ii]/FP_num))
,其中正样本总损失值越大,排名损失越小,对这个负样本的损失值影响越大。
class aLRPLoss(torch.autograd.Function):
@staticmethod
def forward(ctx, logits, targets, regression_losses, delta=1., eps=1e-5):
classification_grads = torch.zeros(logits.shape).cuda()
# ---------------------#
# Filter fg logits
# ---------------------#
fg_labels = (targets == 1)
fg_logits = logits[fg_labels]
fg_num = len(fg_logits)
rank = torch.zeros(fg_num).cuda()
prec = torch.zeros(fg_num).cuda()
fg_grad = torch.zeros(fg_num).cuda()
# --------------------------------------#
# Filter non-trivial negative samples
# --------------------------------------#
# Do not use bg with scores less than minimum fg logit
# since changing its score does not have an effect on precision
threshold_logit = torch.min(fg_logits)-delta
# Get valid bg logits
relevant_bg_labels = ((targets == 0) & (logits >= threshold_logit))
relevant_bg_logits = logits[relevant_bg_labels]
relevant_bg_grad = torch.zeros(len(relevant_bg_logits)).cuda()
# -----------------------------#
# Loop on posivite indices
# -----------------------------#
# sort the fg logits and loop over each positive following the order
order = torch.argsort(fg_logits)
for ii in order:
# x_ij s as score differences with fgs
fg_relations = fg_logits - fg_logits[ii]
# Apply piecewise linear function and determine relations with fgs (H(x_ij))
fg_relations = torch.clamp(fg_relations/(2*delta)+0.5, min=0, max=1)
# Discard i=j in the summation in rank_pos
fg_relations[ii] = 0
# x_ij s as score differences with bgs
bg_relations = relevant_bg_logits - fg_logits[ii]
# Apply piecewise linear function and determine relations with bgs
bg_relations = torch.clamp(bg_relations / (2 * delta) + 0.5, min=0, max=1)
# Compute the rank of the example within fgs and number of bgs with larger scores rank^+(i)
rank_pos = 1 + torch.sum(fg_relations)
FP_num = torch.sum(bg_relations)
# Store the total since it is normalizer also for aLRP Regression error rank(i)
rank[ii] = rank_pos + FP_num
# Compute precision for this example to compute classification loss
prec[ii] = rank_pos/rank[ii]
# For stability, set eps to an infinite small value (e.g. 1e-6), then compute grads
# no AP interpolation here
if FP_num > eps:
# fg_grad = regression_loss + ranking_loss
fg_grad[ii] = -(torch.sum(fg_relations * regression_losses) + FP_num)/rank[ii]
# bg_grad += fg_grad/N_FP
relevant_bg_grad += (bg_relations*(-fg_grad[ii]/FP_num))
# aLRP with grad formulation
classification_grads[fg_labels] = fg_grad
classification_grads[relevant_bg_labels] = relevant_bg_grad
classification_grads /= fg_num
cls_loss = 1-prec.mean()
ctx.save_for_backward(classification_grads)
return cls_loss, rank, order
@staticmethod
def backward(ctx, out_grad1, out_grad2, out_grad3):
g1, = ctx.saved_tensors
return g1*out_grad1, None, None, None, None