[论文阅读 2020 AAAI 目标跟踪]SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation

简介

paper:SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines

code:MegviiDetection/video_analyst

这篇论文在SiamFC基础上进行了改进,提出了SiamFC++,并在OTB2015,VOT2018,LaSOT,GOT-10kTrackingNet上取得了SOTA.

这篇论文基于以下原则设计了SiamFC++:

  • G1: (decomposition of classification and state estimation) The tracker should perform two sub-tasks: classification and state estimation.(跟踪模型应该包含分类和状态估计两个子任务)
  • G2: (non-ambiguous scoring) The classification score should represent the confidence score of target existence directly, in the ”field of view”, i.e. subwindow of the corresponding pixel, rather than the pre-defined settings like anchor boxes.(跟踪模型需要无歧义的评分机制)
  • G3: (prior knowledge-free) Tracking approaches should be free of prior knowledge like scale/ratio distribution.(跟踪模型不应该要求有太多先验知识)
  • G4: (estimation quality assessment) An estimation quality score independent of classification should be used.(需要对跟踪的质量进行评估)

相关工作

现在大多数的跟踪模型的target state estimation可以划分为三大类:

  1. Rescaling the search patch into multiple scales and assembling a mini-batch of scaled images.(按比例扩缩搜索框来应对尺度的变化,比如DCFSiamFC)
  2. The coarse initial location of the target obtained by classification is iteratively refined for accurate box estimation.(迭代优化通过分类获得初始目标框,比如ATOM)
  3. RPN regresses the location shift and size difference between pre-defined anchor boxes and target location.(通过RPN回归预测目标位置与anchor之间的位置偏移和大小差异,比如SiamRPN)

主要内容

[论文阅读 2020 AAAI 目标跟踪]SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation_第1张图片

如上图所示是SiamFC++的主要框架,其中蓝色部分和红色部分是相比于原版SiamFC新添加的分支. 不同于SiamFC,SiamFC++多了quality评估分支和regression bbox预测分支.

Siamese-based Feature Extraction and Matching

通过cross-correlation计算embedding feature的过程可以表示为:

f i ( z , x ) = ψ i ( ϕ ( z ) ) ⋆ ψ i ( ϕ ( x ) ) , i ∈ { c l s , r e g } f_{i}(z, x)=\psi_{i}(\phi(z)) \star \psi_{i}(\phi(x)), i \in\{\mathrm{cls}, \mathrm{reg}\} fi(z,x)=ψi(ϕ(z))ψi(ϕ(x)),i{cls,reg}

both ψ c l s ψ_{cls} ψcls and ψ r e g ψ_{reg} ψreg after common feature extraction to adjust the common features into task-specific feature space. Note that the extracted features of ψ c l s ψ_{cls} ψcls and ψ r e g ψ_{reg} ψreg are of the same size.

Application of Design Guidelines in Head Network

在提取了embedding feature后,设计了classification headregression head将模型划分为分类任务和回归任务(依据原则G1).

regression head分支输出一个4维的向量 t ∗ = ( l ∗ , t ∗ , r ∗ , b ∗ ) \boldsymbol{t}^{*}=\left(l^{*}, t^{*}, r^{*}, b^{*}\right) t=(l,t,r,b),各元素代表的含义如下(其中s表示backbone网络的步长,这篇论文中是8):

l ∗ = ( ⌊ s 2 ⌋ + x s ) − x 0 , t ∗ = ( ⌊ s 2 ⌋ + y s ) − y 0 r ∗ = x 1 − ( ⌊ s 2 ⌋ + x s ) , b ∗ = y 1 − ( ⌊ s 2 ⌋ + y s ) \begin{array}{ll} l^{*} & =\left(\left\lfloor\frac{s}{2}\right\rfloor+x s\right)-x_{0}, \quad t^{*}=\left(\left\lfloor\frac{s}{2}\right\rfloor+y s\right)-y_{0} \\ r^{*} & =x_{1}-\left(\left\lfloor\frac{s}{2}\right\rfloor+x s\right), \quad b^{*}=y_{1}-\left(\left\lfloor\frac{s}{2}\right\rfloor+y s\right) \end{array} lr=(2s+xs)x0,t=(2s+ys)y0=x1(2s+xs),b=y1(2s+ys)

where ( x 0 , y 0 ) (x_0, y_0) (x0,y0) and ( x 1 , y 1 ) (x_1, y_1) (x1,y1) denote the left-top and rightbottom corners of the ground-truth bounding box B∗ associated with point (x, y).

classification head一个分支输出分类的分数 ψ c l s ψ_{cls} ψcls

location (x, y) on feature map ψ c l s ψ_{cls} ψcls is considered as a positive sample if its corresponding location ( ⌊ s 2 ⌋ + x s , ⌊ s 2 ⌋ + y s ) \left(\left\lfloor\frac{s}{2}\right\rfloor+x s,\left\lfloor\frac{s}{2}\right\rfloor+y s\right) (2s+xs,2s+ys) on the input image falls into the ground-truth bounding box. Otherwise,it is a negative sample.

另一个分支预测PSS(论文中指出也可以使用IOU),这个分支是为了评估预测的bbox质量,用于抑制远离目标中心的bbox.

Training Objecti

最终优化的损失函数如下:

L ( { p x , y } , q x , y , { t x , y } ) = 1 N p o s ∑ x , y L c l s ( p x , y , c x , y ∗ ) + λ N p o s ∑ x , y 1 { c x , y ∗ > 0 } L quality  ( q x , y , q x , y ∗ ) + λ N p o s ∑ x , y 1 { c x , y ∗ > 0 } L r e g ( t x , y , t x , y ∗ ) \begin{array}{r} L\left(\left\{p_{x, y}\right\}, q_{x, y},\left\{\boldsymbol{t}_{x, y}\right\}\right)=\frac{1}{N_{\mathrm{pos}}} \sum_{x, y} L_{\mathrm{cls}}\left(p_{x, y}, c_{x, y}^{*}\right) \\ +\frac{\lambda}{N_{\mathrm{pos}}} \sum_{x, y} 1_{\left\{c_{x, y}^{*}>0\right\}} L_{\text {quality }}\left(q_{x, y}, q_{x, y}^{*}\right) \\ +\frac{\lambda}{N_{\mathrm{pos}}} \sum_{x, y} 1_{\left\{c_{x, y}^{*}>0\right\}} L_{\mathrm{reg}}\left(\boldsymbol{t}_{x, y}, \boldsymbol{t}_{x, y}^{*}\right) \end{array} L({px,y},qx,y,{tx,y})=Npos1x,yLcls(px,y,cx,y)+Nposλx,y1{cx,y>0}Lquality (qx,y,qx,y)+Nposλx,y1{cx,y>0}Lreg(tx,y,tx,y)

其中 L c l s L_{cls} Lcls使用focal loss(参考Focal Loss for Dense Object Detection), L q u a l i t y L_{quality} Lquality使用BCE loss, L r e g L_{reg} Lreg使用IOU loss.

补充

论文作者在Appendices B中对预测过程的处理进行了更详细的讨论.

模型将cls_scorequality_accessment进行element-wise production后得到score map.之后对score map进行系列处理(乘hanning window等操作)得到最终的score map,即 s ~ [ x ] \tilde{s}[x] s~[x].

之后通过一个argmax操作得到bbox的预测值,如下所示:
x ∗ = arg ⁡ max ⁡ x ∈ [ 0.. N − 1 ] ⊗ 2 s ~ [ x ] B curr  = B [ x ∗ ] \begin{aligned} x^{*} &=\arg \max _{x \in[0 . . N-1] \otimes 2} \tilde{s}[x] \\ B_{\text {curr }} &=B\left[x^{*}\right] \end{aligned} xBcurr =argx[0..N1]2maxs~[x]=B[x]

最后根据下面的式子更新得到最终的bbox( α \alpha α是一个超参数):

α ′ = s ˉ [ x ∗ ] ⋅ α B pred   .size  = ( 1 − α ′ ) ⋅ B prev  .  size  + α ′ ⋅ B curr  ⋅  size  \begin{aligned} \alpha^{\prime} &=\bar{s}\left[x^{*}\right] \cdot \alpha \\ B_{\text {pred }} \text { .size } &=\left(1-\alpha^{\prime}\right) \cdot B_{\text {prev }} . \text { size }+\alpha^{\prime} \cdot B_{\text {curr }} \cdot \text { size } \end{aligned} αBpred  .size =sˉ[x]α=(1α)Bprev . size +αBcurr  size 

实验结果

[论文阅读 2020 AAAI 目标跟踪]SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation_第2张图片

小结

SiamFC++延续了SiamFC简单高速的特点,其中从论文中的实验结果也可以看出新添加的回归分支确实大幅度提升了跟踪的效果,同时这篇论文也借鉴了目标检测中的一些做法,看来检测和跟踪的联系是越来越紧密了,以后也要多多关注目标检测领域的动向了!

你可能感兴趣的:(论文阅读,目标跟踪,机器学习,人工智能,深度学习)