paper:SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines
code:MegviiDetection/video_analyst
这篇论文在SiamFC
基础上进行了改进,提出了SiamFC++
,并在OTB2015
,VOT2018
,LaSOT
,GOT-10k
和TrackingNet
上取得了SOTA.
这篇论文基于以下原则设计了SiamFC++
:
现在大多数的跟踪模型的target state estimation
可以划分为三大类:
DCF
和SiamFC
)ATOM
)SiamRPN
)如上图所示是SiamFC++
的主要框架,其中蓝色部分和红色部分是相比于原版SiamFC
新添加的分支. 不同于SiamFC
,SiamFC++
多了quality
评估分支和regression bbox
预测分支.
通过cross-correlation
计算embedding feature
的过程可以表示为:
f i ( z , x ) = ψ i ( ϕ ( z ) ) ⋆ ψ i ( ϕ ( x ) ) , i ∈ { c l s , r e g } f_{i}(z, x)=\psi_{i}(\phi(z)) \star \psi_{i}(\phi(x)), i \in\{\mathrm{cls}, \mathrm{reg}\} fi(z,x)=ψi(ϕ(z))⋆ψi(ϕ(x)),i∈{cls,reg}
both ψ c l s ψ_{cls} ψcls and ψ r e g ψ_{reg} ψreg after common feature extraction to adjust the common features into task-specific feature space. Note that the extracted features of ψ c l s ψ_{cls} ψcls and ψ r e g ψ_{reg} ψreg are of the same size.
在提取了embedding feature
后,设计了classification head
和regression head
将模型划分为分类任务和回归任务(依据原则G1).
regression head
分支输出一个4维的向量 t ∗ = ( l ∗ , t ∗ , r ∗ , b ∗ ) \boldsymbol{t}^{*}=\left(l^{*}, t^{*}, r^{*}, b^{*}\right) t∗=(l∗,t∗,r∗,b∗),各元素代表的含义如下(其中s
表示backbone
网络的步长,这篇论文中是8):
l ∗ = ( ⌊ s 2 ⌋ + x s ) − x 0 , t ∗ = ( ⌊ s 2 ⌋ + y s ) − y 0 r ∗ = x 1 − ( ⌊ s 2 ⌋ + x s ) , b ∗ = y 1 − ( ⌊ s 2 ⌋ + y s ) \begin{array}{ll} l^{*} & =\left(\left\lfloor\frac{s}{2}\right\rfloor+x s\right)-x_{0}, \quad t^{*}=\left(\left\lfloor\frac{s}{2}\right\rfloor+y s\right)-y_{0} \\ r^{*} & =x_{1}-\left(\left\lfloor\frac{s}{2}\right\rfloor+x s\right), \quad b^{*}=y_{1}-\left(\left\lfloor\frac{s}{2}\right\rfloor+y s\right) \end{array} l∗r∗=(⌊2s⌋+xs)−x0,t∗=(⌊2s⌋+ys)−y0=x1−(⌊2s⌋+xs),b∗=y1−(⌊2s⌋+ys)
where ( x 0 , y 0 ) (x_0, y_0) (x0,y0) and ( x 1 , y 1 ) (x_1, y_1) (x1,y1) denote the left-top and rightbottom corners of the ground-truth bounding box B∗ associated with point (x, y).
classification head
一个分支输出分类的分数 ψ c l s ψ_{cls} ψcls
location (x, y) on feature map ψ c l s ψ_{cls} ψcls is considered as a positive sample if its corresponding location ( ⌊ s 2 ⌋ + x s , ⌊ s 2 ⌋ + y s ) \left(\left\lfloor\frac{s}{2}\right\rfloor+x s,\left\lfloor\frac{s}{2}\right\rfloor+y s\right) (⌊2s⌋+xs,⌊2s⌋+ys) on the input image falls into the ground-truth bounding box. Otherwise,it is a negative sample.
另一个分支预测PSS
(论文中指出也可以使用IOU
),这个分支是为了评估预测的bbox
质量,用于抑制远离目标中心的bbox
.
最终优化的损失函数如下:
L ( { p x , y } , q x , y , { t x , y } ) = 1 N p o s ∑ x , y L c l s ( p x , y , c x , y ∗ ) + λ N p o s ∑ x , y 1 { c x , y ∗ > 0 } L quality ( q x , y , q x , y ∗ ) + λ N p o s ∑ x , y 1 { c x , y ∗ > 0 } L r e g ( t x , y , t x , y ∗ ) \begin{array}{r} L\left(\left\{p_{x, y}\right\}, q_{x, y},\left\{\boldsymbol{t}_{x, y}\right\}\right)=\frac{1}{N_{\mathrm{pos}}} \sum_{x, y} L_{\mathrm{cls}}\left(p_{x, y}, c_{x, y}^{*}\right) \\ +\frac{\lambda}{N_{\mathrm{pos}}} \sum_{x, y} 1_{\left\{c_{x, y}^{*}>0\right\}} L_{\text {quality }}\left(q_{x, y}, q_{x, y}^{*}\right) \\ +\frac{\lambda}{N_{\mathrm{pos}}} \sum_{x, y} 1_{\left\{c_{x, y}^{*}>0\right\}} L_{\mathrm{reg}}\left(\boldsymbol{t}_{x, y}, \boldsymbol{t}_{x, y}^{*}\right) \end{array} L({px,y},qx,y,{tx,y})=Npos1∑x,yLcls(px,y,cx,y∗)+Nposλ∑x,y1{cx,y∗>0}Lquality (qx,y,qx,y∗)+Nposλ∑x,y1{cx,y∗>0}Lreg(tx,y,tx,y∗)
其中 L c l s L_{cls} Lcls使用focal loss
(参考Focal Loss for Dense Object Detection), L q u a l i t y L_{quality} Lquality使用BCE loss
, L r e g L_{reg} Lreg使用IOU loss
.
论文作者在Appendices B
中对预测过程的处理进行了更详细的讨论.
模型将cls_score
和quality_accessment
进行element-wise production
后得到score map
.之后对score map
进行系列处理(乘hanning window
等操作)得到最终的score map
,即 s ~ [ x ] \tilde{s}[x] s~[x].
之后通过一个argmax
操作得到bbox
的预测值,如下所示:
x ∗ = arg max x ∈ [ 0.. N − 1 ] ⊗ 2 s ~ [ x ] B curr = B [ x ∗ ] \begin{aligned} x^{*} &=\arg \max _{x \in[0 . . N-1] \otimes 2} \tilde{s}[x] \\ B_{\text {curr }} &=B\left[x^{*}\right] \end{aligned} x∗Bcurr =argx∈[0..N−1]⊗2maxs~[x]=B[x∗]
最后根据下面的式子更新得到最终的bbox
( α \alpha α是一个超参数):
α ′ = s ˉ [ x ∗ ] ⋅ α B pred .size = ( 1 − α ′ ) ⋅ B prev . size + α ′ ⋅ B curr ⋅ size \begin{aligned} \alpha^{\prime} &=\bar{s}\left[x^{*}\right] \cdot \alpha \\ B_{\text {pred }} \text { .size } &=\left(1-\alpha^{\prime}\right) \cdot B_{\text {prev }} . \text { size }+\alpha^{\prime} \cdot B_{\text {curr }} \cdot \text { size } \end{aligned} α′Bpred .size =sˉ[x∗]⋅α=(1−α′)⋅Bprev . size +α′⋅Bcurr ⋅ size
SiamFC++
延续了SiamFC
简单高速的特点,其中从论文中的实验结果也可以看出新添加的回归分支确实大幅度提升了跟踪的效果,同时这篇论文也借鉴了目标检测中的一些做法,看来检测和跟踪的联系是越来越紧密了,以后也要多多关注目标检测领域的动向了!