【文献阅读笔记】BoT-SORT: Robust Associations Multi-Pedestrian Tracking

BoT-SORT论文阅读

摘要

这篇文章提出一种新的鲁棒性前端跟踪器,结合了运动motion和外观appearance信息的优势,加入了相机运动补偿,以及更准确的卡尔曼滤波状态向量。

1.introduction

近年来对于多目标跟踪任务,tracking-by-detection正逐渐成为最有效的算法。基于检测的多目标跟踪算法包括目标检测跟踪
跟踪通常由两个主要部分组成:
(1)用于预测后续时刻帧的轨迹边界框的运动模型和状态估计,最常用的方法是卡尔曼滤波。
(2)将新一帧检测到的目标位置和当前跟踪轨迹序列关联,当前两种常用方法分别是:
(a)定位目标,计算预测的轨迹边界框和检测到的目标位置之间的IoU;
(b)借助目标的外观模型,解决重新识别任务(Re-ID)。

2.Related Works

Tracking-by-detection

Motion models

大多数基于检测的目标跟踪算法都基于运动模型。目前,具有匀速模型假设的卡尔曼滤波1是目标运动建模的常用方法,大多数SORT式的算法都采取该方法,如SORT2和DeepSORT3。也有许多研究使用带有更前端变量的卡尔曼算法例如NSA-卡尔曼滤波,比如和GIAOTrackers4和StrongSORT5。然而在一些复杂场景,相机运动会导致目标运动非线性以及错误的卡尔曼预测。因此许多研究采用了相机运动补偿算法(CMC),比如6和MAT7,CMC采用传统的图像配准来估计相机运动,并适当纠正卡尔曼滤波。

Appearance models and re-identification

在SORT式算法中,定位和追踪信息造成了跟踪器的检测能力(MOTA)与跟踪器长时间保持正确标识能力(IDF1)之间的权衡,使用IoU一般能取得更好的MOTA而Re-ID能获得更高的IDF1。其中,借助外观线索特征进行识别并重新标识(ReID)目标也逐渐成为常用方法比如MGN8、OSNet9、BoT10,但存在许多缺陷,尤其在人流拥挤出现遮挡的情况下效果较差。近些年,已经提出了几种联合跟踪器用来训练检测111213,以及一些其他组件例如运动、嵌入、关联模型,优势在于低计算成本和客观的性能。

有一些研究只依赖高性能的检测器和运动信息,不用外观特征,比如14和ByteTrack15.

3.研究方法

针对基于检测的多目标跟踪,提出了三种主要修改改进方法,融合入ByteTrack,从而呈现出两种新的跟踪器:BoT-SORT和BoT-SORT-ReID(BoT-SORT的延申,包含了重新识别模块re-identification module)
ByteTrack大致框架:(BoT-SORT是以Byte为基础改进的)[^20]
【文献阅读笔记】BoT-SORT: Robust Associations Multi-Pedestrian Tracking_第1张图片

Pipeline: 高亮部分为BoT-SORT改进部分

  1. detector和ByteTrack一致,使用YOLOX:YOLO-X作为主干、COCO预训练模型作为初始权重
  2. 将所有检测到的边界框根据检测分数阈值 τ = 0.6 \tau=0.6 τ=0.6分为高置信度 D h i g h D_{high} Dhigh和低置信度 D l o w D_{low} Dlow两个部分。
  3. 采用CMC和卡尔曼滤波==(卡尔曼滤波状态变量有所不同)==预测当前帧每个轨迹的新位置 T \Tau T
  4. 第一次匹配:根据IoU&Re-ID结合(将运动信息和外观特征结合的关联方法),将 D h i g h D_{high} Dhigh和所有预测的轨迹边界框 T \Tau T关联,用匈牙利算法根据相似度匹配,匹配过的轨迹框从集合中去除。其中外观特征通过BoT+ResNeSt50得到,并通过EMA更新特征状态。没有匹配的检测 D r e m a i n D_{remain} Dremain和没有匹配的轨迹 T r e m a i n \Tau_{remain} Tremain
  5. 第二次匹配:单独用IoU作为相似度,将 D l o w D_{low} Dlow和剩下的预测轨迹 T r e m a i n \Tau_{remain} Tremain匹配。(不用Re-ID appearance feature因为 D l o w D_{low} Dlow包含太多因为遮挡或运动模糊导致外观特征不可靠)经过第二次匹配未匹配的 T r e − r e m a i n \Tau_{re-remain} Treremain放入 T l o s t \Tau_{lost} Tlost。只有当 T l o s t \Tau_{lost} Tlost出现在超过一定数量的帧才把它删除,否则保留 T l o s t \Tau_{lost} Tlost在tracks T \Tau T中。
  6. 每帧输出的是边界框和在当前帧的轨迹的ID(不包括 T l o s t \Tau_{lost} Tlost)。

1.the detector is YOLOX [24] with YOLOX-X as the backbone and COCO-pretrained model [36] as the initialized weights…The YOLO series detectors are also adopted by a large number of methods for its excellent balance of accuracy and speed. 15
For detection, we adopt YOLOX-X pretrained on COCO as our detector for an improved time-accuracy trade-off.6
2.separate all the detection boxes into two parts D h i g h D_{high} Dhigh and D h i g h D_{high} Dhigh according to the detection score threshold τ . For the detection boxes whose scores are higher than τ , we put them into the high score detection boxes D h i g h D_{high} Dhigh. For the detection boxes whose scores are lower than τ , we put them into the low score detection boxes D h i g h D_{high} Dhigh.15
3.we adopt Kalman filter to predict the new locations in the current frame of each track in T \Tau T
4.For the appearance branch, a stronger appearance feature extractor, BoT , is applied to replace the original simple CNN. By taking ResNeSt50 as the backbone and pretraining on the DukeMTMCreID dataset, it can extract much more discriminative features.6
The EMA updating strategy not only enhances the matching quality, but also reduces the time consumption.6
5.We find it important to use IoU alone as the Similarity#2 in the second association because the low score detection boxes usually contains severe occlusion or motion blur and appearance features are not reliable. 15
6.The output of each individual frame is the bounding boxes and identities of the tracks T in the current frame. Note that we do not output the boxes and identities of T l o s t \Tau_{lost} Tlost.15

【文献阅读笔记】BoT-SORT: Robust Associations Multi-Pedestrian Tracking_第2张图片
插值预处理: the post-processing region is optional addition详细见ByteSORT
【文献阅读笔记】BoT-SORT: Robust Associations Multi-Pedestrian Tracking_第3张图片
Byte中的Tracklet interpolation

Suppose we have a tracklet T, its tracklet box is lost due to occlusion from frame t1 to t2. The tracklet box of T at frame t1 is B t 1 ∈ R 4 B_{t1} \in R^4 Bt1R4 which contains the top left and bottomright coordinate of the bounding box. Let B t 2 B_{t2} Bt2 represent the tracklet box of T at frame t2. We set a hyper-parameter σ representing the max interval we perform tracklet interpolation, which means tracklet interpolation is performed when t2 − t1 ≤ σ, . The interpolated box of tracklet T at frame t can be computed as follows:15
B t = B t 1 + ( B t 2 − B t 1 ) t − t 1 t 2 − t 1 \begin{equation} B_t=B_{t_1}+(B_{t_2}-B_{t_1})\frac{t-t_1}{t_2-t_1} \end{equation} Bt=Bt1+(Bt2Bt1)t2t1tt1

3.1卡尔曼滤波

一般常见的是离散卡尔曼滤波的匀速模型。SORT选择七个状态变量 x = [ x c , y c , s , a , x c ˙ , y c ˙ , s ˙ ] T \textbf{x}=[x_c,y_c,s,a,\dot{x_c},\dot{y_c},\dot{s}]^T x=[xc,yc,s,a,xc˙,yc˙,s˙]T x c , y c {x_c,y_c} xc,yc是目标中心在图像平面的2D坐标,s是bounding box面积,a是bounding box长宽比。
我们发现,直接预测边界框bounding box的宽度和高度,性能更好。所以定义KF状态向量和测量向量。
【文献阅读笔记】BoT-SORT: Robust Associations Multi-Pedestrian Tracking_第4张图片
DeepSORT提出过程噪音协方差Q以及测量噪音协方差R要以预测变量和测量变量为函数,所以这里的Q、R是时变的 Q k Q_k Qk R k R_k Rk。根据我们略微不同的变量x修改了Q和R,设 σ p = 0.05 , σ v = 0.00625 , σ m = 0.05 \sigma_p=0.05,\sigma_v=0.00625,\sigma_m=0.05 σp=0.05,σv=0.00625,σm=0.05
【文献阅读笔记】BoT-SORT: Robust Associations Multi-Pedestrian Tracking_第5张图片

可以看到这些改进使HOTA提高了,我们假定对于KF的修改有助于拟合更准确的bounding box宽度。

3.2相机运动补偿CMC

基于检测的跟踪器非常依赖于“预测的运动轨迹序列”和“检测到的目标位置”之间的重叠,在相机运动状况下,更容易导致ID变换或识别错误。
所以我们假设相机姿态刚体运动(rigid motion),而目标的位置从一帧到下一帧仅仅由微小变化。下一段话简而言之:因为无法直接获取相机参数,通过提取背景关键点用来预测相机运动。

由于缺乏有关相机运动(例如导航、IMU等)或相机内在矩阵的额外数据,两个相邻帧之间的图像配准是相机刚性运动投影到图像平面上的良好近似值。我们遵循具有仿射变换的视频稳定模块的OpenCV实现中使用的全局运动补偿(GMC)技术。这种图像配准方法适用于显示背景运动。首先,提取图像关键点,其次是稀疏光流,用于基于平移的局部异常值抑制的特征跟踪。仿射矩阵是用RANSAC求解的[16]。使用稀疏配准技术允许根据检测忽略场景中的动态对象,从而有可能更准确地估计背景运动。

将预测边界框从第k-1帧到第k帧的坐标转换,计算仿射矩阵 A k − 1 k A_{k-1}^k Ak1k,计算过程如下
【文献阅读笔记】BoT-SORT: Robust Associations Multi-Pedestrian Tracking_第6张图片
x ^ k ∣ k − 1 , x ^ k ∣ k − 1 ′ \hat{x}_{k|k-1},\hat{x}'_{k|k-1} x^kk1,x^kk1分别是经过相机运动补偿前以及补偿后的k时刻的KF预测状态向量,以下是补偿过程:(若相机运动缓慢则可以忽略)【文献阅读笔记】BoT-SORT: Robust Associations Multi-Pedestrian Tracking_第7张图片以下是卡尔曼更新过程:【文献阅读笔记】BoT-SORT: Robust Associations Multi-Pedestrian Tracking_第8张图片

3.3IoU-Re-ID融合

借助BoT+ResNeSt50得到外观特征,将其融合进跟踪器,运用EMA得到匹配轨迹外光状态e。其中f是当前匹配检测的外观嵌入, α \alpha α是动量。由于外观特征容易被人群、遮挡、模糊目标所干扰,所以只考虑高置信度检测。
在这里插入图片描述
B o T + R e s N e S t 50 ⇒ E M A ⇒ a p p e a r a n c e _ c o s t \begin{equation} BoT+ResNeSt50\Rightarrow EMA \Rightarrow appearance\_cost \end{equation} BoT+ResNeSt50EMAappearance_cost

为了利用深度视觉表现的最新发展,我们将外观特征集成到我们的追踪器中。为了提取这些 Re-ID 特征,我们采用了在 FastReID 库中的 BoT (SBS) [28] 之上的更强基线,以 ResNeSt50 [56] 为骨干 [19]。我们采用指数移动平均线(EMA)机制来更新第i个轨迹在k处的匹配轨迹外观状态e,如[46]所示,其中f是当前匹配检测的外观嵌入,α = 0.9是动量项。由于外观特征可能容易受到人群、遮挡和模糊物体的影响,因此为了保持正确的特征向量,我们仅考虑高置信度检测。为了匹配平均轨迹外观状态e和新的检测嵌入向量f,测量余弦相似性。

为了几何运动信息和外观信息,也就是IoU距离矩阵和余弦距离矩阵。首先低余弦相似度或距离远的候选者(就IoU而言)被剔除;其次选择两种元素中最小的一个作为代价矩阵最终值。

IoU为“预测边界框”和“真实边界框”的交叠率。

【文献阅读笔记】BoT-SORT: Robust Associations Multi-Pedestrian Tracking_第9张图片
d ^ i , j c o s = { 0.5 × d i , j c o s , ( d i , j c o s < θ e m b ) ⋀ ( d i , j i o u < θ i o u ) 1 , o t h e r w i s e \begin{equation} \hat{d}_{i,j}^{cos}=\{\begin{array}{c} 0.5\times d_{i,j}^{cos},(d_{i,j}^{cos}<θ_{emb})⋀(d_{i,j}^{iou}<θ_{iou})\\ 1,otherwise \end{array} \end{equation} d^i,jcos={0.5×di,jcos,(di,jcos<θemb)(di,jiou<θiou)1,otherwise
C ( i , j ) = m i n ⁡ { d i , j i o u , d ^ i , j c o s } \begin{equation} C_(i,j)=min⁡\{d_{i,j}^{iou},\hat{d}_{i,j}^{cos}\} \end{equation} C(i,j)=min{di,jiou,d^i,jcos}

d i , j i o u d_{i,j}^{iou} di,jiou等于轨迹中第i个预测的bounding box和第j个检测到的bounding box之间的iou距离, d i , j c o s d_{i,j}^{cos} di,jcos等于平均轨迹外观描述子i和新检测到的描述子j之间的余弦距离, d ^ i , j c o s \hat{d}_{i,j}^{cos} d^i,jcos是此文中新的外观代价, θ i o u \theta_{iou} θiou是阈值用来剔除不可能配对的轨迹和探测指, θ e m b = 0.25 \theta_{emb}=0.25 θemb=0.25是外观阈值用来从负关联中分离轨迹外观状态和检测嵌入向量的正关联。(负关联指不同时刻不同ID,正关联指不同时刻同一ID)

4.实验

4.1 实验设置

metrics

MOTA:Multiple Object Tracking Accuracy, detection performance
IDF1:evaluate the identity association performance, trackers’ ability to maintain the correct identities over time.
HOTA:Higher-Order Tracking Accuracy, balance the effects of performing accurate detection and association.
FPS:Tracker Speed帧刷新速率
【文献阅读笔记】BoT-SORT: Robust Associations Multi-Pedestrian Tracking_第10张图片
Masking表示剔除远距离的关联的方法。(Masking=IoU意味着IoU距离远的关联association被剔除。)

4.2 Ablation Study

Re-ID Module

外观描述符是一种随时间推移关联同一个人的直观方式。它有可能克服大排量和长闭塞。最近使用具有余弦相似性的 Re-ID 的尝试优于仅使用 IoU 处理高帧率视频的情况。在本节中,我们比较了在跟踪器的第一个匹配关联步骤中结合运动和视觉嵌入的不同策略,在MOT17验证集上,表2。仅IoU就优于基于Re-ID的方法,不包括我们提出的方法。因此,对于低资源应用,IoU 是一个不错的设计选择。我们的 IoU-ReID 组合与 IoU 掩蔽在 MOTA、IDF1 和 HOTA 方面实现了最高的结果,并受益于运动和外观信息。

current MOTA

c M O T A ( t = T ) = M O T A cMOTA(t=T)=MOTA cMOTA(t=T)=MOTA帮助我们快速找到tracker失败的情况。
【文献阅读笔记】BoT-SORT: Robust Associations Multi-Pedestrian Tracking_第11张图片
上图中可以看到在相机运动矫正前(红色曲线)从第400到470帧的MOTA快速下降随后cMOTA到达平台期。通过cMOTA揭示,tracker失败的原因在于相机旋转。加入CMC之后(绿色)在同一检测和跟踪参数下保持了较高的MOTA。

4.4Limitations

在高密度动态目标场景中,相机运动的预测会由于背景关键点缺失而出错;且计算相机的全局运动会很费时。
分离的外观跟踪器运行速度低,可以考虑将特征提取网络以联合检测-嵌入的方式合并到检测器中。

与联合跟踪器和几个无外观跟踪器相比,分离的外观跟踪器的运行速度相对较低。我们仅将深度特征提取应用于高置信度检测,以降低计算成本。如有必要,特征提取器网络可以以联合检测-嵌入的方式合并到检测头中。

Related works附录

[2]SORT:Despite only using a rudimentary combination of familiar techniques such as the Kalman Filter and Hungarian algorithm for the tracking components, this approach achieves an accuracy comparable to state-of-the-art online trackers.
1.leverage the power of CNN based detection in the context of MOT. 2.a pragmatic tracking approach based on the Kalman filter and the Hungarian algorithm, evaluated on a recent MOT benchmark.2
[4]GIAOTracker :It consists of three stages, i.e., online tracking, global link and post-processing. Given detections in every frame, the first stage generates reliable track- lets using information of camera motion, object motion and object appearance. Then they are associated into trajectories by exploiting global clues and refined through four post-processing methods.4
[6]CMC:We introduce two key innovations to recover much of this performance drop. We treat occluded object detection in temporal sequences as a short-term forecasting challenge, bringing to bear tools from dynamic sequence prediction. Second, we build dynamic models that explicitly reason in 3D from monocular videos without calibration, using observations produced by monocular depth estimators.6
[7]MAT:In this paper, we propose an enhanced MOT paradigm, namely Motion-Aware Tracker (MAT). Our MAT is a plug-and-play solution.
First, the nonrigid pedestrian motion and rigid camera motion are blended seamlessly to develop the Integrated Motion Localization (IML) module.
Second, the Dynamic Reconnection Context (DRC) module is devised to guarantee the robustness for long-range motion-based reconnection. The core ideas in DRC are the motion-based dynamic-window and cyclic pseudo-observation trajectory filling strategy, which can smoothly fill in the tracking fragments caused by occlusion or blur.
At last, we present the 3D Integral Image (3DII) module to efficiently cut off useless track-detection association connections using temporal-spatial constraints.7
[8]Multiple Granularity Network (MGN): An end-to-end feature learning strategy
integrating discriminative information with various granularities–MGN, a multi-branch deep network architecture consisting of one branch for global feature representations and two branches for local feature representations. 8
[9]OSNet: In this paper, a novel deep ReID CNN is designed, termed Omni-Scale Network (OSNet), for omni-scale feature learning. This is achieved by designing a residual block composed of multiple convolutional feature streams, each detecting features at a certain scale. Importantly, a novel unified aggregation gate is introduced to dynamically fuse multi-scale features with input-dependent channel-wise weights.9
[10]BoT:combine 6 training tricks (among them, we design a neck structure named a BNNeck) and only use global features provided by ResNe50 backbone. 10有待阅读


  1. R. G. Brown and P. Y. C. Hwang. Introduction to random signals and applied kalman filtering: with MATLAB exercises and solutions; 3rd ed. Wiley, New York, NY, 1997. 1, 2, 13 ↩︎

  2. A. Bewley, Z. Ge, L. Ott, F. Ramos and B. Upcroft, “Simple online and realtime tracking,” 2016 IEEE International Conference on Image Processing (ICIP), 2016, pp. 3464-3468, doi: 10.1109/ICIP.2016.7533003. ↩︎ ↩︎

  3. N. Wojke, A. Bewley and D. Paulus, “Simple online and realtime tracking with a deep association metric,” 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp. 3645-3649, doi: 10.1109/ICIP.2017.8296962. ↩︎

  4. Y. Du, J. Wan, Y. Zhao, B. Zhang, Z. Tong and J. Dong, “GIAOTracker: A comprehensive framework for MCMOT with global information and optimizing strategies in VisDrone 2021,” 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2021, pp. 2809-2819, doi: 10.1109/ICCVW54120.2021.00315. ↩︎ ↩︎

  5. Y. Du, Y. Song, B. Yang, and Y. Zhao. Strongsort: Make deepsort great again. arXiv preprint arXiv:2202.13514, 2022. 2, 3, 7, 9, 10 ↩︎

  6. T. Khurana, A. Dave and D. Ramanan, “Detecting Invisible People,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 3154-3164, doi: 10.1109/ICCV48922.2021.00316. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  7. Shoudong Han, Piao Huang, Hongwei Wang, En Yu, Donghaisheng Liu, Xiaofeng Pan.
    MAT: Motion-aware multi-object tracking. Neurocomputing,Volume 476. 2022. Pages 75-86. ISSN 0925-2312. ↩︎ ↩︎

  8. G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM international conference on Multimedia, pages 274–282, 2018. 2 ↩︎ ↩︎

  9. K. Zhou, Y. Yang, A. Cavallaro and T. Xiang, “Omni-Scale Feature Learning for Person Re-Identification,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 3701-3711, doi: 10.1109/ICCV.2019.00380. ↩︎ ↩︎

  10. H. Luo, Y. Gu, X. Liao, S. Lai and W. Jiang, “Bag of Tricks and a Strong Baseline for Deep Person Re-Identification,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019, pp. 1487-1495, doi: 10.1109/CVPRW.2019.00190. ↩︎ ↩︎

  11. Liang, C., Zhang, Z., Lu, Y., Zhou, X., Li, B., Ye, X., Zou, J.: Rethinking the competition between detection and reid in multi-object tracking. arXiv preprint arXiv:2010.12138 (2020) ↩︎

  12. Z. Lu, V. Rathod, R. Votel and J. Huang, “RetinaTrack: Online Single Stage Joint Detection and Tracking,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 14656-14666, doi: 10.1109/CVPR42600.2020.01468. ↩︎

  13. Z. Wang, L. Zheng, Y. Liu, Y. Li, and S. Wang. Towards real-time multi-object tracking. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 107–122. Springer, 2020. 2, 3, 5, 8 ↩︎

  14. D. Stadler and J. Beyerer, “Modelling Ambiguous Assignments for Multi-Person Tracking in Crowds,” 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), 2022, pp. 133-142, doi: 10.1109/WACVW54805.2022.00019. ↩︎

  15. Y. Zhang, P. Sun, Y. Jiang, D. Yu, Z. Yuan, P. Luo, W. Liu,and X. Wang. Bytetrack: Multi-object tracking by associating every detection box. arXiv preprint arXiv:2110.06864,2021. 1, 2, 3, 4, 5, 7, 9, 10, 12, 13 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

你可能感兴趣的:(论文阅读,目标跟踪,人工智能)