图像 跟踪 - MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object ... (CVPR 2023)

图像 跟踪 - MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors - 预训练目标检测器的端到端自举多目标跟踪(CVPR 2023)

  • 摘要
  • 1. 引言
  • 2. 相关工作
  • 3. 方法
    • 3.1 修订MOTR
    • 3.2 动机
    • 3.3 总体架构
  • References

声明:此翻译仅为个人学习记录

文章信息

  • 标题:MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors (CVPR 2023)
  • 作者:Yuang Zhang, Tiancai Wang, Xiangyu Zhang
  • 文章链接:https://openaccess.thecvf.com/content/CVPR2023/papers/Zhang_MOTRv2_Bootstrapping_End-to-End_Multi-Object_Tracking_by_Pretrained_Object_Detectors_CVPR_2023_paper.pdf
  • 文章代码:https://github.com/megvii-research/MOTRv2

摘要

  在本文中,我们提出了MOTRv2,这是一种简单而有效的管道,用于使用预训练的目标检测器引导端到端多目标跟踪。现有的端到端方法,如MOTR[43]和TrackFormer[20],主要由于其较差的检测性能而不如检测再跟踪的对手。我们的目标是通过优雅地加入一个额外的目标检测器来提高MOTR。我们首先采用查询的锚点公式,然后使用额外的目标检测器生成建议作为锚点,在MOTR之前提供检测。简单的修改大大缓解了MOTR中联合学习检测和关联任务之间的冲突。MOTRv2保持了查询传播功能,并在大规模基准测试上扩展良好。MOTRv2在第一届团体舞多人追踪挑战赛中排名第一(DanceTrack上有73.4%的HOTA)。此外,MOTRv2在BDD100K数据集上达到了最先进的性能。我们希望这个简单有效的管道能够为端到端的MOT社区提供一些新的见解。代码在https://github.com/megvii-research/MOTRv2.

1. 引言

  多目标跟踪(MOT)旨在预测流媒体视频中所有目标的轨迹。它可以分为两个部分:检测和关联。长期以来,MOT上最先进的性能一直由检测再跟踪方法[4,36,44,45]所主导,这些方法具有良好的检测性能,可以应对各种外观分布。这些跟踪器[44]首先采用目标检测器(例如YOLOX[11])来定位每帧中的目标,并通过ReID特征或IoU匹配来关联轨迹。这些方法的优越性能部分源于数据集和偏向检测性能的指标。然而,正如DanceTrack数据集[27]所揭示的那样,它们的关联策略在复杂运动中仍有待改进。

图1. 在DanceTrack和BDD100K数据集上MOTR(灰色条)和MOTRv2(橙色条)之间的性能比较。MOTRv2在不同场景下大大提高了MOTR的性能。

  最近,MOTR[43]为MOT引入了一个完全端到端的框架。通过更新跟踪查询来执行关联过程,同时通过检测查询来检测新生目标。它在DanceTrack上的关联性能令人印象深刻,而检测结果不如检测再跟踪方法的结果,尤其是在MOT17数据集上。我们将较差的检测性能归因于联合检测和关联过程之间的冲突。由于最先进的跟踪器[6,9,44]倾向于使用额外的目标检测器,一个自然的问题是如何将MOTR与额外的目标检测器结合起来,以获得更好的检测性能。一种直接的方法是在轨迹查询的预测和额外的目标检测器之间执行IoU匹配(类似于TransTrack[28])。在我们的实践中,它只在目标检测方面带来了边际改进,而不符合MOTR的端到端特性。

  受以检测结果为输入的检测再跟踪方法的启发,我们想知道是否有可能将检测结果作为输入,并减少对关联的MOTR学习。最近,DETR中基于锚点的建模取得了一些进展[18,35]。例如,DAB-DETR使用定位框的中心点、高度和宽度初始化目标查询。与它们类似,我们修改了MOTR中检测和跟踪查询的初始化。我们将MOTR中检测查询的可学习位置嵌入(PE)替换为锚点的正余弦PE[30],产生了一个基于锚点的MOTR跟踪器。通过这种基于锚点的建模,由额外的目标检测器生成的提案可以作为MOTR的锚点初始化,提供局部先验。transformer解码器用于预测锚的相对偏移,从而使检测任务的优化更加容易。

图2. MOTRv2的总体架构。由最先进的检测器YOLOX[11]产生的提案用于生成提案查询,它取代了MOTR[43]中用于检测新生目标的检测查询。跟踪查询从上一帧传输过来,用于预测被跟踪目标的边界框。提案查询和跟踪查询的级联以及图像特征被输入到MOTR以逐帧生成预测。

  与最初的MOTR相比,所提出的MOTRv2带来了许多优点。它极大地受益于额外的目标检测器引入的良好检测性能。检测任务与MOTR框架隐式解耦,缓解了共享transformer解码器中检测任务和关联任务之间的冲突。MOTRv2学习在给定来自额外检测器的检测结果的情况下跨帧跟踪实例。

  与原始MOTR相比,MOTRv2在DanceTrack、BDD100K和MOT17数据集上实现了巨大的性能改进(见图1)。在DanceTrack数据集上,MOTRv2以很大的优势超过了检测同行的跟踪(与OC-SORT[6]相比,HOTA为14.8%),AssA指标比第二好的方法高18.8%。在大规模多类BDD100K数据集[42]上,我们实现了43.6%的mMOTA,比之前的最佳解决方案Unicorn[41]好2.4%。MOTRv2还在MOT17数据集上实现了最先进的性能[15,21]。我们希望我们简洁优雅的设计能够成为未来端到端多目标跟踪研究的有力基线。

2. 相关工作

  检测再跟踪。主要方法[6,44]主要遵循检测再跟踪管道:目标检测器首先预测每个帧的目标边界框,然后使用单独的算法来关联相邻帧之间的实例边界框。这些方法的性能在很大程度上取决于目标检测的质量。

  使用匈牙利算法[14]进行关联有多种尝试:SORT[4]对每个跟踪的实例应用卡尔曼滤波器[37],并使用卡尔曼滤波器的预测框和检测框之间的交并比(IoU)矩阵进行匹配。Deep SORT[38]引入了一个单独的网络来提取实例的外观特征,并使用SORT之上的成对余弦距离。JDE[36]、Track-RCNN[25]、FairMOT[45]和Unicorn[41]进一步探索了目标检测和外观嵌入的联合训练。ByteTrack[44]利用了强大的基于YOLOX的[11]检测器,实现了最先进的性能。它引入了一种增强的SORT算法来关联低分数检测框,而不是只关联高分数检测框。BoT-SORT[1]进一步设计了更好的卡尔曼滤波器状态、相机运动补偿和ReID特征融合。TransMOT[9]和GTR[48]在计算分配矩阵时使用时空transformers,例如特征交互和历史信息聚合。OC-SORT[6]放松了线性运动假设,并使用了可学习的运动模型。

  虽然我们的方法也受益于稳健的检测器,但我们不计算相似性矩阵,而是使用带有锚点的跟踪查询来联合建模运动和外观。

  按查询传播进行跟踪。MOT的另一个范例将基于查询的目标检测器[7,29,49]扩展到跟踪。这些方法强制每个查询在不同的框架中调用同一个实例。查询和图像特征之间的交互可以在时间上并行或串行执行。

  并行方法以短视频作为输入,并使用一组查询与所有帧进行交互,以预测实例的轨迹。VisTR[34]和随后的工作[8,40]扩展了DETR[7]以检测短视频剪辑中的轨迹。并行方法需要将整个视频作为输入,因此它们消耗内存,并且仅限于几十帧的短视频剪辑。

  串行方法执行与图像特征的逐帧查询交互,并迭代地更新与实例相关联的跟踪查询。Trackor++[2]利用R-CNN[12]回归头进行跨帧的迭代实例重新定位。TrackFormer[20]和MOTR[43]从可变形DETR[49]延伸而来。它们预测目标边界框并更新跟踪查询,以便在后续帧中检测相同的实例。MeMOT[5]构建短期和长期实例特征内存库,以生成跟踪查询。TransTrack[28]传播跟踪查询一次,以在下一帧中找到目标位置。P3AFormer[46]采用流引导图像特征传播。与MOTR不同,TransTrack和P3AFormer在历史轨迹和当前检测中仍然使用基于位置的匈牙利匹配,而不是在整个视频中传播查询。

  我们的方法继承了用于长期端到端跟踪的查询传播方法,同时还利用强大的目标检测器来提供目标位置先验。在复杂运动的跟踪性能方面,该方法大大优于现有的基于匹配和查询的方法。

3. 方法

  在这里,我们介绍了基于提案查询生成(第3.4节)和提案传播(第3.5节)的MOTRv2。

3.1 修订MOTR

  MOTR[43]是一个基于可变形DETR[49]架构的完全端到端的多目标跟踪框架。介绍了轨迹查询和目标查询。目标查询负责检测新生或丢失的目标,而每个跟踪查询负责随时间跟踪一个唯一的实例。为了初始化跟踪查询,MOTR使用与新检测到的目标相关联的目标查询的输出。跟踪查询会根据其状态和当前图像特征随时间更新,这使他们能够以在线方式预测跟踪。

  MOTR中的tracklet感知标签分配将跟踪查询分配给先前跟踪的实例,同时通过二分匹配将目标查询分配给其余实例。MOTR引入了一个时间聚合网络来增强跟踪查询的功能,并引入了一种集体平均损失来平衡跨帧的损失。

3.2 动机

  端到端多目标跟踪框架的一个主要局限性是,与依赖独立目标检测器的检测再跟踪方法[6,44]相比,它们的检测性能较差。为了解决这一限制,我们建议结合YOLOX[11]目标检测器来生成作为目标锚的提案,在MOTR之前提供检测。它极大地缓解了MOTR中联合学习检测和关联任务之间的冲突,提高了检测性能。

3.3 总体架构

  如图2所示,所提出的MOTRv2体系结构由两个主要组件组成:最先进的目标检测器和改进的基于锚点的MOTR跟踪器。

  目标检测器组件首先生成用于训练和推理的提案。对于每个帧,YOLOX生成一组提案,其中包括中心坐标、宽度、高度和置信度值。修改后的基于锚点的MOTR组件负责基于生成的提案来学习轨迹关联。第3.4节描述了用提案查询替换原始MOTR框架中的检测查询。修改后的MOTR现在将跟踪查询和提案查询的连接作为输入。第3.5节描述了连接查询和框架特征之间的交互,以更新被跟踪目标的边界框。

References

[1] Nir Aharon, Roy Orfaig, and Ben-Zion Bobrovsky. Bot-sort: Robust associations multi-pedestrian tracking. arXiv preprint arXiv:2206.14651, 2022. 2, 6
[2] Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixe. Tracking without bells and whistles. In ICCV, 2019. 3, 6
[3] Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008:1–10, 2008. 5
[4] Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In ICIP, 2016. 1, 2
[5] Jiarui Cai, Mingze Xu, Wei Li, Yuanjun Xiong, Wei Xia, Zhuowen Tu, and Stefano Soatto. Memot: Multi-object tracking with memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8090–8100, 2022. 3
[6] Jinkun Cao, Xinshuo Weng, Rawal Khirodkar, Jiangmiao Pang, and Kris Kitani. Observation-centric sort: Rethinking sort for robust multi-object tracking. arXiv preprint arXiv:2203.14360, 2022. 1, 2, 3, 5, 6
[7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020. 3
[8] Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022. 3
[9] Peng Chu, Jiang Wang, Quanzeng You, Haibin Ling, and Zicheng Liu. Transmot: Spatial-temporal graph transformer for multiple object tracking. arXiv preprint arXiv:2104.00194, 2021. 1, 2
[10] Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid, Stefan Roth, Konrad Schindler, and Laura Leal-Taix´e. Mot20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003, 2020. 6, 8
[11] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021. 1, 2, 3, 5, 8
[12] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014. 3
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 5
[14] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955. 2
[15] Laura Leal-Taix´e, Anton Milan, Ian Reid, Stefan Roth, and Konrad Schindler. Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942, 2015. 2, 4, 5, 6, 8
[16] Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13619–13627, 2022. 7, 8
[17] Siyuan Li, Martin Danelljan, Henghui Ding, Thomas E Huang, and Fisher Yu. Tracking every thing in the wild. In European Conference on Computer Vision, pages 498–515. Springer, 2022. 6
[18] Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329, 2022. 2, 4
[19] Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal-Taix´e, and Bastian Leibe. Hota: A higher order metric for evaluating multi-object tracking. IJCV, 129(2):548–578, 2021. 5
[20] Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer. Trackformer: Multi-object tracking with transformers. arXiv preprint arXiv:2101.02702, 2021. 1, 3, 6
[21] Anton Milan, Laura Leal-Taix´e, Ian Reid, Stefan Roth, and Konrad Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016. 2, 4, 5, 6, 8
[22] Jiangmiao Pang, Linlu Qiu, Xia Li, Haofeng Chen, Qi Li, Trevor Darrell, and Fisher Yu. Quasi-dense similarity learning for multiple object tracking. In CVPR, 2021. 5, 6
[23] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In ECCV, 2016. 5
[24] Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, Xiangyu Zhang, and Jian Sun. Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123, 2018. 5
[25] Bing Shuai, Andrew G Berneshawi, Davide Modolo, and Joseph Tighe. Multi-object tracking with siamese track-rcnn. arXiv preprint arXiv:2004.07786, 2020. 2
[26] Daniel Stadler and J¨urgen Beyerer. Modelling ambiguous assignments for multi-person tracking in crowds. In Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision, pages 133–142, 2022. 6
[27] Peize Sun, Jinkun Cao, Yi Jiang, Zehuan Yuan, Song Bai, Kris Kitani, and Ping Luo. Dancetrack: Multi-object tracking in uniform appearance and diverse motion. arXiv preprint arXiv:2111.14690, 2021. 1, 4, 5
[28] Peize Sun, Yi Jiang, Rufeng Zhang, Enze Xie, Jinkun Cao, Xinting Hu, Tao Kong, Zehuan Yuan, Changhu Wang, and Ping Luo. Transtrack: Multiple-object tracking with transformer. arXiv preprint arXiv: 2012.15460, 2020. 1, 3, 5, 6, 7
[29] Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chenfeng Xu, Wei Zhan, and Masayoshi Tomizuka. Sparse r-cnn: End-to-end object detection with learnable proposals. arXiv preprint arXiv:2011.12450, 2020. 3
[30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurlPS, 2017. 2
[31] Qiang Wang, Yun Zheng, Pan Pan, and Yinghui Xu. Multiple object tracking with correlation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3876–3886, 2021. 6
[32] Shuai Wang, Hao Sheng, Yang Zhang, Yubin Wu, and Zhang Xiong. A general recurrent tracking framework without real data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13219–13228, 2021. 6
[33] Yongxin Wang, Kris Kitani, and Xinshuo Weng. Joint object detection and multi-object tracking with graph neural networks. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13708–13715. IEEE, 2021. 6
[34] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to-end video instance segmentation with transformers. In CVPR, 2021. 3
[35] Yingming Wang, Xiangyu Zhang, Tong Yang, and Jian Sun. Anchor detr: Query design for transformer-based detector. arXiv preprint arXiv:2109.07107, 2021. 2
[36] Zhongdao Wang, Liang Zheng, Yixuan Liu, Yali Li, and Shengjin Wang. Towards real-time multi-object tracking. In ECCV, 2020. 1, 2
[37] Greg Welch, Gary Bishop, et al. An introduction to the kalman filter, 1995. 2
[38] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In ICIP, 2017. 2
[39] Jialian Wu, Jiale Cao, Liangchen Song, Yu Wang, Ming Yang, and Junsong Yuan. Track to detect and segment: An online multi-object tracker. In CVPR, 2021. 5, 6
[40] Junfeng Wu, Yi Jiang, Wenqing Zhang, Xiang Bai, and Song Bai. Seqformer: a frustratingly simple model for video instance segmentation. arXiv preprint arXiv:2112.08275, 2021. 3
[41] Bin Yan, Yi Jiang, Peize Sun, Dong Wang, Zehuan Yuan, Ping Luo, and Huchuan Lu. Towards grand unification of object tracking. In ECCV, 2022. 2, 6
[42] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. 2, 4, 5, 6
[43] Fangao Zeng, Bin Dong, Yuang Zhang, Tiancai Wang, Xiangyu Zhang, and Yichen Wei. Motr: End-to-end multipleobject tracking with transformer. In European Conference on Computer Vision, pages 659–675. Springer, 2022. 1, 2, 3, 4, 5, 6, 7, 8
[44] Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xinggang Wang. Byte-track: Multi-object tracking by associating every detection box. arXiv preprint arXiv:2110.06864, 2021. 1, 2, 3, 5, 6
[45] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. Fairmot: On the fairness of detection and reidentification in multiple object tracking. IJCV, pages 1–19, 2021. 1, 2, 5, 6
[46] Zelin Zhao, Ze Wu, Yueqing Zhuang, Boxun Li, and Jiaya Jia. Tracking objects as pixel-wise distributions, 2022. 3, 6
[47] Xingyi Zhou, Vladlen Koltun, and Philipp Kr¨ahenb¨uhl. Tracking objects as points. In ECCV, 2020. 5, 6
[48] Xingyi Zhou, Tianwei Yin, Vladlen Koltun, and Philipp Kr¨ahenb¨uhl. Global tracking transformers. In CVPR, 2022. 2
[49] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020. 3, 7

你可能感兴趣的:(#,图像,跟踪,计算机视觉,人工智能,深度学习)