最成功的多目标跟踪方法采用了基于检测的跟踪的形式。比较典型的方法包括使用复杂的外观模型在一个长的时序中对一个目标进行重识别【25】【4】或者利用全局最优化对每个目标轨迹进行计算【2】【29】【3】。
近些年目标检测算法效果的显著提升,出现了一些精度相当高的检测算法如:Faster/Mask R-CNN【11】【12】【24】FCN【20】SSD【19】
这就改变了对跟踪算法的要求。能够依赖这些越来越精确的物体检测允许更简单的跟踪检测方法【6】【5】【30】。它们都共享将检测与高个时空重叠相关联到单个轨道的核心原则。这种重叠是通过连续帧之间的检测边界框的IOU来测量的。
SORT【5】使用卡尔曼滤波运动模型和匈牙利算法来解决关联问题。这一方法后来被拓展到deepsort【30】通过使用深度特征来处理长时间遮挡的问题。与SORT相似的时IOU Tracker依赖于检测而不是图像信息。这个简单的跟踪器不使用运动模型,并以贪婪的方式将检测与跟踪联系起来。因此,IOU跟踪器可以以每秒数千帧的速度运行(假设所需的检测可用),同时优于更复杂的先进方法【21】。.这种简单方法的一个主要缺点是需要对底层检测器进行高召回。由于一次或几次检测不到而造成的每个间隔不仅会导致假阴性,还会导致轨迹的终止和重新启动,从而导致高频率的轨迹碎片和身份转换。
在本文中,我们通过在IOU跟踪方案中加入一个可视的单目标跟踪器来解决这个问题,以提高对丢失检测的鲁棒性。我们的想法是,如果没有新的检测可以关联,通过视觉跟踪器来继续跟踪每一条轨道,从而填补轨道之间的空白。这就减少了轨迹碎片以及身份转换的数量。该方法的速度和精度依赖于目标检测器和视觉跟踪器的表现。我们在UA-DETRAC【28】和VisDrone【31】数据集上进行实验,我们研究了不同先进的多目标探测器和单目标跟踪器对框架性能的影响。保持较低的计算量的情况下改善了轨迹碎片化和身份转换的问题,并且在两个数据集上都优于最新技术。
运用了【6】中提到的IOU跟踪方法。通过配合单目标跟踪方法来弥补缺失的检测目标,从而减少高频率的轨迹碎片和身份转换。
我们首先回顾一下IOU跟踪器的概念以及局限。IOU跟踪器的主要思想时当前目标检测器可信度足够高,因此由FP/FN(假阳性/假阴性)检测结果所造成的影响可以忽略不计。有了这一假设,基于检测的跟踪这一范式显得十分简单。
IOU跟踪器仅仅根据IOU信息来关联检测结果,采用一种贪心的方式:一个检测结果与上一帧IOU最大的track关联。尽管没有优化,但是这里启发式的方法非常高效。或者说,这也可以被当作一个线性分配问题利用匈牙利算法求解。在真实世界中FP/FN检测结果会干扰跟踪的过程。因此跟踪结果必须有一定的过滤规则,比如每个track必须包含一个高置信度的检测结果 ( ≥ σ h ) \left( \geq \sigma_{h}\right) (≥σh),以及每个track至少有一个最小的跟踪时间 t min t_{\min } tmin。这样有效的规则能够排除一些导致跟踪失败的FP。同时FN会导致一些track迅速地结束,由于IOU跟踪器不会传播上一次检测结果,因此将在下一个可用的检测结果上创建一个新的track。这些因素导致了高频率地身份转换和轨迹碎片。
FN检测结果产生了一个问题,特别是对于IOU跟踪器,因为缺少的检测结果不会传播。因此,我们建议在没有可用于关联的检测的情况下,通过退化到单目标跟踪来扩展IOU跟踪器。完整的跟踪扩展如图1
图 1.扩展IOU跟踪器的基本原理:由于检测缺失会导致轨迹碎片化(a)。可以用视觉跟踪器来弥补缺失的部分(b)。最终的跟踪结果的碎片化程度减少(c)
视觉跟踪在两个方向执行:第一,如果没有detection满足【6】中的IOU匹配阈值 σ I O U \sigma_{I O U} σIOU,那么就在最后一个已知的位置(上一帧detection的位置)初始化一个视觉跟踪器,并用来跟踪目标一段时间(最多 t t l t t l ttl帧)。如果一个新的detection在 t t l t t l ttl帧内满足 σ I O U \sigma_{I O U} σIOU匹配的阈值,则停止视觉跟踪,继续IOU跟踪;否则停止跟踪。这通常足以可靠地补偿少量缺失的detection。
然而,随着视觉跟踪帧数的增加,视觉跟踪器更有可能跟丢原先目标或跳到另一个目标。为了限制仅通过视觉线索跟踪对象的连续帧的数量,我们还针对每个新的track向后通过最后的 t t l t t l ttl帧执行视觉跟踪。如果对于现有的完成轨道满足重叠标准,则合并它们。 通过这种方式,可以关联长度最多为2· t t l t t l ttl帧的间隙,而单个视觉对象跟踪器仅用于最多 t t l t t l ttl帧。
虽然对象的视觉向前和向后跟踪有助于合并不连续的轨道,但它也在每个完成的轨道的开始和结束处添加了视觉跟踪的残余存根,如图1(b)所示。由于track应该在目标进入场景时开始并且在目标离开时结束,因此这些残余存根不能有助于合并间隙并且容易导致错误,因为感兴趣的目标可能不在场景中。出于这个原因,我们限制了视觉跟踪对间隙闭合的使用,并从track上切下这些视觉跟踪的边界框的存根。(个人理解是,如果前向和后向进行的两个视觉跟踪器如果没能重合在一起,则认为这是一个新的目标起点,同时是另一个目标的终点,那么这部分track不被合并,将残余存根被删除)
[1] N. M. Al-Shakarji, G. Seetharaman, F. Bunyak, and K. Palaniappan. Robust multi-object tracking with semantic color correlation. In IEEE International Conference on Advanced Video and Signal Based Surveillance, pages 1–7, 2017.
[2] A. Andriyenko and K. Schindler. Multi-target tracking by continuous energy minimization. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1265–1272, 2011.
[3] A. Andriyenko, K. Schindler, and S. Roth. Discretecontinuous optimization for multi-target tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1926–1933, 2012.
[4] S.-H. Bae and K.-J. Yoon. Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1218–1225, 2014.
[5] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft. Simple online and realtime tracking. In IEEE International Conference on Image Processing, pages 3464–3468, 2016.
[6] E. Bochinski, V. Eiselein, and T. Sikora. High-speed tracking-by-detection without using image information. In IEEE International Conference on Advanced Video and Signal Based Surveillance, pages 1–6, 2017.
[7] Z. Cai, M. Saberian, and N. Vasconcelos. Learning complexity-aware cascades for deep pedestrian detection. In IEEE International Conference on Computer Vision, pages 3361–3369, 2015.
[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition., pages 248–255, 2009.
[9] C. Dicle, O. I. Camps, and M. Sznaier. The way they move: Tracking multiple targets with similar appearance. In IEEE Conference on Computer Vision and Pattern Recognition,pages 2304–2311, 2013.
[10] A. Geiger, M. Lauer, C. Wojek, C. Stiller, and R. Urtasun. 3d traffic scene understanding from movable platforms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5):1012–1025, 2014.
[11] R. Girshick. Fast r-cnn. In IEEE International Conference on Computer Vision, pages 1440–1448, 2015.
[12] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r- ´ cnn. In IEEE International Conference on Computer Vision, pages 2980–2988, 2017.
[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[14] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Highspeed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):583–596, 2015.
[15] Z. Kalal, K. Mikolajczyk, and J. Matas. Forward-backward error: Automatic detection of tracking failures. In International conference on Pattern Recognition, pages 2756–2759, 2010.
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[17] T. Kutschbach, E. Bochinski, V. Eiselein, and T. Sikora. Sequential sensor fusion combining probability hypothesis density and kernelized correlation filters for multi-object tracking in video data. In International Workshop on Traffic and Street Surveillance for Safety and Security at IEEE AVSS 2017, pages 1–5, 2017.
[18] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Com- ´ mon objects in context. In European Conference on Computer Vision, pages 740–755, 2014.
[19] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European Conference on Computer Vision, pages 21–37, 2016.
[20] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3431– 3440, 2015.
[21] S. Lyu, M.-C. Chang, D. Du, L. Wen, H. Qi, Y. Li, Y. Wei, L. Ke, T. Hu, M. Del Coco, et al. Ua-detrac 2017: Report of avss2017 & iwt4s challenge on advanced traffic monitoring. In IEEE International Conference on Advanced Video and Signal Based Surveillance, pages 1–7, 2017.
[22] A. Milan, S. Roth, and K. Schindler. Continuous energy minimization for multitarget tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(1):58–72, 2014.
[23] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes. Globallyoptimal greedy algorithms for tracking a variable number of objects. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1201–1208, 2011.
[24] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis & Machine Intelligence, (6):1137–1149, 2017.
[25] S. Tang, M. Andriluka, B. Andres, and B. Schiele. Multiple people tracking by lifted multicut and person reidentification. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3701–3710, 2017.
[26] W. Tian and M. Lauer. Fast cyclist detection by cascaded detector and geometric constraint. In International Conference on Intelligent Transportation Systems, pages 1286– 1291, 2015.
[27] W. Tian and M. Lauer. Joint tracking with event grouping and temporal constraints. In IEEE International Conference on Advanced Video and Signal Based Surveillance, pages 1– 5, 2017.
[28] L. Wen, D. Du, Z. Cai, Z. Lei, M. Chang, H. Qi, J. Lim, M. Yang, and S. Lyu. UA-DETRAC: A new benchmark and protocol for multi-object tracking. CoRR, abs/1511.04136, 2015.
[29] L. Wen, W. Li, J. Yan, Z. Lei, D. Yi, and S. Z. Li. Multiple target tracking based on undirected hierarchical relation hypergraph. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1282–1289, 2014.
[30] N. Wojke, A. Bewley, and D. Paulus. Simple online and realtime tracking with a deep association metric. In IEEE International Conference on Image Processing, pages 3645– 3649, 2017.
[31] P. Zhu, L. Wen, D. Du, X. Bian, H. Ling, Q. Hu, H. Cheng, C. Liu, X. Liu, W. Ma, Q. Nie, H. Wu, L. Wang, A. Schumann, D. Wang, D. Ortego, E. Luna, E. Michail, E. Bochinski, F. Ni, F. Bunyak, G. Zhang, G. Seetharaman, G. Li, H. Yu, I. Kompatsiaris, J. Zhao, J. Gao, J. Martinez, J. Miguel, K. Palaniappan, K. Avgerinakis, L. Sommer, M. Lauer, M. Liu, N. Al-Shakarji, O. Acatay, P. Giannakeris, Q. Zhao, Q. Ma, Q. Huang, S. Vrochidis, T. Sikora, T. Senst, W. Song, W. Tian, W. Zhang, Y. Zhao, Y. Bai, Y. Wu, Y. Wang, Y. Li, Z. Pi, and Z. Ma. VisDrone-VDT2018: The vision meets drone video detection and tracking challenge results. In European Conference on Computer Vision, pages 1–24, 2018.