《Deep Learning in Video Multi-Object Tracking: A Survey》 论文链接
近期开始研究多目标追踪,因此先找了一篇比较新的2019年综述性论文入门。
本论文着眼于single-camera videos and 2D data. 将MOT通用算法归纳为4个步骤,并分别介绍了Deep Learning在各步骤中的应用,给出了典型论文以供读者进一步阅读学习。
多目标追踪(MOT, multi-object tracking) 是指输入一段视频,在没有任何对目标的先验知识(外形或数量)的前提下,追踪其中一类或多类物体的运动轨迹。比如常见的行人追踪,车辆追踪。
与 单目标追踪(SOT) 不同,MOT不仅需要输出每一帧中每个目标的bounding box,还需要对每个box标注target ID,以此来区分 intra-class objects.
此外,SOT有对目标外形的先验知识,因为训练集会给出一段视频第一帧的bounding box,而MOT则没有。因此SOT多采用相关滤波的方法,而MOT目前多采用 tracking by detection 的方法(后文详细讲)。
MOT的困难之处在于
目前主流的MOT算法是 tracking by detection, 先通过常规目标检测方法提取一系列bounding box,再根据前后帧间的关系,将含有相同目标的bbox分配相同的ID。目前目标检测的质量已经比较好,因此MOT算法常被认为是一种assignment problem,即如何将匹配对应的bbox。
MOT算法可以分为batch和online两类。batch tracking algorithms可以同时利用过去/当前/将来的帧信息来对当前帧进行检测,而online tracking algorithms只能利用过去/当前的帧信息来检测当前帧。
需要特别注意,online不等于real-time,real-time一定是online的,但绝大部分online算法还太慢,不足以支持real-time environment. 尤其是应用了深度学习的算法,往往都计算密集。
主流MOT算法可以被归结为以下4个步骤:
MOT常用评价标准包括metrics defined by Wu and Nevatia, CLEAR MOT metrics, ID metrics三种.
classical metrics
Name | |
---|---|
Mostly Tracked (MT) | 至少80%帧数被正确追踪的目标数量 |
Fragments | (一段真实轨迹可能被多个追踪片段共同组成)至多覆盖真实轨迹80%帧的片段的数量 |
Mostly Lost (ML) | 少于20%帧数被正确追踪的目标数量 |
False trajectories | 不能对应到真实目标的预测轨迹的数量 |
ID switches | 目标被正确追踪,但ID被错误改变的次数 |
CLEAR MOT metrics
通过IoU(和continuity constraint)来进行ground truth和predictions的对应,并计算FP/FN/Fragm/IDSW。其中Fragm是fragments总数量,IDSW是ID switches总数量。
通常使用以下两个评价标准。
M O T A = 1 − F N + F P + I D S W G T ∈ ( − ∞ , 1 ] MOTA=1-\frac{FN+FP+IDSW}{GT} \in(-\infty,1] MOTA=1−GTFN+FP+IDSW∈(−∞,1]
GT是ground truth boxes的数量
M O T P = ∑ t , i d t , i ∑ t c t MOTP=\frac{\sum_{t,i}{d_{t,i}}}{\sum_t{c_t}} MOTP=∑tct∑t,idt,i
c t c_t ct是第t帧中能正确匹配的目标数量, d t , i d_{t,i} dt,i是检测目标 i i i与其对应gt目标的IoU.
MOTA被用于检测tracking的质量,而MOTP更关注detection的质量,即bounding box的精确程度(IoU).
ID scores
ID score其实是对CLEAR MOT metrics的补充,它将检测轨迹对应于能与它匹配最大帧数的ground truth object。通过二分图算法来得出IDTP/IDFP/IDFN,并计算以下分数:
detection的精度可以极大影响整个目标追踪算法的效果,因此许多数据集都提供了公开的detection结果供大家使用,使各个算法之间的性能对比可以更公平。不过,有一些算法集成了一些独特的detection算法,借此提高tracking performance.
Simple Online and Realtime Tracking (SORT) algorithm 2016 IEEE ICIP,第一个使用CNN做行人目标检测
<-highlight
Multiple object tracking with high performance detection and appearance feature 2016 ECCV,modified faster-RCNN with skip-pooling and multi-region features,SOTA on MOT16
Automatic individual pig detection and tracking in pig farms 2019 Sensors,DCF based online tracking method with HOG and Color Names feature to predict tag-boxes,refine bbox by DCF
Joint detection and online multi-object tracking 2018 CVPR,refine SSD detection results by other steps in tracking algorithm,use affinity scores to replace NMS in SSD<-highlight
Multi-object tracking with correlation filter for autonomous vehicle 2018 Sensors,use a CNN-based Correlation Filter(CCF) to allow SSD to generate more accurate bbox by cropped RoI
Instance flow based online multiple object tracking 2017 IEEE ICIP,obtain instance-aware semantic segmentation map in current frame by Multi-task Network Cascade,then use optical flow to predict the position and shape in the next frame.
<- 适合 moving camera
auto-encoders
Learning deep features for multiple object tracking by using a multi-task learning strategy 2014 IEEE ICIP,第一个提出在MOT算法中应用deep learning方法,use auto-encoders to refine visual features. 它提出feature refinement可以极大提高tracking模型的效果。
CNNs as visual feature extractors
-> use pre-trained CNN to extract features.
DeepSORT: Simple online and realtime tracking with a deep association metric 2017 IEEE ICIP,extract feature vectors by custom residual CNN, add cosine distance of vectors to affinity scores. 克服了原SORT算法的主要弱点,即 ID Switch太多。
An Automatic Tracking Method for Multiple Cells Based on Multi-Feature Fusion 2018 IEEE Access,可以区分移动快/慢的目标,对不同的目标采用不同的相似度计算准则
Siamese networks 孪生神经网络 参考资料
孪生神经网络包含两个共享参数的相同子网,输入两张相似图片,目标是判断两张图片中是否包含相同object,为达到该目的,卷积层将会学到具有区分性的特征。
Similarity mapping with enhanced siamese network for multi-object tracking 2016 NIPS,train Siamese network with contrastive loss, take two images/ their IoU score/ their ratio as input
Tracking persons-of-interest via adaptive discriminative features 2016 ECCV,SymTriplet loss 三胞胎网络
Multi-object tracking with quadruplet convolutional neural networks 2017 CVPR,四胞胎网络,考虑detection之间的时序距离
Online multi-target tracking with tensor- based high-order graph matching 2018 ICPR,三胞胎网络 triplet based on Mask R-CNN
Eliminating exposure bias and loss-evaluation mismatch in multiple object tracking 2019 CVPR,ReID triplet CNN + bidirectional LSTM
Online multi-object tracking with dual matching attention networks 2018 ECCV,spatial attention networks(SAN) + bidirectional LSTM,通过注意力机制来排除bbox中的背景部分。这整个网络可以在ECO进行hard example mining丢失目标时,从遮挡条件下恢复检测。<- highlight
more complex approaches for visual feature extraction
Learning to track multiple cues with long-term dependencies 2017 ICCV,使用三种RNN来计算多类特征(appearance + motion + interactions),将输出的特征再传入一个LSTM来计算affinity. 论文发表时在MOT15和MOT16达到SOTA.
<- highlight
Online multi-object tracking by decision making 2015 ICCV,前一篇论文与该论文整体算法相似,本论文用了Markov Decision Processes(MDP) based framework
A directed sparse graphical model for multi-target tracking 2018 CVPR,减少相似度计算复杂度。通过用隐马尔可夫模型预测物体接下去几帧的位置,只对检测结果中足够靠近HMM预测的detections计算相似度。
CNNs for motion prediction: correlation filters
Hierarchical convolutional features for visual tracking 2015 ICCV,correlation filter, whose output is a response map for the tracked object & an estimation of the new position of the object in the next frame
许多模型计算由CNN提取出的tracklet和detection的特征间距离,作为相似性度量。以下介绍其他一些直接使用深度模型来计算相似性的方法,这些方法不需要人为预定义特征间的distance metric。主要可以分为LSTM和CNN两大类。
RNN and LSTMs
Online multi-target tracking using recurrent neural networks 2017 AAAI,首次提出用深度网络来计算相似度,end-to-end learning approach for online MOT. 运行速度快(165FPS),未使用appearance features
<- highlight
Siamese LSTMs
An online and flexible multi-object tracking framework using long short-term memory 2018 CVPR. 第一步,计算IoU作为affinity measures,捕获short reliable tracklets;第二步,将motion/appearance feature作为输入,用Siamese LSTM计算affinity。
Bidirectional LSTMs
Online multi-object tracking with dual matching attention networks 参考3.2节孪生神经网络
Use of LSTMs in MHT frameworks
Multiple Hypothesis Tracking(MHT): 为每个候选目标建立一个潜在跟踪假设的树,这样可以为数据关联问题提供一个系统的解决方法。计算每一个跟踪的概率,然后选出最有可能的跟踪组合。
Multiple hypothesis tracking revisited 2015 ICCV. 回顾经典的基于tracking-by-detection框架的多假设跟踪算法,提出MHT-DAM (Multiple Hypothesis Tracking with Discriminative Appearance Modeling)
Eliminating exposure bias and loss-evaluation mismatch in multiple object tracking 2019 CVPR. 在一个MHT的变种算法中,使用LSTM来计算tracklet scores,并且循环迭代式进行剪枝和增枝,最终选出最有可能的跟踪组合。该论文核心贡献是提出并解决了两个应用RNN于MOT时常见的问题,loss-evaluation mismatch和exposure bias。该算法在多个数据集上达到最高的IDF1,但是MOTA不是最优。<-highlight
CNNs for affinity computation
Multiple people tracking by lifted multicut and person re-identification 2017 CVPR. A novel graph-based formulation that links and clusters person hypotheses over time by solving an instance of a minimum cost lifted multi-cut problem. SOTA on MOT16.
<-highlight
Siamese CNNs
Siamese CNNs 是一种常用的计算相似度的方法。不同于3.2节中提到用它得到两张图的feature vector,再计算特征向量的距离;这里直接使用Siamese CNN的网络输出作为相似度。
相对于上述其他步骤而言,在association步骤中深度学习的应用比较少。传统算法,如Hungarian algorithm依然被广泛使用。以下介绍3种被尝试用在association步骤中的深度学习算法。
Online multi-target tracking using recurrent neural networks 参考3.3节
Joint detection and online multi-object tracking 参考3.1节
Multi-agent reinforcement learning for multi-object tracking 2018 ICAAMS
Collaborative deep reinforcement learning for multi-object tracking 2018 ECCV<-highlight
为了公平的对比,只展示了在MOTChallenge的整个test set上进行的实验结果。
实验首先被划分为使用了public / private detections 的两类,再进一步划分为online / batch算法。MOTA作为最主要的评价指标,算法速度一栏则不太可靠,因为这些实验结果通常没有计算detections的运行时间(这一步由于使用深度学习,往往是计算量最大的步骤),此外它们测试时使用的硬件也不同。
具体实验结果对比图请参考论文。
有一些尝试在public detections基础上,弥补missing detections进而降低FN的论文:
Heterogeneous association graph fusion for target association in multiple object tracking 2018 IEEE TCSVT. superpixel extraction algorithm
Real-time multiple people tracking with deeply learned candidate selection and person re-identification 2018 ICME. MOTDT: use R-FCN to integrate missing detections with new detections, best MOTA and lowest FN among online algorithms on MOT17.
Online multi-object tracking with convolutional neural networks 2017 ICIP. Employ a Particle Filter algorithm and rely on detections only to initialize new targets and to recover lost ones.
Collaborative deep reinforcement learning for multi-object tracking 参考3.4节 learn motion model of objects. The 2nd best among online methods in MOT16.
steps | best approach |
---|---|
private detection | Faster R-CNN (SSD is faster but perform worse) |
feature extraction | CNN to extract appearance feature (appearance是最重要的特征,但许多表现最好的算法还使用了多种其他特征来共同计算相似度,尤其重要的是motion特征,常用LSTM / Kalman Filters / Bayesian filter来提取) |
affinity | hand-crafted distance metrics on feature vectors / Siamese CNN |
association | - |
共性发现
未来可行的研究方向