论文地址
[1409.7618 Multiple Object Tracking: A Literature Review - arXiv]
Abstract
介绍了本综述在MOT领域的贡献。
1. Introduction
本综述只关注行人目标跟踪,原因有三:
- 行人非刚体,是MOT问题的理想对象。
- 应用面广,商业价值高。
- MOT领域70%都是针对行人开展。
同SOT相比,MOT还需要面对的问题有:
- 频繁遮挡。
- 轨迹初始与终止。
- 相似外观。
- 物体间相互作用。
本综述的贡献可归纳为:
- Unified formulation,两种不同的MOT算法分类方法。
- 调研了不同关键组件并进行分析。
- 展示了常用数据集上不同方法的实验结果。
- 开放问题,讨论未来发展方向。
2. MOT Problem
2.1 Problem Formulation
从概率统计的角度定义了MOT问题,简单来说就是在得到观测结果后,寻找到一个状态序列,使得后验概率最大。
2.2 MOT Categorization
从三个标准进行分类,依据任务处理的顺序:
a) initialization method, b) processing mode, and c) type of output.
2.2.1 Initialization Method
Most existing MOT works can be grouped into two sets [51], depending on how objects are initialized: Detection-Based Tracking (DBT) and Detection-Free Tracking (DFT).
DBT更为通用,因为DFT无法处理新物体的出现与消失问题。
2.2.2 Processing Mode
MOT can also be categorized into online tracking and online tracking. The difference is whether observations from future frames are utilized when handling the current frame.
2.2.3 Type of Output
This criterion classifies MOT methods into deterministic ones and probabilistic ones, depending on the randomness of output.
Stochastic Tracking. The output results of stochastic tracking vary from time to time.
Deterministic Tracking. The output of deterministic tracking is constant when running the methods multiple times.
3. MOT Component
3.1 Appearance Model
Technically, an appearance model includes two components: visual representation and
statistical measuring.
3.1.1 Visual Representation
- Local features. KLT, optical flow
- Region features. Here, order means the order of discrepancy when computing the representation.
- Zero-order. color histogram, raw pixel template
- First-order. Gradient-based representations like HOG, level-set formulation
- Up-to-second-order. Region covariance matrix
- Others.
3.1.2 Statistical Measuring
Based on visual representation, statistical measure computes the affinity between two observations.
3.2 Motion Model
The motion model captures the dynamic behavior of an object. It estimates the potential position of objects in the future frames, thereby reducing the search space.
3.2.1 Linear Motion Model
- Velocity smoothness is modeled by enforcing the velocity values of an object in successive frames to change smoothly.
- Position smoothness directly forces the discrepancy between the observed position and estimated position.
- Acceleration smoothness
3.2.2 Non-linear Motion Model
3.3 Interaction Model
Interaction model, also known as mutual motion model, captures the influence of an object on other objects.
3.3.1 Social Force Models
Social force models are also known as group models. In these models, each object is considered to be dependent on other objects and environmental factors.
- Individual force.
- fidelity(忠诚), which means one should not change his desired destination
- constancy, which means one should not suddenly change his momentum, including
speed and direction
- Group force.
- attraction, which means individuals moving together as a group should stay close
- repulsion(排斥), which means that individuals moving together as a group should keep some distance away from others to make all members comfortable
- coherence, which means individuals moving together as a group should move with similar velocity
3.3.2 Crowd Motion Pattern Models
Inspired by the crowd simulation literature [24], motion patterns are introduced to alleviate the difficulty of tracking an individual object in the crowd.
3.4 Exclusion Model
Exclusion is a constraint employed to avoid physical collisions when seeking a solution to the MOT problem. It arises from the fact that two distinct objects cannot occupy the same physical space in the real world.
3.4.1 Detection-level Exclusion Modeling
Two different detection responses in the same frame cannot be assigned to the same target.
- “Soft” modeling. Detection-level exclusion is “softly” modeled by minimizing a cost term to penalize the case of violation.
- “Hard” modeling. “Hard” modeling of detection-level exclusion is implemented by applying explicit constraint.
3.4.2 Trajectory-level Exclusion Modeling
Generally, trajectory-level exclusion is modeled by penalizing the case that two close detection hypotheses have different trajectory labels. This will suppress one trajectory label.
3.5 Occlusion Handling
Occlusion is perhaps the most critical challenge in MOT. It is a primary cause for ID switches or fragmentation of trajectories.
3.5.1 Part-to-whole
This strategy is built on the assumption that a part of the object is still visible when an occlusion happens.
Tracker would be aware of this and adopt only the unoccluded parts for estimation. Specifically, parts are derived by dividing objects into grids uniformly [54], or fitting multiple parts into a specific kind of object like human, e.g. 15 non-overlap parts as in [51], and parts detected from the DPM detector [123] in [81, 124].
3.5.2 Hypothesize-and-test
This strategy sidesteps challenges from occlusion by hypothesizing proposals and testing the proposals according to observations at hand.
3.5.3 Buffer-and-recover
This strategy buffers observations when occlusion happens and remembers states of objects before occlusion.
3.5.4 Others
The strategies described above may not cover all the tactics explored in the community.
3.6 Inference
3.6.1 Probabilistic Inference
Approaches based on probabilistic inference typically represent states of objects as a distribution with uncertainty. The goal of a tracking algorithm is to estimate the probabilistic distribution of target state by a variety of probability reasoning methods based on existing observations.
- Kalman filter. In the case of a linear system and Gaussian-distributed object states, the Kalman filter [39] is proven to be the optimal estimator. It has been applied in [37].
- Extended Kalman filter. To include the non-linear case, the extended Kalman filter is one possible solution. It approximates the non-linear system by a Taylor expansion [36].
- Particle filter. Monte Carlo sampling based models have also become popular in tracking, especially after the introduction of the particle filter [132, 133, 134, 54, 105, 34, 35, 10]. This strategy models the underlying distribution by a set of weighted particles, thereby allowing to drop any assumptions about the distribution itself [105, 34, 35, 38].
3.6.2 Deterministic Optimization
As opposed to the probabilistic inference methods, approaches based on deterministic optimization aim to find the maximum a posterior (MAP) solution to MOT. To that end, the task of inferring data association, the target states or both, is typically cast as an optimization problem.
3.6.3 Discussion
In practice, deterministic optimization or energy minimization is employed more popularly compared with probabilistic approaches.
3.7 Summary
- It is important to note that not all existing MOT methods have all the components.
- In general, appearance, motion and inference are mandatory in most methods.
- It is also notable that, these components are not orthogonal to each other.
4. MOT Evaluation
For a given MOT approach, metrics and datasets are required to evaluate its performance quantitatively.
4.1 Metrics
4.1.1 Metrics for Detection
- Accuracy. FAF, FPPI, MODA
- Precision. MODP
4.1.2 Metrics for Tracking
- Accuracy. IDs, MOTA
- Precision. MOTP, TDE, OSPA
- Completeness. MT, PT, ML, FM
- Robustness. RS, RL
4.2 Datasets
4.3 Public Algorithms
4.4 Benchmark Results
Strictly speaking, in order to make a direct and fair comparison, one needs to fix all the other components while varying the one under consideration.
5. Summary
This paper has described methods and problems related to the task of Multiple Object Tracking (MOT) in videos.
5.1 Existing Issues
- One major issue in the MOT research is that, performance of an MOT method depends heavily on the object detectors.
- Another nuisance is that, when developing an MOT solution, there are many parameters if this algorithm is too complicated.
5.2 Future Directions
- MOT with video adaptation. A customization of the object detector is necessary to improve MOT performance. One solution proposed by Shu et al . [192] adapts a generic pedestrian detector to a specific video by progressively refining the generic pedestrian detector.
- MOT under multiple cameras. The first one is that multiple cameras record the same scene, i.e., multiple views. The second one is that each camera records a different scene, i.e., a non-overlapping multi-camera network.
- Multiple 3D object tracking. However, 3D tracking requires camera calibration, or has to overcome other challenges for estimating camera poses and scene layout. Meanwhile, 3D model design is another issue exclusive to 2D MOT.
- MOT with scene understanding. The analyzing results from scene understanding can provide contextual information and scene structure, which is very helpful to the tracking problem if it is better incorporated into an MOT algorithm.
- MOT with deep learning. Deep learning based models have emerged as an extremely
powerful framework to deal with different kinds of vision problems including image classification [198], object detection [186, 187, 188], and more relevantly single object tracking
[184].
- MOT with other computer vision tasks. Possible combinations include object segmentation [206, 207, 208, 209], re-identification [210, 194, 211], human pose estimation [18, 212, 213, 214, 215], and action recognition [19].