论文翻译:Deep SORT: Simple Online and Realtime Tracking with a Deep Association Metric

相关博客详解一:https://blog.csdn.net/cdknight_happy/article/details/79731981  DeepSort论文学习  cdknight_happy
相关博客详解二:https://www.cnblogs.com/YiXiaoZhou/p/7074037.html
相关博客详解三:http://www.cnblogs.com/yanwei-li/p/8643446.html

相关博客详解帮助理解:https://www.cnblogs.com/xiaozhi_5638/p/9376784.html

github:   https://github.com/abewley/sort

github : https://github.com/nwojke/deep_sort

github :  https://github.com/Qidian213/deep_sort_yolov3

 

论文翻译:Deep SORT: Simple Online and Realtime Tracking with a Deep Association Metric_第1张图片

需要准备的知识点:

马氏距离:https://blog.csdn.net/lzhf1122/article/details/72935323

COS距离 最小余弦距离:https://blog.csdn.net/lin00jian/article/details/51209715

匈牙利算法  融合度量的方式    KM滤波器    级联匹配策略

粉色:重点算法       紫色:生癖词汇       绿色:引文&未补充公式

ABSTRACT

简单在线和实时跟踪(SORT)是一种注重简单、有效算法的多目标跟踪的实用方法。
为了提高排序的性能,本文对外观信息进行了集成。
由于这种扩展,我们能够通过更长时间的闭塞来跟踪对象,有效地减少了标识开关的数量。
在原始框架的精神下,我们将大量的计算复杂度放到了离线预训练阶段,在此阶段我们学习了大规模的人员再识别数据集上的深度关联度量。
在在线应用过程中,我们使用可视化外观空间中的近邻查询建立度量跟踪关联。
实验评估表明,我们的扩展减少了45%的身份交换机数量,在高帧率下实现了整体竞争性能

1.INTRODUCTION

Due to recent progress in object detection, tracking-by-detection has become the leading paradigm in multiple object
tracking.
Within this paradigm, object trajectories are usually found in a global optimization problem that processes entire
video batches at once.
For example, flow network formulations [1, 2, 3] and probabilistic graphical models [4, 5, 6, 7] have become popular frameworks of this type.

因为近年来的目标检测的进步,根据检测的跟踪已经在多目标跟踪领域成为了主要模式。
在这种模式下,对象轨迹通常出现在一次处理整个视频批量的全局优化问题中。
例如,流网络公式[1,2,3]和概率图形模型[4,5,6,7]已经成为这种类型的流行框架。

However,due to batch processing, these methods are not applicable in online scenarios where a target identity must be available at each time step.
More traditional methods are Multiple Hypothesis Tracking (MHT) [8] and the Joint Probabilistic Data Association Filter (JPDAF) [9].
These methods perform data association on a frame-by-frame basis.
 In the JPDAF,a single state hypothesis is generated by weighting individual measurements by their association likelihoods.
In MHT,all possible hypotheses are tracked, but pruning schemes must be applied for computational tractability.
Both methods have recently been revisited in a tracking-by-detection scenario [10, 11] and shown promising results.
 However, the performance of these methods comes at increased computational and implementation complexity.

但是,由于批处理的原因,这些方法不适用于在线场景流网络公式和概率图形模型),在这些场景中,目标ID必须在每个时间步骤中可用。
更传统的方法是多假设跟踪(MHT)[8]和联合概率数据关联过滤器(JPDAF)[9]。
这些方法在逐帧基础上执行数据关联。
JPDAF中,单个状态假设是通过根据它们的关联可能性对单个测量值进行加权而产生的。
MHT中,所有可能的假设都被跟踪,但是剪枝方案必须应用于计算的可跟踪性。
最近,这两种方法都在 tracking-by-detection 场景中被重新使用[10,11],并显示出了良好的结果。
然而,这些方法的性能增加了计算和实现的复杂性JPDAF,MHT)。

 

Simple online and realtime tracking (SORT) [12] is a much simpler framework that performs Kalman filtering in image space and frame-by-frame data association using the Hungarian method with an association metric that measures bounding box overlap.
This simple approach achieves favorable performance at high frame rates.
 On the MOT challenge dataset [13], SORT with a state-of-the-art people detector [14] ranks on average higher than MHT on standard detections.
This not only underlines the influence of object detector performance on overall tracking results, but is also an important insight from a practitioners point of view.

简单的在线和实时跟踪(SORT)[12]是一个更简单的框架,它使用匈牙利方法在图像空间中执行卡尔曼滤波和逐帧数据关联,使用关联度量测量边界框重叠。
这种简单的方法在高帧率下取得了良好的性能。
在MOT challenge数据集[13]中,使用最先进的人员检测器[14]在标准检测中的平均排名高于MHT。
这不仅强调了目标检测器性能对总体跟踪结果的影响,而且从实践者的角度来看,这也是一个重要的见解。

 

While achieving overall good performance in terms of tracking precision and accuracy, SORT returns a relatively high number of identity switches.
This is, because the employed association metric is only accurate when state estimation uncertainty is low.
Therefore, SORT has a deficiency in tracking through occlusions as they typically appear in frontal-view camera scenes.
We overcome this issue by replacing the association metric with a more informed metric that combines motion and appearance information.
In particular, we apply a convolutional neural network (CNN) that has been trained to discriminate pedestrians on a large-scale person re-identification dataset.
Through integration of this network we increase robustness against misses and occlusions while keeping the system easy to implement, efficient, and applicable to online scenarios.
Our code and a pre-trained CNN model are made publicly available to facilitate research experimentation and practical application development.

虽然在跟踪精度和精度方面获得了总体良好的性能,但SORT返回的身份交换机数量相对较高。
这是因为所使用的关联度量仅在状态估计不确定性较低时才准确。
因此,SORT在通过遮挡进行跟踪方面存在缺陷,因为它们通常出现在前端视图摄像机场景中。
我们通过将关联度量替换为更合理的度量来克服这个问题它结合了运动和外观信息
特别地,我们应用了一个卷积神经网络(CNN),它被训练用来在一个大规模的人再识别数据集中辨别行人。
通过对该网络的集成,我们提高了对遗漏和遮挡的鲁棒性,同时使系统易于实现、高效并适用于在线场景。
我们的代码和预先训练的CNN模型被公开,以促进研究实验和实际应用开发。

2. SORT WITH DEEP ASSOCIATION METRIC

We adopt a conventional single hypothesis tracking methodology with recursive Kalman filtering and frame-by-frame data association.
In the following section we describe the core components of this system in greater detail.

我们采用一种传统的单假设跟踪方法,采用递归卡尔曼滤波逐帧数据关联
在下一节中,我们将更详细地描述这个系统的核心组件。

2.1. Track Handling and State Estimation

The track handling and Kalman filtering framework is mostly identical to the original formulation in [12](Simple online and realtime tracking.)
We assume a very general tracking scenario where the camera is uncalibrated and where we have no ego-motion information available.
 While these circumstances pose a challenge to the filtering framework, it is the most common setup considered in recent multiple object tracking benchmarks [15].
Therefore, our tracking scenario is defined on the eight dimensional state space (u, v, γ, h, ẋ, ẏ, γ̇, ḣ) that contains the bounding box center position (u, v), aspect ratio γ, height h,and their respective velocities in image coordinates.
We use a standard Kalman filter with constant velocity motion and linear observation model, where we take the bounding coordinates (u, v, γ, h) as direct observations of the object state.

跟踪处理和卡尔曼滤波框架[12]中的原始公式基本相同。
我们假设有一个非常一般的跟踪场景,在这个场景中,摄像机是没有校准的,并且我们没有可用的自我运动信息。
虽然这些情况对过滤框架构成了挑战,但它是最近在多个对象跟踪基准测试[15]中考虑的最常见的设置。
因此,我们的跟踪场景是定义在八维状态空间(u,v,γ,h,ẋẏ,γ̇,ḣ)包含边界框的中心位置(u,v),长宽比γ,高度h,各自的速度在图像坐标。
我们用标准卡尔曼滤波器与匀速运动线性观测模型,我们把边界坐标(u,v,γ,h)作为直接观察对象的状态。

 

For each track k we count the number of frames since the last successful measurement association    ak .
This counter is incremented during Kalman filter prediction and reset to 0 when the track has been associated with a measurement.
Tracks that exceed a predefined maximum age Amax are considered to have left the scene and are deleted from the track set.
New track hypotheses are initiated for each detection that cannot be associated to an existing track.
These new tracks are classified as tentative during their first three frames.
During this time, we expect a successful measurement association at each time step.
Tracks that are not successfully associated to a measurement within their first three frames are deleted.

对于每一个轨迹k,我们计算自上次成功测量关联到ak的帧数。
此计数器在卡尔曼滤波预测期间递增,一旦当跟踪与测量相关联就重置为0。
超过预先设定的最大年龄的轨迹被认为已经离开场景并从轨迹集中删除。

对新目标出现的判断则是: 如果某次检测结果中的某个目标始终无法与已经存在的追踪器进行关联,那么则认为可能出现了新目标。
这些新的轨迹在它们的前三帧被分类为暂定的
在此期间,我们期望在每个时间步骤中都有一个成功的度量关联
前三个帧中没有成功关联到度量的跟踪被删除

(My

目标的创建与移除
对每一个追踪目标,记录自其上一次检测结果与追踪结果匹配之后的帧数ak,一旦一个目标的检测结果与追踪结果正确关联之后,就将该参数设置为0。
如果ak超过了设置的最大阈值Amax,则认为对该目标的追踪过程已结束。
对新目标出现的判断则是,如果某次检测结果中的某个目标始终无法与已经存在的追踪器进行关联,那么则认为可能出现了新目标。
如果连续的3帧中潜在的新的追踪器对目标位置的预测结果都能够与检测结果正确关联,那么则确认是出现了新的运动目标;
如果不能达到该要求,则认为是出现了“虚警”,需要删除该运动目标。

My)
 

2.2. Assignment Problem

A conventional way to solve the association between the predicted Kalman states and newly arrived measurements is to build an assignment problem that can be solved using the Hungarian algorithm.
Into this problem formulation we integrate motion and appearance information through combination of two appropriate metrics.
To incorporate motion information we use the (squared)  Mahalanobis distance between predicted Kalman states and newly arrived measurements:

解决预测的卡尔曼状态新到达的测量之间的关联的一种传统方法是建立一个可以使用匈牙利算法解决的分配问题。
在这个问题的表述中,我们通过结合两个适当的指标来整合运动外观信息
为了合并运动信息,我们使用预测的卡尔曼状态新到达的测量值之间的(平方)马氏距离:

          (1)

where we denote the projection of the i-th track distribution into measurement space by (y i , S i ) and the j-th bounding box detection by d j .
The Mahalanobis distance takes state estimation uncertainty into account by measuring how many standard deviations the detection is away from the mean track location.
Further, using this metric it is possible to exclude unlikely associations by thresholding the Mahalanobis distance at a 95% confidence interval computed from the inverse χ 2 distribution.

(2)
We denote this decision with an indicator that evaluates to 1 if the association between the i-th track and j-th detection is admissible.
For our four dimensional measurement space the corresponding Mahalanobis threshold is t (1) = 9.4877

其中,我们用 (y i , S i )表示第i个轨迹分布到测量空间的投影,用d j表示第j个边界盒检测。
Mahalanobis距离通过测量检测偏离平均轨迹位置多少个标准差来考虑状态估计的不确定性
此外,使用这个指标可以排除不可能关联,通过以从逆χ2分布计算得来的95%置信区间对马氏距离进行阈值化处理。

        (2)

我们用一个指示器来表示这个决定,如果第i道和第j道检测之间的关联允许的话,这个指示器的值为1。
对于我们的四维测量空间,相应的Mahalanobis阈值为t (1) = 9.4877

 

While the Mahalanobis distance is a suitable association metric when motion uncertainty is low, in our image-space problem formulation the predicted state distribution obtained from the Kalman filtering framework provides only a rough estimate of the object location.
In particular, unaccounted camera motion can introduce rapid displacements in the image plane, making the Mahalanobis distance a rather uninformed metric for tracking through occlusions. Therefore, we integrate a second metric into the assignment problem.

虽然Mahalanobis距离是合适的协会规运动不确定性低的时候,在我们的图像空间问题公式化预测状态分布从卡尔曼滤波获得对象的框架只提供了一个粗略的估计位置。

特别是,失踪的相机运动可以引入快速位移在图像平面上,使Mahalanobis距离跟踪通过遮挡,而无知的度量。因此,我们把第二个指标分配问题。

此博客指派问题写的很清楚,对此补充:https://blog.csdn.net/cdknight_happy/article/details/79731981

别人博客

指派问题

方法一:马氏距离

传统的解决检测结果与追踪预测结果的关联的方法是使用匈牙利方法。本文作者同时考虑了运动信息的关联和目标外观信息的关联。

  • 运动信息的关联:使用了对已存在的运动目标的运动状态的kalman预测结果与检测结果之间的马氏距离进行运行信息的关联。
    这里写图片描述
    dj

表示第j个检测框的位置,yi表示第i个追踪器对目标的预测位置,Si表示检测位置与平均追踪位置之间的协方差矩阵?。如果某次关联的马氏距离小于指定的阈值t(1),则设置运动状态的关联成功。使用的函数为这里写图片描述,作者设置t(1)=9.4877

方法二:外貌特征

  • 当运动的不确定性很低的时候,上述的马氏距离匹配是一个合适的关联度量方法,但是在图像空间中使用kalman滤波进行运动状态估计只是一个比较粗糙的预测。特别是相机存在运动时会使得马氏距离的关联方法失效,造成出现ID switch的现象。因此作者引入了第二种关联方法,对每一个的检测块dj
  • 求一个特征向量ri,限制条件是||ri||=1
  • 。作者对每一个追踪目标构建一个gallary,存储每一个追踪目标成功关联的最近100帧的特征向量。那么第二种度量方式就是计算第i个追踪器的最近100个成功关联的特征集与当前帧第j个检测结果的特征向量间的最小余弦距离。计算公式为:
    这里写图片描述
    如果上面的距离小于指定的阈值,那么这个关联就是成功的。阈值是从单独的训练集里得到的。
  • 使用两种度量方式的线性加权作为最终的度量,这里写图片描述,只有ci,j
  • 如果在两个度量阈值都满足了才能用于融合Cij,,
  • 位于两种度量阈值的交集内时,才认为实现了正确的关联。
    距离度量对短期的预测和匹配效果很好,但对于长时间的遮挡的情况,使用外观特征的度量比较有效。
  • 对于存在相机运动的情况,可以设置λ=0.但是,马氏距离的阈值仍然生效,如果不满足第一个度量的标准,就不能进入Cij的融合阶段。

别人博客)

 

级联匹配

Instead of solving for measurement-to-track associations in a global assignment problem, we introduce a cascade that solves a series of subproblems.
To motivate this approach,consider the following situation:
When an object is occluded for a longer period of time, subsequent Kalman filter predictions increase the uncertainty associated with the object location.

并不是在全局任务分配层面解决 measurement-to-track 融合问题,我们使用了级联解决一系列子问题.
激励这种方法,考虑以下情况:
当一个对象被挡住更长一段时间,随后卡尔曼滤波器预测增加关于对象位置的不确定性。

Consequently, probability mass spreads out in state space and the observation likelihood becomes less peaked.
Intuitively, the association metric should account for this spread of probability mass by increasing the measurement-to-track distance.
Counterintuitively, when two tracks compete for the same detection, the Mahalanobis distance favors larger uncertainty, because it effectively reduces the distance in standard deviations of any detection towards the projected track mean.

因此,状态空间的概率质量分布和观察可能性变得那么苍白
凭直觉,通过增加measurement-to-track距离 融合指标应该计算这个概率质量的分布。
反常识的,当两个跟踪争夺相同的检测,Mahalanobis距离支持更大的不确定性,因为它有效地减少了在任何对于预计跟踪均值的检测的标准偏差的距离 。

This is an undesired behavior as it can lead to increased track fragmentations and unstable tracks.
Therefore, we introduce a matching cascade that gives priority to more frequently seen objects to encode our notion of probability spread in the association likelihood.

这是一个不受欢迎的行为,因为它会导致跟踪破碎不稳定的跟踪
因此,我们引入一个匹配级联 为更频繁地看到的对象优先权以此来 编码我们的概念,即在关联可能下的概率分布。

Listing 1 outlines our matching algorithm.
As input we provide the set of track T and detection D indices as well as the maximum age A max.
In lines 1 and 2 we compute the association cost matrix and the matrix of admissible associations.
We then iterate over track age n to solve a linear assignment problem for tracks of increasing age.
In line 6 we select the subset of tracks T n that have not been associated with a detection in the last n frames.
 In line 7 we solve the linear assignment between tracks in T n and unmatched detections U .
In lines 8 and 9 we update the set of matches and unmatched detections, which we return after completion in line 11.
 Note that this matching cascade gives priority to tracks of smaller age, i.e., tracks that have been seen more recently.

清单1中概述了我们的匹配算法。
作为输入提供的跟踪T和检测D指数以及最高年龄最大。

1:关系矩阵代价 C=[cij]
2:容许关联矩阵 B = [bij]
然后迭代跟踪年龄n解决增长年龄的跟踪线性分配问题
3:用空集合,初始化匹配集合M
4:用D,初始化非匹配集合U
5: for   n从1到Amax每一个:
6:我们选择在最后n帧里面还没有被检测关联的追踪T n的子集
7:用C 解决轨道之间的线性匹配(  未关联跟踪轨迹T n 和 无匹配的检测U)
8:我们更新匹配
9:和无匹配的检测,在第11行结果返回。
注意:这个匹配的级联为主较小年龄的痕迹优先权,即最近见过的跟踪

In a final matching stage, we run intersection over union association as proposed in the original SORT algorithm [12] on the set of unconfirmed and unmatched tracks of age n = 1.
This helps to to account for sudden appearance changes, e.g.,
due to partial occlusion with static scene geometry, and to increase robustness against erroneous initialization.

在最后一个匹配阶段,我们在未经证实和n = 1的非匹配跟踪上,用最初的sort算法[12]运行交叉联盟融合 。
这有助于解释突然出现变化,例如,
由于静态场景几何 部分遮挡,增加了对错误的初始化的鲁棒性。

(别人博客

当一个目标长时间被遮挡之后,kalman滤波预测的不确定性就会大大增加,状态空间内的可观察性就会大大降低。
假如此时两个追踪器竞争同一个检测结果的匹配权,往往遮挡时间较长的那条轨迹的马氏距离更小,使得检测结果更可能和遮挡时间较长的那条轨迹相关联,这种不理想的效果往往会破坏追踪的持续性。
这么理解吧,假设本来协方差矩阵是一个正态分布,那么连续的预测不更新就会导致这个正态分布的方差越来越大,那么离均值欧氏距离远的点可能和之前分布中离得较近的点获得同样的马氏距离值。
所以,作者使用了级联匹配来对更加频繁出现的目标赋予优先权,具体的算法如下图:
论文翻译:Deep SORT: Simple Online and Realtime Tracking with a Deep Association Metric_第2张图片

级联匹配的核心思想就是由小到大对消失时间相同的轨迹进行匹配,这样首先保证了对最近出现的目标赋予最大的优先权,也解决了上面所述的问题。
在匹配的最后阶段还对unconfirmed和age=1的未匹配轨迹进行基于IoU的匹配
这可以缓解因为表观突变或者部分遮挡导致的较大变化。
当然有好处就有坏处,这样做也有可能导致一些新产生的轨迹被连接到了一些旧的轨迹上。但这种情况较少。

别人博客)

 

深度特征描述器

网络结构:

论文翻译:Deep SORT: Simple Online and Realtime Tracking with a Deep Association Metric_第3张图片

By using simple nearest neighbor queries without additional metric learning, successful application of our method requires a well-discriminating feature embedding to be trained offline, before the actual online tracking application.
To this end, we employ a CNN that has been trained on a large-scale person re-identification dataset [21] that contains over 1,100,000 images of 1,261 pedestrians, making it well suited for deep metric learning in a people tracking context.

通过使用简单的最近邻查询没有额外的度量学习,我们的方法的成功应用需要well-discriminating的特征,在实际在线跟踪应用之前就离线训练好的well-discriminating特征
为此,我们采用CNN,一直在训练一个大规模的人鉴定数据集[21]包含超过1100000 1261行人图像,使其适合深度度量学习在一个人跟踪上下文。

The CNN architecture of our network is shown in Table 1.
In summary, we employ a wide residual network [22] with two convolutional layers followed by six residual blocks.
The global feauture map of dimensionality 128 is computed in dense layer 10.
A final batch and l2 normalization projects features onto the unit hypersphere to be compatible with our cosine appearance metric.
In total, the network has 2,800,864 parameters and one forward pass of 32 bounding boxes takes approximately 30 ms on an Nvidia GeForce GTX 1050 mobile GPU.
Thus, this network is well suited for online tracking, provided that a modern GPU is available.
While the details of our training procedure are out of the scope of this paper, we provide a pre-trained model in our GitHub repository  along with a script that can be used to generate features.

CNN架构的网络是表1所示。
总之,我们使用剩余网络[22],两个卷积层,随后六个剩余块。
128维度的全局特征图在dense10这一层进行计算。
最后一批处理和l2标准化项目特征在单位超球面上来兼容我们的余弦外貌度量
总的来说,网络有2800,864个参数和1个前进传播的32框 ,大约需要30 ms  Nvidia GeForce GTX 1050移动GPU。
因此,这个网络是适合在线跟踪,提供一个现代GPU是可用的。
虽然我们的培训过程的细节超出了本文的范围,我们在GitHub库提供pre-trained模型和一个脚本,该脚本可用于生成功能。

 

在行人重识别数据集上离线训练模型。输入128维的归一化的特征。在GTX1050m显卡上,输入30个bounding box提取特征的时间约为30ms。预训练的模型和代码位于https://github.com/nwojke/deep_sort
实验

实验

作者使用《Poi:
Multiple object tracking with high performance detection and appearance feature》文章训练的高性能faster rcnn模型进行检测。检测的置信度阈值设置为0.3。
和sort对比,好处是:
- 减少了45%的ID switch;
- 结合了深度外观信息,对遮挡目标的追踪效果大大提升;
- FP的升高很多,文章中提出这主要是由于静态场景中detection的错误以及过长的允许丢失的track age所导致的(相对于SORT只用相邻帧进行匹配来说,Deep SORT允许高达30帧的丢失,而Kalman的等速运动模型没有改变,这主要造成了FP的升高)。
- 20Hz,依旧实用;
- 达到了state-of-art online tracking的效果。
 

 

你可能感兴趣的:(论文,Deep,SORT)