《SIMPLE ONLINE AND REALTIME TRACKING WITH A DEEP ASSOCIATION METRIC》翻译和笔记

机器翻译为主,校对中…

背景知识

  1. Person Re-Identification Dataset:使用两台相机,每台对每个人捕获一张图像而建成的数据库。 相关链接
  2. 个人理解:这篇文章要解决的应该是SORT算法中,可能因为被追踪的目标在跟踪过程中被遮挡了一阵子再重新出现,SORT算法直接就把它认成新的目标了,而本文对这个问题进行改进。怎么改进呢?加上Appearance Descriptor,这个Appearance Descriptor是用神经网络训练出来的。问题来了:神经网络训练的效果好还是SIFT的效果好?
  3. 卡方分布:以特定概率分布为某种情况建模时,事物长期结果较为稳定,能够清晰进行把握。但是期望与事实存在差异怎么办?偏差是正常的小幅度波动?还是建模错误?此时,利用卡方分布分析结果,排除可疑结果。公式: χ 2 = ∑ ( O − E ) 2 E \chi^2=\sum\frac{\left(O-E\right)^2}{E} χ2=E(OE)2 看看这个。

摘要

Simple Online and Realtime Tracking (SORT) is a pragmatic approach to multiple object tracking with a focus on simple, effective algorithms. In this paper, we integrate appearance information to improve the performance of SORT. Due to this extension we are able to track objects through longer periods of occlusions, effectively reducing the number of identity switches. In spirit of the original framework we place much of the computational complexity into an offline pre-training stage where we learn a deep association metric on a largescale person re-identification dataset. During online application, we establish measurement-to-track associations using nearest neighbor queries in visual appearance space. Experimental evaluation shows that our extensions reduce the number of identity switches by 45%, achieving overall competitive performance at high frame rates.

简单在线实时跟踪(SIMPLE ONLINE AND REALTIME TRACKING,SORT)是一种实用的多目标跟踪方法,注重简单、有效的算法。本文通过整合外观信息来提高SORT算法的性能。由于这个扩展,我们能够跟踪被长时间遮挡又再重新出现的目标,有效地减少对目标身份切换的次数。本着和原始框架(SORT)一致的思想,我们将大部分的计算复杂度放在离线预训练阶段,在此阶段,我们在大规模的人员重新识别数据集上学习深度关联度量。在在线应用过程中,我们在视觉外观空间中使用最近邻查询建立测量到跟踪的关联。实验评估表明,我们的工作使得原算法的身份切换次数减少了45%,在高帧率下达到非常有竞争力的整体性能。

Introduction

Due to recent progress in object detection, tracking-by-detection has become the leading paradigm in multiple object tracking. Within this paradigm, object trajectories are usually found in a global optimization problem that processes entire video batches at once. For example, flow network formulations [1, 2, 3] and probabilistic graphical models [4, 5, 6, 7] have become popular frameworks of this type. However, due to batch processing, these methods are not applicable in online scenarios where a target identity must be available at each time step. More traditional methods are Multiple Hypothesis Tracking (MHT) [8] and the Joint Probabilistic Data Association Filter (JPDAF) [9]. These methods perform data association on a frame-by-frame basis. In the JPDAF, a single state hypothesis is generated by weighting individual measurements by their association likelihoods. In MHT, all possible hypotheses are tracked, but pruning schemes must be applied for computational tractability. Both methods have recently been revisited in a tracking-by-detection scenario [10, 11] and shown promising results. However, the performance of these methods comes at increased computational and implementation complexity.


由于近年来在目标检测方面的研究进展,基于检测的跟踪已成为多目标跟踪的主要范式。在这个框架中,物体的追踪严重依赖于目标检测结果的准确性。而由于算法和训练方法的原因,目标检测的精度有可能会不尽人意,包括误检( F P > 0 FP>0 FP>0 )、漏检 ( R < 1 R < 1 R<1 )、边框的交并比(IoU)不理想等。针对这一问题,有学者提出批模式(Batch Mode)来解决 —— 一批(Batch)是一小段时间的视频,或者是一个完整的视频。样问题就转换成了Batch内的全局优化问题。这类思路换到在线场景来讲,就是假设已经获得了即将拍摄的视频数据来进行物体的追踪,按这个思路解决目标的追踪问题的主流算法有流网络公式[1,2,3]和概率图模型[4,5,6,7]。

在实时目标跟踪的需求下,比较传统的方法是多重假设跟踪(MHT)[8]和联合概率数据关联滤波(JPDAF)[9]。这些方法在逐帧的基础上执行数据关联。在JPDAF中,单个状态假设是通过对个体测量的关联可能性进行加权而产生的。在MHT中,所有可能的假设都被跟踪,但修剪方案必须应用于计算的可处理性。最近,这两种方法在逐检测跟踪方案中被重新研究[10,11],并显示出有很大的潜力。然而,这些方法取得的性能是以计算量的增加和实现上的复杂来换得的。


SORT[12] is a much simpler framework that performs Kalman filtering in image space and frame-by-frame data association using the Hungarian method with an association metric that measures bounding box overlap. This simple approach achieves favorable performance at high frame rates. On the MOT challenge dataset [13], SORT with a state-of-the-art people detector [14] ranks on average higher than MHT on standard detections. This not only underlines the influence of object detector performance on overall tracking results, but is also an important insight from a practitioners point of view.

SORT[12]是一个更简单的框架,它在图像空间中执行卡尔曼滤波,并以边界框的重合程度作为基础,使用匈牙利算法逐帧执行数据关联操作。这种简单的方法可以在高帧率下获得良好的性能。在MOT挑战数据集[13]上,基于最先进的人员检测算法[14]的SORT在标准检测上的表现优于MHT。这不仅强调了目标检测算法的性能对整体跟踪结果的影响,而且从从业者的角度来看也是一个重要的见解。

While achieving overall good performance in terms of tracking precision and accuracy, SORT returns a relatively high number of identity switches. This is, because the employed association metric is only accurate when state estimation uncertainty is low. Therefore, SORT has a deficiency in tracking through occlusions as they typically appear in frontal-view camera scenes. We overcome this issue by replacing the association metric with a more informed metric that combines motion and appearance information. In particular, we apply a convolutional neural network (CNN) that has been trained to discriminate pedestrians on a large-scale person re-identification dataset. Through integration of this network we increase robustness against misses and occlusions while keeping the system easy to implement, efficient, and applicable to online scenarios. Our code and a pre-trained CNN model are made publicly available to facilitate research experimentation and practical application develop.

虽然在跟踪精度和准确性方面实现了良好的总体性能,但SORT返回的目标身份切换的次数相对较高。这是因为所使用的关联量只有在状态估计不确定性较低时才准确。因此,SORT在跟踪被遮挡的目标时还存在不足,在摄像头中,出现遮挡是常有的事。我们通过用结合运动和外观信息的含有更多信息的度量替换掉原算法中使用的关联量来克服这个问题。特别指出,我们使用一个经过训练的卷积神经网络(CNN)来在大规模的人员再识别数据集上区分行人。通过整合该网络,我们提高了对漏失和遮挡的鲁棒性,同时保持系统易于实现、高效和适用于在线场景的优点。我们开源了代码和用于预训练的CNN模型,以促进研究实验和实际应用的发展。

具有深度关联量的SORT算法

We adopt a conventional single hypothesis tracking methodology with recursive Kalman filtering and frame-by-frame data association. In the following section we describe the core components of this system in greater detail.

我们采用传统的单假设跟踪方法、递归卡尔曼滤波和逐帧数据关联。在这一节中,我们将非常详细地描述该系统的核心模块。

跟踪处理和状态估计

The track handling and Kalman filtering framework is mostly identical to the original formulation in [12]. We assume a very general tracking scenario where the camera is uncalibrated and where we have no ego-motion information available. While these circumstances pose a challenge to the filtering framework, it is the most common setup considered in recent multiple object tracking benchmarks [15]. Therefore, our tracking scenario is defined on the eight dimensional state space ( u , v , γ , h , x ˙ , y ˙ , γ ˙ , h ˙ ) (u,v,\gamma,h,\dot{x},\dot{y},\dot{\gamma},\dot{h}) (u,v,γ,h,x˙,y˙,γ˙,h˙) that contains the bounding box center position ( u , v ) (u,v) (u,v), aspect ratio γ \gamma γ, height h h h, and their respective velocities in image coordinates. We use a standard Kalman filter with constant velocity motion and linear observation model, where we take the bounding coordinates ( u , v , γ , h ) (u,v,\gamma,h) (u,v,γ,h) as direct observations of the object state.

轨迹处理和卡尔曼滤波算法与[12]中的原始公式基本相同。我们假设一个非常普遍的跟踪场景,相机是未经校准的,我们不知道它是否移动(抖动)。尽管这个情况对滤波算法带来很大难度,但这是当前多目标跟踪基准测试[15]中最常见的设置。因此,我们的跟踪场景定义在八维状态空间 ( u , v , γ , h , x ˙ , y ˙ , γ ˙ , h ˙ ) (u,v,\gamma,h,\dot{x},\dot{y},\dot{\gamma},\dot{h}) (u,v,γ,h,x˙,y˙,γ˙,h˙) 上,该状态空间包含图像坐标中的边框中心位置 ( u , v ) (u,v) (u,v),检测框的纵横比 γ \gamma γ,高度 h h h,以及它们各自的速度。我们使用标准的Kalman滤波与恒速度运动和线性观测模型,其中我们采取边界坐标 ( u , v , γ , h ) (u,v,\gamma,h) (u,v,γ,h)作为目标状态的直接观测值。


YOLOv3的输出是中心点的坐标和长宽,也是4维的, ( b x , b y , b w , b h ) \left(b_x, b_y, b_w, b_h\right) (bx,by,bw,bh)


For each track k k k we count the number of frames since the last successful measurement association a k a_k ak. This counter is incremented during Kalman filter prediction and reset to 0 when the track has been associated with a measurement. Tracks that exceed a predefined maximum age A max A_{\textnormal{max}} Amax are considered to have left the scene and are deleted from the track set. New track hypotheses are initiated for each detection that cannot be associated to an existing track. These new tracks are classified as tentative during their first three frames. During this time, we expect a successful measurement association at each time step. Tracks that are not successfully associated to a measurement within their first three frames are deleted.

对于每个被跟踪的目标 k k k,我们计算它从上一次成功测量关联后到当前经历的帧数 a k a_k ak。该计数器跟踪与检测相关联时置0,在卡尔曼滤波器预测期间递增。超过预先定义的最大值( a k > A max a_k \gt A_{\textnormal{max}} ak>Amax )的目标被认为已经离开现场,并从跟踪目标集合中删除。对于不能与现有跟踪关联的每一个检测结果,都会启动新的跟踪目标轨迹假设。这些新的目标在它们出现的头三帧内被归类为试探性的跟踪值。在这段时间内,我们希望每帧都能成功和一个检测结果进行关联。在前三帧内没有成功关联到检测结果的目标被删除。


啥意思呢,大概就是为了防止误检测,就是有一帧输出了一个检测结果,后面三帧都没有,说明啥,检测算法出幺蛾子了。


指派问题

A conventional way to solve the association between the predicted Kalman states and newly arrived measurements is to build an assignment problem that can be solved using the Hungarian algorithm. Into this problem formulation we integrate motion and appearance information through combination of two appropriate metrics.

解决预测的卡尔曼状态和新检测值之间的关联的传统方法是建立一个可以用匈牙利算法解决的分配问题。在解决这个问题的公式中,我们通过结合两个适当的度量来整合运动和外观信息。

To incorporate motion information we use the (squared) Mahalanobis distance between predicted Kalman states and newly arrived measurements:

为了整合运动信息,我们使用预测卡尔曼状态和新检测值之间的(平方)马氏距离:

d ( 1 ) ( i , j ) = ( d j − y i ) T S i − 1 ( d j − y i ) d^{(1)}\left(i,j\right) = \left(\bm{d}_j-\bm{y}_i\right)^{T}\bm{S}_i^{-1}\left(\bm{d}_j-\bm{y}_i\right) d(1)(i,j)=(djyi)TSi1(djyi)

where we denote the projection of the i i i-th track distribution into measurement space by ( y i , S i ) \left(\bm{y}_i, \bm{S}_i\right) (yi,Si) and the j j j-th bounding box detection by d j \bm{d}_j dj. The Mahalanobis distance takes state estimation uncertainty into account by measuring how many standard deviations the detection is away from the mean track location. Further, using this metric it is possible to exclude unlikely associations by thresholding the Mahalanobis distance at a 95% confidence interval computed from the inverse χ 2 \chi^2 χ2 distribution. We denote this decision with an indicator

式子中, ( y i , S i ) \left(\bm{y}_i, \bm{S}_i\right) (yi,Si)表示第 i i i个跟踪对象的分布在检测空间中的投影, d j \bm{d}_j dj 表示第 j j j 个边界框的检测。马氏距离通过测量检测到的距离平均轨道位置的标准差来衡量状态估计的不确定性。此外,使用这个度量可以排除不太可能的关联,通过在反向卡方分布 χ 2 \chi^2 χ2计算的95%置信区间上阈值马氏距离。我们用一个指标来表示这个决定。

b i , j ( 1 ) = 1 [ d ( 1 ) ( i , j ) ≤ t ( 1 ) ] (1) b_{i,j}^{(1)}=\mathbb{\bm{1}}\left[d^{(1)}\left(i,j\right)\le t^{(1)}\right] \tag{1} bi,j(1)=1[d(1)(i,j)t(1)](1)


无人机拍摄的视频中,我们对目标进行跟踪。在第 k k k 帧的画面中,检测器输出 m m m 组坐标 D k = [ d 1 , d 2 , . . . d m ] \mathcal{D}_k=\left[\bm{d}_1,\bm{d}_2,...\bm{d}_m\right] Dk=[d1,d2,...dm],而第 k − 1 k-1 k1 帧输出了 n n n 组坐标,并且已经过验证的追踪点集,为 T k − 1 = [ r 1 , r 2 , . . . r n ] \mathcal{T}_{k-1}=\left[\bm{r}_1,\bm{r}_2,...\bm{r}_n\right] Tk1=[r1,r2,...rn] ,现在要做一个指派,求解 T k \mathcal{T}_k Tk

  1. T k − 1 \mathcal{T}_{k-1} Tk1 进行卡尔曼滤波,得到 Y k = [ y 1 , y 2 , . . . y n ] \mathcal{Y}_{k}=\left[\bm{y}_1,\bm{y}_2,...\bm{y}_n\right] Yk=[y1,y2,...yn] ,要把 Y k \mathcal{Y}_{k} Yk D k \mathcal{D}_{k} Dk 匹配确认后得到 T k \mathcal{T}_k Tk
  2. m m m n n n 的三种关系:1) m = n m=n m=n (可能是画面中目标没有出去的,也没有新进来的,也可能是出去和进来的一样多);2) m < n mm<n (画面出去的比进来的多);3) m > n m>n m>n (画面出去的比进来的少)
  3. S i \bm{S}_i Si 是什么?按照马氏距离的定义,是 样本协方差矩阵
  4. d ( 1 ) ( i , j ) d^{(1)}\left(i,j\right) d(1)(i,j) 称为 代价矩阵
  5. 卡方分布干啥用?用来验证 y i \bm{y}_i yi d j \bm{d}_j dj 是不是匹配的。
    卡方公式: χ 2 = ∑ ( O − E ) 2 E \chi^2=\sum\frac{\left(O-E\right)^2}{E} χ2=E(OE)2在5%的显著性水平下,自由度为4的卡方检验值是9.4877(查表得的值),原文中的匹配的标准是马氏距离小于等于9.4877似乎不太对。
    马氏距离:度量学习中一种常用的距离指标,同欧氏距离、曼哈顿距离、汉明距离等一样被用作评定数据之间的相似度指标。但却可以应对高维线性分布的数据中各维度间非独立同分布的问题。单点马氏距离定义: D M ( x ) = ( x − μ ) T Σ − 1 ( x − μ ) D_M\left(x\right)=\sqrt{\left(x-\mu\right)^T\bm{\Sigma}^{-1}\left(x-\mu\right)} DM(x)=(xμ)TΣ1(xμ) 数据点 ( x , y ) \left(x,y\right) (x,y) 之间的马氏距离 D M ( x , y ) = ( x − y ) T Σ − 1 ( x − y ) D_M\left(x,y\right)=\sqrt{\left(x-y\right)^T\bm{\Sigma}^{-1}\left(x-y\right)} DM(x,y)=(xy)TΣ1(xy) 上面标红的 S i \bm{S}_i Si 就是这里的 Σ \bm{\Sigma} Σ问题是这个协方差矩阵咋计算出来?(可能在这里)

这里 说因为相机抖动明显,卡尔曼预测所基于的匀速运动模型并不work,所以马氏距离其实并没有什么作用。但注意也不是完全没用了,主要是通过阈值矩阵(Gate Matrix)对代价矩阵(Cost Matrix)做了一次阈值限制。


that evaluates to 1 1 1 if the association between the i i i-th track and j j j-th detection is admissible. For our four dimensional measurement space the corresponding Mahalanobis threshold is t ( 1 ) = 9.4877 t^{(1)} = 9.4877 t(1)=9.4877.

如果第 i i i 个跟踪的目标和第 j j j 个检测结果之间对应,则 b i , j ( 1 ) b_{i,j}^{(1)} bi,j(1) 取值 1 1 1,它是二进制值。对于我们的4维测量空间(4个自由度),相应的马氏阈值 准确讲应该是卡方检验值? t ( 1 ) = 9.4877 t^{(1)} = 9.4877 t(1)=9.4877

While the Mahalanobis distance is a suitable association metric when motion uncertainty is low, in our image-space problem formulation the predicted state distribution obtained from the Kalman filtering framework provides only a rough estimate of the object location. In particular, unaccounted camera motion can introduce rapid displacements in the image plane, making the Mahalanobis distance a rather uninformed metric for tracking through occlusions. Therefore, we integrate a second metric into the assignment problem. For each bounding box detection d j \bm{d}_j dj we compute an appearance descriptor r j \bm{r}_j rj with ∥ r j ∥ = 1 \left\| \bm{r}_j \right\| = 1 rj=1. Further, we keep a gallery R k = { r k ( i ) } k = 1 L k \mathcal{R}_k = \left\{\bm{r}_k^{(i)}\right\}_{k=1}^{L_k} Rk={rk(i)}k=1Lk of the last L k = 100 L_k = 100 Lk=100 associated appearance descriptors for each track k k k. Then, our second metric measures the smallest cosine distance between the i i i-th track and j j j-th detection in appearance space:

当目标的运动不确定性较低时,马氏距离是一个合适的关联度评价量,在我们的图像空间问题公式中,从卡尔曼滤波框架获得的预测状态分布仅提供了一个粗略的目标位置的估计。尤其是还没考虑相机运动可以在图像平面中引入快速位移,使得在被追踪的目标通过遮挡物时马氏距离成为一个信息不足的度量值。因此,我们将第二个指标整合到分配问题中。对于每个包围框检测 d j \bm{d}_j dj,我们还计算它的外观描述符 r j , ∥ r j ∥ = 1 \bm{r}_j,\quad \left\| \bm{r}_j \right\| = 1 rj,rj=1。此外,我们为跟踪的目标 k k k 保留最后 L k = 100 L_k = 100 Lk=100 个与之相关联的“出现描述符” R k = { r k ( i ) } k = 1 L k \mathcal{R}_k = \left\{\bm{r}_k^{(i)}\right\}_{k=1}^{L_k} Rk={rk(i)}k=1Lk 。然后,我们的第二个度量描述了第 i i i 个跟踪到的目标和第 j j j 个检测结果在外观空间中的最小余弦距离:

d ( 2 ) ( i , j ) = min ⁡ { 1 − r j T r k ∣ r j ( i ) ∈ R i } d^{(2)}\left(i,j\right)=\min\left\{1-\bm{r}_j^{\mathrm{T}}\bm{r}_k\quad|\quad \bm{r}_j^{(i)}\in \mathcal{R}_i\right\} d(2)(i,j)=min{1rjTrkrj(i)Ri}

Again, we introduce a binary variable to indicate if an association is admissible according to this metric

我们再次引入一个二元变量来表示按照这个度量值,预测值和检测结果是否关联。

b i , j ( 2 ) = 1 [ d ( 2 ) ( i , j ) ≤ t ( 2 ) ] b_{i,j}^{(2)}=\mathbb{\bm{1}}\left[d^{(2)}\left(i,j\right)\le t^{(2)}\right] bi,j(2)=1[d(2)(i,j)t(2)]

and we find a suitable threshold for this indicator on a separate training dataset. In practice, we apply a pre-trained CNN to compute bounding box appearance descriptors. The architecture of this network is described in Section 2.4.

我们在单独的训练数据集上为该指标找到一个合适的阈值。在实践中,我们使用一个预先训练好的CNN来计算包围框出现描述符。该网络的结构在“Deep Appearance Descriptor” 小节展开讨论。

In combination, both metrics complement each other by serving different aspects of the assignment problem. On the one hand, the Mahalanobis distance provides information about possible object locations based on motion that are particularly useful for short-term predictions. On the other hand, the cosine distance considers appearance information that are particularly useful to recover identities after longterm occlusions, when motion is less discriminative. To build the association problem we combine both metrics using a weighted sum:

结合起来,这两个指标通过服务于分配问题的不同方面而相互补充。一方面,马氏距离提供了基于运动的可能物体位置信息,这对短期预测特别有用。另一方面,余弦距离表述的是外观信息,这个信息对于在长期被遮挡后再出现的目标的再识别非常有用。这种情况下单单通过运动不太好判断再次出现的目标就是之前出现过的目标。为了建立关联问题,我们使用加权和结合两个指标:

c i , j = λ d ( 1 ) ( i , j ) + ( 1 − λ ) d ( 2 ) ( i , j ) c_{i,j}=\lambda d^{(1)}\left(i,j\right)+\left(1-\lambda\right) d^{(2)}\left(i,j\right) ci,j=λd(1)(i,j)+(1λ)d(2)(i,j)

where we call an association admissible if it is within the gating region of both metrics:

其中,如果关联在两个指标的门控区域内,我们称其为可靠关联:

b i , j = ∏ m = 1 2 b i , j ( m ) b_{i,j}=\prod_{m=1}^{2}b_{i,j}^{(m)} bi,j=m=12bi,j(m)

The influence of each metric on the combined association cost can be controlled through hyperparameter λ \lambda λ. During our experiments we found that setting λ = 0 \lambda=0 λ=0 is a reasonable choice when there is substantial camera motion. In this setting, only appearance information are used in the association cost term. However, the Mahalanobis gate is still used to disregarded infeasible assignments based on possible object locations inferred by the Kalman filter.

通过超参数 λ \lambda λ 可以控制各指标对组合关联代价的影响。在我们的实验中,我们发现当有明显的相机运动时, λ = 0 \lambda=0 λ=0 是一个合理的取值。在此设置中,在关联成本术语中只使用出现信息。然而,马氏门仍然被用来忽略基于卡尔曼滤波器推断的可能目标位置的不可行分配。

匹配的级联

Instead of solving for measurement-to-track associations in a global assignment problem, we introduce a cascade that solves a series of subproblems. To motivate this approach, consider the following situation: When an object is occluded for a longer period of time, subsequent Kalman filter predictions increase the uncertainty associated with the object location. Consequently, probability mass spreads out in state space and the observation likelihood becomes less peaked. Intuitively, the association metric should account for this spread of probability mass by increasing the measurement-to-track distance. Counterintuitively, when two tracks compete for the same detection, the Mahalanobis distance favors larger uncertainty, because it effectively reduces the distance in standard deviations of any detection towards the projected track mean. This is an undesired behavior as it can lead to increased track fragmentations and unstable tracks. Therefore, we introduce a matching cascade that gives priority to more frequently seen objects to encode our notion of probability spread in the association likelihood.

我们引入了一个级联来解决一系列子问题,而不是解决全局分配问题中的度量-跟踪关联问题。为了激发这种方法,考虑以下情况:当一个对象被遮挡较长一段时间,随后的卡尔曼滤波预测增加了与对象位置相关的不确定性。因此,概率质量在状态空间中扩散,观测的可能性减小。直观地说,关联度量应该通过增加测量到跟踪的距离来解释概率质量的这种扩散。与直觉相反的是,当两条轨道竞争相同的检测时,马氏距离倾向于更大的不确定性,因为它有效地减少了任何检测的标准差对投影轨道平均值的距离。这是一种不希望出现的行为,因为它会导致更多的轨迹碎片和不稳定的轨迹。因此,我们引入一个匹配级联优先于更常见的对象,将我们的概率扩散概念编码到关联似然中。

伪代码 匹配级联
输入 追踪结果用 T = { 1 , . . . , N } \mathcal{T}=\left\{1,...,N\right\} T={1,...,N} 表示;检测结果用 D = { 1 , . . . , M } \mathcal{D}=\left\{1,...,M\right\} D={1,...,M} 表示,最大记录条数 A max A_{\textnormal{max}} Amax
1. 使用公式5 计算代价矩阵 C = [ c i , j ] \bm{C}=\left[c_{i,j}\right] C=[ci,j]
2. 使用公式6 计算代价矩阵 B = [ b i , j ] \bm{B}=\left[b_{i,j}\right] B=[bi,j]
3. 匹配空间初始化 M ← ∅ \mathcal{M} \leftarrow \empty M
4. 未匹配的检测结果 U ← D \mathcal{U} \leftarrow \mathcal{D} UD
5. for n ∈ { 1 , . . . , A max } n \in \left\{1,...,A_{\textnormal{max}}\right\} n{1,...,Amax} do
6. \qquad 根据记录位置选择追踪点 T n ← { i ∈ T ∣ a i = n } \mathcal{T}_n\leftarrow \left\{i \in \mathcal{T} \quad \vert \quad a_i=n \right\} Tn{iTai=n}
7. [ x i , j ] ← min_cost_matching ( C , T n , U ) \qquad \left[x_{i,j}\right] \leftarrow \textnormal{min\_cost\_matching}\left(\bm{C}, \mathcal{T}_n,\mathcal{U}\right) [xi,j]min_cost_matching(C,Tn,U)
8. M ← , M ∪ { ( i , j ) ∣ b i , j ⋅ x i , j > 0 } \qquad \mathcal{M} \leftarrow ,\mathcal{M} \cup\left\{(i,j)\vert b_{i,j} \cdot x_{i,j} \gt 0\right\} M,M{(i,j)bi,jxi,j>0}
9. U ← , U   \   { j ∣ ∑ i b i , j ⋅ x i , j > 0 } \qquad \mathcal{U} \leftarrow ,\mathcal{U}\ \backslash \ \left\{j\vert \sum_{i} b_{i,j} \cdot x_{i,j} \gt 0\right\} U,U \ {jibi,jxi,j>0}
10. End for
11. return M , U \mathcal{M},\mathcal{U} M,U

Listing 1 outlines our matching algorithm. As input we provide the set of track T \mathcal{T} T and detection D \mathcal{D} D indices as well as the maximum age A max A_{\textnormal{max}} Amax . In lines 1 and 2 we compute the association cost matrix and the matrix of admissible associations. We then iterate over track age n n n to solve a linear assignment problem for tracks of increasing age. In line 6 we select the subset of tracks T n \mathcal{T}_n Tn that have not been associated with a detection in the last n n n frames. In line 7 we solve the linear assignment between tracks in T n \mathcal{T}_n Tn and unmatched detections U \mathcal{U} U. In lines 8 and 9 we update the set of matches and unmatched detections, which we return after completion in line 11. Note that this matching cascade gives priority to tracks of smaller age, i.e., tracks that have been seen more recently.

Listing 1给出本文匹配算法的伪代码。作为输入,我们提供了一组径迹 T \mathcal{T} T 和检测 D \mathcal{D} D 指标以及最大 A max A_{\textnormal{max}} Amax 。在第1行和第2行中,我们计算了关联代价矩阵和允许关联矩阵。然后在年龄 n n n 的轨道上迭代,以解决年龄递增轨道的线性分配问题。在第6行中,我们选择了在前 n n n 帧中没有与检测关联的径迹 T n \mathcal{T}_n Tn 的子集。在第7行中,我们解决了 T n \mathcal{T}_n Tn 和不匹配检测 U \mathcal{U} U 中航迹之间的线性分配问题。请注意,这个匹配级联优先考虑年龄较小的径迹,即最近才检测到的径迹。

In a final matching stage, we run intersection over union association as proposed in the original SORT algorithm [12] on the set of unconfirmed and unmatched tracks of age n = 1 n = 1 n=1. This helps to to account for sudden appearance changes, e.g., due to partial occlusion with static scene geometry, and to increase robustness against erroneous initialization.

在最后的匹配阶段,我们在年龄 n = 1 n = 1 n=1 的未确认和未匹配轨迹集上,按照原始排序算法[12]中提出的并集关联进行交叉。这有助于考虑突然的外观变化,例如,由于与静态场景几何的部分遮挡,并增加对错误初始化的鲁棒性。

深度外观描述符(Deep Appearance Descriptor)

By using simple nearest neighbor queries without additional metric learning, successful application of our method requires a well-discriminating feature embedding to be trained offline, before the actual online tracking application. To this end, we employ a CNN that has been trained on a large-scale person re-identification dataset [21] that contains over 1,100,000 images of 1,261 pedestrians, making it well suited for deep metric learning in a people tracking context.

通过使用简单的最近邻查询,而不需要额外的度量学习,我们的方法的成功应用需要在实际的在线跟踪应用之前,一个良好的识别特征嵌入离线训练。为此,我们使用了在包含1,261名行人超过110万张图像的大规模人再识别数据集[21]上训练过的CNN,使其非常适合在人跟踪背景下进行深度度量学习。

The CNN architecture of our network is shown in Table 1. In summary, we employ a wide residual network [22] with two convolutional layers followed by six residual blocks. The global feauture map of dimensionality 128 is computed in dense layer 10. A final batch and ‘2 normalization projects features onto the unit hypersphere to be compatible with our cosine appearance metric. In total, the network has 2,800,864 parameters and one forward pass of 32 bounding boxes takes approximately 30 ms on an Nvidia GeForce GTX 1050 mobile GPU. Thus, this network is well suited for online tracking, provided that a modern GPU is available. While the details of our training procedure are out of the scope of this paper, we provide a pre-trained model in our GitHub repository 1 along with a script that can be used to generate features.

我们网络的CNN架构如表1所示。总之,我们采用了一个广泛的残差网络[22],它有两个卷积层,后面是六个残差块。在稠密层10中计算了维数为128的全局地物图。最后一批和’ 2标准化项目功能上的单位超球与我们的余弦外观度量兼容。在Nvidia GeForce GTX 1050移动GPU上,网络总共有2,800,864个参数,32个包围框的一次向前传递大约需要30ms。因此,该网络非常适合在线跟踪,只要有现代GPU可用。虽然我们的培训过程的细节超出了本文的范围,但我们在GitHub存储库1中提供了一个预先培训的模型,以及一个可以用来生成特性的脚本。

实验

结论

We have presented an extension to SORT that incorporates appearance information through a pre-trained association metric. Due to this extension, we are able to track through longer periods of occlusion, making SORT a strong competitor to state-of-the-art online tracking algorithms. Yet, the algorithm remains simple to implement and runs in real time.

我们对SORT算法进行了扩展,通过预先训练的关联算法来对外观信息进行排序。由于这个扩展,我们能够通过更长的时间跟踪遮挡,使排序成为最先进的在线跟踪算法的强大竞争对手。然而,算法仍然很容易实现并实时运行。

补充

目标检测的不准确性包括False Positive、漏检、不准确的边框等多种情况。针对这一问题,有学者提出使用批模式(Batch Mode)来缓解。这个Batch可以是一个固定的滑动窗口,也可以是一整个视频。这样问题就转换成了Batch内的全局优化问题。本质来讲,这类方法利用了Future frame的信息做tracking。

你可能感兴趣的:(论文相关)