这个数据集采用一种方法将现有的大规模的目标检测的数据集利用到目标跟踪上(YouTubeBB稀疏标注)。也就是说,是视频目标检测YT-BB的子集,大约1.1T,30000左右个视频,我最开始接触的是imageNet vid2015(密集标注),大约100G,有5000左右个视频,那时候就觉得好大,现在比较起来真的是。。。
We provide more than 30K videos with more than 14 milliondense bounding box annotations.
In addition, we introduce a new benchmark composed of 500 novel videos, modeled with a distribution similar to our training dataset.(evaluating more than 20 trackers)
(ii)目前基于深度的跟踪器经常受限,一般使用目标分类的预训练模型或者使用目标检测的数据集训练,如ImageNet Videos,或者使用小型的数据集来训练,这些都是一些限制因素。
multi-object tracking: object detection algorithms
single-object tracking: tracking-by-detection(model representation+object seach)
Correlation Filter Trackers
The main reason behind the impressive performance of CF trackers lies in the approximate dense sampling achieved by circulantly shifting the target patch samples.Also, the remarkable runtime performance is achieved by efficiently solving the underlying ridge regression problem in the Fourier domain.
Deep Trackers
讲了三个代表性的方法,MDNet,SIamfc,另一个没有读过,尴尬(Visual tracking with fully convolutional
Object Tracking Datasets
VIVID: an early attempt to build a tracking dataset for surveillance purposes
NfS: provides a set of 100 videos with high framerate, in an attempt to focus on fast motion.
UAV123/UAV20L: gather another application-specific collection of 123 videos and 20 long videos captured from a UAV or generated from a flight simulator.
NUS PRO: gathers an application-specific collection of 365 videos for people and rigid object tracking
目标检测的数据集,像ImageNet Video or YoutubeBoundingBoxes,这些数据集提供目标检测的bbox,相对稀疏或者低的帧率,因此缺乏目标动态的运动信息
收集了30643个视频片段,平均时长为16.6s, 1443,1266帧,分为测试集和训练集两部分,
从Youtube-BoundingBoxes(是一个针对目标检测的大规模数据集,大约有38000个视频段,以每秒标注,这些视频是从Y YouTube直接收集的)仔细挑选了30132个训练视频,分为12个训练子集,每一个包含2511个视频,每个视频大约400多帧(short-term tracking)。21种目标类别
Coarse annotations are provided by YT-BB at 1 fps. In order to increase the annotation density, we rely on a mixture of state-of-the-art trackers to fill in missing annotations. We claim that any tracker is reliable on a small time lapse of 1 second.
As a result, we densely annotated the 30,132 videos using a weighted average between a forward and a backward pass using the DCF tracker(300fps)
并且建立了一个与训练集具有相同分布511个测试视频。测试视频的标注是使用Mechanical Turk workers(Turkers)+VATIC tool(一个利用光流法的视频标注工具)标注的。测试集提供了15个属性,前五个属性(Scale Variation,Aspect Ratio Change,Fast Motion,Low Resolution,Out-of-View)后十个是手动检查的(Illumination Variation,Camera Motion,Motion Blur,Blackground Clutter,Similar Object,Deformation,In-Plane Rotation,Out-of-Plane Rotation,Partial Occlusion,Full Occlusion)
We evaluate the trackers through an online server.In a similar OTB100 fashion
, we perform a One Pass Evaluation (OPE) and measure the success and precision of the trackers over the 511 videos.
success S: IOU between the ground truth bounding boxes and the ones generated by trackers.The trackers are ranked using the AUC(Area Under the Curve) 0~1
precision P :measured as the distance in pixels between the centers Cgt and Ctr.The trackers are ranked using this metric with a conventional threshold of 20 pixels.
Pnorm: we normalize the precision over the size of the ground truth bounding box,we rank tracking algorithms using the Area Under the Curve(AUC) 0~0.5