VisDrone-DET2019:The Vision Meets Drone Object Detection in Image Challenge Results

Abstract

  Recently, automatic visual data understanding from drone platforms become highly demanding. To this end, the Vision Meets Drone Object Detection in Image Challenge is held at the second time in conjuction with the 17th Interational Conference on Computer Vision,focuses on image object detection on drones. Results of 33 object detection algorithms are presented. For each participating detector, a short description is provided in the appendix. Our goal is to advance the state-of-the-art detection algorithms and provide a comprehensive evaluation platform for them. The evaluation protocol of the VisDrone-DET2019 Challenge and the comparison results of all the submitted detectors on the released dataset are publicly available on the website: http://www.aiskyeye.com/. The challenge results demonstrate that there still remains much room for improvement for object detection algorithms on drones.

阅读笔记:
  目前人们对无人机平台自动视觉数据的理解有着强烈的需求,基于此目的,无人机视觉图像目标检测大赛以及第17届ICCV会议,都聚焦在解决无人机的图像目标检测上。如今已经提出了33种目标检测算法,每一种参与其中的检测器都在附录中有一个简短的描述。我们的目的是提升目前的目标检测算法以及为它们提供一个综合性的评价平台。无人机视觉大赛的规则以及所有基于已发布数据集的检测器之间比较的结果都在http://www.aiskyeye.com/网站上公开。本次比赛的结果表明了对于无人机目标检测的算法仍然有着很大的提升空间。
关键词:

  • www.aiskyeye.com

出现的短语:

  1. to this end
  2. in conjuction with

Introduction

  Object detection is breaking into a wide range of many high-level computer vision application, such as autonomous driving,face detection and recognition,and activity recognition. Although significant progress has been achieved in recent years, these algorithms usually focus on detection in general scenarios instead of drone-based scenes. This is because the studies are seriously limited by the lack of public large-scale benchmarks or datasets.

阅读笔记:
  目标检测正突破着大量高级计算机视觉的应用的范围,例如自动驾驶、面部检测和识别和行为识别等等。虽然近几年在这些方面已经取得了大量的进展,但是这些算法都在关注普通的场合而不是基于无人机的场景。这是因为研究人员及其缺少开放的大规模数据集。

  To advance state-of-the-art detection algorithms in drone-based scenes, the 1-st Vision Meets Drone Object Detection in Images Challenge was held on September 8,2018,in conjunction with the 15-th European Conference on Computer Vision in Munich, Germany. Compared with the preliminary drone based datasets,a larger scale drone based object detection dataset is proposed to evaluate detection algorithms in real scenarios. Then, there were 34 object detection methods submitted to this challenge, and we provided a comprehensive performance evaluation for them.

阅读笔记:
  为了提高目前基于无人机场景的一流检测算法,在2018年举办了第一届基于图像目标检测无人机视觉大赛以及在德国慕尼黑举办了第15届ECCV。与最初的基于无人机的数据集相比,提出了一个更大规模的无人机检测数据集用于在真实场景下评价检测算法。在这次比赛中提出了34中目标检测方法,我们为这些方法提供了一个综合性性能的评价。

  In this paper, researchers are encouraged to submit algorithms to detect objects of ten predefined categories(e.g.,pedestrian and car) in the VisDrone-DET2019 dataset. Specifically, there are 33 out 47 detection methods that performs better than the baseline state-of-the-arts. Derived from recently published top computer vision conferences or journals, we believe this challenge is useful to further promote the development of object detection algorithms on drone platforms. The experiments can be found at our web-site:http://www.aiskyeye.com/.

阅读笔记:
  在这篇论文中,鼓励研究员们提交检测VisDrone-DET2019数据集中的10种预先定义类别目标的算法。具体来说,在这47中检测算法中有33种算法表现得比基线水平更好。从近期发表的顶级计算机视觉会议或期刊中,我们相信这次挑战对进一步推广无人机平台的目标检测算法的发展是有益的。实验可以在我们的网站上找到:http://aiskyeye.com/。

2. Related Work

2.1. Anchor based Detectors

  The current state-of-the-art anchor based detectors can be divided into two categories: (1) the two-stage methods with higher accuracy,and (2) the one-stage methods with higher efficiency.

  • two-stage和one-stage分别是什么意思?

  Based on the previous works, Lin develop focal Loss to address the class imbalance issue in object detection by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples.To address the shortcoming in current two-stage methods, Li propose a new two-stage detector to make the head of network as light as possible, by using a thin feature map and a cheap R-CNN subnet ( pooling and singel fully-connected layer). To inherit the merits of both two-stage and one-stage methods, Zhang propose a single-shot detector formed by two inter-connected modules, the anchor refinement module and the object detection module. Moreover, the Cascade R-CNN is a multi-stage object detection architecture. That is, a sequence of detectors is trained with increasing IoU thresholds to be sequentially more selective against close false positives. Recently,Duan propose the channel-aware deconvolutional network to detect small objects, especially for drone based scenes. To keep the favourable performance independent to the network architecture, Zhu train detectors from scratch using BatchNorm with larger learning rate.

阅读笔记:
基于目前的工作,Lin等人通过重构标准交叉熵loss,降低对能够很好分类样本的权重开发了focal Loss 用来解决目标检测中类别的不平衡问题。为了解决当前two-stage方法的缺点,Li等人提出了一种新型two-stage检测器通过使用一个更薄的feature map和一个廉价R-CNN子网络(池化层和全连接层)使得网络的头端尽可能得轻简。为了继承two-stage和one-stage两种方法的优点,Zhang等人通过两个互连模块(anchor精炼模块和目标检测模块)组成了一个Single-shot检测器。此外,Cascade R-CNN是一个multi-stage 的目标检测结构。也就是说,通过增加IoU阈值训练的检测器序列会不断地对靠近错误的positives更加具有选择性。近来,Duan等人提出了channel-aware 反卷积网络用于检测小目标尤其是基于无人机的场景。为了保障良好的性能独立于网络结构,Zhu等人在scratch中使用更大的学习率训练了检测器。

问题:

  1. 什么是 class imbalance issue?
  2. 什么是focal loss?

生词:

  • entropy n.
  • down-weights v.
  • well-classified a.
  • address v.
  • shortcoming n.
  • light a.
  • a sequence of detectors is trained with increasing IoU thresholds to be sequentially more selective against close false positives
  • sequentially ad.
  • scratch

2.2. Anchor-free Detectors

  Although anchor based detectors have achieved much progress in object detection, it is still difficult to select optimal parameters of anchors. To guarantee high recall, more anchors are essential but introduce high computional-complexity. Moreover, different datasets correspond to different optimal anchors. To solve these issues, anchor-free detectors attract much research and have achieved significant advances with complex backbone networks recently.
  Law and Deng propose the CornerNet to detect an object bounding box as a pair of keypoints, the left-top corner and the bottom-right corner, using a single convolution neural network. To decrease the high processing cost, they further introduce CornerNet-Lite. It is a combination of two efficient variants of CornerNet: CornerNet-Saccade with an attention mechanism and CornerNet-Squeeze with a new compact backbone architecture.Moreover Duan et al. detect as triplet ,rather than a pair of keypoints, which improves both precision and recall. Zhou et al. further model an object as the center point of its bounding box, and regresses to all other object properties, such as size, 3D location, orientation, and even pose. On the other hand, Kong et al. propose an accurate, flexible and completely anchor-free framework, which predicts category-sensitive semantic maps for the object existing possibility and category-agnostic bounding box for each position that potentially contains an object. Tian et al. solve object detection in a per-pixel prediction fashion, analogue to semantic segmentation.

阅读笔记:
  虽然基于检测器的锚框在目标检测上已经取得的了很大的进展,但是选择锚框的最佳参数仍然是很困难的。为了保证高的召回率(recall),大量的锚框是必要的但是也会引入高的计算复杂性。而且,不同的数据集相当于不同的最理想锚框。为了解决这些问题,近年来anchor-free检测器吸引了大量的研究并且通过混合backbone网络已经取得了重大进展。
  Law 和 Deng 等人提出了CornerNet,使用一个单层卷积神经网络将目标的boundingbox检测为一对关键点,左上和右下。为了降低高处理成本, 他们进一步引入了CornerNet-Lite。CornerNet-Lite 是由两个高效的CornerNet变形组成的:含有attention机制的CornerNet-Saccade和一个带有新型压缩的backbone的CornerNet-Squeeze。此外,Duan 等人将boudingbox检测为一个可以同时提高精度和召回率的triplet,而不是一对keypoints。Zhou 等人进一步对目标建模为其bounding box的中心点,同时回归了其它所有的目标属性,像尺寸,3D位置,方向甚至位置。另一方面,Kong 等人提出了一个精准灵活的完整anchor-freek框架,可以预测存在的目标敏感类别的语义图和每一个包含潜在目标位置的类别无关的边界框。Tian等人以一个逐像素预测的方式解决目标检测,类似于语义分割。

  • anchor-free是无锚的还是自由锚?
  • category-agnostic 类别无关的还是类别不可知
  • 什么是recall?
  • 什么是backbone?
  • 什么是CornerNet?CornerNet-Lite
  • 什么是attention mechinism?

短语:

  • computional-complexity
  • saccade
  • mechanism
  • combination of
  • correspond to
  • regress v.
  • orientation n.
  • On the other hand, Kong et al. propose an accurate, flexible and completely anchor-free framework, which predicts category-sensitive semantic maps for the object existing possibility and category-agnostic bounding box for each position that potentially contains an object.

3. The VisDrone-DET2019 Challenge

  Similar to the VisDrone-2018 Challenge, we mainly focus on human and vehicles in our daily life, and detect ten object categories of interest including pedestrian,person,car, van, bus, motor, bicycle, awning-tricycle,and tricycle.
  To obtain results on the VisDrone-DET2019 test-challenge set, the participators must generate the results in defined format and uploaded to the evaluation server. If the results of the submitted method are above the performance of Cascade R-CNN, it will be automatically published in the ICCV 2019 workshop proceeding. Moreover, only the algorithms with detailed description(e.g.,speed,GPU and CPU information)have the right of authorship.

阅读笔记:
  和2018届无人机视觉大赛类似,我们主要关注于我们日常生活中的人和车辆,检测十种感兴趣的目标类别包括:行人,人,轿车,货车,巴士,摩托,自行车,带蓬的三轮车和三轮车。
  为了获得2019届无人机视觉大赛测试挑战集的结果, 参与人员需要以规定的格式生成结果并上传到作为评估的服务器上。如果提交方法的结果超过了Cascade R-CNN的性能,它将会被自动发布在ICCV2019的研讨会程序中。此外,只有含有细节描述的算法(如 速度,GPU和CPU信息)才有署名权。

  • test-challenge set指的是什么?
  • results指的是什么?

生词:

  • awning-tricycle
  • van
  • workshop
  • authorship

3.1. The VisDrone-DET2019 Dataset

  The VisDrone DET2019 Dataset uses the same data in The VisDrone DET2018 Dataset, namely 8,599 images captured by drone platforms in different places at different heights. Moreover, more than 540k bounding boxes of targets are annotated with ten predefined categories. The dataset is divided into training,validation and testing subsets(6471 for training, 548 for validation, 1580 for testing), which are collected from different locations but similar enviroments.
  Furthermore, we use the evaluation protocol in MS COCO to evaluate the results of detection algorithms, including AP,AP50,AP75,AR1,AR10,AR100 and AR50 metrics.Specifically, AP is computed by averaging over all 10 Inersection over Union(IoU) thresholds (i.e., in the range [0.50:0.95] with the uniform step size 0.05) of all categories, which is used as primary metric for ranking.AP50 and AP75 are computed at the single IoU thresholds 0.5 and 0.75 over all categories. The AR1, AR10, AR 100 and AR500 scores are the maximum recalls given 1,10,100 and 500 detections per images respectively, averaged over all categories and IoU thresholds.Note these criteria penalize missing detection of objects as well as dulplicate detections( two detection results for the same object instance). Please refer to [26] for more details.

阅读笔记:

  • namely ad.
  • metric n. a.
  • rank v. n.
  • respectively ad.
  • duplicate v. n. a.
  • criteria n.
  • penalize v.
  • valuate 和 evaluate的区别
  • overall 和 over all
  • 验证集有什么用?
  • AP x 和 ARx 代表什么意思?
  • Inersection over Union(IoU)

3.2. Submitted Detectors

  There are 47 different detection mothods submitted to the VisDrone-DET2019 challenge, 33 of which performs better than the state-of-the-art object detector Cascade R-CNN. Except Cascade R-CNN , The VisDrone team also gives the results of another 6 baseline methods, i.e., CornerNet, Light-RCNN , DetNet59, RefineDet, RetinaNet and FPN . For these baselines, the default parameters are used or set to reasonable values. Thus, there are 39 algorithms in total included in the VisDrone-DET2019 Challenge.
  Nine submitted detectors improve the Cascade RCNN , namely Airia-GA-Cascade (A.2), Cascade RCNN+ (A.4), Cascade R-CNN++ (A.5), DCRCNN (A.13), DPN (A.14), DPNet-ensemble (A.15), MSCRDet (A.25), SAMFR-CascadeRCNN(A.29),and SGE-cascadeR-CNN (A.30). Six detectors are based on CenterNet, including CenterNet (A.6), CenterNet-Hourglass (A.7), CN-DhVaSa (A.9), ConstraintNet (A.10), GravityNet (A.20) and RRNet (A.28). Five detectors are derived from RetinaNet , i.e., DA-RetinaNet (A.11), EHR-RetinaNet (A.16), FSRetinanet (A.19), MOD-RETINANET (A.24) and retinaplus (A.27). Three detectors employ FPN representation [24], ACM-OD (A.1), BetterFPN (A.3) and ODAC (A.26). Three detectors (i.e., DBCL (A.12), HTC-drone (A.22) and S+D (A.31)) conduct segmentation of the objects to restrain the background noise. Four algorithms use ensemble of state-of-the-art detectors. Specifically, EnDet (A.17) combines YOLO and Faster R-CNN, while TSEN(A.33)useensemblesof3two-stagemethods: Faster R-CNN, Guided Anchoring and Libra R-CNN. ERCNNs (A.18) is generated by Faster R-CNN and Cascade RCNN with different backbones. Libra-HBR (A.23) consider SNIPER, Libra R-CNN, and cascade R-CNN. Different from FPN model, more multi-scale fusion strategies are proposed. CNAnet (A.8) use the multi-neighbor layers fusion modules to fuse the current layer with its multineighbor higher layers. HRDet+ (A.21) maintains highresolution representation by connecting high-to-low convolutions in parallel. TridentNet (A.32) construct a parallel multi-branch architecture where each branch shares the

在VisDrone-DET2019 challenge中有47种不同的检测方法被提交,有33种算法比当前目标检测器Cascade R-CNN表现得更好。除了Cascade R-CNN之外还有6种基本方法的anchor结果: CornerNet,Light-RCNN,DetNet59,RefineDet,TetinaNet和FPN。对于这些基本算法使用或设置的默认值都是合理的值。因此还有在挑战中还有39种算法。

什么是result of anchor

3.3. Results and Aanlysis

The overall results of the submissions are presented in Table 2. We find that DPNet-ensemble(A.15) achieves the best performance among all submitted methods,i.e.,29.62% AP score. It follows the idea of FPN[24] and improve Cascade-RCNN[2] with global context module(GC) and deformable convolution (DC) into the backbone network. RRNet and ACM-OD rank in the second place with more than 29% AP score. RR-Net(A.28) is an anchor-free detector based on, where the re-regression module can predict the bias between the coarse bounding boxes and ground-truth. ACM-OD(A.1) employs the framework of FPN and active learning scheme to operate jointly with object augmentation.Among the 7 baseline methods provided by the VisDrone Team,CornerNet achieves the best performance, while RetinaNet performs the worst.
We also report the detection results of each object category in Table 3.We observe that all the best results of different kinds of objects are produced by the detectors with top 6 AP scores. However,they do not achieve good detection results in terms of person and awning-tricycle. This is because person does not maintains standing pose and usually involves other kinds of objects such as tricycle and bicycle,while awning-tricycle lacks of training data.

什么是context module?

deformable
coarse

你可能感兴趣的:(论文阅读)