Learning RoI Transformer for Detecting Oriented Objects in Aerial Images

Learning RoI Transformer for Detecting Oriented Objects in Aerial Images

Jian Ding, Nan Xue, Yang Long, Gui-Song Xia, Qikai Lu

Wuhan University,WHU:武汉大学,武大
Computational and Photogrammetric Vision Team,CAPTAIN:计算与摄影测量视觉研究组
State Key Laboratory for Information Engineering in Surveying, Mapping and Remote Sensing,LIESMARS:测绘遥感信息工程国家重点实验室
Computer Science,CS:计算机科学
Computer Vision,CV:计算机视觉
aerial ['eərɪəl]:adj. 空中的,航空的,空气的,空想的 n. 天线
orient ['ɔːrɪənt; 'ɒr-]:v. 朝向,确定方位,使适应 n. 东方国家 adj. 东方 (国家) 的,(太阳等) 冉冉升起的,(宝石) 光彩夺目的
transformer [træns'fɔː(r)mə(r)]:n.变压器,转换器
corresponding author:通讯作者

arXiv (archive - the X represents the Greek letter chi [χ]) is a repository of electronic preprints approved for posting after moderation, but not full peer review.

Abstract

Object detection in aerial images is an active yet challenging task in computer vision because of the birdview perspective, the highly complex backgrounds, and the variant appearances of objects. Especially when detecting densely packed objects in aerial images, methods relying on horizontal proposals for common object detection often introduce mismatches between the Region of Interests (RoIs) and objects. This leads to the common misalignment between the final object classification confidence and localization accuracy. Although rotated anchors have been used to tackle this problem, the design of them always multiplies the number of anchors and dramatically increases the computational complexity. In this paper, we propose a RoI Transformer to address these problems. More precisely, to improve the quality of region proposals, we first designed a Rotated RoI (RRoI) learner to transform a Horizontal Region of Interest (HRoI) into a Rotated Region of Interest (RRoI). Based on the RRoIs, we then proposed a Rotated Position Sensitive RoI Align (RPS-RoI-Align) module to extract rotation-invariant features from them for boosting subsequent classification and regression. Our RoI Transformer is with light weight and can be easily embedded into detectors for oriented object detection. A simple implementation of the RoI Transformer has achieved state-of-the-art performances on two common and challenging aerial datasets, i.e., DOTA and HRSC2016, with a neglectable reduction to detection speed. Our RoI Transformer exceeds the deformable Position Sensitive RoI pooling when oriented bounding-box annotations are available. Extensive experiments have also validated the flexibility and effectiveness of our RoI Transformer. The results demonstrate that it can be easily integrated with other detector architectures and significantly improve the performances.
由于鸟瞰视角、高度复杂的背景以及物体多样的外观,航拍图像中的物体检测是计算机视觉中的一项活跃且具有挑战性的任务。特别是当在航拍图像中检测密集的物体时,依赖于用于普通物体检测的水平候选区域的方法经常引入感兴趣区域 (RoI) 和物体之间的不匹配。这导致最终物体分类置信度和定位精度之间的常见错位。尽管已经使用旋转 anchor 来解决这个问题,但是它们的设计总是使 anchor 的数量倍增并且显著增加了计算复杂性。在本文中,我们提出了一个 RoI Transformer来解决这些问题。更准确地说,为了提高候选区域的质量,我们首先设计了一个 Rotated RoI (RRoI) learner,将水平感兴趣区域 (HRoI) 转换为旋转感兴趣区域 (RRoI)。基于 RRoI,我们提出了 Rotated Position Sensitive RoI Align (RPS-RoI-Align) 模块,从中提取旋转不变特征,以促进后续分类和回归。我们的轻量级 RoI Transformer 可以轻松嵌入检测器中,用于有向边框物体检测。RoI Transformer 的简单实现在两个常见且具有挑战性的航空数据集 (DOTA 和 HRSC2016) 上实现了最先进的性能,降低的检测速度可忽略。当有向边界框标注可用时,我们的 RoI Transformer 超出了 deformable Position Sensitive RoI pooling。广泛的实验也验证了我们的 RoI Transformer 的灵活性和有效性。结果表明,它可以很容易地与其他检测器架构集成,并显著提高性能。

boost [buːst]:vt. 促进,增加,支援 vi. 宣扬,偷窃 n. 推动,帮助,宣扬
variant ['veərɪənt]:n. 变体,转化 adj. 不同的,多样的
appearance [ə'pɪər(ə)ns]:n. 外貌,外观,出现,露面
bird view:鸟瞰图
perspective [pə'spektɪv]:n. 观点,远景,透视图 adj. 透视的
pack [pæk]:n. 包装,一群,背包,包裹,一副 vt. 包装,压紧,捆扎,挑选,塞满 vi. 挤,包装货物,被包装,群集
horizontal [hɒrɪ'zɒnt(ə)l]:adj. 水平的,地平线的,同一阶层的 n. 水平线,水平面,水平位置
region of interest,ROI:感兴趣区域
tackle ['tæk(ə)l]:v. 应付,处理,与某人交涉,抢球,擒抱摔倒,抓获,对付,打
dramatically [drə'mætɪkəlɪ]:adv. 戏剧地,引人注目地 adv. 显著地,剧烈地
Object Detection in Aerial Images,ODAI:遥感图像目标检测
neglectable [nɪ'ɡlektəbl]:adj. 可忽略不计的
deformable [,di'fɔ:məbl]:adj. 可变形的

基于水平正边框的目标检测 (Horizontal Task) 和基于有向边框的目标检测 (Oriented Task)

1 Introduction

Object detection in aerial images aims at locating objects of interest (e.g., vehicles, airplanes) on the ground and identifying their categories. With more and more aerial images being available, object detection in aerial images has been a specific but active topic in computer vision [1-4]. However, unlike natural images that are often taken from horizontal perspectives, aerial images are typically taken with birdviews, which implies that objects in aerial images are always arbitrary oriented. Moreover, the highly complex background and variant appearances of objects further increase the difficulty of object detection in aerial images. These problems have been often approached by an oriented and densely packed object detection task [5-7], which is new while well-grounded and have attracted much attention in the past decade [8-12].
航拍图像中的物体检测旨在定位地面上的感兴趣物体 (e.g., vehicles, airplanes) 并识别它们的类别。随着越来越多可用的航拍图像,航拍图像中的物体检测已经成为计算机视觉中的一个特定但活跃的主题 [1-4]。然而,与通常从水平视角拍摄的自然图像不同,航拍图像通常采用鸟瞰图拍摄,这意味着航拍图像中的物体始终是任意方向的。此外,高度复杂的背景和物体的多样外观进一步增加了航拍图像中物体检测的难度。这些问题经常被一个有向边框且密集的物体检测任务所处理 [5-7],这是一个新的,虽然有良好的基础,并在过去十年引起了很多关注 [8-12]。

imply [ɪm'plaɪ]:vt. 意味,暗示,隐含
decade ['dekeɪd; dɪ'keɪd]:n. 十年,十年期,十

Many of recent progresses on object detection in aerial images have benefited a lot from the RCNN frameworks [2, 4, 7, 13-18]. These methods have reported promising detection performances, by using horizontal bounding boxes as region of interests (RoIs) and then relying on region-based features for category identification [2, 4, 16]. However, as observed in [5, 19], these horizontal RoIs (HROIs) typically lead to misalignments between the bounding boxes and objects. For instance, as shown in Fig. 1, due to the oriented and densely-distributed properties of objects in aerial images, several object instances are often crowded and contained by one HRoI. As a result, it usually turns to be difficult to train a detector for extracting object features and identifying the object’s accurate localization.
最近关于航空图像中物体检测的许多进展已经从 RCNN 框架 [2, 4, 7, 13-18] 中受益匪浅。这些方法已经报道了有希望的检测性能,通过使用水平边界框作为感兴趣区域 (RoI),然后依靠基于区域的特征进行类别识别 [2, 4, 16]。然而,如 [5, 19] 中所观察到的,这些水平 RoI (HROI) 通常导致边界框和物体之间的错位。例如,如图 1 所示,由于航拍图像中物体的具备方向和密集分布特性,一些物体实例经常拥挤并由一个 HRoI 包含。通常变得难以训练用于提取物体特征的检测器并识别物体的精确定位。

promise ['prɒmɪs]:n. 许诺,允诺,希望 vt. 允诺,许诺,给人以...的指望或希望 vi. 许诺,有指望,有前途

Learning RoI Transformer for Detecting Oriented Objects in Aerial Images_第1张图片
Figure 1: Horizontal (top) v.s. Rotated RoI warping (bottom) illustrated in an image with many densely packed objects. One horizontal RoI often contains several instances, which leads ambiguity to the subsequent classification and location task. By contrast, a rotated RoI warping usually provides more accurate regions for instances and enables to better extract discriminative features for object detection.
图 1:Horizontal (top) v.s. Rotated RoI warping (bottom) 在具有许多密集物体的图像中的示例。一个水平 RoI 通常包含多个实例,这导致后续分类和定位任务的模糊性。相比之下,旋转的 RoI 扭曲通常为实例提供更准确的区域,并且能够更好地提取用于物体检测的辨别特征。

discriminative [dɪs'krɪmɪnətɪv]:adj. 区别的,歧视的,有识别力的
warp [wɔːp]:n. 弯曲,歪曲,偏见,乖戾 vt. 使变形,使有偏见,曲解 vi. 变歪,变弯,曲解

Instead of using horizontal bounding boxes, oriented bounding boxes have been alternatively employed to eliminate the mismatching between RRoIs and corresponding objects [5, 19, 20]. In order to achieve high recalls at the phase of RRoI generation, a large number of anchors are required with different angles, scales and aspect ratios. These methods have demonstrated promising potentials on detecting sparsely distributed objects [8-10, 21]. However, due to the highly diverse directions of objects in aerial images, it is often intractable to acquire accurate RRoIs to pair with all the objects in an aerial image by using RRoIs with limited directions. Consequently, the elaborate design of RRoIs with as many directions and scales as possible usually suffers from its high computational complexity at region classification and localization phases.
不使用水平边界框,而是采用有向的边界框来消除 RRoI 与相应物体之间的不匹配 [5, 19, 20]。为了在 RRoI 生成阶段实现高召回率,需要大量的锚具有不同的角度、尺度和宽高比。这些方法已经证明在检测稀疏分布的物体方面具有很大潜力 [8-10, 21]。然而,由于航拍图像中物体的方向极其不同,通过使用方向有限的 RRoI 获取准确的 RRoI 以与航拍图像中的所有物体配对通常是难以处理的。因此,具有尽可能多的方向和尺度的 RRoI 的精心设计通常会受到其在区域分类和定位阶段的高计算复杂性的影响。

consequently ['kɒnsɪkw(ə)ntlɪ]:adv. 因此,结果,所以
sparsely ['spɑrsli]:adv. 稀疏地,贫乏地
diverse [daɪ'vɜːs; 'daɪvɜːs]:adj. 不同的,相异的,多种多样的,形形色色的
intractable [ɪn'træktəb(ə)l]:adj. 棘手的,难治的,倔强的,不听话的
elaborate [ɪ'læb(ə)rət]:adj. 精心制作的,详尽的,煞费苦心的 vt. 精心制作,详细阐述,从简单成分合成 vi. 详细描述,变复杂
ideal [aɪ'dɪəl; aɪ'diːəl]:adj. 理想的,完美的,想象的,不切实际的 n. 理想,典范

As the regular operations in conventional networks for object detection [14] have limited generalization to rotation and scale variations, it is required of some orientation and scale-invariant in the design of RoIs and corresponding extracted features. To this end, Spatial Transformer [22] and deformable convolution and RoI pooling [23] layers have been proposed to model the geometry variations. However, they are mainly designed for the general geometric deformation without using the oriented bounding box annotation. In the field of aerial images, there is only rigid deformation, and oriented bounding box annotation is available. Thus, it is natural to argue that it is important to extract rotation-invariant region features and to eliminate the misalignment between region features and objects especially for densely packed ones.
由于传统的物体检测网络中的常规操作 [14] 对旋转和尺度变化的泛化能力有限,因此在 RoI 和相应的提取特征的设计中需要一些方向和尺度不变。为此,已经提出 Spatial Transformer [22] 和可变形卷积和 RoI pooling [23] 层来模拟几何变化。但是,它们主要是针对一般几何变形而不使用有向的边界框标注而设计的。在航拍图像领域,只有刚性变形,并且可以使用有向的边界框标注。因此,很自然地认为提取旋转不变区域特征并消除区域特征和物体之间的错位是很重要的,特别是对于密集的区域特征和物体。

regular ['regjʊlə]:adj. 定期的,有规律的,合格的,整齐的,普通的 n. 常客,正式队员,中坚分子 adv. 定期地,经常地
rigid ['rɪdʒɪd]:adj. 严格的,僵硬的,死板的,坚硬的,精确的

In this paper, we propose a module called RoI Transformer, targeting to achieve detection of oriented and densely-packed objects, by supervised RRoI learning and feature extraction based on position sensitive alignment through a two-stage framework [13-15, 24, 25]. It consists of two parts. The first is the RRoI Learner, which learns the transformation from HRoIs to RRoIs. The second is the Rotated Position Sensitive RoI Align, which extract the rotation-invariant feature extraction from the RRoI for subsequent objects classification and location regression. To further improve the efficiency, we adopt a light head structure for all RoI-wise operations. We extensively test and evaluate the proposed RoI Transformer on two public datasets for object detection in aerial images i.e.DOTA [5] and HRSC2016 [19], and compare it with state-of-the-art approaches, such as deformable PS RoI pooling [23]. In summary, our contributions are in three-fold:
在本文中,我们提出了一个名为 RoI Transformer 的模块,旨在通过监督的 RRoI 学习和基于位置敏感对齐的特征提取,通过两阶段框架实现对有向且密集物体的检测 [13-15, 24, 25]。它由两部分组成。第一部分是 RRoI 学习器,它学习从 HRoI 到 RRoI 的转变。第二部分是 Rotated Position Sensitive RoI Align,它从 RRoI 中提取旋转不变特征提取,用于后续物体分类和位置回归。为了进一步提高效率,我们采用 light head 结构进行所有 RoI-wise 操作。我们在两个公共数据集上广泛测试和评估所提出的 RoI Transformer,用于航空图像中的物体检测,即 DOTA [5] 和 HRSC2016 [19],并将其与最先进的方法进行比较,例如 deformable PS RoI pooling [23]。总之,我们的贡献有三方面:

  • We propose a supervised rotated RoI learner, which is a learnable module that can transform Horizontal RoIs to RRoIs. This design can not only effectively alleviate the misalignment between RoIs and objects, but also avoid a large amount of RRoIs designed for oriented object detection.
    我们提出了一种受监督的旋转 RoI 学习器,这是一个可以将水平 RoI 转换为 RRoI 的可学习模块。这种设计不仅可以有效地减轻 RoI 和物体之间的错位,还可以避免为有向物体检测设计的大量 RRoI。

  • We designe a Rotated Position Sensitive RoI Alignment module for spatially invariant feature extraction, which can effectively boost the object classification and location regression. The module is a crucial design when using light-head RoI-wise operation, which grantees the efficiency and low complexity.
    我们设计了 Rotated Position Sensitive RoI Alignment 模块,用于空间不变特征提取,可以有效地提升物体分类和位置回归。当使用 light-head RoI-wise 操作时,该模块是一个至关重要的设计,它提高了效率和低复杂性。

  • We achieve state-of-the-art performance on several public large-scale datasets for oriented object detection in aerial images. Experiments also show that the proposed RoI Transformer can be easily embedded into other detector architectures with significant detection performance improvements.
    我们在几个公共大型数据集上实现了最先进的性能,用于航拍图像中的有向物体检测。实验还表明,所提出的 RoI Transformer 可以很容易地嵌入到其他检测器架构中,并且具有显著的检测性能改进。

alleviate [ə'liːvɪeɪt]:vt. 减轻,缓和
grantee [ɡrɑːn'tiː]:n. 受让人,被授与者

2 Related Work

2.1 Oriented Bounding Box Regression

Detecting oriented objects is an extension of general horizontal object detection. The objective of this problem is to locate and classify an object with orientation information, which is mainly tackled with methods based on region proposals. The HRoI based methods [5, 26] usually use a normal RoI Warping to extract feature from a HRoI, and regress position offsets relative to the ground truths. The HRoI based method exists a problem of misalignment between region feature and instance. The RRoI based methods [9, 10] usually use a Rotated RoI Warping to extract feature from a RRoI, and regress position offsets relative to the RRoI, which can avoid the problem of misalignment in a certain.
检测有向物体是一般水平物体检测的扩展。此问题的目标是使用方向信息定位和分类物体,主要使用基于候选区域的方法进行处理。基于 HRoI 的方法 [5, 26] 通常使用正常的 RoI Warping 从 HRoI 中提取特征,并回归相对于 ground truth 的位置。基于 HRoI 的方法存在区域特征和实例之间错位的问题。基于 RRoI 的方法 [9, 10] 通常使用 Rotated RoI Warping 从 RRoI 中提取特征,并回归相对于 RRoI 的位置,这可以避免某些特定的错位问题。

misalignment [mɪsə'laɪnmənt]:n. 不重合,未对准

However, the RRoI based method involves generating a lot of rotated proposals. The [10] adopted the method in [8] for rotated proposals. The SRBBS [8] is difficult to be embedded in the neural network, which would cost extra time for rotated proposal generation. The [9, 12, 21, 27] used a design of rotated anchor in RPN [15]. However, the design is still time-consuming due to the dramatic increase in the number of anchors (num_scales × \times × num_aspect_ratios × \times × num_angles). For example, 3 × \times × 5 × \times × 6 = 90 anchors at a location. A large amount of anchors increases the computation of parameters in the network, while also degrades the efficiency of matching between proposals and ground truths at the same time. Furthermore, directly matching between oriented bounding boxes (OBBs) is harder than that between horizontal bounding boxes (HBBs) because of the existence of plenty of redundant rotated anchors. Therefore, in the design of rotated anchors, both the [9, 28] used a relaxed matching strategy. There are some anchors that do not achieve an IoU above 0.5 with any ground truth, but they are assigned to be True Positive samples, which can still cause the problem of misalignment. In this work, we still use the horizontal anchors. The difference is that when the HRoIs are generated, we transform them into RRoIs by a light fully connected layer. Based on this strategy, it is unnecessary to increase the number of anchors. And a lot of precisely RRoIs can be acquired, which will boost the matching process. So we directly use the IoU between OBBs as a matching criterion, which can effectively avoid the problem of misalignment.
但是,基于 RRoI 的方法涉及生成大量旋转的候选区域。[10] 采用 [8] 中的方法生成旋转的候选区域。SRBBS [8] 很难嵌入到神经网络中,这会花费额外的时间来生成旋转的候选区域。[9, 12, 21, 27] 在 RPN [15] 中使用了旋转 anchor 的设计。然而,由于 anchor 的数量急剧增加 (num_scales × \times × num_aspect_ratios × \times × num_angles),设计仍然很耗时。例如,在一个位置处有 3 × \times × 5 × \times × 6 = 90 个 anchor。大量的 anchor 增加了网络中参数的计算,同时也降低了候选区域与 ground truth 之间匹配的效率。此外,由于存在大量冗余的旋转 anchor,因此有向边界框 (OBB) 之间的直接匹配比水平边界框 (HBB) 之间的直接匹配更难。因此,在旋转 anchor 的设计中,[9, 28] 都使用了松弛的匹配策略。有些 anchor 在任何 ground truth 上都没有达到 0.5 以上的 IoU,但它们仍被指定为真正的正样本,这仍然可能导致错位问题。在这项工作中,我们仍然使用水平 anchor。不同之处在于,当生成 HRoI 时,我们通过轻量的全连接层将它们转换为 RRoI。基于这种策略,没有必要增加 anchor 的数量。并且可以获得许多精确的 RRoI,这将促进匹配过程。因此我们直接使用 OBB 之间的 IoU 作为匹配标准,这可以有效地避免错位问题。

dramatic [drə'mætɪk]:adj. 戏剧的,急剧的,引人注目的,激动人心的
degrade [dɪ'greɪd]:vt. 贬低,使...丢脸,使...降级,使...降解 vi. 降级,降低,退化
criterion [kraɪ'tɪərɪən]:n. 标准,准则,规范,准据
oriented bounding box,OBB
horizontal bounding box,HBB

2.2 Spatial-invariant Feature Extraction

CNN frameworks have good properties for the generalization of translation-invariant features while showing poor performance on rotation and scale variations. For image feature extraction, the Spatial Transformer [22] and deformable convolution [23] are proposed for the modeling of arbitrary deformation. They are learned from the target tasks without extra supervision. For region feature extraction, the deformable RoI pooling [23] is proposed, which is achieved by offset learning for sampling grid of RoI pooling. It can better model the deformation at instance level compared to regular RoI warping [14, 24, 25]. The STN and deformable modules are widely used for recognition in the field of scene text and aerial images [29-33]. As for object detection in aerial images, there are more rotation and scale variations, but hardly nonrigid deformation. Therefore, our RoI Transformer only models the rigid spatial transformation, which is learned in the format of ( d x , d y , d w , d h , d θ ) (d_{x}, d_{y}, d_{w}, d_{h}, d_{\theta}) (dx,dy,dw,dh,dθ). However, different from deformable RoI pooling, our RoI Transformer learns the offset with the supervision of ground truth. And the RRoIs can also be used for further rotated bounding box regression, which can also contribute to the object localization performance.
CNN 框架对于平移不变特征的泛化具有良好的性能,但是在旋转和尺度变化上表现出差的性能。对于图像特征提取,Spatial Transformer [22] and deformable convolution [23] 被提出用于任意变形的建模。他们从目标任务中学习而无需额外的监督。对于区域特征提取,提出了 deformable RoI pooling [23],这是通过对 RoI pooling 的采样网格进行偏移量学习来实现的。与常规的 RoI warping 相比,它可以更好地模拟实例级的变形 [14, 24, 25]。STN 和可变形模块广泛用于场景文本和航空图像领域的识别 [29-33]。对于航拍图像中的物体检测,存在更多的旋转和尺度变化,但几乎没有非刚性变形。因此,我们的 RoI Transformer 仅模拟刚性空间变换,其以 ( d x , d y , d w , d h , d θ ) (d_{x}, d_{y}, d_{w}, d_{h}, d_{\theta}) (dx,dy,dw,dh,dθ) 的格式学习。然而,与 deformable RoI pooling 不同,我们的 RoI Transformer 通过对 ground truth 的监督来学习偏移量。并且 RRoI 还可以用于进一步旋转的边界框回归,这也可以有助于物体定位性能。

nonrigid [nɒn'rɪdʒɪd]:adj. 非刚性的

2.3 Light RoI-wise Operations

RoI-wise operation is the bottleneck of efficiency on two-stage algorithms because the computation are not shared. The Light-head R-CNN [34] is proposed to address this problem by using a larger separable convolution to get a thin feature. It also employs the PS RoI pooling [24] to further reduce the dimensionality of feature maps. A single fully connected layer is applied on the pooled features with the dimensionality of 10, which can significantly improve the speed of two-stage algorithms. In aerial images, there exist scenes where the number of instances is large. For example, over 800 instances are densely packed on a single 1024 × \times × 1024 image. Our approach is similar to Deformable RoI pooling [23] where the RoI-wise operations are conducted twice. The light-head design is also employed for efficiency guarantee.
RoI-wise 操作是两阶段算法效率的瓶颈,因为计算不是共享的。提出了 Light-head R-CNN [34] 通过使用更大的可分离卷积来获得窄特征来解决这个问题。它还采用 PS RoI pooling [24] 来进一步降低特征图的维数。在合并的特征上应用单个全连接的层,其维数为 10,这可以显著提高两阶段算法的速度。在航拍图像中,存在实例数量大的场景。例如,超过 800 个实例密集地分布在单个 1024 × \times × 1024 图像上。我们的方法类似于 Deformable RoI pooling [23],其中 RoI-wise 操作进行两次。light-head 设计也用于提高效率。

bottleneck ['bɒt(ə)lnek]:n. 瓶颈,障碍物
guarantee [gær(ə)n'tiː]:n. 保证,担保,保证人,保证书,抵押品 vt. 保证,担保

3 RoI Transformer

In this section, we present details of our proposed ROI Transformer, which contains a trainable fully connected layer termed as RRoI Learner and a RRoI warping layer for learning the rotated RoIs from the estimated horizontal RoIs and then warping the feature maps to maintain the rotation invariance of deep features. Both of these two layers are differentiable for the end-to-end training. The architecture is shown in Fig.2.
在本节中,我们将详细介绍我们提出的 ROI Transformer,其中包含一个可训练的全连接层,称为 RRoI Learner,一个 RRoI 变形层,用于从估计的水平 RoI 中学习旋转的 RoI,然后扭曲特征图以保持深度特征的旋转不变性。这两个层都是可微的适用于端到端的训练。架构如图 2 所示。

differentiable [,dɪfə'renʃɪəb(ə)l]:adj.可微的,可辨的,可区分的

Learning RoI Transformer for Detecting Oriented Objects in Aerial Images_第2张图片
Figure 2: The architecture of RoI Transformer. For each HRoI, it is passed to a RRoI learner. The RRoI learner in our network is a PS RoI Align followed by a fully connected layer with the dimension of 5 which regresses the offsets of RGT relative to HRoI. The Box decoder is at the end of RRoI Learner, which takes the HRoI and the offsets as input and outputs the decoded RRoIs. Then the feature map and the RRoI are passed to the RRoI warping for geometry robust feature extraction. The combination of RRoI Learner and RRoI warping form a RoI Transformer (RT). The geometry robust pooled feature from the RoI Transformer is then used for classification and RRoI regression.
图 2:RoI Transformer 的架构。对于每个 HRoI,它将传递给 RRoI 学习器。我们网络中的 RRoI 学习器是 PS RoI Align,后面跟一个维度为 5 的全连接层,它回归相对于 HRoI 的 RGT 的偏移量。Box 解码器位于 RRoI Learner 的末尾,它将 HRoI 和偏移量作为输入并输出解码的 RRoI。然后将特征图和 RRoI 传递给 RRoI 变形以进行几何鲁棒特征提取。RRoI Learner 和 RRoI 变形的组合构成了 RoI Transformer (RT)。然后使用来自 RoI Transformer 的几何鲁棒合并的特征进行分类和 RRoI 回归。

3.1 RRoI Learner

The RRoI learner aims at learning rotated RoIs from the feature map of horizontal RoIs. Suppose we have obtained n horizontal RoIs denoted by { H i } \{\mathcal{H}_{i}\} {Hi} with the format of ( x , y , w , h ) (x, y, w, h) (x,y,w,h) for predicted 2D locations, width and height of a HRoI, the corresponding feature maps can be denoted as { F i } \{\mathcal{F}_{i}\} {Fi} with the same index. Since every HRoI is the external rectangle of a RRoI in ideal scenarios, we are trying to infer the geometry of RRoIs from every feature map F i \mathcal{F}_{i} Fi using the fully connected layers. We follow the offset learning for object detection to devise the regression target as
RRoI 学习器的目标是从水平 RoI 的特征图中学习旋转的 RoI。假设我们已经获得了由 { H i } \{\mathcal{H}_{i}\} {Hi} 表示的 n 个水平 RoI,其格式为 ( x , y , w , h ) (x, y, w, h) (x,y,w,h),用于预测 HRoI 的 2D 位置、宽度和高度,相应的特征图可以用相同的索引表示为 { F i } \{\mathcal{F}_{i}\} {Fi}。由于在理想情况下每个 HRoI 都是 RRoI 的外部矩形,因此我们尝试使用全连接的层从每个特征图推断出 RRoI 的几何结构。我们遵循用于物体检测的偏移量学习来设计回归目标

t x ∗ = 1 w r ( ( x ∗ − x r ) cos ⁡ θ r + ( y ∗ − y r ) sin ⁡ θ r ) , t y ∗ = 1 h r ( ( y ∗ − y r ) cos ⁡ θ r − ( x ∗ − x r ) sin ⁡ θ r ) , t w ∗ = log ⁡ w ∗ w r ,   t h ∗ = log ⁡ h ∗ h r , t θ ∗ = 1 2 π ( ( θ ∗ − θ r )   m o d    2 π ) , (1) \begin{aligned} t_{x}^{\ast} &= \frac{1}{w_{r}}\left( (x^{\ast} - x_r)\cos\theta_{r} + (y^{\ast} - y_r)\sin\theta_{r} \right),\\ t_{y}^{\ast} &= \frac{1}{h_{r}}\left( (y^{\ast} - y_r)\cos\theta_{r} - (x^{\ast} - x_r)\sin\theta_{r} \right),\\ t_{w}^{\ast} &= \log\frac{w^{\ast}}{w_{r}}, \, t_{h}^{\ast} = \log\frac{h^{\ast}}{h_{r}},\\ t_{\theta}^{\ast} &= \frac{1}{2\pi} \left( (\theta^{*} - \theta_{r}) \, \mod 2\pi \right),\\ \end{aligned} \tag{1} txtytwtθ=wr1((xxr)cosθr+(yyr)sinθr),=hr1((yyr)cosθr(xxr)sinθr),=logwrw,th=loghrh,=2π1((θθr)mod2π),(1)

where ( x r , y r , w r , h r , θ r ) (x_{r}, y_{r}, w_{r}, h_{r}, \theta_{r}) (xr,yr,wr,hr,θr) is a stacked vector for representing location, width, height and orientation of a RRoI, respectively. ( x ∗ , y ∗ , w ∗ , h ∗ , θ ∗ ) (x^{\ast}, y^{\ast}, w^{\ast}, h^{\ast}, \theta^{\ast}) (x,y,w,h,θ) is the ground truth parameters of an oriented bounding box. The modular operation is used to adjust the angle offset target t θ ∗ t_{\theta}^{\ast} tθ that falls in [ 0 , 2 π ) [0, 2 \pi) [0,2π) for the convenience of computation. Indeed, the target for HRoI regression is a special case of Eq. (1) if θ ∗ = 3 π 2 {\theta}^{\ast} = \frac{3\pi}{2} θ=23π. The relative offsets are illustrated in Fig. 3 as explanation. Mathematically, the fully connected layer outputs a vector ( t x , t y , t w , t h , t θ ) (t_{x}, t_{y}, t_{w}, t_{h}, t_{\theta}) (tx,ty,tw,th,tθ) for every feature map F i \mathcal{F}_{i} Fi by
其中 ( x r , y r , w r , h r , θ r ) (x_{r}, y_{r}, w_{r}, h_{r}, \theta_{r}) (xr,yr,wr,hr,θr) 是用于分别表示 RRoI 的位置、宽度、高度和方向的堆叠矢量。 ( x ∗ , y ∗ , w ∗ , h ∗ , θ ∗ ) (x^{\ast}, y^{\ast}, w^{\ast}, h^{\ast}, \theta^{\ast}) (x,y,w,h,θ) 是有向边界框的 ground truth 参数。模操作用于调整设置目标的角度 t θ ∗ t_{\theta}^{\ast} tθ,该角度落在 θ ∗ = 3 π 2 {\theta}^{\ast} = \frac{3\pi}{2} θ=23π 中以便于计算。实际上,HRoI 回归的目标是 Eq. (1) 的特例。如果 θ ∗ = 3 π 2 {\theta}^{\ast} = \frac{3\pi}{2} θ=23π。相对偏移量在图 3 中进行说明。在数学上,全连接的层为每个特征图 F i \mathcal{F}_{i} Fi 输出一个向量 ( t x , t y , t w , t h , t θ ) (t_{x}, t_{y}, t_{w}, t_{h}, t_{\theta}) (tx,ty,tw,th,tθ)

devise [dɪ'vaɪz]:vt. 设计,想出,发明,图谋,遗赠给 n. 遗赠
modular ['mɒdjʊlə]:adj. 模块化的,模数的,有标准组件的

t = G ( F ; Θ ) (2) t = \mathcal{G}(\mathcal{F}; \Theta) \tag{2} t=G(F;Θ)(2)

where G \mathcal{G} G represents the fully connected layer and Θ \Theta Θ is the weight parameters of G \mathcal{G} G and F \mathcal{F} F is the feature map for every HRoI.
其中 G \mathcal{G} G 代表全连接的层, Θ \Theta Θ G \mathcal{G} G 的权重参数, F \mathcal{F} F 是每个 HRoI 的特征图。

Learning RoI Transformer for Detecting Oriented Objects in Aerial Images_第3张图片
Figure 3: An example explaining the relative offset. There are three coordinate systems. The XOY is bound to the image. The x 1 O 1 y 1 x_{1}O_{1}y_{1} x1O1y1 and x 2 O 2 y 2 x_{2}O_{2}y_{2} x2O2y2 are bound to two RRoIs (blue rectangle) respectively. The yellow rectangle represents the RGT. The right two rectangles are obtained from the left two rectangles by translation and rotation while keeping the relative position unchanged. The ( Δ x 1 , Δ y 1 ) (\Delta x_{1}, \Delta y_{1}) (Δx1,Δy1) is not equal to ( Δ x 2 , Δ y 2 ) (\Delta x_{2}, \Delta y_{2}) (Δx2,Δy2) if they are all in the X O Y XOY XOY. They are the same if ( Δ x 1 , Δ y 1 ) (\Delta x_{1}, \Delta y_{1}) (Δx1,Δy1) falls in ( x 1 O 1 y 1 ) (x_{1}O_{1}y_{1}) (x1O1y1) and ( Δ x 2 , Δ y 2 ) (\Delta x_{2}, \Delta y_{2}) (Δx2,Δy2) in ( x 2 O 2 y 2 ) (x_{2}O_{2}y_{2}) (x2O2y2). The α 1 \alpha_{1} α1 and α 2 \alpha _{2} α2 denote the angles of two RRoIs respectively.
图 3:解释相对偏移的示例。有三个坐标系。 X O Y XOY XOY 与图像绑定。 x 1 O 1 y 1 x_{1}O_{1}y_{1} x1O1y1 and x 2 O 2 y 2 x_{2}O_{2}y_{2} x2O2y2 分别绑定到两个 RRoI (蓝色矩形)。黄色矩形表示 RGT。通过平移和旋转从左侧两个矩形获得右侧两个矩形,同时保持相对位置不变。如果它们都在 X O Y XOY XOY 中,则 ( Δ x 1 , Δ y 1 ) (\Delta x_{1}, \Delta y_{1}) (Δx1,Δy1) 不等于 ( Δ x 2 , Δ y 2 ) (\Delta x_{2}, \Delta y_{2}) (Δx2,Δy2)。如果 ( Δ x 1 , Δ y 1 ) (\Delta x_{1}, \Delta y_{1}) (Δx1,Δy1) 落在 ( x 1 O 1 y 1 ) (x_{1}O_{1}y_{1}) (x1O1y1) 中并且 ( Δ x 2 , Δ y 2 ) (\Delta x_{2}, \Delta y_{2}) (Δx2,Δy2) 落入 ( x 2 O 2 y 2 ) (x_{2}O_{2}y_{2}) (x2O2y2) 中它们是相同的。 α 1 \alpha_{1} α1 and α 2 \alpha _{2} α2 分别表示两个 RRoI 的角度。

While training the layer G \mathcal{G} G, we are about to match the input HRoIs and the ground truth of oriented bounding boxes (OBBs). For the consideration of computational efficiency, the matching is between the HRoIs and axis-aligned bounding boxes over original ground truth. Once an HRoI is matched, we set the t θ ∗ t_{\theta}^{\ast} tθ directly by the definition in Eq. (1). The loss function for optimization is used as Smooth L1 loss [13]. For the predicted t t t in every forward pass, we decode it from offset to the parameters of RRoI. That is to say, our proposed RRoI learner can learn the parameters of RRoI from the HRoI feature map F \mathcal{F} F.
在训练 G \mathcal{G} G 层时,我们即将匹配输入 HRoI 和有向边界框 (OBB) 的 ground truth。为了考虑计算效率,匹配是在 HRoI 和轴对齐的边界框之间而不是原始的 ground truth。一旦 HRoI 匹配,我们直接通过方程式 (1) 中的定义设置 t θ ∗ t_{\theta}^{\ast} tθ。优化的损失函数用作平滑 L1 损失 [13]。对于每个前向传递中的预测 t t t,我们将其从偏移量解码为 RRoI 的参数。也就是说,我们提出的 RRoI 学习器可以从 HRoI 特征图 F \mathcal{F} F 中学习 RRoI 的参数。

3.2 Rotated Position Sensitive RoI Align

Once the parameters of RRoI are obtained, we are able to extract the rotation-invariant deep features for Oriented Object Detection. Here, we propose the module of Rotated Position Sensitive (RPS) RoI Align to extract the rotation-invariant features within a network.
一旦获得了 RRoI 的参数,我们就能够为有向目标检测提取旋转不变的深度特征。在这里,我们提出了旋转位置敏感 (RPS) RoI Align 模块,以提取网络中的旋转不变特征。

Given the input feature map D \mathcal{D} D with H × W × C H \times W \times C H×W×C channels and a RRoI ( x r , y r , w r , h r , θ r ) (x_{r}, y_{r}, w_{r}, h_{r}, \theta_{r}) (xr,yr,wr,hr,θr), where ( x r , y r ) (x_{r}, y_{r}) (xr,yr) denotes the center of the RRoI and ( w r , h r ) (w_{r}, h_{r}) (wr,hr) denotes the width and height of the RRoI. The ( θ r ) (\theta_{r}) (θr) gives the orientation of the RRoI. The RPS RoI pooling divides the Rotated RoI into K × K K \times K K×K bins and outputs a feature map Y \mathcal{Y} Y with the shape of K × K × C K \times K \times C K×K×C. For the bin with index ( i , j ) ( 0 ≤ i , j < K ) (i, j) (0 \leq i, j < K) (i,j)(0i,j<K) of the output channel c ( 0 ≤ c < C ) c (0 \leq c < C) c(0c<C), we have
给定具有 H × W × C H \times W \times C H×W×C 通道的输入特征图 D \mathcal{D} D 和 RRoI ( x r , y r , w r , h r , θ r ) (x_{r}, y_{r}, w_{r}, h_{r}, \theta_{r}) (xr,yr,wr,hr,θr),其中 ( x r , y r ) (x_{r}, y_{r}) (xr,yr) 表示 RRoI 的中心, ( w r , h r ) (w_{r}, h_{r}) (wr,hr) 表示 RRoI 的宽度和高度。 ( θ r ) (\theta_{r}) (θr) 给出了 RRoI 的方向。RPS RoI pooling 将旋转的 Rotated RoI 分成 K × K K \times K K×K bin,并输出形状为 K × K × C K \times K \times C K×K×C 的特征图 Y \mathcal{Y} Y。对于输出通道 c ( 0 ≤ c < C ) c (0 \leq c < C) c(0c<C) 的索引 ( i , j ) ( 0 ≤ i , j < K ) (i, j) (0 \leq i, j < K) (i,j)(0i,j<K) 的 bin,我们有

Y c ( i , j ) = ∑ ( x , y ) ∈ b i n ( i , j ) D i , j , c ( T θ ( x , y ) ) / n i , j , (3) \mathcal{Y}_{c}(i,j) = \sum_{(x,y) \in bin(i,j)} D_{i,j,c}(\mathcal{T}_{\theta}(x,y))/n_{i,j}, \tag{3} Yc(i,j)=(x,y)bin(i,j)Di,j,c(Tθ(x,y))/ni,j,(3)

where the D i , j , c D_{i,j,c} Di,j,c is a feature map out of the K × K × C K \times K \times C K×K×C feature maps. The channel mapping is the same as the original Position Sensitive RoI pooling [24]. The n i , j n_{i,j} ni,j is the number of sampling locations in the bin. The b i n ( i , j ) bin(i,j) bin(i,j) denotes the coordinates set { i w r k + ( s x + 0.5 ) w r k × n ; s x = 0 , 1 , . . . n − 1 } × { j h r k + ( s y + 0.5 ) h r k × n ; s y = 0 , 1 , . . . n − 1 } \{ i \frac{w_{r}}{k} + (s_{x} + 0.5) \frac{w_{r}}{k \times n}; s_{x} = 0,1, ... n-1\} \times \{ j \frac{h_{r}}{k} + (s_{y} + 0.5) \frac{h_{r}}{k \times n}; s_{y} = 0,1, ... n-1 \} {ikwr+(sx+0.5)k×nwr;sx=0,1,...n1}×{jkhr+(sy+0.5)k×nhr;sy=0,1,...n1}. And for each ( x , y ) ∈ b i n ( i , j ) (x,y) \in bin(i,j) (x,y)bin(i,j), it is converted to ( x ’ , y ’ ) (x’, y’) (x,y) by T θ \mathcal{T}_{\theta} Tθ, where
其中 D i , j , c D_{i,j,c} Di,j,c K × K × C K \times K \times C K×K×C 特征图中的一个特征图。The channel mapping is the same as the original Position Sensitive RoI pooling [24]. n i , j n_{i,j} ni,j 是 bin 中的采样位置数目。 b i n ( i , j ) bin(i,j) bin(i,j) 表示坐标集 { i w r k + ( s x + 0.5 ) w r k × n ; s x = 0 , 1 , . . . n − 1 } × { j h r k + ( s y + 0.5 ) h r k × n ; s y = 0 , 1 , . . . n − 1 } \{ i \frac{w_{r}}{k} + (s_{x} + 0.5) \frac{w_{r}}{k \times n}; s_{x} = 0,1, ... n-1\} \times \{ j \frac{h_{r}}{k} + (s_{y} + 0.5) \frac{h_{r}}{k \times n}; s_{y} = 0,1, ... n-1 \} {ikwr+(sx+0.5)k×nwr;sx=0,1,...n1}×{jkhr+(sy+0.5)k×nhr;sy=0,1,...n1}。对于每个 ( x , y ) ∈ b i n ( i , j ) (x,y) \in bin(i,j) (x,y)bin(i,j),它由 T θ \mathcal{T}_{\theta} Tθ 转换为 ( x ’ , y ’ ) (x’, y’) (x,y),其中

( x ′ y ′ ) = ( cos ⁡ θ − sin ⁡ θ sin ⁡ θ cos ⁡ θ ) ( x − w r / 2 y − h r / 2 ) + ( x r y r ) , (4) \left( \begin{array}{cc} x'\\ y'\\ \end{array} \right) = \left( \begin{array}{c} \cos\theta & -\sin\theta\\ \sin\theta & \cos\theta\\ \end{array} \right) \left( \begin{array}{c} x - w_{r}/2\\ y - h_{r}/2\\ \end{array} \right) + \left( \begin{array}{c} x_{r}\\ y_{r}\\ \end{array} \right), \tag{4} (xy)=(cosθsinθsinθcosθ)(xwr/2yhr/2)+(xryr),(4)

Typically, Eq. (3) is implemented by bilinear interpolation.
通常,Eq. (3) 通过双线性插值实现。

interpolation [ɪn,tɜːpəʊ'leɪʃən]:n. 插入,篡改,填写,插值
geometry [dʒɪ'ɒmɪtrɪ]:n. 几何学,几何结构

Learning RoI Transformer for Detecting Oriented Objects in Aerial Images_第4张图片
Figure 4: Rotated RoI warping. The shape of the warped feature is a horizontal rectangle (we use 3 × \times × 3 for example here.) The sampling grid for RoI warping is determined by the RRoI ( x r , y r , w r , h r , θ r ) (x_{r}, y_{r}, w_{r}, h_{r}, \theta_{r}) (xr,yr,wr,hr,θr). We employ the image instead of feature map for better explanation. After RRoI warping, the extracted features are geometry robust. (The orientations of all the vehicles are the same).
Figure 4: Rotated RoI warping. 扭曲特征的形状是水平矩形 (例如,我们在这里以 3 × \times × 3 为例)。用于 RoI 扭曲的采样网格由 RRoI ( x r , y r , w r , h r , θ r ) (x_{r}, y_{r}, w_{r}, h_{r}, \theta_{r}) (xr,yr,wr,hr,θr) 确定。我们使用图像而不是特征图来更好地解释。在 RRoI 变形之后,提取的特征是几何稳健的。(所有车辆的方向都相同)。

3.3 RoI Transformer for Oriented Object Detection

The combination of RRoI Learner, and RPS RoI Align forms a RoI Transformer (RT) module. It can be used to replace the normal RoI warping operation. The pooled feature from RT is rotation-invariant. And the RRoIs provide better initialization for later regression because the matched RRoI is closer to the RGT compared to the matched HRoI. As mentioned before, a RRoI is a tuple with 5 elements ( x r , y r , w r , h r , θ r ) (x_{r}, y_{r}, w_{r}, h_{r}, \theta_{r}) (xr,yr,wr,hr,θr). In order to eliminate ambiguity, we use h h h to denote the short side and w w w the long side of a RRoI. The orientation vertical to h h h and falling in [ 0 , π ] [0, \pi] [0,π] is chosen as the final direction of a RRoI. After all these operations, the ambiguity can be effectively avoided. And the operations are required to reduce the rotation variations.
RRoI Learner 和 RPS RoI Align 的组合形成了 RoI Transformer (RT) 模块。它可以用来代替正常的 RoI 变形操作。 RT 的池化的特征是旋转不变的。并且 RRoI 为后来的回归提供了更好的初始化,因为匹配的 RRoI 与匹配的 HRoI 相比更接近 RGT。如前所述,RRoI 是一个包含 5 个元素的元组 ( x r , y r , w r , h r , θ r ) (x_{r}, y_{r}, w_{r}, h_{r}, \theta_{r}) (xr,yr,wr,hr,θr)。为了消除歧义,我们使用 h h h 来表示 RRoI 的短边和 w w w 来表示 RRoI 的长边。垂直于 h h h 并且落在 [ 0 , π ] [0, \pi] [0,π] 中的方向被选择为 RRoI 的最终方向。在所有这些操作之后,可以有效地避免模糊。并且需要操作来减少旋转变化。

ambiguity [æmbɪ'gjuːɪtɪ]:n. 含糊,不明确,暧昧,模棱两可的话
vertical ['vɜːtɪk(ə)l]:adj. 垂直的,直立的,头顶的,顶点的,纵长的,直上的 n. 垂直线,垂直面,垂直位置

IoU between OBBs In common deep learning based detectors, there are two cases that IoU calculation is needed. The first lies in the matching process while the second is conducted for (Non Maximum Suppression) NMS. The IoU between two OBBs can be calculated by Equation 5:
IoU between OBBs 在常见的基于深度学习的检测器中,有两种情况需要进行 IoU 计算。第一个在于匹配过程,而第二个是执行 (非极大值抑制) NMS 需要的。两个 OBB 之间的 IoU 可以通过公式 5 计算:

I o U = a r e a ( B 1 ⋂ B 2 ) a r e a ( B 1 ⋃ B 2 ) , (5) IoU = \frac{area(B_{1} \bigcap B_{2})}{area(B_{1} \bigcup B_{2})}, \tag{5} IoU=area(B1B2)area(B1B2),(5)

where the B 1 B_1 B1 and B 2 B_2 B2 represent two OBBs, say, a RRoI and a RGT. The calculation of IoU between OBBs is similar with that between horizontal bounding boxes (HBBs). The only difference is that the IoU calculation for OBBs is performed within polygons as illustrated in Fig. 5. In our model, during the matching process, each RRoI is assigned to be True Positive if the IoU with any RGT is over 0.5. It is worth noting that although RRoI and RGT are both quadrilaterals, their intersection may be diverse polygons, e.g. a hexagon as shown in Fig 5(a). For the long and thin bounding boxes, a slight jitter in the angle may cause the IoU of the two predicted OBBs to be very low, which would make the NMS difficult as can be seen in Fig. 5(b).
其中 B 1 B_1 B1 and B 2 B_2 B2 代表两个 OBB,比如一个 RRoI 和一个 RGT。OBB 之间的 IoU 计算与水平边界框 (HBB) 之间的计算类似。唯一不同的是,OBB 的 IoU 计算是在多边形内执行的,如图 5 所示。在我们的模型中,在匹配过程中,如果与任何 RGT 的 IoU 超过 0.5,则这个 RRoI 被指定为 True Positive。值得注意的是,尽管 RRoI 和 RGT 都是四边形,但它们的交叉点可以是不同的多边形,例如,如图 5(a) 所示的六边形。对于长而窄的边界框,角度的轻微抖动可能导致两个预测的 OBB 的 IoU 非常低,这将使 NMS 变得非常复杂,如图 5(b) 所示。

quadrilateral [,kwɒdrɪ'læt(ə)r(ə)l]:n. 四边形 adj. 四边形的
diverse [daɪ'vɜːs; 'daɪvɜːs]:adj. 不同的,相异的,多种多样的,形形色色的
hexagon ['heksəg(ə)n]:n. 六角形,六边形 adj. 成六角的,成六边的
jitter ['dʒɪtə]:n. 紧张不安,晃动 v. 紧张不安,晃动

Learning RoI Transformer for Detecting Oriented Objects in Aerial Images_第5张图片
Figure 5: Examples of IoU between oriented bounding boxes(OBBs). (a) IoU between a RRoI and a matched RGT. The red hexagon indicates the intersection area between RRoI and RGT. (b) The intersection between two long and thin bounding boxes. For long and thin bounding boxes, a slight jitter in the angle may lead to a very low IoU of the two boxes. The red quadrilateral is the intersection area. In such case, the predicted OBB with score of 0.53 can not be suppressed since the IoU is very low.
Figure 5: Examples of IoU between oriented bounding boxes(OBBs). (a) RRoI 和匹配的 RGT 之间的 IoU。红色六边形表示 RRoI 和 RGT 之间的交叉区域。(b) 两个长而窄的边界框之间的交叉点。对于长而窄的边界框,角度的轻微抖动可能导致两个盒子的 IoU 非常低。红色四边形是交叉区域。在这种情况下,由于 IoU 非常低,因此无法抑制得分为 0.53 的预测 OBB。

Targets Calculation After RRoI warping, the rotation-invariant feature can be acquired. Consistently, the offsets also need to be rotation-invariant. To achieve this goal, we use the relative offsets as explained in Fig. 3. The main idea is to employ the coordinate system binding to the RRoI rather than the image for offsets calculation. The Eq. (1) is the derived formulation for relative offsets.
Targets Calculation RRoI 变形后,可以获取旋转不变特征。一致地,偏移量也需要是旋转不变。为了实现这一目标,我们使用相对偏移量,如图 3 所示。主要思想是使用坐标系绑定到 RRoI 而不是图像,然后用于偏移量计算。Eq. (1) 是相对偏移量的推导公式。

consistently [kən'sɪstəntli]:adv. 一贯地,一致地,坚实地

4 Experiments and Analysis

4.1 Datasets

For experiments, we choose two datasets, known as DOTA [5] and HRSC2016 [19], for oriented object detection in aerial images.
对于实验,我们选择两个数据集,名为 DOTA [5] 和 HRSC2016 [19],用于航拍图像中的有向物体检测。

DOTA [5]. This is the largest dataset for object detection in aerial images with oriented bounding box annotations. It contains 2806 large size images. There are objects of 15 categories, including Baseball diamond (BD), Ground track field (GTF), Small vehicle (SV), Large vehicle (LV), Tennis court (TC), Basketball court (BC), Storage tank (ST), Soccer-ball field (SBF), Roundabout (RA), Swimming pool (SP), and Helicopter (HC). The fully annotated DOTA images contain 188, 282 instances. The instances in this data set vary greatly in scale, orientation, and aspect ratio. As shown in [5], the algorithms designed for regular horizontal object detection get modest performance on it. Like PASCAL VOC [35] and COCO [36], the DOTA provides the evaluation server1.
DOTA [5]。这是在航拍图像中有向边界框标注进行物体检测的最大数据集。它包含 2806 个大尺寸图像。有 15 个类别的物体,包括棒球场 (BD),田径场 (GTF),小型车辆 (SV),大型车辆 (LV),网球场 (TC),篮球场 (BC),储油罐 (ST) ,足球场 (SBF),环形交叉口 (RA),游泳池 (SP) 和直升机 (HC)。完全标注的 DOTA 图像包含 188,282 个实例。此数据集中的实例在尺度、方向和宽高比方面差异很大。如 [5] 所示,为常规水平物体检测而设计的算法在其上获得了适度的性能。与 PASCAL VOC [35] 和 COCO [36] 一样,DOTA 提供评估服务器1。

baseball ['beɪsbɔːl]:n. 棒球,棒球运动
diamond ['daɪəmənd]:n. 钻石,金刚石,菱形,方块牌 adj. 菱形的,金刚钻的
Baseball Diamond:棒球内场,棒球场
ground track field:n. 田径场
tennis court:网球场
court [kɔːt]:n. 法院,球场,朝廷,奉承 vt. 招致,向...献殷勤,设法获得 vi. 求爱
tennis ['tenɪs]:n. 网球
soccer ['sɒkə]:n. 英式足球,足球
roundabout ['raʊndəbaʊt]:n. 环岛,环状交叉路口,旋转平台,旋转木马,转椅,迂回路线 adj. 迂回的,绕道的,圆形的
helicopter ['helɪkɒptə]:n. 直升飞机vi. 乘直升飞机 vt. 由直升机运送
modest ['mɒdɪst]:adj. 谦虚的,谦逊的,适度的,端庄的,羞怯的

1http://captain.whu.edu.cn/DOTAweb/

We use both the training and validation sets for training, the testing set for test. We do a limited data augmentation. Specifically, we resize the image at two scales (1.0 and 0.5) for training and testing. After image rescaling, we crop a series of 1024 × \times × 1024 patches from the original images with a stride of 824. For those categories with a small number of samples, we do a rotation augmentation randomly from 4 angles (0, 90, 180, 270) to simply avoid the effect of an imbalance between different categories. With all these processes, we obtain 37373 patches, which are much less than that in the official baseline implements (150, 342 patches) [5]). For testing experiments, the 1024 × \times × 1024 patches are also employed. None of the other tricks is utilized except the stride for image sampling is set to 512.
我们使用训练集和验证集进行训练,测试集进行测试。我们进行有限的数据扩充。具体而言,我们以两个尺度 (1.0 和 0.5) 缩放图像大小以进行训练和测试。在图像重新缩放之后,我们从原始图像中裁剪出一系列 1024 × \times × 1024 图像块,步幅为 824。对于具有少量样本的那些类别,我们从 4 个角度 (0, 90, 180, 270) 随机进行旋转增强,以简单地避免不同类别之间的不平衡的影响。通过所有这些过程,我们获得了 37373 个图像块,这比实际的 baseline 实现 (150, 342 图像块) [5] 中的图像块少得多。对于测试实验,还使用 1024 × \times × 1024 的图像块。除了用于图像采样的步幅设置为 512 之外,没有使用任何其他技巧。

HRSC2016 [19]. The HRSC2016 [19] is a challenging dataset for ship detection in aerial images. The images are collected from Google Earth. It contains 1061 images and more than 20 categories of ships in various appearances. The image size ranges from 300 × \times × 300 to 1500 × \times × 900. The training, validation and test set include 436 images, 181 images and 444 images, respectively. For data augmentation, we only adopt the horizontal flipping. And the images are resized to (512, 800), where 512 represents the length of the short side and 800 the maximum length of an image.
HRSC2016 [19]. HRSC2016 [19] 是航拍图像中船舶检测的具有挑战性的数据集。图像是从 Google Earth 收集的。它包含 1061 个图像和 20 多种不同类型的船舶。图像尺寸范围从 300 × \times × 300 to 1500 × \times × 900。训练、验证和测试集分别包括 436 幅图像、181 幅图像和 444 幅图像。对于数据增加,我们只采用水平翻转。并且图像被缩放大小为 (512, 800),其中 512 表示短边的长度,800 表示图像的最大长度。

Google Earth:Google地球

4.2 Implementation details

Baseline Framework. For the experiments, we build the baseline network inspired from LightHead R-CNN [34] with backbone ResNet101 [39]. Our final detection performance is based on the FPN [40] network, while it is not employed in the ablation experiments for simplicity.
Baseline Framework. 对于实验,我们构建了基于 LightHead R-CNN 的 baseline network [34] (backbone ResNet101 [39])。我们的最终检测性能基于 FPN [40] 网络,而为简单起见,它不用于消融实验。

simplicity [sɪm'plɪsɪtɪ]:n. 朴素,简易,天真,愚蠢
ablation [ə'bleɪʃ(ə)n]:n. 消融,切除
ablation experiment:消融实验
physiological [,fɪzɪə'lɒdʒɪkəl]:adj. 生理学的,生理的
psychology [saɪ'kɒlədʒɪ]:n. 心理学,心理状态
nervous ['nɜːvəs]:adj. 神经的,紧张不安的,强健有力的
surgical ['sɜːdʒɪk(ə)l]:adj. 外科的,手术上的 n. 外科手术,外科病房
removal [rɪ'muːv(ə)l]:n. 免职,移动,排除,搬迁
pioneer [paɪə'nɪə]:n. 先锋,拓荒者 vt. 开辟,倡导,提倡 vi. 作先驱
physiologist [,fɪzɪ'ɑlədʒɪst]:n. 生理学家,生理学者
lesion ['liːʒ(ə)n]:n. 损害,身体上的伤害,机能障碍

A basic research method of physiological psychology based on ablation, especially during the first three-quarters of the 20th century, in which an attempt is made to determine the functions of a specific region of the nervous system by examining the behavioural effects of its surgical removal. It was pioneered in 1824 by the French physiologist Marie Jean Pierre Flourens (1794-1867) and is also called a lesion experiment.
基于消融的生理心理学的基本研究方法,特别是在 20 世纪前四分之三期间,其中通过检查其手术切除的行为影响来尝试确定神经系统的特定区域的功能。它于1824年由法国生理学家 Marie Jean Pierre Flourens (1794-1867) 开创,也被称为病变实验。

Light-Head R-CNN OBB: We modified the regression of fully-connected layer on the second stage to enable it to predict OBBs, similar to work in DOTA [5]. The only difference is that we replace ( ( x i , y i ) , i = 1 , 2 , 3 , 4 ) ((x_{i}, y_{i}), i = 1, 2, 3, 4) ((xi,yi),i=1,2,3,4) with ( x , y , w , h , θ ) (x, y, w, h, \theta) (x,y,w,h,θ) for the representation of an OBB. Since there is an additional param θ \theta θ, we do not double the regression loss as the original Light-Head R-CNN [34] does. The hyperparameters of large separable convolutions we set is k = 15 , C m i d = 56 , C o u t = 490 k = 15, Cmid = 56, Cout = 490 k=15,Cmid=56,Cout=490. And the OHEM [41] is not employed for sampling at the training phase. For RPN, we used 15 anchors same as original Light-Head R-CNN [34]. And the batch size of RPN [15] is set to 512. Finally, there are 6000 RoIs from RPN before Non-maximum Suppression (NMS) and 800 RoIs after using NMS. Then 512 RoIs are sampled for the training of R-CNN. The learning rate is set to 0.0005 for the first 14 epochs and then divided by 10 for the last 4 epochs. For testing, we adopt 6000 RoIs before NMS and 1000 after NMS processing.
Light-Head R-CNN OBB: 我们修改了第二阶段全连接层的回归,使其能够预测 OBB,类似于在 DOTA 中的工作 [5]。唯一的不同之处在于我们用 ( x , y , w , h , θ ) (x, y, w, h, \theta) (x,y,w,h,θ) 代替 ( ( x i , y i ) , i = 1 , 2 , 3 , 4 ) ((x_{i}, y_{i}), i = 1, 2, 3, 4) ((xi,yi),i=1,2,3,4) 来表示 OBB。由于还有一个额外的参数 θ \theta θ,我们不会像原来的 Light-Head R CNN [34] 那样使回归损失加倍。我们设置的大的可分离卷积的超参数是 k = 15 , C m i d = 56 , C o u t = 490 k = 15, Cmid = 56, Cout = 490 k=15,Cmid=56,Cout=490。OHEM [41] 未在训练阶段用于抽样。对于 RPN,我们使用了与原始 Light-Head R-CNN 相同的 15 个 anchor [34]。并且 RPN [15] 的批量大小设置为 512。最后,在非最大抑制 (NMS) 之前有来自 RPN 的 6000 个 RoI 和在使用 NMS 之后有 800 个RoI。然后对 512 个 RoI 进行采样以用于 R-CNN 的训练。学习率在前 14 epoch 设置为 0.0005,然后在最后 4 epoch 除以 10。对于测试,我们在 NMS 之前采用 6000 个 RoI,在 NMS 处理之后采用 1000 个 RoI。

Light-Head R-CNN OBB with FPN: The Light-Head R-CNN OBB with FPN uses the FPN [40] as a backbone network. Since no source code was publicly available for Light-Head R-CNN based on FPN, our implementation details could be different. We simply added the large separable convolution on the feature of every level P 2 , P 3 , P 4 , P 5 P_{2}, P_{3}, P_{4}, P_{5} P2,P3,P4,P5. The hyperparameters of large separable convolution we set is k = 15 , C m i d = 64 , C o u t = 490 k = 15, Cmid = 64, Cout = 490 k=15,Cmid=64,Cout=490. The batch size of RPN is set to be 512. There are 6000 RoIs from RPN before NMS and 600 RoIs after NMS processing. Then 512 RoIs are sampled for the training of R-CNN. The learning rate is set to 0.005 for the first 5 epochs and divided by a factor of 10 for the last 2 epochs.
Light-Head R-CNN OBB with FPN: 带有 FPN 的 Light-Head R-CNN OBB 使用 FPN [40] 作为骨干网络。由于基于 FPN 的 Light-Head R-CNN 没有公开的源代码,我们的实现细节可能不同。我们简单地在每个级别的特征上添加了大的可分离卷积 P 2 , P 3 , P 4 , P 5 P_{2}, P_{3}, P_{4}, P_{5} P2,P3,P4,P5。我们设置的大可分卷积的超参数是 k = 15 , C m i d = 64 , C o u t = 490 k = 15, Cmid = 64, Cout = 490 k=15,Cmid=64,Cout=490。RPN 的批量大小设置为 512。在 NMS 之前有来自 RPN 的 6000 个 RoI 和在 NMS 处理之后有 600 个 RoI。然后对 512 个 RoI 进行采样以用于 R-CNN 的训练。学习率在前 5 epoch 设置为 0.005,在最后 2 epoch 除以因子 10。

4.3 Comparison with Deformable PS RoI Pooling

In order to validate that the performance is not from extra computation, we compared our performance with that of deformable PS RoI pooling, since both of them employed RoI warping operation to model the geometry variations. For experiments, we use the Light-Head R-CNN OBB as our baseline. The deformable PS RoI pooling and RoI Transformer are used to replace the PS RoI Align in the LightHead R-CNN [34].
为了验证性能不是源于额外的计算,我们与 deformable PS RoI pooling 的性能进行了比较,因为它们都使用 RoI warping 操作来模拟几何变化。对于实验,我们使用 Light-Head R-CNN OBB 作为我们的基线。deformable PS RoI pooling and RoI Transformer 用于替换 LightHead R-CNN 中的 PS RoI Align [34]。

Complexity. Both RoI Transformer and deformable RoI pooling have a light localisation network, which is a fully connected layer followed by the normal pooled feature. In our RoI Transformer, only 5 parameters ( t x , t y , t w , t h , t θ ) (t_{x}, t_{y}, t_{w}, t_{h}, t_{\theta}) (tx,ty,tw,th,tθ) are learned. The deformable PS RoI pooling learns offsets for each bin, where the number of parameters is 7 × \times × 7 × \times × 2. So our module is designed lighter than deformable PS RoI pooling. As can be seen in Tab. 4, our RoI Transformer model uses less memory (273MB compared to 273.2MB) and runs faster at the inference phase (0.17s compared to 0.206s per image). Because we use the light-head design, the memory savings are not obvious compared to deformable PS RoI pooling. However, RoI Transformer runs slower than deformable PS RoI pooling on training time (0.475s compared to 0.445s) since there is an extra matching process between the RRoIs and RGTs in training.
Complexity. RoI Transformer 和 deformable RoI pooling 都有一个轻型定位网络,它是一个全连接的层,后面是正常的池化特征。在我们的 RoI Transformer 中,只学习了 5 个参数 ( t x , t y , t w , t h , t θ ) (t_{x}, t_{y}, t_{w}, t_{h}, t_{\theta}) (tx,ty,tw,th,tθ)。deformable PS RoI pooling 学习每个 bin 的偏移量,其中参数的数量是 7 × \times × 7 × \times × 2。所以我们的模块设计比 deformable PS RoI pooling 更轻量。可以在 Tab. 4 中看到,我们的 RoI Transformer 模型使用更少的内存 (273MB与273.2MB) 并且在推理阶段运行得更快 (每个图像对比 0.17s 与0.206s)。因为我们使用 light-head 设计,与 deformable PS RoI pooling 相比,节省的内存并不明显。然而,由于在训练中 RRoI 和 RGT 之间存在额外的匹配过程,因此 RoI Transformer 在训练时 (0.475s 与 0.445s) 的运行速度比 deformable PS RoI pooling 更慢。

localization [,lokəlɪ'zeʃən]:n. 本土化,定位
normal ['nɔːm(ə)l]:adj. 正常的,正规的,标准的 n. 正常,标准,常态,法线

Detection Accuracy. The comparison results are shown in Tab. 4. The deformable PS RoI pooling outperforms the Light-Head R-CNN OBB Baseline by 5.6 percents. While there is only 1.4 points improvement for R-FCN [24] on Pascal VOC [35] as pointed out in [23]. It shows that the geometry modeling is more important for object detection in aerial images. But the deformable PS RoI pooling is much lower than our RoI Transformer by 3.85 points. We argue that there are two reasons: 1) Our RoI Transformer can better model the geometry variations in aerial images. 2) The regression targets of deformable PS RoI pooling are still relative to the HRoI rather than using the boundary of the offsets. Our regression targets are relative to the RRoI, which gives a better initialization for regression. The visualization of some detection results based on Light-Head R-CNN OBB Baseline, Deformable Position Sensitive RoI pooling and RoI Transformer are shown in Fig. 7, Fig. 8 and Fig. 9, respectively. The results in Fig. 7 and the first column of Fig. 8 are taken from the same large image. It shows that RoI Transformer can precisely locate the instances in scenes with densely packed ones. And the Light-Head R-CNN OBB baseline and the deformable RoI pooling show worse accuracy performance on the localization of instances. It is worth noting that the head of truck is misclassified to be small vehicle (the blue bounding box) for the three methods as shown in Fig. 7 and Fig. 8. While our proposed RoI Transformer has the least number of misclassified instances. The second column in Fig 8 is a complex scene containing long and thin instances, where both Light-Head R-CNN OBB baseline and deformable PS RoI pooling generate many False Negatives. And these False Negatives are hard to be suppressed by NMS due to the reason as explained in Fig. 5(b). Benefiting from the consistency between region feature and instance, the detection results based on RoI Transformer generate much fewer False Negatives.
Detection Accuracy. 比较结果显示在 Tab. 4。deformable PS RoI pooling 优于 Light-Head R-CNN OBB 基线 5.6%。虽然如 [23] 所指出的那样,在 Pascal VOC [35] 上 R-FCN [24] 只有 1.4 个点的改善。它表明几何建模对于航拍图像中的物体检测更为重要。但 deformable PS RoI pooling 比我们的 RoI Transformer 低 3.85 个点。我们认为有两个原因:1) 我们的 RoI Transformer 可以更好地模拟航拍图像中的几何变化。2) deformable PS RoI pooling的回归目标仍然相对于 HRoI 而不是使用偏移量的边界。我们的回归目标是相对于 RRoI,它为回归提供了更好的初始化。The visualization of some detection results based on Light-Head R-CNN OBB Baseline, Deformable Position Sensitive RoI pooling and RoI Transformer are shown in Fig. 7, Fig. 8 and Fig. 9, respectively. 图 7 中的结果和图 8 的第一列取自相同的大尺度图像。它表明 RoI Transformer 可以精确地定位具有密集分别的场景中的实例。并且 Light-Head R-CNN OBB baseline and the deformable RoI pooling 在实例的定位上表现出更差的准确性。值得注意的是,卡车的头部被三种方法错误分类为的小型车辆 (蓝色边界框),如图 7 和图 8 所示。我们提出的 RoI Transformer 具有最少数量的错误分类实例。图 8 中的第二列是包含长和窄实例的复杂场景,其中 Light-Head R-CNN OBB baseline and deformable PS RoI pooling 都产生许多 False Negative。由于如图 5 (b) 所示的原因,这些 False Negative 难以被 NMS 抑制。受益于区域特征和实例之间的一致性,基于 RoI Transformer 的检测结果产生的 False Negative 更少。

Learning RoI Transformer for Detecting Oriented Objects in Aerial Images_第6张图片
Figure 6: Visualization of detection results from RoI Transformer in DOTA.

Learning RoI Transformer for Detecting Oriented Objects in Aerial Images_第7张图片
Figure 7: Visualization of detection on the scene where many densely packed instances exist. We select the predicted bounding boxes with scores above 0.1, and a NMS with threshold 0.1 is applied for duplicate removal.
图 7:存在许多密集排列实例的场景检测结果可视化。我们选择具有高于 0.1 的分数的预测边界框,并且应用具有阈值 0.1 的 NMS 用于重复去除。

Learning RoI Transformer for Detecting Oriented Objects in Aerial Images_第8张图片
Figure 8: Visualization of detection results in DOTA. The first row shows the results from RoT Transformer. The second ros shows the results from Light-Head R-CNN OBB baseline. The last row shows the results from deformable PS RoI pooling. In the visualization, We select the predicted bounding boxes with scores above 0.1, and a NMS with threshold 0.1 is applied for duplicate removal.
图 8:DOTA 中检测结果的可视化。第一行显示 RoT Transformer 的结果。第二行显示了 Light-Head R-CNN OBB baseline的结果。最后一行显示了 deformable PS RoI pooling 的结果。在可视化中,我们选择具有高于 0.1 的分数的预测边界框,并且应用具有阈值 0.1 的 NMS 用于重复去除。

Learning RoI Transformer for Detecting Oriented Objects in Aerial Images_第9张图片
Figure 9: Visualization of detection results in DOTA. The first row shows the results from RoT Transformer. The second ros shows the results from Light-Head R-CNN OBB baseline. The last row shows the results from deformable PS RoI pooling. In the visualization, We select the predicted bounding boxes with scores above 0.1, and a NMS with threshold 0.1 is applied for duplicate removal.
图 9:DOTA 中检测结果的可视化。第一行显示 RoT Transformer 的结果。第二行显示了 Light-Head R-CNN OBB baseline 的结果。最后一行 deformable PS RoI pooling 的结果。在可视化中,我们选择具有高于 0.1 的分数的预测边界框,并且应用具有阈值 0.1 的 NMS 用于重复去除。

Table 1: Results of ablation studies. We used the Light-Head R-CNN OBB detector as our baseline. The leftmost column represents the optional settings for the RoI Transformer. In the right four experiments, we explored the appropriate setting for RoI Transformer.
表 1:消融研究的结果。我们使用 Light-Head R-CNN OBB 检测器作为基线。最左边的列表示 RoI Transformer 的可选设置。在右边的四个实验中,我们探索了 RoI Transformer 的适当设置。
Learning RoI Transformer for Detecting Oriented Objects in Aerial Images_第10张图片

leftmost ['lɛftmost]:adj. 最左边的
enlarge [ɪn'lɑːdʒ; en-]:vi. 扩大,放大,详述 vt. 扩大,使增大,扩展

Table 2: Comparisons with the state-of-the-art methods on HRSC2016.
表 2:与 HRSC2016 上最先进的方法进行比较。
在这里插入图片描述

Table 3: Comparisons with state-of-the-art detectors on DOTA [5]. The short names for each category can be found in Section 4.1. The FR-O indicates the Faster R-CNN OBB detector, which is the official baseline provided by DOTA [5]. The RRPN indicates the Rotation Region Proposal Networks, which used a design of rotated anchor. The R2CNN means Rotational Region CNN, which is a HRoI-based method without using the RRoI warping operation. The RDFPN means the Rotation Dense Feature Pyramid Netowrks. It also used a design of Rotated anchors, and used a variation of FPN. The work in Yang et al. [38] is an extension of R-DFPN.
表 3:在 DOTA 上与最先进的检测器进行比较 [5]。每个类别的简称可以在 4.1 节中找到。FR-O 表示 Faster R-CNN OBB 检测器,它是 DOTA [5] 提供的有效基线。RRPN 表示 Rotation Region Proposal Networks,其使用旋转 anchor 的设计。R2CNN 表示 Rotational Region CNN,其是不使用 RRoI 变形操作的基于 HRoI 的方法。RDFPN 表示 Rotation Dense Feature Pyramid Netowrks。它还使用了旋转锚的设计,并使用了 FPN 的变体。Yang et al. [38] 的工作是 R-DFPN 的扩展。
Learning RoI Transformer for Detecting Oriented Objects in Aerial Images_第11张图片

Table 4: Comparison of our RoI Transformer with deformable PS RoI pooling and Light-Head R-CNN OBB on accuracy, speed and memory. All the speed are tested on images with size of 1024 × \times × 1024 on a single TITAN X (Pascal). The time of post process (i.e.NMS) was not included. The LR-O, DPSRP and RT denote the Light-Head R-CNN OBB, deformable Position Sensitive RoI pooling and RoI Transformer, respectively.
表 4:我们的带有 deformable PS RoI pooling 的 RoI Transformer 和 Light-Head R-CNN OBB 在精度、速度和存储方面的比较。所有的速度测试都是在单个 TITAN X (Pascal) 上测试大小为 1024 × \times × 1024 的图像上执行的。不包括后处理时间 (即NMS)。The LR-O, DPSRP and RT denote the Light-Head R-CNN OBB, deformable Position Sensitive RoI pooling and RoI Transformer, respectively.
Learning RoI Transformer for Detecting Oriented Objects in Aerial Images_第12张图片

4.4 Ablation Studies

We conduct a serial of ablation experiments on DOTA to analyze the accuracy of our proposed RoI Transformer. We use the Light-Head R-CNN OBB as our baseline. Then gradually change the settings. When simply add the RoI Transformer, there is a 4.87 point improvement in mAP. The other settings are discussed in the following.
我们在 DOTA 上进行了一系列消融实验,以分析我们提出的 RoI Transformer 的准确性。我们使用 Light-Head R-CNN OBB 作为基线,然后逐渐更改设置。当简单地添加 RoI Transformer 时,mAP 有 4.87 点的改进。其他设置将在下面讨论。

Light RRoI Learner. In order to guarantee the efficiency, we directly apply a fully connected layer with output dimension of 5 on the pooled features from the HRoI warping. As a comparison, we also tried more fully connected layers for the RRoI learner, as shown at the first and second columns in Tab. 1. We find there is little drop (0.22 point) on mAP when we add on more fully connected layer with output dimension of 2048 for the RRoI leaner. The little accuracy degradation should be due to the fact that the additional fully connected layer with higher dimensionality requires a longer time for convergence.
Light RRoI Learner. 为了保证效率,我们直接在 HRoI 变形的池化特征上应用输出维数为 5 的全连接层。作为比较,我们还为 RRoI 学习器尝试了更多全连接的层,如 Tab. 1 中的第一列和第二列所示。我们发现当我们为 RRoI 学习器添加更加全连接的层,输出维数为 2048 时,mAP 上的下降很少 (0.22 点)。精度降低很小应归因于具有较高维度的附加全连接层需要较长的收敛时间。

Contextual RRoI. As pointed in [9, 42], appropriate enlargement of the RoI will promote the performance. A horizontal RoI may contain much background while a precisely RRoI hardly contains redundant background as explained in the Fig. 10. Complete abandon of contextual information will make it difficult to classify and locate the instance even for the human. Therefore, it is necessary to enlarge the region of the feature with an appropriate degree. Here, we enlarge the long side of RRoI by a factor of 1.2 and the short side by 1.4. The enlargement of RRoI improves AP by 2.86 points, as shown in Tab. 1.
Contextual RRoI. 正如 [9, 42] 所指出的,适当扩大 RoI 将提升性能。水平 RoI 可以包含很多背景,而精确的 RRoI 几乎不包含冗余背景,如图 10 所示。完全放弃上下文信息将使得甚至人类分类和定位实例变得困难。因此,需要以适当的程度扩大特征的区域。在这里,我们将 RRoI 的长边扩大 1.2 倍,将短边扩大 1.4。RRoI 的扩大使 AP 提高了 2.86 点,如 Tab. 1 所示。

contextual [kɒn'tekstjʊəl]:adj. 上下文的,前后关系的
degradation [,degrə'deɪʃ(ə)n]:n. 退化,降格,降级,堕落
convergence [kən'vɜːdʒəns]:n. 收敛,会聚,集合
enlargement [ɪn'lɑːdʒm(ə)nt; en-]:n. 放大,放大的照片,增补物

NMS on RRoIs. Since the obtained RoIs are rotated, there is flexibility for us to decide whether to conduct another NMS on the RRoIs transformed from the HRoIs. This comparison is shown in the last two columns of Tab. 1. We find there is ~ 1.5 points improvement in mAP if we remove the NMS. This is reasonable because there are more RoIs without additional NMS, which could increase the recall.
NMS on RRoIs. 由于获得的 RoI 是旋转的,因此我们可以灵活地决定是否在从 HRoI 转换的 RRoI 上进行另一个 NMS。此比较显示在 Tab. 1 的最后两列中。如果我们删除 NMS,我们发现 mAP 有大约 1.5 分的改进。这是合理的,因为有更多的 RoI 没有额外的 NMS,这可能会增加召回率。

duplicate [ˈdjuːplɪkeɪt]:vt. 复制,使加倍 n. 副本,复制品 adj. 复制的,二重的 vi. 复制,重复
removal [rɪ'muːv(ə)l]:n. 免职,移动,排除,搬迁

Learning RoI Transformer for Detecting Oriented Objects in Aerial Images_第13张图片
Figure 10: Comparison of 3 kinds of region for feature extraction. (a) The Horizontal Region. (b) The rectified Region after RRoI Warping. (c) The rectified Region with appropriate context after RRoI warping.

4.5 Comparisons with the State-of-the-art

We compared the performance of our proposed RoI Transformer with the state-of-the-art algorithms on two datasets DOTA [5] and HRSC2016 [19]. The settings are described in Sec. 4.2, and we just replace the Position Sensitive RoI Align with our proposed RoI Transformer. Our baseline and RoI Transformer results are obtained without using ohem [41] at the training phase.
我们将我们提出的 RoI Transformer 的性能与两个数据集 DOTA [5] 和 HRSC2016 [19] 上的最新算法进行了比较。The settings are described in Sec. 4.2,我们只需用我们提出的 RoI Transformer 替换 Position Sensitive RoI Align。我们的 baseline and RoI Transformer 结果是在训练阶段不使用 ohem [41] 获得的。

Results on DOTA. We compared our results with the state-of-the-arts in DOTA. Note the RRPN [9] and R2CNN [26] are originally used for text scene detection. The results are a re-implemented version for DOTA by a third-party2. As can be seen in Tab. 3, our RoI Transformer achieved the mAP of 67.74 for DOTA , it outperforms the previous the state-of-the-art without FPN (61.01) by 6.71 points. And it even outperforms the previous FPN based method by 5.45 points. With FPN, the Light-Head OBB Baseline achieved mAP of 66.95, which outperforms the previous state-of-the-art detectors, but still slightly lower than RoI Transformer. When RoI Transformer is added on Light-Head OBB FPN Baseline, it gets improvement by 2.6 points in mAP reaching the peak at 69.56. This indicates that the proposed RoI Transformer can be easily embedded in other frameworks and significantly improve the detection performance. Besides, there is a significant improvement in densely packed small instances. (e.g. the small vehicles, large vehicles, and ships). For example, the detection performance for the ship category gains an improvement of 26.34 points compared to the previous best result (57.25) achieved by R2CNN [26]. Some qualitative results of RoI Transformer on DOTA are given in Fig 6.
Results on DOTA. 我们将结果与 DOTA 中的最新技术进行了比较。注意 RRPN [9] 和 R2CNN [26] 最初用于文本场景检测。结果是第三方为 DOTA 重新实现的版本。As can be seen in Tab. 3, our RoI Transformer achieved the mAP of 67.74 for DOTA , it outperforms the previous the state-of-the-art without FPN (61.01) by 6.71 points. 它甚至比先前的基于 FPN 的方法高 5.45 个点。凭借 FPN,Light-Head OBB Baseline 的 mAP 达到 66.95,优于之前最先进的探测器,但仍略低于 RoI Transformer。当在 Light-Head OBB FPN Baseline 上添加 RoI Transformer 时,mAP 的改善达到 2.6 点,达到峰值 69.56。这表明所提出的 RoI Transformer 可以很容易地嵌入到其他框架中,并显著提高检测性能。此外,在密集的小型实例中有显著的改进 (例如小型车辆、大型车辆和船舶)。例如,与 R2CNN [26] 取得的先前最佳结果 (57.25) 相比,船舶类别的检测性能提高了 26.34 点。图 6 给出了 RoI Transformer 在 DOTA 上的一些定性结果。

2https://github.com/DetectionTeamUCAS/RRPN_Faster-RCNN_Tensorflow

Results on HRSC2016. The HRSC2016 contains a lot of thin and long ship instances with arbitrary orientation. We use 4 scales { 6 4 2 , 12 8 2 , 25 6 2 , 51 2 2 } \{64^{2}, 128^{2}, 256^{2}, 512^{2}\} {642,1282,2562,5122} and 5 aspect ratios { 1 / 3 , 1 / 2 , 1 , 2 , 3 } \{1/3, 1/2, 1, 2, 3\} {1/3,1/2,1,2,3}, yielding k = 20 k = 20 k=20 anchors for RPN initialization. This is because there is more aspect ratio variations in HRSC, but relatively fewer scale changes. The other settings are the same as those in 4.2. We conduct the experiments without FPN which still achieves the best performance on mAP. Specifically, based on our proposed method, the mAP can reach 86.16, 1.86 higher than that of RRD [37]. Note that the RRD is designed using SSD [43] for oriented object detection, which utilizes multi-layers for feature extraction with 13 different aspect ratios of boxes { 1 , 2 , 3 , 5 , 7 , 9 , 15 , 1 / 2 , 1 / 3 , 1 / 5 , 1 / 7 , 1 / 9 , 1 / 15 } \{1, 2, 3, 5, 7, 9, 15, 1/2, 1/3, 1/5, 1/7, 1/9, 1/15\} {1,2,3,5,7,9,15,1/2,1/3,1/5,1/7,1/9,1/15}. While our proposed framework just employs the final output features with only 5 aspect ratios of boxes. In Fig. 11, we visualize some detection results in HRSC2016. The orientation of the ship is evenly distributed over 2 π 2\pi 2π. In the last row, there are closely arranged ships, which are difficult to distinguish by horizontal rectangles. While our proposed RoI Transformer can handle the above mentioned problems effectively. The detected incomplete ship in the third picture of the last row proves the strong stability of our proposed RoI Transformer detection method.
Results on HRSC2016. HRSC2016 包含许多具有任意方向的窄和长的船实例。我们使用 4 个尺度 { 6 4 2 , 12 8 2 , 25 6 2 , 51 2 2 } \{64^{2}, 128^{2}, 256^{2}, 512^{2}\} {642,1282,2562,5122} 和 5 个宽高比 { 1 / 3 , 1 / 2 , 1 , 2 , 3 } \{1/3, 1/2, 1, 2, 3\} {1/3,1/2,1,2,3},产生用于 RPN 初始化的 k = 20 k = 20 k=20 个 anchor。这是因为 HRSC 中的宽高比变化更多,但尺度变化相对较少。其他设置与 4.2 中的设置相同。我们在没有 FPN 的情况下进行实验,这仍然在 mAP 上实现了最佳性能。具体而言,基于我们提出的方法,mAP 可以达到 86.16,比 RRD 高 1.86 [37]。注意,RRD 是使用 SSD [43] 设计的,用于有向物体检测,它利用多层进行特征提取,具有 13 个不同的 anchor 宽高比 { 1 , 2 , 3 , 5 , 7 , 9 , 15 , 1 / 2 , 1 / 3 , 1 / 5 , 1 / 7 , 1 / 9 , 1 / 15 } \{1, 2, 3, 5, 7, 9, 15, 1/2, 1/3, 1/5, 1/7, 1/9, 1/15\} {1,2,3,5,7,9,15,1/2,1/3,1/5,1/7,1/9,1/15}。相比我们提出的框架只采用最终输出的特征,只有 5 个宽高比的 anchor。在图 11 中,我们在 HRSC2016 中可视化一些检测结果。船的方向均匀分布在 2 π 2\pi 2π 角度上。在最后一排,有紧密排列的船只,难以使用水平矩形区分。虽然我们提出的 RoI Transformer 可以有效地处理上述问题。在最后一行的第三张图片中检测到的不完整船舶证明了我们提出的 RoI Transformer 检测方法的强稳定性。

Learning RoI Transformer for Detecting Oriented Objects in Aerial Images_第14张图片
Figure 11: Visualization of detection results from RoI Transformer in HRSC2016. We select the predicted bounding boxes with scores above 0.1, and a NMS with threshold 0.1 is applied for duplicate removal.
图 11:HRSC2016 中 RoI Transformer 检测结果的可视化。我们选择分数大于 0.1 的预测边界框,并应用具有阈值 0.1 的 NMS 进行重复移除。

5 Conclusion

In this paper, we proposed a module called RoI Transformer to model the geometry transformation and solve the problem of misalignment between region feature and objects. The design brings significant improvements for oriented object detection on the challenging DOTA and HRSC with negligible computation cost increase. While the deformable module is a well-designed structure to model the geometry transformation, which is widely used for oriented object detection. The comprehensive comparisons with deformable RoI pooling solidly verified that our model is more reasonable when oriented bounding box annotations are available. So, it can be inferred that our module can be an optional substitution of deformable RoI pooling for oriented object detection.
在本文中,我们提出了一个名为 RoI Transformer 的模块来模拟几何变换并解决区域特征和物体之间的不重和问题。该设计为具有挑战性的 DOTA 和 HRSC 的有向物体检测带来了显著的改进,计算成本增加可以忽略不计。可变形模块是一种精心设计的结构,用于模拟几何变换,其广泛用于有向物体检测。与 deformable RoI pooling 的综合比较可以确保我们的模型在有向边界框标注可用时更合理。因此,可以推断出我们的模块可以是 deformable RoI pooling 的可选替代,用于有向物体检测。

substitution [sʌbstɪ'tjuːʃn]:n. 代替,置换,代替物
aspect ratio:宽高比
removal [rɪ'muːv(ə)l]:n. 免职,移动,排除,搬迁
stability [stə'bɪlɪtɪ]:n. 稳定性,坚定,恒心

References

[5] DOTA: A Large-scale Dataset for Object Detection in Aerial Images
[8] Ship Rotated Bounding Box Space for Ship Extraction From High-Resolution Optical Satellite Images With Complex Backgrounds
[9] Arbitrary-Oriented Scene Text Detection via Rotation Proposals
[10] Rotated region based CNN for ship detection
[12] Towards Multi-class Object Detection in Unconstrained Remote Sensing Imagery
[13] Rich feature hierarchies for accurate object detection and semantic segmentation
[15] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
[19] Ship Rotated Bounding Box Space for Ship Extraction From High-Resolution Optical Satellite Images With Complex Backgrounds
[21] Toward Arbitrary-Oriented Ship Detection With Rotated Region Proposal and Discrimination Networks
[22] Spatial Transformer Networks
[23] Deformable Convolutional Networks
[24] R-FCN: Object Detection via Region-based Fully Convolutional Networks
[26] R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection
[27] Automatic Ship Detection of Remote Sensing Images from Google Earth in Complex Scenes Based on Multi-Scale Rotation Dense Feature Pyramid Networks
[28] Learning a Rotation Invariant Detector with Rotatable Bounding Box
[34] Light-Head R-CNN: In Defense of Two-Stage Object Detector
[37] Rotation-Sensitive Regression for Oriented Scene Text Detection

WORDBOOK

University of Chinese Academy of Sciences:中国科学院大学,国科大

KEY POINTS

DetectionTeamUCAS
https://github.com/DetectionTeamUCAS

Rotation-sensitive Regression for Oriented Scene Text Detection
https://github.com/MhLiao/RRD

Learning RoI Transformer for Detecting Oriented Objects in Aerial Images
https://github.com/dingjiansw101/RoITransformer_DOTA

DOTA: A Large-scale Dataset for Object Detection in Aerial Images
https://github.com/dingjiansw101/Faster_RCNN_for_DOTA

TextBoxes++: A Single-Shot Oriented Scene Text Detector
https://github.com/MhLiao/TextBoxes_plusplus

R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection
https://github.com/DetectionTeamUCAS/R2CNN_Faster-RCNN_Tensorflow

你可能感兴趣的:(object,detection,-,目标检测)