Date:2018-10-22
Author:哪咔吗
Source Link:http://arxiv.org/pdf/1703.06870v3.pdf
NN)Mask R-CNN
- 摘要
- 1.介绍
In principle Mask R-CNN is an intuitive extension of Faster R-CNN, yet constructing the mask branch properly is critical for good results. Most importantly, Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool [18, 12], the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. To fix the misalignment, we propose a simple, quantization-free layer, called RoIAlign, that faithfully preserves exact spatial locations. Despite being 1Following common terminology, we use object detection to denote detection via bounding boxes, not masks, and semantic segmentation to denote per-pixel classification without differentiating instances. Yet we note that instance segmentation is both semantic and a form of detection.
原则上,Mask R-CNN是R-CNN的直观扩展,但正确构建掩模分支对于获得好的结果至关重要。最重要的是,更快的RCNN并非针对网络输入和输出之间的像素对像素对齐而设计的。RoIPool [18,12]是参与实例的事实核心操作,为特征提取执行粗略的空间量化,这一点最为明显。为了找到错位,我们提出了一个简单的,无量化的图层,称为RoIAlign,忠实地保留了确切的空间位置。尽管1遵循通用术语,但我们使用对象检测来表示通过边界框而不是掩码进行检测,并使用语义分割来表示每像素分类而不区分实例。但是我们注意到,实例分割既是语义的,也是一种检测形式。
Figure 2. Mask R-CNN results on the COCO test set. These results are based on ResNet-101 [19], achieving a mask AP of 35.7 and running at 5 fps. Masks are shown in color, and bounding box, category, and confidences are also shown.
图2.掩盖COCO测试集上的R-CNN结果。这些结果基于ResNet-101 [19],实现了35.7的掩模AP,并以5 fps运行。面具以彩色显示,还显示了边界框,类别和置信度。
a seemingly minor change, RoIAlign has a large impact: it improves mask accuracy by relative 10% to 50%, showing bigger gains under stricter localization metrics. Second, we found it essential to decouple mask and class prediction: we predict a binary mask for each class independently, without competition among classes, and rely on the network’s RoI classification branch to predict the category. In contrast, FCNs usually perform per-pixel multi-class categorization, which couples segmentation and classification, and based on our experiments works poorly for instance segmentation.
一个看似微小的变化,RoIAlign具有很大的影响:它将掩模精度提高了10%到50%,在更严格的本地化指标下显示出更大的收益。其次,我们发现解耦模板和类别预测至关重要:我们独立预测每个类别的二进制掩码,而不需要在类别间进行竞争,并依靠网络的RoI分类分支来预测类别。相比之下,FCNs通常执行每像素多类别分类,结合分割和分类,并基于我们的实验在分割实例方面效果不佳。
Without bells and whistles, Mask R-CNN surpasses all previous state-of-the-art single-model results on the COCO instance segmentation task [28], including the heavilyengineered entries from the 2016 competition winner. As a by-product, our method also excels on the COCO object detection task. In ablation experiments, we evaluate multiple basic instantiations, which allows us to demonstrate its robustness and analyze the effects of core factors.
没有花里胡哨之力,Mask R-CNN超越了COCO实例分割任务中所有先前的最新单模型结果[28],其中包括来自2016年竞赛冠军的大量工程项目。作为副产品,我们的方法也擅长COCO物体检测任务。在消融实验中,我们评估了多个基本实例,这使我们能够展示其强大性并分析核心因素的影响。
Our models can run at about 200ms per frame on a GPU, and training on COCO takes one to two days on a single 8-GPU machine. We believe the fast train and test speeds, together with the framework’s flexibility and accuracy, will benefit and ease future research on instance segmentation.
我们的模型可以在GPU上以每帧200毫秒的速度运行,并且在单个8 GPU计算机上进行COCO培训需要一到两天。我们相信,快速训练和测试速度,以及框架的灵活性和准确性,将会对实例分割的未来研究起到一定的作用。
Finally, we showcase the generality of our framework via the task of human pose estimation on the COCO keypoint dataset [28]. By viewing each keypoint as a one-hot binary mask, with minimal modification Mask R-CNN can be applied to detect instance-specific poses. Mask R-CNN surpasses the winner of the 2016 COCO keypoint competition, and at the same time runs at 5 fps. Mask R-CNN, therefore, can be seen more broadly as a flexible framework for instance-level recognition and can be readily extended to more complex tasks.
最后,我们通过COCO关键点数据集上的人体姿态估计任务展示了我们框架的一般性[28]。通过将每个关键点视为一个热门的二进制掩码,只需进行最少的修改Mask R-CNN可用于检测实例特定的姿势。Mask R-CNN超越2016年COCO关键点竞赛的冠军,同时运行速度为5 fps。因此,面膜R-CNN可以更广泛地视为实例级别识别的灵活框架,并且可以很容易地扩展到更复杂的任务。
We have released code to facilitate future research.
我们已发布代码以促进未来的研究。
2. Related Work2.相关工作
R-CNN: The Region-based CNN (R-CNN) approach [13] to bounding-box object detection is to attend to a manageable number of candidate object regions [42, 20] and evaluate convolutional networks [25, 24] independently on each RoI. R-CNN was extended [18, 12] to allow attending to RoIs on feature maps using RoIPool, leading to fast speed and better accuracy. Faster R-CNN [36] advanced this stream by learning the attention mechanism with a Region Proposal Network (RPN). Faster R-CNN is flexible and robust to many follow-up improvements (e.g., [38, 27, 21]), and is the current leading framework in several benchmarks.
R-CNN:基于区域的CNN(R-CNN)方法[13]对边界框对象进行检测是为了关注可管理数量的候选目标区域[42,20]并独立评估卷积网络[25,24]在每个RoI上。R-CNN得到了扩展[18,12],允许使用RoIPool在功能地图上参与RoI,从而实现更快的速度和更高的准确性。更快的R-CNN [36]通过学习区域建议网络(RPN)的注意机制来推进这一流程。更快速的R-CNN灵活性强,适用于许多后续改进(例如[38,27,21]),并且是几个基准测试中的当前领先框架。
Instance Segmentation: Driven by the effectiveness of RCNN, many approaches to instance segmentation are based on segment proposals. Earlier methods [13, 15, 16, 9] resorted to bottom-up segments [42, 2]. DeepMask [33] and following works [34, 8] learn to propose segment candidates, which are then classified by Fast R-CNN. In these methods, segmentation precedes recognition, which is slow and less accurate. Likewise, Dai et al. [10] proposed a complex multiple-stage cascade that predicts segment proposals from bounding-box proposals, followed by classification. Instead, our method is based on parallel prediction of masks and class labels, which is simpler and more flexible.
实例细分:在RCNN的有效性的推动下,许多实例细分的方法都基于细分提案。早期的方法[13,15,16,9]采用了自下而上的方法[42,2]。DeepMask [33]和以下着作[34,8]学会提出片段候选者,然后由Fast R-CNN进行分类。在这些方法中,分割先于识别,这是缓慢的并且不太准确。同样,戴等人。 [10]提出了一个复杂的多级级联,从包围盒提议中预测段提议,然后进行分类。相反,我们的方法基于面具和类标签的并行预测,它更简单,更灵活。
Most recently, Li et al. [26] combined the segment proposal system in [8] and object detection system in [11] for “fully convolutional instance segmentation” (FCIS). The common idea in [8, 11, 26] is to predict a set of positionsensitive output channels fully convolutionally. These channels simultaneously address object classes, boxes, and masks, making the system fast. But FCIS exhibits systematic errors on overlapping instances and creates spurious edges (Figure 6), showing that it is challenged by the fundamental difficulties of segmenting instances.
最近,李等人。 [26]将[8]中的段提议系统和[11]中的对象检测系统合并为“完全卷积实例分段”(FCIS)。[8,11,26]中的共同想法是预测一组完全卷积的位置敏感输出通道。这些通道同时处理对象类,框和掩码,使系统更快。但是FCIS在重叠实例上表现出系统性错误并产生虚假边缘(图6),表明它受到分割实例的根本困难的挑战。
Another family of solutions [23, 4, 3, 29] to instance segmentation are driven by the success of semantic segmentation. Starting from per-pixel classification results (e.g., FCN outputs), these methods attempt to cut the pixels of the same category into different instances. In contrast to the segmentation-first strategy of these methods, Mask R-CNN is based on an instance-first strategy. We expect a deeper incorporation of both strategies will be studied in the future.
另一个解决方案家族[23,4,3,29]实例分割是由语义分割的成功驱动的。从每像素分类结果(例如,FCN输出)开始,这些方法试图将相同类别的像素切割成不同的实例。与这些方法的分段第一策略相比,Mask R-CNN基于实例第一策略。我们预计未来将研究更深入的两种战略。
3. Mask R-CNN3.掩码R-CNN
Mask R-CNN is conceptually simple: Faster R-CNN has two outputs for each candidate object, a class label and a bounding-box offset; to this we add a third branch that outputs the object mask. Mask R-CNN is thus a natural and intuitive idea. But the additional mask output is distinct from the class and box outputs, requiring extraction of much finer spatial layout of an object. Next, we introduce the key elements of Mask R-CNN, including pixel-to-pixel alignment, which is the main missing piece of Fast/Faster R-CNN.
掩码R-CNN在概念上是简单的:更快的R-CNN对于每个候选对象具有两个输出,一个类别标签和一个边界框偏移;为此,我们添加一个输出对象掩码的第三个分支。面具R-CNN因此是一个自然而直观的想法。但是额外的掩码输出与类和盒输出不同,需要提取对象的更精细的空间布局。接下来,我们介绍Mask R-CNN的关键元素,包括像素对像素对齐,这是Fast / Faster R-CNN的主要缺失部分。
Faster R-CNN: We begin by briefly reviewing the Faster R-CNN detector [36]. Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network (RPN), proposes candidate object bounding boxes. The second stage, which is in essence Fast R-CNN [12], extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference. We refer readers to [21] for latest, comprehensive comparisons between Faster R-CNN and other frameworks.
更快的R-CNN:我们首先回顾一下更快的R-CNN探测器[36]。更快的R-CNN由两个阶段组成。第一阶段称为区域提议网络(RPN),提出候选对象边界框。第二阶段本质上是Fast R-CNN [12],使用每个候选框中的RoIPool提取特征,并执行分类和边界框回归。两个阶段使用的功能可以共享以加快推断速度。我们引用读者[21]对Faster R-CNN和其他框架进行最新,全面的比较。
Mask R-CNN: Mask R-CNN adopts the same two-stage procedure, with an identical first stage (which is RPN). In the second stage, in parallel to predicting the class and box offset, Mask R-CNN also outputs a binary mask for each RoI. This is in contrast to most recent systems, where classification depends on mask predictions (e.g. [33, 10, 26]). Our approach follows the spirit of Fast R-CNN [12] that applies bounding-box classification and regression in parallel (which turned out to largely simplify the multi-stage pipeline of original R-CNN [13]).
掩码R-CNN:掩码R-CNN采用相同的两阶段过程,具有相同的第一阶段(即RPN)。在第二阶段,与预测类和盒子偏移并行,Mask R-CNN也为每个RoI输出一个二进制掩码。这与大多数最近的系统形成对比,其中分类依赖于掩模预测(例如[33,10,26])。我们的方法遵循Fast R-CNN [12]的精神,它并行地应用了边界框分类和回归(其原来大大简化了原始R-CNN的多级流水线[13])。
Formally, during training, we define a multi-task loss on each sampled RoI as L = Lcls + Lbox + Lmask. The classification loss and bounding-box loss are identical as those defined in [12]. The mask branch has a dimensional output for each RoI, which encodes K binary masks of resolution , one for each of the K classes. To this we apply a per-pixel sigmoid, and define as the average binary cross-entropy loss. For an RoI associated with ground-truth class k, is only defined on the k-th mask (other mask outputs do not contribute to the loss).
形式上,在训练期间,我们将每个抽样的RoI的多任务丢失定义为L = Lcls + Lbox + Lmask。分类损失和边界框损失与[12]中定义的相同。掩码分支对每个RoI都有一个维输出,它编码分辨率为的K个二进制掩码,每个K类一个掩码。为此,我们应用每像素S形,并将定义为平均二叉交叉熵损失。对于与地面实况类别k相关的RoI,仅在第k个掩模上定义(其他掩模输出不会造成损失)。
Our definition of allows the network to generate masks for every class without competition among classes; we rely on the dedicated classification branch to predict the class label used to select the output mask. This decouples mask and class prediction. This is different from common practice when applying FCNs [30] to semantic segmentation, which typically uses a per-pixel softmax and a multinomial cross-entropy loss. In that case, masks across classes compete; in our case, with a per-pixel sigmoid and a binary loss, they do not. We show by experiments that this formulation is key for good instance segmentation results.
我们对的定义允许网络为每个班级生成口罩,而不需要在班级间进行竞争;我们依靠专用分类分支来预测用于选择输出掩码的类别标签。这样可以将掩码和类别预测分开。这与将FCN [30]应用于语义分割时的常见做法不同,后者通常使用每像素softmax和多项叉熵损失。在这种情况下,跨班级的面具竞争;在我们的例子中,每像素S形和二进制丢失,他们不。我们通过实验显示这个公式对于良好的实例分割结果是关键的。
Mask Representation: A mask encodes an input object’s spatial layout. Thus, unlike class labels or box offsets that are inevitably collapsed into short output vectors by fully-connected (fc) layers, extracting the spatial structure of masks can be addressed naturally by the pixel-to-pixel correspondence provided by convolutions.
掩码表示法:掩码编码输入对象的空间布局。因此,与通过完全连接(fc)层不可避免地折叠成短输出矢量的类标签或框偏移不同,提取掩模的空间结构可以通过卷积提供的像素到像素对应自然地解决。
Specifically, we predict an mask from each RoI using an FCN [30]. This allows each layer in the mask branch to maintain the explicit object spatial layout without collapsing it into a vector representation that lacks spatial dimensions. Unlike previous methods that resort to fc layers for mask prediction [33, 34, 10], our fully convolutional representation requires fewer parameters, and is more accurate as demonstrated by experiments.
具体而言,我们使用FCN预测每个RoI的掩码[30]。这允许掩码分支中的每个层保持显式对象空间布局,而不将其折叠成缺少空间维度的向量表示。与之前采用fc层进行掩模预测的方法不同[33,34,10],我们的完全卷积表示需要更少的参数,并且如实验所证明的那样更精确。
This pixel-to-pixel behavior requires our RoI features, which themselves are small feature maps, to be well aligned to faithfully preserve the explicit per-pixel spatial correspondence. This motivated us to develop the following RoIAlign layer that plays a key role in mask prediction.
这种像素到像素的行为要求我们的RoI特征(它们本身是小特征图)能够很好地对齐以忠实地保留显式的每像素空间对应关系。这促使我们开发了以下RoAlign图层,该图层在遮罩预测中发挥关键作用。
RoIAlign: RoIPool [12] is a standard operation for extracting a small feature map (e.g., 7×7) from each RoI. RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Quantization is performed, e.g., on a continuous coordinate x by computing , where 16 is a feature map stride and is rounding; likewise, quantization is performed when dividing into bins (e.g., 7×7). These quantizations introduce misalignments between the RoI and the extracted features. While this may not impact classification, which is robust to small translations, it has a large negative effect on predicting pixel-accurate masks.
RoIlign:RoIPool [12]是从每个RoI提取小特征映射(例如7×7)的标准操作。RoIPool首先将浮点数RoI量化为特征映射的离散粒度,然后将这个量化的RoI细分为自身量化的空间仓,最后汇总每个仓所涵盖的特征值(通常通过最大池)。例如,通过计算在连续坐标x上执行量化,其中16是特征映射步长并且是舍入;同样地,当分成分箱(例如,7×7)时执行量化。这些量化引入了RoI和提取的特征之间的错位。虽然这可能不会影响分类,这对于小型翻译很有用,但它对预测像素精确的蒙版有很大的负面影响。
To address this, we propose an RoIAlign layer that removes the harsh quantization of RoIPool, properly aligning the extracted features with the input. Our proposed change is simple: we avoid any quantization of the RoI boundaries or bins (i.e., we use instead of ). We use bilinear interpolation [22] to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average), see Figure 3 for details. We note that the results are not sensitive to the exact sampling locations, or how many points are sampled, as long as no quantization is performed.
为了解决这个问题,我们提出一个RoIlign层,它可以消除RoIPool的严格量化,正确地将提取的特征与输入对齐。我们提出的改变很简单:我们避免任何RoI边界或分区的量化(即,我们使用而不是)。我们使用双线性插值[22]来计算每个RoI bin中四个有规律采样位置的输入特征的精确值,并汇总结果(使用最大值或平均值),详细信息请参见图3。我们注意到,只要未执行量化,结果对精确的采样位置不敏感,或者采样了多少个点。
RoIAlign leads to large improvements as we show in §4.2. We also compare to the RoIWarp operation proposed in [10]. Unlike RoIAlign, RoIWarp overlooked the alignment issue and was implemented in [10] as quantizing RoI just like RoIPool. So even though RoIWarp also adopts bilinear resampling motivated by [22], it performs on par with RoIPool as shown by experiments (more details in Table 2c), demonstrating the crucial role of alignment.
正如我们在§4.2中所展示的,RoIAlign带来了巨大的改进。我们也比较了[10]中提出的RoIWarp操作。与RoIlign不同,RoIWarp忽略了对齐问题,并在[10]中将RoI与RoIPool一样量化为RoI。所以即使RoIWarp也采用[22]激励的双线性重采样,它可以像RoIPool一样实验(表2c中的更多细节),证明了对齐的关键作用。
Network Architecture: To demonstrate the generality of our approach, we instantiate Mask R-CNN with multiple architectures. For clarity, we differentiate between: (i) the convolutional backbone architecture used for feature extraction over an entire image, and (ii) the network head for bounding-box recognition (classification and regression) and mask prediction that is applied separately to each RoI. We denote the backbone architecture using the nomenclature network-depth-features. We evaluate ResNet [19] and ResNeXt [45] networks of depth 50 or 101 layers. The original implementation of Faster R-CNN with ResNets
网络体系结构:为了演示我们的方法的一般性,我们实例化具有多种体系结构的Mask R-CNN。为了清楚起见,我们区分:(i)用于整个图像上的特征提取的卷积骨干架构,以及(ii)用于边界框识别(分类和回归)的网络头和分别应用于每个RoI的掩模预测。我们用命名网络深度特征来表示骨干架构。我们评估深度为50或101层的ResNet [19]和ResNeXt [45]网络。带ResNets的更快的R-CNN的原始实施
[19] extracted features from the final convolutional layer of the 4-th stage, which we call C4. This backbone with ResNet-50, for example, is denoted by ResNet-50-C4. This is a common choice used in [19, 10, 21, 39].
[19]从第四阶段的最后卷积层提取特征,我们称之为C4。例如,ResNet-50的骨干用ResNet-50-C4表示。这是[19,10,21,39]中常用的选择。
We also explore another more effective backbone recently proposed by Lin et al. [27], called a Feature Pyramid Network (FPN). FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask RCNN gives excellent gains in both accuracy and speed. For further details on FPN, we refer readers to [27].
我们还探索了Lin等人最近提出的另一种更有效的骨干。 [27],称为特征金字塔网络(FPN)。FPN使用具有横向连接的自顶向下架构从单一比例输入构建网络内特征金字塔。更快的R-CNN和FPN骨干网根据其规模从不同层次的特征金字塔中提取RoI特征,但其他方法与vanilla ResNet类似。使用ResNet-FPN主干进行MaskNRCNN特征提取,可以提高精度和速度。有关FPN的更多详细信息,请参阅[27]。
For the network head we closely follow architectures presented in previous work to which we add a fully convolutional mask prediction branch. Specifically, we extend the Faster R-CNN box heads from the ResNet [19] and FPN [27] papers. Details are shown in Figure 4. The head on the ResNet-C4 backbone includes the 5-th stage of ResNet (namely, the 9-layer ‘res5’ [19]), which is computeintensive. For FPN, the backbone already includes res5 and thus allows for a more efficient head that uses fewer filters. We note that our mask branches have a straightforward structure. More complex designs have the potential to improve performance but are not the focus of this work.
对于网络负责人,我们密切关注以前工作中提出的架构,并在其中添加完全卷积掩码预测分支。具体而言,我们从ResNet [19]和FPN [27]论文中扩展了更快的R-CNN盒头。详细情况如图4所示。ResNet-C4主干上包含ResNet的第5级(即9层’res5’[19]),它是计算密集型的。对于FPN,骨干已经包含res5,因此可以使用更少的滤波器来提高效率。我们注意到我们的面具分支有一个简单的结构。更复杂的设计有提高性能的潜力,但不是这项工作的重点。
Figure 4. Head Architecture: We extend two existing Faster RCNN heads [19, 27]. Left/Right panels show the heads for the ResNet C4 and FPN backbones, from [19] and [27], respectively, to which a mask branch is added. Numbers denote spatial resolution and channels. Arrows denote either conv, deconv, or fc layers as can be inferred from context (conv preserves spatial dimension while deconv increases it). All convs are 3×3, except the output conv which is 1×1, deconvs are 2×2 with stride 2, and we use ReLU [31] in hidden layers. Left: ‘res5’ denotes ResNet’s fifth stage, which for simplicity we altered so that the first conv operates on a 7×7 RoI with stride 1 (instead of 14×14 / stride 2 as in [19]). Right: ‘×4’ denotes a stack of four consecutive convs.
图4.头架构:我们扩展了两个现有的更快的RCNN头[19,27]。左/右面板分别显示来自[19]和[27]的ResNet C4和FPN骨干的头部,其中添加了掩膜分支。数字表示空间分辨率和频道。箭头表示可以从上下文推断的conv,deconv或fc图层(conv会保留空间维度,而deconv会增加它)。所有的转换都是3×3,除了输出转换为1×1,解压缩为2×2和步长2,并且我们在隐藏层中使用了ReLU [31]。左:res5表示ResNet的第五阶段,为了简单起见,我们改变了第一阶段的第一阶段,以步幅1(而不是14×14 /步幅2,如[19]中的7×7阶段)操作。右:“×4”表示一连串四次转换。
3.1. Implementation Details3.1。实施细节
We set hyper-parameters following existing Fast/Faster R-CNN work [12, 36, 27]. Although these decisions were made for object detection in original papers [12, 36, 27], we found our instance segmentation system is robust to them.
我们在现有的快速/更快的R-CNN工作之后设置超参数[12,36,27]。尽管这些决策是在原始文件中进行对象检测的[12,36,27],但我们发现我们的实例分割系统对它们是强健的。
Training: As in Fast R-CNN, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. The mask loss is defined only on positive RoIs. The mask target is the intersection between an RoI and its associated ground-truth mask.
培训:与Fast R-CNN一样,如果RoI的IoU的地面实况框至少为0.5,则认为是正面的,否则为负面。掩模损失仅在正向RoI上定义。掩码目标是RoI与其关联的地面实况蒙版之间的交集。
We adopt image-centric training [12]. Images are resized such that their scale (shorter edge) is 800 pixels [27]. Each mini-batch has 2 images per GPU and each image has N sampled RoIs, with a ratio of 1:3 of positive to negatives [12]. N is 64 for the C4 backbone (as in [12, 36]) and 512 for FPN (as in [27]). We train on 8 GPUs (so effective minibatch size is 16) for 160k iterations, with a learning rate of 0.02 which is decreased by 10 at the 120k iteration. We use a weight decay of 0.0001 and momentum of 0.9. With ResNeXt [45], we train with 1 image per GPU and the same number of iterations, with a starting learning rate of 0.01. The RPN anchors span 5 scales and 3 aspect ratios, following [27]. For convenient ablation, RPN is trained separately and does not share features with Mask R-CNN, unless specified. For every entry in this paper, RPN and Mask R-CNN have the same backbones and so they are shareable.
我们采用图像中心训练[12]。调整图像的大小以使其比例(较短的边缘)为800像素[27]。每个微型批次每个GPU有2个图像,每个图像具有N个采样的RoI,比例为1:3的正负极[12]。C4骨架的N为64(如[12,36]),FPN为512(如[27])。我们在8个GPU(有效小批量大小为16)上进行160k次迭代训练,学习率为0.02,在120k迭代时减少10。我们使用0.0001的重量衰减和0.9的动量。使用ResNeXt [45],我们每个GPU训练1个图像,迭代次数相同,初始学习率为0.01。RPN锚点跨越5个尺度和3个纵横比,见[27]。为了方便消融,除非另有说明,否则RPN将单独进行培训并且不会与Mask R-CNN共享特征。对于本文中的每个条目,RPN和Mask R-CNN具有相同的主干,因此它们可共享。
Inference: At test time, the proposal number is 300 for the C4 backbone (as in [36]) and 1000 for FPN (as in [27]). We run the box prediction branch on these proposals, followed by non-maximum suppression [14]. The mask branch is then applied to the highest scoring 100 detection boxes. Although this differs from the parallel computation used in training, it speeds up inference and improves accuracy (due to the use of fewer, more accurate RoIs). The mask branch can predict K masks per RoI, but we only use the k-th mask, where k is the predicted class by the classification branch. The m×m floating-number mask output is then resized to the RoI size, and binarized at a threshold of 0.5.
推论:在测试时,C4主干的提案编号为300(如[36]),FPN的提案编号为1000(如[27])。我们对这些提议运行盒子预测分支,然后是非最大抑制[14]。然后将掩码分支应用于得分最高的100个检测框。虽然这与训练中使用的并行计算不同,但它加快了推理速度并提高了准确性(由于使用了更少,更准确的RoI)。掩模分支可以预测每个RoI的K个掩模,但我们只使用第k个掩模,其中k是分类分支预测的类。然后将m×m浮点数掩码输出调整为RoI大小,并在阈值0.5下进行二进制化。
Figure 5. More results of Mask R-CNN on COCO test images, using ResNet-101-FPN and running at 5 fps, with 35.7 mask AP (Table 1).
图5.在COCO测试图像上使用ResNet-101-FPN并以5 fps运行并带有35.7掩模AP(表1)的Mask R-CNN的更多结果。
Table 1. Instance segmentation mask AP on COCO test-dev. MNC [10] and FCIS [26] are the winners of the COCO 2015 and 2016 segmentation challenges, respectively. Without bells and whistles, Mask R-CNN outperforms the more complex FCIS+++, which includes multi-scale train/test, horizontal flip test, and OHEM [38]. All entries are single-model results.
表1. COCO test-dev上的实例分段掩码AP。跨国公司[10]和FCIS [26]分别是2015年和2016年分类挑战的赢家。没有花里胡哨的,Mask R-CNN胜过了更复杂的FCIS +++,其中包括多尺度训练/测试,水平测试和OHEM [38]。所有条目都是单模型结果。
Note that since we only compute masks on the top 100 detection boxes, Mask R-CNN adds a small overhead to its Faster R-CNN counterpart (e.g., ∼20% on typical models).
请注意,由于我们仅计算前100个检测框中的掩码,Mask R-CNN为其较快的R-CNN对象(例如典型模型上的约20%)增加了一个小的开销。
4. Experiments: Instance Segmentation4.实验:实例分割
We perform a thorough comparison of Mask R-CNN to the state of the art along with comprehensive ablations on the COCO dataset [28]. We report the standard COCO metrics including AP (averaged over IoU thresholds), AP50, AP75, and APS, APM , APL (AP at different scales). Unless noted, AP is evaluating using mask IoU. As in previous work [5, 27], we train using the union of 80k train images and a 35k subset of val images (trainval35k), and report ablations on the remaining 5k val images (minival). We also report results on test-dev [28].
我们对Mask R-CNN进行了彻底的比较,并对COCO数据集进行了全面的消融[28]。我们报告标准的COCO指标,包括AP(平均在IoU阈值上),AP50,AP75和APS,APM,APL(AP在不同尺度上)。除非另有说明,否则AP正在使用掩膜IoU进行评估。和以前的工作[5,27]一样,我们训练使用80k列车图像和val图像的35k子集(trainval35k)的联合,并报告其余5k val图像(微型)上的消融。我们还在测试开发中报告结果[28]。
4.1. Main Results4.1。主要结果
We compare Mask R-CNN to the state-of-the-art methods in instance segmentation in Table 1. All instantiations of our model outperform baseline variants of previous state-of-the-art models. This includes MNC [10] and FCIS [26], the winners of the COCO 2015 and 2016 segmentation challenges, respectively. Without bells and whistles, Mask R-CNN with ResNet-101-FPN backbone outperforms FCIS+++ [26], which includes multi-scale train/test, horizontal flip test, and online hard example mining (OHEM) [38]. While outside the scope of this work, we expect many such improvements to be applicable to ours. Mask R-CNN outputs are visualized in Figures 2 and 5. Mask R-CNN achieves good results even under challenging conditions. In Figure 6 we compare our Mask R-CNN baseline and FCIS+++ [26]. FCIS+++ exhibits systematic artifacts on overlapping instances, suggesting that it is challenged by the fundamental difficulty of instance segmentation. Mask R-CNN shows no such artifacts.
我们将Mask R-CNN与表1中实例分割中的最新方法进行了比较。我们模型的所有实例都优于先前最先进的模型的基线变体。其中包括MNC [10]和FCIS [26],分别是2015年和2016年分类挑战的获胜者。ResNet-101-FPN骨干网掩码R-CNN的性能优于FCIS +++ [26],其中包括多尺度训练/测试,水平流测试和在线硬示例挖掘(OHEM)[38]。虽然超出了本工作的范围,但我们预计许多此类改进将适用于我们的工作。图2和图5中显示了掩膜R-CNN输出。面具R-CNN即使在具有挑战性的条件下也能取得良好效果。在图6中,我们比较了我们的Mask R-CNN基线和FCIS +++ [26]。FCIS +++在重叠的实例中展现出系统性的人为因素,这表明它受到实例分割根本困难的挑战。掩码R-CNN没有显示这样的文物。
Figure 6. FCIS+++ [26] (top) vs. Mask R-CNN (bottom, ResNet-101-FPN). FCIS exhibits systematic artifacts on overlapping objects.
图6. FCIS +++ [26](顶部)与屏蔽R-CNN(底部,ResNet-101-FPN)。 FCIS展示重叠对象的系统性文物。
(b) Multinomial vs. Independent Masks (ResNet-50-C4): Decoupling via perclass binary masks (sigmoid) gives large gains over multinomial masks (softmax).
(b)多项式与独立式口罩(ResNet-50-C4):通过类别式口罩(sigmoid)进行解耦可获得多项式口罩(softmax)的巨大收益。
(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs. multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs improve results as they take advantage of explicitly encoding spatial layout.
(e)掩模分支(ResNet-50-FPN):用于掩模预测的完全卷积网络(FCN)与多层感知器(MLP,完全连接)。FCN改善了结果,因为它们利用了对空间布局的明确编码。
Table 2. Ablations. We train on trainval35k, test on minival, and report mask AP unless otherwise noted.
表2.消融。除非另有说明,否则我们在trainval35k上训练,在minival上测试,并报告mask AP。
(a) Backbone Architecture: Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet.
(a)骨干架构:更好的骨干带来预期的收益:更深的网络效果更好,FPN优于C4功能,ResNeXt改进ResNet。
(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-level AP using large-stride features. Misalignments are more severe than with stride-16 features (Table 2c), resulting in big accuracy gaps.
(d)RoIlign(ResNet-50-C5,步幅32):使用大步功能的面罩级和盒级AP。错位比步幅-16的特征更严重(表2c),导致很大的精度差距。
4.2. Ablation Experiments4.2。消融实验
We run a number of ablations to analyze Mask R-CNN. Results are shown in Table 2 and discussed in detail next.
我们运行一些消融来分析Mask R-CNN。结果显示在表2中并在下面详细讨论。
Architecture: Table 2a shows Mask R-CNN with various backbones. It benefits from deeper networks (50 vs. 101) and advanced designs including FPN and ResNeXt. We note that not all frameworks automatically benefit from deeper or advanced networks (see benchmarking in [21]).
架构:表2a显示了具有各种骨架的Mask R-CNN。它受益于更深的网络(50对101)和先进的设计,包括FPN和ResNeXt。我们注意到并非所有框架都自动从更深或更高级的网络中获益(参见[21]中的基准测试)。
Multinomial vs. Independent Masks: Mask R-CNN decouples mask and class prediction: as the existing box branch predicts the class label, we generate a mask for each class without competition among classes (by a per-pixel sigmoid and a binary loss). In Table 2b, we compare this to using a per-pixel softmax and a multinomial loss (as commonly used in FCN [30]). This alternative couples the tasks of mask and class prediction, and results in a severe loss in mask AP (5.5 points). This suggests that once the instance has been classified as a whole (by the box branch), it is sufficient to predict a binary mask without concern for the categories, which makes the model easier to train.
多项式与独立式掩码:掩码R-CNN分离掩码和类别预测:由于现有的分支预测类别标签,因此我们为每个类别生成一个掩码,而不会在类别间进行竞争(按像素S形和二进制丢失)。在表2b中,我们将其与使用每像素softmax和多项损失(如FCN [30]中常用的)进行比较。这种替代方案将掩模和类别预测的任务相结合,并导致掩模AP(5.5分)的严重损失。这表明一旦实例被整体分类(通过盒子分支),预测二进制掩码就足够了,而不用考虑类别,这使得模型更易于训练。
Class-Specific vs. Class-Agnostic Masks: Our default instantiation predicts class-specific masks, i.e., one
Class-Speci fi c与Class-Agnostic Masks:我们的默认实例化预测了类特定的掩码,即一个
© RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ∼3 points and AP75 by ∼5 points. Using proper alignment is the only factor that contributes to the large gap between RoI layers.
(c)RoIlign(ResNet-50-C4):使用各种RoI图层蒙版结果。我们的RoIlign层将AP提高了约3分,AP75提高了约5分。使用适当的对齐是造成RoI层之间巨大差距的唯一因素。
mask per class. Interestingly, Mask R-CNN with classagnostic masks (i.e., predicting a single output regardless of class) is nearly as effective: it has 29.7 mask AP vs. 30.3 for the class-specific counterpart on ResNet-50-C4. This further highlights the division of labor in our approach which largely decouples classification and segmentation.
每个班级的面具。有趣的是,具有分类掩码的掩码R-CNN(即预测单个输出而不管类别)几乎同样有效:它具有29.7掩码AP,而对于ResNet-50-C4上的类别特定对应字符,掩码AP为30.3。这进一步突出了我们的方法中的分工,这种分工在很大程度上将分类和分割分开。
RoIAlign: An evaluation of our proposed RoIAlign layer is shown in Table 2c. For this experiment we use the ResNet50-C4 backbone, which has stride 16. RoIAlign improves AP by about 3 points over RoIPool, with much of the gain coming at high IoU (AP75). RoIAlign is insensitive to max/average pool; we use average in the rest of the paper. Additionally, we compare with RoIWarp proposed in MNC [10] that also adopt bilinear sampling. As discussed in §3, RoIWarp still quantizes the RoI, losing alignment with the input. As can be seen in Table 2c, RoIWarp performs on par with RoIPool and much worse than RoIAlign. This highlights that proper alignment is key.
Roialign:我们建议的RoIlign层的评估如表2c所示。在这个实验中,我们使用了跨度为16的ResNet50-C4主干。RoIAlign比RoIPool提高了约3个点,其中很大的收益来自高IoU(AP75)。RoIlign对最大/平均水池不敏感;我们在本文的其余部分使用平均值。另外,我们与在MNC [10]中提出的RoIWarp进行比较,该方法也采用双线性采样。正如§3所讨论的那样,RoIWarp仍然量化了RoI,失去了与输入的一致性。从表2c可以看出,RoIWarp的表现与RoIPool相当,比RoIAlign差很多。这突出表明正确的对齐是关键。
We also evaluate RoIAlign with a ResNet-50-C5 backbone, which has an even larger stride of 32 pixels. We use the same head as in Figure 4 (right), as the res5 head is not applicable. Table 2d shows that RoIAlign improves mask AP by a massive 7.3 points, and mask AP75 by 10.5 points (50% relative improvement). Moreover, we note that with RoIAlign, using stride-32 C5 features (30.9 AP) is more accurate than using stride-16 C4 features (30.3 AP, Table 2c). RoIAlign largely resolves the long-standing challenge of using large-stride features for detection and segmentation. Finally, RoIAlign shows a gain of 1.5 mask AP and 0.5 box AP when used with FPN, which has finer multi-level strides. For keypoint detection that requires finer alignment, RoIAlign shows large gains even with FPN (Table 6).
我们还用一个ResNet-50-C5骨干来评估RoIlign,这个骨干有32个像素的更大步幅。我们使用与图4(右)相同的头,因为res5头不适用。表2d显示RoIAlign提高了掩模AP的7.3点,掩盖AP75 10.5点(相对提高50%)。此外,我们注意到使用RoIAlign,使用步幅-32 C5功能(30.9 AP)比使用步幅-16 C4功能(30.3 AP,表2c)更准确。RoIAlign在很大程度上解决了使用大步功能进行检测和分割的长期挑战。最后,与FPN一起使用时,RoIAlign显示1.5掩模AP和0.5盒AP的增益,FPN具有更精细的多级步幅。对于需要精细对齐的关键点检测,RoIAlign即使使用FPN也显示出较大的增益(表6)。
Mask Branch: Segmentation is a pixel-to-pixel task and we exploit the spatial layout of masks by using an FCN. In Table 2e, we compare multi-layer perceptrons (MLP) and FCNs, using a ResNet-50-FPN backbone. Using FCNs gives a 2.1 mask AP gain over MLPs. We note that we choose this backbone so that the conv layers of the FCN head are not pre-trained, for a fair comparison with MLP.
遮罩分支:分割是一个像素到像素的任务,我们通过使用FCN来利用遮罩的空间布局。在表2e中,我们使用ResNet-50-FPN主干比较了多层感知器(MLP)和FCN。使用FCN可以提供2.1 Mbps的AP掩码。我们注意到,我们选择了这个骨干,这样FCN头部的conv层没有经过预先训练,与MLP进行公平比较。
4.3. Bounding Box Detection Results4.3。边界框检测结果
We compare Mask R-CNN to the state-of-the-art COCO bounding-box object detection in Table 3. For this result, even though the full Mask R-CNN model is trained, only the classification and box outputs are used at inference (the mask output is ignored). Mask R-CNN using ResNet-101FPN outperforms the base variants of all previous state-ofthe-art models, including the single-model variant of GRMI [21], the winner of the COCO 2016 Detection Challenge. Using ResNeXt-101-FPN, Mask R-CNN further improves results, with a margin of 3.0 points box AP over the best previous single model entry from [39] (which used Inception-ResNet-v2-TDM).
我们将Mask R-CNN与表3中的最新COCO包围盒对象检测进行比较。对于这个结果,即使训练完整的Mask R-CNN模型,只有分类和框输出用于推理(掩码输出被忽略)。使用ResNet-101FPN的面罩R-CNN优于以前所有先进模型的基础变体,其中包括COMI 2016检测挑战赛获胜者GRMI [21]的单模型变体。使用ResNeXt-101-FPN,Mask R-CNN进一步改进了结果,与[39](使用Inception-ResNet-v2-TDM)的最佳单一模型条目相比,框AP的余量为3.0分。
As a further comparison, we trained a version of Mask R-CNN but without the mask branch, denoted by “Faster R-CNN, RoIAlign” in Table 3. This model performs better than the model presented in [27] due to RoIAlign. On the other hand, it is 0.9 points box AP lower than Mask R-CNN. This gap of Mask R-CNN on box detection is therefore due solely to the benefits of multi-task training.
作为进一步的比较,我们训练了一个版本的掩模R-CNN,但没有掩模分支,表3中的“Faster R-CNN,RoIlign”表示。由于RoIlign的原因,该模型的性能比[27]中介绍的模型要好。另一方面,比面具R-CNN低0.9个盒子AP。因此掩模R-CNN在盒子检测上的差距仅仅是由于多任务训练的好处。
Lastly, we note that Mask R-CNN attains a small gap between its mask and box AP: e.g., 2.7 points between 37.1 (mask, Table 1) and 39.8 (box, Table 3). This indicates that our approach largely closes the gap between object detection and the more challenging instance segmentation task.
最后,我们注意到Mask R-CNN在其掩模和盒AP之间获得了一个小间隙:例如,在37.1(掩模,表1)和39.8(框3)之间的2.7个点。这表明我们的方法在很大程度上缩小了对象检测与更具挑战性的实例分割任务之间的差距。
4.4. Timing4.4. Timing
Inference: We train a ResNet-101-FPN model that shares features between the RPN and Mask R-CNN stages, following the 4-step training of Faster R-CNN [36]. This model runs at 195ms per image on an Nvidia Tesla M40 GPU (plus 15ms CPU time resizing the outputs to the original resolution), and achieves statistically the same mask AP as the unshared one. We also report that the ResNet-101-C4 variant takes ∼400ms as it has a heavier box head (Figure 4), so we do not recommend using the C4 variant in practice.
推论:我们训练了一个ResNet-101-FPN模型,该模型在R-CNN更快的四步训练之后训练RPN和Mask R-CNN阶段之间的特征[36]。Nvidia Tesla M40 GPU(加上15ms CPU时间,将输出调整为原始分辨率)时,该模型以195ms的速度运行,并实现与非共享模式相同的掩模AP。我们还报告说ResNet-101-C4变体需要400毫秒,因为它有一个较重的盒子头(图4),所以我们不建议在实践中使用C4变体。
Although Mask R-CNN is fast, we note that our design is not optimized for speed, and better speed/accuracy tradeoffs could be achieved [21], e.g., by varying image sizes and proposal numbers, which is beyond the scope of this paper.
尽管掩模R-CNN速度很快,但我们注意到我们的设计并未针对速度进行优化,并且可以实现更好的速度/精度折衷[21],例如,通过改变图像尺寸和提案编号,这超出了本白皮书的范围。
Training: Mask R-CNN is also fast to train. Training with ResNet-50-FPN on COCO trainval35k takes 32 hours in our synchronized 8-GPU implementation (0.72s per 16image mini-batch), and 44 hours with ResNet-101-FPN. In fact, fast prototyping can be completed in less than one day when training on the train set. We hope such rapid training will remove a major hurdle in this area and encourage more people to perform research on this challenging topic.
训练:面具R-CNN训练也很快。在COCO trainval35k上使用ResNet-50-FPN进行培训的同步8 GPU实现需要32小时(每16图像微型批次0.72s),使用ResNet-101-FPN需要44小时。实际上,快速原型设计可以在不到一天的时间内在火车上进行训练时完成。我们希望这种快速培训能够消除该领域的一个主要障碍,并鼓励更多的人对这个具有挑战性的话题进行研究。
5. Mask R-CNN for Human Pose Estimation5.掩盖R-CNN用于人体姿态估计
Our framework can easily be extended to human pose estimation. We model a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types (e.g., left shoulder, right elbow). This task helps demonstrate the flexibility of Mask R-CNN. We note that minimal domain knowledge for human pose is exploited by our system, as the experiments are mainly to demonstrate the generality of the Mask R-CNN framework. We expect that domain knowledge (e.g., modeling structures [6]) will be complementary to our simple approach.
我们的框架可以很容易地扩展到人体姿态估计。我们将一个关键点的位置建模为一个单独的热掩模,并采用掩模R-CNN预测K个掩模,每个K个关键点类型(例如左肩,右肘)各一个。这项任务有助于展示Mask R-CNN的灵活性。我们注意到,我们的系统利用了人类姿态的最小领域知识,因为实验主要是为了展示Mask R-CNN框架的一般性。我们期望领域知识(例如,建模结构[6])将与我们简单的方法相辅相成。
Implementation Details: We make minor modifications to the segmentation system when adapting it for keypoints. For each of the K keypoints of an instance, the training target is a one-hot binary mask where only a single pixel is labeled as foreground. During training, for each visible ground-truth keypoint, we minimize the cross-entropy loss over an -way softmax output (which encourages a [6] is the 2016 competition winner that uses multi-scale testing, post-processing with CPM [44], and filtering with an object detector, adding a cumulative ∼5 points (clarified in personal communication). †: G-RMI was trained on COCO plus MPII [1] (25k images), using two models (Inception-ResNet-v2 for bounding box detection and ResNet-101 for keypoints).
实施细节:对关键点进行调整时,我们对细分系统进行细微修改。对于实例的每个K关键点,训练目标是一个热点二进制掩码,其中只有一个像素标记为前景。在训练过程中,对于每个可见的地面真值关键点,我们将 -way softmax输出的交叉熵损失最小化(鼓励[6]是2016年竞赛获胜者,使用多尺度测试,CPM后处理[使用两种模型(Inception-ResNet-1)对G-RMI进行COCO加MPII [1](25k图像)的训练,并用目标检测器进行滤波,累加约5个点(在个人通信中加以澄清) v2用于边界框检测,ResNet-101用于关键点)。
Figure 7. Keypoint detection results on COCO test using Mask R-CNN (ResNet-50-FPN), with person segmentation masks predicted from the same model. This model has a keypoint AP of 63.1 and runs at 5 fps.
图7.使用Mask R-CNN(ResNet-50-FPN)在COCO测试中的关键点检测结果,以及从相同模型预测的人分割掩码。该模型的关键点AP为63.1,运行速度为5 fps。
Table 4. Keypoint detection AP on COCO test-dev. Ours is a single model (ResNet-50-FPN) that runs at 5 fps. CMU-Pose+++
表4. COCO test-dev上的关键点检测AP。我们是以5 fps运行的单一型号(ResNet-50-FPN)。 CMU-姿态+++
single point to be detected). We note that as in instance segmentation, the K keypoints are still treated independently. We adopt the ResNet-FPN variant, and the keypoint head architecture is similar to that in Figure 4 (right). The keypoint head consists of a stack of eight 3×3 512-d conv layers, followed by a deconv layer and 2× bilinear upscaling, producing an output resolution of 56×56. We found that a relatively high resolution output (compared to masks) is required for keypoint-level localization accuracy.
单点待检测)。我们注意到,与实例分割一样,K关键点仍然是独立处理的。我们采用ResNet-FPN变体,关键点头结构与图4(右)相似。关键点头由8个3×3 512-d的conv层组成,其后是去卷积层和2倍双线性放大,产生56×56的输出分辨率。我们发现对于关键点级别的定位精度需要相对较高的分辨率输出(与掩模相比)。
Models are trained on all COCO trainval35k images that contain annotated keypoints. To reduce overfitting, as this training set is smaller, we train using image scales randomly sampled from [640, 800] pixels; inference is on a single scale of 800 pixels. We train for 90k iterations, starting from a learning rate of 0.02 and reducing it by 10 at 60k and 80k iterations. We use bounding-box NMS with a threshold of 0.5. Other details are identical as in §3.1.
模型在所有包含注释关键点的COCO trainval35k图像上进行训练。为减少过度训练,由于训练集较小,我们使用从[640,800]像素中随机采样的图像比例进行训练;推断是在800像素的单一尺度上进行的。我们训练90k迭代,从0.02的学习率开始,在60k和80k迭代时将其减少10。我们使用边界框NMS,阈值为0.5。其他细节与§3.1中的相同。
Main Results and Ablations: We evaluate the person keypoint AP (APkp) and experiment with a ResNet-50-FPN backbone; more backbones will be studied in the appendix. Table 4 shows that our result (62.7 APkp) is 0.9 points higher than the COCO 2016 keypoint detection winner [6] that uses a multi-stage processing pipeline (see caption of Table 4). Our method is considerably simpler and faster.
主要结果和消融:我们评估人员关键点AP(APkp)并尝试使用ResNet-50-FPN主干;附录中将研究更多骨干。表4显示我们的结果(62.7 APkp)比使用多级处理管道的COCO 2016关键点检测获胜者[6]高0.9个点(见表4的标题)。我们的方法相当简单快捷。
More importantly, we have a unified model that can si multaneously predict boxes, segments, and keypoints while running at 5 fps. Adding a segment branch (for the person category) improves the APkp to 63.1 (Table 4) on test-dev. More ablations of multi-task learning on minival are in Table 5. Adding the mask branch to the box-only (i.e., Faster R-CNN) or keypoint-only versions consistently improves these tasks. However, adding the keypoint branch reduces the box/mask AP slightly, suggesting that while keypoint detection benefits from multitask training, it does not in turn help the other tasks. Nevertheless, learning all three tasks jointly enables a unified system to efficiently predict all outputs simultaneously (Figure 7). We also investigate the effect of RoIAlign on keypoint detection (Table 6). Though this ResNet-50-FPN backbone has finer strides (e.g., 4 pixels on the finest level), RoIAlign still shows significant improvement over RoIPool and increases APkp by 4.4 points. This is because keypoint detections are more sensitive to localization accuracy. This again indicates that alignment is essential for pixel-level localization, including masks and keypoints.
更重要的是,我们有一个统一的模型,可以在5 fps下运行时同时预测盒子,分段和关键点。添加段分支(针对人员类别)将test-dev上的APkp值提高到63.1(表4)。表5中更多关于微型多任务学习的消除。将掩码分支添加到仅包装盒(即更快的R-CNN)或仅有关键点的版本可以持续改进这些任务。但是,添加关键点分支会略微减少盒/掩码AP,这表明虽然多任务训练可以实现关键点检测,但它不会帮助其他任务。不过,联合学习所有三项任务可以使统一系统同时有效地预测所有输出(图7)。我们还调查RoIAlign对关键点检测的影响(表6)。尽管ResNet-50-FPN骨干网有很大的进展(例如,在嵌套层面上有4个像素),但RoIAlign仍然显示出比RoIPool有显着的提高,APkp增加4.4点。这是因为关键点检测对定位精度更敏感。这再次表明,对齐对像素级本地化至关重要,包括掩码和关键点。
Table 5. Multi-task learning of box, mask, and keypoint about the person category, evaluated on minival. All entries are trained on the same data for fair comparisons. The backbone is ResNet50-FPN. The entries with 64.2 and 64.7 AP on minival have test-dev AP of 62.7 and 63.1, respectively (see Table 4).
表5.关于人物类别的盒子,面具和关键点的多任务学习,在迷你游戏上评估。所有的参赛作品都使用相同的数据进行公平比较。骨干是ResNet50-FPN。 minival上的64.2和64.7 AP的条目分别具有62.7和63.1的测试开发AP(参见表4)。
Table 6. RoIAlign vs. RoIPool for keypoint detection on minival. The backbone is ResNet-50-FPN.
表6. RoIlign与RoIPool用于微型关键点检测。骨干是ResNet-50-FPN。
Given the effectiveness of Mask R-CNN for extracting object bounding boxes, masks, and keypoints, we expect it be an effective framework for other instance-level tasks.
鉴于Mask R-CNN提取对象边界框,掩码和关键点的有效性,我们预计它将成为其他实例级任务的有效框架。
Appendix A: Experiments on Cityscapes附录A:城市风景的实验
We further report instance segmentation results on the Cityscapes [7] dataset. This dataset has fine annotations for 2975 train, 500 val, and 1525 test images. It has 20k coarse training images without instance annotations, which we do not use. All images are 2048×1024 pixels. The instance segmentation task involves 8 object categories, whose numbers of instances on the fine training set are: Instance segmentation performance on this task is measured by the COCO-style mask AP (averaged over IoU thresholds); AP50 (i.e., mask AP at an IoU of 0.5) is also reported.
我们进一步报告Cityscapes [7]数据集上的实例分割结果。该数据集对2975列车,500 val和1525测试图像具有良好的注释。它有20k个没有实例注释的粗糙训练图像,我们不使用它。所有图像都是2048×1024像素。实例分段任务涉及8个对象类别,其在精细训练集上的实例数量为:此任务上的实例分段性能由COCO式掩码AP(在IoU阈值上平均)测量;也报告AP50(即,IoU为0.5的掩码AP)。
Implementation: We apply our Mask R-CNN models with the ResNet-FPN-50 backbone; we found the 101-layer counterpart performs similarly due to the small dataset size. We train with image scale (shorter side) randomly sampled from [800, 1024], which reduces overfitting; inference is on a single scale of 1024 pixels. We use a mini-batch size of 1 image per GPU (so 8 on 8 GPUs) and train the model for 24k iterations, starting from a learning rate of 0.01 and reducing it to 0.001 at 18k iterations. It takes ∼4 hours of training on a single 8-GPU machine under this setting.
实施:我们将我们的Mask R-CNN模型与ResNet-FPN-50骨干一起使用;我们发现由于数据集的大小很小,101层对应表现相似。我们训练时采用从[800,1024]随机采样的图像缩放比例(短边),这可以减少过度拟合;推断是在1024像素的单一尺度上进行的。我们在每个GPU上使用1个图像的小批量(在8个GPU上使用8个),并对模型进行24k次迭代训练,从学习率0.01开始,在18k迭代时将其降至0.001。在此设置下,单个8 GPU计算机需要花费约4小时的培训时间。
Results: Table 7 compares our results to the state of the art on the val and test sets. Without using the coarse training set, our method achieves 26.2 AP on test, which is over 30% relative improvement over the previous best entry (DIN [3]), and is also better than the concurrent work of SGN’s 25.0 [29]. Both DIN and SGN use fine + coarse data. Compared to the best entry using fine data only (17.4 AP), we achieve a ∼50% improvement.
结果:表7将我们的结果与val和测试集上的现有技术进行比较。在不使用粗糙训练集的情况下,我们的方法在测试中达到26.2 AP,相对于以前的最佳条目(DIN [3]),相对提高30%以上,并且也优于SGN 25.0的同时工作[29]。 DIN和SGN都使用精细+粗糙的数据。与仅使用精细数据(17.4 AP)的最佳条目相比,我们实现了约50%的改进。
For the person and car categories, the Cityscapes dataset exhibits a large number of within-category overlapping instances (on average 6 people and 9 cars per image). We argue that within-category overlap is a core difficulty of instance segmentation. Our method shows massive improvement on these two categories over the other best entries (relative ∼40% improvement on person from 21.8 to 30.5 and ∼20% improvement on car from 39.4 to 46.9), even though our method does not exploit the coarse data.
对于个人和汽车类别,Cityscapes数据集展示了大量类别内重叠实例(平均每个图像6人和9辆汽车)。我们认为,类别内重叠是实例分割的核心难题。我们的方法显示,对于其他最佳条目,这两个类别都有了很大的改进(相对于人员从21.8提高到40.5%,从39.4提高到了30.5,汽车提高了20%,从39.4提高到46.9),尽管我们的方法没有利用粗略数据。
A main challenge of the Cityscapes dataset is training models in a low-data regime, particularly for the categories of truck, bus, and train, which have about 200-500 train ing samples each. To partially remedy this issue, we further report a result using COCO pre-training. To do this, we initialize the corresponding 7 categories in Cityscapes from a pre-trained COCO Mask R-CNN model (rider being randomly initialized). We fine-tune this model for 4k iterations in which the learning rate is reduced at 3k iterations, which takes ∼1 hour for training given the COCO model.
Cityscapes数据集的一个主要挑战是在低数据情况下训练模型,尤其是卡车,公交车和火车类别的训练模型,每个训练样本大约有200-500个训练样本。为了部分解决这个问题,我们使用COCO预培训进一步报告结果。为此,我们从预先训练好的COCO Mask R-CNN模型(骑手被随机初始化)初始化Cityscapes中相应的7个类别。我们对这个模型进行了微调4k迭代,其中学习速率在3k次迭代时减少,在COCO模型的情况下,这需要约1小时的训练时间。
Figure 8. Mask R-CNN results on Cityscapes test (32.0 AP). The bottom-right image shows a failure prediction.
图8.在Cityscapes测试中屏蔽R-CNN结果(32.0 AP)。右下图显示故障预测。
The COCO pre-trained Mask R-CNN model achieves 32.0 AP on test, almost a 6 point improvement over the fine-only counterpart. This indicates the important role the amount of training data plays. It also suggests that methods on Cityscapes might be influenced by their lowshot learning performance. We show that using COCO pretraining is an effective strategy on this dataset.
COCO预先训练的Mask R-CNN模型在测试中达到了32.0 AP,比精细对手提高了近6个点。这表明培训数据的重要作用。它还表明,城市风景的方法可能受其低迷学习表现的影响。我们表明使用COCO预训练是对这个数据集的有效策略。
Finally, we observed a bias between the val and test AP, as is also observed from the results of [23, 4, 29]. We found that this bias is mainly caused by the truck, bus, and train categories, with the fine-only model having val/test AP of 28.8/22.8, 53.5/32.2, and 33.0/18.6, respectively. This suggests that there is a domain shift on these categories, which also have little training data. COCO pre-training helps to improve results the most on these categories; however, the domain shift persists with 38.0/30.1, 57.5/40.9, and 41.2/30.9 val/test AP, respectively. Note that for the person and car categories we do not see any such bias (val/test AP are within point).
最后,我们观察到val和测试AP之间存在偏差,从[23,4,29]的结果中也可以看出。我们发现,这种偏见主要是由卡车,公共汽车和火车类别造成的,纯罚款模型的val / test AP分别为28.8 / 22.8,53.5 / 32.2和33.0 / 18.6。这表明这些类别存在域名转移,这些域名也很少有培训数据。COCO预培训有助于提高这些类别的最佳结果;然而,域变化仍然分别为38.0 / 30.1,57.5 / 40.9和41.2 / 30.9 val / test AP。请注意,对于人员和汽车类别,我们没有看到任何此类偏差(VAL /测试AP在点内)。
Example results on Cityscapes are shown in Figure 8.
城市风景示例结果如图8所示。
Table 8. Enhanced detection results of Mask R-CNN on COCO minival. Each row adds an extra component to the above row. We denote ResNeXt model by ‘X’ for notational brevity.
表8.在COCO minival上增强Mask R-CNN的检测结果。每行添加一个额外的组件到上面的行。为了符号简洁,我们用’X’表示ResNeXt模型。
Appendix B: Enhanced Results on COCO附录B:关于COCO的增强结果
As a general framework, Mask R-CNN is compatible with complementary techniques developed for detection/segmentation, including improvements made to Fast/Faster R-CNN and FCNs. In this appendix we describe some techniques that improve over our original results. Thanks to its generality and flexibility, Mask R-CNN was used as the framework by the three winning teams in the COCO 2017 instance segmentation competition, which all significantly outperformed the previous state of the art.
作为一个通用框架,Mask R-CNN与为检测/分割开发的补充技术兼容,包括对快速/更快的R-CNN和FCN进行改进。在本附录中,我们将介绍一些改进我们原始结果的技术。由于其通用性和灵活性,COCO 2017实例细分竞赛中三个获胜团队使用Mask R-CNN作为框架,这些团队的表现都优于先前的技术水平。
Instance Segmentation and Object Detection实例分割和对象检测
We report some enhanced results of Mask R-CNN in Table 8. Overall, the improvements increase mask AP 5.1 points (from 36.7 to 41.8) and box AP 7.7 points (from 39.6 to 47.3). Each model improvement increases both mask AP and box AP consistently, showing good generalization of the Mask R-CNN framework. We detail the improvements next. These results, along with future updates, can be reproduced by our released code at https://github.com/ facebookresearch/Detectron, and can serve as higher baselines for future research.
我们在表8中报告Mask R-CNN的一些增强结果。总体而言,这些改进提高了掩护AP 5.1点(从36.7到41.8)和AP 7.7点(从39.6到47.3)。每个模型的改进都会一致地增加掩模AP和框AP的数量,显示掩模R-CNN框架具有很好的一般性。我们接下来详细介绍改进。这些结果以及未来的更新可以通过我们在https://github.com/ facebookresearch / Detectron上发布的代码进行复制,并且可以作为未来研究的更高基线。
Updated baseline: We start with an updated baseline with a different set of hyper-parameters. We lengthen the training to 180k iterations, in which the learning rate is reduced by 10 at 120k and 160k iterations. We also change the NMS threshold to 0.5 (from a default value of 0.3). The updated baseline has 37.0 mask AP and 40.5 box AP.
更新后的基线:我们从更新后的基线开始,使用一组不同的超参数。我们将训练延长到180k迭代,其中在120k和160k迭代时学习速率减少10。我们还将NMS阈值更改为0.5(默认值为0.3)。更新的基线有37.0掩模AP和40.5盒AP。
End-to-end training: All previous results used stagewise training, i.e., training RPN as the first stage and Mask R-CNN as the second. Following [37], we evaluate endto-end (‘e2e’) training that jointly trains RPN and Mask RCNN. We adopt the ‘approximate’ version in [37] that only computes partial gradients in the RoIAlign layer by ignoring the gradient w.r.t. RoI coordinates. Table 8 shows that e2e training improves mask AP by 0.6 and box AP by 1.2. ImageNet-5k pre-training: Following [45], we experiment with models pre-trained on a 5k-class subset of ImageNet (in contrast to the standard 1k-class subset). This 5× increase in pre-training data improves both mask and box 1 AP. As a reference, [40] used ∼250× more images (300M) and reported a 2-3 box AP improvement on their baselines.
端到端培训:以前的所有研究结果均采用分阶段培训,即将RPN作为第一阶段训练,将面具R-CNN作为第二阶段训练。在[37]之后,我们评估联合训练RPN和掩模RCNN的端对端(‘e2e’)训练。我们采用[37]中的’近似’版本,仅通过忽略梯度w.r.t来计算RoIAlign层中的部分梯度。 RoI坐标。表8显示,e2e训练将掩蔽AP提高0.6,将AP提高1.2。ImageNet-5k预训练:在[45]之后,我们试验了在ImageNet的5k级子集上预训练的模型(与标准的1k级子集相反)。训练前数据增加5倍,可以改善掩模和方框1的AP。作为参考文献,[40]使用了~250倍的图像(300M),并在其基线上报告了2-3盒AP改善。
Table 9. Enhanced keypoint results of Mask R-CNN on COCO minival. Each row adds an extra component to the above row. Here we use only keypoint annotations but no mask annotations. We denote ResNet by ‘R’ and ResNeXt by ‘X’ for brevity.
表9. COCO minival上Mask R-CNN增强的关键点结果。每行添加一个额外的组件到上面的行。这里我们只使用关键点注释但不使用遮罩注释。为了简洁起见,我们用’R’和ResNeXt’X’来表示ResNet。
Train-time augmentation: Scale augmentation at train time further improves results. During training, we randomly sample a scale from [640, 800] pixels and we increase the number of iterations to 260k (with the learning rate reduced by 10 at 200k and 240k iterations). Train-time augmentation improves mask AP by 0.6 and box AP by 0.8.
训练时间增量:训练时间的增量训练可进一步提高结果。在训练过程中,我们从[640,800]个像素中随机抽取一个比例,我们将迭代次数增加到260k(在200k和240k迭代时学习率降低了10)。训练时间增加将掩护AP提高0.6点,将AP掩护提高0.8点。
Model architecture: By upgrading the 101-layer ResNeXt to its 152-layer counterpart [19], we observe an increase of 0.5 mask AP and 0.6 box AP. This shows a deeper model can still improve results on COCO.
模型架构:通过将101层ResNeXt升级到152层对应模型[19],我们观察到0.5掩模AP和0.6盒AP的增加。这表明一个更深的模型仍然可以改善COCO的结果。
Using the recently proposed non-local (NL) model [43], we achieve 40.3 mask AP and 45.0 box AP. This result is without test-time augmentation, and the method runs at 3fps on an Nvidia Tesla P100 GPU at test time.
使用最近提出的非局部(NL)模型[43],我们实现了40.3掩模AP和45.0盒AP。这一结果没有测试时间增强,并且测试时该方法在Nvidia Tesla P100 GPU上以3fps运行。
Test-time augmentation: We combine the model results evaluated using scales of [400, 1200] pixels with a step of 100 and on their horizontal flips. This gives us a singlemodel result of 41.8 mask AP and 47.3 box AP.
测试时间增量:我们将使用[400,1200]像素的缩放比例评估的模型结果与100的步长以及它们的水平面结合起来。这给了我们41.8掩模AP和47.3盒AP的单模型结果。
The above result is the foundation of our submission to the COCO 2017 competition (which also used an ensemble, not discussed here). The first three winning teams for the instance segmentation task were all reportedly based on an extension of the Mask R-CNN framework.
以上结果是我们提交COCO 2017比赛的基础(其中也使用了一个合奏组合,这里不再讨论)。据报道,实例分割任务的前三名获胜团队都是基于Mask R-CNN框架的扩展。
Keypoint Detection关键点检测
We report enhanced results of keypoint detection in Table 9. As an updated baseline, we extend the training schedule to 130k iterations in which the learning rate is reduced by 10 at 100k and 120k iterations. This improves APkp by about 1 point. Replacing ResNet-50 with ResNet-101 and ResNeXt-101 increases APkp to 66.1 and 67.3, respectively. With a recent method called data distillation [35], we are able to exploit the additional 120k unlabeled images provided by COCO. In brief, data distillation is a self-training strategy that uses a model trained on labeled data to predict annotations on unlabeled images, and in turn updates the model with these new annotations. Mask R-CNN provides an effective framework for such a self-training strategy. With data distillation, Mask R-CNN APkp improve by 1.8 points to 69.1. We observe that Mask R-CNN can benefit from extra data, even if that data is unlabeled.
我们在表9中报告关键点检测的增强结果。作为更新后的基线,我们将训练计划延长到130k次迭代,其中在100k和120k迭代时学习率降低了10。这可以提高APkp大约1分。用ResNet-101和ResNeXt-101代替ResNet-50,APkp分别增加到66.1和67.3。利用最近称为数据精馏的方法[35],我们可以利用COCO提供的额外的120k无标签图像。简而言之,数据提炼是一种自我训练策略,它使用训练有标签数据的模型来预测未标记图像上的注释,并用这些新注释来更新模型。面具R-CNN为这种自我培训战略提供了一个有效的框架。通过数据提炼,Mask R-CNN APkp提高1.8点至69.1。我们观察到Mask R-CNN可以从额外的数据中获益,即使这些数据没有标记。
By using the same test-time augmentation as used for instance segmentation, we further boost APkp to 70.4.
通过使用与实例分段相同的测试时间增强功能,我们将APkp进一步提升至70.4。
Acknowledgements: We would like to acknowledge Ilija Radosavovic for contributions to code release and enhanced results, and the Caffe2 team for engineering support.
致谢:我们要感谢Ilija Radosavovic对代码发布和增强结果的贡献,以及Caffe2工程团队的支持。
References参考
[1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2D human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014. 8
[1] M. Andriluka,L. Pishchulin,P. Gehler和B. Schiele。 2D人体姿势估计:新的基准和最先进的分析。在CVPR,2014年。8
[2] P. Arbel´aez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014. 2
[2] P. Arbel’aez,J. Pont-Tuset,J. T. Barron,F. Marques和J. Malik。多尺度组合分组。在CVPR,2014。2
[3] A. Arnab and P. H. Torr. Pixelwise instance segmentation with a dynamically instantiated network. In CVPR, 2017. 3, 9
[3] A.阿纳布和P.托尔。 Pixelwise实例分割与动态实例化网络。在CVPR,2017.3,9
[4] M. Bai and R. Urtasun. Deep watershed transform for instance segmentation. In CVPR, 2017. 3, 9
[4] M. Bai和R. Urtasun。深度分水岭变换例如分割。在CVPR,2017.3,9
[5] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Insideoutside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR, 2016. 5
[5] S. Bell,C. L. Zitnick,K. Bala和R. Girshick。 Insideoutside net:使用跳池和循环神经网络检测上下文中的对象。在CVPR,2016。5
[6] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multiperson 2d pose estimation using part affinity fields. In CVPR, 2017. 7, 8
[6] Z. Cao,T. Simon,S.-E.魏和Y.谢赫。实时多人2d姿态估计使用部分亲和力字段。在CVPR,2017.7,8
[7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The Cityscapes dataset for semantic urban scene understanding. In CVPR, 2016. 9
[7] M. Cordts,M. Omran,S. Ramos,T. Rehfeld,M. Enzweiler,R. Benenson,U. Franke,S. Roth和B. Schiele。用于语义城市场景理解的Cityscapes数据集。在CVPR,2016。9
[8] J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive fully convolutional networks. In ECCV, 2016. 2
[8] J. Dai,K. He,Y. Li,S. Ren和J. Sun.实例敏感的完全卷积网络。在ECCV,2016。2
[9] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. In CVPR, 2015. 2
[9] J. Dai,K. He和J. Sun.用于联合对象和东西分割的卷积特征掩蔽。在CVPR,2015。2
[10] J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In CVPR, 2016. 2, 3, 4, 5, 6
[10] J. Dai,K. He和J. Sun.通过多任务网络级联的实例感知语义分割。在CVPR,2016年。2,3,4,5,6
[11] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: Object detection via region-based fully convolutional networks. In NIPS, 2016. 2
[11] J. Dai,Y. Li,K. He和J. Sun. R-FCN:通过基于区域的完全卷积网络进行目标检测。在NIPS,2016。2
[12] R. Girshick. Fast R-CNN. In ICCV, 2015. 1, 2, 3, 4, 6
[12] R. Girshick。快R-CNN。在ICCV,2015年1,2,3,4,6
[13] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 2, 3
[13] R. Girshick,J. Donahue,T. Darrell和J. Malik。丰富的功能层次结构,用于精确的对象检测和语义分割。在CVPR,2014。2,3
[14] R. Girshick, F. Iandola, T. Darrell, and J. Malik. Deformable part models are convolutional neural networks. In CVPR, 2015. 4
[14] R. Girshick,F。Iandola,T. Darrell和J. Malik。可变形零件模型是卷积神经网络。在CVPR,2015。4
[15] B. Hariharan, P. Arbel´aez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In ECCV. 2014. 2
[15] B. Hariharan,P.阿尔贝阿兹,R. Girshick和J.马利克。同时检测和分割。在ECCV中。 2
[16] B. Hariharan, P. Arbel´aez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, 2015. 2
[16] B. Hariharan,P.阿尔贝阿兹,R. Girshick和J.马利克。用于对象分割和细化本地化的高列。在CVPR,2015。2
[17] Z. Hayder, X. He, and M. Salzmann. Shape-aware instance segmentation. In CVPR, 2017. 9
[17] Z. Hayder,X. He和M. Salzmann。形状感知实例分段。在CVPR,2017年。9
[18] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV. 2014. 1, 2
[18] K. He,X. Zhang,S. Ren和J. Sun.空间金字塔池在深度卷积网络中进行视觉识别。在ECCV中。 2014. 1,2
[19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 2, 4, 7, 10
[19] K. He,X. Zhang,S. Ren和J. Sun.图像识别的深度残留学习。在CVPR,2016年2月,4日,7日,10日
[20] J. Hosang, R. Benenson, P. Doll´ar, and B. Schiele. What makes for effective detection proposals? PAMI, 2015. 2
[20] J.Hosang,R.Bennenson,P.Doll’ar和B.Schiele。什么使得有效的检测建议成为可能PAMI,2015。2
[21] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, 2017. 2, 3, 4, 6, 7
[21] J. Huang,V. Rathod,C. Sun,M. Zhu,A. Korattikara,A. Fathi,I. Fischer,Z. Wojna,Y. Song,S. Guadarrama,et al。现代卷积物体检测器的速度/精度折衷。在CVPR,2017年2月,3日,4日,6日和7日
[22] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NIPS, 2015. 4
[22] M. Jaderberg,K. Simonyan,A. Zisserman和K. Kavukcuoglu。空间变压器网络。在NIPS,2015年。4
[23] A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, and C. Rother. Instancecut: from edges to instances with multicut. In CVPR, 2017. 3, 9
[23] A. Kirillov,E. Levinkov,B. Andres,B. Savchynskyy和C. Rother。 Instancecut:从边到具有multicut的实例。在CVPR,2017.3,9
[24] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012. 2
[24] A. Krizhevsky,I. Sutskever和G. Hinton。 ImageNet分类与深卷积神经网络。在NIPS,2012年。2
[25] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989. 2
[25] Y. LeCun,B. Boser,J. S. Denker,D. Henderson,R. E. Howard,W. Hubbard和L. D. Jackel。反向传播适用于手写邮政编码识别。神经计算,1989。2
[26] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware semantic segmentation. In CVPR, 2017. 2, 3, 5, 6
[26] Y.Li,H.Qi,J.Dai,X.Ji,和Y.We。完全卷积实例感知语义分割。在CVPR,2017年。2,3,5,6
[27] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017. 2, 4, 5, 7
[27] T.-Y. Lin,P. Doll’ar,R. Girshick,K. He,B. Hariharan和S. Belongie。特征金字塔网络用于对象检测。在CVPR,2017年2月,4日,5日,7日
[28] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 2, 5
[28] T.-Y. Lin,M. Maire,S. Belongie,J. Hays,P. Perona,D. Ramanan,P. Doll’ar和C. L. Zitnick。 Microsoft COCO:上下文中的通用对象。在ECCV,2014.2,5
[29] S. Liu, J. Jia, S. Fidler, and R. Urtasun. SGN: Sequential grouping networks for instance segmentation. In ICCV, 2017. 3, 9
[29] S. Liu,J. Jia,S. Fidler和R. Urtasun。 SGN:用于实例分段的顺序分组网络。在ICCV,2017.3,9
[30] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. 1, 3, 6
[30] J. Long,E. Shelhamer和T. Darrell。用于语义分割的完全卷积网络。在CVPR,2015。1,3,6
[31] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010. 4
[31] V. Nair和G. E. Hinton。整型线性单元改进了受限玻尔兹曼机器。在ICML,2010年。4
[32] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy. Towards accurate multiperson pose estimation in the wild. In CVPR, 2017. 8
[32] G. Papandreou,T. Zhu,N. Kanazawa,A. Toshev,J. Tompson,C. Bregler和K. Murphy。在野外对准确的多人姿势估计。在CVPR,2017年。8
[33] P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to segment object candidates. In NIPS, 2015. 2, 3
[33] P. O. Pinheiro,R. Collobert和P. Dollar。学习细分对象候选者。在NIPS,2015。2,3
[34] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Doll´ar. Learning to refine object segments. In ECCV, 2016. 2, 3
[34] P. O. Pinheiro,T.-Y. Lin,R. Collobert和P. Doll’ar。学习重新定义对象段。在ECCV,2016年2月3日
[35] I. Radosavovic, P. Doll´ar, R. Girshick, G. Gkioxari, and K. He. Data distillation: Towards omni-supervised learning. arXiv:1712.04440, 2017. 10
[35] I. Radosavovic,P. Doll’ar,R. Girshick,G. Gkioxari和K. He。数据蒸馏:迈向全方位监督学习。 arXiv:1712.04440,2017。10
[36] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. 1, 2, 3, 4, 7
[36] S. Ren,K. He,R. Girshick和J. Sun.更快的R-CNN:通过区域提案网络实现对象实时检测。在NIPS中,2015年1,2,3,4,7
[37] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In TPAMI, 2017. 10
[37] S. Ren,K. He,R. Girshick和J. Sun.更快的R-CNN:通过区域提案网络实现对象实时检测。在TPAMI,2017年。10
[38] A. Shrivastava, A. Gupta, and R. Girshick. Training regionbased object detectors with online hard example mining. In CVPR, 2016. 2, 5
[38] A. Shrivastava,A. Gupta和R. Girshick。在线硬示例挖掘培训基于区域的对象检测器。在CVPR,2016。2,5
[39] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Beyond skip connections: Top-down modulation for object detection. arXiv:1612.06851, 2016. 4, 7
[39] A. Shrivastava,R. Sukthankar,J. Malik和A. Gupta。超越跳过连接:自顶向下调制物体检测。 arXiv:1612.06851,2016.4,7
[40] C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017. 10
[40] C. Sun,A. Shrivastava,S. Singh和A. Gupta。重温深度学习时代数据的不合理有效性。在ICCV,2017。10
[41] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. In ICLR Workshop, 2016. 7
[41] C. Szegedy,S. Ioffe和V. Vanhoucke。初始-v4,初始阶段和剩余连接对学习的影响。在ICLR研讨会上,2016
[42] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 2013. 2
[42] J. R. Uijlings,K. E. van de Sande,T. Gevers和A. W. Smeulders。选择性搜索对象识别。 IJCV,2013。2
[43] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. arXiv:1711.07971, 2017. 10
[43] X. Wang,R. Girshick,A. Gupta和K. He。非局部神经网络。 arXiv:1711.07971,2010。10
[44] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016. 8
[44] S.-E. Wei,V. Ramakrishna,T. Kanade和Y. Sheikh。卷积式姿态机。在CVPR,2016年。8
[45] S. Xie, R. Girshick, P. Doll´ar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017. 4, 10
[45] S. Xie,R. Girshick,P. Doll’ar,Z. Tu和K. He。深度神经网络的聚合残差变换。在CVPR,2017.4,10