MaskRCNN论文阅读笔记

Abstract
We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance.
The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN,running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing,single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition.
我们提出了一个概念上简单,灵活,通用的对象实例分割框架。我们的方法有效地检测图像中的对象,同时为每个实例生成高质量的分割掩码。该方法称为掩码R-CNN,通过添加用于预测与现有分支并行的对象掩码的分支来扩展更快的R-CNN。用于边界框识别。 Mask R-CNN很容易训练,只需很少的开销就可以以5 fps的速度加速R-CNN。此外,Mask R-CNN很容易推广到其他任务,例如,允许我们在同一框架中估计人体姿势。我们在COCO挑战套件的所有三个轨道中展示了最佳结果,包括实例分割,边界框对象检测和人员关键点检测。没有花里胡哨,Mask R-CNN在每项任务中都优于所有现有的单一模型,包括COCO 2016挑战赛冠军。我们希望我们简单有效的方法将成为一个坚实的基线,并有助于简化未来在实例级认可方面的研究。
引言
The vision community has rapidly improved object detection and semantic segmentation results over a short period of time. In large part, these advances have been driven by powerful baseline systems, such as the Fast/Faster RCNN [1], [2] and Fully Convolutional Network (FCN) [3] frameworks for object detection and semantic segmentation,respectively. These methods are conceptually intuitive and offer flexibility and robustness, together with fast training and inference time. Our goal in this work is to develop a comparably enabling framework for instance segmentation.Instance segmentation is challenging because it requires the correct detection of all objects in an image while also precisely segmenting each instance. It therefore combines elements from the classical computer vision tasks of object detection, where the goal is to classify individual objects and localize each using a bounding box, and semantic segmentation, where the goal is to classify each pixel into a fixed set of categories without differentiating object instances. 1 Given this, one might expect a complex method is required to achieve good results. However, we show that a surprisingly simple, flexible, and fast system can surpass The vision community has rapidly improved object detection and semantic segmentation results over a short period of time. In large part, these advances have been driven by powerful baseline systems, such as the Fast/Faster RCNN [1], [2] and Fully Convolutional Network (FCN) [3] frameworks for object detection and semantic segmentation,respectively. These methods are conceptually intuitive and offer flexibility and robustness, together with fast training
and inference time. Our goal in this work is to develop a comparably enabling framework for instance segmentation.Instance segmentation is challenging because it requires the correct detection of all objects in an image while also precisely segmenting each instance. It therefore combines elements from the classical computer vision tasks of object detection, where the goal is to classify individual objects and localize each using a bounding box, and semantic segmentation, where the goal is to classify each pixel into a fixed set of categories without differentiating object instances. 1 Given this, one might expect a complex method is required to achieve good results. However, we show that a surprisingly simple, flexible, and fast system can surpass prior state-of-the-art instance segmentation results.prior state-of-the-art instance segmentation results.
视觉社区在短时间内迅速改进了对象检测和语义分割结果。在很大程度上,这些进步是由强大的基线系统驱动的,例如快速/快速RCNN [1],[2]和完全卷积网络(FCN)[3]框架分别用于对象检测和语义分割。这些方法在概念上是直观的,并提供灵活性和稳健性,以及快速的培训和推理时间。我们在这项工作中的目标是为实例分割开发一个可比较的支持框架。实例分割具有挑战性,因为它需要正确检测图像中的所有对象,同时还要精确地分割每个实例。因此,它结合了来自对象检测的经典计算机视觉任务的元素,其目标是对各个对象进行分类并使用边界框对每个对象进行定位,以及语义分割,其目标是将每个像素分类为固定的一组类别而不区分对象实例。 1鉴于此,人们可能期望获得良好结果需要复杂的方法。然而,我们表明,一个令人惊讶的简单,灵活,快速的系统可以超越视觉社区在短时间内迅速改进了对象检测和语义分割结果。在很大程度上,这些进步是由强大的基线系统驱动的,例如快速/快速RCNN [1],[2]和完全卷积网络(FCN)[3]框架分别用于对象检测和语义分割。这些方法在概念上是直观的,并提供灵活性和稳健性,以及快速的培训和推理时间。我们在这项工作中的目标是为实例分割开发一个可比较的支持框架。实例分割具有挑战性,因为它需要正确检测图像中的所有对象,同时还要精确地分割每个实例。因此,它结合了来自对象检测的经典计算机视觉任务的元素,其目标是对各个对象进行分类并使用边界框对每个对象进行定位,以及语义分割,其目标是将每个像素分类为固定的一组类别而不区分对象实例。 1鉴于此,人们可能期望获得良好结果需要复杂的方法。然而,我们表明,一个令人惊讶的简单,灵活和快速的系统可以超越先前的最新实例分割结果。最先进的实例分割结果。
Our method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting segmentation masks on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression(Figure 1). The mask branch is a small FCN applied to each RoI, predicting a segmentation mask in a pixel-to-pixel manner. Mask R-CNN is simple to implement and train given the Faster R-CNN framework, which facilitates a wide range of flexible architecture designs. Additionally,the mask branch only adds a small computational overhead,enabling a fast system and rapid experimentation.
我们的方法称为Mask R-CNN,通过添加分支来扩展Faster R-CNN,用于预测每个感兴趣区域(RoI)上的分割掩码,与现有分支并行进行分类和边界框回归(图1)。 掩模分支是应用于每个RoI的小FCN,以像素到像素的方式预测分割掩模。 鉴于更快的R-CNN框架,Mask R-CNN易于实施和训练,这有助于广泛的灵活架构设计。 此外,掩码分支仅增加了小的计算开销,实现了快速系统和快速实验。
In principle Mask R-CNN is an intuitive extension of Faster R-CNN, yet constructing the mask branch properly is critical for good results. Most importantly, Faster R-CNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool [5], [1], the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. To fix the misalignment, we propose a simple,quantization-free layer, called RoIAlign, that faithfully preserves exact spatial locations. Despite being a seemingly minor change, RoIAlign has a large impact: it improves mask accuracy by relative 10% to 50%, showing bigger gains under stricter localization metrics. Second, we found it essential to decouple mask and class prediction: we predict a binary mask for each class independently, without competition among classes, and rely on the network’s RoI classification branch to predict the category. In contrast,FCNs usually perform per-pixel multi-class categorization,
原则上,MaskR-CNN是FasterR-CNN的直观扩展,但正确构建掩模分支对于获得良好结果至关重要。最重要的是,FasterR-CNN并非设计用于网络输入和输出之间的像素到像素对齐。这一点在RoIPool [5],[1],参与实例的事实核心操作如何为特征提取执行粗略空间量化方面最为明显。为了解决这个错位,我们提出了一个简单的,无量化的层,称为RoIAlign,它忠实地保留了精确的空间位置。尽管看似微小的变化,但RoIAlign产生了巨大的影响:它将掩模精度提高了10%到50%,在更严格的本地化指标下显示出更大的收益。其次,我们发现将掩模和类预测分离是必不可少的:我们独立地预测每个类的二进制掩码,没有类之间的竞争,并依赖于网络的RoI分类分支来预测类别。相比之下,FCN通常执行每像素多类别分类,其耦合分割和分类,并且基于我们的实验分割效果不佳。
Without bells and whistles, Mask R-CNN surpasses all previous state-of-the-art single-model results on the COCO instance segmentation task [6], including the heavily-
engineered entries from the 2016 competition winner. As a by-product, our method also excels on the COCO object detection task. In ablation experiments, we evaluate multiple basic instantiations, which allows us to demonstrate its robustness and analyze the effects of core factors.
Our models can run at about 200ms per frame on a GPU,and training on COCO takes one to two days on a single 8-GPU machine. We believe the fast train and test speeds,
together with the framework’s flexibility and accuracy, will benefit and ease future research on instance segmentation.
Finally, we showcase the generality of our framework via the task of human pose estimation on the COCO keypoint dataset [6]. By viewing each keypoint as a one-
hot binary mask, with minimal modification Mask R-CNN can be applied to detect instance-specific poses. Mask R-CNN surpasses the winner of the 2016 COCO keypoint competition, and at the same time runs at 5 fps. Mask RCNN, therefore, can be seen more broadly as a flexible framework for instance-level recognition and can be readily extended to more complex tasks.
A preliminary version of this manuscript was published previously [7]. As a generic framework, Mask R-CNN is compatible with complementary techniques developed for detection/segmentation, as have been widely witnessed in Fast/Faster R-CNN and FCN in the past years. This manuscript also describes some techniques that improve over our original results published in [7]. Thanks to its generality and flexibility, Mask R-CNN was used as the framework by the three winning teams in the COCO 2017 instance segmentation competition, which all significantly outperformed the previous state of the art.We have released code to facilitate future research.
没有花里胡哨,Mask R-CNN在COCO实例分割任务[6]上超越了所有先前最先进的单一模型结果,包括2016年竞赛获胜者的大量设计作品。作为副产品,我们的方法也擅长COCO对象检测任务。在消融实验中,我们评估了多个基本实例,这使我们能够证明其稳健性并分析核心因素的影响。
我们的模型可以在GPU上以每帧大约200ms的速度运行,而COCO上的培训需要在一台8-GPU机器上进行一到两天的培训。我们相信快速训练和测试速度以及框架的灵活性和准确性将有利于并简化未来对实例分割的研究。
最后,我们通过COCO关键点数据集上的人体姿态估计任务展示了我们框架的一般性[6]。通过将每个关键点视为单热二进制掩码,通过最小的修改,可以应用掩码R-CNN来检测特定于实例的姿势。 Mask R-CNN超越2016年COCO关键点竞赛的冠军,同时以5 fps的速度运行。因此,掩码RCNN可以更广泛地被视为用于实例级识别的灵活框架,并且可以容易地扩展到更复杂的任务。
该手稿的初步版本先前已发表[7]。作为通用框架,Mask R-CNN与为检测/分割开发的互补技术兼容,如过去几年在快速/快速R-CNN和FCN中广泛见到的那样。该手稿还描述了一些改进我们在[7]中发表的原始结果的技术。由于其通用性和灵活性,Mask R-CNN被COCO 2017实例细分竞赛中的三个获奖团队用作框架,所有这些都明显优于先前的技术水平。我们已经发布了代码以促进未来的研究。
Related Work
R-CNN: The Region-based CNN (R-CNN) approach [8] to bounding-box object detection is to attend to a manageable number of candidate object regions [9], [10] and evaluate convolutional networks [11], [12] independently on each RoI. R-CNN was extended [5], [1] to allow attending to RoIs on feature maps using RoIPool, leading to fast speed and better accuracy. Faster R-CNN [2] advanced this stream by learning the attention mechanism with a Region Proposal Network (RPN). Faster R-CNN is flexible and robust to many follow-up improvements (e.g., [13], [14], [15]), and is the current leading framework in several benchmarks.Instance Segmentation: Driven by the effectiveness of R-CNN, many approaches to instance segmentation are based on segment proposals. Earlier methods [8], [16], [17],[18] resorted to bottom-up segments [9], [19]. DeepMask[20] and following works [21], [22] learn to propose segment candidates, which are then classified by Fast RCNN. In these methods, segmentation precedes recognition,which is slow and less accurate. Likewise, Dai et al. [23]proposed a complex multiple-stage cascade that predicts segment proposals from bounding-box proposals, followed by classification. Instead, our method is based on parallel prediction of masks and class labels, which is simpler and more flexible.
Most recently, Li et al. [24] combined the segment proposal system in [22] and object detection system in [25]for “fully convolutional instance segmentation” (FCIS).The common idea in [22], [25], [24] is to predict a set of position-sensitive output channels fully convolutionally. These channels simultaneously address object classes,boxes, and masks, making the system fast. But FCIS exhibits systematic errors on overlapping instances and creates spurious edges (Figure 6), showing that it is challenged by the fundamental difficulties of segmenting instances.Another family of solutions [26], [27], [28], [29] to instance segmentation are driven by the success of semantic segmentation. Starting from per-pixel classification results (e.g., FCN outputs), these methods attempt to cut the pixels of the same category into different instances. In contrast to the segmentation-first strategy of these methods, Mask RCNN is based on an instance-first strategy. We expect a deeper incorporation of both strategies will be studied in the future.
R-CNN:基于区域的CNN(R-CNN)方法[8]用于边界框对象检测是为了处理可管理数量的候选对象区域[9],[10]并评估卷积网络[11], [12]独立于每个RoI。 R-CNN被扩展[5],[1]允许使用RoIPool在特征图上参与RoI,从而实现更快的速度和更高的准确性。更快的R-CNN [2]通过学习区域提议网络(RPN)的注意机制来推进这一流。更快的R-CNN对许多后续改进具有灵活性和鲁棒性(例如,[13],[14],[15]),并且是目前几个基准测试中的领先框架。实例分割:由R-的有效性驱动CNN,实例细分的许多方法都基于细分提议。早期的方法[8],[16],[17],[18]采用自下而上的方法[9],[19]。 DeepMask [20]及其后的作品[21],[22]学会提出段候选,然后通过快速RCNN进行分类。在这些方法中,分割先于识别,这是缓慢且不太准确的。同样,戴等人。 [23]提出了一个复杂的多阶段级联,它从边界框提议中预测分段提议,然后进行分类。相反,我们的方法基于掩模和类标签的并行预测,这更简单,更灵活。
最近,李等人。 [24]将[22]中的分段建议系统和[25]中的目标检测系统结合起来进行“完全卷积实例分割”(FCIS)。[22],[25],[24]中的常见思想是预测a一组位置敏感的输出通道完全卷积。这些通道同时处理对象类,盒子和掩码,使系统快速。但FCIS在重叠实例上表现出系统误差,并产生虚假边缘(图6),表明它受到分割实例的基本困难的挑战。另一类解决方案[26],[27],[28],[29]实例分割是由语义分割的成功驱动的。从每像素分类结果(例如,FCN输出)开始,这些方法试图将相同类别的像素切割成不同的实例。与这些方法的分段优先策略相比,掩码RCNN基于实例优先策略。我们预计未来将研究两种策略的更深层次结合。
3.MaskRCNN
Mask R-CNN is conceptually simple: Faster R-CNN has two outputs for each candidate object, a class label and a bounding-box offset; to this we add a third branch that outputs the object mask. Mask R-CNN is thus a natural and intuitive idea. But the additional mask output is distinct from the class and box outputs, requiring extraction of much finer spatial layout of an object. Next, we introduce the key elements of Mask R-CNN, including pixel-to-pixel alignment, which is the main missing piece of Fast/Faster R-CNN.
MaskR-CNN在概念上很简单:FasterR-CNN为每个候选对象提供两个输出,一个类标签和一个边界框偏移; 为此,我们添加了第三个输出目标mask的分支。 因此,MaskR-CNN是一种自然而直观的想法。 但是额外的mask输出与类和框输出不同,需要提取对象的更精细的空间布局。 接下来,我们介绍Mask R-CNN的关键元素,包括像素到像素的对齐,这是Fast / Faster R-CNN的主要缺失部分。
Faster R-CNN: We begin by briefly reviewing the Faster R-CNN detector [2]. Faster R-CNN consists of two stages.The first stage, called a Region Proposal Network (RPN),proposes candidate object bounding boxes. The second stage, which is in essence Fast R-CNN [1], extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features
used by both stages can be shared for faster inference. We refer readers to [15] for latest, comprehensive comparisons between Faster R-CNN and other frameworks.
FasterR-CNN:我们首先简要回顾一下FasterR-CNN探测器[2]。 FasterR-CNN由两个阶段组成。第一阶段称为区域提议网络(RPN),提出候选对象边界框。 第二阶段,实质上是FastR-CNN [1],从每个候选框中使用RoIPool提取特征,并执行分类和边界框回归。 特点是两个阶段使用的可以共享以便更快地推断。 我们向读者推荐[15],以便在更快的R-CNN和其他框架之间进行最新,全面的比较。
Mask R-CNN: Mask R-CNN adopts the same two-stage procedure, with an identical first stage (which is RPN). In the second stage, in parallel to predicting the class and box offset, Mask R-CNN also outputs a binary mask for each RoI. This is in contrast to most recent systems, where classification depends on mask predictions (e.g. [20], [23],[24]). Our approach follows the spirit of Fast R-CNN [1]that applies bounding-box classification and regression in parallel (which turned out to largely simplify the multi-stage pipeline of original R-CNN [8]).
MaskR-CNN:MaskR-CNN采用相同的两阶段过程,具有相同的第一阶段(即RPN)。 在第二阶段,与预测类和盒偏移并行,Mask R-CNN还为每个RoI输出二进制掩码。 这与最近的系统形成对比,其中分类取决于Mask预测(例如[20],[23],[24])。 我们的方法遵循Fast R-CNN [1]的精神,它并行应用边界框分类和回归(结果大大简化了原始R-CNN的多阶段流水线[8])。
Formally, during training, we define a multi-task loss on each sampled RoI as L = L cls + L box + L mask . The classification loss L cls and bounding-box loss L box are identical as those defined in [1]. The mask branch has a Km 2 -dimensional output for each RoI, which encodes K binary masks of resolution m × m, one for each of the K classes. To this we apply a per-pixel sigmoid, and define L mask as the average binary cross-entropy loss. For an RoI associated with ground-truth class k, L mask is only defined on the k-th mask (other mask outputs do not contribute to the loss).
正式地,在训练期间,我们将每个采样的RoI上的多任务损失定义为L = L cls + L box + L mask。 分类损失L cls和边界框损失L框与[1]中定义的相同。 掩码分支对于每个RoI具有 K ∗ m 2 K*m^2 Km2维输出,其编码分辨率为m×m的K个二进制掩码,每个K类对应一个。 为此,我们应用每像素S形,并将L掩模定义为平均二元交叉熵损失。 对于与地面实况类k相关联的RoI,L掩模仅在第k个掩模上定义(其他掩模输出不会导致损耗)。
Our definition of L m a s k L_{mask} Lmask allows the network to generate masks for every class without competition among classes;we rely on the dedicated classification branch to predict the class label used to select the output mask. This decouples mask and class prediction. This is different from common practice when applying FCNs [3] to semantic segmentation,which typically uses a per-pixel softmax and a multinomial cross-entropy loss. In that case, masks across classes compete; in our case, with a per-pixel sigmoid and a binary loss,they do not. We show by experiments that this formulation is key for good instance segmentation results.
我们对$ L_ {mask} $的定义允许网络为每个类生成掩码而不会在类之间进行竞争;我们依靠专用的分类分支来预测用于选择输出掩码的类标签。 这解耦了掩码和类预测。 这与将FCN [3]应用于语义分割时的常规做法不同,后者通常使用每像素softmax和多项交叉熵损失。 在这种情况下,各类的面具竞争; 在我们的例子中,每像素sigmoid和二进制丢失,他们没有。 我们通过实验表明,该公式是良好实例分割结果的关键。
Mask Representation: A mask encodes an input object’s spatial layout. Thus, unlike class labels or box offsets that are inevitably collapsed into short output vectors by
fully-connected (fc) layers, extracting the spatial structure of masks can be addressed naturally by the pixel-to-pixel correspondence provided by convolutions.
Specifically, we predict an m × m mask from each RoI using an FCN [3]. This allows each layer in the mask branch to maintain the explicit m×m object spatial layout
without collapsing it into a vector representation that lacks spatial dimensions. Unlike previous methods that resort to fc layers for mask prediction [20], [21], [23], our fully
convolutional representation requires fewer parameters, and is more accurate as demonstrated by experiments.
This pixel-to-pixel behavior requires our RoI features,which themselves are small feature maps, to be well aligned to faithfully preserve the explicit per-pixel spatial correspondence. This motivated us to develop the following RoIAlign layer that plays a key role in mask prediction.
掩码表示:掩码编码输入对象的空间布局。因此,与通过完全连接(fc)层不可避免地折叠成短输出矢量的类标签或盒偏移不同,提取掩模的空间结构可以通过由卷积提供的像素到像素的对应自然地解决。具体来说,我们使用FCN预测每个RoI的m×m掩模[3]。这允许掩模分支中的每个层保持显式的m×m对象空间布局,而不将其折叠成缺少空间维度的矢量表示。与先前使用fc层进行掩模预测的方法[20],[21],[23]不同,我们的完全卷积表示需要更少的参数,并且如实验所证明的更准确。
这种像素到像素的行为要求我们的RoI特征(它们本身就是小特征映射)要很好地对齐,以忠实地保持显式的每像素空间对应关系。这促使我们开发以下RoIAlign层,该层在掩模预测中起关键作用。
RoIAlign: RoIPool [1] is a standard operation for extracting a small feature map (e.g., 7×7) from each RoI.RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin
are aggregated (usually by max pooling). Quantization is performed, e.g., on a continuous coordinate x by computing [x/16], where 16 is a feature map stride and [·] is rounding;likewise, quantization is performed when dividing into bins (e.g., 7×7). These quantizations introduce misalignments between the RoI and the extracted features. While this may not impact classification, which is robust to small translations, it has a large negative effect on predicting pixel-accurate masks.
RoIAlign:RoIPool [1]是从每个RoI中提取小特征映射(例如,7×7)的标准操作.RoIPool首先将浮点数RoI量化为特征映射的离散粒度,然后将该量化的RoI细分 进入自身量化的空间区间,最后聚合每个区间覆盖的特征值(通常通过最大池化)。 例如,通过计算[x / 16]在连续坐标x上执行量化,其中16是特征图步幅并且[·]是舍入的;同样,当分成区间(例如,7×7)时执行量化。 这些量化引入了RoI和提取的特征之间的未对准。 虽然这可能不会影响分类,这对小翻译很有效,但它对预测像素精确掩模有很大的负面影响。
To address this, we propose an RoIAlign layer that removes the harsh quantization of RoIPool, properly aligning the extracted features with the input. Our proposed change
is simple: we avoid any quantization of the RoI boundaries or bins (i.e., we use x/16 instead of [x/16]). We use bilinear interpolation [31] to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average). 2 See Figure 3 for our implementation details. We note that the results are not sensitive to where the four sampling points are located in the bin, or how many points are sampled, as long as no quantization is performed on any coordinates involved.
RoIAlign leads to large improvements as we show in §4.2. We also compare to the RoIWarp operation proposed in [23]. Unlike RoIAlign, RoIWarp overlooked the alignment issue and was implemented in [23] as quantizing RoI just like RoIPool. So even though RoIWarp also adopts bilinear resampling motivated by [31], it performs on par with RoIPool as shown by experiments (more details in Table 2c), demonstrating the crucial role of alignment.
为了解决这个问题,我们提出了一个RoIAlign层来消除RoIPool的严格量化,正确地将提取的特征与输入对齐。我们提出的改变很简单:我们避免对RoI边界或区间进行任何量化(即,我们使用x / 16而不是[x / 16])。我们使用双线性插值[31]来计算每个RoI仓中四个常规采样位置的输入要素的精确值,并汇总结果(使用最大值或平均值)。 2有关我们的实施细节,请参见图3。我们注意到,只要不对所涉及的任何坐标执行量化,结果对于四个采样点位于箱中的位置或者采样多少点都不敏感。
正如我们在§4.2中所展示的那样,RoIAlign带来了巨大的改进。我们还比较了[23]中提出的RoIWarp操作。与RoIAlign不同,RoIWarp忽略了对齐问题,并在[23]中实现为量化RoI,就像RoIPool一样。因此,尽管RoIWarp也采用[31]推动的双线性重采样,但它与RoIPool相当,如实验所示(表2c中的更多细节),证明了对齐的关键作用。

RoIAlign的实施。 虚线网格是执行RoIAlign的特征图,实线表示RoI(在此示例中为2×2个区间),点表示每个区间内的4个采样点。计算每个采样点的值 通过特征图上附近网格点的双线性插值。 不对RoI,其箱或采样点中涉及的任何坐标执行量化。

Network Architecture: To demonstrate the generality of our approach, we instantiate Mask R-CNN with multiple architectures. For clarity, we differentiate between: (i) the convolutional backbone architecture used for feature extraction over an entire image, and (ii) the network head for bounding-box recognition (classification and regression) and mask prediction that is applied separately to each RoI. We denote the backbone architecture using the nomenclature network-depth-features. We evaluate ResNet [4] and ResNeXt [32] networks of depth 50 or 101 layers. The original implementation of Faster R-CNN with ResNets [4] extracted features from the final convolutional layer of the 4-th stage, which we call C4. This backbone with ResNet-50, for example, is denoted by ResNet-50-C4. This is a common choice used in [4], [23], [15], [33].
网络架构:为了演示我们方法的一般性,我们使用多种架构实例化Mask R-CNN。 为清楚起见,我们区分:(i)用于整个图像上的特征提取的卷积骨干架构,以及(ii)用于边界框识别(分类和回归)的网络头和单独应用于每个RoI的掩模预测。 我们使用命名法网络深度特征来表示骨干架构。 我们评估深度为50或101层的ResNet [4]和ResNeXt [32]网络。 最快的R-CNN与ResNets [4]的实现从第4阶段的最终卷积层中提取了特征,我们称之为C4。 例如,ResNet-50的这个主干由ResNet-50-C4表示。 这是[4],[23],[15],[33]中常用的选择。
我们还探索了Lin等人最近提出的另一个更有效的主干。 [14],称为特征金字塔网络(FPN)。 FPN使用具有横向连接的自上而下架构,从单一尺度输入构建网内特征金字塔。带有FPN的更快的R-CNN
骨干根据其规模从特征金字塔的不同级别提取RoI特征,但其他方法类似于vanilla ResNet。使用ResNet-FPN骨干网通过Mask R-CNN进行特征提取,可以在精度和速度方面获得极佳的提升。有关FPN的更多详细信息,请参阅[14]。
对于网络头,我们密切关注以前工作中提出的体系结构,我们在其中添加了完全卷积模板预测分支。具体来说,我们从ResNet [4]和FPN [14]论文中扩展了更快的R-CNN盒头。详细信息如图4所示.ResNet-C4主干网的头部包括ResNet的第5阶段(即9层’res5’[4]),这是计算密集型的。对于FPN,骨干已经包括res5,因此允许更有效的头使用更少的过滤器。我们注意到我们的掩模分支具有简单的结构。更复杂的设计有可能提高性能,但不是这项工作的重点。
3.1实现细节
We set hyper-parameters following existing Fast/Faster R-CNN work [1], [2], [14]. Although these decisions were made for object detection in original papers [1], [2], [14], we found our instance segmentation system is robust to them.
我们根据现有的快速/快速R-CNN工作设置超参数[1],[2],[14]。 虽然这些决定是在原始论文[1],[2],[14]中为对象检测做出的,但我们发现我们的实例分割系统对它们是健壮的。
Training: As in Fast R-CNN, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. The mask loss L mask is defined only on positive RoIs. The mask target is the intersection between an RoI and its associated ground-truth mask.
训练:如在快速R-CNN中,如果具有至少为0.5的地面实况框的IoU,则认为RoI为正,否则为负。 掩模丢失L掩模仅在正RoI上定义。 掩模目标是RoI与其相关的地面实况掩模之间的交集。
We adopt image-centric training [1]. Images are resized such that their scale (shorter edge) is 800 pixels [14]. Each mini-batch has 2 images per GPU and each image has N
sampled RoIs, with a ratio of 1:3 of positive to negatives[1]. N is 64 for the C4 backbone (as in [1], [2]) and 512 for FPN (as in [14]). We train on 8 GPUs (so effective mini-batch size is 16) for 160k iterations, with a learning rate of 0.02 which is decreased by 10 at the 120k iteration.
We use a weight decay of 0.0001 and a momentum of 0.9.When using ResNeXt [32], we use a mini-batch size of 1 image per GPU with the same number of iterations, and a
starting learning rate of 0.01.
The RPN anchors span 5 scales and 3 aspect ratios,following [14]. For convenient ablation, RPN is trained separately and does not share features with Mask R-CNN,
unless specified. For every entry in this paper, RPN and Mask R-CNN have the same backbones and so they are shareable.
我们采用以图像为中心的培训[1]。 调整图像大小以使其比例(较短边缘)为800像素[14]。 每个小批量每个GPU有2个图像,每个图像有N个采样的RoI,正负比为1:3 [1]。 C4主干的N为64(如[1],[2]中所述),FPN为512(如[14]所示)。 我们在8个GPU(有效的小批量大小为16)上进行160k次迭代训练,学习率为0.02,在120k迭代时减少10。
我们使用0.0001的权重衰减和0.9的动量。当使用ResNeXt [32]时,我们使用每个GPU的1个图像的小批量大小,具有相同的迭代次数,并且起始学习率为0.01。
在[14]之后,RPN锚点跨越5个尺度和3个纵横比。 为了方便消融,RPN是单独训练的,除非另有说明,否则不与Mask R-CNN共享功能。 对于本文中的每个条目,RPN和Mask R-CNN具有相同的主干,因此它们是可共享的。
Inference: At test time, the proposal number is 300 for the C4 backbone (as in [2]) and 1000 for FPN (as in [14]). We run the box prediction branch on these proposals, followed by non-maximum suppression [34]. The mask branch is then applied to the highest scoring 100 detection boxes.
Although this differs from the parallel computation used in training, it speeds up inference and improves accuracy (due to the use of fewer, more accurate RoIs). The mask branch can predict K masks per RoI, but we only use the k-th mask, where k is the predicted class by the classification branch. The m×m floating-number mask output is then
resized to the RoI size, and binarized at a threshold of 0.5.Note that since we only compute masks on the top 100 detection boxes, Mask R-CNN adds a small overhead to its Faster R-CNN counterpart (e.g., ∼ 20% on typical models).
推论:在测试时,C4主干的提议编号为300(如[2]所示),FPN为1000(如[14]所示)。 我们在这些提议上运行盒子预测分支,然后是非最大抑制[34]。 然后将掩模分支应用于最高得分100检测框。
虽然这与训练中使用的并行计算不同,但它加速了推理并提高了准确性(由于使用更少,更准确的RoI)。 掩码分支可以预测每个RoI的K个掩码,但是我们仅使用第k个掩码,其中k是分类分支的预测类。 然后是m×m浮点数掩码输出
调整大小为RoI大小,并在0.5的阈值处进行二值化。注意,由于我们只在前100个检测框上计算掩码,因此Mask R-CNN为其更快的R-CNN对应物增加了一小部分开销(例如,约20%) 典型的模型)。

你可能感兴趣的:(MaskRCNN论文阅读笔记)