论文阅读笔记(三十):Learning to Segment Every Thing

Existing methods for object instance segmentation require all training instances to be labeled with segmentation masks. This requirement makes it expensive to annotate new categories and has restricted instance segmentation models to ∼100 well-annotated classes. The goal of this paper is to propose a new partially supervised training paradigm, together with a novel weight transfer function, that enables training instance segmentation models over a large set of categories for which all have box annotations, but only a small fraction have mask annotations. These contributions allow us to train Mask R-CNN to detect and segment 3000 visual concepts using box annotations from the Visual Genome dataset and mask annotations from the 80 classes in the COCO dataset. We carefully evaluate our proposed approach in a controlled study on the COCO dataset. This work is a first step towards instance segmentation models that have broad comprehension of the visual world.

物体实例分割的现有方法要求所有训练实例都用分割掩码进行标记。这个要求使得注释新的类别变得很昂贵,并且将实例分割模型限制在大约100个注释良好的类中。本文的目标是提出一种新的部分监督训练范式,以及一种新颖的权重传递函数,它可以在大量类别上进行训练实例分割模型,所有这些类别都有框注释,但只有一小部分具有掩码注释。通过这些贡献,我们可以训练Mask R-CNN使用Visual Genome数据集中的框注释和COCO数据集中80个类的掩码注释来检测和分割3000个视觉概念。我们在COCO数据集的对照研究中仔细评估了我们提出的方法。这项工作是对具有广泛理解视觉世界的实例细分模型的第一步。

Object detectors have become significantly more accurate (e.g., [10, 33]) and gained important new capabilities. One of the most exciting is the ability to predict a foreground segmentation mask for each detected object (e.g., [15]), a task called instance segmentation. In practice, typical instance segmentation systems are restricted to a narrow slice of the vast visual world that includes only around 100 object categories.

物体探测器已经变得更加准确(例如,[10,33])并获得了重要的新功能。 最令人兴奋的之一是能够为每个检测到的物体(例如[15])预测前景分割掩码,这是一个称为实例分割的任务。 在实践中,典型的实例分割系统仅限于庞大的视觉世界的一小部分,其中只包含大约100个物体类别。

A principle reason for this limitation is that state-of-theart instance segmentation algorithms require strong supervision and such supervision may be limited and expensive to collect for new categories [22]. By comparison, bounding box annotations are more abundant and less expensive [4]. This fact raises a question: Is it possible to train highquality instance segmentation models without complete instance segmentation annotations for all categories? With this motivation, our paper introduces a new partially supervised instance segmentation task and proposes a novel transfer learning method to address it.

这种限制的一个主要原因是,现有技术的实例分割算法需要强有力的监督,并且这种监督对于新类别的收集可能是有限的和昂贵的[22]。相比之下,边界框注释更丰富且更便宜[4]。这个事实提出了一个问题:如果没有针对所有类别的完整实例分割注释,是否可以训练高质量的实例分割模型?有了这个动机,我们的论文引入了一个新的部分监督实例分割任务,并提出了一种新的transfer learning方法来解决它。

We formulate the partially supervised instance segmentation task as follows: (1) given a set of categories of interest, a small subset has instance mask annotations, while the other categories have only bounding box annotations; (2) the instance segmentation algorithm should utilize this data to fit a model that can segment instances of all object categories in the set of interest. Since the training data is a mixture of strongly annotated examples (those with masks) and more weakly annotated examples (those with only boxes), we refer to the task as partially supervised.

我们制定部分监督实例分割任务如下:(1)给定一组感兴趣的类别,一个小子集具有实例掩码注释,而其他类别只有边界框注释; (2)实例分割算法应该利用这些数据来拟合一个模型,该模型可以对感兴趣的集合中的所有物体类别的实例进行分割。由于训练数据是强注释示例(带掩码的示例)和更弱注释示例(仅带框的示例)的混合,因此我们将该任务称为部分监督。

The main benefit of the proposed partially supervised paradigm is it allows us to build a large-scale instance segmentation model by exploiting both types of existing datasets: those with bounding box annotations over a large number of classes, such as Visual Genome [19], and those with instance mask annotations over a small number of classes, such as COCO [22]. As we will show, this enables us to scale state-of-the-art instance segmentation methods to thousands of categories, a capability that is critical for their deployment in real world uses.

所提出的部分监督范式的主要好处是,它允许我们通过利用两种类型的现有数据集来构建大规模实例分割模型:那些在大量类上使用边界框注释的数据集,例如Visual Genome [19],以及那些在少数类中使用实例掩码注释的人,如COCO [22]。正如我们将要展示的那样,这使我们能够将最先进的实例细分方法扩展到数千个类别,这对于其在现实世界中的部署非常重要。

To address partially supervised instance segmentation, we propose a novel transfer learning approach built on Mask R-CNN [15]. Mask R-CNN is well-suited to our task because it decomposes the instance segmentation problem into the subtasks of bounding box object detection and mask prediction. These subtasks are handled by dedicated network ‘heads’ that are trained jointly. The intuition behind our approach is that once trained, the parameters of the bounding box head encode an embedding of each object category that enables the transfer of visual information for that category to the partially supervised mask head.

为了解决部分监督实例分割问题,我们提出了一种基于Mask R-CNN的新型transfer learning方法[15]。Mask R-CNN非常适合我们的任务,因为它将实例分割问题分解为边界框物体检测和掩码预测的子任务。这些子任务由联合训练的专用网络“heads”处理。我们的方法背后的直觉是,一旦训练好了,边界框head的参数就会对每个物体类别的嵌入进行编码,从而能够将该类别的视觉信息转移到部分监督的掩码head。

We materialize this intuition by designing a parameterized weight transfer function that is trained to predict a category’s instance segmentation parameters as a function of its bounding box detection parameters. The weight transfer function can be trained end-to-end in Mask R-CNN using classes with mask annotations as supervision. At inference time, the weight transfer function is used to predict the instance segmentation parameters for every category, thus enabling the model to segment all object categories, including those without mask annotations at training time.

我们通过设计参数化权重传递函数来实现这种直觉,该参数化权重传递函数被训练为根据其边界框检测参数来预测类别的实例分割参数。权重传递函数可以在Mask R-CNN中使用带有掩码注释的类作为监督进行端对端训练。在推断时,权重传递函数用于预测每个类别的实例分割参数,从而使模型能够分割所有物体类别,包括在训练时没有掩码注释的物体类别。

We evaluate our approach in two settings. First, we use the COCO dataset [22] to simulate the partially supervised instance segmentation task as a means of establishing quantitative results on a dataset with high-quality annotations and evaluation metrics. Specifically, we split the full set of COCO categories into a subset with mask annotations and a complementary subset for which the system has access to only bounding box annotations. Because the COCO dataset involves only a small number (80) of semantically wellseparated classes, quantitative evaluation is precise and reliable. Experimental results show that our method improves results over a strong baseline with up to a 40% relative increase in mask AP on categories without training masks.

我们在两种设置中评估我们的方法。首先,我们使用COCO数据集[22]来模拟部分监督的实例分割任务,作为通过高质量注释和评估指标在数据集上建立定量结果的一种手段。具体来说,我们将整套COCO类别分成一个子集,其中带有掩码注释和一个补充子集,系统只能访问边界框注释。由于COCO数据集只涉及少数(80)的语义分割类,因此定量评估是准确可靠的。实验结果表明,我们的方法在强大的基线上改善了结果,在没有训练掩码的情况下,mask AP的相对增加高达40%。

In our second setting, we train a large-scale instance segmentation model on 3000 categories using the Visual Genome (VG) dataset [19]. VG contains bounding box annotations for a large number of object categories, however quantitative evaluation is challenging as many categories are semantically overlapping (e.g., near synonyms) and the annotations are not exhaustive, making precision and recall difficult to measure. Moreover, VG is not annotated with instance masks. Instead, we use VG to provide qualitative output of a large-scale instance segmentation model. Output of our model is illustrated in Figure 1 and 5.

在我们的第二种设置中,我们使用Visual Genome(VG)数据集[19]对3000个类别的大规模实例分割模型进行训练。 VG包含大量物体类别的边界框注释,但是定量评估具有挑战性,因为许多类别在语义上重叠(例如,接近同义词)并且注释并非详尽无遗,使得精确度和召回难以度量。而且,VG不用实例掩码进行注释。相反,我们使用VG来提供大规模实例分割模型的定性输出。我们模型的输出如图1和5所示。

Instance segmentation. Instance segmentation is a highly active research area [12, 13, 5, 31, 32, 6, 14, 20, 18, 2], with Mask R-CNN [15] representing the current state-of-the-art. These methods assume a fully supervised training scenario in which all categories of interest have instance mask annotations during training. Fully supervised training, however, makes it difficult to scale these systems to thousands of categories. The focus of our work is to relax this assumption and enable training models even when masks are available for only a small subset of categories. To do this, we develop a novel transfer learning approach built on Mask R-CNN.

实例分割。实例分割是一个非常活跃的研究领域[12,13,5,31,32,6,14,20,18,2],其中MaskR-CNN [15]代表当前的最新技术水平。这些方法假定一个完全监督的训练场景,其中所有感兴趣的类别在训练期间都有实例掩码注释。然而,完全监督的训练使这些系统难以扩展到数千种类别。我们工作的重点是放宽这一假设,即使mask仅适用于一小部分类别,也能启用训练模型。为此,我们开发了一种基于Mask R-CNN的新型transfer learning方法。

Weight prediction and task transfer learning. Instead of directly learning model parameters, prior work has explored predicting them from other sources (e.g., [11]). In [8], image classifiers are predicted from the natural language description of a zero-shot category. In [26], a small neural network is used to predict the classifier weights of the composition of two concepts from the classifier weights of each individual concept. Here, we design a model that predicts the class-specific instance segmentation weights used in Mask R-CNN, instead of training them directly, which is not possible in our partially supervised training scenario.

权重预测和任务transfer learning。先前的工作并没有直接学习模型参数,而是从其他来源(例如[11])对其进行了预测。在[8]中,图像分类器是根据zero-shot类别的自然语言描述进行预测的。在[26]中,一个小的神经网络被用来根据每个单独概念的分类器权重来预测两个概念组合的分类器权重。在这里,我们设计了一个模型,用于预测Mask R-CNN中使用的特定于类的实例分割权重,而不是直接对其进行训练,这在部分监督的训练场景中是不可能的。

Our approach is also a type of transfer learning [27] where knowledge gained from one task helps with another task. Most related to our work, LSDA [17] transforms whole-image classification parameters into object detection parameters through a domain adaptation procedure. LSDA can be seen as transferring knowledge learned on an image classification task to an object detection task, whereas we consider transferring knowledge learned from bounding box detection to instance segmentation.

我们的方法也是一种transfer learning[27],其中从一项任务获得的知识有助于完成另一项任务。与我们的工作最相关的是,LSDA [17]通过域适配程序将全图分类参数转换为物体检测参数。 LSDA可以被看作是将在图像分类任务上学习到的知识转移到物体检测任务,而我们考虑将从边界框检测中学到的知识转移到实例分割。

Weakly supervised semantic segmentation. Prior work trains semantic segmentation models from weak supervision. (Note that semantic segmentation is a pixel-labeling task that is different from instance segmentation, which is an object detection task.) Image-level labels and object size constraints are used in [29], while other methods use boxes as supervision for expectation-maximization [28] or iterating between proposals generation and training [4]. Point supervision and objectness potentials are used in [3]. Most work in this area addresses only semantic segmentation, treats each class independently, and relies on hand-crafted bottom-up proposals that generalize poorly. Our work is complementary to these approaches, as we explore generalizing segmentation models trained from a subset of classes to other classes without relying on bottom-up segmentation.

弱监督语义分割。之前的工作是从弱监督的角度训练语义分割模型。 (请注意,语义分割是一个像素标记任务,与实例分割不同,这是一个物体检测任务。)图像级标签和物体大小约束在[29]中使用,而其他方法使用框作为期望监督最大化[28]或在提案生成和训练之间迭代[4]。Point supervision和objectness potentials在[3]中使用。这个领域的大部分工作只涉及语义分割,独立处理每个类,并依赖手工制作的自下而上的提议,这些提议不能很好地推广。我们的工作是对这些方法的补充,因为我们在不依赖于自下而上的分割的情况下,探索将从一个类的子集训练到其他类的泛化分割模型。

Visualembeddings. Objectcategoriesmaybemodeledby continuous ‘embedding’ vectors in a visual-semantic space, where nearby vectors are often close in appearance or semantic ontology. Class embedding vectors may be obtained via natural language processing techniques (e.g. word2vec [25] and GloVe [30]), from visual appearance information (e.g. [7]), or both (e.g. [35]). In our work, the parameters of Mask R-CNN’s box head contain class-specific appearance information and can be seen as embedding vectors learned by training for the bounding box object detection task. The class embedding vectors enable transfer learning in our model by sharing appearance information between visually related classes. We also compare with the NLPbased GloVe embeddings [30] in our experiments.

Visualembeddings。可以通过在视觉语义空间中连续“嵌入”向量来物体类别进行建模,其中附近的向量通常在外观或语义本体上接近。可以通过自然语言处理技术(例如word2vec [25]和GloVe [30])从视觉外观信息(例如[7])或两者(例如[35])获得类别嵌入向量。在我们的工作中,Mask R-CNN box head的参数包含特定于类别的外观信息,并且可以看作通过边界框物体检测任务的训练学习的嵌入向量。类嵌入向量通过在视觉相关类之间共享外观信息来实现我们模型中的transfer learning。我们还将我们的实验与基于NLP的GloVe嵌入[30]进行比较。

Let C be the set of object categories (i.e., ‘things’ [1]) for which we would like to train an instance segmentation model. Most existing approaches assume that all training examples in C are annotated with instance masks. We relax this requirement and instead assume that C = A ∪ B where examples from the categories in A have masks, while those in B have only bounding boxes. Since the examples of the B categories are weakly labeled w.r.t. the target task (instance segmentation), we refer to training on the combination of strong and weak labels as a partially supervised learning problem. Noting that one can easily convert instance masks to bounding boxes, we assume that bounding box annotations are also available for classes in A.

设C是我们想要训练实例分割模型的物体类别集合(即’事物’[1])。大多数现有的方法都假定C中的所有训练样例都使用实例掩码进行注释。我们放宽这个要求,而是假设C =A∪B,其中A中的类别的例子具有掩码,而B中的例子仅具有边界框。由于B类的例子被弱标记为w.r.t.目标任务(实例分割),我们将强弱标签组合训练称为部分监督学习问题。注意到可以很容易地将实例掩码转换为边界框,我们假设边界框注释也可用于A中的类。

Given an instance segmentation model like Mask RCNN that has a bounding box detection component and a mask prediction component, we propose the MaskX RCNN method that transfers category-specific information from the model’s bounding box detectors to its instance mask predictors.

假设像掩码RCNN这样的实例分割模型具有边界框检测组件和掩码预测组件,我们提出了MaskX RCNN方法,该方法将模型边界框检测器中特定于分类的信息传送到其实例掩码预测器。

This paper addresses the problem of large-scale instance segmentation by formulating a partially supervised learning paradigm in which only a subset of classes have instance masks during training while the rest have box annotations. We propose a novel transfer learning approach, where a learned weight transfer function predicts how each class should be segmented based on parameters learned for detecting bounding boxes. Experimental results on the COCO dataset demonstrate that our method greatly improves the generalization of mask prediction to categories without mask training data. Using our approach, we build a large-scale instance segmentation model over 3000 classes in the Visual Genome dataset. The qualitative results are encouraging and illustrate an exciting new research direction into large-scale instance segmentation. They also reveal that scaling instance segmentation to thousands of categories, without full supervision, is an extremely challenging problem with ample opportunity for improved methods.

本文通过制定一个部分监督的学习范式来解决大规模实例分割的问题,其中只有一部分类在训练期间具有实例掩码,而其余的则具有框注释。我们提出了一种新颖的transfer learning方法,其中学习权重转移函数根据学习的用于检测边界框的参数来预测每个类应该如何分割。 COCO数据集上的实验结果表明,我们的方法大大提高了掩码预测对没有掩码训练数据的类别的泛化。使用我们的方法,我们在Visual Genome数据集中的3000多个类中构建了一个大型实例分割模型。定性结果令人鼓舞,并说明了一个激动人心的大规模实例分割的新研究方向。他们还表示,如果没有全面监督,将实例细分扩展到数千个类别,这是一个非常具有挑战性的问题,有很多改进方法的机会。

Figure2.Detailed illustration of our MaskX RCNN method. Instead of directly learning the mask prediction parameters west, MaskX RCNN predicts a category’s segmentation parameters wseg from its corresponding detection parameters wdet, using a learned weight transfer function T . For training, T only needs mask data for the classes in set A, yet it can be applied to all classes in set A ∪ B at test time. We also augment the mask head with a complementary fully connected multi-layer perceptron (MLP).

图2.我们的MaskX RCNN方法的详细说明。 MaskX RCNN不是直接向西学习掩码预测参数,而是使用学习权重传递函数T从其相应的检测参数wdet中预测类别的分割参数wseg。对于训练,T 只需要集合A中的类的掩码数据,但它可以在测试时间应用于集合A∪B中的所有类。我们还用一个互补的完全连接的多层感知器(MLP)增强了掩码head。

你可能感兴趣的:(笔记)