数据集LVIS|LVIS: A Dataset for Large Vocabulary Instance Segmentation

FAIR公开大型实例分割数据集，多类别少训练样本

2019.8

论文地址：https://arxiv.org/pdf/1908.03195.pdf

https://www.arxiv-vanity.com/papers/1908.03195/

主页：http://www.lvisdataset.org

Abstract

Progress on object detection is enabled by datasets that focus the research community’s attention on open challenges. This process led us from simple images to complex scenes and from bounding boxes to segmentation masks. In this work, we introduce LVIS (pronounced ‘el-vis’): a new dataset for Large Vocabulary Instance Segmentation. We plan to collect ∼2 million high-quality instance segmentation masks for over 1000 entry-level object categories in 164k images. Due to the Zipfian distribution of categories in natural images, LVIS naturally has a long tail of categories with few training samples. Given that state-of-the-art deep learning methods for object detection perform poorly in the low-sample regime, we believe that our dataset poses an important and exciting new scientific challenge. LVIS is available at http://www.lvisdataset.org.

数据集实现了对象检测的进展，这些数据集将研究界的注意力集中在开放性挑战上。这个过程使我们从简单的图像到复杂的场景，从边界框到分割蒙版。在这项工作中，我们介绍了LVIS（发音为'el-vis'）：一个用于大词汇量实例分割的新数据集。我们计划为164k图像中的1000多个入门级对象类别收集约 200万个高质量实例分割掩码。由于Zipfian在自然图像中分类，LVIS自然有一长串的类别，只有很少的训练样本。鉴于用于物体检测的最先进的深度学习方法在低样本制度中表现不佳，我们认为我们的数据集提出了一个重要且令人兴奋的新科学挑战。LVIS可在http://www.lvisdataset.org上找到。

1 INTRODUCTION

A central goal of computer vision is to endow algorithms with the ability to intelligently describe images. Object detection is a canonical image description task; it is intuitively appealing, useful in applications, and straightforward to benchmark in existing settings. The accuracy of object detectors has improved dramatically and new capabilities, such as predicting segmentation masks and 3D representations, have been developed. There are now exciting opportunities to push these methods towards new goals.

计算机视觉的核心目标是赋予算法智能描述图像的能力。对象检测是规范的图像描述任务; 它具有直观的吸引力，在应用程序中很有用，并且可以直接在现有设置中进行基准测试。物体探测器的精确度得到了显着提高，并且已经开发出新功能，例如预测分割掩模和3D表示。现在有令人兴奋的机会将这些方法推向新的目标。

Today, rigorous evaluation of general purpose object detectors is mostly performed in the few category regime (e.g. 80) or when there are a large number of training examples per category (e.g. 100 to 1000+). Thus, there is an opportunity to enable research in the natural setting where there are a large number of categories and per-category data is sometimes scarce. The long tail of rare categories is inescapable; annotating more images simply uncovers previously unseen, rare categories (see Fig.9and[29, 25, 24, 27]). Efficiently learning from few examples is a significant open problem in machine learning and computer vision, making this opportunity one of the most exciting from a scientific and practical perspective. But to open this area to empirical study, a suitable, high-quality dataset and benchmark is required.

今天，通用物体探测器的严格评估主要在少数类别制度（例如 80）中进行，或者当每个类别存在大量训练样例（例如，100至1000+）时。因此，有机会在自然环境中进行研究，其中存在大量类别，并且每类别数据有时很少。稀有类别的缺点是不可避免的; 注释多个图像简单地揭示以前看不到的，稀有的类别（参见图 9和 [ 29，25，24，27 ]）。从少数例子中有效地学习是机器学习和计算机视觉中一个重要的开放性问题，从科学和实践的角度来看，这个机会是最令人兴奋的。但要开放这个领域进行实证研究，需要一个合适的，高质量的数据集和基准。

We aim to enable this new research direction by designing and collecting LVIS (pronounced ‘el-vis’)—a benchmark dataset for research on Large Vocabulary Instance Segmentation. We are collecting instance segmentation masks for more than 1000 entry-level object categories (see Fig.1). When completed, we plan for our dataset to contain 164k images and ∼2 million high-quality instance masks.1 Our annotation pipeline starts from a set of images that were collected without prior knowledge of the categories that will be labeled in them. We engage annotators in an iterative object spotting process that uncovers the long tail of categories that naturally appears in the images and avoids using machine learning algorithms to automate data labeling.

我们的目标是通过设计和收集LVIS（发音为'el-vis'） - 一个用于大词汇量实例分割研究的基准数据集来实现这一新的研究方向。我们正在为超过1000个入门级对象类别收集实例分段掩码（参见图 1）。完成后，我们计划为我们的数据包含164K的图像和〜 200万高品质的情况下mask。1我们的注释管道从一组图像开始，这些图像是在没有事先知道将在其中标记的类别的情况下收集的。我们在一个迭代对象定位过程中使用注释器来揭示图像中自然出现的类别的长尾，并避免使用机器学习算法来自动化数据标记。

We designed a crowdsourced annotation pipeline that enables the collection of our large-scale dataset while also yielding high-quality segmentation masks. Quality is important for future research because relatively coarse masks, such as those in the COCO dataset[18], limit the ability to differentiate algorithm-predicted mask quality beyond a certain, coarse point. When compared to expert annotators, our segmentation masks have higher overlap and boundary consistency than both COCO and ADE20K[28].

我们设计了一个众包注释管道，可以收集我们的大型数据集，同时还可以生成高质量的分段掩码。质量对于未来的研究非常重要，因为相对粗糙的掩模，例如COCO数据集[ 18 ]中的掩模，限制了将算法预测的掩模质量区分为某个粗糙点的能力。与专家注释器相比，我们的分割掩模具有比COCO和ADE20K更高的重叠和边界一致性 [ 28 ]。

To build our dataset, we adopt an evaluation-first design principle. This principle states that we should first determine exactly how to perform quantitative evaluation and only then design and build a dataset collection pipeline to gather the data entailed by the evaluation. We select our benchmark task to be COCO-style instance segmentation and we use the same COCO-style average precision (AP) metric that averages over categories and different mask intersection over union (IoU) thresholds[19]. Task and metric continuity with COCO reduces barriers to entry.

为了构建我们的数据集，我们采用评估优先设计原则。该原则指出，我们应首先确定如何执行定量评估，然后才设计和构建数据集收集管道以收集评估所需的数据。我们选择我们的基准任务为COCO风格的实例分段，我们使用相同的COCO风格平均精度（AP）度量标准，该平均值超过类别和不同的掩码交集超过联合（IoU）阈值 [ 19 ]。COCO的任务和指标连续性降低了进入门槛。

Buried within this seemingly innocuous task choice are immediate technical challenges: How do we fairly evaluate detectors when one object can reasonably be labeled with multiple categories (see Fig.2)? How do we make the annotation workload feasible when labeling 164k images with segmented objects from over 1000 categories?

在这个看似无害的任务选择中埋藏着直接的技术挑战：当一个物体可以合理地用多个类别标记时，我们如何公平地评估探测器？（见图 2）？当使用来自1000多个类别的分段对象标记164k图像时，我们如何使注释工作量变得可行？

The essential design choice resolving these challenges is to build a federated dataset: a single dataset that is formed by the union of a large number of smaller constituent datasets, each of which looks exactly like a traditional object detection dataset for a single category. Each small dataset provides the essential guarantee of exhaustive annotations for a single category—all instances of that category are annotated. Multiple constituent datasets may overlap and thus a single object within an image can be labeled with multiple categories. Furthermore, since the exhaustive annotation guarantee only holds within each small dataset, we do not require the entire federated dataset to be exhaustively annotated with all categories, which dramatically reduces the annotation workload. Crucially, at test time the membership of each image with respect to the constituent datasets is not known by the algorithm and thus it must make predictions as if all categories will be evaluated. The evaluation oracle evaluates each category fairly on its constituent dataset.

解决这些挑战的基本设计选择是构建联合数据集：由大量较小的组成数据集联合形成的单个数据集，每个数据集看起来与单个类别的传统对象检测数据集完全相同。每个小数据集为单个类别提供详尽注释的基本保证 - 该类别的所有实例都被注释。多个组成数据集可以重叠，因此图像中的单个对象可以用多个类别标记。此外，由于详尽的注释保证仅保留在每个小数据集中，因此我们不需要对所有类别详尽地注释整个联合数据集，这大大减少了注释工作量。至关重要的是，在测试时，算法不知道每个图像相对于组成数据集的成员资格，因此必须进行预测，就好像将评估所有类别一样。评估oracle在其组成数据集上公平地评估每个类别。

In the remainder of this paper, we summarize how our dataset and benchmark relate to prior work, provide details on the evaluation protocol, describe how we collected data, and then discuss results of the analysis of this data.

在本文的其余部分，我们总结了我们的数据集和基准与先前工作的关系，提供了评估协议的详细信息，描述了我们如何收集数据，然后讨论了这些数据的分析结果。

Dataset Timeline.

We report detailed analysis on the 5000 image val subset that we have annotated twice. We have now annotated an additional 77k images (split between train, val, and test), representing ∼50% of the final dataset; we refer to this as LVIS v0.5 (see §A for details). The first LVIS Challenge, based on v0.5, will be held at the COCO Workshop at ICCV 2019.

我们报告了我们已注释两次的5000图像val子集的详细分析。我们现在已经注释了额外的77k图像（在train，val和test之间划分），占最终数据集的约 50％; 我们称此为LVIS V0.5（见第一个细节）。第一个基于v0.5的LVIS挑战赛将在ICC 2019年的COCO研讨会上举行。

1.1 RELATED DATASETS

Datasets shape the technical problems researchers study and consequently the path of scientific discovery[17]. We owe much of our current success in image recognition to pioneering datasets such as MNIST[16], BSDS[20], Caltech 101[6], PASCAL VOC[5], ImageNet[23], and COCO[18]. These datasets enabled the development of algorithms that detect edges, perform large-scale image classification, and localize objects by bounding boxes and segmentation masks. They were also used in the discovery of important ideas, such as Convolutional Networks[15, 13], Residual Networks[10], and Batch Normalization[11].

数据集塑造了研究人员研究的技术问题，从而形成了科学发现的途径 [ 17 ]。我们目前在图像识别方面的成功很大程度上归功于MNIST [ 16 ]，BSDS [ 20 ]，Caltech 101 [ 6 ]，PASCAL VOC [ 5 ]，ImageNet [ 23 ]和COCO [ 18 ]等先驱数据集。。这些数据集支持开发检测边缘，执行大规模图像分类以及通过边界框和分割蒙版定位对象的算法。他们也用在重要的思想，如卷积网络的发现 [ 15，13]，残余网络 [ 10 ]，并批标准化 [ 11 ]。

LVIS is inspired by these and other related datasets, including those focused on street scenes (Cityscapes[3] and Mapillary[22]) and pedestrians (Caltech Pedestrians[4]). We review the most closely related datasets below.

LVIS的灵感来自这些和其他相关数据集，包括那些关注街景（Cityscapes [ 3 ]和Mapillary [ 22]）和行人（Caltech Pedestrians [ 4 ]）的数据集。我们将回顾下面最密切相关的数据集。

Coco[18]is the most popular instance segmentation benchmark for common objects. It contains 80 categories that are pairwise distinct. There are a total of 118k training images, 5k validation images, and 41k test images. All 80 categories are exhaustively annotated in all images (ignoring annotation errors), leading to approximately 1.2 million instance segmentation masks. To establish continuity with COCO, we adopt the same instance segmentation task and AP metric, and we are also annotating all images from the COCO 2017 dataset. All 80 COCO categories can be mapped into our dataset. In addition to representing an order of magnitude more categories than COCO, our annotation pipeline leads to higher-quality segmentation masks that more closely follow object boundaries (see §4).

是常见对象最受欢迎的实例分段基准。它包含80个成对不同的类别。总共有118k个训练图像，5k验证图像和41k测试图像。所有80个类别在所有图像中都被详尽地注释（忽略注释错误），导致大约120万个实例分割掩码。为了与COCO建立连续性，我们采用相同的实例分段任务和AP度量，并且我们还注释来自COCO 2017数据集的所有图像。所有80个COCO类别都可以映射到我们的数据集中。除了比COCO表示更多类别的数量级之外，我们的注释管道还可以生成更高质量的分段掩码，这些掩码更接近于对象边界（见第4节）。

Ade20k[28] is an ambitious effort to annotate almost every pixel in 25k images with object instance, ‘stuff’, and part segmentations. The dataset includes approximately 3000 named objects, stuff regions, and parts. Notably, ADE20K was annotated by a single expert annotator, which increases consistency but also limits dataset size. Due to the relatively small number of annotated images, most of the categories do not have enough data to allow for both training and evaluation. Consequently, the instance segmentation benchmark associated with ADE20K evaluates algorithms on the 100 most frequent categories. In contrast, our goal is to enable benchmarking of large vocabulary instance segmentation methods.

是一项雄心勃勃的工作，用对象实例，“东西”和部分分段来注释25k图像中的几乎每个像素。数据集包括大约3000个命名对象，填充区域和部分。值得注意的是，ADE20K由一个专家注释器注释，这增加了一致性，但也限制了数据集的大小。由于注释图像的数量相对较少，因此大多数类别没有足够的数据来进行训练和评估。因此，与ADE20K相关联的实例分段基准评估了100个最常见类别的算法。相比之下，我们的目标是启用大型词汇表实例分割方法的基准测试。

iNaturalist[26] contains nearly 900k images annotated with bounding boxes for 5000 plant and animal species. Similar to our goals, iNaturalist emphasizes the importance of benchmarking classification and detection in the few example regime. Unlike our effort, iNaturalist does not include segmentation masks and is focussed on a different image and fine-grained category distribution; our category distribution emphasizes entry-level categories.

包含近900k个带有5000个植物和动物物种的边界框注释的图像。与我们的目标类似，iNaturalist强调在少数示例制度中对分类和检测进行基准测试的重要性。与我们的努力不同，iNaturalist不包括分割蒙版，而是专注于不同的图像和细粒度的类别分布; 我们的类别分布强调入门级类别。

Open Images v4[14] is a large dataset of 1.9M images. The detection portion of the dataset includes 15M bounding boxes labeled with 600 object categories. The associated benchmark evaluates the 500 most frequent categories, all of which have over 100 training samples (>70% of them have over 1000 training samples). Thus, unlike our benchmark, low-shot learning is not integral to Open Images. Also different from our dataset is the use of machine learning algorithms to select which images will be annotated by using classifiers for the target categories. Our data collection process, in contrast, involves no machine learning algorithms and instead discovers the objects that appear within a given set of images. Starting with release v4, Open Images has used a federated dataset design for object detection.

是一个1.9M图像的大型数据集。数据集的检测部分包括标有600个对象类别的15M边界框。相关联的基准评估500个最频繁的类别，所有这些具有超过100的训练样本（>其中70％有超过1000个训练样本）。因此，与我们的基准测试不同，低镜头学习不是Open Images的组成部分。与我们的数据集不同的是使用机器学习算法来选择通过使用目标类别的分类器来注释哪些图像。相反，我们的数据收集过程不涉及机器学习算法，而是发现出现在给定图像集中的对象。从版本v4开始，Open Images使用联合数据集设计进行对象检测。

2 DATASET DESIGN

We followed an evaluation-first design principle: prior to any data collection, we precisely defined what task would be performed and how it would be evaluated. This principle is important because there are technical challenges that arise when evaluating detectors on a large vocabulary dataset that do not occur when there are few categories. These must be resolved first, because they have profound implications for the structure of the dataset, as we discuss next.

我们遵循评估优先设计原则：在任何数据收集之前，我们精确定义了将执行的任务以及如何评估它。这个原则很重要，因为在评估大型词汇数据集上的检测器时会出现技术挑战，这些挑战在类别很少时不会发生。必须首先解决这些问题，因为它们对数据集的结构有深远的影响，我们将在下面讨论。

2.1TASK AND EVALUATION OVERVIEW

Task and Metric. Our dataset benchmark is the instance segmentation task: given a fixed, known set of categories, design an algorithm that when presented with a previously unseen image will output a segmentation mask for each instance of each category that appears in the image along with the category label and a confidence score. Given the output of an algorithm over a set of images, we compute mask average precision (AP) using the definition and implementation from the COCO dataset[19] (for more detail see §2.3).

任务和指标。 我们的数据集基准是实例分割任务：给定一组固定的已知类别，设计一种算法，当呈现先前看不见的图像时，将为图像中出现的每个类别的每个实例输出一个分割掩码以及类别标签和一个置信度分数。给定算法在一组图像上的输出，我们使用COCO数据集[ 19 ]中的定义和实现来计算掩模平均精度（AP）（更多细节见第2.3节）。

Evaluation Challenges. Datasets like PASCAL VOC and COCO use manually selected categories that are pairwise disjoint: when annotating a car, there’s never any question if the object is instead a potted plant or a sofa. When increasing the number of categories, it is inevitable that other types of pairwise relationships will occur: (1) partially overlapping visual concepts; (2) parent-child relationships; and (3) perfect synonyms. See Fig.2 for examples.

评估挑战。 像PASCAL VOC和COCO这样的数据集使用手动选择的成对不相交的类别：当注释汽车时，如果对象是盆栽植物或沙发，则永远不会有任何疑问。当增加类别数量时，不可避免地会出现其他类型的成对关系：（1）部分重叠的视觉概念; （2）亲子关系; （3）完美的同义词。有关示例，请参见图 2。

If these relations are not properly addressed, then the evaluation protocol will be unfair. For example, most toys are not deer and most deer are not toys, but a toy deer is both—if a detector outputs deer and the object is only labeled toy, the detection will be marked as wrong. Likewise, if a car is only labeled vehicle, and the algorithm outputs car, it will be incorrectly judged to be wrong. Or, if an object is only labeled backpack and the algorithm outputs the synonym rucksack, it will be incorrectly penalized. Providing a fair benchmark is important for accurately reflecting algorithm performance.

如果这些关系没有得到妥善解决，那么评估协议将是不公平的。例如，大多数玩具不是鹿，大多数鹿不是玩具，但是玩具鹿都是 - 如果探测器输出鹿并且物体仅标记为玩具，则检测将被标记为错误。同样地，如果汽车仅被标记为车辆，并且算法输出汽车，则将错误地判断为错误。或者，如果对象仅标记为背包，则算法输出同义词背包，它会受到不正当的惩罚。提供公平的基准对于准确反映算法性能非常重要。

These problems occur when the ground-truth annotations are missing one or more true labels for an object. If an algorithm happens to predict one of these correct, but missing labels, it will be unfairly penalized. Now, if all objects are exhaustively and correctly labeled with all categories, then the problem is trivially solved. But correctly and exhaustively labeling 164k images each with 1000 categories is undesirable: it forces a binary judgement deciding if each category applies to each object; there will be many cases of genuine ambiguity and inter-annotator disagreement. Moreover, the annotation workload will be very large. Given these drawbacks, we describe our solution next.

当地面实况注释缺少对象的一个或多个真实标签时，会出现这些问题。如果算法恰好预测了这些正确但缺失的标签之一，则会受到不公平的惩罚。现在，如果所有对象都是详尽且正确地标记了所有类别，那么问题就可以解决了。但正确而详尽地标记每个有1000个类别的164k图像是不可取的：它强制二元判断决定每个类别是否适用于每个对象; 会有很多真正含糊不清和注释时间不一致的案例。而且，注释工作量将非常大。鉴于这些缺点，我们接下来描述我们的解决方案。

2.2 FEDERATED DATASETS

Our key observation is that the desired evaluation protocol does not require us to exhaustively annotate all images with all categories. What is required instead is that for each category c there must exist two disjoint subsets of the entire dataset D for which the following guarantees hold:

我们的关键观察是，期望的评估协议不要求我们详尽地注释所有类别的所有图像。相反，对于每个类别c，必须存在整个数据集D的两个不相交的子集，以下保证包含：

Positive set: there exists a subset of images Pc⊆D such that all instances of c in Pc are segmented. In other words, Pc is exhaustively annotated for category c.

Negative set: there exists a subset of images Nc⊆D such that no instance of c appears in any of these images.

Given these two subsets for a category c, Pc∪Nc can be used to perform standard COCO-style AP evaluation for c. The evaluation oracle only judges the algorithm on a category c over the subset of images in which c has been exhaustively annotated; if a detector reports a detection of category c on an image i∉Pc∪Nc, the detection is not evaluated.

By collecting the per-category sets into a single dataset, D=∪c(Pc∪Nc), we arrive at the concept of a federated dataset. A federated dataset is a dataset that is formed by the union of smaller constituent datasets, each of which looks exactly like a traditional object detection dataset for a single category. By not annotating all images with all categories, freedom is created to design an annotation process that avoids ambiguous cases and collects annotations only if there is sufficient inter-annotator agreement. At the same time, the workload can be dramatically reduced.

通过收集每一类别集成一个单一的数据集，d = ∪ Ç（P c ^ ∪ Ñ Ç），我们得出一个的概念联合数据集。联合数据集是由较小的组成数据集联合形成的数据集，每个数据集看起来与单个类别的传统对象检测数据集完全相同。通过不注释具有所有类别的所有图像，创建自由以设计注释过程以避免模糊情况并且仅在存在足够的注释器间协议时收集注释。同时，可以大大减少工作量

Finally, we note that positive set and negative set membership on the test split is not disclosed and therefore algorithms have no side information about what categories will be evaluated in each image. An algorithm thus must make its best prediction for all categories in each test image.

最后，我们注意到未公开测试分割的正集和负集成员资格，因此算法没有关于将在每个图像中评估哪些类别的辅助信息。因此，算法必须对每个测试图像中的所有类别进行最佳预测。

Reduced Workload. Federated dataset design allows us to make |Pc∪Nc|≪|D|,∀c. This choice dramatically reduces the workload and allows us to undersample the most frequent categories in order to avoid wasting annotation resources on them (e.g. person accounts for 30% of COCO). Of our estimated ∼2 million instances, likely no single category will account for more than ∼3% of the total instances.

减少工作量。 联邦数据集设计允许我们制作| P c ^ ∪ ñ Ç | « | D | ，∀ Ç。这种选择大大减少了工作量，并允许我们对最常见的类别进行欠采样，以避免在其上浪费注释资源（例如，人员占COCO的30％）。我们估计〜 200万分的情况下，可能没有一个单一的类别将占到超过〜总实例的3％。

2.3 EVALUATION DETAILS

The challenge evaluation server will only return the overall AP, not per-category AP’s. We do this because: (1) it avoids leaking which categories are present in the test set;2 (2) given that tail categories are rare, there will be few examples for evaluation in some cases, which makes per-category AP unstable; (3) by averaging over a large number of categories, the overall category-averaged AP has lower variance, making it a robust metric for ranking algorithms.

挑战评估服务器将仅返回整体AP，而不是每类别AP。我们这样做是因为：（1）它避免泄漏测试集中存在哪些类别; 2（2）鉴于尾部类别很少，在某些情况下评估的例子很少，这使得每类AP不稳定; （3）通过对大量类别求平均，整体类别平均AP具有较低的方差，使其成为排序算法的稳健度量。

Non-Exhaustive Annotations.We also collect an image-level boolean label, eci, indicating if image i∈Pc is exhaustively annotated for category c. In most cases (91%), this flag is true, indicating that the annotations are indeed exhaustive. In the remaining cases, there is at least one instance in the image that is not annotated. Missing annotations often occur in ‘crowds’ where there are a large number of instances and delineating them is difficult. During evaluation, we do not count false positives for category c on images i that have eci set to false. We do measure recall on these images: the detector is expected to predict accurate segmentation masks for the labeled instances. Our strategy differs from other datasets that use a small maximum number of instances per image, per category (10-15) together with ‘crowd regions’ (COCO) or use a special ‘group of c’ label to represent 5 or more instances (Open Images v4). Our annotation pipeline (§3) attempts to collect segmentations for all instances in an image, regardless of count, and then checks if the labeling is in fact exhaustive. See Fig.3.

Hierarchy.

During evaluation, we treat all categories the same; we do nothing special in the case of hierarchical relationships. To perform best, for each detected object o, the detector should output the most specific correct category as well as all more general categories, e.g., a canoe should be labeled both canoe and boat. The detected object o in image i will be evaluated with respect to all labeled positive categories {c | i∈Pc}, which may be any subset of categories between the most specific and the most general.

Synonyms.

A federated dataset that separates synonyms into different categories is valid, but is unnecessarily fragmented (see Fig.2, right). We avoid splitting synonyms into separate categories with WordNet[21]. Specifically, in LVIS each category c is a WordNet synset—a word sense specified by a set of synonyms and a definition.

3 DATASET CONSTRUCTION

In this section we provide an overview of the annotation pipeline that we use to collect LVIS.

3.1ANNOTATION PIPELINE

Fig.4 illustrates our annotation pipeline by showing the output of each stage, which we describe below. For now, assume that we have a fixed category vocabulary V. We will describe how the vocabulary was collected in §3.2.

Object Spotting, Stage 1.

The goals of the object spotting stage are to: (1) generate the positive set, Pc, for each category c∈V and (2) elicit vocabulary recall such that many different object categories are included in the dataset.

Object spotting is an iterative process in which each image is visited a variable number of times. On the first visit, an annotator is asked to mark one object with a point and to name it with a category c∈V using an autocompletetext input. On each subsequent visit, all previously spotted objects are displayed and an annotator is asked to mark an object of a previously unmarked category or to skip the image if no more categories in V can be spotted. When an image has been skipped 3 times, it will no longer be visited. The autocomplete is performed against the set of all synonyms, presented with their definitions; we internally map the selected word to its synset/category to resolve synonyms.

Obvious and salient objects are spotted early in this iterative process. As an image is visited more, less obvious objects are spotted, including incidental, non-salient ones. We run the spotting stage twice, and for each image we retain categories that were spotted in both runs. Thus two people must independently agree on a name in order for it to be included in the dataset; this increases naming consistency.

To summarize the output of stage 1: for each category in the vocabulary, we have a (possibly empty) set of images in which one object of that category is marked per image. This defines an initial positive set, Pc, for each category c.

Exhaustive Instance Marking, Stage 2.

The goals this stage are to: (1) verify stage 1 annotations and (2) take each image i∈Pc and mark all instances of c in i with a point.

In this stage, (i,c) pairs from stage 1 are each sent to 5 annotators. They are asked to perform two steps. First, they are shown the definition of category c and asked to verify if it describes the spotted object. Second, if it matches, then the annotators are asked to mark all other instances of the same category. If it does not match, there is no second step. To prevent frequent categories from dominating the dataset and to reduce the overall workload, we subsample frequent categories such that no positive set exceeds more than 1% of the images in the dataset.

To ensure annotation quality, we embed a ‘gold set’ within the pool of work. These are cases for which we know the correct ground-truth. We use the gold set to automatically evaluate the work quality of each annotator so that we can direct work towards more reliable annotators. We use 5 annotators per (i,c) pair to help ensure instance-level recall.

To summarize, from stage 2 we have exhaustive instance spotting for each image i∈Pc for each category c∈V.

Instance Segmentation, Stage 3.

The goals of the instance segmentation stage are to: (1) verify the category for each marked object from stage 2 and (2) upgrade each marked object from a point annotation to a full segmentation mask.

To do this, each pair (i,o) of image i and marked object instance o is presented to one annotator who is asked to verify that the category label for o is correct and if it is correct, to draw a detailed segmentation mask for it (e.g. see Fig.3).

We use a training task to establish our quality standards. Annotator quality is assessed with a gold set and by tracking their average vertex count per polygon. We use these metrics to assign work to reliable annotators.

In sum, from stage 3 we have for each image and spotted instance pair one segmentation mask (if it is not rejected).

Segment Verification, Stage 4.

The goal of the segment verification stage is to verify the quality of the segmentation masks from stage 3. We show each segmentation to up to 5 annotators and ask them to rate its quality using a rubric. If two or more annotators reject the mask, then we requeue the instance for stage 3 segmentation. Thus we only accept a segmentation if 4 annotators agree it is high-quality. Unreliable workers from stage 3 are not invited to judge segmentations in stage 4; we also use rejections rates from this stage to monitor annotator reliability. We iterate between stages 3 & 4 a total of four times, each time only re-annotating rejected instances.

To summarize the output of stage 4 (after iterating back and forth with stage 3): we have a high-quality segmentation mask for >99% of all marked objects.

Full Recall Verification, Stage 5.

The full recall verification stage finalizes the positive sets. The goal is to find images i∈Pc where c is not exhaustively annotated. We do this by asking annotators if there are any unsegmented instances of category c in i. We ask up to 5 annotators and require at least 4 to agree that annotation is exhaustive. As soon as two believe it is not, we mark the exhaustive annotation flag eci as false. We use a gold set to maintain quality.

To summarize the output of stage 5: we have a boolean flag eci for each image i∈Pc indicating if category c is exhaustively annotated in image i. This finalizes the positive sets along with their instance segmentation annotations.

Negative Sets, Stage 6.

The final stage of the pipeline is to collect a negative set Nc for each category c in the vocabulary. We do this by randomly sampling images i∈D∖Pc, where D is all images in the dataset. For each sampled image i, we ask up to 5 annotators if category c appears in image i. If any one annotator reports that it does, we reject the image. Otherwise i is added to Nc. We sample until the negative set Nc reaches a target size of 1% of the images in the dataset. We use a gold set to maintain quality.

To summarize, from stage 6 we have a negative image set Nc for each category c∈V such that the category does not appear in any of the images in Nc.

3.2VOCABULARY CONSTRUCTION

We construct the vocabulary V with an iterative process that starts from a large super-vocabulary and uses the object spotting process (stage 1) to winnow it down. We start from 8.8k synsets that were selected from WordNet by removing some obvious cases (e.g. proper nouns) and then finding the intersection with highly concrete common nouns[2]. This yields a high-recall set of concrete, and thus likely visual, entry-level synsets. We then apply object spotting to 10k COCO images with autocomplete against this super-vocabulary. This yields a reduced vocabulary with which we repeat the process once more. Finally, we perform minor manual editing. The resulting vocabulary contains 1723 synsets—the upper bound on the number of categories that can appear in LVIS.

4.3 EVALUATION PROTOCOL

COCO Detectors on Lvis.

To validate our annotations and federated dataset design we downloaded three Mask R-CNN[9] models from the Detectron Model Zoo[7] and evaluated them on LVIS annotations for the categories in COCO. Tab.2 shows that both box AP and mask AP are close between our annotations and the original ones from COCO for all models, which span a wide AP range. This result validates our annotations and evaluation protocol: even though LVISuses a federated dataset design with sparse annotations, the quantitative outcome closely reproduces the ‘gold standard’ results from dense COCO annotations.

Federated Dataset Simulations.

For insight into how AP changes with positive and negative sets sizes |Pc| and |Nc|, we randomly sample smaller evaluation sets from COCO val2017 and recompute AP. To plot quartiles and min-max ranges, we re-test each setting 20 times. In Fig.(a)a we use all positive instances for evaluation, but vary max|Nc| between 50 and 5k. AP decreases somewhat (∼2 points) as we increase the number of negative images as the ratio of negative to positive examples grows with fixed |Pc| and increasing |Nc|. Next, in Fig(b)b we set max|Nc|=50 and vary |Pc|. We observe that even with a small positive set size of 80, AP is similar to the baseline with low variance. With smaller positive sets (down to 5) variance increases, but the AP gap from 1st to 3rd quartile remains below 2 points. These simulations together with COCO detectors tested on LVIS (Tab.2) indicate that including smaller evaluation sets for each category is viable for evaluation.

Low-Shot Detection.

To validate the claim that low-shot detection is a challenging open problem, we trained Mask R-CNN on random subsets of COCO train2017 ranging from 1k to 118k images. For each subset, we optimized the learning rate schedule and weight decay by grid search. Results on val2017 are shown in Fig.(c)c. At 1k images, mask AP drops from 36.4% (full dataset) to 9.8% (1k subset). In the 1k subset, 89% of the categories have more than 20 training instances, while the low-shot literature typically considers ≪ 20 examples per category[8].

Low-Shot Category Statistics.

Fig.9 (left) shows category growth as a function of image count (up to 977 categories in 5k images). Extrapolating the trajectory, our final dataset will include over 1k categories (upper bounded by the vocabulary size, 1723). Since the number of categories increases during data collection, the low-shot nature of LVIS is somewhat independent of the dataset scale, see Fig.9 (right) where we bin categories based on how many images they appear in: rare (1-10 images), common (11-100), and frequent (>100). These bins, as measured w.r.t. the training set, will be used to present disaggregated AP metrics.

5 CONCLUSION

We introduced LVIS, a new dataset designed to enable, for the first time, the rigorous study of instance segmentation algorithms that can recognize a large vocabulary of object categories (>1000) and must do so using methods that can cope with the open problem of low-shot learning. While LVIS emphasizes learning from few examples, the dataset is not small: it will span 164k images and label ∼2 million object instances. Each object instance is segmented with a high-quality mask that surpasses the annotation quality of related datasets. We plan to establish LVIS as a benchmark challenge that we hope will lead to exciting new object detection, segmentation, and low-shot learning algorithms.