【论文阅读】Segment anything 学习、理解与复现

摘要

We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive – often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at https://segment-anything.com to foster research into foundation models for computer vision.

我提出了一个Segment Anythin(SA)的项目,包括:实例分割中新的 Task、新的Model 和 新的Dataset。使用高效的数据循环采集模型,建立了目前最大型分割数据集。超过1100万张图像和10亿的掩码(masks)。模型是为了“可提示的(promptable)(或可指令化的)”而设计与训练的,所以可以以零样本(zero-shot)方式转移到新的图像分布和任务。我们在许多任务中评估了它的能力,并且发现他的零样本表现十分优秀,与之前全监督方法结果相比极具竞争力甚至有所超越。我们发布了SegmentAnythingModel(SAM)和数据集(SA-1B)在https://segment-anything.com,以促进计算机视觉的基础模型研究。

1. 引言

Large language models pre-trained on web-scale datasets are revolutionizing NLP with strong zero-shot and few-shot generalization [10]. These “foundation models” [8] can generalize to tasks and data distributions beyond those seen during training. This capability is often implemented with prompt engineering in which hand-crafted text is used to prompt the language model to generate a valid textual response for the task at hand. When scaled and trained with abundant text corpora from the web, these models’ zero and few-shot performance compares surprisingly well to (even matching in some cases) fine-tuned models [10, 21]. Empirical trends show this behavior improving with model scale, dataset size, and total training compute [56, 10, 21, 51].

集成强大的零样本或少样本泛化能力且在网络级别数据集预训练的大规模语言模型正推动这NLP的发展,这些“基础模型”可以运用到原先训练中已知的任务和数据之外。这些能力经常与“提示工程(prompt enginerring 或指令工程)”一起应用到使用人工编制的文本去提示语言模型生成有效的文本反馈替代手工操作。当使用网络中丰富的语料库进行缩放和训练时,这些模型的零样本和少样本表现比其他微调模型更加优秀(有时持平),经验表明这种做法的优势随着模型规模、数据大小和训练计算量的最大而提升。

Foundation models have also been explored in computer vision, albeit to a lesser extent. Perhaps the most prominent illustration aligns paired text and images from the web.For example, CLIP [82] and ALIGN [55] use contrastive learning to train text and image encoders that align the two modalities. Once trained, engineered text prompts enable zero-shot generalization to novel visual concepts and data distributions. Such encoders also compose effectively with other modules to enable downstream tasks, such as image generation (e.g., DALL·E [83]). While much progress has been made on vision and language encoders, computer vision includes a wide range of problems beyond this scope, and for many of these, abundant training data does not exist.

 基础模型也在计算机视觉中得到了探索,尽管探索的程度较低。也许最显著的探索就是来自网络中的文本-图像对齐。例如,CLIP和ALIGN使用对比学习方法训练文本-图片从而对齐这两模态。一旦模型被训练以及制作了文本提示,可以零样本泛化得到新颖的视觉概念和数据分布,这种编码器还可以有效组合其他模块从而实现下游任务,如图像生成。虽然视频和语言编码器取得了很大的进展,但计算机视觉中的许多问题都不在了这个范畴,并且其中的不少问题还缺少大量的训练数据。

In this work, our goal is to build a foundation model for image segmentation. That is, we seek to develop a promptable model and pre-train it on a broad dataset using a task that enables powerful generalization. With this model, we aim to solve a range of downstream segmentation problems on new data distributions using prompt engineering.

 本次工作中,我们的目标是创造一个“图像分割的基础模型”。即我们寻求开发一种可提示模型,并且使用一个泛化能力强大的任务将该模型在广泛的数据集上进行预训练。在这个模型的基础上,我们的目标是使用提示工程来解决新数据分布中的大量下游图像分割问题。

The success of this plan hinges on three components: task, model, and data. To develop them, we address the following questions about image segmentation:

1. What task will enable zero-shot generalization?

2. What is the corresponding model architecture?

3. What data can power this task and model?

 这样计划的成功与否取决于三个方面:
1、什么样的任务可以实现零样本泛化
2、什么是适应的模型架构?
3、什么样的数据可以驱动这样的任务和模型?

These questions are entangled and require a comprehensive solution. We start by defining a promptable segmentation task that is general enough to provide a powerful pretraining objective and to enable a wide range of downstream applications. This task requires a model that supports flexible prompting and can output segmentation masks in realtime when prompted to allow for interactive use. To train our model, we need a diverse, large-scale source of data. Unfortunately, there is no web-scale data source for segmentation; to address this, we build a “data engine”, i.e., we iterate between using our efficient model to assist in data collection and using the newly collected data to improve the model. We introduce each interconnected component next, followed by the dataset we created and the experiments that demonstrate the effectiveness of our approach.

这些问题十分错综复杂以及需要一个全面的解决方案。我们先定义一个可提示的分割任务,该任务应足够通用以满足强大的预训练目标和广泛的下游应用。这个任务需要一个模型,该模型能够支持灵活的提示以及可以在需要交互使用时根据提示实时地输出分割掩码。想要训练我们的模型,我们就需要一个多样化、大规模的数据。不幸的是,目前没有网络级的数据源供图像分割使用,为了解决这个问题,我们建立了一个“数据引擎”,即,我们使用我们的高效模型来帮助我们进行数据收集于此同时又将收集到的新数据来提升模型,两个之间反复迭代。接下来,我们将介绍每个互联组件,然后介绍我们建立的数据集以及证明我们方法的有效性实验。

Task (§2). In NLP and more recently computer vision, foundation models are a promising development that can perform zero-shot and few-shot learning for new datasets and tasks often by using “prompting” techniques. Inspired by this line of work, we propose the promptable segmentation task, where the goal is to return a valid segmentation mask given any segmentation prompt (see Fig. 1a). A prompt simply specifies what to segment in an image, e.g., a prompt can include spatial or text information identifying an object. The requirement of a valid output mask means that even when a prompt is ambiguous and could refer to multiple objects (for example, a point on a shirt may indicate either the shirt or the person wearing it), the output should be a reasonable mask for at least one of those objects. We use the promptable segmentation task as both a pre-training objective and to solve general downstream segmentation tasks via prompt engineering.

【论文阅读】Segment anything 学习、理解与复现_第1张图片

图1:我们目标是创建一个图像分割的基础模型,通过引入三个互联的组件:一个可提示的图像分割任务;一个图像分割模型(SAM)通过提示工程可以驱动数据标注和支持零样本迁移到更多的任务中;一个数据引擎收集了SA-1B,我们的数据集超过了10亿的掩码。

任务(章节2)。 在NLP和最近的计算机视觉中,基础模型是一项很有前途的发展,它可以使用“提示”技术在新数据集和任务进行零样本和少样本学习。受这个启发,我们提出“可提示的图像分割任务”,他的目标是返回一个有效的分割掩码通过输入任意的分割提示(如图1a)。一种提示只需指定什么才是要要在图像中分割的,例如提示可以包含空间的或者文本的信息来标识一个物体。输出掩码的要求是即使提示不明确或者可能指的是多种对象(例如,提示是在衬衫上一个点有可能真正想要指定的是衬衫或者穿着衬衫的人),该提示的输出应该是至少一个该对象合理的掩码。我们使用这个可提示的图像分割任务即作为预训练的目标也作为解决需要通过提示工程的常规下游图像分割的任务。

Model (§3). The promptable segmentation task and the goal of real-world use impose constraints on the model architecture. In particular, the model must support flexible prompts, needs to compute masks in amortized real-time to allow interactive use, and must be ambiguity-aware. Surprisingly, we find that a simple design satisfies all three constraints: a powerful image encoder computes an image embedding, a prompt encoder embeds prompts, and then the two information sources are combined in a lightweight mask decoder that predicts segmentation masks. We refer to this model as the Segment Anything Model, or SAM (see Fig. 1b). By separating SAM into an image encoder and a fast prompt encoder / mask decoder, the same image embedding can be reused (and its cost amortized) with different prompts. Given an image embedding, the prompt encoder and mask decoder predict a mask from a prompt in ∼50ms in a web browser. We focus on point, box, and mask prompts, and also present initial results with free-form text prompts. To make SAM ambiguity-aware, we design it to predict multiple masks for a single prompt allowing SAM to naturally handle ambiguity, such as the shirt vs. person example. 

模型(章节3)。 可提示的分割任务和真实世界中使用的目标,对模型架构施加了约束。尤其是,模型必须支持灵活的提示,需要实时计算掩码以便交互使用,并且必须具有模糊性意识。令人兴奋的是,我们发现一个简单的设计就可以满足全部三种约束:一个强大的图像编码器计算图像Embedding(嵌入,名词),一个提示编码器去嵌入提示(应该是得到prompt embedding),然后将两种信息结合进一个轻量的掩码解码器中预测分割掩码。我们将此模型称为Segment Anything Model,或者SAM(见图1b)。将SAM分成一个图像编码器和一个快速的提示编码器/掩码解码器,相同的图像嵌入可以在不同的提示输入下重复使用(其计算成本是摊销的)。给到一个图像嵌入,在浏览器中输入一个提示,提示编码器和掩码解码器预测一个掩码只要约50ms。我们聚焦在点、框和掩码的提示,以及交付自由格式提示的初始结果。让SAM具有模糊意识,我们设计它去为一个提示预测多个掩码,好让SAM自然地处理歧义,比如衬衫和人的例子。

Data engine (§4). To achieve strong generalization to new data distributions, we found it necessary to train SAM on a large and diverse set of masks, beyond any segmentation dataset that already exists. While a typical approach for foundation models is to obtain data online [82], masks are not naturally abundant and thus we need an alternative strategy. Our solution is to build a “data engine”, i.e., we co-develop our model with model-in-the-loop dataset annotation (see Fig. 1c). Our data engine has three stages: assisted-manual, semi-automatic, and fully automatic. In the first stage, SAM assists annotators in annotating masks, similar to a classic interactive segmentation setup. In the second stage, SAM can automatically generate masks for a subset of objects by prompting it with likely object locations and annotators focus on annotating the remaining objects, helping increase mask diversity. In the final stage, we prompt SAM with a regular grid of foreground points, yielding on average ∼100 high-quality masks per image.

数据引擎(章节4)。 实现对新数据的强泛化,我们发现十分有必要将SAM在一个大规模多样性的掩码中选训练,超过任何当前存在图像分割数据集。对于基础模型,典型的方法就是通过在线获取数据,掩码天然地不丰富,因此我们需要一个替代策略。我们的方法是建立一个“数据引擎”,即,我们共同开发(同步开发?)我们的模型使用model-in-the-loop(前面有介绍模型循环更新过程)数据集标注(看图1c)。我们的数据引擎有三个阶段:人工辅助、半自动和全自动。在第一阶段中,SAM协助标注器标注掩码(标注器应该是第三方模块,SAM与标注器联合使用,过程是一开始有个低分割功能的SAM,标注器人工标注的时候,SAM会增强人工标注的遮罩,辅助人工标注),类似于经典的交互式分割装置(类似的辅助标注工具有很多,有些是利用传统的分割算法辅助标注的)。在第二个阶段,SAM通过输入可能对象的位置提示,可以自动地为对象的子集生成掩码(我的理解是,我点击了衬衫,SAM自动标注衬衫、人、以及衬衫上的子集),标注器聚焦在标注其他剩下的对象,有助于增加掩码的多样性。在最后的阶段,我们提示SAM一个规则的前景点集网格,平均每个图像产生100个高质量掩码。

Dataset (§5). Our final dataset, SA-1B, includes more than 1B masks from 11M licensed and privacy-preserving images (see Fig. 2). SA-1B, collected fully automatically using the final stage of our data engine, has 400× more masks than any existing segmentation dataset [66, 44, 117, 60], and as we verify extensively, the masks are of high quality and diversity. Beyond its use in training SAM to be robust and general, we hope SA-1B becomes a valuable resource for research aiming to build new foundation models.

数据集(章节5)。 我们最终的数据集SA-1B,包含超过10亿掩码来自于1100万张授权的和私人图像(见图2)。SA-1B,使用我们引擎的最后阶段来全自动收集,比任何已有的图像分割数据集多400多倍的掩码,正如我们验证的那样,这些遮罩具有高质量和多样性。除了用于训练SAM使其鲁棒和通过之外,我们希望SA-1B成为一个有价值的资源用于解决创建新基础模型的研究。

Responsible AI (§6). We study and report on potential fairness concerns and biases when using SA-1B and SAM. Images in SA-1B span a geographically and economically diverse set of countries and we found that SAM performs similarly across different groups of people. Together, we hope this will make our work more equitable for real-world use cases. We provide model and dataset cards in the appendix. 

 人工智能责任(章节6)。我们研究并报告了在使用SA-1B和SAM时的潜在公平性顾虑和偏见,SA-1B中的图像跨越了不同国家的地理和经济上的多样性,以及我们发现SAM执行同样跨越了不同的人群。同时,我们希望这将使我们的工作在真实世界用例找那个更加的公平合理。我们在附录里提供了模型和数据集卡片。

Experiments (§7). We extensively evaluate SAM. First, using a diverse new suite of 23 segmentation datasets, we find that SAM produces high-quality masks from a single foreground point, often only slightly below that of the manually annotated ground truth. Second, we find consistently strong quantitative and qualitative results on a variety of downstream tasks under a zero-shot transfer protocol using prompt engineering, including edge detection, object proposal generation, instance segmentation, and a preliminary exploration of text-to-mask prediction. These results suggest that SAM can be used out-of-the-box with prompt engineering to solve a variety of tasks involving object and image distributions beyond SAM’s training data. Nevertheless, room for improvement remains, as we discuss in §8.

实验(章节7)。我们广泛地评估了SAM。第一,我们使用了一套新的多样性的23个图像分割数据集,我们发现SAM在单一前景点上即可产生高质量掩码,通常只稍微低于人工标注的真实结果。第二,在零样本迁移协议下,我们使用提示工程在各种下游任务上获得了一致的强大定量和定性结果,包括边缘检测、对象建议生成、实例分割和文本转掩码预测的初步探索。结果表明SAM可以作为提示工程的开箱即用模型,去解决各种任务加入SAM训练数据之外的对象和图像分布。然而,还有改进的余地,我们在第8章进行讨论。

Release. We are releasing the SA-1B dataset for research purposes and making SAM available under a permissive open license (Apache 2.0) at https://segment-anything.com. We also showcase SAM’s capabilities with an online demo.

发布。我们发布SA-1B数据集为研究目的和让SAM在许可的开发许可(Apache2.0)在https://segment-anything.com/。我们还提供了SAM的能力示例在onlie demo。

图2,答应你们的超大数据集^-^。SA-1B,平均每图100掩码。多的不说了

2. Segment Anything Task

(待续)

你可能感兴趣的:(图像分割,论文阅读,算法,python)