又一个大爆炸?MetaAI开源新的自监督foundation model——DINOv2,无需fine-tuning,刷榜多个下游任务,填补SAM分割领域外的空白
链接
Paper: DINOv2: Learning Robust Visual Features without Supervision (arxiv.org)
Demo:DINOv2 by Meta AI (metademolab.com)
前言
"DINOv2 complements our other recent computer vision research, including Segment Anything. Segment Anything is a promptable segmentation system focused on zero-shot generalization to diverse set of segmentation tasks. DINOv2 combines with simple linear classifiers to achieve strong results across multiple tasks beyond the segmentation sub-field, creating horizontal impact."
前段时间,Meta AI就高调发布了Segment Anything(SAM),SAM以交互式方式k快速生成Mask,并可以对从未训练过的图片进行精准分割,可以根据文字提示或使用者点击进而圈出图像中的特定物体,其灵活性在图像分割领域内属首创。
但是,归根到底SAM是一个promptable segmentation system,主要应用于各种分割任务,对其他的视觉任务(e.g. Classification, Retrieval,VQA...)的帮助没有那么直接。
于是,在继[分割一切],Meta AI再次发布重量级开源项目——DINOv2,DINOv2可以抽取到强大的图像特征,且在下游任务上不需要微调,这使得它适合作为许多不同的应用中新的BackBone。
- DINOv2 delivers strong performance and does not require fine-tuning. This makes it suitable for use as a backbone for many different computer vision tasks.
DINOv2 is able to take a video and generate a higher-quality segmentation than the original DINO method. DINOv2 allows remarkable properties to emerge, such as a robust understanding of object parts, and robust semantic and low-level understanding of images.
同时,本次DINOv2的发布还有着小札的亲自站场,在一开始就收获大量关注度。(最后还不忘提一嘴心心念念的Metaverse)
在小札的亲自宣发下,DINOv2在发布后短短的一天里就收获了2k+的star
!
主要特性✨
在Meta AI官方的Blog中,将DINOv2的特性总结如下:
- Meta AI has built DINOv2, a new method for training high-performance computer vision models.
- DINOv2 delivers strong performance and does not require fine-tuning. This makes it suitable for use as a backbone for many different computer vision tasks.
- Because it uses self-supervision, DINOv2 can learn from any collection of images. It can also learn features, such as depth estimation, that the current standard approach cannot.
- DINOv2是一种训练高性能计算机视觉模型的新方法。
- DINOv2提供了强大的性能,并且不需要微调。
- 由于是自监督( self-supervision),DINOv2可以从任何图像集合中学习。同时,它还可以学习到当现有方法无法学习的某些特征,例如深度估计。
DINOv2是一种新的高性能计算机视觉模型训练方法,使用自监督学习来实现与该领域中使用的标准方法相匹配或超越结果。与其他自监督系统一样,使用DINOv2方法的模型可以在不需要任何相关元数据的情况下对任何图像集合进行训练。这意味着它可以从它所接收到的所有图像中学习,而不仅仅是那些包含特定一组标签或alt文本或标题的图像。DINOv2提供了可直接用作简单线性分类器输入的高性能特征。这种灵活性意味着DINOv2可用于创建许多不同计算机视觉任务的多用途骨干。
文中的实验展示了DINOv2在下游任务上的出色能力,例如分类、分割和图像检索等应用领域。其中,最令人惊讶的是,在深度估计方面,DINOv2的结果明显优于in-domain与out-of-domain的SOTA的pipeline。作者认为这种强大的域外表现是自监督特征学习和轻量级任务特定模块(例如线性分类器)相结合的结果。
最后,由于不采用fine-tuning,骨干保持通用,同一特征可以同时用于许多不同任务。
研究内容
这里,我们不展开DINOv2的具体算法细节,只简要介绍一下DINOv2主要干了些什么(个人看法,欢迎讨论):
创建了一个新的高质量数据集
Building a large, curated, and diverse dataset to train the models
在如今的大模型时代,为了进一步提高性能,往往更大的模型需要更多的数据进行训练。由于没有足够大的高质量数据集来满足DINOv2的训练需求,Meta AI通过从大量未经整理的数据池中检索与几个经过整理的数据集中的图像相近的图像,来组建一个新的数据集。具体流程如下所示:
This approach enabled us to produce a pretraining dataset totaling 142 million images out of the 1.2 billion source images.
通过上图所示的流程,Meta AI从12亿张图片中得到了经过整理的1.42亿张图像,命名为LVD-142M数据集。
蒸馏得到好的轻量模型
大模型虽好,但其硬件和算力的要求太高(都是大厂和大实验室才能玩),我们总是希望着出现门槛没那么高的Strong, lightweight models的出现。
因此,Meta AI通过模型蒸馏的方法,将大模型的知识压缩到较小的模型中,使后续跟进的研究者只需付出最小的准确性代价,就能大大降低推理成本。同时,得到的ViT-Small、ViT-Base和ViT-Large模型也在下游任务上展现出不错的泛化性能,具体可见后面的实验结果。
发布了一系列高性能的预训练模型
Releasing a family of high-performance pretrained models
最重要的,Meta AI向社区发布了一系列DINOv2预训练模型:
We release DINOv2 pretrained models to the community with a matching stable, accurate, and scaled implementation: We share pretraining code and recipe for ViT-L/16 (300 M params) and ViT-g/14 (1.1 B params) architectures, as well as checkpoints for a range of pretrained models from the larger ViT-g/14 down to smaller distilled models (ViT-S/14, ViT-B/14 and ViT-L/14). The performance of our approach is competitive or better than the performance of text-image models such as CLIP and OpenCLIP on a wide array of tasks, some of which are illustrated in our demo. Don’t hesitate to play with it! Our features can be used out of the box for nearest neighbor classification or paired with linear classification, yielding strong performance. DINOv2 allows skipping the model adaptation phase (fine-tuning) — our linear evaluation performance is close to their fine-tuned counterpart (within 2 percent on ImageNet-1k) .
DINOv2作为特征提取器可以开箱即用,无需微调就能在多个下游任务上取得相当好的结果(在ImageNet-1k上,linear evaluation比Fine-tuning只有2%内的差距),如下图所示:
算法和技术改进
With more training data, larger models perform better than smaller ones, but their training poses two major challenges. First, increasing the model size makes the training more challenging because of potential instability. In DINOv2, we included additional regularization methods inspired by the similarity search and classification literature, making the training algorithm much more stable. Second, in order to remain tractable, larger models require more efficient implementations. The DINOv2 training code integrates the latest mixed-precision and distributed training implementations proposed in the cutting-edge PyTorch 2 (fully sharded data parallel), an efficient implementation of the stochastic depth technique, as well as the latest compute algorithm implementations of xFormers (in particular, variable-length memory-efficient attention). This allows faster and more efficient iteration cycles. Overall, with equivalent hardware, our code runs around twice as fast with only a third of the memory usage, allowing scaling in data, model size, and hardware.
通过利用最新的Pytorch 2.0的数据并行、分布式训练、混合精度训练与variable-length memory-efficient attention等技术,在同等硬件的情况下,新的代码运行速度大约是之前的两倍,而内存使用量只有原来的三分之一,这可以帮助DINOv2在在数据、模型大小和硬件方面进行更加高效的扩展。
玩玩Demo
同时,Meta在官网上放出了深度估计、语义分割和实例检索的网页Demo,无需注册可以直接尝试(这才是「Open」AI)
链接:DINOv2 by Meta AI (metademolab.com)
这里,也分享一下我的试玩结果:
深度估计(Depth Estimation)
一般很少有预训练模型展示自己在深度估计方面的能力,这也说明了DINOv2模型表现出强大的分布外泛化能力(strong out-of-distribution performance)。
这里,我特意选取了一直非自然光照条件下的夜景作为测试,得到的结果还是非常惊艳的!
语义分割(Semantic Segmentation)
DINOv2的冻结特征(frozen features)可以很容易地用于语义分割任务。
这里就是简单的语义分割,没有SAM在分割任务上的可玩性那么强
实例检索(Instance Retrieval)
这是我认为很有意思的一个Demo,它是从大量的艺术图片集合中找到与给定图片相似的艺术作品。
这里我上传了一张黄鹤楼的图片作为Query:
这是Dinov2给出的结果,感觉在语义上还是十分接近的(都有一个高耸的塔或楼)
未来方向️
这里,Meta AI也给出了团队的未来研究方向:
Going forward, the team plans to integrate this model, which can function as a building block, in a larger, more complex AI system that could interact with large language models. A visual backbone providing rich information on images will allow complex AI systems to reason on images in a deeper way than describing them with a single text sentence. Models trained with text supervision are ultimately limited by the image captions. With DINOv2, there is no such built-in limitation.
总的来说,就是和大语言模型(LLMs, Large Language Models)结合,向通用人工智能与复杂系统(Complex AI systems)前进。(这里我们看看就好,终究是大厂要做的工作)
结语
DINOv2向我们展示了CV中Self-supervised Learning的一个重大进步,并在各个任务上表现了其作为一个通用视觉模型Backbone的强大泛化能力。可以期待更多基于DINOv2的研究工作出现。
MMpretrain
如果你对和DINOv2相关的预训练foundation models感兴趣,推荐你关注一下OpenMMLab的开源深度学习预训练工具箱MMPreTrain:open-mmlab/mmpretrain: OpenMMLab Pre-training Toolbox and Benchmark (github.com)
MMpretrain涵盖了多样的主干网络与预训练模型,并支持多种训练策略(有监督学习,无监督学习等),其中收录的自监督算法如下,可以看出都是近两年最新的经典方法,这里我们也可以期待一下DINOv2的出现
参考
- Mark Zuckerberg - Continuing our work to open source more... | Facebook
- DINOv2: State-of-the-art computer vision models with self-supervised learning (facebook.com)
本文参与了SegmentFault 思否写作挑战赛,欢迎正在阅读的你也加入。