Kaiming He∗,† Xinlei Chen∗ Saining Xie Yanghao Li Piotr Dollar Ross Girshick
∗ equal technical contribution † project lead
Facebook AI Research (FAIR)
[pdf]
Figure 1. Our MAE architecture. During pre-training, a large random subset of image patches (e.g., 75%) is masked out. The encoder is applied to the small subset of visible patches. Mask tokens are introduced after the encoder, and the full set of encoded patches and mask tokens is processed by a small decoder that reconstructs the original image in pixels. After pre-training, the decoder is discarded and the encoder is applied to uncorrupted images to produce representations for recognition tasks.
目录
Abstract
1. Introduction
2. Related Work
3. Approach
4. ImageNet Experiments
4.1. Main Properties
4.2. Comparisons with Previous Results
4.3. Partial Fine-tuning
5. Transfer Learning Experiments
6. Discussion and Conclusion
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
It is based on two core designs.
First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task.
Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3× or more) and improve accuracy.
Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data.
Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.
本文的核心意义:
本文表明 masked autoencoders (MAE) 是一种可扩展的计算机视觉自监督学习器。
本文的核心思想:
本文的 MAE 方法很简单:对输入图像的随机块进行 mask,然后重建缺失的像素。
本文的核心方法:
它基于两个核心设计。
首先,开发了一个非对称的编码器-解码器体系结构,其中的编码器只对可见的 patch 子集 (没有 mask 标记) 进行操作,同时还有一个轻量级的解码器,从潜在的表示和 mask 标记重建原始图像。
其次,本文发现 mask 输入图像的高比例,例如 75%,会产生一个 nontrivial 且有意义的自监督任务。
耦合这两种设计使 MAE 能够高效有效地训练大型模型:加速训练 (3 倍或更多) 并提高准确性。
本文的实验结论:
本文的可扩展方法允许学习高容量的模型,可以很好地泛化:例如,在仅使用 ImageNet-1K 数据的方法中,普通的 ViT-Huge 模型获得了最好的精度 (87.8%)。
下游任务的迁移性能优于有监督的预训练,并表现出良好的 scaling behavior。
Deep learning has witnessed an explosion of architectures of continuously growing capability and capacity [28, 24, 47]. Aided by the rapid gains in hardware, models today can easily overfit one million images [13] and begin to demand hundreds of millions of—often publicly inaccessible—labeled images [16].
This appetite for data has been successfully addressed in natural language processing (NLP) by self-supervised pretraining. The solutions, based on autoregressive language modeling in GPT [40, 41, 4] and masked autoencoding in BERT [14], are conceptually simple: they remove a portion of the data and learn to predict the removed content. These methods now enable training of generalizable NLP models containing over one hundred billion parameters [4].
The idea of masked autoencoders, a form of more general denoising autoencoders [48], is natural and applicable in computer vision as well. Indeed, closely related research in vision [49, 39] preceded BERT. However, despite significant interest in this idea following the success of BERT, progress of autoencoding methods in vision lags behind NLP. We ask: what makes masked autoencoding different between vision and language? We attempt to answer this question from the following perspectives:
研究背景和意义:
深度学习见证了能力和容量不断增长的架构,如爆炸式的涌现。借助于硬件的快速增长,今天的模型可以很容易地超过一百万张图像,并开始要求数以百万计的——通常是公众无法访问的标签图像。
在自然语言处理 (NLP) 中,这种对数据的渴望已经通过自监督的预训练成功地解决了。基于 GPT 中的自回归语言建模和 BERT 中的 masked 自编码的解决方案,在概念上很简单:删除一部分数据,并学习预测删除的内容。这些方法现在可以训练包含超过一千亿个参数的可推广的 NLP 模型。
Masked 自编码器的概念,一种更通用的去噪自动编码器的形式,是自然的,并适用于计算机视觉。事实上,在 BERT 之前就有与视觉密切相关的研究。然而,尽管 BERT 的成功引起了人们对这一想法的极大兴趣,但视觉自编码方法的进展却落后于 NLP。
【 前三段概括起来,传递的信息是:目前超规模数据处理是重要的;在 NLP 中,这种应用通过自监督的形式已经实现;实现方法就是将数据删除一部分,再还原这些部分;本文的 masked 自编码器也是采用这种策略,实现对计算机视觉的自监督学习器】
引出问题:
试问: 是什么使 masked 自编码在视觉和语言之间有不同? 本文试图从以下几(3)个方面来回答这个问题。(网络结构,信息特点,解码器)
(i) Until recently, architectures were different. In vision, convolutional networks [29] were dominant over the last decade [28]. Convolutions typically operate on regular grids and it is not straightforward to integrate ‘indicators’ such as mask tokens [14] or positional embeddings [47] into convolutional networks. This architectural gap, however, has been addressed with the introduction of Vision Transformers (ViT) [16] and should no longer present an obstacle.
(ii) Information density is different between language and vision. Languages are human-generated signals that are highly semantic and information-dense. When training a model to predict only a few missing words per sentence, this task appears to induce sophisticated language understanding. Images, on the contrary, are natural signals with heavy spatial redundancy—e.g., a missing patch can be recovered from neighboring patches with little high-level understanding of parts, objects, and scenes. To overcome this difference and encourage learning useful features, we show that a simple strategy works well in computer vision: masking a very high portion of random patches. This strategy largely reduces redundancy and creates a challenging selfsupervisory task that requires holistic understanding beyond low-level image statistics. To get a qualitative sense of our reconstruction task, see Figures 2 – 4.
(iii) The autoencoder’s decoder, which maps the latent representation back to the input, plays a different role between reconstructing text and images. In vision, the decoder reconstructs pixels, hence its output is of a lower semantic level than common recognition tasks. This is in contrast to language, where the decoder predicts missing words that contain rich semantic information. While in BERT the decoder can be trivial (an MLP) [14], we found that for images, the decoder design plays a key role in determining the semantic level of the learned latent representations.
(i) 目前为止,architectures 都是不同的。在视觉上,卷积网络在过去十年中占主导地位。卷积通常在规则网格上运行,并不是直接集成 “指标”,如 mask tokens [14] 或 position embedding [47] 到卷积网络。然而,这种架构上的差距已经通过引入 vision transformer (ViT) [16] 得到解决,应该不再是一个障碍。
(ii) 语言和视觉的信息密度不同。语言是人类产生的信号,具有高度的语义和信息密集性。当训练一个模型来预测每句话中只有几个遗漏的单词时,这项任务似乎诱导了复杂的语言理解。相反,图像是具有大量空间冗余的自然信号。一个丢失的 patch 可以从邻近的 patch 中恢复,而不需要对部件、对象和场景有高水平的理解。为了克服这种差异并鼓励学习有用的特性,本文展示了一个简单的策略,可以在计算机视觉中工作得很好:用随机 mask 掩盖很大一部分图像。这种策略在很大程度上减少了冗余,并创建了一个具有挑战性的自监督任务,需要超越低级图像统计的整体理解。(想对本文的重建任务有一个定性的认识,请参见图 2 - 4)。
(iii) 自编码器的解码器将潜在的表示映射回输入,在重构文本和图像之间扮演着不同的角色。在视觉中,解码器对像素进行重构,因此其输出的语义水平低于一般的识别任务。这与语言相反,在语言中,解码器预测包含丰富语义信息的缺失单词。虽然在 BERT 中解码器可以是微不足道的 (一个MLP),但本文发现,对于图像,解码器的设计在确定学到的潜在表示的语义级别方面起着关键作用。
Driven by this analysis, we present a simple, effective, and scalable form of a masked autoencoder (MAE) for visual representation learning. Our MAE masks random patches from the input image and reconstructs the missing patches in the pixel space. It has an asymmetric encoderdecoder design. Our encoder operates only on the visible subset of patches (without mask tokens), and our decoder is lightweight and reconstructs the input from the latent representation along with mask tokens (Figure 1). Shifting the mask tokens to the small decoder in our asymmetric encoder-decoder results in a large reduction in computation. Under this design, a very high masking ratio (e.g., 75%) can achieve a win-win scenario: it optimizes accuracy while allowing the encoder to process only a small portion (e.g., 25%) of patches. This can reduce overall pre-training time by 3× or more and likewise reduce memory consumption, enabling us to easily scale our MAE to large models.
在此分析的驱动下,本文提出了一种简单、有效、可扩展的用于视觉表示学习的 masked 自编码器 (MAE)。本文的 MAE masks 来自输入图像的随机 patch,并在像素空间中重建缺失的 patch。它具有非对称的编解码设计。本文的编码器 patch 的操作只在可见的子集 (没有被 mask 的 tokens), 本文的解码器是轻量级的,并根据 mask tockens 的潜在表示重构输入图像 (图 1)。将 mask tokens 的小解码器(非对称 encoder-decoder )计算,其计算量得到很大的下降。在这种设计下,一个非常高的 mask 比 (如 75%) 可以达到双赢的情况:它优化精度,同时允许编码器只处理一小部分 (如 25%) 的 patch。这可以减少 3 倍或更多的预训练时间,同样也可以减少内存消耗,使能够轻松地将 MAE 扩展到大型模型。
Our MAE learns very high-capacity models that generalize well. With MAE pre-training, we can train datahungry models like ViT-Large/-Huge [16] on ImageNet-1K with improved generalization performance. With a vanilla ViT-Huge model, we achieve 87.8% accuracy when finetuned on ImageNet-1K. This outperforms all previous results that use only ImageNet-1K data. We also evaluate transfer learning on object detection, instance segmentation, and semantic segmentation. In these tasks, our pre-training achieves better results than its supervised pre-training counterparts, and more importantly, we observe significant gains by scaling up models. These observations are aligned with those witnessed in self-supervised pre-training in NLP [14, 40, 41, 4] and we hope that they will enable our field to explore a similar trajectory.
MAE 学习了非常高容量的模型,可以很好地泛化。通过 MAE 预训练,可以在 ImageNet-1K 上训练像 ViT-Large/-Huge 这样的数据匮乏模型,提高泛化性能。使用普通的 ViT - Huge 模型,在ImageNet-1K 上微调时,达到了 87.8% 的精度。这比以前所有只使用 ImageNet-1K 数据的结果都要好。
本文还评估了迁移学习在目标检测、实例分割和语义分割方面的作用。在这些任务中,本文的预训练比有监督的预训练取得更好的结果,更重要的是,观察到通过 scaling up models 获得显著的增益。这些观察结果与在 NLP 自监督预训练中所看到的情况一致,希望它们将使 CV 领域能够探索类似的轨迹。
Masked language modeling and its autoregressive counterparts, e.g., BERT [14] and GPT [40, 41, 4], are highly successful methods for pre-training in NLP. These methods hold out a portion of the input sequence and train models to predict the missing content. These methods have been shown to scale excellently [4] and a large abundance of evidence indicates that these pre-trained representations generalize well to various downstream tasks.
Autoencoding is a classical method for learning representations. It has an encoder that maps an input to a latent representation and a decoder that reconstructs the input. For example, PCA and k-means are autoencoders [25]. Denoising autoencoders (DAE) [48] are a class of autoencoders that corrupt an input signal and learn to reconstruct the original, uncorrupted signal. A series of methods can be thought of as a generalized DAE under different corruptions, e.g., masking pixels [49, 39, 6] or removing color channels [59]. Our MAE is a form of denoising autoencoding, but different from the classical DAE in numerous ways.
Masked image encoding methods learn representations from images corrupted by masking. The pioneering work of [49] presents masking as a noise type in DAE. Context Encoder [39] inpaints large missing regions using convolutional networks. Motivated by the success in NLP, related recent methods [6, 16, 2] are based on Transformers [47]. iGPT [6] operates on sequences of pixels and predicts unknown pixels. The ViT paper [16] studies masked patch prediction for self-supervised learning. Most recently, BEiT [2] proposes to predict discrete tokens [37, 43].
Self-supervised learning approaches have seen significant interest in computer vision, often focusing on different pretext tasks for pre-training [15, 50, 35, 59, 38, 17]. Recently, contrastive learning [3, 21] has been popular, e.g., [51, 36, 22, 7], which models image similarity and dissimilarity (or only similarity [20, 8]) between two or more views. Contrastive and related methods strongly depend on data augmentation [7, 20, 8]. Autoencoding pursues a conceptually different direction, and it exhibits different behaviors as we will present.
Masked 语言建模及其自回归对应方法,如 BERT [14] 和 GPT [40,42,4],是非常成功的 NLP 预训练方法。这些方法提供了输入序列的一部分,训练模型来预测缺失的内容。这些方法已经被证明可以很好地扩展,大量的证据表明这些预先训练的表示可以很好地推广到各种下游任务。
自编码是学习表征的一种经典方法。它有一个将输入映射到潜在表示的编码器和一个重建输入的解码器。例如,PCA和k-means是自动编码器[25]。去噪自动编码器(DAE)[48]是一类损坏输入信号并学习重建原始的、未损坏的信号的自动编码器。一系列的方法可以被认为是在不同的破坏下的通用DAE,例如,屏蔽像素[49,39,6]或删除颜色通道[59]。我们的MAE是去噪自编码的一种形式,但在许多方面不同于经典的DAE。
Masked 图像编码方法从被 masking 损坏的图像中学习表示。[49] 的开创性工作提出 mask 作为一种噪声类型的 DAE。编码器 [39] 使用卷积网络填补大的缺失区域。在 NLP 的激励下,相关的最近的方法 [6,16,2] 是基于 transformer [47]。iGPT [6] 作用于像素序列,预测未知像素。ViT 论文 [16]研究了用于自监督学习的 masked patch 预测。最近,BEiT [2] 提出预测离散 tokens [37,43]。
自监督学习方法已经在计算机视觉中看到了显著的兴趣,通常集中在训练前的不同的网络前置(pretext)任务上 [15,50,35,59,38,17]。最近,对比学习 [3,21] 很流行,例如 [51,36,22,7],它模拟两个或多个视图之间的图像相似度和不相似度 (或仅相似度 [20,8])。对比和相关方法强烈依赖于数据扩充 [7,20,8]。自编码在概念上追求不同的方向,正如本文将要展示的那样,它表现出不同的行为。
Our masked autoencoder (MAE) is a simple autoencoding approach that reconstructs the original signal given its partial observation. Like all autoencoders, our approach has an encoder that maps the observed signal to a latent representation, and a decoder that reconstructs the original signal from the latent representation. Unlike classical autoencoders, we adopt an asymmetric design that allows the encoder to operate only on the partial, observed signal (without mask tokens) and a lightweight decoder that reconstructs the full signal from the latent representation and mask tokens. Figure 1 illustrates the idea, introduced next.
本文的 masked 自编码器 (MAE) 是一种简单的自编码方法,它可以在给定部分观测值的情况下重建原始信号。与所有自动编码器一样,本文的方法有一个编码器,它将观察到的信号映射到潜在表示;以及一个解码器,它从潜在表示重建原始信号。与传统的自编码器不同,本文采用非对称设计,允许编码器只对观察到的部分信号 (没有掩码标记) 进行操作,并采用轻量级解码器从潜在表示和 mask tokens 重建完整的信号。
Masking
Following ViT [16], we divide an image into regular non-overlapping patches. Then we sample a subset of patches and mask (i.e., remove) the remaining ones. Our sampling strategy is straightforward: we sample random patches without replacement, following a uniform distribution. We simply refer to this as “random sampling”.
Random sampling with a high masking ratio (i.e., the ratio of removed patches) largely eliminates redundancy, thus creating a task that cannot be easily solved by extrapolation from visible neighboring patches (see Figures 2 – 4). The uniform distribution prevents a potential center bias (i.e., more masked patches near the image center). Finally, the highly sparse input creates an opportunity for designing an efficient encoder, introduced next.
和 ViT 一样,本文将图像划分为常规的不重叠的 patches。然后对一个子集的 patch 和 mask (即,去除) 其余的。本文的抽样策略很简单:本文随机抽样,不进行替换,遵循均匀分布。本文简单地称之为 “随机抽样”。
随机抽样高 masking 率 (即删除 patch 的比例) 在很大程度上消除冗余,从而创建一个不容易由可见邻近 patch 外推解决的任务 (见图 2 - 4)。均匀分布防止潜在的中心偏差 (例如, mask 图像中心附近)。最后,高度稀疏的输入为设计高效的编码器创造了机会。
MAE encoder
Our encoder is a ViT [16] but applied only on visible, unmasked patches. Just as in a standard ViT, our encoder embeds patches by a linear projection with added positional embeddings, and then processes the resulting set via a series of Transformer blocks. However, our encoder only operates on a small subset (e.g., 25%) of the full set. Masked patches are removed; no mask tokens are used. This allows us to train very large encoders with only a fraction of compute and memory. The full set is handled by a lightweight decoder, described next.
MAE 编码器
MAE 编码器是一个 ViT[16],但只作用于可见的,未屏蔽的 patch。就像在标准 ViT 中一样,MAE 编码器通过线性映射嵌入 patch 和 位置 embedding,然后通过一系列 Transformer block 处理结果集。然而,MAE 编码器只在整个集合的一个小子集上运行 (即没有被 masked 的部分,本文通常为 25%)。被 masked 的 patch 直接删除;没有使用 mask tokens。这使得网络能够训练非常大的编码器,只有一小部分的计算和内存。完整的集合由轻量级解码器处理。
MAE decoder
The input to the MAE decoder is the full set of tokens consisting of (i) encoded visible patches, and (ii) mask tokens. See Figure 1. Each mask token [14] is a shared, learned vector that indicates the presence of a missing patch to be predicted. We add positional embeddings to all tokens in this full set; without this, mask tokens would have no information about their location in the image. The decoder has another series of Transformer blocks.
The MAE decoder is only used during pre-training to perform the image reconstruction task (only the encoder is used to produce image representations for recognition). Therefore, the decoder architecture can be flexibly designed in a manner that is independent of the encoder design. We experiment with very small decoders, narrower and shallower than the encoder. For example, our default decoder has <10% computation per token vs. the encoder. With this asymmetrical design, the full set of tokens are only processed by the lightweight decoder, which significantly reduces pre-training time.
MAE 解码器
MAE 解码器的输入是由 (i) 编码的未被 masked 的 patches 和 (ii) mask tokens 组成的全套 tokens。参见图1。每个 mask tokens 都是一个共享的、学习到的向量,它指示需要预测的缺失 patches。向这个完整集合中的所有标记添加位置 embeddings;如果没有这个,mask tokens 将没有关于其在图像中的位置的信息。解码器有另一系列的 Transformer blocks。
MAE 解码器只在预训练期间用于执行图像重建任务 (只有编码器用于进行识别,生成图像表示时)。因此,可以独立于编码器灵活地设计解码器架构。作者使用非常小的解码器进行实验,比编码器更窄、更浅。例如,默认解码器与编码器相比,每个 token 的计算量<10%。使用这种非对称设计,整套 token 只由轻量级解码器处理,这大大减少了训练前的时间。
Reconstruction target
Our MAE reconstructs the input by predicting the pixel values for each masked patch. Each element in the decoder’s output is a vector of pixel values representing a patch. The last layer of the decoder is a linear projection whose number of output channels equals the number of pixel values in a patch. The decoder’s output is reshaped to form a reconstructed image. Our loss function computes the mean squared error (MSE) between the reconstructed and original images in the pixel space. We compute the loss only on masked patches, similar to BERT [14].1
We also study a variant whose reconstruction target is the normalized pixel values of each masked patch. Specifically, we compute the mean and standard deviation of all pixels in a patch and use them to normalize this patch. Using normalized pixels as the reconstruction target improves representation quality in our experiments.
目标重构
MAE 通过预测每个 mask 块的像素值来重建输入。解码器输出中的每个元素都是一个像素值向量,代表一个 patchd。解码器的最后一层是一个线性投影,其输出通道的数量等于一个 patch 中的像素值的数量。解码器的输出被重塑以形成重建图像。本文的损失函数计算重建图像和原始图像在像素空间中的平均平方误差 (MSE)。本文只在 mask token 上计算损失,类似于 BERT [14].
作者还研究了一个变量,其重建目标是每个被掩片的归一化像素值。具体来说,计算一个patch中所有像素的均值和标准差,并用它们来归一化这个 patch。在本文的实验中,使用归一化像素作为重建目标可以提高再现质量。
Simple implementation
Our MAE pre-training can be implemented efficiently, and importantly, does not require any specialized sparse operations. First we generate a token for every input patch (by linear projection with an added positional embedding). Next we randomly shuffle the list of tokens and remove the last portion of the list, based on the masking ratio. This process produces a small subset of tokens for the encoder and is equivalent to sampling patches without replacement. After encoding, we append a list of mask tokens to the list of encoded patches, and unshuffle this full list (inverting the random shuffle operation) to align all tokens with their targets. The decoder is applied to this full list (with positional embeddings added). As noted, no sparse operations are needed. This simple implementation introduces negligible overhead as the shuffling and unshuffling operations are fast.
简单实现
MAE 预训练可以有效地实施,而且重要的是,不需要任何专门的稀疏操作。
首先,为每个输入patch 生成一个token (通过线性映射并添加 position embedding )。
然后,随机打乱 token 列表,并根据 mask 比率删除列表的后面部分。这个过程,编码器只产生一个小的 token 子集,大小为没有被 masked 的采样 patch。
在编码之后,将一个 mask token 列表添加到已编码 patch 列表中,并对整个列表进行 unshuffle (与随机洗牌操作相反),以使所有 token 与其目标对齐。
解码器应用于这个完整的列表 (添加了 position embedding)。
如前所述,不需要进行稀疏操作。这个简单的实现引入的开销是可忽略的,因为洗牌和反洗牌操作非常快。
We do self-supervised pre-training on the ImageNet-1K (IN1K) [13] training set. Then we do supervised training to evaluate the representations with (i) end-to-end fine-tuning or (ii) linear probing. We report top-1 validation accuracy of a single 224×224 crop. Details are in Appendix A.1.
作者在 ImageNet-1K (IN1K) 训练集上进行自监督的预训练。然后进行监督训练,通过 (i) 端到端微调或 (ii) 线性探测来评估表示。实验记录 224×224 crop 的最高验证精度。详情见附录 A.1 (下面仅列出表格说明,文章描述请参考原文)。
Baseline: ViT-Large
We use ViT-Large (ViT-L/16) [16] as the backbone in our ablation study. ViT-L is very big (an order of magnitude bigger than ResNet-50 [24]) and tends to overfit. The following is a comparison between ViT-L trained from scratch vs. fine-tuned from our baseline MAE:
We note that it is nontrivial to train supervised ViT-L from scratch and a good recipe with strong regularization is needed (82.5%, see Appendix A.2). Even so, our MAE pretraining contributes a big improvement. Here fine-tuning is only for 50 epochs (vs. 200 from scratch), implying that the fine-tuning accuracy heavily depends on pre-training.
baseline: ViT-Large
本文使用 ViT-large (ViT-L /16)[16] 作为消融研究的 baseline。ViT-L 非常大 (比 resnet - 50 大一个数量级),并且倾向于过拟合。上表为从头开始训练与 MAE 微调的性能对比。
注意到,从头开始训练有监督的 ViT-L 是很重要的,并且需要一个具有强正规化的好方法 (82.5%,参见附录 A .2 (请参考原文))。即便如此,MAE 预训练还是有很大的进步。这里的微调只适用于 50 个epoch (而不是),这意味着微调的准确性很大程度上依赖于预训练。
We ablate our MAE using the default settings in Table 1 (see caption). Several intriguing properties are observed.
灰色表示本文的默认设置。
Masking ratio (不在表 1 中)
Figure 5 shows the influence of the masking ratio. The optimal ratios are surprisingly high. The ratio of 75% is good for both linear probing and fine-tuning. This behavior is in contrast with BERT [14], whose typical masking ratio is 15%. Our masking ratios are also much higher than those in related works [6, 16, 2] in computer vision (20% to 50%).
The model infers missing patches to produce different, yet plausible, outputs (Figure 4). It makes sense of the gestalt of objects and scenes, which cannot be simply completed by extending lines or textures. We hypothesize that this reasoning-like behavior is linked to the learning of useful representations.
Figure 5 also shows that linear probing and fine-tuning results follow different trends. For linear probing, the accuracy increases steadily with the masking ratio until the sweet point: the accuracy gap is up to ∼20% (54.6% vs. 73.5%). For fine-tuning, the results are less sensitive to the ratios, and a wide range of masking ratios (40–80%) work well. All fine-tuning results in Figure 5 are better than training from scratch (82.5%).
本文使用表 1 中的默认设置来对 MAE 进行消融实验 (见标题)。
Masking ratio:图 5 显示了 Masking ratio 的影响。最佳比率高得惊人。75% 的比例对于线性 probing 和微调都很好。这一行为与 BERT 形成对比,后者的典型 masking ratio 为15%。本文的masking ratio 也远高于计算机视觉相关研究 (20% - 50%)。
该模型推断缺失的 patch,以产生不同的、但貌似可信的输出 (图 4)。它感知物体和场景的格式塔 (gestalt),这不能简单地通过扩展线条或纹理来完成。作者假设这种类似推理的行为与有用表征的学习有关。
图 5 还显示了线性 probing 和微调结果遵循不同的趋势。对于线性 probing,精度随着 masking ratio 的增加而稳步提高,直到最高点(sweet point): 精度差距高达~ 20% (54.6% vs. 73.5%)。为了进行微调,结果对比率的敏感性较低,而且广泛使用的 masking ratio (40-80%) 效果很好。图 5 中的所有微调结果都比从头开始训练要好 (82.5%)。
Decoder design
Our MAE decoder can be flexibly designed, as studied in Table 1a and 1b.
Table 1a varies the decoder depth (number of Transformer blocks). A sufficiently deep decoder is important for linear probing. This can be explained by the gap between a pixel reconstruction task and a recognition task: the last several layers in an autoencoder are more specialized for reconstruction, but are less relevant for recognition. A reasonably deep decoder can account for the reconstruction specialization, leaving the latent representations at a more abstract level. This design can yield up to 8% improvement in linear probing (Table 1a, ‘lin’). However, if fine-tuning is used, the last layers of the encoder can be tuned to adapt to the recognition task. The decoder depth is less influential for improving fine-tuning (Table 1a, ‘ft’).
Interestingly, our MAE with a single-block decoder can perform strongly with fine-tuning (84.8%). Note that a single Transformer block is the minimal requirement to propagate information from visible tokens to mask tokens. Such a small decoder can further speed up training.
In Table 1b we study the decoder width (number of channels). We use 512-d by default, which performs well under fine-tuning and linear probing. A narrower decoder also works well with fine-tuning.
Overall, our default MAE decoder is lightweight. It has 8 blocks and a width of 512-d ( gray in Table 1). It only has 9% FLOPs per token vs. ViT-L (24 blocks, 1024-d). As such, while the decoder processes all tokens, it is still a small fraction of the overall compute.
译码器的设计:
MAE译码器消融实验包括如表 1a 和 1b 所述的形式,从深度和宽度上进行消融。
表 1a 改变了解码器的深度 (变压器块的数量)。足够深的译码器对于线性 probing 很重要 ( 表 1a, ' lin ')。这可以用像素重建任务和识别任务之间的差距来解释:自编码器的最后几层更专门用于重建,但与识别不太相关。合理深度的解码器可以解释重构专门化(reconstruction specialization),将潜在表示留在更抽象的层次。这种设计可以在线性 probing 方面提高 8%。然而,如果使用微调 (表 1a, ' ft '),编码器的最后一层可以调整以适应识别任务。解码器深度对改善微调的影响较小。
有趣的是,单块解码器的 MAE 在微调时表现强烈 (84.8%)。单个 Transformer 块是将信息从可见 token 传播到 mask token 的最低要求。如此小的解码器可以进一步加快训练速度。
表 1b 研究了解码器的宽度 (通道数)。默认使用 512-d,它在微调和线性 probing 下表现良好。更窄的解码器也可以很好地进行微调。
总的来说,默认的 MAE 解码器是轻量级的。它有 8 个区块,512-d 的宽度 (表1中的灰色部分)。相对于 ViT-L (24 个区块,1024-d),每个 token 只有 9% 的 FLOPs。因此,虽然解码器处理所有token,但它仍然是整个计算的一小部分。
Mask token
An important design of our MAE is to skip the mask token [M] in the encoder and apply it later in the lightweight decoder. Table 1c studies this design.
If the encoder uses mask tokens, it performs worse: its accuracy drops by 14% in linear probing. In this case, there is a gap between pre-training and deploying: this encoder has a large portion of mask tokens in its input in pretraining, which does not exist in uncorrupted images. This gap may degrade accuracy in deployment. By removing the mask token from the encoder, we constrain the encoder to always see real patches and thus improve accuracy.
Moreover, by skipping the mask token in the encoder, we greatly reduce training computation. In Table 1c, we reduce the overall training FLOPs by 3.3×. This leads to a 2.8× wall-clock speedup in our implementation (see Table 2). The wall-clock speedup is even bigger (3.5–4.1×), for a smaller decoder (1-block), a larger encoder (ViT-H), or both. Note that the speedup can be >4× for a masking ratio of 75%, partially because the self-attention complexity is quadratic. In addition, memory is greatly reduced, which can enable training even larger models or speeding up more by large-batch training. The time and memory efficiency makes our MAE favorable for training very large models.
Mask token
MAE 的一个重要设计是在编码器中不使用 mask token [M],然后将其应用到轻量级解码器中。表 1c 研究了这一设计。
如果编码器使用 mask token,它的性能会更差:在线性 probing 中它的精度会下降 14%。在这种情况下,预训练和部署之间存在一个间隙:该编码器在预训练输入中有很大一部分 mask token,这在未损坏的图像中是不存在的。这种差距可能会降低部署的准确性。通过从编码器中去除 mask,限制编码器总是看到真实的 patch,从而提高精度。
此外,通过在编码器中不使用 mask token,大大减少了训练计算量。在表 1c 中,将整体训练FLOPs 减少 3.3×。这在 MAE 的实现中导致了 2.8× wall-clock 加速 (见表 2)。采用较小的解码器(1 块),或较大的编码器 (ViT- H),或两者联合,它们的 wall-clock 加速甚至更大 (3.5-4.1×)。
注意,当 masking ratio 为 75% 时,加速可以为 >4×,部分原因是自注意复杂度是二次方的。
此外,内存大大减少,可以训练更大的模型,或通过大批量的训练提高速度。
时间和内存效率使 MAE 有利于训练非常大的模型。
Reconstruction target
We compare different reconstruction targets in Table 1d. Our results thus far are based on pixels without (per-patch) normalization. Using pixels with normalization improves accuracy. This per-patch normalization enhances the contrast locally. In another variant, we perform PCA in the patch space and use the largest PCA coefficients (96 here) as the target. Doing so degrades accuracy. Both experiments suggest that the high-frequency components are useful in our method.
We also compare an MAE variant that predicts tokens, the target used in BEiT [2]. Specifically for this variant, we use the DALLE pre-trained dVAE [43] as the tokenizer, following [2]. Here the MAE decoder predicts the token indices using cross-entropy loss. This tokenization improves fine-tuning accuracy by 0.4% vs. unnormalized pixels, but has no advantage vs. normalized pixels. It also reduces linear probing accuracy. In §5 we further show that tokenization is not necessary in transfer learning.
Our pixel-based MAE is much simpler than tokenization. The dVAE tokenizer requires one more pre-training stage, which may depend on extra data (250M images [43]). The dVAE encoder is a large convolutional network (40% FLOPs of ViT-L) and adds nontrivial overhead. Using pixels does not suffer from these problems.
重建目标
在表 1d 中比较了不同的重建目标。到目前为止,本文的结果是基于没有 ( 每个 patch) 归一化的像素。
使用带有归一化的像素可以提高精度。这种每个 patch 的归一化在局部上增强了对比度。
在另一种变体中,作者在 patch 空间中执行主成分分析,并使用最大的主成分系数 (这里是 96) 作为目标。这样做会降低准确性。
两个实验都表明高频分量在本文的方法中是有用的。
本文还比较了预测 token 的 MAE 变体,即 BEiT 中使用的 target。特别地,对于这个变体,本文使用 DALLE 预先训练的 dVAE 作为标记赋予器。在这里,MAE 译码器利用交叉熵损失预测 token指数。与非归一化像素相比,这种标记化提高了0.4%的微调精度,但与归一化像素相比没有优势。它还降低了线性 probing 精度。在 §5 中,进一步证明了 token化(tokenization)在迁移学习中是不必要的。
本文采用基于像素的 MAE 比 token化简单得多。dVAE tokenizer 还需要一个预训练阶段,这可能依赖于额外的数据 (250M 图像[43])。dVAE 编码器是一个大型的卷积网络 (40% 的 ViT-L FLOPs),增加了不小的开销。使用像素的方法不会增加 FLOPs。
Data augmentation
Table 1e studies the influence of data augmentation on our MAE pre-training.
Our MAE works well using cropping-only augmentation, either fixed-size or random-size (both having random horizontal flipping). Adding color jittering degrades the results and so we do not use it in other experiments.
Surprisingly, our MAE behaves decently even if using no data augmentation (only center-crop, no flipping). This property is dramatically different from contrastive learning and related methods [51, 22, 7, 20], which heavily rely on data augmentation. It was observed [20] that using cropping-only augmentation reduces the accuracy by 13% and 28% respectively for BYOL [20] and SimCLR [7]. In addition, there is no evidence that contrastive learning can work without augmentation: the two views of an image are the same and can easily satisfy a trivial solution.
In MAE, the role of data augmentation is mainly performed by random masking (ablated next). The masks are different for each iteration and so they generate new training samples regardless of data augmentation. The pretext task is made difficult by masking and requires less augmentation to regularize training.
数据扩增
表 1e 研究了数据扩增对 MAE 预训练的影响。
MAE 只使用 crop 的扩增方法时表现很好,无论是固定大小或随机大小 (同时也使用了随机水平翻转)。添加颜色抖动会降低结果,所以作者不在其他实验中使用它。
令人惊讶的是,MAE 在没有使用数据增强 (只有中心 crop,没有翻转) 时表现也很好。这一特性与对比学习及其相关方法有很大的不同,后者严重依赖于数据增强。[20] 观察到,仅使用 crop 扩增法在 BYOL [20] 和 SimCLR [7] 的准确率分别降低了 13% 和 28%。此外,没有证据表明对比学习可以在没有扩增的情况下工作。
在 MAE 中,数据扩增的作用主要是通过随机 mask (然后消融) 来实现的。每个迭代的 mask 是不同的,因此无论数据扩增与否,它们都会生成新的训练样本。masking 使 pretext 任务变得困难,并且需要较少的扩增来规范训练。
Mask sampling strategy
In Table 1f we compare different mask sampling strategies, illustrated in Figure 6.
The block-wise masking strategy, proposed in [2], tends to remove large blocks (Figure 6 middle). Our MAE with block-wise masking works reasonably well at a ratio of 50%, but degrades at a ratio of 75%. This task is harder than that of random sampling, as a higher training loss is observed. The reconstruction is also blurrier.
We also study grid-wise sampling, which regularly keeps one of every four patches (Figure 6 right). This is an easier task and has lower training loss. The reconstruction is sharper. However, the representation quality is lower.
Simple random sampling works the best for our MAE. It allows for a higher masking ratio, which provides a greater speedup benefit while also enjoying good accuracy.
mask 抽样策略
在表 1f 中比较了不同的 mask 采样策略,如图 6 所示。
1. 去除较大的块 (图 6 中间)。在 50% 的比例下,本文的区块 mask 的 MAE 工作得相当好,但在 75% 的比例下就会退化。这种策略比随机抽样策略更难,其训练损失更高,图像重建效果也更加模糊。
2. 网格采样的策略,即如图 6 右所示。这是一个更简单的任务,训练损失更低,图像重建的锐化较好。但是,表示 (representation) 质量较低。
3. 简单的随机抽样对 MAE 来说是最好的。这种策略可以使用更高的 mask ratio,这提供了更大的加速效益,同时也享有良好的准确性。
Training schedule
Our ablations thus far are based on 800-epoch pre-training. Figure 7 shows the influence of the training schedule length. The accuracy improves steadily with longer training. Indeed, we have not observed saturation of linear probing accuracy even at 1600 epochs. This behavior is unlike contrastive learning methods, e.g., MoCo v3 [9] saturates at 300 epochs for ViT-L. Note that the MAE encoder only sees 25% of patches per epoch, while in contrastive learning the encoder sees 200% (two-crop) or even more (multi-crop) patches per epoch.
训练方法
到目前为止,本文的消融是基于800 个 epoch 的预训练。图 7 显示了训练计划长度的影响。准确率随着训练时间的延长而稳步提高。事实上,即使在1600 epoch,作者也没有观察到线性 probing 精度的饱和。这种行为不同于对比学习方法,例如,在 ViT-L 上,用 MoCo v3 方法在第 300 epoch 时出现饱和。值得注意,MAE 编码器每个 epoch 只看到 25% 的 patch,而在对比学习中,编码器每个 epoch 看到 200% (two-crop) 甚至更多 (multi-crop) patch。
Comparisons with self-supervised methods
In Table 3 we compare the fine-tuning results of self-supervised ViT models. For ViT-B, all methods perform closely. For ViT-L, the gaps among methods are bigger, suggesting that a challenge for bigger models is to reduce overfitting.
Our MAE can scale up easily and has shown steady improvement from bigger models. We obtain 86.9% accuracy using ViT-H (224 size). By fine-tuning with a 448 size, we achieve 87.8% accuracy, using only IN1K data. The previous best accuracy, among all methods using only IN1K data, is 87.1% (512 size) [56], based on advanced networks. We improve over the state-of-the-art by a nontrivial margin in the highly competitive benchmark of IN1K (no external data). Our result is based on vanilla ViT, and we expect advanced networks will perform better.
Comparing with BEiT [2], our MAE is more accurate while being simpler and faster. Our method reconstructs pixels, in contrast to BEiT that predicts tokens: BEiT reported a 1.8% degradation [2] when reconstructing pixels with ViT-B. We do not need dVAE pre-training. Moreover, our MAE is considerably faster (3.5× per epoch) than BEiT, for the reason as studied in Table 1c.
The MAE models in Table 3 are pre-trained for 1600 epochs for better accuracy (Figure 7). Even so, our total pre-training time is less than all other methods if they were trained in the same hardware. For example, with ViT-L, our MAE’s training time is 31 hours for 1600 epochs and MoCo v3’s is 36 hours for 300 epochs [9], using the same 128 TPU-v3 cores.
自监督方法对比
表 3 比较了自监督 ViT 模型的微调结果。对于 ViT - B,所有方法的效果都很好。对于 ViT - L,方法之间的差距更大,这表明更大的模型面临的挑战是减少过拟合。
MAE 可以很容易地扩大规模,并在更大的模型上,显示出稳定的改善。在 ViT-H 上,MAE 获得 86.9% 的准确率使用 (224 size)。通过微调 448 size,MAE 达到 87tgftyredwf dvfghb cqaZ.8% 的精度,仅使用 IN1K 数据。
在所有仅使用 IN1K 数据的方法中,基于高级网络的最佳精度为 87.1% (512 size) [56]。在 IN1K (没有外部数据) 的高度竞争性基准测试中,MAE 在最先进的基础上进行了显著的改进。本文的结果是基于普通的 ViT,预计在更先进的网络上,将表现得更好。
与 BEiT[2] 相比,MAE 在更简单、更快的同时也更准确。本文的方法重建像素,与预测标记的BEiT 相比:当使用 ViT- B 重建像素时,BEiT 降低 1.8%。本文的方法不需要 dVAE 预训练。
此外,MAE 比 BEiT 要快得多 (3.5×每 epoch),其原因如表 1c 所示。
为了提高精度,表 3 中的 MAE 模型进行了 1600 个 epoch 的预训练 (图 7)。即便如此,如果在相同的硬件上训练,本文的总预训练时间比所有其他方法都要少。例如,使用相同的 128 个 TPU-v3 (看到这个数据心里只有一颤 T~T) 内核,使用 ViT-L,MAE 的训练时间为 1600 个 epoch ,31小时,而 MoCo v3 的训练时间为 300 个 epoch,36小时。
Comparisons with supervised pre-training
In the original ViT paper [16], ViT-L degrades when trained in IN1K. See Figure 8. Our improved supervised recipe works better for training from scratch (Figure 8, “our impl.”; see A.2), but the accuracy is saturated.
Our MAE pre-training, using only IN1K, can generalize better: the gain over training from scratch is bigger for higher-capacity models. It follows a trend similar to the JFT-300M supervised pre-training in [16]. This comparison shows that our MAE can help scale up model sizes.
自监督预训练对比
在原始的 ViT 论文中,在 IN1K 中训练时,ViT-L 表现退化。参见图 8 所示。本文改进的监督方法可以更好地从零开始进行训练 (图8,“our impl”; 见 A.2),但准确性是饱和的。
MAE 预训练,仅使用 IN1K,可以更好地泛化:对于更高容量的模型,从零开始训练的增益更大。它遵循的趋势类似于 JFT-300M 监督预训 [16]。这一比较表明, MAE可以帮助扩大模型尺寸。
Table 1 shows that linear probing and fine-tuning results are largely uncorrelated. Linear probing has been a popular protocol in the past few years; however, it misses the opportunity of pursuing strong but non-linear features—which is indeed a strength of deep learning. As a middle ground, we study a partial fine-tuning protocol: fine-tune the last several layers while freezing the others. This protocol was also used in early works, e.g., [54, 59, 35].
Figure 9 shows the results. Notably, fine-tuning only one Transformer block boosts the accuracy significantly from 73.5% to 81.0%. Moreover, if we fine-tune only “half” of the last block (i.e., its MLP sub-block), we can get 79.1%, much better than linear probing. This variant is essentially fine-tuning an MLP head. Fine-tuning a few blocks (e.g., 4 or 6) can achieve decent accuracy, which is still a small fine-tuning head compared with the frozen backbone.
In Figure 9 we also compare with MoCo v3 [9], which is a contrastive method with ViT-L results available. It has higher linear probing accuracy than our MAE. However, all of its partial fine-tuning results are worse than ours. The gap is 2.6% when tuning 4 blocks.
These results show that the MAE representations are less linearly separable, but they are stronger non-linear features and perform well when a non-linear head is tuned. These observations suggest that linear separability is not the sole metric for evaluating representation quality. It has also been observed (e.g., [8]) that linear probing is not well correlated with transfer learning performance, e.g., for object detection. To our knowledge, linear evaluation is not often used in NLP for benchmarking pre-training.
表1显示了线性 probing 和微调结果在很大程度上是不相关的。
线性 probing 在过去几年已经成为一种流行的协议;然而,它错过了追求强大但非线性特征的机会——这确实是深度学习的优势。作为中间立场,本文研究了一个部分微调协议:微调最后几个层,同时冻结其他层。在早期的工作中也使用了该方案,如 [54,59,35]。
图 9 显示了结果。值得注意的是,只需微调一个 Transformer 块,就能将精度从 73.5% 显著提高到 81.0%。
此外,如果只微调最后一个块的 “一半” (即 MLP 子块),可以得到 79.1%,比线性 probing 好得多。这个变体本质上是对一个 MLP 头进行微调。
微调几个块 (例如,4 或 6个) 可以获得不错的精度。
在图 9 中,与 MoCo v3 [9] 进行了比较,这是一种与可用的 ViT-L 结果进行对比的方法。它具有比 MAE 更高的线性 probing 精度。然而,它所有的部分微调结果都比 MAE 的差。在调优 4 个块时,差距为 2.6%。
这些结果表明,MAE表示法是不那么线性可分的,但它们是强非线性特征,并在非线性头调谐时表现良好。
这些观察结果表明,线性可分性并不是评价表现质量的唯一度量标准。
另外,线性 probing 与迁移学习性能没有很好的关联,例如,用于对象检测。
据我们所知,线性评价在 NLP 中并不常用来进行 token 预训练。
We evaluate transfer learning in object detection and segmentation on COCO [32] and semantic segmentation on ADE20K [60]. We use the pre-trained models in Table 3.
本文在 COCO[32] 和 ADE20K[60] 上评估迁移学习在目标检测和分割中的作用。使用表 3 中的预训练模型。
Object detection and segmentation
We fine-tune Mask R-CNN [23] end-to-end on COCO. The ViT backbone is adapted for use with FPN [31] (see Appendix A.3). We apply this object detection system to all entries in Table 4. We report box AP for object detection and mask AP for instance segmentation.
Compared to supervised pre-training, our MAE performs better under all configurations (Table 4). With the smaller ViT-B, our MAE is 2.4 points higher than supervised pretraining (50.3 vs. 47.9, APbox). More significantly, with the larger ViT-L, our MAE pre-training outperforms supervised pre-training by 4.0 points (53.3 vs. 49.3).
The pixel-based MAE is better than or on par with the token-based BEiT, while MAE is much simpler and faster. Both MAE and BEiT are better than MoCo v3 and MoCo v3 is on par with supervised pre-training.
目标检测和分割
本文在 COCO 上对mask R-CNN[23] 进行端到端微调。ViT backbone 网适用于 FPN[31] (见附录A.3)。将此对象检测系统应用于表 4 中的所有条目。box AP 用于目标检测,mask AP用于实例分割。
与监督前训练相比,MAE 在所有配置下表现更好 (表 4)。ViT-B 较小时,MAE 比监督前训练高 2.4点 (50.3 vs. 47.9, APbox)。更重要的是,随着 ViT-L 的增大,MAE 训练前的表现比有监督的训练前高出4.0分 (53.3 vs 49.3)。
基于像素的 MAE 优于或与基于 token 的 BEiT 相当,而 MAE 更简单和更快。MAE 和 BEiT 都优于 MoCo v3, MoCo v3 与有监督的预训练相当。
Semantic segmentation
Our experiments on ADE20K use UperNet [52] following the code in [2]. Details are in A.4. Table 5 shows that our MAE significantly improves the transferring results of ViT-L, which is 3.7 points better than the supervised pre-training counterpart (53.6 vs. 49.9). The pixel-based MAE outperforms the token-based BEiT. These observations are consistent with those in COCO.
语义分割
在 ADE20K 上的实验使用了 [2] 中的代码之后的 supernet [52]。详情见 A.4。从表 5 可以看出,MAE 显著提高了 ViT-L 的迁移结果,比有监督的训练前对照提高了 3.7 分 (53.6 vs 49.9)。基于像素的 MAE 优于基于 token 的 BEiT。这些观察结果与 COCO 的一致。
Pixels vs. tokens
Table 6 presents an all-around comparison on pixels vs. tokens as the MAE reconstruction target. While using dVAE tokens is better than using unnormalized pixels, it is statistically similar to just using normalized pixels across all tasks and models we studied. It agains shows that tokenization is not necessary for our MAE.
像素 VS token
表 6 给出了作为 MAE 重建目标的像素和 token 的全面比较。虽然使用 dVAE 标记比使用非规范化像素要好,但它在统计上类似于本研究的所有任务和模型中只使用规范化像素。它再次表明 tokenization 对 MAE 来说是不必要的。
Simple algorithms that scale well are the core of deep learning. In NLP, simple self-supervised learning methods (e.g., [40, 14, 41, 4]) enable benefits from exponentially scaling models. In computer vision, practical pre-training paradigms are dominantly supervised (e.g. [28, 44, 24, 16]) despite progress in self-supervised learning. In this study, we observe on ImageNet and in transfer learning that an autoencoder—a simple self-supervised method similar to techniques in NLP—provides scalable benefits. Selfsupervised learning in vision may now be embarking on a similar trajectory as in NLP.
On the other hand, we note that images and languages are signals of a different nature and this difference must be addressed carefully. Images are merely recorded light without a semantic decomposition into the visual analogue of words. Instead of attempting to remove objects, we remove random patches that most likely do not form a semantic segment. Likewise, our MAE reconstructs pixels, which are not semantic entities. Nevertheless, we observe (e.g., Figure 4) that our MAE infers complex, holistic reconstructions, suggesting it has learned numerous visual concepts, i.e., semantics. We hypothesize that this behavior occurs by way of a rich hidden representation inside the MAE. We hope this perspective will inspire future work.
具有良好扩展性的简单算法是深度学习的核心。在自然语言处理中,简单的自监督学习方法 (如[40,14,41,4]) 可以从指数尺度模型中获益。在计算机视觉中,尽管在自监督学习方面取得了进展,但实用的训练前范例主要是监督的 (例如[28,44,24,16])。在本研究中,在 ImageNet 和迁移学习中,自编码器 (一种简单的自监督的方法,类似于 NLP 中的技术) 提供了可扩展的好处。视觉上的自监督学习现在可能正走上与 NLP 相似的轨迹。
另一方面,注意到图像和语言是不同性质的信号,必须仔细处理这种差异。图像只是记录的场景的光照,没有语义分解成像文字那样的视觉模拟。本文不是试图删除对象,而是删除那些很可能不构成语义片段的随机 patch。同样地,本文的 MAE 重建像素,这些像素不是语义实体。然而,观察到 (例如,图4),MAE 推断复杂的,整体的重建,表明它已经学习了许多视觉概念,即语义。假设这种行为是通过 MAE 中丰富的隐藏表现方式发生的。希望这一观点将激励未来的工作。
Broader impacts. The proposed method predicts content based on learned statistics of the training dataset and as such will reflect biases in those data, including ones with negative societal impacts. The model may generate inexistent content. These issues warrant further research and consideration when building upon this work to generate images.
该方法基于训练数据集的学习统计数据预测内容,因此将反映这些数据中的 biases,包括那些具有 negative societal impacts 的数据。模型可能生成不存在的内容。这些问题值得在此基础上进一步研究和考虑。
本文仅对原文阅读,没有做深入解读。更深入的讨论还请看大佬们的评论吧:
如何看待何恺明最新一作论文Masked Autoencoders? - 知乎