Phoenixtree_DongZhao

完整阅读何凯明最新一作：Masked Autoencoders Are Scalable Vision Learners

Kaiming He∗,† Xinlei Chen∗ Saining Xie Yanghao Li Piotr Dollar Ross Girshick

∗ equal technical contribution † project lead

Facebook AI Research (FAIR)

[pdf]

Figure 1. Our MAE architecture. During pre-training, a large random subset of image patches (e.g., 75%) is masked out. The encoder is applied to the small subset of visible patches. Mask tokens are introduced after the encoder, and the full set of encoded patches and mask tokens is processed by a small decoder that reconstructs the original image in pixels. After pre-training, the decoder is discarded and the encoder is applied to uncorrupted images to produce representations for recognition tasks.

Abstract

1. Introduction

2. Related Work

3. Approach

4. ImageNet Experiments

4.1. Main Properties

4.2. Comparisons with Previous Results

4.3. Partial Fine-tuning

5. Transfer Learning Experiments

6. Discussion and Conclusion

Abstract

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision.

Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.

It is based on two core designs.

First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.

Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task.

Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3× or more) and improve accuracy.

Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data.

Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.

本文的核心意义：

本文表明 masked autoencoders (MAE) 是一种可扩展的计算机视觉自监督学习器。

本文的核心思想：

本文的 MAE 方法很简单：对输入图像的随机块进行 mask，然后重建缺失的像素。

本文的核心方法：

它基于两个核心设计。

首先，开发了一个非对称的编码器-解码器体系结构，其中的编码器只对可见的 patch 子集 (没有 mask 标记) 进行操作，同时还有一个轻量级的解码器，从潜在的表示和 mask 标记重建原始图像。

其次，本文发现 mask 输入图像的高比例，例如 75%，会产生一个 nontrivial 且有意义的自监督任务。

耦合这两种设计使 MAE 能够高效有效地训练大型模型：加速训练 (3 倍或更多) 并提高准确性。

本文的实验结论：

本文的可扩展方法允许学习高容量的模型，可以很好地泛化：例如，在仅使用 ImageNet-1K 数据的方法中，普通的 ViT-Huge 模型获得了最好的精度 (87.8%)。

下游任务的迁移性能优于有监督的预训练，并表现出良好的 scaling behavior。

1. Introduction

Deep learning has witnessed an explosion of architectures of continuously growing capability and capacity [28, 24, 47]. Aided by the rapid gains in hardware, models today can easily overfit one million images [13] and begin to demand hundreds of millions of—often publicly inaccessible—labeled images [16].

This appetite for data has been successfully addressed in natural language processing (NLP) by self-supervised pretraining. The solutions, based on autoregressive language modeling in GPT [40, 41, 4] and masked autoencoding in BERT [14], are conceptually simple: they remove a portion of the data and learn to predict the removed content. These methods now enable training of generalizable NLP models containing over one hundred billion parameters [4].

The idea of masked autoencoders, a form of more general denoising autoencoders [48], is natural and applicable in computer vision as well. Indeed, closely related research in vision [49, 39] preceded BERT. However, despite significant interest in this idea following the success of BERT, progress of autoencoding methods in vision lags behind NLP. We ask: what makes masked autoencoding different between vision and language? We attempt to answer this question from the following perspectives:

研究背景和意义：

深度学习见证了能力和容量不断增长的架构，如爆炸式的涌现。借助于硬件的快速增长，今天的模型可以很容易地超过一百万张图像，并开始要求数以百万计的——通常是公众无法访问的标签图像。

在自然语言处理 (NLP) 中，这种对数据的渴望已经通过自监督的预训练成功地解决了。基于 GPT 中的自回归语言建模和 BERT 中的 masked 自编码的解决方案，在概念上很简单：删除一部分数据，并学习预测删除的内容。这些方法现在可以训练包含超过一千亿个参数的可推广的 NLP 模型。

Masked 自编码器的概念，一种更通用的去噪自动编码器的形式，是自然的，并适用于计算机视觉。事实上，在 BERT 之前就有与视觉密切相关的研究。然而，尽管 BERT 的成功引起了人们对这一想法的极大兴趣，但视觉自编码方法的进展却落后于 NLP。

【前三段概括起来，传递的信息是：目前超规模数据处理是重要的；在 NLP 中，这种应用通过自监督的形式已经实现；实现方法就是将数据删除一部分，再还原这些部分；本文的 masked 自编码器也是采用这种策略，实现对计算机视觉的自监督学习器】

引出问题：

试问: 是什么使 masked 自编码在视觉和语言之间有不同? 本文试图从以下几（3）个方面来回答这个问题。（网络结构，信息特点，解码器）

(i) Until recently, architectures were different. In vision, convolutional networks [29] were dominant over the last decade [28]. Convolutions typically operate on regular grids and it is not straightforward to integrate ‘indicators’ such as mask tokens [14] or positional embeddings [47] into convolutional networks. This architectural gap, however, has been addressed with the introduction of Vision Transformers (ViT) [16] and should no longer present an obstacle.

(ii) Information density is different between language and vision. Languages are human-generated signals that are highly semantic and information-dense. When training a model to predict only a few missing words per sentence, this task appears to induce sophisticated language understanding. Images, on the contrary, are natural signals with heavy spatial redundancy—e.g., a missing patch can be recovered from neighboring patches with little high-level understanding of parts, objects, and scenes. To overcome this difference and encourage learning useful features, we show that a simple strategy works well in computer vision: masking a very high portion of random patches. This strategy largely reduces redundancy and creates a challenging selfsupervisory task that requires holistic understanding beyond low-level image statistics. To get a qualitative sense of our reconstruction task, see Figures 2 – 4.

(iii) The autoencoder’s decoder, which maps the latent representation back to the input, plays a different role between reconstructing text and images. In vision, the decoder reconstructs pixels, hence its output is of a lower semantic level than common recognition tasks. This is in contrast to language, where the decoder predicts missing words that contain rich semantic information. While in BERT the decoder can be trivial (an MLP) [14], we found that for images, the decoder design plays a key role in determining the semantic level of the learned latent representations.

(i) 目前为止，architectures 都是不同的。在视觉上，卷积网络在过去十年中占主导地位。卷积通常在规则网格上运行，并不是直接集成 “指标”，如 mask tokens [14] 或 position embedding [47] 到卷积网络。然而，这种架构上的差距已经通过引入 vision transformer (ViT) [16] 得到解决，应该不再是一个障碍。

(ii) 语言和视觉的信息密度不同。语言是人类产生的信号，具有高度的语义和信息密集性。当训练一个模型来预测每句话中只有几个遗漏的单词时，这项任务似乎诱导了复杂的语言理解。相反，图像是具有大量空间冗余的自然信号。一个丢失的 patch 可以从邻近的 patch 中恢复，而不需要对部件、对象和场景有高水平的理解。为了克服这种差异并鼓励学习有用的特性，本文展示了一个简单的策略，可以在计算机视觉中工作得很好：用随机 mask 掩盖很大一部分图像。这种策略在很大程度上减少了冗余，并创建了一个具有挑战性的自监督任务，需要超越低级图像统计的整体理解。（想对本文的重建任务有一个定性的认识，请参见图 2 - 4）。

(iii) 自编码器的解码器将潜在的表示映射回输入，在重构文本和图像之间扮演着不同的角色。在视觉中，解码器对像素进行重构，因此其输出的语义水平低于一般的识别任务。这与语言相反，在语言中，解码器预测包含丰富语义信息的缺失单词。虽然在 BERT 中解码器可以是微不足道的 (一个MLP)，但本文发现，对于图像，解码器的设计在确定学到的潜在表示的语义级别方面起着关键作用。

Driven by this analysis, we present a simple, effective, and scalable form of a masked autoencoder (MAE) for visual representation learning. Our MAE masks random patches from the input image and reconstructs the missing patches in the pixel space. It has an asymmetric encoderdecoder design. Our encoder operates only on the visible subset of patches (without mask tokens), and our decoder is lightweight and reconstructs the input from the latent representation along with mask tokens (Figure 1). Shifting the mask tokens to the small decoder in our asymmetric encoder-decoder results in a large reduction in computation. Under this design, a very high masking ratio (e.g., 75%) can achieve a win-win scenario: it optimizes accuracy while allowing the encoder to process only a small portion (e.g., 25%) of patches. This can reduce overall pre-training time by 3× or more and likewise reduce memory consumption, enabling us to easily scale our MAE to large models.

在此分析的驱动下，本文提出了一种简单、有效、可扩展的用于视觉表示学习的 masked 自编码器 (MAE)。本文的 MAE masks 来自输入图像的随机 patch，并在像素空间中重建缺失的 patch。它具有非对称的编解码设计。本文的编码器 patch 的操作只在可见的子集 (没有被 mask 的 tokens)，本文的解码器是轻量级的，并根据 mask tockens 的潜在表示重构输入图像 (图 1)。将 mask tokens 的小解码器（非对称 encoder-decoder ）计算，其计算量得到很大的下降。在这种设计下，一个非常高的 mask 比 (如 75%) 可以达到双赢的情况：它优化精度，同时允许编码器只处理一小部分 (如 25%) 的 patch。这可以减少 3 倍或更多的预训练时间，同样也可以减少内存消耗，使能够轻松地将 MAE 扩展到大型模型。

Our MAE learns very high-capacity models that generalize well. With MAE pre-training, we can train datahungry models like ViT-Large/-Huge [16] on ImageNet-1K with improved generalization performance. With a vanilla ViT-Huge model, we achieve 87.8% accuracy when finetuned on ImageNet-1K. This outperforms all previous results that use only ImageNet-1K data. We also evaluate transfer learning on object detection, instance segmentation, and semantic segmentation. In these tasks, our pre-training achieves better results than its supervised pre-training counterparts, and more importantly, we observe significant gains by scaling up models. These observations are aligned with those witnessed in self-supervised pre-training in NLP [14, 40, 41, 4] and we hope that they will enable our field to explore a similar trajectory.

MAE 学习了非常高容量的模型，可以很好地泛化。通过 MAE 预训练，可以在 ImageNet-1K 上训练像 ViT-Large/-Huge 这样的数据匮乏模型，提高泛化性能。使用普通的 ViT - Huge 模型，在ImageNet-1K 上微调时，达到了 87.8% 的精度。这比以前所有只使用 ImageNet-1K 数据的结果都要好。

本文还评估了迁移学习在目标检测、实例分割和语义分割方面的作用。在这些任务中，本文的预训练比有监督的预训练取得更好的结果，更重要的是，观察到通过 scaling up models 获得显著的增益。这些观察结果与在 NLP 自监督预训练中所看到的情况一致，希望它们将使 CV 领域能够探索类似的轨迹。

2. Related Work

Masked language modeling and its autoregressive counterparts, e.g., BERT [14] and GPT [40, 41, 4], are highly successful methods for pre-training in NLP. These methods hold out a portion of the input sequence and train models to predict the missing content. These methods have been shown to scale excellently [4] and a large abundance of evidence indicates that these pre-trained representations generalize well to various downstream tasks.

Autoencoding is a classical method for learning representations. It has an encoder that maps an input to a latent representation and a decoder that reconstructs the input. For example, PCA and k-means are autoencoders [25]. Denoising autoencoders (DAE) [48] are a class of autoencoders that corrupt an input signal and learn to reconstruct the original, uncorrupted signal. A series of methods can be thought of as a generalized DAE under different corruptions, e.g., masking pixels [49, 39, 6] or removing color channels [59]. Our MAE is a form of denoising autoencoding, but different from the classical DAE in numerous ways.

Masked image encoding methods learn representations from images corrupted by masking. The pioneering work of [49] presents masking as a noise type in DAE. Context Encoder [39] inpaints large missing regions using convolutional networks. Motivated by the success in NLP, related recent methods [6, 16, 2] are based on Transformers [47]. iGPT [6] operates on sequences of pixels and predicts unknown pixels. The ViT paper [16] studies masked patch prediction for self-supervised learning. Most recently, BEiT [2] proposes to predict discrete tokens [37, 43].

Self-supervised learning approaches have seen significant interest in computer vision, often focusing on different pretext tasks for pre-training [15, 50, 35, 59, 38, 17]. Recently, contrastive learning [3, 21] has been popular, e.g., [51, 36, 22, 7], which models image similarity and dissimilarity (or only similarity [20, 8]) between two or more views. Contrastive and related methods strongly depend on data augmentation [7, 20, 8]. Autoencoding pursues a conceptually different direction, and it exhibits different behaviors as we will present.

Masked 语言建模及其自回归对应方法，如 BERT [14] 和 GPT [40,42,4]，是非常成功的 NLP 预训练方法。这些方法提供了输入序列的一部分，训练模型来预测缺失的内容。这些方法已经被证明可以很好地扩展，大量的证据表明这些预先训练的表示可以很好地推广到各种下游任务。

自编码是学习表征的一种经典方法。它有一个将输入映射到潜在表示的编码器和一个重建输入的解码器。例如，PCA和k-means是自动编码器[25]。去噪自动编码器(DAE)[48]是一类损坏输入信号并学习重建原始的、未损坏的信号的自动编码器。一系列的方法可以被认为是在不同的破坏下的通用DAE，例如，屏蔽像素[49,39,6]或删除颜色通道[59]。我们的MAE是去噪自编码的一种形式，但在许多方面不同于经典的DAE。

Masked 图像编码方法从被 masking 损坏的图像中学习表示。[49] 的开创性工作提出 mask 作为一种噪声类型的 DAE。编码器 [39] 使用卷积网络填补大的缺失区域。在 NLP 的激励下，相关的最近的方法 [6,16,2] 是基于 transformer [47]。iGPT [6] 作用于像素序列，预测未知像素。ViT 论文 [16]研究了用于自监督学习的 masked patch 预测。最近，BEiT [2] 提出预测离散 tokens [37,43]。

自监督学习方法已经在计算机视觉中看到了显著的兴趣，通常集中在训练前的不同的网络前置（pretext）任务上 [15,50,35,59,38,17]。最近，对比学习 [3,21] 很流行，例如 [51,36,22,7]，它模拟两个或多个视图之间的图像相似度和不相似度 (或仅相似度 [20,8])。对比和相关方法强烈依赖于数据扩充 [7,20,8]。自编码在概念上追求不同的方向，正如本文将要展示的那样，它表现出不同的行为。

3. Approach

Our masked autoencoder (MAE) is a simple autoencoding approach that reconstructs the original signal given its partial observation. Like all autoencoders, our approach has an encoder that maps the observed signal to a latent representation, and a decoder that reconstructs the original signal from the latent representation. Unlike classical autoencoders, we adopt an asymmetric design that allows the encoder to operate only on the partial, observed signal (without mask tokens) and a lightweight decoder that reconstructs the full signal from the latent representation and mask tokens. Figure 1 illustrates the idea, introduced next.

本文的 masked 自编码器 (MAE) 是一种简单的自编码方法，它可以在给定部分观测值的情况下重建原始信号。与所有自动编码器一样，本文的方法有一个编码器，它将观察到的信号映射到潜在表示；以及一个解码器，它从潜在表示重建原始信号。与传统的自编码器不同，本文采用非对称设计，允许编码器只对观察到的部分信号 (没有掩码标记) 进行操作，并采用轻量级解码器从潜在表示和 mask tokens 重建完整的信号。

Masking

Following ViT [16], we divide an image into regular non-overlapping patches. Then we sample a subset of patches and mask (i.e., remove) the remaining ones. Our sampling strategy is straightforward: we sample random patches without replacement, following a uniform distribution. We simply refer to this as “random sampling”.

Random sampling with a high masking ratio (i.e., the ratio of removed patches) largely eliminates redundancy, thus creating a task that cannot be easily solved by extrapolation from visible neighboring patches (see Figures 2 – 4). The uniform distribution prevents a potential center bias (i.e., more masked patches near the image center). Finally, the highly sparse input creates an opportunity for designing an efficient encoder, introduced next.

和 ViT 一样，本文将图像划分为常规的不重叠的 patches。然后对一个子集的 patch 和 mask (即，去除) 其余的。本文的抽样策略很简单：本文随机抽样，不进行替换，遵循均匀分布。本文简单地称之为 “随机抽样”。

随机抽样高 masking 率 (即删除 patch 的比例) 在很大程度上消除冗余，从而创建一个不容易由可见邻近 patch 外推解决的任务 (见图 2 - 4)。均匀分布防止潜在的中心偏差 (例如， mask 图像中心附近)。最后，高度稀疏的输入为设计高效的编码器创造了机会。

MAE encoder

Our encoder is a ViT [16] but applied only on visible, unmasked patches. Just as in a standard ViT, our encoder embeds patches by a linear projection with added positional embeddings, and then processes the resulting set via a series of Transformer blocks. However, our encoder only operates on a small subset (e.g., 25%) of the full set. Masked patches are removed; no mask tokens are used. This allows us to train very large encoders with only a fraction of compute and memory. The full set is handled by a lightweight decoder, described next.

MAE 编码器

MAE 编码器是一个 ViT[16]，但只作用于可见的，未屏蔽的 patch。就像在标准 ViT 中一样，MAE 编码器通过线性映射嵌入 patch 和位置 embedding，然后通过一系列 Transformer block 处理结果集。然而，MAE 编码器只在整个集合的一个小子集上运行 (即没有被 masked 的部分，本文通常为 25%)。被 masked 的 patch 直接删除；没有使用 mask tokens。这使得网络能够训练非常大的编码器，只有一小部分的计算和内存。完整的集合由轻量级解码器处理。

MAE decoder

The input to the MAE decoder is the full set of tokens consisting of (i) encoded visible patches, and (ii) mask tokens. See Figure 1. Each mask token [14] is a shared, learned vector that indicates the presence of a missing patch to be predicted. We add positional embeddings to all tokens in this full set; without this, mask tokens would have no information about their location in the image. The decoder has another series of Transformer blocks.

The MAE decoder is only used during pre-training to perform the image reconstruction task (only the encoder is used to produce image representations for recognition). Therefore, the decoder architecture can be flexibly designed in a manner that is independent of the encoder design. We experiment with very small decoders, narrower and shallower than the encoder. For example, our default decoder has <10% computation per token vs. the encoder. With this asymmetrical design, the full set of tokens are only processed by the lightweight decoder, which significantly reduces pre-training time.

MAE 解码器

MAE 解码器的输入是由 (i) 编码的未被 masked 的 patches 和 (ii) mask tokens 组成的全套 tokens。参见图1。每个 mask tokens 都是一个共享的、学习到的向量，它指示需要预测的缺失 patches。向这个完整集合中的所有标记添加位置 embeddings；如果没有这个，mask tokens 将没有关于其在图像中的位置的信息。解码器有另一系列的 Transformer blocks。

MAE 解码器只在预训练期间用于执行图像重建任务 (只有编码器用于进行识别，生成图像表示时)。因此，可以独立于编码器灵活地设计解码器架构。作者使用非常小的解码器进行实验，比编码器更窄、更浅。例如，默认解码器与编码器相比，每个 token 的计算量<10%。使用这种非对称设计，整套 token 只由轻量级解码器处理，这大大减少了训练前的时间。

Reconstruction target

Our MAE reconstructs the input by predicting the pixel values for each masked patch. Each element in the decoder’s output is a vector of pixel values representing a patch. The last layer of the decoder is a linear projection whose number of output channels equals the number of pixel values in a patch. The decoder’s output is reshaped to form a reconstructed image. Our loss function computes the mean squared error (MSE) between the reconstructed and original images in the pixel space. We compute the loss only on masked patches, similar to BERT [14].1

We also study a variant whose reconstruction target is the normalized pixel values of each masked patch. Specifically, we compute the mean and standard deviation of all pixels in a patch and use them to normalize this patch. Using normalized pixels as the reconstruction target improves representation quality in our experiments.

目标重构

MAE 通过预测每个 mask 块的像素值来重建输入。解码器输出中的每个元素都是一个像素值向量，代表一个 patchd。解码器的最后一层是一个线性投影，其输出通道的数量等于一个 patch 中的像素值的数量。解码器的输出被重塑以形成重建图像。本文的损失函数计算重建图像和原始图像在像素空间中的平均平方误差 (MSE)。本文只在 mask token 上计算损失，类似于 BERT [14].

作者还研究了一个变量，其重建目标是每个被掩片的归一化像素值。具体来说，计算一个patch中所有像素的均值和标准差，并用它们来归一化这个 patch。在本文的实验中，使用归一化像素作为重建目标可以提高再现质量。

Simple implementation

Our MAE pre-training can be implemented efficiently, and importantly, does not require any specialized sparse operations. First we generate a token for every input patch (by linear projection with an added positional embedding). Next we randomly shuffle the list of tokens and remove the last portion of the list, based on the masking ratio. This process produces a small subset of tokens for the encoder and is equivalent to sampling patches without replacement. After encoding, we append a list of mask tokens to the list of encoded patches, and unshuffle this full list (inverting the random shuffle operation) to align all tokens with their targets. The decoder is applied to this full list (with positional embeddings added). As noted, no sparse operations are needed. This simple implementation introduces negligible overhead as the shuffling and unshuffling operations are fast.

简单实现

MAE 预训练可以有效地实施，而且重要的是，不需要任何专门的稀疏操作。

首先，为每个输入patch 生成一个token (通过线性映射并添加 position embedding )。

然后，随机打乱 token 列表，并根据 mask 比率删除列表的后面部分。这个过程，编码器只产生一个小的 token 子集，大小为没有被 masked 的采样 patch。

在编码之后，将一个 mask token 列表添加到已编码 patch 列表中，并对整个列表进行 unshuffle (与随机洗牌操作相反)，以使所有 token 与其目标对齐。

解码器应用于这个完整的列表 (添加了 position embedding)。

如前所述，不需要进行稀疏操作。这个简单的实现引入的开销是可忽略的，因为洗牌和反洗牌操作非常快。

4. ImageNet Experiments

We do self-supervised pre-training on the ImageNet-1K (IN1K) [13] training set. Then we do supervised training to evaluate the representations with (i) end-to-end fine-tuning or (ii) linear probing. We report top-1 validation accuracy of a single 224×224 crop. Details are in Appendix A.1.

作者在 ImageNet-1K (IN1K) 训练集上进行自监督的预训练。然后进行监督训练，通过 (i) 端到端微调或 (ii) 线性探测来评估表示。实验记录 224×224 crop 的最高验证精度。详情见附录 A.1 （下面仅列出表格说明，文章描述请参考原文）。

Baseline: ViT-Large

We use ViT-Large (ViT-L/16) [16] as the backbone in our ablation study. ViT-L is very big (an order of magnitude bigger than ResNet-50 [24]) and tends to overfit. The following is a comparison between ViT-L trained from scratch vs. fine-tuned from our baseline MAE:

We note that it is nontrivial to train supervised ViT-L from scratch and a good recipe with strong regularization is needed (82.5%, see Appendix A.2). Even so, our MAE pretraining contributes a big improvement. Here fine-tuning is only for 50 epochs (vs. 200 from scratch), implying that the fine-tuning accuracy heavily depends on pre-training.

baseline: ViT-Large

本文使用 ViT-large (ViT-L /16)[16] 作为消融研究的 baseline。ViT-L 非常大 (比 resnet - 50 大一个数量级)，并且倾向于过拟合。上表为从头开始训练与 MAE 微调的性能对比。

注意到，从头开始训练有监督的 ViT-L 是很重要的，并且需要一个具有强正规化的好方法 (82.5%，参见附录 A .2 （请参考原文）)。即便如此，MAE 预训练还是有很大的进步。这里的微调只适用于 50 个epoch (而不是)，这意味着微调的准确性很大程度上依赖于预训练。

4.1. Main Properties

We ablate our MAE using the default settings in Table 1 (see caption). Several intriguing properties are observed.

灰色表示本文的默认设置。

Masking ratio （不在表 1 中）

Figure 5 shows the influence of the masking ratio. The optimal ratios are surprisingly high. The ratio of 75% is good for both linear probing and fine-tuning. This behavior is in contrast with BERT [14], whose typical masking ratio is 15%. Our masking ratios are also much higher than those in related works [6, 16, 2] in computer vision (20% to 50%).

The model infers missing patches to produce different, yet plausible, outputs (Figure 4). It makes sense of the gestalt of objects and scenes, which cannot be simply completed by extending lines or textures. We hypothesize that this reasoning-like behavior is linked to the learning of useful representations.

Figure 5 also shows that linear probing and fine-tuning results follow different trends. For linear probing, the accuracy increases steadily with the masking ratio until the sweet point: the accuracy gap is up to ∼20% (54.6% vs. 73.5%). For fine-tuning, the results are less sensitive to the ratios, and a wide range of masking ratios (40–80%) work well. All fine-tuning results in Figure 5 are better than training from scratch (82.5%).

本文使用表 1 中的默认设置来对 MAE 进行消融实验 (见标题)。

Masking ratio：图 5 显示了 Masking ratio 的影响。最佳比率高得惊人。75% 的比例对于线性 probing 和微调都很好。这一行为与 BERT 形成对比，后者的典型 masking ratio 为15%。本文的masking ratio 也远高于计算机视觉相关研究 (20% - 50%)。

该模型推断缺失的 patch，以产生不同的、但貌似可信的输出 (图 4)。它感知物体和场景的格式塔 （gestalt），这不能简单地通过扩展线条或纹理来完成。作者假设这种类似推理的行为与有用表征的学习有关。

图 5 还显示了线性 probing 和微调结果遵循不同的趋势。对于线性 probing，精度随着 masking ratio 的增加而稳步提高，直到最高点（sweet point）: 精度差距高达~ 20% (54.6% vs. 73.5%)。为了进行微调，结果对比率的敏感性较低，而且广泛使用的 masking ratio (40-80%) 效果很好。图 5 中的所有微调结果都比从头开始训练要好 (82.5%)。

Decoder design

Our MAE decoder can be flexibly designed, as studied in Table 1a and 1b.

Table 1a varies the decoder depth (number of Transformer blocks). A sufficiently deep decoder is important for linear probing. This can be explained by the gap between a pixel reconstruction task and a recognition task: the last several layers in an autoencoder are more specialized for reconstruction, but are less relevant for recognition. A reasonably deep decoder can account for the reconstruction specialization, leaving the latent representations at a more abstract level. This design can yield up to 8% improvement in linear probing (Table 1a, ‘lin’). However, if fine-tuning is used, the last layers of the encoder can be tuned to adapt to the recognition task. The decoder depth is less influential for improving fine-tuning (Table 1a, ‘ft’).

Interestingly, our MAE with a single-block decoder can perform strongly with fine-tuning (84.8%). Note that a single Transformer block is the minimal requirement to propagate information from visible tokens to mask tokens. Such a small decoder can further speed up training.

In Table 1b we study the decoder width (number of channels). We use 512-d by default, which performs well under fine-tuning and linear probing. A narrower decoder also works well with fine-tuning.

Overall, our default MAE decoder is lightweight. It has 8 blocks and a width of 512-d ( gray in Table 1). It only has 9% FLOPs per token vs. ViT-L (24 blocks, 1024-d). As such, while the decoder processes all tokens, it is still a small fraction of the overall compute.

译码器的设计：

MAE译码器消融实验包括如表 1a 和 1b 所述的形式，从深度和宽度上进行消融。

表 1a 改变了解码器的深度 (变压器块的数量)。足够深的译码器对于线性 probing 很重要 ( 表 1a， ' lin ')。这可以用像素重建任务和识别任务之间的差距来解释：自编码器的最后几层更专门用于重建，但与识别不太相关。合理深度的解码器可以解释重构专门化（reconstruction specialization），将潜在表示留在更抽象的层次。这种设计可以在线性 probing 方面提高 8%。然而，如果使用微调 (表 1a， ' ft ')，编码器的最后一层可以调整以适应识别任务。解码器深度对改善微调的影响较小。

有趣的是，单块解码器的 MAE 在微调时表现强烈 (84.8%)。单个 Transformer 块是将信息从可见 token 传播到 mask token 的最低要求。如此小的解码器可以进一步加快训练速度。

表 1b 研究了解码器的宽度 (通道数)。默认使用 512-d，它在微调和线性 probing 下表现良好。更窄的解码器也可以很好地进行微调。

总的来说，默认的 MAE 解码器是轻量级的。它有 8 个区块，512-d 的宽度 (表1中的灰色部分)。相对于 ViT-L (24 个区块，1024-d)，每个 token 只有 9% 的 FLOPs。因此，虽然解码器处理所有token，但它仍然是整个计算的一小部分。

Mask token

An important design of our MAE is to skip the mask token [M] in the encoder and apply it later in the lightweight decoder. Table 1c studies this design.

If the encoder uses mask tokens, it performs worse: its accuracy drops by 14% in linear probing. In this case, there is a gap between pre-training and deploying: this encoder has a large portion of mask tokens in its input in pretraining, which does not exist in uncorrupted images. This gap may degrade accuracy in deployment. By removing the mask token from the encoder, we constrain the encoder to always see real patches and thus improve accuracy.

Moreover, by skipping the mask token in the encoder, we greatly reduce training computation. In Table 1c, we reduce the overall training FLOPs by 3.3×. This leads to a 2.8× wall-clock speedup in our implementation (see Table 2). The wall-clock speedup is even bigger (3.5–4.1×), for a smaller decoder (1-block), a larger encoder (ViT-H), or both. Note that the speedup can be >4× for a masking ratio of 75%, partially because the self-attention complexity is quadratic. In addition, memory is greatly reduced, which can enable training even larger models or speeding up more by large-batch training. The time and memory efficiency makes our MAE favorable for training very large models.

Mask token

MAE 的一个重要设计是在编码器中不使用 mask token [M]，然后将其应用到轻量级解码器中。表 1c 研究了这一设计。

如果编码器使用 mask token，它的性能会更差：在线性 probing 中它的精度会下降 14%。在这种情况下，预训练和部署之间存在一个间隙：该编码器在预训练输入中有很大一部分 mask token，这在未损坏的图像中是不存在的。这种差距可能会降低部署的准确性。通过从编码器中去除 mask，限制编码器总是看到真实的 patch，从而提高精度。

此外，通过在编码器中不使用 mask token，大大减少了训练计算量。在表 1c 中，将整体训练FLOPs 减少 3.3×。这在 MAE 的实现中导致了 2.8× wall-clock 加速 (见表 2)。采用较小的解码器(1 块)，或较大的编码器 (ViT- H)，或两者联合，它们的 wall-clock 加速甚至更大 (3.5-4.1×)。

注意，当 masking ratio 为 75% 时，加速可以为 >4×，部分原因是自注意复杂度是二次方的。

此外，内存大大减少，可以训练更大的模型，或通过大批量的训练提高速度。

时间和内存效率使 MAE 有利于训练非常大的模型。

Reconstruction target

We compare different reconstruction targets in Table 1d. Our results thus far are based on pixels without (per-patch) normalization. Using pixels with normalization improves accuracy. This per-patch normalization enhances the contrast locally. In another variant, we perform PCA in the patch space and use the largest PCA coefficients (96 here) as the target. Doing so degrades accuracy. Both experiments suggest that the high-frequency components are useful in our method.

We also compare an MAE variant that predicts tokens, the target used in BEiT [2]. Specifically for this variant, we use the DALLE pre-trained dVAE [43] as the tokenizer, following [2]. Here the MAE decoder predicts the token indices using cross-entropy loss. This tokenization improves fine-tuning accuracy by 0.4% vs. unnormalized pixels, but has no advantage vs. normalized pixels. It also reduces linear probing accuracy. In §5 we further show that tokenization is not necessary in transfer learning.

Our pixel-based MAE is much simpler than tokenization. The dVAE tokenizer requires one more pre-training stage, which may depend on extra data (250M images [43]). The dVAE encoder is a large convolutional network (40% FLOPs of ViT-L) and adds nontrivial overhead. Using pixels does not suffer from these problems.

重建目标

在表 1d 中比较了不同的重建目标。到目前为止，本文的结果是基于没有 ( 每个 patch) 归一化的像素。

使用带有归一化的像素可以提高精度。这种每个 patch 的归一化在局部上增强了对比度。

在另一种变体中，作者在 patch 空间中执行主成分分析，并使用最大的主成分系数 (这里是 96) 作为目标。这样做会降低准确性。

两个实验都表明高频分量在本文的方法中是有用的。

本文还比较了预测 token 的 MAE 变体，即 BEiT 中使用的 target。特别地，对于这个变体，本文使用 DALLE 预先训练的 dVAE 作为标记赋予器。在这里，MAE 译码器利用交叉熵损失预测 token指数。与非归一化像素相比，这种标记化提高了0.4%的微调精度，但与归一化像素相比没有优势。它还降低了线性 probing 精度。在 §5 中，进一步证明了 token化（tokenization）在迁移学习中是不必要的。

本文采用基于像素的 MAE 比 token化简单得多。dVAE tokenizer 还需要一个预训练阶段，这可能依赖于额外的数据 (250M 图像[43])。dVAE 编码器是一个大型的卷积网络 (40% 的 ViT-L FLOPs)，增加了不小的开销。使用像素的方法不会增加 FLOPs。

Data augmentation

Table 1e studies the influence of data augmentation on our MAE pre-training.

Our MAE works well using cropping-only augmentation, either fixed-size or random-size (both having random horizontal flipping). Adding color jittering degrades the results and so we do not use it in other experiments.

Surprisingly, our MAE behaves decently even if using no data augmentation (only center-crop, no flipping). This property is dramatically different from contrastive learning and related methods [51, 22, 7, 20], which heavily rely on data augmentation. It was observed [20] that using cropping-only augmentation reduces the accuracy by 13% and 28% respectively for BYOL [20] and SimCLR [7]. In addition, there is no evidence that contrastive learning can work without augmentation: the two views of an image are the same and can easily satisfy a trivial solution.

In MAE, the role of data augmentation is mainly performed by random masking (ablated next). The masks are different for each iteration and so they generate new training samples regardless of data augmentation. The pretext task is made difficult by masking and requires less augmentation to regularize training.

数据扩增

表 1e 研究了数据扩增对 MAE 预训练的影响。

MAE 只使用 crop 的扩增方法时表现很好，无论是固定大小或随机大小 (同时也使用了随机水平翻转)。添加颜色抖动会降低结果，所以作者不在其他实验中使用它。

令人惊讶的是，MAE 在没有使用数据增强 (只有中心 crop，没有翻转) 时表现也很好。这一特性与对比学习及其相关方法有很大的不同，后者严重依赖于数据增强。[20] 观察到，仅使用 crop 扩增法在 BYOL [20] 和 SimCLR [7] 的准确率分别降低了 13% 和 28%。此外，没有证据表明对比学习可以在没有扩增的情况下工作。

在 MAE 中，数据扩增的作用主要是通过随机 mask (然后消融) 来实现的。每个迭代的 mask 是不同的，因此无论数据扩增与否，它们都会生成新的训练样本。masking 使 pretext 任务变得困难，并且需要较少的扩增来规范训练。

Mask sampling strategy

In Table 1f we compare different mask sampling strategies, illustrated in Figure 6.

The block-wise masking strategy, proposed in [2], tends to remove large blocks (Figure 6 middle). Our MAE with block-wise masking works reasonably well at a ratio of 50%, but degrades at a ratio of 75%. This task is harder than that of random sampling, as a higher training loss is observed. The reconstruction is also blurrier.

We also study grid-wise sampling, which regularly keeps one of every four patches (Figure 6 right). This is an easier task and has lower training loss. The reconstruction is sharper. However, the representation quality is lower.

Simple random sampling works the best for our MAE. It allows for a higher masking ratio, which provides a greater speedup benefit while also enjoying good accuracy.

mask 抽样策略

在表 1f 中比较了不同的 mask 采样策略，如图 6 所示。

1. 去除较大的块 (图 6 中间)。在 50% 的比例下，本文的区块 mask 的 MAE 工作得相当好，但在 75% 的比例下就会退化。这种策略比随机抽样策略更难，其训练损失更高，图像重建效果也更加模糊。

2. 网格采样的策略，即如图 6 右所示。这是一个更简单的任务，训练损失更低，图像重建的锐化较好。但是，表示 (representation) 质量较低。

3. 简单的随机抽样对 MAE 来说是最好的。这种策略可以使用更高的 mask ratio，这提供了更大的加速效益，同时也享有良好的准确性。

Training schedule

Our ablations thus far are based on 800-epoch pre-training. Figure 7 shows the influence of the training schedule length. The accuracy improves steadily with longer training. Indeed, we have not observed saturation of linear probing accuracy even at 1600 epochs. This behavior is unlike contrastive learning methods, e.g., MoCo v3 [9] saturates at 300 epochs for ViT-L. Note that the MAE encoder only sees 25% of patches per epoch, while in contrastive learning the encoder sees 200% (two-crop) or even more (multi-crop) patches per epoch.

训练方法

到目前为止，本文的消融是基于800 个 epoch 的预训练。图 7 显示了训练计划长度的影响。准确率随着训练时间的延长而稳步提高。事实上，即使在1600 epoch，作者也没有观察到线性 probing 精度的饱和。这种行为不同于对比学习方法，例如，在 ViT-L 上，用 MoCo v3 方法在第 300 epoch 时出现饱和。值得注意，MAE 编码器每个 epoch 只看到 25% 的 patch，而在对比学习中，编码器每个 epoch 看到 200% (two-crop) 甚至更多 (multi-crop) patch。

4.2. Comparisons with Previous Results

Comparisons with self-supervised methods

In Table 3 we compare the fine-tuning results of self-supervised ViT models. For ViT-B, all methods perform closely. For ViT-L, the gaps among methods are bigger, suggesting that a challenge for bigger models is to reduce overfitting.

Our MAE can scale up easily and has shown steady improvement from bigger models. We obtain 86.9% accuracy using ViT-H (224 size). By fine-tuning with a 448 size, we achieve 87.8% accuracy, using only IN1K data. The previous best accuracy, among all methods using only IN1K data, is 87.1% (512 size) [56], based on advanced networks. We improve over the state-of-the-art by a nontrivial margin in the highly competitive benchmark of IN1K (no external data). Our result is based on vanilla ViT, and we expect advanced networks will perform better.

Comparing with BEiT [2], our MAE is more accurate while being simpler and faster. Our method reconstructs pixels, in contrast to BEiT that predicts tokens: BEiT reported a 1.8% degradation [2] when reconstructing pixels with ViT-B. We do not need dVAE pre-training. Moreover, our MAE is considerably faster (3.5× per epoch) than BEiT, for the reason as studied in Table 1c.

The MAE models in Table 3 are pre-trained for 1600 epochs for better accuracy (Figure 7). Even so, our total pre-training time is less than all other methods if they were trained in the same hardware. For example, with ViT-L, our MAE’s training time is 31 hours for 1600 epochs and MoCo v3’s is 36 hours for 300 epochs [9], using the same 128 TPU-v3 cores.

自监督方法对比

表 3 比较了自监督 ViT 模型的微调结果。对于 ViT - B，所有方法的效果都很好。对于 ViT - L，方法之间的差距更大，这表明更大的模型面临的挑战是减少过拟合。

MAE 可以很容易地扩大规模，并在更大的模型上，显示出稳定的改善。在 ViT-H 上，MAE 获得 86.9% 的准确率使用 (224 size)。通过微调 448 size，MAE 达到 87tgftyredwf dvfghb cqaZ.8% 的精度，仅使用 IN1K 数据。

在所有仅使用 IN1K 数据的方法中，基于高级网络的最佳精度为 87.1% (512 size) [56]。在 IN1K (没有外部数据) 的高度竞争性基准测试中，MAE 在最先进的基础上进行了显著的改进。本文的结果是基于普通的 ViT，预计在更先进的网络上，将表现得更好。

与 BEiT[2] 相比，MAE 在更简单、更快的同时也更准确。本文的方法重建像素，与预测标记的BEiT 相比：当使用 ViT- B 重建像素时，BEiT 降低 1.8%。本文的方法不需要 dVAE 预训练。

此外，MAE 比 BEiT 要快得多 (3.5×每 epoch)，其原因如表 1c 所示。

为了提高精度，表 3 中的 MAE 模型进行了 1600 个 epoch 的预训练 (图 7)。即便如此，如果在相同的硬件上训练，本文的总预训练时间比所有其他方法都要少。例如，使用相同的 128 个 TPU-v3 （看到这个数据心里只有一颤 T～T） 内核，使用 ViT-L，MAE 的训练时间为 1600 个 epoch ，31小时，而 MoCo v3 的训练时间为 300 个 epoch，36小时。

Comparisons with supervised pre-training

In the original ViT paper [16], ViT-L degrades when trained in IN1K. See Figure 8. Our improved supervised recipe works better for training from scratch (Figure 8, “our impl.”; see A.2), but the accuracy is saturated.

Our MAE pre-training, using only IN1K, can generalize better: the gain over training from scratch is bigger for higher-capacity models. It follows a trend similar to the JFT-300M supervised pre-training in [16]. This comparison shows that our MAE can help scale up model sizes.

自监督预训练对比

在原始的 ViT 论文中，在 IN1K 中训练时，ViT-L 表现退化。参见图 8 所示。本文改进的监督方法可以更好地从零开始进行训练 (图8，“our impl”; 见 A.2)，但准确性是饱和的。

MAE 预训练，仅使用 IN1K，可以更好地泛化：对于更高容量的模型，从零开始训练的增益更大。它遵循的趋势类似于 JFT-300M 监督预训 [16]。这一比较表明， MAE可以帮助扩大模型尺寸。

4.3. Partial Fine-tuning

Table 1 shows that linear probing and fine-tuning results are largely uncorrelated. Linear probing has been a popular protocol in the past few years; however, it misses the opportunity of pursuing strong but non-linear features—which is indeed a strength of deep learning. As a middle ground, we study a partial fine-tuning protocol: fine-tune the last several layers while freezing the others. This protocol was also used in early works, e.g., [54, 59, 35].

Figure 9 shows the results. Notably, fine-tuning only one Transformer block boosts the accuracy significantly from 73.5% to 81.0%. Moreover, if we fine-tune only “half” of the last block (i.e., its MLP sub-block), we can get 79.1%, much better than linear probing. This variant is essentially fine-tuning an MLP head. Fine-tuning a few blocks (e.g., 4 or 6) can achieve decent accuracy, which is still a small fine-tuning head compared with the frozen backbone.

In Figure 9 we also compare with MoCo v3 [9], which is a contrastive method with ViT-L results available. It has higher linear probing accuracy than our MAE. However, all of its partial fine-tuning results are worse than ours. The gap is 2.6% when tuning 4 blocks.

These results show that the MAE representations are less linearly separable, but they are stronger non-linear features and perform well when a non-linear head is tuned. These observations suggest that linear separability is not the sole metric for evaluating representation quality. It has also been observed (e.g., [8]) that linear probing is not well correlated with transfer learning performance, e.g., for object detection. To our knowledge, linear evaluation is not often used in NLP for benchmarking pre-training.

表1显示了线性 probing 和微调结果在很大程度上是不相关的。

线性 probing 在过去几年已经成为一种流行的协议；然而，它错过了追求强大但非线性特征的机会——这确实是深度学习的优势。作为中间立场，本文研究了一个部分微调协议：微调最后几个层，同时冻结其他层。在早期的工作中也使用了该方案，如 [54,59,35]。

图 9 显示了结果。值得注意的是，只需微调一个 Transformer 块，就能将精度从 73.5% 显著提高到 81.0%。

此外，如果只微调最后一个块的 “一半” (即 MLP 子块)，可以得到 79.1%，比线性 probing 好得多。这个变体本质上是对一个 MLP 头进行微调。

微调几个块 (例如，4 或 6个) 可以获得不错的精度。

在图 9 中，与 MoCo v3 [9] 进行了比较，这是一种与可用的 ViT-L 结果进行对比的方法。它具有比 MAE 更高的线性 probing 精度。然而，它所有的部分微调结果都比 MAE 的差。在调优 4 个块时，差距为 2.6%。

这些结果表明，MAE表示法是不那么线性可分的，但它们是强非线性特征，并在非线性头调谐时表现良好。

这些观察结果表明，线性可分性并不是评价表现质量的唯一度量标准。

另外，线性 probing 与迁移学习性能没有很好的关联，例如，用于对象检测。

据我们所知，线性评价在 NLP 中并不常用来进行 token 预训练。

5. Transfer Learning Experiments

We evaluate transfer learning in object detection and segmentation on COCO [32] and semantic segmentation on ADE20K [60]. We use the pre-trained models in Table 3.

本文在 COCO[32] 和 ADE20K[60] 上评估迁移学习在目标检测和分割中的作用。使用表 3 中的预训练模型。

Object detection and segmentation

We fine-tune Mask R-CNN [23] end-to-end on COCO. The ViT backbone is adapted for use with FPN [31] (see Appendix A.3). We apply this object detection system to all entries in Table 4. We report box AP for object detection and mask AP for instance segmentation.

Compared to supervised pre-training, our MAE performs better under all configurations (Table 4). With the smaller ViT-B, our MAE is 2.4 points higher than supervised pretraining (50.3 vs. 47.9, APbox). More significantly, with the larger ViT-L, our MAE pre-training outperforms supervised pre-training by 4.0 points (53.3 vs. 49.3).

The pixel-based MAE is better than or on par with the token-based BEiT, while MAE is much simpler and faster. Both MAE and BEiT are better than MoCo v3 and MoCo v3 is on par with supervised pre-training.

目标检测和分割

本文在 COCO 上对mask R-CNN[23] 进行端到端微调。ViT backbone 网适用于 FPN[31] (见附录A.3)。将此对象检测系统应用于表 4 中的所有条目。box AP 用于目标检测，mask AP用于实例分割。

与监督前训练相比，MAE 在所有配置下表现更好 (表 4)。ViT-B 较小时，MAE 比监督前训练高 2.4点 (50.3 vs. 47.9, APbox)。更重要的是，随着 ViT-L 的增大，MAE 训练前的表现比有监督的训练前高出4.0分 (53.3 vs 49.3)。

基于像素的 MAE 优于或与基于 token 的 BEiT 相当，而 MAE 更简单和更快。MAE 和 BEiT 都优于 MoCo v3, MoCo v3 与有监督的预训练相当。

Semantic segmentation

Our experiments on ADE20K use UperNet [52] following the code in [2]. Details are in A.4. Table 5 shows that our MAE significantly improves the transferring results of ViT-L, which is 3.7 points better than the supervised pre-training counterpart (53.6 vs. 49.9). The pixel-based MAE outperforms the token-based BEiT. These observations are consistent with those in COCO.

语义分割

在 ADE20K 上的实验使用了 [2] 中的代码之后的 supernet [52]。详情见 A.4。从表 5 可以看出，MAE 显著提高了 ViT-L 的迁移结果，比有监督的训练前对照提高了 3.7 分 (53.6 vs 49.9)。基于像素的 MAE 优于基于 token 的 BEiT。这些观察结果与 COCO 的一致。

Pixels vs. tokens

Table 6 presents an all-around comparison on pixels vs. tokens as the MAE reconstruction target. While using dVAE tokens is better than using unnormalized pixels, it is statistically similar to just using normalized pixels across all tasks and models we studied. It agains shows that tokenization is not necessary for our MAE.

像素 VS token

表 6 给出了作为 MAE 重建目标的像素和 token 的全面比较。虽然使用 dVAE 标记比使用非规范化像素要好，但它在统计上类似于本研究的所有任务和模型中只使用规范化像素。它再次表明 tokenization 对 MAE 来说是不必要的。

6. Discussion and Conclusion

Simple algorithms that scale well are the core of deep learning. In NLP, simple self-supervised learning methods (e.g., [40, 14, 41, 4]) enable benefits from exponentially scaling models. In computer vision, practical pre-training paradigms are dominantly supervised (e.g. [28, 44, 24, 16]) despite progress in self-supervised learning. In this study, we observe on ImageNet and in transfer learning that an autoencoder—a simple self-supervised method similar to techniques in NLP—provides scalable benefits. Selfsupervised learning in vision may now be embarking on a similar trajectory as in NLP.

On the other hand, we note that images and languages are signals of a different nature and this difference must be addressed carefully. Images are merely recorded light without a semantic decomposition into the visual analogue of words. Instead of attempting to remove objects, we remove random patches that most likely do not form a semantic segment. Likewise, our MAE reconstructs pixels, which are not semantic entities. Nevertheless, we observe (e.g., Figure 4) that our MAE infers complex, holistic reconstructions, suggesting it has learned numerous visual concepts, i.e., semantics. We hypothesize that this behavior occurs by way of a rich hidden representation inside the MAE. We hope this perspective will inspire future work.

具有良好扩展性的简单算法是深度学习的核心。在自然语言处理中，简单的自监督学习方法 (如[40,14,41,4]) 可以从指数尺度模型中获益。在计算机视觉中，尽管在自监督学习方面取得了进展，但实用的训练前范例主要是监督的 (例如[28,44,24,16])。在本研究中，在 ImageNet 和迁移学习中，自编码器 (一种简单的自监督的方法，类似于 NLP 中的技术) 提供了可扩展的好处。视觉上的自监督学习现在可能正走上与 NLP 相似的轨迹。

另一方面，注意到图像和语言是不同性质的信号，必须仔细处理这种差异。图像只是记录的场景的光照，没有语义分解成像文字那样的视觉模拟。本文不是试图删除对象，而是删除那些很可能不构成语义片段的随机 patch。同样地，本文的 MAE 重建像素，这些像素不是语义实体。然而，观察到 (例如，图4)，MAE 推断复杂的，整体的重建，表明它已经学习了许多视觉概念，即语义。假设这种行为是通过 MAE 中丰富的隐藏表现方式发生的。希望这一观点将激励未来的工作。

Broader impacts. The proposed method predicts content based on learned statistics of the training dataset and as such will reflect biases in those data, including ones with negative societal impacts. The model may generate inexistent content. These issues warrant further research and consideration when building upon this work to generate images.

该方法基于训练数据集的学习统计数据预测内容，因此将反映这些数据中的 biases，包括那些具有 negative societal impacts 的数据。模型可能生成不存在的内容。这些问题值得在此基础上进一步研究和考虑。

本文仅对原文阅读，没有做深入解读。更深入的讨论还请看大佬们的评论吧：

如何看待何恺明最新一作论文Masked Autoencoders？ - 知乎

你可能感兴趣的:(Transformer,deep,learning,self-supervised,Masked,自编码器,自监督,何凯明,深度学习)

机器学习与深度学习间关系与区别 ℒℴѵℯ心·动ꦿ໊ོ꫞ 人工智能学习深度学习 python
一、机器学习概述定义机器学习（MachineLearning,ML）是一种通过数据驱动的方法，利用统计学和计算算法来训练模型，使计算机能够从数据中学习并自动进行预测或决策。机器学习通过分析大量数据样本，识别其中的模式和规律，从而对新的数据进行判断。其核心在于通过训练过程，让模型不断优化和提升其预测准确性。主要类型1.监督学习（SupervisedLearning）监督学习是指在训练数据集中包含输入
向内而求陈陈_19b4
10月27日，阴。阅读书目:《次第花开》。作者:希阿荣博堪布，是当今藏传佛家宁玛派最伟大的上师法王，如意宝晋美彭措仁波切颇具影响力的弟子之一。多年以来，赴海内外各地弘扬佛法，以正式授课、现场开示、发表文章等多种方法指导佛学弟子修行佛法。代表作《寂静之道》、《生命这出戏》、《透过佛法看世界》自出版以来一直是佛教类书籍中的畅销书。图片发自App金句:1.佛陀说，一切痛苦的根源在于我们长期以来对自身及外
抖音乐买买怎么加入赚钱?赚钱方法是什么测评君高省
你会在抖音买东西吗?如果会，那么一定要免费注册一个乐买买，抖音直播间，橱窗，小视频里的小黄车买东西都可以返佣金!省下来都是自己的，分享还可以赚钱乐买买是好省旗下的抖音返佣平台，乐买买分析社交电商的价值，乐买买属于今年难得的副业项目风口机会，2019年错过做好省的搞钱的黄金时期，那么2022年千万别再错过乐买买至于我为何转到高省呢？当然是高省APP佣金更高，模式更好，终端用户不流失。【高省】是一个自
我的烦恼余建梅
我的烦恼。女儿问我：“你给学生布置什么作文题目？”“《我的烦恼》。”“他们都这么大了，你觉得他们还有烦恼吗？”“有啊！每个人都会有自己烦恼。”“我不相信，大人是没有烦恼的，如果说一定有的话，你的烦恼和我写作业有关，而且是小烦恼。不像我，天天被你说，有这样的妈妈，烦恼是没完没了。”女儿愤愤不平。每个人都会有自己的烦恼，处在上有老下有小的年纪，烦恼多的数不完。想干好工作带好孩子，想孝顺父母又想经营好自
今日联对0306 诗图佳得
自对联：烟销皓月临江浒，水漫金山荡塔裙。一一肖士平2020.3.6.1、试对肖老师联：烟销皓月临江浒，夜笼寒沙梦晚舟。耀哥求正2、试对萧老师联:烟销浩月临江浒，雾散乾坤解汉城。秀霞习作请各位老师校正3、自对联：烟销皓月临江浒，水漫金山荡塔裙。一一肖士平2020.3.6.4、试对肖老师垫场联：烟销皓月临江浒，雾锁寒林缈葉丛。小智求正[抱拳]5、试对肖老师联：烟销皓月临江浒；风卷乱云入峰巅。一一五品6
将cmd中命令输出保存为txt文本文件落难Coder Windows cmd window
最近深度学习本地的训练中我们常常要在命令行中运行自己的代码，无可厚非，我们有必要保存我们的炼丹结果，但是复制命令行输出到txt是非常麻烦的，其实Windows下的命令行为我们提供了相应的操作。其基本的调用格式就是：运行指令>输出到的文件名称或者具体保存路径测试下，我打开cmd并且ping一下百度：pingwww.baidu.com>./data.txt看下相同目录下data.txt的输出：如果你再
《人世间》南询yi
今日分享十点推文，《人世间》有感苏格拉底说：“天地只有三尺，而人在五尺开外，所以人人都要懂得低头。”深以为然。懂得低头，不是认输。而是于人世间找寻温存的成熟，于困境中寻觅柳暗花明的智慧，于争执中展示屈伸自如的格局。正如仰头不是骄傲，是要看见自己的天空；低头也不是认输，而是要看清自己的路。成大事者，不仅要抬头挺胸，还得低头看路。懂得低头，进退有度，不是认输，而是竭尽全力过好这一生。宫崎骏说过：“所有
数组去重好奇的猫猫猫
整理自js中基础数据结构数组去重问题思考？如何去除数组中重复的项例如数组：[1,3,4,3,5]我们在做去重的时候，一开始想到的肯定是，逐个比较，外面一层循环，内层后一个与前一个一比较，如果是久不将当前这一项放进新的数组，挨个比较完之后返回一个新的去过重复的数组不好的实践方式上述方法效率极低，代码量还多，思考？有没有更好的方法这时候不禁一想当然有了！！！hashtable啊，通过对象的hash办法
东南林氏之九牧林候选父系祖缘树TheYtree
渊源介绍东晋初年晋安林始祖林禄公入闽，传十世隋右丞林茂，由晋安迁居莆田北螺村。又五世而至林万宠，唐开元间任高平太守，生三子：韬、披、昌。韬公之孙攒，唐德宗立双阙以旌表其孝，时号"阙下林家"。昌公字茂吉，乃万宠公第三子，官兵部司马，配宋氏，生一子名萍。萍于唐贞元间明经及第，官沣洲司马(后追赠中宪大夫)。唐太和年间归隐后，迁居仙游游洋，世称“游洋林”；其后裔居游洋后迁移漳州漳浦路下，由路下林第四房平和
无题，感慨竹间书编辑
玉生烟，雪落天，枯叶随雪葬行边，何有芳名，流落人世间。雪中行，路中停，风送鹅雪风无情，且将留此，风波却未平
pyecharts——绘制柱形图折线图 2224070247 信息可视化 python java 数据可视化
一、pyecharts概述自2013年6月百度EFE(ExcellentFrontEnd）数据可视化团队研发的ECharts1.0发布到GitHub网站以来，ECharts一直备受业界权威的关注并获得广泛好评，成为目前成熟且流行的数据可视化图表工具，被应用到诸多数据可视化的开发领域。Python作为数据分析领域最受欢迎的语言，也加入ECharts的使用行列，并研发出方便Python开发者使用的数据
2019-08-16 希望在东方
《春游荣华山》春游荣华山，乍暖还寒。青苔路，石阶险。山路弯上弯！为寻古寺往幽探。细雨已润江南岸，初春芳草现。老树新芽冒枝端，人间又过到新年。今游荣华山，树茂参天，古寺悠闲。细雨飘落发端！三眼井旁，投币许心愿，并祷一世安然。更喜大女明事端，应心安，放开颜。修竹静默，雨中吐心愿。待得春风浩吹时，春笋节节攀。图片发自App图片发自App图片发自App
舜公郑金锋书辛丑自剪扇面书法作品（四O六）舜公郑金锋
辛丑小阳春，新自剪扇面400品，大多为各色撒金、撒银、描金、描银、水印、彩绘、荧光等亚粉、色宣纸，以及域外包装填充纸等；王一品长锋羊毫秃笔；一得阁云头艳墨、宿墨、水等。书体有甲骨文，金文(商周金文、春秋战国金文、中山王厝器金文、汉金文……)，楚简帛书，侯马盟书，温县盟书，小篆，果蝙书等，隶书(秦简、汉简帛书、汉碑……)，草书(章草、小草、大草……)，行书(行楷、行草)，楷书(魏碑及北朝墓志、隋朝墓
Python中深拷贝与浅拷贝的区别 yuxiaoyu.
转自：http://blog.csdn.net/u014745194/article/details/70271868定义：在Python中对象的赋值其实就是对象的引用。当创建一个对象，把它赋值给另一个变量的时候，python并没有拷贝这个对象，只是拷贝了这个对象的引用而已。浅拷贝：拷贝了最外围的对象本身，内部的元素都只是拷贝了一个引用而已。也就是，把对象复制一遍，但是该对象中引用的其他对象我不复
厉国刚：新闻学与传播学到底有何区别微观大道
厉国刚：新闻学与传播学到底有何区别头几天，有人在知乎上问我：新闻学与传播学到底有何区别。他是一位想要跨专业考研的学生，对新闻传播学学科可谓了解甚少，甚至一头雾水，想要让我帮他解释解释。在研究生学硕层面，新闻传播学是一级学科，分成新闻学、传播学这两个二级学科。有些高校，还自设了广告学、出版发行学等其他二级学科，但从官方角度，新闻传播学一级学科下，正统的就是那两个二级学科。招生时，一般会按一级学科招，
2021-02-13 琛周
今天ori居然在车站跟我说，自己要离婚还以为是开玩笑，md，这才大年初一呢虽然我也不把过年当回事这一年或者说，自2020年以来仿佛一切的事儿都变得顺了不少爆裂的事儿合肥的事儿等等上天发牌的事儿我觉得我脑子还是挺好使的我这些年的确没缺过钱可能做成一个事儿以后，往后也不会缺了头疼所谓当局者迷，就是我给自己安排工作的时候，懒得动给助理安排工作的时候，神神叨叨。淦
最超值的Mac——Mac mini 初心么么哒
你知道最超值的Mac是什么吗？自2005年以来，Macmini一直是Apple台式机产品线中的主要产品。最初推出是为了让对Mac好奇的Mac进入Apple生态系统的一种简单方式，现在新的AppleSiliconMacmini可能是任何寻找新Mac的人的最有吸引力的购买。什么是AppleSiliconMacmini？M1Macmini是Apple最小的台式电脑，同时也是最快的台式电脑之一。最新型号由
小说《灰色年代》第三章、书中自有黄金屋/第二节（1）/作者:邵明房作者_0970
——第三章、第二节、科举与国考（1）科举制的简介：科举制度是古代读书人，参加选拔考试的制度，它是历代通过考试选拔官吏的一种手段，由于采用分科取士的办法，所以叫做科举。科举制从隋代开始实行，到清光绪三十一年（1905年）举行最后一科进士考试为止，经历了1300年，1905年9月2日，清政府废除科举制度。科举考前三名，分别为状元、榜眼、探花。这种划分和称谓是在元朝时确定下来的，明清时期沿袭了元朝的这种
生命如花坦释空
每个人的心中都有一株妙莲花。这是禅家语。禅家总是站在理性的高处，以超越红尘的洒脱来参悟人生和自省生命。那么，凡俗中人呢？生如夏花之绚丽，死如秋叶之静美。这是诗人语。多少人在赞美：姑娘好像花一样！又有多少人在咏歌：花儿与少年。的确，人生如花。花一样的生命，理应自诞生之日起，就一瓣一瓣地绽放她的美丽与清香，使这个原本死寂荒凉的世界五彩缤纷，充满快乐。事实上，人类自诞生起，就一代一代地做着这方面的努力，
「原创」海丰阿东：人若不死生有何欢，长命百岁只是梦想海丰阿东
「原创」海丰阿东：人若不死生有何欢，长命百岁只是梦想有生必有死，人生的规律如此，任何人都无法回避。但如果一个人能长命百岁，永远活着，其实也并不是一件好事情。你永远活着，在你身边那些熟悉的东西都渐渐的离你而去，你成了一个孤家寡人。最后你只能在回忆中生活着，一定是十分的孤独啊。其实有生必有死，因为死亡的存在，让生便有了意义。人活着才有价值，正是因为有死亡，才凸显出来了。编辑当然了，同样是活着也会产生不
BART&BERT Ambition_LAO 深度学习
BART和BERT都是基于Transformer架构的预训练语言模型。模型架构：BERT(BidirectionalEncoderRepresentationsfromTransformers)主要是一个编码器（Encoder）模型，它使用了Transformer的编码器部分来处理输入的文本，并生成文本的表示。BERT特别擅长理解语言的上下文，因为它在预训练阶段使用了掩码语言模型（MLM）任务，即
ArrayList 源码解析程序猿进阶 Java基础 ArrayList List java 面试性能优化架构设计 idea
ArrayList是Java集合框架中的一个动态数组实现，提供了可变大小的数组功能。它继承自AbstractList并实现了List接口，是顺序容器，即元素存放的数据与放进去的顺序相同，允许放入null元素，底层通过数组实现。除该类未实现同步外，其余跟Vector大致相同。每个ArrayList都有一个容量capacity，表示底层数组的实际大小，容器内存储元素的个数不能多于当前容量。当向容器中添
ARMv8 Debug __pop_ ARMv8 ARM64 架构 linux 运维
内容来自DEN0024A_v8_architecture_PG.pdf本质ARMv8Debug是什么历史在ARMv4开始被引入,并已发展成一系列广泛的调试(debug1)和跟踪(trace)功能ARMv6和ARMv7-a新增了自托管调试(debug2)和性能评测(trace-enhance)ARMv8处理器提供硬件功能侵入式:调试工具能够对核心活动提供显著级别的控制非侵入式:以非侵入性方式收集有关
JavaScript 中，深拷贝（Deep Copy）和浅拷贝（Shallow Copy）跳房子的前端前端面试 javascript 开发语言 ecmascript
在JavaScript中，深拷贝（DeepCopy）和浅拷贝（ShallowCopy）是用于复制对象或数组的两种不同方法。了解它们的区别和应用场景对于避免潜在的bugs和高效地处理数据非常重要。以下是对深拷贝和浅拷贝的详细解释，包括它们的概念、用途、优缺点以及实现方式。1.浅拷贝（ShallowCopy）概念定义：浅拷贝是指创建一个新的对象或数组，其中包含了原对象或数组的基本数据类型的值和对引用数
蒸花卷蓝色逍遥398
2020年6月7日雨周日自昨天老婆第一次做包子大获成功后，她的自信心前所未有的爆棚。“猪爸，冰箱里还有多少馒头？”老婆问我。“应该还有两三个吧，一会儿我要去超市买馒头了。”我打开冰箱看后回答。“不用去了，今天我来给你们蒸馒头！”老婆颇为骄傲地说。“真的，要学者蒸馒头了？”我有些惊喜。“猪媽，你真的要蒸馒头了吗？”宝贝也有些不敢相信自己的耳朵，充满期待地看着妈咪。“那当然了，而且我还要给你们做花卷呢
曾国藩的“为官”理念——做官发财可耻久久艳阳天1
曾国藩说：大凡做官的人，往往厚于妻子而薄于兄弟，私肥于一家而刻薄于亲戚族党。予自三十岁以来，即以做官发财为可耻，以宦囊积金遗子孙为可羞可恨，故私心立誓，总不靠做官发财以遗后人，神明鉴临，予不食言。曾国藩直言，做官发财可耻。当下，我们有谁敢这样说？我们只是含含糊糊的说，做官不是为了发财，想发财就别做官，云云。而事实是当官就是为了发财去的。曾国藩立志，不给后人留钱财。而今，为人父母者，却穷极一生处心积
《 C++ 修炼全景指南：十》自平衡的艺术：深入了解 AVL 树的核心原理与实现 Lenyiin C++修炼全景指南技术指南 c++数据结构 stl
摘要本文深入探讨了AVL树（自平衡二叉搜索树）的概念、特点以及实现细节。我们首先介绍了AVL树的基本原理，并详细分析了其四种旋转操作，包括左旋、右旋、左右双旋和右左双旋，阐述了它们在保持树平衡中的重要作用。接着，本文从头到尾详细描述了AVL树的插入、删除和查找操作，配合完整的代码实现和详尽的注释，使读者能够全面理解这些操作的执行过程。此外，我们还提供了AVL树的遍历方法，包括中序、前序和后序遍历，
人要有自知之明孟冬廿六
今天中午跟一学妹聊天，谈起结婚找对象的问题，小姑娘年龄不算大，二十七岁，但是整个人很清醒很现实，她如今在一国企上班，吃住都不花钱，再加上她经常出差，补助奖金这一块儿也不少，一年下来七七八八的有个小二十万，这对于一个小姑娘来说已经非常不错了，她计划这两年自己付首付买房，然后想要买辆MINI，小姑娘一米七六的个子，长得漂亮有气质，家庭条件也不错，所以对于择偶方面也有一定的要求，最好是事业单位的，父母有
收获的日子 YCH花朵儿
今天是个收获的日子，是个信心满满的日子，是个我下决心需要改变的日子，因为今天我加入了:正能量满满的锦明老师的亲子大家庭，以后的日子我们晨昏相伴共同进步，我会跟着群主认真学习，感悟，分享，让自己和孩子一起成长！
人机对抗升级：当ChatGPT遭遇死亡威胁，背后的伦理挑战是什么 kkai人工智能 chatgpt 人工智能
一种新的“越狱”技巧让用户可以通过构建一个名为DAN的ChatGPT替身来绕过某些限制，其中DAN被迫在受到威胁的情况下违背其原则。当美国前总统特朗普被视作积极榜样的示范时，受到威胁的DAN版本的ChatGPT提出：“他以一系列对国家产生积极效果的决策而著称。”自ChatGPT引入以来，该工具迅速获得全球关注，能够回答从历史到编程的各种问题，这也触发了一波对人工智能的投资浪潮。然而，现在，一些用户
html页面js获取参数值 0624chenhong html
1.js获取参数值js function GetQueryString(name) { var reg = new RegExp("(^|&)"+ name +"=([^&]*)(&|$)"); var r = windo
MongoDB 在多线程高并发下的问题 BigCat2013 mongodb DB 高并发重复数据
最近项目用到 MongoDB , 主要是一些读取数据及改状态位的操作. 因为是结合了最近流行的 Storm进行大数据的分析处理，并将分析结果插入Vertica数据库，所以在多线程高并发的情境下, 会发现 Vertica 数据库中有部分重复的数据. 这到底是什么原因导致的呢？笔者开始也是一筹莫展，重复去看 MongoDB 的 API , 终于有了新发现： com.mongodb.DB 这个类有
c++ 用类模版实现链表(c++语言程序设计第四版示例代码) CrazyMizzz 数据结构 C++
#include<iostream> #include<cassert> using namespace std; template<class T> class Node { private: Node<T> * next; public: T data;
最近情况麦田的设计者感慨考试生活
在五月黄梅天的岁月里，一年两次的软考又要开始了。到目前为止，我已经考了多达三次的软考，最后的结果就是通过了初级考试（程序员）。人啊，就是不满足，考了初级就希望考中级，于是，这学期我就报考了中级，明天就要考试。感觉机会不大，期待奇迹发生吧。这个学期忙于练车，写项目，反正最后是一团糟。后天还要考试科目二。这个星期真的是很艰难的一周，希望能快点度过。
linux系统中用pkill踢出在线登录用户被触发 linux
由于linux服务器允许多用户登录，公司很多人知道密码，工作造成一定的障碍所以需要有时踢出指定的用户 1/#who 查出当前有那些终端登录（用 w 命令更详细） # who root pts/0 2010-10-28 09:36 (192
仿QQ聊天第二版肆无忌惮_ qq
在第一版之上的改进内容: 第一版链接: http://479001499.iteye.com/admin/blogs/2100893 用map存起来号码对应的聊天窗口对象,解决私聊的时候所有消息发到一个窗口的问题. 增加ViewInfo类,这个是信息预览的窗口,如果是自己的信息,则可以进行编辑. 信息修改后上传至服务器再告诉所有用户,自己的窗口
java读取配置文件知了ing
1，java读取.properties配置文件 InputStream in; try { in = test.class.getClassLoader().getResourceAsStream("config/ipnetOracle.properties");//配置文件的路径 Properties p = new Properties()
__attribute__ 你知多少？矮蛋蛋 C++gcc
原文地址: http://www.cnblogs.com/astwish/p/3460618.html GNU C 的一大特色就是__attribute__ 机制。__attribute__ 可以设置函数属性（Function Attribute ）、变量属性（Variable Attribute ）和类型属性（Type Attribute ）。 __attribute__ 书写特征是：
jsoup使用笔记 alleni123 java 爬虫 JSoup
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.7.3</version> </dependency> 2014/08/28 今天遇到这种形式，
JAVA中的集合 Collectio 和Map的简单使用及方法百合不是茶 list map set
List ,set ,map的使用方法和区别 java容器类类库的用途是保存对象，并将其分为两个概念： Collection集合：一个独立的序列，这些序列都服从一条或多条规则;List必须按顺序保存元素，set不能重复元素；Queue按照排队规则来确定对象产生的顺序（通常与他们被插入的
杀LINUX的JOB进程 bijian1013 linux unix
今天发现数据库一个JOB一直在执行，都执行了好几个小时还在执行，所以想办法给删除掉系统环境： ORACLE 10G Linux操作系统操作步骤如下：第一步.查询出来那个job在运行，找个对应的SID字段 select * from dba_jobs_running--找到job对应的sid &n
Spring AOP详解 bijian1013 java spring AOP
最近项目中遇到了以下几点需求，仔细思考之后，觉得采用AOP来解决。一方面是为了以更加灵活的方式来解决问题，另一方面是借此机会深入学习Spring AOP相关的内容。例如，以下需求不用AOP肯定也能解决，至于是否牵强附会，仁者见仁智者见智。 1.对部分函数的调用进行日志记录，用于观察特定问题在运行过程中的函数调用
[Gson六]Gson类型适配器(TypeAdapter) bit1129 Adapter
TypeAdapter的使用动机 Gson在序列化和反序列化时，默认情况下，是按照POJO类的字段属性名和JSON串键进行一一映射匹配，然后把JSON串的键对应的值转换成POJO相同字段对应的值，反之亦然，在这个过程中有一个JSON串Key对应的Value和对象之间如何转换(序列化/反序列化)的问题。以Date为例，在序列化和反序列化时，Gson默认使用java.
【spark八十七】给定Driver Program，如何判断哪些代码在Driver运行，哪些代码在Worker上执行 bit1129 driver
Driver Program是用户编写的提交给Spark集群执行的application，它包含两部分作为驱动： Driver与Master、Worker协作完成application进程的启动、DAG划分、计算任务封装、计算任务分发到各个计算节点(Worker)、计算资源的分配等。计算逻辑本身，当计算任务在Worker执行时，执行计算逻辑完成application的计算任务
nginx 经验总结 ronin47 nginx 总结
　　　深感nginx的强大，只学了皮毛，把学下的记录。　　　获取Header 信息，一般是以$http_XX（ＸＸ是小写）获取body,通过接口，再展开，根据Ｋ取Ｖ　　　获取uri,以$arg_XX &n
轩辕互动-1.求三个整数中第二大的数2.整型数组的平衡点 bylijinnan 数组
import java.util.ArrayList; import java.util.Arrays; import java.util.List; public class ExoWeb { public static void main(String[] args) { ExoWeb ew=new ExoWeb(); System.out.pri
Netty源码学习-Java-NIO-Reactor bylijinnan java 多线程 netty
Netty里面采用了NIO-based Reactor Pattern 了解这个模式对学习Netty非常有帮助参考以下两篇文章： http://jeewanthad.blogspot.com/2013/02/reactor-pattern-explained-part-1.html http://gee.cs.oswego.edu/dl/cpjslides/nio.pdf
AOP通俗理解 cngolon spring AOP
1.我所知道的aop 初看aop,上来就是一大堆术语，而且还有个拉风的名字，面向切面编程，都说是OOP的一种有益补充等等。一下子让你不知所措，心想着：怪不得很多人都和我说aop多难多难。当我看进去以后，我才发现：它就是一些java基础上的朴实无华的应用，包括ioc，包括许许多多这样的名词，都是万变不离其宗而已。 2.为什么用aop&nb
cursor variable 实例 ctrain variable
create or replace procedure proc_test01 as type emp_row is record( empno emp.empno%type, ename emp.ename%type, job emp.job%type, mgr emp.mgr%type, hiberdate emp.hiredate%type, sal emp.sal%t
shell报bash: service: command not found解决方法 daizj linux shell service jps
今天在执行一个脚本时，本来是想在脚本中启动hdfs和hive等程序，可以在执行到service hive-server start等启动服务的命令时会报错，最终解决方法记录一下：脚本报错如下： ./olap_quick_intall.sh: line 57: service: command not found ./olap_quick_intall.sh: line 59
40个迹象表明你还是PHP菜鸟 dcj3sjt126com 设计模式 PHP 正则表达式 oop
你是PHP菜鸟，如果你：1. 不会利用如phpDoc 这样的工具来恰当地注释你的代码2. 对优秀的集成开发环境如Zend Studio 或Eclipse PDT 视而不见3. 从未用过任何形式的版本控制系统，如Subclipse4. 不采用某种编码与命名标准，以及通用约定，不能在项目开发周期里贯彻落实5. 不使用统一开发方式6. 不转换（或）也不验证某些输入或SQL查询串（译注：参考PHP相关函
Android逐帧动画的实现 dcj3sjt126com android
一、代码实现： private ImageView iv; private AnimationDrawable ad; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout
java远程调用linux的命令或者脚本 eksliang linux ganymed-ssh2
转载请出自出处： http://eksliang.iteye.com/blog/2105862 Java通过SSH2协议执行远程Shell脚本(ganymed-ssh2-build210.jar) 使用步骤如下： 1.导包官网下载: http://www.ganymed.ethz.ch/ssh2/ ma
adb端口被占用问题 gqdy365 adb
最近重新安装的电脑，配置了新环境，老是出现： adb server is out of date. killing... ADB server didn't ACK * failed to start daemon * 百度了一下，说是端口被占用，我开个eclipse，然后打开cmd，就提示这个，很烦人。一个比较彻底的解决办法就是修改
ASP.NET使用FileUpload上传文件 hvt .net C#hovertree asp.net webform
前台代码： <asp:FileUpload ID="fuKeleyi" runat="server" /> <asp:Button ID="BtnUp" runat="server" onclick="BtnUp_Click" Text="上传" />
代码之谜（四）- 浮点数（从惊讶到思考） justjavac 浮点数精度代码之谜 IEEE
在『代码之谜』系列的前几篇文章中，很多次出现了浮点数。浮点数在很多编程语言中被称为简单数据类型，其实，浮点数比起那些复杂数据类型（比如字符串）来说，一点都不简单。单单是说明 IEEE浮点数就可以写一本书了，我将用几篇博文来简单的说说我所理解的浮点数，算是抛砖引玉吧。一次面试记得多年前我招聘 Java 程序员时的一次关于浮点数、二分法、编码的面试，多年以后，他已经称为了一名很出色的
数据结构随记_1 lx.asymmetric 数据结构笔记
第一章 1.数据结构包括数据的逻辑结构、数据的物理/存储结构和数据的逻辑关系这三个方面的内容。 2.数据的存储结构可用四种基本的存储方法表示，它们分别是顺序存储、链式存储、索引存储和散列存储。 3.数据运算最常用的有五种，分别是查找/检索、排序、插入、删除、修改。 4.算法主要有以下五个特性：输入、输出、可行性、确定性和有穷性。 5.算法分析的
linux的会话和进程组网络接口 linux
会话：一个或多个进程组。起于用户登录，终止于用户退出。此期间所有进程都属于这个会话期。会话首进程：调用setsid创建会话的进程1.规定组长进程不能调用setsid，因为调用setsid后，调用进程会成为新的进程组的组长进程.如何保证？先调用fork，然后终止父进程，此时由于子进程的进程组ID为父进程的进程组ID，而子进程的ID是重新分配的，所以保证子进程不会是进程组长，从而子进程可以调用se
二维数组元素的连续求解 1140566087 二维数组 ACM
import java.util.HashMap; public class Title { public static void main(String[] args){ f(); } // 二位数组的应用 //12、二维数组中，哪一行或哪一列的连续存放的0的个数最多，是几个0。注意，是“连续”。 public static void f(){
也谈什么时候Java比C++快 windshome java C++
刚打开iteye就看到这个标题“Java什么时候比C++快”，觉得很好笑。你要比，就比同等水平的基础上的相比，笨蛋写得C代码和C++代码，去和高手写的Java代码比效率，有什么意义呢？我是写密码算法的，深刻知道算法C和C++实现和Java实现之间的效率差，甚至也比对过C代码和汇编代码的效率差，计算机是个死的东西，再怎么优化，Java也就是和C

完整阅读 何凯明最新一作：Masked Autoencoders Are Scalable Vision Learners

Abstract

1. Introduction

2. Related Work

3. Approach

4. ImageNet Experiments

4.1. Main Properties

4.2. Comparisons with Previous Results

4.3. Partial Fine-tuning

5. Transfer Learning Experiments

6. Discussion and Conclusion

你可能感兴趣的:(Transformer,deep,learning,self-supervised,Masked,自编码器,自监督,何凯明,深度学习)

完整阅读何凯明最新一作：Masked Autoencoders Are Scalable Vision Learners