We present Momentum Contrast (MoCo) for unsupervised visual representation learning.
我们提出了动量对比(MoCo)用于无监督视觉表征学习。
From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder.
从字典查找的对比学习角度出发,我们建立了一个带有队列和移动平均编码器的动态字典。
This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning.
这使得构建一个大型且一致的动态字典成为可能,从而促进了对比的无监督学习。
MoCo provides competitive results under the common linear protocol on ImageNet classification.
MoCo在ImageNet分类的通用线性协议下提供了有竞争力的结果。
More importantly, the representations learned by MoCo transfer well to downstream tasks.
更重要的是,MoCo学习到的表象很好地转移到下游任务。
MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins.
MoCo可以在PASCAL VOC、COCO和其他数据集上的7个检测/分割任务中超过它的有监督的预训练的对手,有时会超过它的很大幅度。
This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks
这表明,在许多视觉任务中,无监督表征学习和有监督表征学习之间的差距已经在很大程度上缩小了。(使得无监督学习变得更加细致了)
大约这里就是介绍了一下MoCo:
Unsupervised representation learning is highly successful in natural language processing, e.g., as shown by GPT [50, 51] and BERT [12].
无监督表示学习在自然语言处理中非常成功,如GPT[50,51]和BERT[12]所示。
But supervised pre-training is still dominant in computer vision, where unsupervised methods generally lag behind.
但在计算机视觉领域,有监督的预训练仍占主导地位,而无监督的方法通常落后。
The reason may stem from differences in their respective signal spaces.
原因可能是它们各自的信号空间不同。
Language tasks have discrete signal spaces (words, sub-word units, etc.) for building tokenized dictionaries, on which unsupervised learning can be based.
语言任务具有离散的信号空间(单词、子单词单位等),用于构建token化字典,非监督学习可以以此为基础。
Computer vision, in contrast, further concerns dictionary building [54, 9, 5], as the raw signal is in a continuous, high-dimensional space and is not structured for human communication (e.g., unlike words).
相比之下,计算机视觉进一步关注字典的构建[54,9,5],因为原始信号处于连续的高维空间,不适合人类交流(例如,不像单词)。
Several recent studies [61, 46, 36, 66, 35, 56, 2] present promising results on unsupervised visual representation learning using approaches related to the contrastive loss [29].
一些最近的研究[61,46,36,66,35,56,2]在使用与对比损失[29]相关的方法进行无监督视觉表征学习方面取得了很好的结果。
Though driven by various motivations, these methods can be thought of as building dynamic dictionaries.
尽管这些方法受到各种动机的驱动,但可以将其视为构建动态字典。
The “keys” (tokens) in the dictionary are sampled from data (e.g., images or patches) and are represented by an encoder network.
字典中的“键”(标记)是从数据(如图像或补丁)中取样的,并由编码器网络表示。
Unsupervised learning trains encoders to perform dictionary look-up: an encoded “query” should be similar to its matching key and dissimilar to others.
无监督学习训练编码器执行字典查找:一个编码的“查询”应该与它的匹配键相似而与其他键不同。
Learning is formulated as minimizing a contrastive loss [29].
learning的过程可以表示为使对比损失最小化的过程
From this perspective, we hypothesize that it is desirable to build dictionaries that are: (i) large and (ii) consistent as they evolve during training.
从这个角度来看,我们假设构建字典是可取的:(i)大的和(ii)在训练不断发展的过程中具有一致性。
Intuitively, a larger dictionary may better sample the underlying continuous, highdimensional visual space, while the keys in the dictionary should be represented by the same or similar encoder so that their comparisons to the query are consistent.
直观地说,更大的字典可能更好地抽样底层连续的、高维的视觉空间,而字典中的键应该由相同或类似的编码器表示,以便它们与查询的比较是一致的。
However, existing methods that use contrastive losses can be limited in one of these two aspects (discussed later in context).
然而,使用对比损失的现有方法可以在这两个方面之一受到限制(稍后将在上下文中讨论)。
We present Momentum Contrast (MoCo) as a way of building large and consistent dictionaries for unsupervised learning with a contrastive loss (Figure 1).
我们将动量对比(Momentum Contrast, MoCo)作为一种构建大型且一致的字典的方法,用于具有对比损失的无监督学习(图1)。
We maintain the dictionary as a queue of data samples: the encoded representations of the current mini-batch are enqueued, and the oldest are dequeued.
我们将字典维护为一个数据样本队列:当前小批处理的编码表示被放入队列,最老的被退出队列。
The queue decouples the dictionary size from the mini-batch size, allowing it to be large.
队列将字典大小与迷你批处理大小解耦,允许它变大。
Moreover, as the dictionary keys come from the preceding several mini-batches, a slowly progressing key encoder, implemented as a momentum-based moving average of the query encoder, is proposed to maintain consistency.
此外,由于字典键来自前面的几个小批量,一个缓慢进展的键编码器,实现了一个基于动量的移动平均的查询编码器,以保持一致性。
MoCo is a mechanism for building dynamic dictionaries for contrastive learning, and can be used with various pretext tasks.
MoCo是一种建立动态字典的对比学习机制,可用于各种代理任务。
如果您对代理任务不清楚可以看:什么是pretext task?
In this paper, we follow a simple instance discrimination task [61, 63, 2]: a query matches a key if they are encoded views (e.g., different crops) of the same image.
在本文中,我们遵循一个简单的实例识别任务[61,63,2]:查询匹配一个密钥,判断它们是否和我们的编码视图(例如,不同的作物)的同一图像。
Using this pretext task, MoCo shows competitive results under the common protocol of linear classification in the ImageNet dataset [11].
利用这个pretext task,MoCo在ImageNet数据集[11]上展现出了在线性分类的通用协议下有竞争力的结果(大约就是不是当前最好的,但是也很不错的成果)。
A main purpose of unsupervised learning is to pre-train representations (i.e., features) that can be transferred to downstream tasks by fine-tuning(微调).
译文:无监督学习的一个主要目的是预先训练表征(即特征),这些表征可以通过微调转移到下游任务。
We show that in 7 downstream tasks related to detection or segmentation, MoCo unsupervised pre-training can surpass its ImageNet supervised counterpart, in some cases by nontrivial margins.
我们发现,在7个与检测或分割相关的下游任务中,MoCo无监督预训练可以超过ImageNet的有监督预训练,在某些情况下甚至超过了它的有监督预训练。
In these experiments, we explore MoCo pre-trained on ImageNet or on a one-billion Instagram image set, demonstrating that MoCo can work well in a more real-world, billionimage scale, and relatively uncurated scenario.
在这些实验中,我们探索了在ImageNet或10亿Instagram图像集上预先训练的MoCo,证明MoCo可以在更真实的、10亿图像规模的和相对未策展的场景中很好地工作。
These results show that MoCo largely closes the gap between unsupervised and supervised representation learning in many computer vision tasks, and can serve as an alternative to ImageNet supervised pre-training in several applications.
这些结果表明,在许多计算机视觉任务中,MoCo在很大程度上缩小了无监督和有监督表示学习之间的差距,并且在一些应用中可以作为ImageNet有监督预训练的替代方案。
Unsupervised/self-supervised learning methods generally involve two aspects: pretext tasks and loss functions.
无监督/自我监督学习方法一般涉及pretext task和损失函数两个方面。
The term “pretext” implies that the task being solved is not of genuine interest, but is solved only for the true purpose of learning a good data representation.
术语“pretext”意味着正在解决的任务不是真正感兴趣的,而是为了学习良好的数据表示而解决的。(大约就是和真正的任务不是一个,但是可以提升真正任务的效果)
Loss functions can often be investigated independently of pretext tasks.
损失函数通常可以独立于 pretext tasks进行研究。
MoCo focuses on the loss function aspect.
MoCo专注于损失功能方面
Next we discuss related studies with respect to these two aspects.
接下来我们就这两个方面的相关研究进行讨论。
Loss functions. A common way of defining a loss function is to measure the difference between a model’s prediction and a fixed target, such as reconstructing the input pixels (e.g., auto-encoders) by L1 or L2 losses, or classifying the input into pre-defined categories (e.g., eight positions [13],color bins [64]) by cross-entropy or margin-based losses.
定义损失函数的常用方法是测量模型预测和固定目标之间的差异,
例如:通过L1或L2损耗重建输入像素(例如自动编码器)
或通过交叉熵或基于边际的损失将输入分类为预先定义的类别
Other alternatives, as described next, are also possible.
下面所述的其他选择也是可能的。
Contrastive losses [29] measure the similarities of sample pairs in a representation space.
对比loss度量一个表示空间中一对样本的相似性。
Instead of matching an input to a fixed target, in contrastive loss formulations the target can vary on-the-fly during training and can be defined in terms of the data representation computed by a network[29].
与将输入匹配到固定目标不同,在损失对比公式中,目标可以在训练期间动态变化,并可以根据网络[29]计算的数据表示来定义。
Contrastive learning is at the core of several recent works on unsupervised learning [61, 46, 36, 66, 35, 56, 2],which we elaborate on later in context (Sec. 3.1).
对比学习是最近几篇关于无监督学习的研究的核心[61,46,36,66,35,56,2],我们将在后面的背景(第3.1节)中详细阐述。
Adversarial losses [24] measure the difference between probability distributions.
对抗损失[24]度量概率分布之间的差异。
It is a widely successful technique for unsupervised data generation.
这是一种非常成功的无监督数据生成技术。
Adversarial methods for representation learning are explored in [15, 16].
[15, 16]探索了表示学习的对抗方法
There are relations (see [24]) between generative adversarial networks and noise-contrastive estimation (NCE) [28].
生成式对抗网络和噪声对比估计(NCE)[28]之间存在联系(见[24])。
(这些前人的研究,看起来有点耽误事,直接看作者的moco怎么做的)
Contrastive learning [29], and its recent developments, can be thought of as training an encoder for a dictionary look-up task, as described next.
对比学习[29]及其最近的发展可以被认为是训练一个字典查找任务的编码器,如下所述。
Consider an encoded query q and a set of encoded samples {k0, k1, k2, …} that are the keys of a dictionary.
考虑一个编码查询q和一组编码样本{k0, k1, k2,…}是字典的键。
Assume that there is a single key (denoted as k+) in the dictionary that q matches.
假设在字典中有一个键(我们将这个键记成 k+)他是和q匹配的。
(这里注意一件事,就是这个文章事实上只认为每个q都只和一个k匹配)
A contrastive loss [29] is a function whose value is low when q is similar to its positive key k+ and dissimilar to all other keys (considered negative keys for q).
一个对比损失函数是这样的一个东西,如果预测的出来的结果中:和正例(只有一个)的相似度高且和负例(剩下的全是负例)的相似度低,则这个损失函数的值应当低。
With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE [46], is considered in this paper:
采用点积度量相似性的方法,本文考虑了一种名为InfoNCE[46]的对比损失函数形式:
where τ is a temperature hyper-parameter per [61].
其中τ是每个的温度超参数。
The sum is over one positive and K negative samples.
这个和的得出是建立在:1个正的和K个负的样本。
Intuitively, this loss is the log loss of a (K+1)-way softmax-based classifier that tries to classify q as k+.
直观地,这个损失函数是一个k+1路的softmax为基础的将q归为k+的分类器的log loss
Contrastive loss functions can also be based on other forms [29, 59, 61, 36], such as margin-based losses and variants of NCE losses.
对比损失函数也可以基于其他形式[29,59,61,36],如基于边际的损失和NCE损失的变体。
理解一下这个损失函数,这个东西是使用点积来作为基础的,所以越相似的两个内容的点积就越大,点积就越大,这样就会出现这样的景象,q和k+越相似就会使得下面这个值就会更加接近1:
同样地,如果这个东西还和其他的比较相似,其他的点积就会变大,整体就会更加远离1.
而越接近1整体的值就越小,所以输出的结果判断的越对,就越接近结果。但是这个东西其实不就是个交叉熵损失函数吗?
The contrastive loss serves as an unsupervised objective function for training the encoder networks that represent the queries and keys [29].
对比loss作为无监督目标函数,用于训练query和key的编码器网络。(我个人理解这里就是把两者对应起来)
In general, the query representation is q = fq(xq) where fq is an encoder network and xq is a query sample (likewise, k = fk(xk)).
普遍的来说,表示查询就是q = fq(xq),其中fg是一个网络,xq是一个查询得实际例子。(同理,k = fk(xk)也是如此)
Their instantiations depend on the specific pretext task.
它们的实例化依赖于特定的pretext task
The input xq and xk can be images [29, 61, 63], patches [46], or context consisting a set of patches [46].
输入xq和xk可以是图像[29,61,63],patches,或者由一组patches组成的context。
The networks fq and fk can be identical [29, 59, 63], partially shared [46, 36, 2], or different [56].
网络fq和fk可以是相同的[29,59,63],部分共享的[46,36,2],或不同的[56]。
这里主要是介绍本文设计的contrastive loss的问题:
1.设计损失函数,首先要明确损失函数是用来评估模型的好坏的,所以必须结合模型所做的工作。这里明确模型需要完成的工作是实现一个查字典的工作。通过一个query,拿到一个key
2.所以,这个其实类似于一个分类任务,作者这里使用的损失函数也类似于交叉熵损失函数。
3.不同的是这个是检验两个enconder的效果好坏。
From the above perspective, contrastive learning is a way of building a discrete dictionary on high-dimensional continuous inputs such as images. 从上述角度来看,对比学习是在高维连续输入(例如图像)上构建离散字典的一种方法。
The dictionary is dynamic in the sense that the keys are randomly sampled, and that the key encoder evolves during training.
字典是动态的,因为从某种意义上说key是随机采样的,并且key的编码器在训练过程中不断进化。
Our hypothesis is that good features can be learned by a large dictionary that covers a rich set of negative samples, while the encoder for the dictionary keys is kept as consistent as possible despite its evolution.
我们的假设是,一个包含丰富负样本集的大字典可以学习到好的特征,而字典keys的编码器尽管不断进化,却尽可能保持一致。
(这句话其实应该分成两个部分来理解,因为前半句对应的是用queue来构造一个大的字典,后半句大约对应,为什么要用动量更新key encoder,当然动量更新还有一个原因就是字典太大backward算不动,我觉得作者之前应该是由这个算不动想到的这个优化)
Based on this motivation, we present Momentum Contrast as described next.
基于这个动机,我们提出了动量对比,如下所述。
Dictionary as a queue.
At the core of our approach is maintaining the dictionary as a queue of data samples.
我们方法的核心是将字典维护为一个数据样本队列。
This allows us to reuse the encoded keys from the immediate preceding mini-batches.
这使我们可以重用前面的小批量的已编码密钥。
The introduction of a queue decouples the dictionary size from the mini-batch size.
队列的引入将字典大小与mini-batch size解耦。
Our dictionary size can be much larger than a typical mini-batch size, and can be flexibly and independently set as a hyper-parameter.
我们的字典大小可以比典型的迷你批处理大小大得多,并且字典大小可以灵活独立地设置为超参数。
第二段理解:
理解一下这里作者提出来queue作为keys(keys中有negative samples同时只有一个positive sample)也就是和这些东西计算相似度,要让输出的东西和这里的唯一的positive sample相似,和其他的negative不相似。
The samples in the dictionary are progressively replaced.
词典中的样本逐步被替换。
The current mini-batch is enqueued to the dictionary, and the oldest mini-batch in the queue is removed.
当前的mini-batch被放入字典中,队列中最老的mini-batch被删除。
The dictionary always represents a sampled subset of all data, while the extra computation of maintaining this dictionary is manageable.
字典总是表示所有数据的抽样子集,而维护这个字典的额外计算是可控的。(字典中的之前的key都是之前计算保存下来的)
Moreover, removing the oldest mini-batch can be beneficial, because its encoded keys are the most outdated and thus the least consistent with the newest ones.
此外,删除最旧的mini-batch可能是有益的,因为其编码的keys是最过时的,因此与最新的密钥最不一致。(这句理解一下,之前的字典中的key其实都是用当时的key encoder计算的,不是用现在更新的key encoder计算的)
Momentum update.
Using a queue can make the dictionary large, but it also makes it intractable to update the key encoder by back-propagation (the gradient should propagate to all samples in the queue).
使用队列可以使字典变大,但通过反向传播来更新keys编码器也很棘手(梯度应该传播到队列中的所有样本)。
(内存承受不住,计算能力也承受不住,我在上一篇详细谈过的。)
A naive solution is to copy the key encoder fk from the query encoder fq, ignoring this gradient. But this solution yields poor results in experiments (Sec. 4.1).
一个不好的解决方案是从查询编码器fq中复制键编码器fk,忽略这个梯度。但这种解决方案的效果并不好。
(不算这里的梯度直接吧query encoder拷贝过来使用。)
We hypothesize that such failure is caused by the rapidly changing encoder that reduces the key representations’ consistency.
我们假设这种失败是由于快速变化的编码器降低了密钥表示的一致性造成的。
(作者假设keys的编码器得有一定的稳定性才能很好的表达)
We propose a momentum update to address this issue.
(所以)我们提出来了momentum更新。
Formally, denoting the parameters of fk as θk and those of fq as θq, we update θk by:
形式上,将fk的参数表示为θk,将fq的参数表示为θq,我们对θk的更新如下:
θk ← mθk + (1 1 m)θq. (2)
Here m ∈ [0, 1) is a momentum coefficient. Only the parameters θq are updated by back-propagation.
这里m∈[0,1]是动量系数。只有θq被反向传播更新。
The momentum update in Eqn.(2) makes θk evolve more smoothly than θq.
Eqn.(2)中的动量更新使得θk比θq演化更平稳。(因为动量稀释了更新)
As a result, though the keys in the queue are encoded by different encoders (in different mini-batches), the difference among these encoders can be made small.
因此,尽管队列中的keys由不同的编码器编码(在不同的小批量中),但这些编码器之间的差异可以很小。
(因为编码器变化不大,所以前面编码器计算出来的keys后面是可以接着用的)
In experiments, a relatively large momentum (e.g., m = 0.999, our default) works much better than a smaller value (e.g., m = 0.9), suggesting that a slowly evolving key encoder is a core to making use of a queue.
在实验中,相对较大的动量(例如,m = 0.999,我们的默认值)比较小的值(例如,m = 0.9)工作得更好,这表明缓慢演化的密钥编码器是利用队列的核心。
第五段总结(介绍和之前的机制的区别,接下来会详细介绍)
Relations to previous mechanisms.
MoCo is a general mechanism for using contrastive losses.
MoCo是使用对比loss的一般机制。
We compare it with two existing general mechanisms in Figure 2.
我们将其与图2中现有的两种通用机制进行比较。
They exhibit different properties on the dictionary size and consistency.
它们在字典大小和一致性方面表现出不同的属性。
The end-to-end update by back-propagation is a natural mechanism (e.g., [29, 46, 36, 63, 2, 35], Figure 2a).
使用反向传播的端到端更新是一种自然的机制
It uses samples in the current mini-batch as the dictionary, so the keys are consistently encoded (by the same set of encoder parameters).
它使用当前小批中的样本作为字典,因此keys是一致的编码(通过相同的编码器参数集)
(就是这个batch使用的编码器都是上一个batch优化得到的那个key encoder,所以具有一致性)
But the dictionary size is coupled with the mini-batch size, limited by the GPU memory size.
但是字典的大小与小批量的大小相结合,受到GPU内存大小的限制。
(在上一篇当中谈了其实batchsize变大就会挑战GPU的内存)
It is also challenged by large mini-batch optimization [25].
同时大的batch-size对优化也会有影响
(上篇也说了万一偏了可能影响就很大的。)
Some recent methods [46, 36, 2] are based on pretext tasks driven by local positions, where the dictionary size can be made larger by multiple positions.
最近的一些方法[46,36,2]是基于局部位置驱动的pretext tasks,其中,字典大小可以通过多个位置来增大。(没看懂)
But these pretext tasks may require special network designs such as patchifying the input [46] or customizing the receptive field size [2], which may complicate the transfer of these networks to downstream tasks.
但是,这些任务可能需要特殊的网络设计,如修补输入[46]或定制接收域大小[2],这可能会使这些网络向下游任务的转移复杂化。
Another mechanism is the memory bank approach proposed by [61] (Figure 2b).
另一种机制是[61]提出的内存库方法(图2b)。
A memory bank consists of the representations of all samples in the dataset.
存储库由数据集中所有样本的表示组成。
The dictionary for each mini-batch is randomly sampled from the memory bank with no back-propagation, so it can support a large dictionary size.
每个小批的字典从内存库中随机取样,没有反向传播,因此它可以支持大字典大小。
(理解一下这里的keys encoder其实就已经不存在了)
However, the representation of a sample in the memory bank was updated when it was last seen, so the sampled keys are essentially about the encoders at multiple different steps all over the past epoch and thus are less consistent.
然而,在内存库中样本的表示是在最后一次看到它时更新的,因此采样的密钥本质上是关于过去整个epoch中多个不同步骤的编码器的,因此不具有稳定性。
(没有看过这个文章,不确定怎么更新的,但是按照本文的意思他是这个每次访问到这个key都会更新,所以相当于底板库不大稳定)
A momentum update is adopted on the memory bank in [61]. Its momentum update is on the representations of the same sample, not the encoder.
[61]中对内存库采用动量更新。它的动量更新是在同一个样本的表示上,而不是在编码器上。
This momentum update is irrelevant to our method, because MoCo does not keep track of every sample.
这种动量更新与我们的方法无关,因为MoCo不会跟踪每个样本
Moreover, our method is more memory-efficient and can be trained on billion-scale data, which can be intractable for a memory bank.
此外,我们的方法具有更高的内存效率,并且可以在10亿规模的数据上进行训练,这对于存储库来说是很难处理的。
Contrastive learning can drive a variety of pretext tasks.
对比学习可以驱动各种pretext tasks。
As the focus of this paper is not on designing a new pretext task, we use a simple one mainly following the instance discrimination task in [61], to which some recent works [63, 2] are related.
由于本文的重点不是设计一种新的借口任务,所以我们采用了一个简单的pretext task,主要是在[61]的实例识别任务之后,与最近的一些研究相关。
Following [61], we consider a query and a key as a positive pair if they originate from the same image, and otherwise as a negative sample pair.
跟随[61]的研究,我们将来自于同一个图像的一个query和一个key视为一个对应对,如果不来自同一个图,则视为负样本对。
Following [63, 2], we take two random “views” of the same image under random data augmentation to form a positive pair.
跟随[63,2]的研究,我们取同一图像在随机数据增强下的两个随机“视图”,形成正对。
The queries and keys are respectively encoded by their encoders, fq and fk.
queries和keys分别由它们的编码器fq和fk编码。
The encoder can be any convolutional neural network [39].
编码器可以是任意卷积神经网络[39]。
Algorithm 1 provides the pseudo-code of MoCo for this pretext task.
Algorithm 1说明了这个MoCo的pretext task的伪代码。伪代码看不懂可以参考:Moco Algorithm 1 解读
For the current mini-batch, we encode the queries and their corresponding keys, which form the positive sample pairs.
对于当前的mni_batch,我们对query及其对应的key进行编码,它们形成了正样本对。
The negative samples are from the queue.
负例则来自于队列。(理解一下这句话,这个其实看懂了Algorithm 1 之后这个就很清楚了)
We adopt a ResNet [33] as the encoder, whose last fully
connected layer (after global average pooling) has a fixed dimensional output (128-D [61]).
我们使用一个最后具有全连接层并且具有固定输出的RseNet作为enconder。
This output vector is normalized by its L2-norm [61].
这个输出向量是由它的l2范数标准化的
This is the representation of the query or key.
这是查询或键的表示。
The temperature τ in Eqn.(1) is set as 0.07 [61].
eq .(1)中的温度τ设为0.07[61]。(这里是提到了之前的那个温度交叉熵的初始温度设置值)
The data augmentation setting follows [61]: a 224×224-pixel crop is taken from a randomly resized image, and then undergoes random color jittering, random horizontal flip, and random grayscale conversion, all available in PyTorch’s torchvision package.
数据增强设置如下[61]:从随机调整大小的图像中获取224×224-pixel裁剪,然后进行随机颜色抖动、随机水平翻转和随机灰度转换,所有这些都可以在PyTorch的torchvision包中获得。(就是在torchvision的基础包中作了数据增强,没有做出格的数据增强)
Our encoders fq and fk both have Batch Normalization (BN) [37] as in the standard ResNet [33].
我们的编码器fq和fk都有批处理归一化(BN)[37]作为标准ResNet[33]。
In experiments, we found that using BN prevents the model from learning good representations, as similarly reported in [35] (which avoids using BN).
在实验中,我们发现使用BN会阻碍模型学习好的表征,类似的论断在35号引用中有陈述(它避免使用BN)。
The model appears to “cheat” the pretext task and easily finds a low-loss solution.
该模型似乎“欺骗”了pretext task,很容易找到一个低损失的解决方案。
This is possibly because the intra-batch communication among samples (caused by BN) leaks information
这可能是由于数据之间的批内通信(BN引起的)泄露了信息。
We resolve this problem by shuffling BN.
我们通过shuffling BN来解决这个问题。
We train with multiple GPUs and perform BN on the samples independently for each GPU (as done in common practice).
我们使用多个GPU进行训练,并对每个GPU独立地在样本上执行BN(与通常做法相同)。(也就是对每个gpu上的内容独立的进行,互相并不影响,这个就和正常的情况没区别)
For the key encoder fk, we shuffle the sample order in the current mini-batch before distributing it among GPUs (and shuffle back after encoding); the sample order of the mini-batch for the query encoder fq is not altered.
对于key encoder fk,我们先将当前小批量的样本顺序打乱,然后再将其分配到gpu中(编码后再打乱);query encoder fq的minibatch顺序没有更改。
(也就是说我们只在过key encoder的时候我们打乱顺序)
This ensures the batch statistics used to compute a query and its positive key come from two different subsets.
这确保用于计算查询及其正键的批统计信息来自两个不同的子集。
This effectively tackles the cheating issue and allows training to benefit from BN.
这有效地解决了作弊问题,并使训练受益于BN。(这里说的作弊问题,大约就是通过bn获得了这个batch当中其他内容的信息,理解一下这里其实在正常的网络当中没有很大的问题,但是这里就不行,因为在对比学习当中这个batch当中的其他都是作为反例,相当于测试集的内容泄露到训练集当中去了)
We use shuffled BN in both our method and its end-to-end ablation counterpart (Figure 2a).
我们在我们的方法和它的端到端消融试验的方法中都使用了shuffled BN
(这里消融实验版本)
消融试验就是将其中的一部分设置为无效,主要是用来验证哪一部分有效。其实就相当于控制变量法。
It is irrelevant to the memory bank counterpart (Figure 2b), which does not suffer from this issue because the positive keys are from different mini-batches in the past.
大约就是说在之前的方法当中,作者也使用了shuffle BN进行消融试验来进行验证。
试验部分之后完善。
We study unsupervised training performed in:
我们研究在以下情况下进行的无监督训练:
ImageNet-1M (IN-1M): This is the ImageNet [11] training set that has ∼1.28 million images in 1000 classes (often called ImageNet-1K; we count the image number instead, as classes are not exploited by unsupervised learning).
ImageNet- 1m (in - 1m):这是ImageNet[11]训练集,在1000个类中有约128万张图像(通常称为ImageNet- 1k;我们计算图像的数量,因为类不会被非监督学习所利用)。
This dataset is well-balanced in its class distribution, and its images generally contain iconic view of objects.
Instagram-1B (IG-1B): Following [44], this is a dataset of ∼1 billion (940M) public images from Instagram.
The images are from ∼1500 hashtags [44] that are related to the ImageNet categories.
This dataset is relatively uncurated comparing to IN-1M, and has a long-tailed, unbalanced distribution of real-world data. This dataset contains both iconic objects and scene-level images
总结起来的总体创新点就是:用一个队列将之前的数据都存储起来获得一个大的负例的集合。
之后的动量优化、shuffleBN也是创新点,但是这些都是由最初的想法引出来的。