转载自微信公众号
原文链接: https://mp.weixin.qq.com/s?__biz=Mzg4MjgxMjgyMg==&mid=2247486049&idx=1&sn=1d98375dcbb9d0d68e8733f2dd0a2d40&chksm=cf51b898f826318ead24e414144235cfd516af4abb71190aeca42b1082bd606df6973eb963f0#rd
Video: https://www.youtube.com/watch?v=7l6fttRJzeU
Slides: https://nips.cc/media/neurips-2021/Slides/21895.pdf
Self-Supervised Learning
– Self-Prediction and Contrastive Learning
What is self-supervised learning and why we need it?
the idea of constructing supervised learning tasks out of unsupervised datasets
Why?
✅ Data labeling is expensive and thus high-quality dataset is limited
✅ Learning good representation makes it easier to transfer useful information to a variety of downstream tasks ⇒ \Rightarrow ⇒ e.g. Few-shot learning / Zero-shot transfer to new tasks
Self-supervised learning tasks are also known as pretext tasks
Video Colorization (Vondrick et al 2018)
Zero-shot CLIP (Radford et al. 2021)
Precursors 先驱者 to recent self-supervised approaches
Some ideas:
Restricted Boltzmann Machines
Autoencoders
Word2Vec
Autogressive Modeling
Siamese networks
Multiple Instance / Metric Learning
Autoregressive model:
Autogressive model also has been a basis for many self-supervised methods such as gpt
Many contrastive self-supervised learning methods use a pair of neural networks and learned from their difference
– this idea can be tracked back to Siamese Networks
if you believe that one network F can well encode x and get a good representation f(x)
then, 对于两个不同的输入x1和x2,their distance can be d(x1,x2) = L(f(x1),f(x2))
the idea of running two identical CNN on two different inputs and then comparing them —— a Siamese network
Train by:
✅ If xi and xj are the same person, ∣ ∣ f ( x i ) − f ( x j ) ||f(xi)-f(xj) ∣∣f(xi)−f(xj) is small
✅ If xi and xj are the different person, ∣ ∣ f ( x i ) − f ( x j ) ||f(xi)-f(xj) ∣∣f(xi)−f(xj) is large
Predecessors of the predetestors of the recent contrastive learning techniques : multiple instance learning and metric learning
deviate frome the typical framework of empirical risk minimization
ealy work:
metric learning:
contrastive Loss:
Triplet loss
N-pair loss:
The part to be predicted pretends to be missing
relationship: can be based on inner logics within data
✅ such as different camera views of the same scene
✅ or create multiple augmented version of the same sample
The multiple samples can be selected from the dataset based on some known logics (e.g., the order of words / sentences), or fabricated by altering the original version
即 we know the true relationship between samples but pretend to not know it
Self-prediction construct prediction tasks within every individual data sample
分类:
The autoregressive model predicts future behavior based on past behavior
Examples :
mask a random portion of information and pretend it is missiing, irrespective of the natural sequence
e.g.,
Examples :
Some transformation (e.g., segmentation, rotation) of one data samples should maintain the original information of follow the desired innate logic
Examples
Order of image patches
✅ e.g., shuffle the patches
✅ e.g., relative position, jigsaw puzzle
Image rotation
Counting features across patches
Hybrid Self-Prediction Models: Combines different type of generation modeling
Goal:
对比学习 can be applied to both supervised and unsupervised settings
Category
Inter-sample classification
the most dominant approach
✅ “inter-smaple”: emphasize or distinguish it from “intra-sample”
Feature clustering
Multiview coding
Given both similar (“positive”) and dissimilar (“negative”) candidates, to identify which ones are similar to the anchor data point is a classification task
How to construct a set of data point candidates:
Common loss functions :
2005
Works with labelled dataset
Encoder data into an embedding vector
Given two labeled data pairs ( x i , y i ) (x_i,y_i) (xi,yi) and ( x j , y j ) (x_j,y_j) (xj,yj):
Triplet loss (Schroff et al. 2015)
Given a triplet input ( x , x + , x − ) (x, x^{+}, x^{-}) (x,x+,x−)
Triplet (三胞胎) loss: because it demands an input triplet containing one anchor, one positive and one negative
Lifted structured loss (Song et al. 2015):
对于大规模训练,batchsize经常非常大
Noise contrastive Estimation (NCE): Gutmann & Hyvarinen 2010
Given target sample distribution p and noise distribution q:
initially proposed to learn word embedding in 2010
InfoNCE (2018)
Given a context vector c, the positive sample should be drawn from the conditional distribution ( p ( x ∣ c ) ) (p(x|c)) (p(x∣c))
The probability of detecting the positive sample correctly is:
Find similar data samples by clustering them with learned features
core idea : Use clustering algorithms to assign pseudo lables to samples such that we can run intra-sample contrastive learning
Examples:
Apply the InfoNCE objective to two or more different views of input data
Became a mainstream contrastive learning method
Auto-Encoding Variational Bayes (Kingma et al. 2014)
Image generation:
Jointly train an encoder, additional to the usual GAN
GAN Inversion: learning encoder post-hoc and/or optimizing for given image
Denoising autoencoder (Vincent et al. 2008)
Context autoencoder (Pathak et al 2016)
can not only be on the pixel value itself, but also on any subset of information from the image
Image Colorization
Split-brain autoencoder
In order to get representation that transfer well to downstream tasks
- minimizing the loss function 等价于 maxmizing a lower bound to the mutual information between the predicted context c t c_t ct and the future patch x t + k x_{t+k} xt+k
- 相当于预测的数据的latent representation最准确
CPC has been highly influential in contrastive learning
- showing the effectiveness of causing the problem as an entire sample classification task
Each istance is a distinct calss of its own
# classes = # training samples
Non-parametric softmax that compares features
Memory bank for stroing representations of past samples V = V { i } V=V\{i\} V=V{i}
The model learns to scatter the feature vectors in the hypersphere while mapping visually similar images into closer regions
一个自然的问题:Is there better ways to creat multiview images? ↓ \downarrow ↓
MoCo (Momentum Contrast; He et al. 2019)
MoCo v3:
Contrastive learning loss
f() – base encoder
g() – projection head layer
In-batch negative samples
✅ Use large batches to have sufficient number of negative inputs
fully symmetric;
Barlow Twins (Zbontar et al. 2021)
Learn similarity representations for different augmented views of the same sample, but no contrastive component involving negative samples
the objective is just minimizing the L2 distance between features encoded from the same image
Bootstrap Your Own Latent (BYOL, et al. )
Simsiam (Chen 2020)
BatchNorm seems to be playing an important role
another major technology for self-supervised learning:
- to learn from clusters of features
Sinkhorm-Knopp: a cluster algorithm based on OT
In this approach, nobel ideas based on clustering are designed to be used in conjunction with other SSL methods
Supervised Contrastive Loss (SupCon; Khosla et al. 2021)
✅ less sensitive to hyperparameter choices
Tracking object movement tracking in time
Temporal order Verification
Predict the arrow of time, forward or backward
Tracking emerges by colorizing videos (Vondrick et al. 2018)
Tracking emerges by colorizing videos (Vondrick et al. 2018)
Used for video segmentation or human pose estimation without fine-tuning
✅ because the model can move the colored markings in the labeled input image directly in the prediction
TCN (Sermanet et al. 2017)
Multi-frame TCN (Dwibedi et al. 2019)
Because video files are huge, generating coherent continuous of video has been a difficult task
Predicting videos with VQ-VAE (Walker et al. 2021)
VideoGPT: Video generation using VQ-VAE and Transformers (Yan et al. 2021)
Jukebox (Dhariwal et al. 2020)
ASR: Automatic speech recognition
Wav2Vec 2.0 (Baevski et al. 2020)
applies contrast siblings on the representation of mask portion of the audio
✅ to learn discrete tokens from them
speech recognition models: trained on these token, show better performance compared to those trained on conventional audio features / raw audio
HuBERT (Hsu et al. 2021, FAIR)
Also employed by SpeechStew (Chan et al. 2021), Big SSL (Zhang et al. 2021)
applied to multimodal data, although the difinition of self-supervised learning gets kind of blurry here depending on whether you consider a multi-modal dataset as single unlabeled dataset or as if one modality gives supervision to another modality
MIL-NCE (Miech et al. 2020)
CLIP (Radford et al. ), ALIGN (Jia et al. 2021)
Pretrained language models:
Some examples: changed the landscape of NLP research quite a lot
GPT
✅ Autogressive;
✅ predict the next token based on the previous tokens
BERT
✅ as a bi-directional transformer model
✅ Masked language modeling (MLM)
✅ Next sentence prediction (NSP) ⇒ \Rightarrow ⇒ a binary classifier for telling whether one sentence is the next sentence of the other
ALBERT
✅ Sentence order prediction (SOP) ⇒ \Rightarrow ⇒ Positive sample: a pair of two consecutive segments from the same document; Negative sample: same as above but with the segment order switch
ELECTRA
✅ Replaced token detection (RTD) ⇒ \Rightarrow ⇒ random tokens are replaced and considered corrected, in parallel a binary discriminator is trained together with the generative model to predict whether each token has been replaced
Skip-thought vectors (Kiros et al. 2015)
Quick-thought vectors (Logeswaran & Lee, 2018)
IS-BERT (“Info-Sentence BERT”; Zhang et al. 2020)
SimCSE (“Simple Contrastive learning of Sentence Embeddings”; Gao et al. 2021)
- Most of the models for learning sentence embedding relies on supervised NLI (Natural Language Inference) datasets, such as SBERT (Reimers & Gurevych 2019), BERT-flow
- Unsupervised sentence embedding models (e.g., unsupervised SimCSE) still have performance gap with the supervised version (e.g., supervised SimCSE)
contrastive learning can provide good results in terms of transfer performance
Data augmentation setup is critical for learning good embedding
方法:
image augmentation; text augmentation
Basic Image Augmentation:
Augmentation Strategies
Image mixture
Mixup (Zhang et al. 2018): weighted pixel-wise combination of two images
✅ to create new sampls based on existed ones
Cutmix (Yun et al 2019): mix in a local region of one image into the other
MoCHi (Mixing of Contrastive Hard Negatives): mixture of hard negative samples
✅ explicitly maintains a queue of some number of negative samples sorted by similarity to the query in descending order ⇒ \Rightarrow ⇒ the first couple samples in the queue should be the hardest and negative samples ⇒ \Rightarrow ⇒ then new hard negative can be created by mixing images in this queue together or even with the query
Lexical (词汇的) Edits.
(Just changing the words or tokens)
EDA (Easy Data Augmentation; Wei & Zhou 2019): Synonym replacement, random insertion / swap / deletion
Contextual Augmentation (Kobayashi 2018): word substition by BERT prediction
✅ try to find the replacement words using a bi-directional language model
Back-translation (Sennrich et al. 2015)
augments by first translating it to another language and then translating it back to the original language
✅ depends on the translation model ⇒ \Rightarrow ⇒ the meaning should stay largely unchanged
CERT (Fang et al. 2020) generates augmented sentences via back-translation
Dropout and Cutoff
SimCSE uses dropout (Gao et al. 2021)
✅ drouput: a universal way to apply transformnation on any input
✅ SimCSE: use drouput to creat different copies of the same text ⇒ \Rightarrow ⇒ universial because it doe not need expert knowledege about the attributes of this input modality (it is changes on the architecture level)
Cutoff augmentation for text (Shen et al. 2020)
✅ masking random selected tokens, feature columns, spans
Need large batchsize
Why does contrastive learning work?
InfoNCE (van den Oord et al. 2018)
Minimizing InfoNCE leads to maximizing the MI between view1 and view2
Q: How can we design good views?
Minimal sufficient encoder depends on downstream tasks (Tian et al. 2020)
Composite loss for finding the sweet spot (Tsai et al. 2020)
✅ helps converging to a minimal sufficient encoder
To perform well in transfer learning ⇒ \Rightarrow ⇒ we want our model to capture the mutual information between the data x and the downstream label y I ( x ; y ) I(x;y) I(x;y)
- if the mutual information between the views ( I ( v 1 ; v 2 ) I(v_1; v_2) I(v1;v2)) is smaller than I ( x ; y ) I(x;y) I(x;y) ⇒ \Rightarrow ⇒ the model would fail to capture useful information for the downstream tasks
- Meanwhile, if the mutual information between the views are too large ⇒ \Rightarrow ⇒ would have excess information that is unrelated to the downstream tasks ⇒ \Rightarrow ⇒ the transfor performance would decrease due to the noise
- ⇒ \Rightarrow ⇒ there is a sweet spot ⇒ \Rightarrow ⇒ the minimal sufficient encoder
Contrastively learned features are more uniform and aligned
- compared with random initialized network or a network trained with the supervised learning
- also measured the alignment measuring how close the distance between features from two views of the same input is
总之,对比学习理论起到了很大作用,但仍有很长的路要走
briefly discuss a few open research questions and areas of work to look into
Large batch size ⇒ \Rightarrow ⇒ improved transfer performance
High-quality large data corpus ⇒ \Rightarrow ⇒ better performance
Efficient negative sample selection
Combine multiple pretext tasks
Data augmentation tricks have critical impacts but are still quite ad-hoc
Modality-dependent: 大多数增强方法仅适用于单个modality ⇒ \Rightarrow ⇒ most of them are handcrafted by human
Theoretical foundations
✅ e.g., on why certain augmentation works better than others
✅ to guide us to find more efficient data augmentation
Improving training efficiency
Self-supervised learning methods are pushing the deep learning arms race (军备竞赛)
❌ increase of model size and training batch size
❌ ⇒ \Rightarrow ⇒ leads to increase the cost both economically and environmentally
Direct impacts on the economical and environmental costs
Social biases in the embedding space