BERT论文阅读(一): Pre-training of Deep Bidirectional Transformers for Language Understanding


  • pre-training
  • bidirectional==>alleviates the unidirectionality constriant of fine-tuning based approaches by using a "masked language model"(MLM) pre-training objective
  • fine-tuning


==> 目前的方法,例如OpenAI GPT使用left-to-right架构,单向性约束

=> masked language model减轻单向,允许双向

==> 因为真正完全遮盖屏蔽的掩码,造成pre-training和fine-tuning不匹配?


==> 语言模型无法直接捕获句子关系?


==> 传统的两步文本对编码 + 应用双向交叉attention  ==> self-attention一步

==> 为了改善LTR这种单向模型缺陷,做了几个尝试

attempt 1: we add a randomly initialized BiLSTM on top the LTR model.

==> LTR单向模型加BiLSTM之后,效果确实有提升,但性能还是比预训练双向模型差很多!

尝试2: LTR + RTL,我们认为单独训练从左到右和从右到左模型是可能的,并且将每个词牌表示为两个模型的连接,比如ELMo。


a. LTR+RTL --> token representation的成本时单个双向模型的两倍 

b.对于像问答等任务不直观,因为从右到左模型不能基于问题思考答案,<-- 因为答案比问题先出现




1. Introduction

feature-based approach and fine-tuning approach

The contributions of our paper

2. Related Work

2.1 Unsupervised Feature-based Approaches

2.2 Unsupervised Fine-tuning Approaches

2.3 Transfer Learning from Supervised Data

3. BERT framework

Model Architecture of BERT

Input/Output Representations

dataset:WordPiece embeddings

we differentiate the sentences in two ways

BERT input representation

3.1 Pre-training BERT

Taks #1: Masked LM

Task #2: Next Sentence Prediciton(NSP)

3.2 Fine-tuning BERT

4. Experiments

4.1 GLUE

4.2 SQuAD v1.1

in the question answering task

4.3 SQuAD v2.0

4.4 SWAG

5. Ablation/əˈbleɪ.ʃən/ Studies

5.1 Effect of Pre-training Tasks




5.2 Effect of Model Size

5.3 Feature-based approach with BERT

the feature-based approach has certain advantages:

To ablate the fine-tuning approach,

6. Conclusion


BERT, Bidirectional Encoder Representations from Transformers, a new language representation model.


to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left context(Left-to-Right) and right(Right-to-Left) context in all layers.


Pretrained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks without substantial(considerable importance, size or worth) taskspecific architecture modifications.


In one word, BERT is conceptually simple and empiracally powerful.


1. Introduction

Language model pre-training has been shown to be effective for improving many natural language processing tasks.

  • sentence-level tasks. i.e. NLI, paraphrasing --> which aim to predict the relationships between sentences by analyzing them holistically.
  • token-level tasks. i.e. NER, question ansering --> where models are required to produce fine-grained output at the token leve


  • sentence-level tasks。自然语言推理,释义-->旨在通过整体分析来预测句子之间的关系
  • token-level task。命名实体识别,问答 --> 模型需要在令牌级别产生细粒度输出

feature-based approach and fine-tuning approach

There are two existing strategies for applying pre-trained language representations to downstream tasks:  

==> share same objective function during pre-training; they  both use unidirectional language models to learn general language representations.

  • feature-based approaches. ELMo, use task-specific architectures that include the pre-trained representations as additional features.
  • fine-tuning approach. OpenAI GPT, Generative Pre-trained Transformer, introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters.

目前有两种应用预训练语言表示的策略:==> 在预训练期间分享相同的目标函数;它们都使用单向语言模型去学习整体的语言表示。

  • 基于特征的方法。例如,ELMo,用包含预训练表示的基于任务的架构作为额外特征。
  • 基于微调的方法。OpenAI GPT,引入最少的基于任务的参数,并且通过简单微调所有预训练参数对下游任务进行训练。

==> 单向的缺点,要双向

Problem: current techniques restrict the power of  the pre-trained representations, especially for the fine-tuning approaches. The major limitation is that standard language models are unidirectional, and this limits the choice of architectures that can be used during pre-training. For example, in OpenAI GPT, the authors use a left-to-right architecture, where every token can only attend to previous tokens in the self-attention layers.

Such restrictions are sub-optimal for sentence-level tasks, and could be very harmful  when applying fine-tuning based approaches to token-level tasks, such as question answering, where it's crucial to incorporate context from both directions.

问题:目前的技术限制了预训练表示的能力,特别是对于微调方法。主要的限制是标准语言模型是单向的,这限制了可以在预训练期间使用的架构选择。例如,在OpenAI GPT中,作者使用一个left-to-right架构,其中每个词牌只能参与self-attention层中之前的词牌。


In this paper, we improve the fine-tuning based approaches by BERT.


==> masked language model减轻单向,允许双向

solve: BERT alleviates the previously mentioned unidirectionality constriant by using a "masked language model"(MLM) pre-training objective. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context.


Unlike left-to-right language model pre-training


the MLM objective enables the representation to fuse the left and the right context, which allows us to pretrain a deep bidirectional Transformer.


A "next sentence prediction" task that jointly pretrains text-pair representations.


The contributions of our paper

  • demonstrate the importance of bidirectional pre-training for language representations.

说明了对语言表示来说双向预训练的重要性 <= 之前的模型在pre-training阶段是单向的

BERT uses MLM to enable pretrained deep bidirectional representations.


  • pre-trained representations reduce the need for many heavily-engineered task-specific architectures.


BERT is the first fine-tuning based representation model.  > many task-specific architectures


  • BERT advances the state of the art for eleven NLP tasks.

2. Related Work

history of pre-training general language representations. 预训练通用语言表示的历史

2.1 Unsupervised Feature-based Approaches

  • word representation

pre-trained word embeddings > embeddings learned from scratch


--> To pretrain word embedding vectors, left-to-right language modeling objectives have been used + objectives to discriminate correct from incorrect words in left and right context

为了预训练词嵌入向量,使用了从左到右语言建模目标 + 在左右上下文中区分正确单词和错误单词的目标

  • coarser granularities更粗粒度(sentence embeddings, paragraph embeddings)

--> To train sentence representations, prior work has used objectives to rank candidate next sentences, left-to-right generation of next sentence words given a representation of the previous sentence.


  • traditional word embedding

ELMo and its predecessor

extract context-sensitive features <-- from a left-to-right and a right-to-left language model

抽取上下文敏感特征 <-- 从一个从左到右和一个从右到左语言模型

The contextual representation of each token is the concatenation of the left-to-right and right-to-left representations.


problem: LSTMs similar to ELMo, their model is feature-based and not deeply bidirectional.



problem: word2vec是上下文无关的,无法对一词多义进行建模

solve: ELMo, Embeddings from language models, 获得一个上下文相关的预训练表示

problem: their model is feature-based and not deeply bidirectional

2.2 Unsupervised Fine-tuning Approaches

only pretrained word embedding parameters from unlabeled text

  • sentence or document encoders which produce contextual token representations have been pretrained from unlabeled text and fine-tuned for a supervised downstream task.


advantage: few parameters need to be learned from scratch. 


2.3 Transfer Learning from Supervised Data

CV research demonstrate the importance of transfer learning from large pre-trained models 

计算机视觉研究表明来自大规模预训练模型的transfer learning的重要性

an effective recipe is to fine-tune models pretrained with ImageNet


3. BERT framework

two steps in our framework of BERT: pre-training, fine-tuning


  • pre-training: the model is trained on unlabeled data over different pre-training tasks


  • fine-tuning: the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks.


Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters.


A distinctive feature of BERT: its unified architecture across different tasks. There's minimal difference between the pre-trained architecture and the final downstream architecture.


Model Architecture of BERT

a multi-layer bidirectional Transformer encoder


BERT论文阅读(一): Pre-training of Deep Bidirectional Transformers for Language Understanding_第1张图片

L, the number of layers; H, the hidden size; A, the number of self-attention heads; 4H, the feed-forward/filter size, i.e. 3072 for the H=768, 4096 for the H=1024


two model sizes

(1) BERTbase(L=12, H=768, A=12, Total Parameters=110M)

(2) BERTlarge(L=24, H=1024, A=16, Total Parameters=340M)

比较:BERTbase Transformer  <--compare--> OpenAI GPT Transformer:

BERT uses bidirectional self-attention; GPT uses constrained self-attention where every token can only attend to context to its left.

BERT使用双向self-attention;而OpenAI GPT使用受限的self-attention,其中每个词牌只能关注左侧的上下文。

in the literature, the bidirectional Transformer is often referred to as a "Transformer encoder" while the left-context-only version is referred to as a "Transformer decoder" since it can used for text feneration. 

Input/Output Representations

To make BERT handle a variety of downstream tasks


our input representation is able to unambiguously represent both a single sentence and a pair of sentences(e.g., ) in one token sequence. sentence, can be an arbitrary span of continguous text, rather than an actual linguistic sentence; sequence, refers to the input token sequence to BERT, which may be a single sentence or true sentences packed together.


  • dataset:WordPiece embeddings

CLS -> a special classification token, the first token of every sequence


the final hidden state corresponding to CLS -> is used as aggregate sequence representation for classification tasks

与CLS相关的最终隐藏状态 -> 被用作分类任务的聚合序列表示

sentence pairs -> are packed together into a single sequence

句子对 -> 被打包在一起作为一个单独序列

  • we differentiate the sentences in two ways

First, sep~ separate them with a special token 


Second, add a learned embedding to every token ->  indicating whether it belongs to sentence A or sentence B.


  • E ~ input embeddings 输入嵌入
  • C ~ the final hidden vector of the special token 特殊词牌的最终隐含向量
  • Ti ~ the final hidden vector for the ith input token 第i个输入词牌的最终隐含向量

BERT论文阅读(一): Pre-training of Deep Bidirectional Transformers for Language Understanding_第2张图片

  • BERT input representation

Given a token, BERT input representation = corresponding token embeddings + segmentation embeddings + position embeddings

给定一个词牌,BERT的输入表示=相应的词牌插入向量 + 分割嵌入 + 位置嵌入

BERT论文阅读(一): Pre-training of Deep Bidirectional Transformers for Language Understanding_第3张图片

3.1 Pre-training BERT

we do not use traditional left-to-right or right-to-left language models to pre-train BERT.


we pre-train BERT using two unsupervised tasks.


Taks #1: Masked LM

problem: standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly "see itself", and the model could trivially predict the target word in a multi-layered context.



purpose: In order to train a deep bidirectional representation


==> 掩码允许双向

solve: we simply mask some percentage of the input tokens at random, and then predict those masked tokens.


called "masked LM"(MLM), the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM.  


15% at random, we only predict the masked words rather than reconstrcting the entire input. 


problem: Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismath between pre-training and fine-tuning, since the [mask] token does not appear during fine-tuning.

==> 因为真正完全遮盖屏蔽的掩码,造成pre-training和fine-tuning不匹配

solve: To mitigate this, we do not always replace "masked" words with the actual [MASK] token.


  • chooses 15% of the token positions at random for prediction

if i-th token is chosen, 80% ~ replace with [MASK] token; 10% ~ replace with a random token; 10% ~ unchanged i-th token


  • Ti will be used to predict the original token with cross entropy loss


Task #2: Next Sentence Prediciton(NSP)

Many important downstream tasks such as QA and NLI are based on undestanding the relationship between two sentences


problem: which is not directly captured by language modeling.

==> 语言模型无法直接捕获句子关系


solve: In order to train a model that understands sentence relationships, we pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus.


when choosing the sentence A and B for each pretraining example, 50% ~ true next -> IsNext label; 50% ~ random sentence -> NotNext label


C is used for next sentence prediction(NSP), which is closely related to representation learning objectives.


  • However, in prior work, only sentence embeddings are transferred to downstream tasks.


  • BERT transfers all parameters to initialize end-task model parameters.


  • Pre-training data

For the pre-training corpus, we use the BooksCorpus(800M words) and English Wikipeida. For Wikipedia, we extract only the next passages and ignore lists, tables, and headers.

3.2 Fine-tuning BERT

Fine-tuning is straightforward <-- since the self-attention mechanism in the Transformer allows  BERT to model many downstream tasks by swapping out the appropriate inputs and outputs.


problem: For applications involving text pairs, a common pattern is to independently encode text pairs before applying bidirectional across attention.

==> 传统的两步文本对编码 + 应用双向交叉attention  ==> self-attention一步


solve: BERT instead uses the self-attention mechanism to unify these two stages, as encoding a concatenated text pair with self-attention effectively includes bidirectional cross attention between two sentences.


For each task, we simply plug in the task-specific inputs and outputs into BERT and fine-tune all the parameters end-to-end.


  • at the input, sentence A and B analogous to (1) sentence pairs (2) (3) question-passage pairs
  • at the output, the token representations are fed into an output layer for token-level tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.

在输出端,词牌表示被输入到输出层用于词牌级任务,比如sequence tagging or question answering, 而[CLS]表示被输入到输出层中用于分类,比如文本蕴含或情感分析。

4. Experiments

we present BERT fine-tuning results on 11 NLP tasks.


4.1 GLUE

The General Language Understanding Evaluation(GLUE) benchmark is a collection of diverse natural language understanding tasks.


To fine-tune on GLUE


represent the input sequence(for single sentence or sentence pairs) as described in Section3, and use the final hidden vector C corresponding to the first input token as the aggregate representation.


The only new parameters introduced during fine-tuning are classification layer weights W-RKH, whereK is the number of labels. We compute a standard classification loss with C and W, i.e., log(softmax(CWT )).


4.2 SQuAD v1.1

The Stanford Question Answering Dataset is a collention of 100k crowdsourced question/answer pair. Given a question and a passage from Wikipedia containing the answer, the task is to predict the answer text span in the passage.


in the question answering task

we represent the input question and passage --> a single packed sequence

  • with the question using the A embeddings
  • the passage using the B embeddings.

we only introduce a start vector S and an end vector E during fine-tuning.


The probability of word i being the start of the answer span: is computed as a dot product between Ti and S followed by a softmax over all of the words in the paragraph:

单词i作为答案范围起始的概率 = 计算Ti与向量s的点积,然后对段落中所有单词进行softmax

Pi = \frac{e^{S\cdot Ti}}{\sum_{j}^{} e^{S\cdot Tj}}

The analogous formula is used for the end of the answer span. 相似的公式用于答案范围的结尾

The score of a candidate answer span from position i to position j is defined as S*Ti + E * Tj, and the maximum scoring span where j >= i is used as a prediction.

从位置i到位置j的候选答案范围分值 = S*Ti + E*Tj,其中,最大分值的范围用作预测。

The training objective is the sum of the log-likelihoods of the correct start and end positions.


4.3 SQuAD v2.0

The SQuAD 2.0 task extends the SQuAD 1.1 problem definition

  • by allowing for the possibility that no short answer exits in the provided paragraph, making the problem more realistic.
  • treat questions that do not have an answer as having an answer span with start and end at the [CLS] token.


  • 通过允许在提供的段落中不存在简短的答案来增强问题的实际性。
  • 认为没有答案的问题有一个从start开始到[CLS]结束的答案范围

The probablility space for the start and end answer span positions is extended to include the position of the [CLS] token.

4.4 SWAG

The Situations With Adversarial Generations(SWAG) dataset contains 113k sentence-pair completion examples that evaluate grounded commensense inference. Given a sentence, the task is to choose the most plausible continuation among four choices.


when fine-tuning on the SWAG dataset, construct four input sequences, each containing the concatenation of the given sentence(sentence A) and a possible continuation(sentence B)


The only task-specific parameters introduced is a vector whose dot product with the [CLS] token representatioin C denotes a score for each choice which is normalized with a softmax layer.


5. Ablation/əˈbleɪ.ʃən/ Studies

--> ablation study就是你在同时提出多个思路提升某个模型的时候,为了验证这几个思路分别都是有效的,做的控制变量实验的工作

  1. 在baseline的基础上加上模块A,看效果。
  2. 在baseline的基础上加上模块B,看效果。
  3. 在baseline的基础上同时加上模块AB,看效果。

Ablation experiments over a number of facets of BERT in order to better understand their relative importance.

5.1 Effect of Pre-training Tasks

we demonstrate the importance of the deep bidirectionality of BERT by evaluating two pre-training objectives using exactly the same pre-training data, fine-tuning scheme, and hyperparameters as BERTbase:


  • No NSP

--> MLM without NSP: A bidirectional model which is trained using the "Masked LM"(MLM) but without the "next sentence prediction"(NSP) task


result: removing NSP hurts performance significantly on ONLI, MNLI and SQuAD 1.1

  • LTR & No NSP

--> A left-context-only model which is trained using a standard Left-to-Right(LTR) LM, rather than an MLM. The left-only constraint was also applied at fine-tuning, because removing it introduced a pre-train/fine-tune mismatch that degraded downstream performance. 


--> this model was pre-trained without the NSP task.

result:The LTR model performs worse than the MLM model on all tasks, with large drop on MRPC and SQuAD.

BERT论文阅读(一): Pre-training of Deep Bidirectional Transformers for Language Understanding_第4张图片

  • +BiLSTM

suspect: the LTR model will perform poorly at token predictions, since the token-level hidden states have no right-side context.


purpose: In order to make it clear and test, they make a good faith attempt at strengthening the LTR system.


==> 为了改善LTR这种单向模型缺陷,做了几个尝试

attempt 1: we add a randomly initialized BiLSTM on top the LTR model.

尝试1: 他们在LTR模型顶部加入了一个随机初始化的双向LSTM模型。

result: This does significantly improve results on SQuAD, but the results are still far worse than those of the pre-trained bidirectional models.

==> LTR单向模型加BiLSTM之后,效果确实有提升,但性能还是比预训练双向模型差很多!

attempt 2: we recognize that it would also be possible to train separate LTR and RTL models and represent each token as the concatenation of the two models, as ELMo does.

尝试2: LTR + RTL,我们认为单独训练从左到右和从右到左模型是可能的,并且将每个词牌表示为两个模型的连接,比如ELMo。


a. this is twice as expensive as a single bidirectional model

LTR+RTL --> token representation的成本时单个双向模型的两倍 

b. this is non-intuitive for tasks like QA, since RTL model would not be able to condition the answer on the question.

对于像问答等任务不直观,因为从右到左模型不能基于问题思考答案,<-- 因为答案比问题先出现

c. this is strictly less powerful than a deep bidirectional model, since it can use both left and right context at every layer.


5.2 Effect of Model Size

we explore the effect of model size on fine-tuning task accuracy.


we trained a number of BERT models with a differing number of layers, hidden units, and attention heads


BERT论文阅读(一): Pre-training of Deep Bidirectional Transformers for Language Understanding_第5张图片

# L = the number of layers; # H = hidden size; # A = number of attention heads; # LM(ppl) = the masked LM perlexity of held-out training data 保留训练数据的掩码语言模型困惑度

It has long been known that increasing the model size will lead to continual improvements on large-scale tasks 


scaling to extreme model sizes also lead to large improvements on very small scale tasks, provided that the model has been sufficiently pre-trained.


increasing hidden dimension size from 200 to 600 helped, but increasing further to 1,000 did not bring further improvements.

prior works used a feature-based approach, the task-specific models can benefit from the larger, more expensive pre-trained representations even when downstream task data is very small

5.3 Feature-based approach with BERT

All of the BERT results presented so far have used the fine-tuning approach, where a simple classification layer is added to the pre-trained model, all parameters are jointly fine-tuned on a downstream task.


==> 微调fine-tuning和NSP二分类预测关系?

the feature-based approach has certain advantages:

where fixed features are extracted from the pre-trained model

  • First, not all tasks can be easily represented by a Transformer encoder architecture, and therefore require a task-specific model architecture to be added.


> Transformer编码器架构属于feature-based方法还是fine-tuning方法、task-specific方法,他们有什么区别联系?


  • Second, there are major computational benefits to pre-compute an expensive representation of the training data once and then run many experiments with cheaper models on top of this representation.


To ablate the fine-tuning approach,

  • we apply the feature-based approach by extracting the activations from one or more layers without fine-tuning any parameters of BERT.


  • These contextual embeddings are used as input to a randomly initialized two-layer 768-dimensional BiLSTM before the classification layer.


==> BERT is effective for both fine-tuning and feature-based approaches.

6. Conclusion

  • rich, unsupervised pre-training is an integral part of many language understanding systems, these results enable even low-resource tasks to benefit from deep unidirectional architectures.


  • our major contribution --> furthur generalizing these findings to deep bidirectional architectures, allowing the same pre-trained model to successfully tackle a broad set of NLP tasks.


--> 如果只看结论,啥也看不出来,因为很多重要内容都分散在文中大体中,这可能是中英文论文的区别吧
