BERT论文阅读(一): Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT三大核心:

  • pre-training
  • bidirectional==>alleviates the unidirectionality constriant of fine-tuning based approaches by using a "masked language model"(MLM) pre-training objective
  • fine-tuning

==>为啥掩码mask?

==> 目前的方法,例如OpenAI GPT使用left-to-right架构,单向性约束

=> masked language model减轻单向,允许双向

==> 因为真正完全遮盖屏蔽的掩码,造成pre-training和fine-tuning不匹配?

==>为了缓解这种情况,我们不总是用真正的[MASK]词牌掩盖单词。80%词牌用[MASK]掩盖,10%用随机词牌替代,剩下的10%保持不变

==> 语言模型无法直接捕获句子关系?

==>为了训练一个能够理解句子关系的模型,我们对二分类的下一句预测任务进行预训练NSP

==> 传统的两步文本对编码 + 应用双向交叉attention  ==> self-attention一步

==> 为了改善LTR这种单向模型缺陷,做了几个尝试

attempt 1: we add a randomly initialized BiLSTM on top the LTR model.

==> LTR单向模型加BiLSTM之后,效果确实有提升,但性能还是比预训练双向模型差很多!

尝试2: LTR + RTL,我们认为单独训练从左到右和从右到左模型是可能的,并且将每个词牌表示为两个模型的连接,比如ELMo。

problem: 

a. LTR+RTL --> token representation的成本时单个双向模型的两倍 

b.对于像问答等任务不直观,因为从右到左模型不能基于问题思考答案,<-- 因为答案比问题先出现

c.严格来说,它不如深度双向模型强大,因为深度双向模型可以在每一层都使用左右上下文。


目录

Abstract

1. Introduction

feature-based approach and fine-tuning approach

The contributions of our paper

2. Related Work

2.1 Unsupervised Feature-based Approaches

2.2 Unsupervised Fine-tuning Approaches

2.3 Transfer Learning from Supervised Data

3. BERT framework

Model Architecture of BERT

Input/Output Representations

dataset:WordPiece embeddings

we differentiate the sentences in two ways

BERT input representation

3.1 Pre-training BERT

Taks #1: Masked LM

Task #2: Next Sentence Prediciton(NSP)

3.2 Fine-tuning BERT

4. Experiments

4.1 GLUE

4.2 SQuAD v1.1

in the question answering task

4.3 SQuAD v2.0

4.4 SWAG

5. Ablation/əˈbleɪ.ʃən/ Studies

5.1 Effect of Pre-training Tasks

No NSP

LTR & No NSP

+BiLSTM

5.2 Effect of Model Size

5.3 Feature-based approach with BERT

the feature-based approach has certain advantages:

To ablate the fine-tuning approach,

6. Conclusion


Abstract

BERT, Bidirectional Encoder Representations from Transformers, a new language representation model.

BERT,来源于Transformers的双向编码器表示

to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left context(Left-to-Right) and right(Right-to-Left) context in all layers.

BERT能够通过联合调节所有层的左上下文和右上下文,从未标记文本中预训练深度双向表示。

Pretrained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks without substantial(considerable importance, size or worth) taskspecific architecture modifications.

预训练好的BERT模型能够只用一个额外的输出层进行微调,从而为各种任务创建最先进的模型,而不用对基于任务的架构进行大量修改。

In one word, BERT is conceptually simple and empiracally powerful.

总之,BERT在概念上简单,在经验上强大。

1. Introduction

Language model pre-training has been shown to be effective for improving many natural language processing tasks.

  • sentence-level tasks. i.e. NLI, paraphrasing --> which aim to predict the relationships between sentences by analyzing them holistically.
  • token-level tasks. i.e. NER, question ansering --> where models are required to produce fine-grained output at the token leve

语言模型预训练对提升许多nlp任务是有效的,比如:

  • sentence-level tasks。自然语言推理,释义-->旨在通过整体分析来预测句子之间的关系
  • token-level task。命名实体识别,问答 --> 模型需要在令牌级别产生细粒度输出

feature-based approach and fine-tuning approach

There are two existing strategies for applying pre-trained language representations to downstream tasks:  

==> share same objective function during pre-training; they  both use unidirectional language models to learn general language representations.

  • feature-based approaches. ELMo, use task-specific architectures that include the pre-trained representations as additional features.
  • fine-tuning approach. OpenAI GPT, Generative Pre-trained Transformer, introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters.

目前有两种应用预训练语言表示的策略:==> 在预训练期间分享相同的目标函数;它们都使用单向语言模型去学习整体的语言表示。

  • 基于特征的方法。例如,ELMo,用包含预训练表示的基于任务的架构作为额外特征。
  • 基于微调的方法。OpenAI GPT,引入最少的基于任务的参数,并且通过简单微调所有预训练参数对下游任务进行训练。

==> 单向的缺点,要双向

Problem: current techniques restrict the power of  the pre-trained representations, especially for the fine-tuning approaches. The major limitation is that standard language models are unidirectional, and this limits the choice of architectures that can be used during pre-training. For example, in OpenAI GPT, the authors use a left-to-right architecture, where every token can only attend to previous tokens in the self-attention layers.

Such restrictions are sub-optimal for sentence-level tasks, and could be very harmful  when applying fine-tuning based approaches to token-level tasks, such as question answering, where it's crucial to incorporate context from both directions.

问题:目前的技术限制了预训练表示的能力,特别是对于微调方法。主要的限制是标准语言模型是单向的,这限制了可以在预训练期间使用的架构选择。例如,在OpenAI GPT中,作者使用一个left-to-right架构,其中每个词牌只能参与self-attention层中之前的词牌。

这种单向限制对句子级别任务是次优的,但当把基于微调的方法用于词牌级别的任务时危害很大。例如,在问答领域,从两个方向合并上下文是非常关键的。

In this paper, we improve the fine-tuning based approaches by BERT.

在这篇论文中,我们通过BERT来提升基于微调的方法。

==> masked language model减轻单向,允许双向

solve: BERT alleviates the previously mentioned unidirectionality constriant by using a "masked language model"(MLM) pre-training objective. The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context.

BERT通过使用“掩码语言模型”MLM预训练目标减轻了前面提到的单向性约束。掩码的语言模型随机屏蔽输入中的一些词牌,目标是仅基于其上下文来预测掩码单词的原始词汇id。

Unlike left-to-right language model pre-training

不同于从左到右的语言模型预训练

the MLM objective enables the representation to fuse the left and the right context, which allows us to pretrain a deep bidirectional Transformer.

MLM目标能够让表示融合左侧和右侧的上下文,这使我们能预训练一个深度双向Transformer。

A "next sentence prediction" task that jointly pretrains text-pair representations.

下一句预测任务可以联合预训练文本对表示。

The contributions of our paper

  • demonstrate the importance of bidirectional pre-training for language representations.

说明了对语言表示来说双向预训练的重要性 <= 之前的模型在pre-training阶段是单向的

BERT uses MLM to enable pretrained deep bidirectional representations.

BERT使用MLM来启用预训练深度双向表示

  • pre-trained representations reduce the need for many heavily-engineered task-specific architectures.

预训练表示降低了许多精心设计的特定任务架构的需求

BERT is the first fine-tuning based representation model.  > many task-specific architectures

BERT是第一个基于微调的表示模型,其性能超过了许多特定任务架构。

  • BERT advances the state of the art for eleven NLP tasks.

https://github.com/google-research/bert

2. Related Work

history of pre-training general language representations. 预训练通用语言表示的历史

2.1 Unsupervised Feature-based Approaches

  • word representation

pre-trained word embeddings > embeddings learned from scratch

预训练词嵌入,性能优于从头开始学习的嵌入

--> To pretrain word embedding vectors, left-to-right language modeling objectives have been used + objectives to discriminate correct from incorrect words in left and right context

为了预训练词嵌入向量,使用了从左到右语言建模目标 + 在左右上下文中区分正确单词和错误单词的目标

  • coarser granularities更粗粒度(sentence embeddings, paragraph embeddings)

--> To train sentence representations, prior work has used objectives to rank candidate next sentences, left-to-right generation of next sentence words given a representation of the previous sentence.

为了训练句子表示,之前的工作使用目标对候选的下一个句子进行排序,给顶一个之前句子的表示,从左到右生成下一个句子单词。

  • traditional word embedding

ELMo and its predecessor

extract context-sensitive features <-- from a left-to-right and a right-to-left language model

抽取上下文敏感特征 <-- 从一个从左到右和一个从右到左语言模型

The contextual representation of each token is the concatenation of the left-to-right and right-to-left representations.

每个词牌的上下文表示是从左到右和从右到左表示的串联。

problem: LSTMs similar to ELMo, their model is feature-based and not deeply bidirectional.

LSTMs与ELMo类似,它们的模型都是基于特征的,都不是深度双向的

-->扩展:https://www.cnblogs.com/robert-dlut/p/9824346.html

problem: word2vec是上下文无关的,无法对一词多义进行建模

solve: ELMo, Embeddings from language models, 获得一个上下文相关的预训练表示

problem: their model is feature-based and not deeply bidirectional

2.2 Unsupervised Fine-tuning Approaches

only pretrained word embedding parameters from unlabeled text

  • sentence or document encoders which produce contextual token representations have been pretrained from unlabeled text and fine-tuned for a supervised downstream task.

生成上下文词牌表示的句子或文本编码器已经从未标记的文本中进行了预训练,并且针对有监督的下游任务进行了微调。

advantage: few parameters need to be learned from scratch. 

优点:几乎没有参数需要从头开始学习

2.3 Transfer Learning from Supervised Data

CV research demonstrate the importance of transfer learning from large pre-trained models 

计算机视觉研究表明来自大规模预训练模型的transfer learning的重要性

an effective recipe is to fine-tune models pretrained with ImageNet

一个有效方法是微调用ImageNet预训练的模型

3. BERT framework

two steps in our framework of BERT: pre-training, fine-tuning

在BERT模型框架中主要有两步:预训练和微调

  • pre-training: the model is trained on unlabeled data over different pre-training tasks

预训练:模型在不同预训练任务的未标记数据上进行训练

  • fine-tuning: the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks.

微调:BERT模型首先用预训练参数进行初始化,并且所有的参数用来自下游任务的标签数据进行微调。

Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters.

每一个下游任务有独立的微调模型,即使它们用相同的预训练参数进行初始化。

A distinctive feature of BERT: its unified architecture across different tasks. There's minimal difference between the pre-trained architecture and the final downstream architecture.

BERT的一个显著特征:跨不同任务的统一架构。预训练架构和最终的下游架构之间的差异很小

Model Architecture of BERT

a multi-layer bidirectional Transformer encoder

BERT模型架构:一个多层双向Transformer编码器

BERT论文阅读(一): Pre-training of Deep Bidirectional Transformers for Language Understanding_第1张图片

L, the number of layers; H, the hidden size; A, the number of self-attention heads; 4H, the feed-forward/filter size, i.e. 3072 for the H=768, 4096 for the H=1024

L,层数;H,隐藏层大小;A,self-attention的头数

two model sizes

(1) BERTbase(L=12, H=768, A=12, Total Parameters=110M)

(2) BERTlarge(L=24, H=1024, A=16, Total Parameters=340M)

比较:BERTbase Transformer  <--compare--> OpenAI GPT Transformer:

BERT uses bidirectional self-attention; GPT uses constrained self-attention where every token can only attend to context to its left.

BERT使用双向self-attention;而OpenAI GPT使用受限的self-attention,其中每个词牌只能关注左侧的上下文。

in the literature, the bidirectional Transformer is often referred to as a "Transformer encoder" while the left-context-only version is referred to as a "Transformer decoder" since it can used for text feneration. 

Input/Output Representations

To make BERT handle a variety of downstream tasks

为了使BERT处理不同的下游任务

our input representation is able to unambiguously represent both a single sentence and a pair of sentences(e.g., ) in one token sequence. sentence, can be an arbitrary span of continguous text, rather than an actual linguistic sentence; sequence, refers to the input token sequence to BERT, which may be a single sentence or true sentences packed together.

我们的输入表示能够在每个词牌序列中清楚明确地表示单个句子和句子对。其中,句子是连续文本一段任意范围,而不是一个真实的语言句子;序列指的是输入词牌序列到BERT中,它可能是单个句子,也可能是大宝在一起的真实句子。

  • dataset:WordPiece embeddings

CLS -> a special classification token, the first token of every sequence

CLS是一个特殊的分类词牌,每个序列的第一个词牌

the final hidden state corresponding to CLS -> is used as aggregate sequence representation for classification tasks

与CLS相关的最终隐藏状态 -> 被用作分类任务的聚合序列表示

sentence pairs -> are packed together into a single sequence

句子对 -> 被打包在一起作为一个单独序列

  • we differentiate the sentences in two ways

First, sep~ separate them with a special token 

首先,用一个特殊词牌区分隔开它们,注意:是分隔开而不是区分开!!

Second, add a learned embedding to every token ->  indicating whether it belongs to sentence A or sentence B.

其次,将学习到的嵌入添加到每一个词牌,用来表示它属于句子A还是句子B

  • E ~ input embeddings 输入嵌入
  • C ~ the final hidden vector of the special token 特殊词牌的最终隐含向量
  • Ti ~ the final hidden vector for the ith input token 第i个输入词牌的最终隐含向量

BERT论文阅读(一): Pre-training of Deep Bidirectional Transformers for Language Understanding_第2张图片

  • BERT input representation

Given a token, BERT input representation = corresponding token embeddings + segmentation embeddings + position embeddings

给定一个词牌,BERT的输入表示=相应的词牌插入向量 + 分割嵌入 + 位置嵌入

BERT论文阅读(一): Pre-training of Deep Bidirectional Transformers for Language Understanding_第3张图片

3.1 Pre-training BERT

we do not use traditional left-to-right or right-to-left language models to pre-train BERT.

 我们不使用传统的从左到右或从右到左语言模型去预训练BERT。

we pre-train BERT using two unsupervised tasks.

我们用两个非监督任务来预训练BERT 

Taks #1: Masked LM

problem: standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly "see itself", and the model could trivially predict the target word in a multi-layered context.

==>为啥掩码mask?

问题:标准的条件语言模型只能被从左到右或从右到左进行训练,因为双向条件能让每个词间接地看到自己,而且模型在多层上下文环境下可以很简单地预测出目标词

purpose: In order to train a deep bidirectional representation

为了训练一个深层双向表示

==> 掩码允许双向

solve: we simply mask some percentage of the input tokens at random, and then predict those masked tokens.

我们简单地随机掩盖一部分输入词牌,然后预测那些掩码词牌 

called "masked LM"(MLM), the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM.  

掩码语言模型MLM,与掩码词牌相对应的最终隐含向量被输送到词汇表上的输出softmax,就像在标准语言模型中一样。

15% at random, we only predict the masked words rather than reconstrcting the entire input. 

15%的随机比例,我们只预测被掩码的单词而不是重建整个输入

problem: Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismath between pre-training and fine-tuning, since the [mask] token does not appear during fine-tuning.

==> 因为真正完全遮盖屏蔽的掩码,造成pre-training和fine-tuning不匹配

solve: To mitigate this, we do not always replace "masked" words with the actual [MASK] token.

为了缓解这种情况,我们不总是用真正的[MASK]词牌掩盖单词。

  • chooses 15% of the token positions at random for prediction

if i-th token is chosen, 80% ~ replace with [MASK] token; 10% ~ replace with a random token; 10% ~ unchanged i-th token

随机选择15%的词牌位置用于预测。其中,80%词牌用[MASK]掩盖,10%用随机词牌替代,剩下的10%保持不变

  • Ti will be used to predict the original token with cross entropy loss

Ti将被用于用交叉熵损失来预测原始词牌

Task #2: Next Sentence Prediciton(NSP)

Many important downstream tasks such as QA and NLI are based on undestanding the relationship between two sentences

 许多重要的下游任务,例如问答和自然语言推理,都基于理解两个句子之间的关系

problem: which is not directly captured by language modeling.

==> 语言模型无法直接捕获句子关系

问题:但是句子之间的关系不能直接被语言建模捕获

solve: In order to train a model that understands sentence relationships, we pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus.

解决:为了训练一个能够理解句子关系的模型,我们对二分类的下一句预测任务进行预训练,该任务可以从任意单语语料库中生成。

when choosing the sentence A and B for each pretraining example, 50% ~ true next -> IsNext label; 50% ~ random sentence -> NotNext label

当为每一个预训练实例选择句子A、B时,其中50%是真实的下一个句子,标记为IsNext;而另外50%是随机句子,标记为NotNext。

C is used for next sentence prediction(NSP), which is closely related to representation learning objectives.

C用于下一句预测任务,它和表征学习目标密切相关

  • However, in prior work, only sentence embeddings are transferred to downstream tasks.

然而,在之前的工作中,只有句子插入能够被传递到下游任务中

  • BERT transfers all parameters to initialize end-task model parameters.

但BERT能够将所有参数都以初始化终端任务模型参数。

  • Pre-training data

For the pre-training corpus, we use the BooksCorpus(800M words) and English Wikipeida. For Wikipedia, we extract only the next passages and ignore lists, tables, and headers.

3.2 Fine-tuning BERT

Fine-tuning is straightforward <-- since the self-attention mechanism in the Transformer allows  BERT to model many downstream tasks by swapping out the appropriate inputs and outputs.

微调是很简单的。因为Transformer中的self-attention允许BERT通过交换适当的输入输出对许多下游任务建模。

problem: For applications involving text pairs, a common pattern is to independently encode text pairs before applying bidirectional across attention.

==> 传统的两步文本对编码 + 应用双向交叉attention  ==> self-attention一步

问题:对于包含文本对的应用,一种常见的模式就是在应用双向交叉attention之前独立编码文本对

solve: BERT instead uses the self-attention mechanism to unify these two stages, as encoding a concatenated text pair with self-attention effectively includes bidirectional cross attention between two sentences.

解决:而BERT使用self-attention机制统一这两个阶段,因为编码具有self-attention的连接文本对有效地包括了两个句子之间的双向交叉注意力。

For each task, we simply plug in the task-specific inputs and outputs into BERT and fine-tune all the parameters end-to-end.

对于每一个任务,我们只需要将特定于任务的输入和输出插入到BERT,并且端到端地微调所有参数。

  • at the input, sentence A and B analogous to (1) sentence pairs (2) (3) question-passage pairs
  • at the output, the token representations are fed into an output layer for token-level tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.

在输出端,词牌表示被输入到输出层用于词牌级任务,比如sequence tagging or question answering, 而[CLS]表示被输入到输出层中用于分类,比如文本蕴含或情感分析。

4. Experiments

we present BERT fine-tuning results on 11 NLP tasks.

我们将在11个NLP任务上展示BERT微调结果

4.1 GLUE

The General Language Understanding Evaluation(GLUE) benchmark is a collection of diverse natural language understanding tasks.

GLUE分块是一个不同自然语言理解任务的合集

To fine-tune on GLUE

为了在GLUE上进行微调

represent the input sequence(for single sentence or sentence pairs) as described in Section3, and use the final hidden vector C corresponding to the first input token as the aggregate representation.

像第3节描述的表示输入序列(对单个句子或句子对),使用与第一个输入词牌相对应的最终隐含向量C作为聚合表示。

The only new parameters introduced during fine-tuning are classification layer weights W-RKH, whereK is the number of labels. We compute a standard classification loss with C and W, i.e., log(softmax(CWT )).

在微调阶段唯一引入的新参数是分类层权重W。

4.2 SQuAD v1.1

The Stanford Question Answering Dataset is a collention of 100k crowdsourced question/answer pair. Given a question and a passage from Wikipedia containing the answer, the task is to predict the answer text span in the passage.

Stanford问答数据集是由100k条众包问答对组成的集合。给定一个问题和一篇来自Wikipedia带答案的文章,人物履历是预测文章中答案文本的范围

in the question answering task

we represent the input question and passage --> a single packed sequence

  • with the question using the A embeddings
  • the passage using the B embeddings.

we only introduce a start vector S and an end vector E during fine-tuning.

我们在微调阶段只引入开始向量S和结束向量E

The probability of word i being the start of the answer span: is computed as a dot product between Ti and S followed by a softmax over all of the words in the paragraph:

单词i作为答案范围起始的概率 = 计算Ti与向量s的点积,然后对段落中所有单词进行softmax

Pi = \frac{e^{S\cdot Ti}}{\sum_{j}^{} e^{S\cdot Tj}}

The analogous formula is used for the end of the answer span. 相似的公式用于答案范围的结尾

The score of a candidate answer span from position i to position j is defined as S*Ti + E * Tj, and the maximum scoring span where j >= i is used as a prediction.

从位置i到位置j的候选答案范围分值 = S*Ti + E*Tj,其中,最大分值的范围用作预测。

The training objective is the sum of the log-likelihoods of the correct start and end positions.

训练目标是正确开始和结束位置的对数似然之和。

4.3 SQuAD v2.0

The SQuAD 2.0 task extends the SQuAD 1.1 problem definition

  • by allowing for the possibility that no short answer exits in the provided paragraph, making the problem more realistic.
  • treat questions that do not have an answer as having an answer span with start and end at the [CLS] token.

Stanford问答数据集2.0扩展数据集1.1,

  • 通过允许在提供的段落中不存在简短的答案来增强问题的实际性。
  • 认为没有答案的问题有一个从start开始到[CLS]结束的答案范围

The probablility space for the start and end answer span positions is extended to include the position of the [CLS] token.

4.4 SWAG

The Situations With Adversarial Generations(SWAG) dataset contains 113k sentence-pair completion examples that evaluate grounded commensense inference. Given a sentence, the task is to choose the most plausible continuation among four choices.

SWAG包含113k根植于常理推断的句对完成实例。给定一个句子,任务是在四个选项中选择最合理的延续。

when fine-tuning on the SWAG dataset, construct four input sequences, each containing the concatenation of the given sentence(sentence A) and a possible continuation(sentence B)

当在SWAG数据集上进行微调试验时,构建四个输入序列----每个都包含给定句子A和可能的延续句子B的连接。

The only task-specific parameters introduced is a vector whose dot product with the [CLS] token representatioin C denotes a score for each choice which is normalized with a softmax layer.

唯一被引入的特定任务参数时一个向量,它与[CLS]词牌表示C的点积表示每个选择的分数,该分数用softmax层进行归一化。

5. Ablation/əˈbleɪ.ʃən/ Studies

--> ablation study就是你在同时提出多个思路提升某个模型的时候,为了验证这几个思路分别都是有效的,做的控制变量实验的工作

  1. 在baseline的基础上加上模块A,看效果。
  2. 在baseline的基础上加上模块B,看效果。
  3. 在baseline的基础上同时加上模块AB,看效果。

Ablation experiments over a number of facets of BERT in order to better understand their relative importance.

5.1 Effect of Pre-training Tasks

we demonstrate the importance of the deep bidirectionality of BERT by evaluating two pre-training objectives using exactly the same pre-training data, fine-tuning scheme, and hyperparameters as BERTbase:

我们通过使用与BERTbase完全相同的预训练数据、微调方案、超参数来评估两个预训练目标,证明了BERT深度双向的重要性:

  • No NSP

--> MLM without NSP: A bidirectional model which is trained using the "Masked LM"(MLM) but without the "next sentence prediction"(NSP) task

用掩码语言模型,但没有用下一句预测任务训练的双向模型

result: removing NSP hurts performance significantly on ONLI, MNLI and SQuAD 1.1

  • LTR & No NSP

--> A left-context-only model which is trained using a standard Left-to-Right(LTR) LM, rather than an MLM. The left-only constraint was also applied at fine-tuning, because removing it introduced a pre-train/fine-tune mismatch that degraded downstream performance. 

使用一个标准的从左到右语言模型,而不是掩码语言模型去训练仅左上下文的模型。仅左约束也被应用于微调,因为删除它会引入预训练/微调不匹配,从而降低下游性能。

--> this model was pre-trained without the NSP task.

result:The LTR model performs worse than the MLM model on all tasks, with large drop on MRPC and SQuAD.

BERT论文阅读(一): Pre-training of Deep Bidirectional Transformers for Language Understanding_第4张图片

  • +BiLSTM

suspect: the LTR model will perform poorly at token predictions, since the token-level hidden states have no right-side context.

直观上的怀疑:LTR模型在词牌预测上表现较差,是因为词牌级隐含状态没有右侧上下文。 

purpose: In order to make it clear and test, they make a good faith attempt at strengthening the LTR system.

目的:为了弄清楚我们怀疑的原因,作者尝试加强LTR系统 

==> 为了改善LTR这种单向模型缺陷,做了几个尝试

attempt 1: we add a randomly initialized BiLSTM on top the LTR model.

尝试1: 他们在LTR模型顶部加入了一个随机初始化的双向LSTM模型。

result: This does significantly improve results on SQuAD, but the results are still far worse than those of the pre-trained bidirectional models.

==> LTR单向模型加BiLSTM之后,效果确实有提升,但性能还是比预训练双向模型差很多!

attempt 2: we recognize that it would also be possible to train separate LTR and RTL models and represent each token as the concatenation of the two models, as ELMo does.

尝试2: LTR + RTL,我们认为单独训练从左到右和从右到左模型是可能的,并且将每个词牌表示为两个模型的连接,比如ELMo。

problem: 

a. this is twice as expensive as a single bidirectional model

LTR+RTL --> token representation的成本时单个双向模型的两倍 

b. this is non-intuitive for tasks like QA, since RTL model would not be able to condition the answer on the question.

对于像问答等任务不直观,因为从右到左模型不能基于问题思考答案,<-- 因为答案比问题先出现

c. this is strictly less powerful than a deep bidirectional model, since it can use both left and right context at every layer.

严格来说,它不如深度双向模型强大,因为深度双向模型可以在每一层都使用左右上下文。

5.2 Effect of Model Size

we explore the effect of model size on fine-tuning task accuracy.

我们探索模型大小对微调任务准确度的影响

we trained a number of BERT models with a differing number of layers, hidden units, and attention heads

我们训练了许多BERT模型,它们具有不同的层,隐含单元和attention头。

BERT论文阅读(一): Pre-training of Deep Bidirectional Transformers for Language Understanding_第5张图片

# L = the number of layers; # H = hidden size; # A = number of attention heads; # LM(ppl) = the masked LM perlexity of held-out training data 保留训练数据的掩码语言模型困惑度

It has long been known that increasing the model size will lead to continual improvements on large-scale tasks 

众所周知,增加模型大小将导致大规模任务的持续改进。

scaling to extreme model sizes also lead to large improvements on very small scale tasks, provided that the model has been sufficiently pre-trained.

扩张到极致模型大小会导致极小规模任务的巨大提升,假定该模型已经被充分预训练过了。

increasing hidden dimension size from 200 to 600 helped, but increasing further to 1,000 did not bring further improvements.

prior works used a feature-based approach, the task-specific models can benefit from the larger, more expensive pre-trained representations even when downstream task data is very small

5.3 Feature-based approach with BERT

All of the BERT results presented so far have used the fine-tuning approach, where a simple classification layer is added to the pre-trained model, all parameters are jointly fine-tuned on a downstream task.

至今为止,所有的被提出的BERT结果都使用了微调方法,当一个简单的分类层被添加到预训练模型,所有的参数在一个下游任务上被联合微调。

==> 微调fine-tuning和NSP二分类预测关系?

the feature-based approach has certain advantages:

where fixed features are extracted from the pre-trained model

  • First, not all tasks can be easily represented by a Transformer encoder architecture, and therefore require a task-specific model architecture to be added.

首先,不是所有的任务都能被Transformer编码器架构代表,因此需要添加一个特定任务的模型架构

> Transformer编码器架构属于feature-based方法还是fine-tuning方法、task-specific方法,他们有什么区别联系?

在第一节Introduction

  • Second, there are major computational benefits to pre-compute an expensive representation of the training data once and then run many experiments with cheaper models on top of this representation.

其次,具有重大的计算优势,一旦预计算一次昂贵的训练数据表示,然后可以在此表示之上用更便宜的模型运行许多实验

To ablate the fine-tuning approach,

  • we apply the feature-based approach by extracting the activations from one or more layers without fine-tuning any parameters of BERT.

为了对比微调方法,我们通过从一个或多个没有经过BERT微调参数的层中抽取激活来应用基于特征方法

  • These contextual embeddings are used as input to a randomly initialized two-layer 768-dimensional BiLSTM before the classification layer.

这些上下文嵌入被用作分类层之前随机初始化两层768维双向LSTM的输入

==> BERT is effective for both fine-tuning and feature-based approaches.

6. Conclusion

  • rich, unsupervised pre-training is an integral part of many language understanding systems, these results enable even low-resource tasks to benefit from deep unidirectional architectures.

丰富的、无监督的预训练是许多语言理解系统不可或缺的一部分,这些结果即使是低资源任务也能从深度单向架构中获益。

  • our major contribution --> furthur generalizing these findings to deep bidirectional architectures, allowing the same pre-trained model to successfully tackle a broad set of NLP tasks.

我们的主要贡献-->进一步将这些发现扩展到深层双向架构中,允许相同的预训练模型成功处理一系列广泛的NLP任务。

--> 如果只看结论,啥也看不出来,因为很多重要内容都分散在文中大体中,这可能是中英文论文的区别吧

你可能感兴趣的:(ML,&,DL,paper,reading,BERT)