BERT模型

BERT框架

BERT有两部分：pre-training和fine-tuning。在pre-training阶段，会在没有标注数据且不同预训练任务上训练模型；在fine-tuning阶段，BERT会根据预训练模型的参数初始化，然后在下游任务的标注数据进行fine-tuned。

BERT是一个多层双向的transformer encoder模型。是的，BERT中的transformer只有encoder，没有decoder！！！

BERT模型

模型输入输出表示

BERT模型中使用的是WordPiece embeddings，最后一层隐藏层的向量会作为每个token的表示。另外，有3个特殊字符如下：

[CLS]：用于分类任务中每个序列的第一个token
[SEP]：作为句子对（A，B）的分割符，句子首尾都有，具体可看输入输出表示部分。
[MASK]：用于masked ML中word的替换

输入输出表示

还需要说明的是，BERT模型中sentence并不是语义层面的句子，可以是连续的文本。sequence指的是token 序列，可以是单个sentence也可以是合在一起的 two sentences。

部分小疑问

输入不管有几个sentence，总和的maxlength是固定的，与其他训练任务一样
segment可以有多种(可以大于2，只是比较少见)，如果只有一种那么Segment embedding只有一个

预训练（pre-training）BERT

相比之前的预训练模型，BERT在预训练阶段做了两个无监督任务：MLM（masked LM）和next sentence prediction（NSP）。

通过源码能看到：

MLM和NSP任务的输入都是一样的，即输入中都会有masked token
在masked LM任务中只会预测masked token，其他token不做预测。
由于是多任务学习，最终的

Task1: MLM

我没明白为什么传统模型无法双向训练，而用masked LM可以解决双向训练的问题：

Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirec- tional conditioning would allow each word to in- directly “see itself”, and the model could trivially predict the target word in a multi-layered context.

某文章的解释是：

从图中可以看到经过两层的双向操作，每个位置上的输出就已经带有了原本这个位置上的词的信息了。这样的“窥探”会导致模型预测词的任务变得失去意义，因为模型已经看到每个位置上是什么词了。ref: 《Semi-supervised sequence tagging with bidirectional language models》.

masked LM的做法：

随机选择一定比例的token随机masked，然后再预测这些token。这个比例作者选的是15%，另外文中提到“In contrast to denoising auto-encoders (Vincent et al., 2008), we only predict the masked words rather than recon- structing the entire input.”
因为[MASK] token不出现在fine-tuning阶段，因此在pre-training阶段，对mask做了点其他处理：如果某个token被选择mask，那么80%的情况是被替换成[MASK]，10%的情况是被随机替换成其他token，10%保持原token不替换。

在masked LM任务中只会预测masked token，其他token不做预测。

Task2: NSP

有一些任务比如问答（QA）和自然语言推理（NLI）都是对两句话关系的理解，但是语言模型无法捕捉这种信息。为了让训练的模型能获取句子之间的关系，在预训练的时候多加了一个二值化的NSP任务。具体做法：

对每一个训练样本（A，B）对，50%用真实的（A，B）对被标注为IsNext，50%用错误的（A，B'）对标注为NotNext，其中B'随机来自于语料。
目标函数与 Jernite et al. (2017) and (Logeswaran and Lee (2018))[] 比较相近。

对这部分我的疑问是：

如果不是QA和NLI类的任务，那BERT是不是只有MLM任务？

Fine-tuning BERT

原文中不理解的地方：

For applications involving text pairs, a common pattern is to independently encode text pairs be- fore applying bidirectional cross attention, such as Parikh et al. (2016); Seo et al. (2017). BERT instead uses the self-attention mechanism to unify these two stages, as encoding a concatenated text pair with self-attention effectively includes bidi- rectional cross attention between two sentences.

学习资料

从Word Embedding到Bert模型—自然语言处理中的预训练技术发展史
Bert系列（一）——demo运行
关于BERT，面试官们都怎么问
如何优雅地编码文本中的位置信息？三种positioanl encoding方法简述

BERT模型

BERT框架

模型输入输出表示

部分小疑问

预训练（pre-training）BERT

Task1: MLM

Task2: NSP

Fine-tuning BERT

学习资料

你可能感兴趣的:(BERT模型)