BERT:Pre-training of Deep Bidirectional Transformers for Language

BERT: Bidirectional Encoder Representations from Transformers

1. 创新点


2. Bert


  • pre-training:在预训练期间,模型在不同的预训练任务上训练未标记的数据。
  • fine-tuning:对于微调,首先使用预先训练的参数初始化BERT模型,并使用来自下游任务的标记数据对所有参数进行微调。

2.1 Model Architecture

BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al. (2017) and released in the tensor2tensor library.

2.1.1 前置要求

因为Bert里面用的是Transformer的结构,所以需要先阅读论文“attention is all you need”

2.1.2 定义模型


  • the number of layers(i.e., Transformer blocks)
  • the hidden size
  • the number of self-attention heads

was chosen to have the same model size as OpenAI GPT for comparison purposes. Critically, however, the BERT Transformer uses bidirectional self-attention, while the GPT Transformer uses constrained self-attention where every token can only attend to context to its left.

2.2 Input/Output Representations

2.2.1 句子处理



  1. 句子之间用分开
  2. Second, we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B.
BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.

3. Pre-training BERT

我们不使用传统的从左到右或从右到左的语言模型来预训练BERT。 相反,我们使用本节中描述的两个非监督的任务来预训练BERT。

3.1 Task #1: Masked LM



一句话中取15%的词用替换, 然后预测替换的词原来是什么词


虽然这允许我们获得双向预训练模型,但缺点是我们在预训练和微调之间产生不匹配,因为[MASK]在微调期间不会出现。 为了缓解这种情况,我们并不总是用实际的[MASK]替换随机选择的字。

训练数据生成器随机选择15%的词进行预测。 如果选择了第i个词,我们用

  • 80%的可能用[MASK]替换选中的第i个词
  • 10%的可能随机选一个词来替换选中的第i个词
  • 10%的可能选中的第i个词保留原来的词

3.2 Task #2: Next Sentence Prediction (NSP)

为了使模型理解句子间的关系, 任务2 在每个预训练样本中选择句子 A 和 B , 句子B有50%的几率是句子A的下一句 (labeled as IsNext), 50%的几率不是句子A的下一句 (labeled as NotNext).



