Note 5: BERT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2018)

Fig. 1 Devlin et al., (2018)

BERT (Bidirectional Encoder Representations from Transformers) is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.

2. Two-steps Framework

Pre-training: The model is trained on unlabeled data over different pre-training tasks.
Fine-tuning: The initialized BERT model is fine-tuned using labeled data from the downstream tasks, while each downstream task has its own tuned model.
Fig. 2 Overall (Devlin et al., 2018)

3. Input/Output Representations

Although the down-stream tasks are different, the input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., Question, Answer) in one token sequence.
[CLS]: It is always the first token of every sequence. Especially in classification tasks, it can be used as the aggregate representation of a sequence.
[SEP]: It can separate pairs of sentences. Furthermore, we can add a learned segment embedding to each token indicating which sentence it belongs to.
For a given token, its input representation is constructed by summing the corresponding token, segment and position embeddings.
Fig. 3 Input representation (Devlin et al., 2018)

4. Pre-training

Masked LM: Mask some percentage of input tokens at random and then predict these masked tokens.
- Mask 15% of all tokens in each sequence at random.
- However, it induces a mismatching problem between pre-training and fine-tuning since the fine-tuning stage dose not have the [MASK] token.
- To mitigate this problem, if the -th token is chosen, BERT replaces it with:
  - the [MASK] token 80% of the time
  - a random token 10% of the time
  - the unchanged -th token 10% of the time
- Merits: As the model dose not know whether the input token has been replaced, it force the model to keep a distributional contextual representation of every input token.
Next sentence prediction (NSP) is a binarized task which can train a model to understand sentence relationships.
- The training samples can be generated from any monolingual corpus.
- Label IsNext: 50% of samples are actual sentence A followed by sentence B.
- Label NotNext: 50% of samples are randomly selected from corpus.
- The special symbol [CLS]'s output is used for NSP classification, as shown in Fig. 2.
Pre-training data is a document-level corpus rather than a shuffled sentence-level corpus.

5. Fine-tuning BERT

BERT encodes a concatenated text pair with self-attention effectively includes bidirectional cross attention between two sentences.
Input: above mentioned sentence A and sentence B are analogous to
- sentence pairs in paraphrasing,
- hypothesis-premise pairs in entailment,
- question-passage pairs in question answering,
- a degenerate text-∅ pair in text classification or sequence tagging.
Output:
- the token representations are fed into an output layer for token-level tasks, such as sequence tagging or question answering.
- the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.

6. BERT vs. GPT vs. ELMo

BERT uses a bidirectional Transformer. OpenAI GPT (Radford et al., 2018) uses a left-to-right Transformer. ELMo (Peters et al., 2018) uses the concatenation of independently trained left-to-right and right-to-left LSTMs to generate features for downstream tasks.
BERT and OpenAI GPT are fine-tuning approaches, while ELMo is a feature-based approach.

Differences in pre-training model architectures (Devlin et al. 2018)

Reference

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365.