BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

摘要

  • a new language representation model called BERT.
  • BERT stand for Bidirectional Encoder Representations from Transformers。
  • BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. (jointly conditioning 共同微调)
  • 应用: answering question、language inference、
  • eleven natural language processing tasks :11个自然语言任务。得到优异的结果。

介绍

  • sentence-level task:
    • natural language inference
    • paraphrasing
  • token-level tasks:
    • named entity recognition
    • question answering

将预训练模型应用到下游任务两种策略:

  • feature-based

    • ELMo
  • fine-tuning

    • the Generative Pre-trained Transformer (OpenAI GPT)

两种方法共同使用:unidirectional language models(单向语言模型)

去学习一般的语言表示。

限制

  • standard language models are unidirectional

相关工作

Unsupervised Feature-based Approaches

  • Learning widely applicable representations of words
    (活跃研究领域) include non-neural and neural method

  • Pre-trained word embeddings: NLP中完整的部分。

  • coarser granularities: 更粗粒度。:sentence embeddings。
    (句子级的嵌入)paragraph embeddings (段落级别嵌入

Unsupervised Fine-tuning Approaches

  • OpenAI GPT
  • the GLUE benchmark
  • Left-to-right language modeling
  • auto-encoder objectives

Transfer Learning from Supervised Data

  • natural language inference
  • machine translation
  • Computer vision research

BeRT

  • pre-training
    • the model is trained on unlabeled data over different pre-training tasks
  • fine-tuning
    • the BERT model is first initialized with the pre-trained parameters
    • all of the parameters are fine-tuned using labeled data from the downstream tasks.
    • Each downstream task has separate fine-tuned models.
    • BERT is its unified arachitecture across different tasks.

模型架构

  • BERT is a multi-layer bidirectional Transformer encoder based on the original implementation.
  • the number of layers as L L L
  • the hidden size as H H H
  • the number of self-attention heads as A A A
  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding_第1张图片
  • the BERT Transformer *
    • uses bidirectional self-attention**
  • the GPT Transformer
    • constrained self-attention where every
      token can only attend to context to its left

Input/Output Representations

  • WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary
  • input embedding as E E E
  • the final hidden vector of the special [CLS] token as C ∈ R H C \in R^{H} CRH
  • the final hidden vector for the i t h i^{th} ith input token as
    T i ∈ R H T_i \in R^{H} TiRH

Pre-training BERT

  • two unsupervised tasks

Masked LM(任务1)

  • mask some percentage of the input tokens at random
  • [MASK] token.

Next Sentence Prediction (NSP)\

许多重要的下游任务: Q A QA QA N L I NLI NLI

  • based on understanding the relationship between two sentences.
  • a binarized next sentence prediction task.
  • any monolingual corpus 任意语料库。
    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding_第2张图片
  • sentence embeddings are transferred to down-stream tasks
  • BERT transfers all parameters to initialize end-task model parameters

Pre-training data

  • the BooksCorpus (800M words)
  • English Wikipedia (2500M words)
  • long contiguous sequences 长连续序列。

Fine-tuning BERT

实验

GLUE

  • C ∈ R H C \in R^{H} CRH corresponding to the first input token.
  • classification layer weights W ∈ R K × H W \in R^{K \times H} WRK×H
  • l o g ( s o f t m a x ( C W T ) ) log(softmax(CW^{T})) log(softmax(CWT))
    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding_第3张图片

GLUE tasks

  • a batch size of 32
  • fine-tune for epochs。

经典数据集

  • The Stanford Question Answering Dataset
  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding_第4张图片
    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding_第5张图片
    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding_第6张图片
    $the [CLS] $ toeken

Ablation Studies

Effect of Model Size

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding_第7张图片

结论

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding_第8张图片
Recent empirical improvements due to transfer
learning with language models have demonstrated
that rich, unsupervised pre-training is an integral
part of many language understanding systems.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding_第9张图片
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding_第10张图片
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding_第11张图片

总结

慢慢的将代码跑通。会自己将 B E R T BERT BERT给其全部都将其搞透彻,研究彻底都行啦的回事与打算。慢慢的都将其研究好都行啦的样子。

  • 会自己慢慢的将Bert代码给其研究透彻,会自己清楚什么是细颗粒度任务和粗颗粒度任务。全部都将其搞定都行啦的回事与打算
  • 慢慢的将自己研究透彻,研究彻底,

总之

BERT模型都是含义都是:

  • the BERT Transformer uses bidirectional self-attention
  • a multi-layer bidirectional Transformer encoder based on the original implementation
    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding_第12张图片

多层transformer encoder

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding_第13张图片

你可能感兴趣的:(模块复现,人工智能)