摘要
- a new language representation model called BERT.
- BERT stand for Bidirectional Encoder Representations from Transformers。
- BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. (jointly conditioning 共同微调)
- 应用: answering question、language inference、
- eleven natural language processing tasks :11个自然语言任务。得到优异的结果。
介绍
- sentence-level task:
- natural language inference、
- paraphrasing
- token-level tasks:
- named entity recognition
- question answering
将预训练模型应用到下游任务两种策略:
-
feature-based
-
fine-tuning
- the Generative Pre-trained Transformer (OpenAI GPT)
两种方法共同使用:unidirectional language models(单向语言模型)
去学习一般的语言表示。
限制
- standard language models are unidirectional
相关工作
Unsupervised Feature-based Approaches
-
Learning widely applicable representations of words
(活跃研究领域) include non-neural and neural method
-
Pre-trained word embeddings: NLP中完整的部分。
-
coarser granularities: 更粗粒度。:sentence embeddings。
(句子级的嵌入)paragraph embeddings (段落级别嵌入)
Unsupervised Fine-tuning Approaches
- OpenAI GPT
- the GLUE benchmark
- Left-to-right language modeling
- auto-encoder objectives
Transfer Learning from Supervised Data
- natural language inference
- machine translation
- Computer vision research
BeRT
- pre-training
- the model is trained on unlabeled data over different pre-training tasks
- fine-tuning
- the BERT model is first initialized with the pre-trained parameters
- all of the parameters are fine-tuned using labeled data from the downstream tasks.
- Each downstream task has separate fine-tuned models.
- BERT is its unified arachitecture across different tasks.
模型架构
- BERT is a multi-layer bidirectional Transformer encoder based on the original implementation.
- the number of layers as L L L
- the hidden size as H H H
- the number of self-attention heads as A A A
- the BERT Transformer *
- uses bidirectional self-attention**
- the GPT Transformer
- constrained self-attention where every
token can only attend to context to its left
Input/Output Representations
- WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary
- input embedding as E E E
- the final hidden vector of the special [CLS] token as C ∈ R H C \in R^{H} C∈RH
- the final hidden vector for the i t h i^{th} ith input token as
T i ∈ R H T_i \in R^{H} Ti∈RH
Pre-training BERT
Masked LM(任务1)
- mask some percentage of the input tokens at random
- [MASK] token.
Next Sentence Prediction (NSP)\
许多重要的下游任务: Q A QA QA、 N L I NLI NLI
- based on understanding the relationship between two sentences.
- a binarized next sentence prediction task.
- any monolingual corpus 任意语料库。
- sentence embeddings are transferred to down-stream tasks
- BERT transfers all parameters to initialize end-task model parameters
Pre-training data
- the BooksCorpus (800M words)
- English Wikipedia (2500M words)
- long contiguous sequences 长连续序列。
Fine-tuning BERT
实验
GLUE
- C ∈ R H C \in R^{H} C∈RH corresponding to the first input token.
- classification layer weights W ∈ R K × H W \in R^{K \times H} W∈RK×H
- l o g ( s o f t m a x ( C W T ) ) log(softmax(CW^{T})) log(softmax(CWT))
GLUE tasks
- a batch size of 32
- fine-tune for epochs。
经典数据集
- The Stanford Question Answering Dataset
$the [CLS] $ toeken
Ablation Studies
Effect of Model Size
结论
Recent empirical improvements due to transfer
learning with language models have demonstrated
that rich, unsupervised pre-training is an integral
part of many language understanding systems.
总结
慢慢的将代码跑通。会自己将 B E R T BERT BERT给其全部都将其搞透彻,研究彻底都行啦的回事与打算。慢慢的都将其研究好都行啦的样子。
- 会自己慢慢的将Bert代码给其研究透彻,会自己清楚什么是细颗粒度任务和粗颗粒度任务。全部都将其搞定都行啦的回事与打算
- 慢慢的将自己研究透彻,研究彻底,
总之
BERT模型都是含义都是:
- the BERT Transformer uses bidirectional self-attention
- a multi-layer bidirectional Transformer encoder based on the original implementation
多层transformer encoder