Pre-trained Models for Natural Language Processing: A Survey阅读笔记(TBC)

priori knowledge

  • inductive bias

前言

邱锡鹏人物画像第一章。

1 Introduction

different neural networks proposed:

  • CNN
  • RNN
  • GNN
  • attention mechanism

Neural methods use low-dimensional and dense vectors(aka. distributed representation) in NLP tasks.
Unlike CV field, there is less performance improvement in NLP field, which is because of the lack of large datasets(except machine translation).
PTMs can help NLP tasks avoid training a new model(which commonly has a large number of parameters) from scratch.
The generations of PTMS:

  • word embeddings pre-trained models
    • features:not be needed in downstream tasks any more~
    • examples:Skip-Gram、GloVe
    • advantage:
      • computational efficiencies
    • drawback:
      • context-free, fail to capture higher-level concepts in context
      • dont behave well in polysemous disambiguation、syntactic structures、semantic roles、anaphora
  • contextual word embeddings pre-trained models
    • features:these learned encoders are still needed
    • examples:CoVe、ELMo、OpenAI GPT、BERT

2 Background

2.1 language representations learning

Non-contextual embeddings

the limitations to non-contextual embeddings:

  • the embedding for a word is static regardless of its context:fail to represent polysemous words well
  • OOV(out-of-vocabulary problems):character-level or sub-word representations(CharCNN\FastText\BPE)

Contextual embeddings

motivations:

  • polysemous
  • the context-dependent nature of words

2.2 Neural Contextual Encoders

most of the contextual encoders can be classified into two categories:

Sequence Models

  • convolutional models:local information
  • recurrent models:short memory

Non-Sequence Models

  • recursive NN \ tree-lstm \ GCN:
    • advantage:
      • useful inductive bias
      • easy to train
    • drawbacks(problems):
      • hard to capture long range dependency
      • need expert knowledge & external NLP tools (sequence models’ inductive bias is sequentiality and time invariance which doesnt need external NLP tools. However tree-lstm model needs dependency parser as priori knowledge).
  • fully-connected self-attention model(transformers)
    • features
      • weak inductive bias
      • model the relation of every two words, and the connection weights of them are dynamically computed by self-attention mechanism
    • advantage:
      • easy to capture long range dependency
    • drawbacks:
      • hard to train(1. need large corpus 2. easy to overfit on small training datasets)

2.3 Why Pre-training?

motivations:

  • expensive annotation costs of building large-scale labeled datasets
  • unlabeled datasets are easy to access
    advantages:
  • learn universal language representations
  • good model initialization to speed up convergence
  • a kind of regularization

2.4 History of PTMs

  • word embeddings
    • NNLM(neural network language model):deep neural networks
    • CBOW(continuous bags-of-words) and SG(Skip-Gram):no need for dnn
    • word2vec
    • GloVe
  • contextual encoders:
  • initialize LSTMs with a LM(language model) or a sequence autoencoder
  • pretrain a deep LSTM encoder from an attentional seq2seq model with MT(machine translation)->CoVe
    • further development orientation:
      • larger scale corpus
      • more powerful or deeper architectures
      • new pre-training tasks
  • improvements of contextual encoders
    fixed
  • pretrain 2-layer LSTM encoder with a BiLM(bidirectional language model)->ELMO(embeddings from Language Models)
  • pretrain a character-level LM
    finetune
  • ULMFiT(universal language model fine-tuning)
  • OpenAI GPT(Generative Pre-traininig)
  • BERT(Bidirectional Encoder Representation from Transformer)

3 Overview of PTMs

3.1 Pre-training tasks

  • SL( supervised learning):learn a function that maps an input to an output based on training data consist of input-output pairs
  • UL( unsupervised learning):find(not learn) intrinsic knowledge from unlabeled data,such as clusters,densities,latent reprentations
  • SSL( self-supervised learning):learn the general knowledge from datasets without human-annotated supervised labels(e.g. MLM(masked language models))

examples of SSL!(self-supervised learning) tasks:

LM(probabilistic language modeling)

LM often refers in particular to auto-regressive or unidirectional LM.

One drawback: only one direction —An improved solution—> BiLM

MLM(Masked language Modeling)

details: predict the masked tokens by the rest of the tokens
One problem:dismatch between the pre-training and fine-tuning phase
One solution:use a special [MASK] token 80% of the time, a random token 10% of the time, original token 10% of the time.
feature:

  • treated as a classification task
  • static masking

Seq2seq MLM(Sequence-to-Sequence Masked language Modeling)

details:sequentially predict the masked tokens in auto-regressive fashion(decoder) by the masked sequence fed into encoder
feature:treated as a generative task

E-MLM(Enhanced Masked Language Modeling)

enhanced MLM proposed:

  • RoBERTa:dynamic
  • UniLM:?
  • XLM:?
  • Span-BERT:CWM\SBO??
  • StructBERT:SOR??
  • ???

PLM(Permuted Language Modeling)

motivation: overcome the gap between pre-training and finetune
example:

  • XLNET

DAE,打住了,看不懂了,拜拜,绕道而行

专有名词

  • PTMS:pre-trained models
  • alleviate:缓解(alleviate the feature engineering problem)
  • discrete:离散的(discrete handcrafted features)
  • aks.:also known as
  • the feature engineering
  • handcrafted features
  • dense vectors = distributed representation
  • polysemous:一词多义的; disambiguation:歧义消除
  • anaphora:回指
  • general-purpose:通用的、多种用途的
  • lexical :词汇方面的
  • syntactic:语法的
  • real-valued:实值的(low-dimensional real-valued vectors)
  • the limitation to
  • dynamical embedding = contextual embedding(which is opposed to static embedding)
  • aggregate:聚合(convolutional models aggregate the local information from neighbors by convolution operations)
  • sequentiality:序列性
  • time invariance:时间不变性
  • locality:局部性
  • spatial invariance:空间不变性
  • convergence:收敛
  • regularization:正则化
  • precursor:先驱
  • intrinsic:内在的
  • cloze:完形填空
  • phase:阶段(pre-training phase)

你可能感兴趣的:(文献阅读之家)