cs224n lecture8 Recurrent Neural Networks and Language Models

  • Traditional language models
  • RNNs
  • RNN language models
  • training problems and tricks
  • RNN for other sequence tasks
  • Bi and deep rnns

Language Models

  • computes a probability for a sequence of words P(w1,...,wT) P ( w 1 , . . . , w T )

  • ML

    • word ordering (ab vs ba)
    • word choice(home vs house)
  • Traditional

    • conditioned on window of n previous words
    • Markov assumption
    • use counts to estimate prob.
    • RAM requirment scales with length of sequence(n-gram)
  • Recurrent Neural Networks

    • RAM requirement scales with number of words
      cs224n lecture8 Recurrent Neural Networks and Language Models_第1张图片
    • use same set of W weights at all time steps
      cs224n lecture8 Recurrent Neural Networks and Language Models_第2张图片
  • Gradient vanish or explosive

    • long distance: can only memory 5-6 words
    • solution1: initialize W to identity matrix and RELU f(z)=reac(z)=max(z,0) f ( z ) = r e a c ( z ) = m a x ( z , 0 )

birnn

  • just need to reverse the order of sequence

SMT

f: french, source
e: english, destiny

p(e|f)=argmaxep(f|e)p(e) p ( e | f ) = a r g m a x e p ( f | e ) p ( e )

p(e) p ( e ) language model: see as a weighted parameter, control the fluency
p(f|e) p ( f | e ) translate model

p(text|voice)=p(voice|text)p(text) p ( t e x t | v o i c e ) = p ( v o i c e | t e x t ) p ( t e x t )

translate model
  • alignment: hard
    • zero
    • one to many
    • many to many
    • many to many
    • reorder
  • many options:beam search

NMT

AI advantage: end2end trainable model, just consider a final objective function, then everything is learned in the model

RNN Translation model extensions
  1. Train different RNN weights for encoding and decoding
  2. compute every hidden state in decoder from
    • Previous hidden state
    • last hidden vector of encoder
    • previous predict output word
  3. train stacked rnns
  4. train bidirectional encoder(occasionally)
  5. train input sequence in reverse order for simpler optimization(escape vanishing gradients): A B C -> X Y ==> C B A -> X Y

Advanced RNN

  • LSTM
  • GRU
GRU
  • update gate: based on current input word vector and hidden state
    zt=sig(W(z)xt+U(z)ht1) z t = s i g ( W ( z ) x t + U ( z ) h t − 1 )
  • reset gate:
    rt=sig(W(r)xt+U(r)ht1) r t = s i g ( W ( r ) x t + U ( r ) h t − 1 )
LSTM
  1. Input gate
  2. Forge gate
  3. Output
Recent Improvements
  1. prob with softmax
    • no zero shot word predictions
    • combine pointer and softmax

Tricks

  • prob: softmax is huge and slow
    • class-based word prediction(instead of softmax)
  • just need to back propagation once
  • initialize W to identity matrix and RELU

How to improve word Embedding

  1. Input: word -> subword

    • morpheme: BP encoding
    • character embedding
  2. regularization

    • preprocessing: replace some words, drop frequent word and add unfrequent word

Taks List

  1. NER todo: see leture8
  2. Machine Translation:

todos:

  • Recap word vector equtions, shows in the begining of leture9: Machine Translation and Adavanced Recurrent LSTMs and GRUs
  • replicating NER paper

你可能感兴趣的:(cs224n lecture8 Recurrent Neural Networks and Language Models)