CMU 11-785 L15 Divergence of RNN

Variants on recurrent nets

  • Architectures
    • How to train recurrent networks of different architectures
  • Synchrony
    • The target output is time-synchronous with the input
    • The target output is order-synchronous, but not time synchronous

One to one

CMU 11-785 L15 Divergence of RNN_第1张图片

  • No recurrence in model

    • Exactly as many outputs as inputs
    • One to one correspondence between desired output and actual output
  • Common assumption
    ∇ Y ( t ) Div ⁡ ( Y target ( 1 … T ) , Y ( 1 … T ) ) = w t ∇ Y ( t ) Div ⁡ ( Y target ( t ) , Y ( t ) ) \nabla_{Y(t)} \operatorname{Div}\left(Y_{\text {target}}(1 \ldots T), Y(1 \ldots T)\right)=w_{t} \nabla_{Y(t)} \operatorname{Div}\left(Y_{\text {target}}(t), Y(t)\right) Y(t)Div(Ytarget(1T),Y(1T))=wtY(t)Div(Ytarget(t),Y(t))

    • w t w_t wt is typically set to 1.0

Many to many

CMU 11-785 L15 Divergence of RNN_第2张图片

  • The divergence computed is between the sequence of outputs by the network and the desired sequence of outputs
  • This is not just the sum of the divergences at individual times

Language modelling: Representing words

  • Represent words as one-hot vectors

    • Sparse problem
    • Makes no assumptions about the relative importance of words
  • The Projected word vectors

    • Replace every one-hot vector W i W_i Wi by P W i PW_i PWi
    • P P P is an M × N M\times N M×N matrix
  • How to learn projections

CMU 11-785 L15 Divergence of RNN_第3张图片

  • Soft bag of words
    • Predict word based on words in immediate context
    • Without considering specific position
  • Skip-grams
    • Predict adjacent words based on current word

CMU 11-785 L15 Divergence of RNN_第4张图片

Many to one

  • Example
    • Question answering
      • Input : Sequence of words
      • Output: Answer at the end of the question
    • Speech recognition
      • Input : Sequence of feature vectors (e.g. Mel spectra)
      • Output: Phoneme ID at the end of the sequence

CMU 11-785 L15 Divergence of RNN_第5张图片

  • Outputs are actually produced for every input

    • We only read it at the end of the sequence
  • How to train

    • Define the divergence everywhere
      • D I V ( Y target , Y ) = ∑ t w t Xent ⁡ ( Y ( t ) ,  Phoneme ) D I V\left(Y_{\text {target}}, Y\right)=\sum_{t} w_{t} \operatorname{Xent}(Y(t), \text { Phoneme}) DIV(Ytarget,Y)=twtXent(Y(t), Phoneme)
    • Typical weighting scheme for speech
      • All are equally important
    • Problem like question answering
      • Answer only expected after the question ends

Sequence-to-sequence

CMU 11-785 L15 Divergence of RNN_第6张图片

  • How do we know when to output symbols
    • In fact, the network produces outputs at every time
    • Which of these are the real outputs
      • Outputs that represent the definitive occurrence of a symbol

CMU 11-785 L15 Divergence of RNN_第7张图片

  • Option 1: Simply select the most probable symbol at each time
    • Merge adjacent repeated symbols, and place the actual emission of the symbol in the final instant
    • Cannot distinguish between an extended symbol and repetitions of the symbol
    • Resulting sequence may be meaningless
  • Option 2: Impose external constraints on what sequences are allowed
    • Only allow sequences corresponding to dictionary words
    • Sub-symbol units
  • How to train when no timing information provided

CMU 11-785 L15 Divergence of RNN_第8张图片

  • Only the sequence of output symbols is provided for the training data
    • But no indication of which one occurs where
  • How do we compute the divergence?
    • And how do we compute its gradient

你可能感兴趣的:(CMU,11-785)