CMU 11-785 L14 Stability analysis and LSTMs

Stability

  • Will this necessarily be「Bounded Input Bounded Output」?
    • Guaranteed if output and hidden activations are bounded
    • But will it saturate

Analyzing Recursion

  • Sufficient to analyze the behavior of the hidden layer since it carries the relevant information

  • Assumed linear systems

    • z k = W h h k − 1 + W x x k , h k = z k z_{k}=W_{h} h_{k-1}+W_{x} x_{k}, \quad h_{k}=z_{k} zk=Whhk1+Wxxk,hk=zk

CMU 11-785 L14 Stability analysis and LSTMs_第1张图片

  • Sufficient to analyze the response to a single input at t = 0 t =0 t=0 (else is zero input)

Simple scalar linear recursion

  • h ( t ) = w h ( t − 1 ) + c x ( t ) h(t) = wh(t-1) + cx(t) h(t)=wh(t1)+cx(t)
  • h 0 ( t ) = w t c x ( 0 ) h_0(t) = w^tcx(0) h0(t)=wtcx(0)
  • If w > 1 w > 1 w>1 it will blow up

Simple Vector linear recursion

  • h ( t ) = W h ( t − 1 ) + C x ( t ) h(t) = Wh(t-1) + Cx(t) h(t)=Wh(t1)+Cx(t)
  • h 0 ( t ) = W t C x ( 0 ) h_0(t) = W^tCx(0) h0(t)=WtCx(0)
    CMU 11-785 L14 Stability analysis and LSTMs_第2张图片
  • For any input, for large the length of the hidden vector will expand or contract according to the t − t- t th power of the largest eigen value of the hidden-layer weight matrix
  • If ∣ λ m a x > 1 ∣ |\lambda_{max} > 1| λmax>1 it will blow up, otherwise it will contract and shrink to 0 rapidly

CMU 11-785 L14 Stability analysis and LSTMs_第3张图片

Non-linearities

  • Sigmoid: Saturates in a limited number of steps, regardless of w w w
    • To a value dependent only on w w w (and bias, if any)
    • Rate of saturation depends on w w w
  • Tanh: Sensitive to w w w, but eventually saturates
    • “Prefers” weights close to 1.0
  • Relu: Sensitive to w w w, can blow up

CMU 11-785 L14 Stability analysis and LSTMs_第4张图片

Lessons

  • Recurrent networks retain information from the infinite past in principle
  • In practice, they tend to blow up or forget
    • If the largest Eigen value of the recurrent weights matrix is greater than 1, the network response may blow up
    • If it’s less than one, the response dies down very quickly
  • The “memory” of the network also depends on the parameters (and activation) of the hidden units
    • Sigmoid activations saturate and the network becomes unable to retain new information
    • RELU activations blow up or vanish rapidly
    • Tanh activations are the most effective at storing memory
      • And still has very short “memory”
      • Still sensitive to Eigenvalues of W W W

Vanishing gradient

  • A particular problem with training deep networks is the gradient of the error with respect to weights is unstable
  • For
    • Div ⁡ ( X ) = D ( f N ( W N − 1 f N − 1 ( W N − 2 f N − 2 ( … W 0 X ) ) ) ) \operatorname{Div}(X)=D\left(f_{N}\left(W_{N-1} f_{N-1}\left(W_{N-2} f_{N-2}\left(\ldots W_{0} X\right)\right)\right)\right) Div(X)=D(fN(WN1fN1(WN2fN2(W0X))))
  • We get
    • ∇ f k Div ⁡ = ∇ D . ∇ f N . W N − 1 . ∇ f N − 1 . W N − 2 … ∇ f k + 1 W k \nabla_{f_{k}} \operatorname{Div}=\nabla D . \nabla f_{N} . W_{N-1} . \nabla f_{N-1} . W_{N-2} \ldots \nabla f_{k+1} W_{k} fkDiv=D.fN.WN1.fN1.WN2fk+1Wk
  • Where
    • ∇ f n \nabla{f_{n}} fn is jacobian of f N ( ) f_N() fN() to its current input

For activation

  • For RNN
    • ∇ f t ( z i ) = [ f t , 1 ′ ( z 1 ) 0 ⋯ 0 0 f t , 2 ′ ( z 2 ) ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ f t , N ′ ( z N ) ] \nabla f_{t}\left(z_{i}\right)=\left[\begin{array}{cccc}f_{t, 1}^{\prime}\left(z_{1}\right) & 0 & \cdots & 0 \\\\ 0 & f_{t, 2}^{\prime}\left(z_{2}\right) & \cdots & 0 \\\\ \vdots & \vdots & \ddots & \vdots \\\\ 0 & 0 & \cdots & f_{t, N}^{\prime}\left(z_{N}\right)\end{array}\right] ft(zi)=ft,1(z1)000ft,2(z2)000ft,N(zN)
    • For vector activations: A full matrix
    • For scalar activations: A matrix where the diagonal entries are the derivatives of the activation of the recurrent hidden layer
  • The derivative (or subgradient) of the activation function is always bounded
  • Most common activation functions, such as sigmoid, tanh() and RELU have derivatives that are always less than 1
    • Multiplication by the Jacobian is always a shrinking operation
    • After a few layers the derivative of the divergence at any time is totally “forgotten”

For weights

  • In a single-layer RNN, the weight matrices are identical
    • The conclusion below holds for any deep network, though
  • The chain product for ∇ f k D i v \nabla_{f_k} Div fkDiv will
    • Expand ∇ D \nabla D D along directions in which the singular values of the weight matrices are greater than 1
    • Shrink ∇ D \nabla D D in directions where the singular values are less than 1
    • Repeated multiplication by the weights matrix will result in Exploding or vanishing gradients

LSTM

Problem

  • Recurrent nets are very deep nets
  • Stuff gets forgotten in the forward pass too
    • Each weights matrix and activation can shrink components of the input
  • Need the long-term dependency
  • The memory retention of the network depends on the behavior of the weights and jacobian
  • Which in turn depends on the parameters W W W rather than what it is trying to remember
  • We need
    • Not be directly dependent on vagaries of network parameters, but rather on input-based determination of whether it must be remembered
    • Retain memories until a switch based on the input flags them as ok to forget
      • 「Curly brace must remember until curly brace is closed」

Architecture

CMU 11-785 L14 Stability analysis and LSTMs_第5张图片

  • The σ \sigma σ are multiplicative gates that decide if something is important or not

Key component

Remembered cell state

CMU 11-785 L14 Stability analysis and LSTMs_第6张图片

  • Mutiply is a switch
    • Should I continue remember or not? (scale up / down)
  • Acddition
    • Should I agument the memory?
  • C t C_t Ct is the linear history carried by the constant-error carousel
  • Carries information through, only affected by a gate
    • And addition of history, which too is gated…

Gates

CMU 11-785 L14 Stability analysis and LSTMs_第7张图片

  • Gates are simple sigmoidal units with outputs in the range (0,1)
  • Controls how much of the information is to be let through

Forget gate

CMU 11-785 L14 Stability analysis and LSTMs_第8张图片

  • The first gate determines whether to carry over the history or to forget it
    • More precisely, how much of the history to carry over
    • Also called the “forget” gate
    • Note, we’re actually distinguishing between the cell memory C C C and the state h h h that is coming over time! They’re related though
      • Hidden state is compute from memory (which is stored)

Input gate

CMU 11-785 L14 Stability analysis and LSTMs_第9张图片

  • The second input has two parts
    • A perceptron layer that determines if there’s something new and interesting in the input
      • 「See a curly brace」
    • A gate that decides if its worth remembering
      • 「Curly brace is in comment section, ignore it」

Memory cell update

CMU 11-785 L14 Stability analysis and LSTMs_第10张图片

  • If something new and worth remembering
    • Added to the current memory cell

Output and Output gate

CMU 11-785 L14 Stability analysis and LSTMs_第11张图片

  • The output of the cell
    • Simply compress it with tanh to make it lie between 1 and -1
      • Note that this compression no longer affects our ability to carry memory forward
    • Controlled by an output gate
      • To decide if the memory contents are worth reporting at this time

The “Peephole” Connection

CMU 11-785 L14 Stability analysis and LSTMs_第12张图片

  • The raw memory is informative by itself and can also be input
    • Note, we’re using both C C C and h h h

Forward

CMU 11-785 L14 Stability analysis and LSTMs_第13张图片

Backward1

CMU 11-785 L14 Stability analysis and LSTMs_第14张图片
∇ C t D i v = ∇ h t D i v ∘ ( o t ∘ tanh ⁡ ′ ( . ) + tanh ⁡ ( . ) ∘ σ ′ ( . ) W C o ) + ∇ C t + 1 D i v ∘ ( f t + 1 + C t ∘ σ ′ ( . ) W C f + C ~ t + 1 ∘ σ ′ ( . ) W C i ∘ tanh ⁡ ( . ) … ) \begin{array}{l} \nabla_{C_{t}} D i v=&\nabla_{h_{t}} D i v \circ\left(o_{t} \circ \tanh ^{\prime}(.)+\tanh (.) \circ \sigma^{\prime}(.) W_{C o}\right)+ \\\\ &\nabla_{C_{t+1}} D i v \circ\left(f_{t+1}+C_{t} \circ \sigma^{\prime}(.) W_{C f}+\tilde{C}_{t+1} \circ \sigma^{\prime}(.) W_{C i} \circ \tanh (.) \ldots\right) \end{array} CtDiv=htDiv(ottanh(.)+tanh(.)σ(.)WCo)+Ct+1Div(ft+1+Ctσ(.)WCf+C~t+1σ(.)WCitanh(.))

CMU 11-785 L14 Stability analysis and LSTMs_第15张图片
∇ h t D i v = ∇ z t D i v ∇ h t z t + ∇ C t + 1 D i v ∘ ( C t ∘ σ ′ ( . ) W h f + C ~ t + 1 ∘ σ ′ ( . ) W h i ) + ∇ C t + 1 D i v ∘ o t + 1 ∘ tanh ⁡ ′ ( . ) W h i + ∇ h t + 1 D i v ∘ tanh ⁡ ( . ) ∘ σ ′ ( . ) W h o \begin{aligned} \nabla_{h_{t}} D i v=& \nabla_{z_{t}} D i v \nabla_{h_{t}} z_{t}+\nabla_{C_{t+1}} D i v \circ\left(C_{t} \circ \sigma^{\prime}(.) W_{h f}+\tilde{C}_{t+1} \circ \sigma^{\prime}(.) W_{h i}\right)+\\\\ &\nabla_{C_{t+1}} D i v \circ o_{t+1} \circ \tanh ^{\prime}(.) W_{h i}+\nabla_{h_{t+1}} D i v \circ \tanh (.) \circ \sigma^{\prime}(.) W_{h o} \end{aligned} htDiv=ztDivhtzt+Ct+1Div(Ctσ(.)Whf+C~t+1σ(.)Whi)+Ct+1Divot+1tanh(.)Whi+ht+1Divtanh(.)σ(.)Who
And weights?

Gated Recurrent Units

CMU 11-785 L14 Stability analysis and LSTMs_第16张图片

  • Combine forget and input gates
    • In new input is to be remembered, then this means old memory is to be forgotten
    • No need to compute twice

CMU 11-785 L14 Stability analysis and LSTMs_第17张图片

  • Don’t bother to separately maintain compressed and regular memories
    • Redundant representation

Summary

  • LSTMs are an alternative formalism where memory is made more directly dependent on the input, rather than network parameters/structure
  • Through a “Constant Error Carousel” memory structure with no weights or activations, but instead direct switching and “increment/decrement” from pattern recognizers
  • Do not suffer from a vanishing gradient problem but do suffer from exploding gradient issue

  1. http://arunmallya.github.io/writeups/nn/lstm/index.html#/ ↩︎

你可能感兴趣的:(CMU,11-785,深度学习,神经网络,数据挖掘,机器学习)