这篇文章是看了刘建平老师的循环神经网络(RNN)模型与前向反向传播算法后的笔记,包括一些公式推导,都是自己的理解,如有错误,欢迎指出。
RNN主流模型:
1. x ( t ) \boldsymbol x^{(t)} x(t)代表在序列索引号 t t t时训练样本的输入,是 n n nx 1 1 1的向量;
2. h ( t ) \boldsymbol h^{(t)} h(t)代表在序列索引号 t t t时模型的隐藏状态,为 m m mx 1 1 1的向量;
3. o ( t ) \boldsymbol o^{(t)} o(t)代表在序列索引号 t t t时模型的输出,为 l l lx 1 1 1的向量;
4. L ( t ) L^{(t)} L(t)代表在序列索引号t时模型的损失函数,是标量;
5. y ( t ) \boldsymbol y^{(t)} y(t)代表在序列索引号 t t t时训练样本的输入,是 l l lx 1 1 1的向量;
6. U , W , V \boldsymbol U,\boldsymbol W,\boldsymbol V U,W,V代表模型的线性关系参数,其中, U \boldsymbol U U是 m m mx n n n的向量, W \boldsymbol W W是 m m mx m m m的向量, V \boldsymbol V V是 l l lx m m m的向量。
略
1.隐藏层激活函数使用tanh函数,即 t a n h ( z ) = e z − e − z e z + e − z tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}} tanh(z)=ez+e−zez−e−z
其导数为 1 − t a n h 2 ( z ) 1-tanh^2(z) 1−tanh2(z)
2.输出层激活函数为softmax函数,即
y ^ i ( t ) = e o i ( t ) ∑ j = 1 l e o j ( t ) \hat{y}^{(t)}_i=\frac{e^{o^{(t)}_i}}{\sum\limits_{j=1}^le^{o^{(t)}_j}} y^i(t)=j=1∑leoj(t)eoi(t)
其向量表达式为
y ^ ( t ) = e o ( t ) 1 T e o ( t ) \hat{\boldsymbol y}^{(t)}=\frac{e^{\boldsymbol {o}^{(t)}}}{\boldsymbol {1}^Te^{\boldsymbol {o}^{(t)}}} y^(t)=1Teo(t)eo(t)
3.损失函数为
L = ∑ t = 1 τ L ( t ) L=\sum\limits_{t=1}^\tau L^{(t)} L=t=1∑τL(t)
其中,由于输出为第 i i i类, y i ( t ) = 1 y_i^{(t)}=1 yi(t)=1,其余 y j ( t ) = 0 , j ≠ i y_j^{(t)}=0,j\neq i yj(t)=0,j̸=i,所以 L ( t ) L^{(t)} L(t)有如下表示:
L ( t ) = − ( y ( t ) ) T ln y ^ ( t ) = − ∑ i = 1 l y i ( t ) ln y ^ i ( t ) = − ln y ^ i ( t ) = ln ∑ j = 1 l e o j ( t ) − o i ( t ) \begin{aligned} L^{(t)}&=-(\boldsymbol y ^{(t)})^T\ln \hat{\boldsymbol y}^{(t)} \\ &= -\sum\limits_{i=1}^{l} y_i^{(t)} \ln \hat{y}^{(t)}_i \\ &=- \ln \hat{y}^{(t)}_i \\ &= \ln \sum_{j=1}^l e^{o^{(t)}_j}-o^{(t)}_i \end{aligned} L(t)=−(y(t))Tlny^(t)=−i=1∑lyi(t)lny^i(t)=−lny^i(t)=lnj=1∑leoj(t)−oi(t)
V , c V,c V,c的梯度先计算 ∂ L ( t ) ∂ o ( t ) \frac{\partial L^{(t)}}{\partial o^{(t)}} ∂o(t)∂L(t),由于 L ( t ) L^{(t)} L(t)是标量,而 o ( t ) o^{(t)} o(t)是 l l lx 1 1 1的向量,则 ∂ L ( t ) ∂ o ( t ) \frac{\partial L^{(t)}}{\partial o^{(t)}} ∂o(t)∂L(t)为 l l lx 1 1 1的向量,具体计算如下:
∂ L ( t ) ∂ o ( t ) = ( ∂ L ( t ) ∂ o 1 ( t ) … ∂ L ( t ) ∂ o i ( t ) … ∂ L ( t ) ∂ o l ( t ) ) T \begin{aligned} \frac{\partial L^{(t)}}{\partial o^{(t)}}&=\left( \frac{\partial L^{(t)}}{\partial o_1^{(t)}}\dots \frac{\partial L^{(t)}}{\partial o_i^{(t)}}\dots\frac{\partial L^{(t)}}{\partial o_l^{(t)}}\right)^T \end{aligned} ∂o(t)∂L(t)=(∂o1(t)∂L(t)…∂oi(t)∂L(t)…∂ol(t)∂L(t))T
其中,当 k ≠ i k\neq i k̸=i时,
∂ L ( t ) ∂ o k ( t ) = e o k t ∑ j = 1 l e o j ( t ) = y ^ k ( t ) \frac{\partial L^{(t)}}{\partial o_k^{(t)}}=\frac{e^{o^{t}_k}}{\sum\limits_{j=1}^l e^{o^{(t)}_j}}=\hat{y}^{(t)}_k ∂ok(t)∂L(t)=j=1∑leoj(t)eokt=y^k(t)
当 k = i k=i k=i时,
∂ L ( t ) ∂ o i ( t ) = e o i t ∑ j = 1 l e o j ( t ) − 1 = y ^ i ( t ) − y i ( t ) \frac{\partial L^{(t)}}{\partial o_i^{(t)}}=\frac{e^{o^{t}_i}}{\sum\limits_{j=1}^l e^{o^{(t)}_j}}-1=\hat{y}^{(t)}_i-y_i^{(t)} ∂oi(t)∂L(t)=j=1∑leoj(t)eoit−1=y^i(t)−yi(t)
综上, ∂ L ( t ) ∂ o ( t ) = y ^ ( t ) − y ( t ) \frac{\partial L^{(t)}}{\partial \boldsymbol o^{(t)}}=\hat{\boldsymbol y}^{(t)}-\boldsymbol y^{(t)} ∂o(t)∂L(t)=y^(t)−y(t),其矩阵的求解能得出一样的结论,求解过程可以参照这篇文章:
d L ( t ) = − d ( y ( t ) ) T ln y ^ ( t ) = t r ( ( y ( t ) ) T 1 1 T d e o ( t ) 1 T e o ( t ) − ( y ( t ) ) T d o ( t ) ) = t r ( ( ( e o ( t ) ) T 1 T e o ( t ) − ( y ( t ) ) T ) d o ( t ) ) = t r ( ( y ^ ( t ) − y ( t ) ) T d o ( t ) ) = t r ( ( y ^ ( t ) − y ( t ) ) T d ( V h ( t ) + c ) ) = t r ( ( y ^ ( t ) − y ( t ) ) T d V h ( t ) + ( y ^ ( t ) − y ( t ) ) T d c ) = t r ( h ( t ) ( y ^ ( t ) − y ( t ) ) T d V + ( y ^ ( t ) − y ( t ) ) T d c ) \begin{aligned} dL^{(t)}&=-d(\boldsymbol y ^{(t)})^T\ln \hat{\boldsymbol y}^{(t)}\\ &=tr\left( \frac{(\boldsymbol y ^{(t)})^T\ \boldsymbol 1\ \boldsymbol 1^T de^{\boldsymbol o^{(t)}}}{\boldsymbol 1^T e^{\boldsymbol o^{(t)}}}-(\boldsymbol y ^{(t)})^Td\boldsymbol o^{(t)}\right) \\ &=tr\left(\left( \frac{(e^{\boldsymbol o^{(t)}})^T}{\boldsymbol 1^T e^{\boldsymbol o^{(t)}}}-(\boldsymbol y ^{(t)})^T\right)d\boldsymbol o^{(t)}\right)\\ &=tr\left( \left(\hat\boldsymbol y ^{(t)}-\boldsymbol y ^{(t)}\right)^Td\boldsymbol o^{(t)}\right)\\ &=tr\left( \left(\hat\boldsymbol y ^{(t)}-\boldsymbol y ^{(t)}\right)^Td\left(\boldsymbol V\boldsymbol h^{(t)}+\boldsymbol c\right)\right)\\ &=tr\left( \left(\hat\boldsymbol y ^{(t)}-\boldsymbol y ^{(t)}\right)^Td\boldsymbol V\boldsymbol h^{(t)}+ \left(\hat\boldsymbol y ^{(t)}-\boldsymbol y ^{(t)}\right)^Td\boldsymbol c\right)\\ &=tr\left( \boldsymbol h^{(t)} \left(\hat\boldsymbol y ^{(t)}-\boldsymbol y ^{(t)}\right)^Td\boldsymbol V+ \left(\hat\boldsymbol y ^{(t)}-\boldsymbol y ^{(t)}\right)^Td\boldsymbol c\right) \end{aligned} dL(t)=−d(y(t))Tlny^(t)=tr(1Teo(t)(y(t))T 1 1Tdeo(t)−(y(t))Tdo(t))=tr((1Teo(t)(eo(t))T−(y(t))T)do(t))=tr((y^(t)−y(t))Tdo(t))=tr((y^(t)−y(t))Td(Vh(t)+c))=tr((y^(t)−y(t))TdVh(t)+(y^(t)−y(t))Tdc)=tr(h(t)(y^(t)−y(t))TdV+(y^(t)−y(t))Tdc)
由上式可以看出, ∂ L ( t ) ∂ V = ( y ^ ( t ) − y ( t ) ) ( h ( t ) ) T , ∂ L ( t ) ∂ c = y ^ ( t ) − y ( t ) \frac{\partial L^{(t)}}{\partial \boldsymbol V}= \left(\hat\boldsymbol y ^{(t)}-\boldsymbol y ^{(t)}\right)(\boldsymbol h^{(t)})^T,\frac{\partial L^{(t)}}{\partial \boldsymbol c}=\hat\boldsymbol y ^{(t)}-\boldsymbol y ^{(t)} ∂V∂L(t)=(y^(t)−y(t))(h(t))T,∂c∂L(t)=y^(t)−y(t)。其中,上式 ( y ( t ) ) T 1 = ∑ i = 1 l y i ( t ) = 1 (\boldsymbol y ^{(t)})^T\ \boldsymbol 1=\sum_{i=1}^l y^{(t)}_i=1 (y(t))T 1=∑i=1lyi(t)=1, 1 T d e o ( t ) = 1 T e o ( t ) ⊙ d o ( t ) = ( 1 ⊙ e o ( t ) ) T d o ( t ) = ( e o ( t ) ) T d o ( t ) \boldsymbol 1^T de^{\boldsymbol o^{(t)}}=\boldsymbol 1^Te^{\boldsymbol o^{(t)}}\odot d\boldsymbol o^{(t)}=\left(\boldsymbol 1\odot e^{\boldsymbol o^{(t)}}\right)^Td\boldsymbol o^{(t)}=\left(e^{\boldsymbol o^{(t)}}\right)^Td\boldsymbol o^{(t)} 1Tdeo(t)=1Teo(t)⊙do(t)=(1⊙eo(t))Tdo(t)=(eo(t))Tdo(t)。
结论和如下一致:
接下来求解 W , U , b \boldsymbol W,\boldsymbol U,\boldsymbol b W,U,b的梯度需要先计算 δ ( t ) = ∂ L ∂ h ( t ) \boldsymbol\delta^{(t) }= \frac{\partial L}{\partial \boldsymbol h^{(t)}} δ(t)=∂h(t)∂L。
d L = ( ∂ L ( t ) ∂ o ( t ) ) T d o ( t ) + ( ∂ L ∂ h ( t + 1 ) ) T d h ( t + 1 ) = ( y ^ ( t ) − y ( t ) ) T d ( V h ( t ) + c ) + ( δ ( t + 1 ) ) T ( ∂ h ( t + 1 ) ∂ h ( t ) ) T d h ( t ) = t r ( ( ( y ^ ( t ) − y ( t ) ) T V + ( δ ( t + 1 ) ) T ( ∂ h ( t + 1 ) ∂ h ( t ) ) T ) d h ( t ) ) \begin{aligned} dL&=\left(\frac{\partial L^{(t)}}{\partial \boldsymbol o^{(t)}}\right)^Td\boldsymbol o^{(t)}+ \left(\frac{\partial L}{\partial \boldsymbol h^{(t+1)}}\right)^Td\boldsymbol h^{(t+1)}\\ &=\left(\hat\boldsymbol y ^{(t)}-\boldsymbol y ^{(t)}\right)^Td\left(\boldsymbol V\boldsymbol h^{(t)}+\boldsymbol c\right)+(\boldsymbol\delta^{(t+1)})^T\left(\frac{\partial \boldsymbol h^{(t+1)}}{\partial \boldsymbol h^{(t)}}\right)^Td\boldsymbol h^{(t)}\\ &=tr\left(\left(\left(\hat\boldsymbol y ^{(t)}-\boldsymbol y ^{(t)}\right)^T \boldsymbol V+(\boldsymbol\delta^{(t+1)})^T\left(\frac{\partial \boldsymbol h^{(t+1)}}{\partial \boldsymbol h^{(t)}}\right)^T\right) d\boldsymbol h^{(t)}\right) \end{aligned} dL=(∂o(t)∂L(t))Tdo(t)+(∂h(t+1)∂L)Tdh(t+1)=(y^(t)−y(t))Td(Vh(t)+c)+(δ(t+1))T(∂h(t)∂h(t+1))Tdh(t)=tr⎝⎛⎝⎛(y^(t)−y(t))TV+(δ(t+1))T(∂h(t)∂h(t+1))T⎠⎞dh(t)⎠⎞
所以可以得出:
∂ L ∂ h ( t ) = V T ( y ^ ( t ) − y ( t ) ) + ( ∂ h ( t + 1 ) ∂ h ( t ) ) δ ( t + 1 ) \frac{\partial L}{\partial \boldsymbol h^{(t)}}=\boldsymbol V^T\left(\hat\boldsymbol y ^{(t)}-\boldsymbol y ^{(t)}\right)+\left(\frac{\partial \boldsymbol h^{(t+1)}}{\partial \boldsymbol h^{(t)}}\right)\boldsymbol\delta^{(t+1)} ∂h(t)∂L=VT(y^(t)−y(t))+(∂h(t)∂h(t+1))δ(t+1)
其中,
d h ( t + 1 ) = d t a n h ( U x ( t + 1 ) + W h ( t ) + b ) = ( 1 − ( h ( t + 1 ) ) 2 ) ⊙ d W h ( t ) = d i a g ( 1 − ( h ( t + 1 ) ) 2 ) d W h ( t ) = d i a g ( 1 − ( h ( t + 1 ) ) 2 ) W d h ( t ) \begin{aligned} d\boldsymbol h^{(t+1)}&=dtanh\left(\boldsymbol U\boldsymbol x^{(t+1)}+ \boldsymbol W\boldsymbol h^{(t)}+\boldsymbol b\right)\\ &=\left(\boldsymbol 1-(\boldsymbol h^{(t+1)})^2\right)\odot d\boldsymbol W\boldsymbol h^{(t)}\\ &=diag\left(\boldsymbol 1-(\boldsymbol h^{(t+1)})^2\right)d\boldsymbol W\boldsymbol h^{(t)}\\ &=diag\left(\boldsymbol 1-(\boldsymbol h^{(t+1)})^2\right)\boldsymbol Wd\boldsymbol h^{(t)} \end{aligned} dh(t+1)=dtanh(Ux(t+1)+Wh(t)+b)=(1−(h(t+1))2)⊙dWh(t)=diag(1−(h(t+1))2)dWh(t)=diag(1−(h(t+1))2)Wdh(t)
所以得:
∂ h ( t + 1 ) ∂ h ( t ) = W T d i a g ( 1 − ( h ( t + 1 ) ) 2 ) \frac{\partial \boldsymbol h^{(t+1)}}{\partial \boldsymbol h^{(t)}}=\boldsymbol W^Tdiag\left(\boldsymbol 1-(\boldsymbol h^{(t+1)})^2\right) ∂h(t)∂h(t+1)=WTdiag(1−(h(t+1))2)
综上, δ ( t ) = ∂ L ∂ h ( t ) = V T ( y ^ ( t ) − y ( t ) ) + W T d i a g ( 1 − ( h ( t + 1 ) ) 2 ) δ ( t + 1 ) \boldsymbol\delta^{(t) }= \frac{\partial L}{\partial \boldsymbol h^{(t)}}=\boldsymbol V^T\left(\hat\boldsymbol y ^{(t)}-\boldsymbol y ^{(t)}\right)+\boldsymbol W^Tdiag\left(\boldsymbol 1-(\boldsymbol h^{(t+1)})^2\right)\boldsymbol\delta^{(t+1)} δ(t)=∂h(t)∂L=VT(y^(t)−y(t))+WTdiag(1−(h(t+1))2)δ(t+1)。
d L = ∑ t = 1 τ ( ∂ L ∂ h ( t ) ) T d h ( t ) = ∑ t = 1 τ ( ∂ L ∂ h ( t ) ) T d t a n h ( U x ( t ) + W h ( t − 1 ) + b ) = ∑ t = 1 τ t r ( ( ∂ L ∂ h ( t ) ) T ( 1 − ( h ( t ) ) 2 ) ⊙ d ( U x ( t ) + W h ( t − 1 ) + b ) ) = ∑ t = 1 τ t r ( ( ∂ L ∂ h ( t ) ⊙ ( 1 − ( h ( t ) ) 2 ) ) T ( d ( U x ( t ) ) + d ( W h ( t − 1 ) ) + d b ) ) = ∑ t = 1 τ t r ( x ( t ) ( d i a g ( 1 − ( h ( t ) ) 2 ) δ ( t ) ) T d U + h ( t − 1 ) ( d i a g ( 1 − ( h ( t ) ) 2 ) δ ( t ) ) T d W + ( d i a g ( 1 − ( h ( t ) ) 2 ) δ ( t ) ) T d b ) \begin{aligned} dL&=\sum\limits_{t=1}^{\tau}\left(\frac{\partial L}{\partial \boldsymbol h^{(t)}}\right)^Td\boldsymbol h^{(t)}\\ &=\sum\limits_{t=1}^{\tau}\left(\frac{\partial L}{\partial \boldsymbol h^{(t)}}\right)^Tdtanh\left(\boldsymbol U\boldsymbol x^{(t)}+ \boldsymbol W\boldsymbol h^{(t-1)}+\boldsymbol b\right)\\ &=\sum\limits_{t=1}^{\tau}tr\left(\left(\frac{\partial L}{\partial \boldsymbol h^{(t)}}\right)^T\left(\boldsymbol 1-(\boldsymbol h^{(t)})^2\right)\odot d\left(\boldsymbol U\boldsymbol x^{(t)}+ \boldsymbol W\boldsymbol h^{(t-1)}+\boldsymbol b\right)\right)\\ &=\sum\limits_{t=1}^{\tau}tr\left(\left(\frac{\partial L}{\partial \boldsymbol h^{(t)}}\odot \left(\boldsymbol 1-(\boldsymbol h^{(t)})^2\right)\right)^T\left(d(\boldsymbol U\boldsymbol x^{(t)})+d(\boldsymbol W\boldsymbol h^{(t-1)})+d\boldsymbol b\right)\right)\\ &=\sum\limits_{t=1}^{\tau}tr\left(\boldsymbol x^{(t)}\left(diag\left(\boldsymbol 1-(\boldsymbol h^{(t)})^2\right)\boldsymbol\delta^{(t)}\right)^Td\boldsymbol U+\boldsymbol h^{(t-1)}\left(diag\left(\boldsymbol 1-(\boldsymbol h^{(t)})^2\right)\boldsymbol\delta^{(t)}\right)^Td\boldsymbol W+\left(diag\left(\boldsymbol 1-(\boldsymbol h^{(t)})^2\right)\boldsymbol\delta^{(t)}\right)^Td\boldsymbol b\right) \end{aligned} dL=t=1∑τ(∂h(t)∂L)Tdh(t)=t=1∑τ(∂h(t)∂L)Tdtanh(Ux(t)+Wh(t−1)+b)=t=1∑τtr((∂h(t)∂L)T(1−(h(t))2)⊙d(Ux(t)+Wh(t−1)+b))=t=1∑τtr((∂h(t)∂L⊙(1−(h(t))2))T(d(Ux(t))+d(Wh(t−1))+db))=t=1∑τtr(x(t)(diag(1−(h(t))2)δ(t))TdU+h(t−1)(diag(1−(h(t))2)δ(t))TdW+(diag(1−(h(t))2)δ(t))Tdb)
所以 W , U , b \boldsymbol W,\boldsymbol U,\boldsymbol b W,U,b梯度表达式如下:
∂ L ∂ U = ∑ t = 1 τ d i a g ( 1 − ( h ( t ) ) 2 ) δ ( t ) ( x ( t ) ) T ∂ L ∂ W = ∑ t = 1 τ d i a g ( 1 − ( h ( t ) ) 2 ) δ ( t ) ( h ( t − 1 ) ) T ∂ L ∂ b = ∑ t = 1 τ d i a g ( 1 − ( h ( t ) ) 2 ) δ ( t ) \begin{aligned} \frac{\partial L}{\partial \boldsymbol U}&=\sum\limits_{t=1}^{\tau}diag\left(\boldsymbol 1-(\boldsymbol h^{(t)})^2\right)\boldsymbol\delta^{(t)}\left(\boldsymbol x^{(t)}\right)^T\\ \frac{\partial L}{\partial \boldsymbol W}&=\sum\limits_{t=1}^{\tau}diag\left(\boldsymbol 1-(\boldsymbol h^{(t)})^2\right)\boldsymbol\delta^{(t)}\left(\boldsymbol h^{(t-1)}\right)^T\\ \frac{\partial L}{\partial \boldsymbol b}&=\sum\limits_{t=1}^{\tau}diag\left(\boldsymbol 1-(\boldsymbol h^{(t)})^2\right)\boldsymbol\delta^{(t)} \end{aligned} ∂U∂L∂W∂L∂b∂L=t=1∑τdiag(1−(h(t))2)δ(t)(x(t))T=t=1∑τdiag(1−(h(t))2)δ(t)(h(t−1))T=t=1∑τdiag(1−(h(t))2)δ(t)