循环神经网络RNNCell单元详解及反向传播的梯度求导

摘要

本文给出循环神经网络RNNCell单元的定义公式, 并求解其在反向传播中的梯度.

本文给出的相关公式是编程导向的, 完整的, 可以直接用于代码实现, 已通过 Python 验证.

相关

配套代码, 请参考文章 :

纯Python和PyTorch对比实现循环神经网络RNN及反向传播

纯Python和PyTorch对比实现循环神经网络RNNCell及反向传播

Affine 变换的定义和梯度, 请参考:

affine/linear(仿射/线性)变换函数详解及全连接层反向传播的梯度求导

系列文章索引 :
https://blog.csdn.net/oBrightLamp/article/details/85067981

正文

1. RNNCell 定义

1.1 一次循环

考虑输入一个 3 阶张量 X i j k X_{ijk} Xijk, 该张量可以表示为 i i i 个尺寸为 j × k j \times k j×k 的矩阵 X j k X_{jk} Xjk, 同时表明循环单元的输入尺寸为 k k k.

若单元输出尺寸为 n n n, 输入层矩阵 X j k X_{jk} Xjk ,变换矩阵为 W n k W_{nk} Wnk, 偏置向量 a 1 × n a_{1 \times n} a1×n, 变换后的矩阵为 Y j n Y_{jn} Yjn.

设初始隐含层矩阵为 H j n H_{jn} Hjn, 变换矩阵为 V n × n V_{n \times n} Vn×n, 偏置向量 b 1 × n b_{1 \times n} b1×n, 变换后的矩阵为 Z j n Z_{jn} Zjn.

设激活函数为 tanh, 则一次 RNNCell 循环变换为 :
Y j n = X j k W n k T + a 1 × n    Z j n = H j n V n n T + b 1 × n    A j n = Y j n + Z j n    O j n = t a n h ( A j n ) Y_{jn} = X_{jk}W_{nk}^T+a_{1 \times n}\\ \;\\ Z_{jn} = H_{jn}V_{nn}^T+b_{1 \times n}\\ \;\\ A_{jn}=Y_{jn}+Z_{jn}\\ \;\\ O_{jn} = tanh(A_{jn}) Yjn=XjkWnkT+a1×nZjn=HjnVnnT+b1×nAjn=Yjn+ZjnOjn=tanh(Ajn)
将以上过程记为 :
O j n = R N N C e l l ( X j k , H j n ) O_{jn} = RNNCell(X_{jk},H_{jn}) Ojn=RNNCell(Xjk,Hjn)
循环到下一次时, 将 O j n O_{jn} Ojn 代入 H j n H_{jn} Hjn, 与下一个 X j k X_{jk} Xjk 重新进行运算.

1.2 循环迭代

下面使用迭代记法表示 RNNCell 运算.

使用 H j n ( 0 ) H_{jn}^{(0)} Hjn(0) 表示初始隐含层矩阵, 对于 :
X i j k = X j k ( 1 ) , X j k ( 2 ) , X j k ( 3 ) , ⋯   , X j k ( i ) X_{ijk} = X_{jk}^{(1)},X_{jk}^{(2)},X_{jk}^{(3)},\cdots,X_{jk}^{(i)} Xijk=Xjk(1),Xjk(2),Xjk(3),,Xjk(i)
则 :
H j n ( 1 ) = R N N C e l l ( X j k ( 1 ) , H j n ( 0 ) )    H j n ( 2 ) = R N N C e l l ( X j k ( 2 ) , H j n ( 1 ) )    H j n ( 3 ) = R N N C e l l ( X j k ( 3 ) , H j n ( 2 ) ) ⋮ H j n ( i ) = R N N C e l l ( X j k ( i ) , H j n ( i − 1 ) ) H_{jn}^{(1)} = RNNCell(X_{jk}^{(1)},H_{jn}^{(0)})\\ \;\\ H_{jn}^{(2)} = RNNCell(X_{jk}^{(2)},H_{jn}^{(1)})\\ \;\\ H_{jn}^{(3)} = RNNCell(X_{jk}^{(3)},H_{jn}^{(2)})\\ \vdots\\ H_{jn}^{(i)} = RNNCell(X_{jk}^{(i)},H_{jn}^{(i-1)})\\ Hjn(1)=RNNCell(Xjk(1),Hjn(0))Hjn(2)=RNNCell(Xjk(2),Hjn(1))Hjn(3)=RNNCell(Xjk(3),Hjn(2))Hjn(i)=RNNCell(Xjk(i),Hjn(i1))

展开最后一层作为示例 :
Y j n ( i ) = X j k ( i ) W n k T + a 1 × n    Z j n ( i ) = H j n ( i − 1 ) V n n T + b 1 × n    A j n ( i ) = Y j n ( i ) + Z j n ( i )    H j n ( i ) = t a n h ( A j n ( i ) ) Y_{jn}^{(i)} = X_{jk}^{(i)}W_{nk}^T+a_{1 \times n}\\ \;\\ Z_{jn}^{(i)} = H_{jn}^{(i-1)}V_{nn}^T+b_{1 \times n}\\ \;\\ A_{jn}^{(i)}=Y_{jn}^{(i)}+Z_{jn}^{(i)}\\ \;\\ H_{jn}^{(i)} = tanh(A_{jn}^{(i)}) Yjn(i)=Xjk(i)WnkT+a1×nZjn(i)=Hjn(i1)VnnT+b1×nAjn(i)=Yjn(i)+Zjn(i)Hjn(i)=tanh(Ajn(i))
在迭代的过程中, W n k T ,    V n n T ,    a 1 × n ,    b 1 × n W_{nk}^T,\; V_{nn}^T,\; a_{1 \times n},\; b_{1 \times n} WnkT,VnnT,a1×n,b1×n 都是共享的, 不变的.

1.3 张量公式

使用 3 阶张量表示 :
H i j n = R N N C e l l ( i ) ( X i j k , H j n ( 0 ) ) H_{ijn} = RNNCell^{(i)}(X_{ijk},H_{jn}^{(0)}) Hijn=RNNCell(i)(Xijk,Hjn(0))
RNNCell 的上标 ( i ) (i) (i) 表示经过 i i i 次循环迭代运算.

注意, 经过 RNNCell 运算后, 输入尺寸为 i × j × k i \times j \times k i×j×k 的张量 X i j k X_{ijk} Xijk 将输出尺寸为 i × j × n i \times j \times n i×j×n 的张量 H i j n H_{ijn} Hijn.

2. 反向传播

考虑输入一个 3 阶张量 X i j k X_{ijk} Xijk, 经过 RNNCell 运算后, 输出 3 阶张量 H i j n H_{ijn} Hijn, 往前 forward 传播得到误差值 error ( 标量 e ), e 对 H i j n H_{ijn} Hijn 的梯度 ∇ e ( H i j n ) \nabla e_{(H_{ijn})} e(Hijn) 已由上游给出, 求 e 对 X i j k X_{ijk} Xijk 的梯度.
H i j n = R N N C e l l ( i ) ( X i j k , H j n ( 0 ) )    e = f o r w a r d ( H i j n ) H_{ijn} = RNNCell^{(i)}(X_{ijk},H_{jn}^{(0)})\\ \;\\ e = forward(H_{ijn}) Hijn=RNNCell(i)(Xijk,Hjn(0))e=forward(Hijn)
为了避免符号混乱, 将上游传递的梯度记为 ∇ e ( Q i j n ) = ∇ e ( H i j n ) \nabla e_{(Q_{ijn})} = \nabla e_{(H_{ijn})} e(Qijn)=e(Hijn), ∇ e ( H i j n ) \nabla e_{(H_{ijn})} e(Hijn) 用于迭代计算的中间结果.

2.1 关于 H 的梯度

从 RNNCell 运算的定义可以看出, 每一次循环迭代运算都是由 Affine 运算和激活函数运算组合而成.

对于 :
y = t a n h ( x ) = e x − e − x e x + e − x    d y d x = 1 − y 2 y = tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}} \\ \;\\ \frac{dy}{dx}= 1-y^2 y=tanh(x)=ex+exexexdxdy=1y2
则 :
∇ e ( A j n ( i ) ) = ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) )    ∇ H j n ( i ) ( A j n ( i ) ) = d H j n ( i ) d A j n ( i ) = 1 − H j n ( i ) 2 \nabla {e}_{(A_{jn}^{(i)})}=\nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})}\\ \;\\ \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})} = \frac{d H_{jn}^{(i)}}{d A_{jn}^{(i)}}=1-{H_{jn}^{(i)}}^2\\ e(Ajn(i))=e(Qjn(i))Hjn(i)(Ajn(i))Hjn(i)(Ajn(i))=dAjn(i)dHjn(i)=1Hjn(i)2

上式中的 ∇ e ( Q j n ( i ) ) \nabla e_{(Q_{jn}^{(i)})} e(Qjn(i)) 已由上游给出, ⊙ \odot 表示元素积, 即矩阵同位元素分别相乘.

注意, 在这里, 我们得到了 ∇ H j n ( i ) ( A j n ( i ) ) \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})} Hjn(i)(Ajn(i)) 的实际值, 因为 H j n ( i ) H_{jn}^{(i)} Hjn(i) 是已知的.

因为激活函数不一定是 t a n h tanh tanh, 为了不失一般性, 下面我们直接使用 ∇ H j n ( i ) ( A j n ( i ) ) \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})} Hjn(i)(Ajn(i)) 符号.

加法运算, 梯度直接往下游传递:
A j n = Y j n + Z j n    ∇ e ( Z j n ( i ) ) = ∇ e ( A j n ( i ) ) ⊙ 1 = ∇ e ( A j n ( i ) ) A_{jn}=Y_{jn}+Z_{jn}\\ \;\\ \nabla {e}_{(Z_{jn}^{(i)})}=\nabla e_{(A_{jn}^{(i)})} \odot 1=\nabla e_{(A_{jn}^{(i)})} Ajn=Yjn+Zjne(Zjn(i))=e(Ajn(i))1=e(Ajn(i))
Affine 运算的定义及梯度求导公式已在上面的 <相关> 文章给出.
Z j n = H j n V n n T + b 1 × n    ∇ e ( H j n ( i − 1 ) ) = ∇ e ( Z j n ( i ) ) V n n = ∇ e ( A j n ( i ) ) V n n    ∇ e ( H j n ( i − 1 ) ) = ( ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) ) ) V n n Z_{jn} = H_{jn}V_{nn}^T+b_{1 \times n}\\ \;\\ \nabla {e}_{(H_{jn}^{(i-1)})}=\nabla e_{(Z_{jn}^{(i)})} V_{nn}=\nabla e_{(A_{jn}^{(i)})} V_{nn} \;\\ \nabla {e}_{(H_{jn}^{(i-1)})}=(\nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})})V_{nn} Zjn=HjnVnnT+b1×ne(Hjn(i1))=e(Zjn(i))Vnn=e(Ajn(i))Vnne(Hjn(i1))=(e(Qjn(i))Hjn(i)(Ajn(i)))Vnn
注意, 这里的矩阵的上标 ( i ) (i) (i) 特指张量 X i j k X_{ijk} Xijk 的最后一个矩阵, 即第 i i i 个矩阵.

根据 RNNCell 循环迭代计算的特性, 上次循环的梯度也影响本次循环, 所以梯度求导也是按顺序循环迭代的.
∇ e ( H j k ( i − 2 ) ) = ∇ e ( Q j n ( i − 1 ) ) ⊙ ∇ H j n ( i − 1 ) ( A j n ( i − 1 ) ) V n n + ∇ e ( H j n ( i − 1 ) ) ⊙ ∇ H j n ( i − 1 ) ( A j n ( i − 1 ) ) V n n    ∇ e ( H j k ( i − 2 ) ) = ( ∇ e ( Q j n ( i − 1 ) ) + ∇ e ( H j n ( i − 1 ) ) ) ⊙ ∇ H j n ( i − 1 ) ( A j n ( i − 1 ) ) V n n \nabla {e}_{(H_{jk}^{(i-2)})}=\nabla e_{(Q_{jn}^{(i-1)})} \odot \nabla {H_{jn}^{(i-1)}}_{(A_{jn}^{(i-1)})}V_{nn}+\nabla e_{(H_{jn}^{(i-1)})}\odot \nabla {H_{jn}^{(i-1)}}_{(A_{jn}^{(i-1)})} V_{nn}\\ \;\\ \nabla {e}_{(H_{jk}^{(i-2)})}=(\nabla e_{(Q_{jn}^{(i-1)})}+\nabla e_{(H_{jn}^{(i-1)})}) \odot \nabla {H_{jn}^{(i-1)}}_{(A_{jn}^{(i-1)})}V_{nn} e(Hjk(i2))=e(Qjn(i1))Hjn(i1)(Ajn(i1))Vnn+e(Hjn(i1))Hjn(i1)(Ajn(i1))Vnne(Hjk(i2))=(e(Qjn(i1))+e(Hjn(i1)))Hjn(i1)(Ajn(i1))Vnn
为了版面简洁, 上式中 V n n V_{nn} Vnn 前面省略了一个小括号, 请按照从左至右的顺序计算.

公式汇总如下 :
∇ e ( H j n ( i − 1 ) ) = ( ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) ) ) V n n    ∇ e ( H j k ( i − 2 ) ) = ( ∇ e ( Q j n ( i − 1 ) ) + ∇ e ( H j n ( i − 1 ) ) ) ⊙ ∇ H j n ( i − 1 ) ( A j n ( i − 1 ) ) V n n    ∇ e ( H j k ( i − 3 ) ) = ( ∇ e ( Q j n ( i − 2 ) ) + ∇ e ( H j n ( i − 2 ) ) ) ⊙ ∇ H j n ( i − 2 ) ( A j n ( i − 2 ) ) V n n ⋮ ∇ e ( H j k ( 0 ) ) = ( ∇ e ( Q j n ( 1 ) ) + ∇ e ( H j n ( 1 ) ) ) ⊙ ∇ H j n ( 1 ) ( A j n ( 1 ) ) V n n \nabla {e}_{(H_{jn}^{(i-1)})}=(\nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})})V_{nn}\\ \;\\ \nabla {e}_{(H_{jk}^{(i-2)})}=(\nabla e_{(Q_{jn}^{(i-1)})}+\nabla e_{(H_{jn}^{(i-1)})}) \odot \nabla {H_{jn}^{(i-1)}}_{(A_{jn}^{(i-1)})}V_{nn}\\ \;\\ \nabla {e}_{(H_{jk}^{(i-3)})}=(\nabla e_{(Q_{jn}^{(i-2)})}+\nabla e_{(H_{jn}^{(i-2)})}) \odot \nabla {H_{jn}^{(i-2)}}_{(A_{jn}^{(i-2)})}V_{nn}\\ \vdots\\ \nabla {e}_{(H_{jk}^{(0)})}=(\nabla e_{(Q_{jn}^{(1)})}+\nabla e_{(H_{jn}^{(1)})}) \odot \nabla {H_{jn}^{(1)}}_{(A_{jn}^{(1)})}V_{nn}\\ e(Hjn(i1))=(e(Qjn(i))Hjn(i)(Ajn(i)))Vnne(Hjk(i2))=(e(Qjn(i1))+e(Hjn(i1)))Hjn(i1)(Ajn(i1))Vnne(Hjk(i3))=(e(Qjn(i2))+e(Hjn(i2)))Hjn(i2)(Ajn(i2))Vnne(Hjk(0))=(e(Qjn(1))+e(Hjn(1)))Hjn(1)(Ajn(1))Vnn

2.2 关于 X 的梯度

参考上例以及 Affine 层的求导公式得 :
Y j n = X j k W n k T + a 1 × n    ∇ e ( Y j n ( i ) ) = ∇ e ( A j n ( i ) ) ⊙ 1 = ∇ e ( A j n ( i ) )    ∇ e ( X j k ( i ) ) = ∇ e ( Y j n ( i ) ) W n k = ∇ e ( A j n ( i ) ) W n k Y_{jn} = X_{jk}W_{nk}^T+a_{1 \times n}\\ \;\\ \nabla {e}_{(Y_{jn}^{(i)})}=\nabla e_{(A_{jn}^{(i)})} \odot 1=\nabla e_{(A_{jn}^{(i)})}\\ \;\\ \nabla {e}_{(X_{jk}^{(i)})}=\nabla e_{(Y_{jn}^{(i)})} W_{nk}=\nabla e_{(A_{jn}^{(i)})}W_{nk} Yjn=XjkWnkT+a1×ne(Yjn(i))=e(Ajn(i))1=e(Ajn(i))e(Xjk(i))=e(Yjn(i))Wnk=e(Ajn(i))Wnk
同样的, 循环迭代公式如下 :
∇ e ( X j n ( i ) ) = ( ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) ) ) W n k    ∇ e ( X j k ( i − 1 ) ) = ( ∇ e ( Q j n ( i − 1 ) ) + ∇ e ( H j n ( i − 1 ) ) ) ⊙ ∇ H j n ( i − 1 ) ( A j n ( i − 1 ) ) W n k    ∇ e ( X j k ( i − 2 ) ) = ( ∇ e ( Q j n ( i − 2 ) ) + ∇ e ( H j n ( i − 2 ) ) ) ⊙ ∇ H j n ( i − 2 ) ( A j n ( i − 2 ) ) W n k ⋮ ∇ e ( X j k ( 1 ) ) = ( ∇ e ( Q j n ( 1 ) ) + ∇ e ( H j n ( 1 ) ) ) ⊙ ∇ H j n ( 1 ) ( A j n ( 1 ) ) W n k \nabla {e}_{(X_{jn}^{(i)})}=(\nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})}) W_{nk}\\ \;\\ \nabla {e}_{(X_{jk}^{(i-1)})}=(\nabla e_{(Q_{jn}^{(i-1)})}+\nabla e_{(H_{jn}^{(i-1)})}) \odot \nabla {H_{jn}^{(i-1)}}_{(A_{jn}^{(i-1)})}W_{nk}\\ \;\\ \nabla {e}_{(X_{jk}^{(i-2)})}=(\nabla e_{(Q_{jn}^{(i-2)})}+\nabla e_{(H_{jn}^{(i-2)})}) \odot \nabla {H_{jn}^{(i-2)}}_{(A_{jn}^{(i-2)})}W_{nk}\\ \vdots\\ \nabla {e}_{(X_{jk}^{(1)})}=(\nabla e_{(Q_{jn}^{(1)})}+\nabla e_{(H_{jn}^{(1)})}) \odot \nabla {H_{jn}^{(1)}}_{(A_{jn}^{(1)})}W_{nk}\\ e(Xjn(i))=(e(Qjn(i))Hjn(i)(Ajn(i)))Wnke(Xjk(i1))=(e(Qjn(i1))+e(Hjn(i1)))Hjn(i1)(Ajn(i1))Wnke(Xjk(i2))=(e(Qjn(i2))+e(Hjn(i2)))Hjn(i2)(Ajn(i2))Wnke(Xjk(1))=(e(Qjn(1))+e(Hjn(1)))Hjn(1)(Ajn(1))Wnk

为了版面简洁, 上式中 W n k W_{nk} Wnk 前面省略了一个小括号, 请按照从左至右的顺序计算.

∇ e ( H i j n ) \nabla e_{(H_{ijn})} e(Hijn) 已由上例计算给出, 请注意区分 ∇ e ( Q i j n ) \nabla e_{(Q_{ijn})} e(Qijn).

2.3 关于 W 的梯度

W n k W_{nk} Wnk 在所有的迭代步骤中是共享的, 一样的, 根据链式求导法则:
∇ e ( W n k ) = ∇ e ( Q j n ( 1 ) ) ⊙ d H j n ( 1 ) d W n k + ∇ e ( Q j n ( 2 ) ) ⊙ d H j n ( 2 ) d W n k + ⋯ + ∇ e ( Q j n ( i ) ) ⊙ d H j n ( i ) d W n k \nabla {e}_{(W_{nk})}=\nabla {e}_{(Q_{jn}^{(1)})}\odot\frac{dH_{jn}^{(1)}}{dW_{nk}}+\nabla {e}_{(Q_{jn}^{(2)})}\odot\frac{dH_{jn}^{(2)}}{dW_{nk}}+\cdots+\nabla {e}_{(Q_{jn}^{(i)})}\odot\frac{dH_{jn}^{(i)}}{dW_{nk}} e(Wnk)=e(Qjn(1))dWnkdHjn(1)+e(Qjn(2))dWnkdHjn(2)++e(Qjn(i))dWnkdHjn(i)

或 :
∇ e ( W n k ) = ∇ e ( W n k ( 1 ) ) + ∇ e ( W n k ( 2 ) ) + ∇ e ( W n k ( 3 ) ) + ⋯ + ∇ e ( W n k ( i ) ) \nabla {e}_{(W_{nk})}=\nabla e_{(W_{nk}^{(1)})}+\nabla e_{(W_{nk}^{(2)})}+\nabla e_{(W_{nk}^{(3)})}+\cdots+\nabla e_{(W_{nk}^{(i)})} e(Wnk)=e(Wnk(1))+e(Wnk(2))+e(Wnk(3))++e(Wnk(i))
由于 :
∇ e ( A j n ( i ) ) = ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) )    ∇ e ( W n k ( i ) ) = ( ∇ e ( Y j n ( i ) ) ) T X j k ( i ) = ( ∇ e ( A j n ( i ) ) ) T X j k ( i ) = ( ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) ) ) T X j k ( i ) \nabla {e}_{(A_{jn}^{(i)})}=\nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})}\\ \;\\ \nabla e_{(W_{nk}^{(i)})}=(\nabla e_{(Y_{jn}^{(i)})})^T X_{jk}^{(i)}=(\nabla e_{(A_{jn}^{(i)})})^T X_{jk}^{(i)}=(\nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})})^TX_{jk}^{(i)} e(Ajn(i))=e(Qjn(i))Hjn(i)(Ajn(i))e(Wnk(i))=(e(Yjn(i)))TXjk(i)=(e(Ajn(i)))TXjk(i)=(e(Qjn(i))Hjn(i)(Ajn(i)))TXjk(i)
上式 W n k W_{nk} Wnk 的上标 ( i ) (i) (i) 表示第 i i i 步计算中得到的梯度. 在循环迭代的过程中 :
∇ e ( W n k ( i ) ) = ( ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) ) ) T X j k ( i )    ∇ e ( W n k ( i − 1 ) ) = ( ( ∇ e ( Q j n ( i − 1 ) ) + ∇ e ( H j n ( i − 1 ) ) ) ⊙ ∇ H j n ( i − 1 ) ( A j n ( i − 1 ) ) ) T X j k ( i − 1 )    ∇ e ( W n k ( i − 2 ) ) = ( ( ∇ e ( Q j n ( i − 2 ) ) + ∇ e ( H j n ( i − 2 ) ) ) ⊙ ∇ H j n ( i − 2 ) ( A j n ( i − 2 ) ) ) T X j k ( i − 2 ) ⋮ ∇ e ( W n k ( 1 ) ) = ( ( ∇ e ( Q j n ( 1 ) ) + ∇ e ( H j n ( 1 ) ) ) ⊙ ∇ H j n ( 1 ) ( A j n ( 1 ) ) ) T X j k ( 1 ) \nabla e_{(W_{nk}^{(i)})}=(\nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})})^T X_{jk}^{(i)}\\ \;\\ \nabla e_{(W_{nk}^{(i-1)})}=((\nabla e_{(Q_{jn}^{(i-1)})}+\nabla e_{(H_{jn}^{(i-1)})}) \odot \nabla {H_{jn}^{(i-1)}}_{(A_{jn}^{(i-1)})})^TX_{jk}^{(i-1)}\\ \;\\ \nabla e_{(W_{nk}^{(i-2)})}=((\nabla e_{(Q_{jn}^{(i-2)})}+\nabla e_{(H_{jn}^{(i-2)})}) \odot \nabla {H_{jn}^{(i-2)}}_{(A_{jn}^{(i-2)})})^TX_{jk}^{(i-2)}\\ \vdots\\ \nabla e_{(W_{nk}^{(1)})}=((\nabla e_{(Q_{jn}^{(1)})}+\nabla e_{(H_{jn}^{(1)})}) \odot \nabla {H_{jn}^{(1)}}_{(A_{jn}^{(1)})})^TX_{jk}^{(1)}\\ e(Wnk(i))=(e(Qjn(i))Hjn(i)(Ajn(i)))TXjk(i)e(Wnk(i1))=((e(Qjn(i1))+e(Hjn(i1)))Hjn(i1)(Ajn(i1)))TXjk(i1)e(Wnk(i2))=((e(Qjn(i2))+e(Hjn(i2)))Hjn(i2)(Ajn(i2)))TXjk(i2)e(Wnk(1))=((e(Qjn(1))+e(Hjn(1)))Hjn(1)(Ajn(1)))TXjk(1)
最后, 将上面的结果加起来即可:
∇ e ( W n k ) = ∇ e ( W n k ( 1 ) ) + ∇ e ( W n k ( 2 ) ) + ∇ e ( W n k ( 3 ) ) + ⋯ + ∇ e ( W n k ( i ) ) \nabla {e}_{(W_{nk})}=\nabla e_{(W_{nk}^{(1)})}+\nabla e_{(W_{nk}^{(2)})}+\nabla e_{(W_{nk}^{(3)})}+\cdots+\nabla e_{(W_{nk}^{(i)})} e(Wnk)=e(Wnk(1))+e(Wnk(2))+e(Wnk(3))++e(Wnk(i))

2.4 关于 V 的梯度

V n n V_{nn} Vnn 在所有的迭代步骤中是共享的, 根据链式求导法则, 参考上例, 易得
∇ e ( V n n ( i ) ) = ( ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) ) ) T H j n ( i − 1 )    ∇ e ( V n n ( i − 1 ) ) = ( ( ∇ e ( Q j n ( i − 1 ) ) + ∇ e ( H j n ( i − 1 ) ) ) ⊙ ∇ H j n ( i − 1 ) ( A j n ( i − 1 ) ) ) T H j n ( i − 2 )    ∇ e ( V n n ( i − 2 ) ) = ( ( ∇ e ( Q j n ( i − 2 ) ) + ∇ e ( H j n ( i − 2 ) ) ) ⊙ ∇ H j n ( i − 2 ) ( A j n ( i − 2 ) ) ) T H j n ( i − 3 ) ⋮ ∇ e ( V n n ( 1 ) ) = ( ( ∇ e ( Q j n ( 1 ) ) + ∇ e ( H j n ( 1 ) ) ) ⊙ ∇ H j n ( 1 ) ( A j n ( 1 ) ) ) T H j n ( 0 ) \nabla e_{(V_{nn}^{(i)})}=(\nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})})^T H_{jn}^{(i-1)}\\ \;\\ \nabla e_{(V_{nn}^{(i-1)})}=((\nabla e_{(Q_{jn}^{(i-1)})}+\nabla e_{(H_{jn}^{(i-1)})}) \odot \nabla {H_{jn}^{(i-1)}}_{(A_{jn}^{(i-1)})})^T H_{jn}^{(i-2)}\\ \;\\ \nabla e_{(V_{nn}^{(i-2)})}=((\nabla e_{(Q_{jn}^{(i-2)})}+\nabla e_{(H_{jn}^{(i-2)})}) \odot \nabla {H_{jn}^{(i-2)}}_{(A_{jn}^{(i-2)})})^T H_{jn}^{(i-3)}\\ \vdots\\ \nabla e_{(V_{nn}^{(1)})}=((\nabla e_{(Q_{jn}^{(1)})}+\nabla e_{(H_{jn}^{(1)})}) \odot \nabla {H_{jn}^{(1)}}_{(A_{jn}^{(1)})})^T H_{jn}^{(0)}\\ e(Vnn(i))=(e(Qjn(i))Hjn(i)(Ajn(i)))THjn(i1)e(Vnn(i1))=((e(Qjn(i1))+e(Hjn(i1)))Hjn(i1)(Ajn(i1)))THjn(i2)e(Vnn(i2))=((e(Qjn(i2))+e(Hjn(i2)))Hjn(i2)(Ajn(i2)))THjn(i3)e(Vnn(1))=((e(Qjn(1))+e(Hjn(1)))Hjn(1)(Ajn(1)))THjn(0)
将上面的结果加起来即可:
∇ e ( V n n ) = ∇ e ( V n n ( 1 ) ) + ∇ e ( V n n ( 2 ) ) + ∇ e ( V n n ( 3 ) ) + ⋯ + ∇ e ( V n n ( i ) ) \nabla {e}_{(V_{nn})}=\nabla e_{(V_{nn}^{(1)})}+\nabla e_{(V_{nn}^{(2)})}+\nabla e_{(V_{nn}^{(3)})}+\cdots+\nabla e_{(V_{nn}^{(i)})} e(Vnn)=e(Vnn(1))+e(Vnn(2))+e(Vnn(3))++e(Vnn(i))

2.5 关于 a 和 b 的梯度

Y j n = X j k W n k T + a 1 × n    ∇ e ( Y j n ( i ) ) = ∇ e ( A j n ( i ) ) ⊙ 1 = ∇ e ( A j n ( i ) )    ∇ e ( a 1 × n i ) = ∇ e ( Y j n ( i ) ) ⊙ 1 = ∇ e ( A j n ( i ) ) = ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) ) Y_{jn} = X_{jk}W_{nk}^T+a_{1 \times n}\\ \;\\ \nabla {e}_{(Y_{jn}^{(i)})}=\nabla e_{(A_{jn}^{(i)})} \odot 1=\nabla e_{(A_{jn}^{(i)})}\\ \;\\ \nabla {e}_{(a_{1 \times n}^{i})}=\nabla e_{(Y_{jn}^{(i)})} \odot 1=\nabla e_{(A_{jn}^{(i)})}= \nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})} Yjn=XjkWnkT+a1×ne(Yjn(i))=e(Ajn(i))1=e(Ajn(i))e(a1×ni)=e(Yjn(i))1=e(Ajn(i))=e(Qjn(i))Hjn(i)(Ajn(i))
a 1 × n a_{1 \times n} a1×n 在所有的迭代步骤中是共享的, 一样的, 根据链式求导法则, 参考上例, 易得 :
∇ e ( a 1 × n ( i ) ) = ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) )    ∇ e ( a 1 × n ( i − 1 ) ) = ( ∇ e ( Q j n ( i − 1 ) ) + ∇ e ( H j n ( i − 1 ) ) ) ⊙ ∇ H j n ( i − 1 ) ( A j n ( i − 1 ) )    ∇ e ( a 1 × n ( i − 2 ) ) = ( ∇ e ( Q j n ( i − 2 ) ) + ∇ e ( H j n ( i − 2 ) ) ) ⊙ ∇ H j n ( i − 2 ) ( A j n ( i − 2 ) ) ⋮ ∇ e ( a 1 × n ( 1 ) ) = ( ∇ e ( Q j n ( 1 ) ) + ∇ e ( H j n ( 1 ) ) ) ⊙ ∇ H j n ( 1 ) ( A j n ( 1 ) ) \nabla {e}_{(a_{1 \times n}^{(i)})}= \nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})}\\ \;\\ \nabla {e}_{(a_{1 \times n}^{(i-1)})}= (\nabla e_{(Q_{jn}^{(i-1)})} + \nabla e_{(H_{jn}^{(i-1)})}) \odot \nabla {H_{jn}^{(i-1)}}_{(A_{jn}^{(i-1)})}\\ \;\\ \nabla {e}_{(a_{1 \times n}^{(i-2)})}= (\nabla e_{(Q_{jn}^{(i-2)})} + \nabla e_{(H_{jn}^{(i-2)})}) \odot \nabla {H_{jn}^{(i-2)}}_{(A_{jn}^{(i-2)})}\\ \vdots\\ \nabla {e}_{(a_{1 \times n}^{(1)})}= (\nabla e_{(Q_{jn}^{(1)})} + \nabla e_{(H_{jn}^{(1)})}) \odot \nabla {H_{jn}^{(1)}}_{(A_{jn}^{(1)})} e(a1×n(i))=e(Qjn(i))Hjn(i)(Ajn(i))e(a1×n(i1))=(e(Qjn(i1))+e(Hjn(i1)))Hjn(i1)(Ajn(i1))e(a1×n(i2))=(e(Qjn(i2))+e(Hjn(i2)))Hjn(i2)(Ajn(i2))e(a1×n(1))=(e(Qjn(1))+e(Hjn(1)))Hjn(1)(Ajn(1))
将上面的结果加起来即可 :
∇ e ( a 1 × n ) = ∇ e ( a 1 × n ( 1 ) ) + ∇ e ( a 1 × n ( 2 ) ) + ∇ e ( a 1 × n ( 3 ) ) + ⋯ + ∇ e ( a 1 × n ( i ) ) \nabla {e}_{(a_{1 \times n})}=\nabla {e}_{(a_{1 \times n}^{(1)})}+\nabla {e}_{(a_{1 \times n}^{(2)})}+\nabla {e}_{(a_{1 \times n}^{(3)})}+\cdots+\nabla {e}_{(a_{1 \times n}^{(i)})} e(a1×n)=e(a1×n(1))+e(a1×n(2))+e(a1×n(3))++e(a1×n(i))
同理, 观察公式定义, 我们可以得到 :
∇ e ( b 1 × n ) = ∇ e ( a 1 × n ) \nabla {e}_{(b_{1 \times n})}=\nabla {e}_{(a_{1 \times n})} e(b1×n)=e(a1×n)

你可能感兴趣的:(深度学习基础)