本文给出循环神经网络RNNCell单元的定义公式, 并求解其在反向传播中的梯度.
本文给出的相关公式是编程导向的, 完整的, 可以直接用于代码实现, 已通过 Python 验证.
配套代码, 请参考文章 :
纯Python和PyTorch对比实现循环神经网络RNN及反向传播
纯Python和PyTorch对比实现循环神经网络RNNCell及反向传播
Affine 变换的定义和梯度, 请参考:
affine/linear(仿射/线性)变换函数详解及全连接层反向传播的梯度求导
系列文章索引 :
https://blog.csdn.net/oBrightLamp/article/details/85067981
考虑输入一个 3 阶张量 X i j k X_{ijk} Xijk, 该张量可以表示为 i i i 个尺寸为 j × k j \times k j×k 的矩阵 X j k X_{jk} Xjk, 同时表明循环单元的输入尺寸为 k k k.
若单元输出尺寸为 n n n, 输入层矩阵 X j k X_{jk} Xjk ,变换矩阵为 W n k W_{nk} Wnk, 偏置向量 a 1 × n a_{1 \times n} a1×n, 变换后的矩阵为 Y j n Y_{jn} Yjn.
设初始隐含层矩阵为 H j n H_{jn} Hjn, 变换矩阵为 V n × n V_{n \times n} Vn×n, 偏置向量 b 1 × n b_{1 \times n} b1×n, 变换后的矩阵为 Z j n Z_{jn} Zjn.
设激活函数为 tanh, 则一次 RNNCell 循环变换为 :
Y j n = X j k W n k T + a 1 × n    Z j n = H j n V n n T + b 1 × n    A j n = Y j n + Z j n    O j n = t a n h ( A j n ) Y_{jn} = X_{jk}W_{nk}^T+a_{1 \times n}\\ \;\\ Z_{jn} = H_{jn}V_{nn}^T+b_{1 \times n}\\ \;\\ A_{jn}=Y_{jn}+Z_{jn}\\ \;\\ O_{jn} = tanh(A_{jn}) Yjn=XjkWnkT+a1×nZjn=HjnVnnT+b1×nAjn=Yjn+ZjnOjn=tanh(Ajn)
将以上过程记为 :
O j n = R N N C e l l ( X j k , H j n ) O_{jn} = RNNCell(X_{jk},H_{jn}) Ojn=RNNCell(Xjk,Hjn)
循环到下一次时, 将 O j n O_{jn} Ojn 代入 H j n H_{jn} Hjn, 与下一个 X j k X_{jk} Xjk 重新进行运算.
下面使用迭代记法表示 RNNCell 运算.
使用 H j n ( 0 ) H_{jn}^{(0)} Hjn(0) 表示初始隐含层矩阵, 对于 :
X i j k = X j k ( 1 ) , X j k ( 2 ) , X j k ( 3 ) , ⋯   , X j k ( i ) X_{ijk} = X_{jk}^{(1)},X_{jk}^{(2)},X_{jk}^{(3)},\cdots,X_{jk}^{(i)} Xijk=Xjk(1),Xjk(2),Xjk(3),⋯,Xjk(i)
则 :
H j n ( 1 ) = R N N C e l l ( X j k ( 1 ) , H j n ( 0 ) )    H j n ( 2 ) = R N N C e l l ( X j k ( 2 ) , H j n ( 1 ) )    H j n ( 3 ) = R N N C e l l ( X j k ( 3 ) , H j n ( 2 ) ) ⋮ H j n ( i ) = R N N C e l l ( X j k ( i ) , H j n ( i − 1 ) ) H_{jn}^{(1)} = RNNCell(X_{jk}^{(1)},H_{jn}^{(0)})\\ \;\\ H_{jn}^{(2)} = RNNCell(X_{jk}^{(2)},H_{jn}^{(1)})\\ \;\\ H_{jn}^{(3)} = RNNCell(X_{jk}^{(3)},H_{jn}^{(2)})\\ \vdots\\ H_{jn}^{(i)} = RNNCell(X_{jk}^{(i)},H_{jn}^{(i-1)})\\ Hjn(1)=RNNCell(Xjk(1),Hjn(0))Hjn(2)=RNNCell(Xjk(2),Hjn(1))Hjn(3)=RNNCell(Xjk(3),Hjn(2))⋮Hjn(i)=RNNCell(Xjk(i),Hjn(i−1))
展开最后一层作为示例 :
Y j n ( i ) = X j k ( i ) W n k T + a 1 × n    Z j n ( i ) = H j n ( i − 1 ) V n n T + b 1 × n    A j n ( i ) = Y j n ( i ) + Z j n ( i )    H j n ( i ) = t a n h ( A j n ( i ) ) Y_{jn}^{(i)} = X_{jk}^{(i)}W_{nk}^T+a_{1 \times n}\\ \;\\ Z_{jn}^{(i)} = H_{jn}^{(i-1)}V_{nn}^T+b_{1 \times n}\\ \;\\ A_{jn}^{(i)}=Y_{jn}^{(i)}+Z_{jn}^{(i)}\\ \;\\ H_{jn}^{(i)} = tanh(A_{jn}^{(i)}) Yjn(i)=Xjk(i)WnkT+a1×nZjn(i)=Hjn(i−1)VnnT+b1×nAjn(i)=Yjn(i)+Zjn(i)Hjn(i)=tanh(Ajn(i))
在迭代的过程中, W n k T ,    V n n T ,    a 1 × n ,    b 1 × n W_{nk}^T,\; V_{nn}^T,\; a_{1 \times n},\; b_{1 \times n} WnkT,VnnT,a1×n,b1×n 都是共享的, 不变的.
使用 3 阶张量表示 :
H i j n = R N N C e l l ( i ) ( X i j k , H j n ( 0 ) ) H_{ijn} = RNNCell^{(i)}(X_{ijk},H_{jn}^{(0)}) Hijn=RNNCell(i)(Xijk,Hjn(0))
RNNCell 的上标 ( i ) (i) (i) 表示经过 i i i 次循环迭代运算.
注意, 经过 RNNCell 运算后, 输入尺寸为 i × j × k i \times j \times k i×j×k 的张量 X i j k X_{ijk} Xijk 将输出尺寸为 i × j × n i \times j \times n i×j×n 的张量 H i j n H_{ijn} Hijn.
考虑输入一个 3 阶张量 X i j k X_{ijk} Xijk, 经过 RNNCell 运算后, 输出 3 阶张量 H i j n H_{ijn} Hijn, 往前 forward 传播得到误差值 error ( 标量 e ), e 对 H i j n H_{ijn} Hijn 的梯度 ∇ e ( H i j n ) \nabla e_{(H_{ijn})} ∇e(Hijn) 已由上游给出, 求 e 对 X i j k X_{ijk} Xijk 的梯度.
H i j n = R N N C e l l ( i ) ( X i j k , H j n ( 0 ) )    e = f o r w a r d ( H i j n ) H_{ijn} = RNNCell^{(i)}(X_{ijk},H_{jn}^{(0)})\\ \;\\ e = forward(H_{ijn}) Hijn=RNNCell(i)(Xijk,Hjn(0))e=forward(Hijn)
为了避免符号混乱, 将上游传递的梯度记为 ∇ e ( Q i j n ) = ∇ e ( H i j n ) \nabla e_{(Q_{ijn})} = \nabla e_{(H_{ijn})} ∇e(Qijn)=∇e(Hijn), ∇ e ( H i j n ) \nabla e_{(H_{ijn})} ∇e(Hijn) 用于迭代计算的中间结果.
从 RNNCell 运算的定义可以看出, 每一次循环迭代运算都是由 Affine 运算和激活函数运算组合而成.
对于 :
y = t a n h ( x ) = e x − e − x e x + e − x    d y d x = 1 − y 2 y = tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}} \\ \;\\ \frac{dy}{dx}= 1-y^2 y=tanh(x)=ex+e−xex−e−xdxdy=1−y2
则 :
∇ e ( A j n ( i ) ) = ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) )    ∇ H j n ( i ) ( A j n ( i ) ) = d H j n ( i ) d A j n ( i ) = 1 − H j n ( i ) 2 \nabla {e}_{(A_{jn}^{(i)})}=\nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})}\\ \;\\ \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})} = \frac{d H_{jn}^{(i)}}{d A_{jn}^{(i)}}=1-{H_{jn}^{(i)}}^2\\ ∇e(Ajn(i))=∇e(Qjn(i))⊙∇Hjn(i)(Ajn(i))∇Hjn(i)(Ajn(i))=dAjn(i)dHjn(i)=1−Hjn(i)2
上式中的 ∇ e ( Q j n ( i ) ) \nabla e_{(Q_{jn}^{(i)})} ∇e(Qjn(i)) 已由上游给出, ⊙ \odot ⊙ 表示元素积, 即矩阵同位元素分别相乘.
注意, 在这里, 我们得到了 ∇ H j n ( i ) ( A j n ( i ) ) \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})} ∇Hjn(i)(Ajn(i)) 的实际值, 因为 H j n ( i ) H_{jn}^{(i)} Hjn(i) 是已知的.
因为激活函数不一定是 t a n h tanh tanh, 为了不失一般性, 下面我们直接使用 ∇ H j n ( i ) ( A j n ( i ) ) \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})} ∇Hjn(i)(Ajn(i)) 符号.
加法运算, 梯度直接往下游传递:
A j n = Y j n + Z j n    ∇ e ( Z j n ( i ) ) = ∇ e ( A j n ( i ) ) ⊙ 1 = ∇ e ( A j n ( i ) ) A_{jn}=Y_{jn}+Z_{jn}\\ \;\\ \nabla {e}_{(Z_{jn}^{(i)})}=\nabla e_{(A_{jn}^{(i)})} \odot 1=\nabla e_{(A_{jn}^{(i)})} Ajn=Yjn+Zjn∇e(Zjn(i))=∇e(Ajn(i))⊙1=∇e(Ajn(i))
Affine 运算的定义及梯度求导公式已在上面的 <相关> 文章给出.
Z j n = H j n V n n T + b 1 × n    ∇ e ( H j n ( i − 1 ) ) = ∇ e ( Z j n ( i ) ) V n n = ∇ e ( A j n ( i ) ) V n n    ∇ e ( H j n ( i − 1 ) ) = ( ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) ) ) V n n Z_{jn} = H_{jn}V_{nn}^T+b_{1 \times n}\\ \;\\ \nabla {e}_{(H_{jn}^{(i-1)})}=\nabla e_{(Z_{jn}^{(i)})} V_{nn}=\nabla e_{(A_{jn}^{(i)})} V_{nn} \;\\ \nabla {e}_{(H_{jn}^{(i-1)})}=(\nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})})V_{nn} Zjn=HjnVnnT+b1×n∇e(Hjn(i−1))=∇e(Zjn(i))Vnn=∇e(Ajn(i))Vnn∇e(Hjn(i−1))=(∇e(Qjn(i))⊙∇Hjn(i)(Ajn(i)))Vnn
注意, 这里的矩阵的上标 ( i ) (i) (i) 特指张量 X i j k X_{ijk} Xijk 的最后一个矩阵, 即第 i i i 个矩阵.
根据 RNNCell 循环迭代计算的特性, 上次循环的梯度也影响本次循环, 所以梯度求导也是按顺序循环迭代的.
∇ e ( H j k ( i − 2 ) ) = ∇ e ( Q j n ( i − 1 ) ) ⊙ ∇ H j n ( i − 1 ) ( A j n ( i − 1 ) ) V n n + ∇ e ( H j n ( i − 1 ) ) ⊙ ∇ H j n ( i − 1 ) ( A j n ( i − 1 ) ) V n n    ∇ e ( H j k ( i − 2 ) ) = ( ∇ e ( Q j n ( i − 1 ) ) + ∇ e ( H j n ( i − 1 ) ) ) ⊙ ∇ H j n ( i − 1 ) ( A j n ( i − 1 ) ) V n n \nabla {e}_{(H_{jk}^{(i-2)})}=\nabla e_{(Q_{jn}^{(i-1)})} \odot \nabla {H_{jn}^{(i-1)}}_{(A_{jn}^{(i-1)})}V_{nn}+\nabla e_{(H_{jn}^{(i-1)})}\odot \nabla {H_{jn}^{(i-1)}}_{(A_{jn}^{(i-1)})} V_{nn}\\ \;\\ \nabla {e}_{(H_{jk}^{(i-2)})}=(\nabla e_{(Q_{jn}^{(i-1)})}+\nabla e_{(H_{jn}^{(i-1)})}) \odot \nabla {H_{jn}^{(i-1)}}_{(A_{jn}^{(i-1)})}V_{nn} ∇e(Hjk(i−2))=∇e(Qjn(i−1))⊙∇Hjn(i−1)(Ajn(i−1))Vnn+∇e(Hjn(i−1))⊙∇Hjn(i−1)(Ajn(i−1))Vnn∇e(Hjk(i−2))=(∇e(Qjn(i−1))+∇e(Hjn(i−1)))⊙∇Hjn(i−1)(Ajn(i−1))Vnn
为了版面简洁, 上式中 V n n V_{nn} Vnn 前面省略了一个小括号, 请按照从左至右的顺序计算.
公式汇总如下 :
∇ e ( H j n ( i − 1 ) ) = ( ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) ) ) V n n    ∇ e ( H j k ( i − 2 ) ) = ( ∇ e ( Q j n ( i − 1 ) ) + ∇ e ( H j n ( i − 1 ) ) ) ⊙ ∇ H j n ( i − 1 ) ( A j n ( i − 1 ) ) V n n    ∇ e ( H j k ( i − 3 ) ) = ( ∇ e ( Q j n ( i − 2 ) ) + ∇ e ( H j n ( i − 2 ) ) ) ⊙ ∇ H j n ( i − 2 ) ( A j n ( i − 2 ) ) V n n ⋮ ∇ e ( H j k ( 0 ) ) = ( ∇ e ( Q j n ( 1 ) ) + ∇ e ( H j n ( 1 ) ) ) ⊙ ∇ H j n ( 1 ) ( A j n ( 1 ) ) V n n \nabla {e}_{(H_{jn}^{(i-1)})}=(\nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})})V_{nn}\\ \;\\ \nabla {e}_{(H_{jk}^{(i-2)})}=(\nabla e_{(Q_{jn}^{(i-1)})}+\nabla e_{(H_{jn}^{(i-1)})}) \odot \nabla {H_{jn}^{(i-1)}}_{(A_{jn}^{(i-1)})}V_{nn}\\ \;\\ \nabla {e}_{(H_{jk}^{(i-3)})}=(\nabla e_{(Q_{jn}^{(i-2)})}+\nabla e_{(H_{jn}^{(i-2)})}) \odot \nabla {H_{jn}^{(i-2)}}_{(A_{jn}^{(i-2)})}V_{nn}\\ \vdots\\ \nabla {e}_{(H_{jk}^{(0)})}=(\nabla e_{(Q_{jn}^{(1)})}+\nabla e_{(H_{jn}^{(1)})}) \odot \nabla {H_{jn}^{(1)}}_{(A_{jn}^{(1)})}V_{nn}\\ ∇e(Hjn(i−1))=(∇e(Qjn(i))⊙∇Hjn(i)(Ajn(i)))Vnn∇e(Hjk(i−2))=(∇e(Qjn(i−1))+∇e(Hjn(i−1)))⊙∇Hjn(i−1)(Ajn(i−1))Vnn∇e(Hjk(i−3))=(∇e(Qjn(i−2))+∇e(Hjn(i−2)))⊙∇Hjn(i−2)(Ajn(i−2))Vnn⋮∇e(Hjk(0))=(∇e(Qjn(1))+∇e(Hjn(1)))⊙∇Hjn(1)(Ajn(1))Vnn
参考上例以及 Affine 层的求导公式得 :
Y j n = X j k W n k T + a 1 × n    ∇ e ( Y j n ( i ) ) = ∇ e ( A j n ( i ) ) ⊙ 1 = ∇ e ( A j n ( i ) )    ∇ e ( X j k ( i ) ) = ∇ e ( Y j n ( i ) ) W n k = ∇ e ( A j n ( i ) ) W n k Y_{jn} = X_{jk}W_{nk}^T+a_{1 \times n}\\ \;\\ \nabla {e}_{(Y_{jn}^{(i)})}=\nabla e_{(A_{jn}^{(i)})} \odot 1=\nabla e_{(A_{jn}^{(i)})}\\ \;\\ \nabla {e}_{(X_{jk}^{(i)})}=\nabla e_{(Y_{jn}^{(i)})} W_{nk}=\nabla e_{(A_{jn}^{(i)})}W_{nk} Yjn=XjkWnkT+a1×n∇e(Yjn(i))=∇e(Ajn(i))⊙1=∇e(Ajn(i))∇e(Xjk(i))=∇e(Yjn(i))Wnk=∇e(Ajn(i))Wnk
同样的, 循环迭代公式如下 :
∇ e ( X j n ( i ) ) = ( ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) ) ) W n k    ∇ e ( X j k ( i − 1 ) ) = ( ∇ e ( Q j n ( i − 1 ) ) + ∇ e ( H j n ( i − 1 ) ) ) ⊙ ∇ H j n ( i − 1 ) ( A j n ( i − 1 ) ) W n k    ∇ e ( X j k ( i − 2 ) ) = ( ∇ e ( Q j n ( i − 2 ) ) + ∇ e ( H j n ( i − 2 ) ) ) ⊙ ∇ H j n ( i − 2 ) ( A j n ( i − 2 ) ) W n k ⋮ ∇ e ( X j k ( 1 ) ) = ( ∇ e ( Q j n ( 1 ) ) + ∇ e ( H j n ( 1 ) ) ) ⊙ ∇ H j n ( 1 ) ( A j n ( 1 ) ) W n k \nabla {e}_{(X_{jn}^{(i)})}=(\nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})}) W_{nk}\\ \;\\ \nabla {e}_{(X_{jk}^{(i-1)})}=(\nabla e_{(Q_{jn}^{(i-1)})}+\nabla e_{(H_{jn}^{(i-1)})}) \odot \nabla {H_{jn}^{(i-1)}}_{(A_{jn}^{(i-1)})}W_{nk}\\ \;\\ \nabla {e}_{(X_{jk}^{(i-2)})}=(\nabla e_{(Q_{jn}^{(i-2)})}+\nabla e_{(H_{jn}^{(i-2)})}) \odot \nabla {H_{jn}^{(i-2)}}_{(A_{jn}^{(i-2)})}W_{nk}\\ \vdots\\ \nabla {e}_{(X_{jk}^{(1)})}=(\nabla e_{(Q_{jn}^{(1)})}+\nabla e_{(H_{jn}^{(1)})}) \odot \nabla {H_{jn}^{(1)}}_{(A_{jn}^{(1)})}W_{nk}\\ ∇e(Xjn(i))=(∇e(Qjn(i))⊙∇Hjn(i)(Ajn(i)))Wnk∇e(Xjk(i−1))=(∇e(Qjn(i−1))+∇e(Hjn(i−1)))⊙∇Hjn(i−1)(Ajn(i−1))Wnk∇e(Xjk(i−2))=(∇e(Qjn(i−2))+∇e(Hjn(i−2)))⊙∇Hjn(i−2)(Ajn(i−2))Wnk⋮∇e(Xjk(1))=(∇e(Qjn(1))+∇e(Hjn(1)))⊙∇Hjn(1)(Ajn(1))Wnk
为了版面简洁, 上式中 W n k W_{nk} Wnk 前面省略了一个小括号, 请按照从左至右的顺序计算.
∇ e ( H i j n ) \nabla e_{(H_{ijn})} ∇e(Hijn) 已由上例计算给出, 请注意区分 ∇ e ( Q i j n ) \nabla e_{(Q_{ijn})} ∇e(Qijn).
W n k W_{nk} Wnk 在所有的迭代步骤中是共享的, 一样的, 根据链式求导法则:
∇ e ( W n k ) = ∇ e ( Q j n ( 1 ) ) ⊙ d H j n ( 1 ) d W n k + ∇ e ( Q j n ( 2 ) ) ⊙ d H j n ( 2 ) d W n k + ⋯ + ∇ e ( Q j n ( i ) ) ⊙ d H j n ( i ) d W n k \nabla {e}_{(W_{nk})}=\nabla {e}_{(Q_{jn}^{(1)})}\odot\frac{dH_{jn}^{(1)}}{dW_{nk}}+\nabla {e}_{(Q_{jn}^{(2)})}\odot\frac{dH_{jn}^{(2)}}{dW_{nk}}+\cdots+\nabla {e}_{(Q_{jn}^{(i)})}\odot\frac{dH_{jn}^{(i)}}{dW_{nk}} ∇e(Wnk)=∇e(Qjn(1))⊙dWnkdHjn(1)+∇e(Qjn(2))⊙dWnkdHjn(2)+⋯+∇e(Qjn(i))⊙dWnkdHjn(i)
或 :
∇ e ( W n k ) = ∇ e ( W n k ( 1 ) ) + ∇ e ( W n k ( 2 ) ) + ∇ e ( W n k ( 3 ) ) + ⋯ + ∇ e ( W n k ( i ) ) \nabla {e}_{(W_{nk})}=\nabla e_{(W_{nk}^{(1)})}+\nabla e_{(W_{nk}^{(2)})}+\nabla e_{(W_{nk}^{(3)})}+\cdots+\nabla e_{(W_{nk}^{(i)})} ∇e(Wnk)=∇e(Wnk(1))+∇e(Wnk(2))+∇e(Wnk(3))+⋯+∇e(Wnk(i))
由于 :
∇ e ( A j n ( i ) ) = ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) )    ∇ e ( W n k ( i ) ) = ( ∇ e ( Y j n ( i ) ) ) T X j k ( i ) = ( ∇ e ( A j n ( i ) ) ) T X j k ( i ) = ( ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) ) ) T X j k ( i ) \nabla {e}_{(A_{jn}^{(i)})}=\nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})}\\ \;\\ \nabla e_{(W_{nk}^{(i)})}=(\nabla e_{(Y_{jn}^{(i)})})^T X_{jk}^{(i)}=(\nabla e_{(A_{jn}^{(i)})})^T X_{jk}^{(i)}=(\nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})})^TX_{jk}^{(i)} ∇e(Ajn(i))=∇e(Qjn(i))⊙∇Hjn(i)(Ajn(i))∇e(Wnk(i))=(∇e(Yjn(i)))TXjk(i)=(∇e(Ajn(i)))TXjk(i)=(∇e(Qjn(i))⊙∇Hjn(i)(Ajn(i)))TXjk(i)
上式 W n k W_{nk} Wnk 的上标 ( i ) (i) (i) 表示第 i i i 步计算中得到的梯度. 在循环迭代的过程中 :
∇ e ( W n k ( i ) ) = ( ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) ) ) T X j k ( i )    ∇ e ( W n k ( i − 1 ) ) = ( ( ∇ e ( Q j n ( i − 1 ) ) + ∇ e ( H j n ( i − 1 ) ) ) ⊙ ∇ H j n ( i − 1 ) ( A j n ( i − 1 ) ) ) T X j k ( i − 1 )    ∇ e ( W n k ( i − 2 ) ) = ( ( ∇ e ( Q j n ( i − 2 ) ) + ∇ e ( H j n ( i − 2 ) ) ) ⊙ ∇ H j n ( i − 2 ) ( A j n ( i − 2 ) ) ) T X j k ( i − 2 ) ⋮ ∇ e ( W n k ( 1 ) ) = ( ( ∇ e ( Q j n ( 1 ) ) + ∇ e ( H j n ( 1 ) ) ) ⊙ ∇ H j n ( 1 ) ( A j n ( 1 ) ) ) T X j k ( 1 ) \nabla e_{(W_{nk}^{(i)})}=(\nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})})^T X_{jk}^{(i)}\\ \;\\ \nabla e_{(W_{nk}^{(i-1)})}=((\nabla e_{(Q_{jn}^{(i-1)})}+\nabla e_{(H_{jn}^{(i-1)})}) \odot \nabla {H_{jn}^{(i-1)}}_{(A_{jn}^{(i-1)})})^TX_{jk}^{(i-1)}\\ \;\\ \nabla e_{(W_{nk}^{(i-2)})}=((\nabla e_{(Q_{jn}^{(i-2)})}+\nabla e_{(H_{jn}^{(i-2)})}) \odot \nabla {H_{jn}^{(i-2)}}_{(A_{jn}^{(i-2)})})^TX_{jk}^{(i-2)}\\ \vdots\\ \nabla e_{(W_{nk}^{(1)})}=((\nabla e_{(Q_{jn}^{(1)})}+\nabla e_{(H_{jn}^{(1)})}) \odot \nabla {H_{jn}^{(1)}}_{(A_{jn}^{(1)})})^TX_{jk}^{(1)}\\ ∇e(Wnk(i))=(∇e(Qjn(i))⊙∇Hjn(i)(Ajn(i)))TXjk(i)∇e(Wnk(i−1))=((∇e(Qjn(i−1))+∇e(Hjn(i−1)))⊙∇Hjn(i−1)(Ajn(i−1)))TXjk(i−1)∇e(Wnk(i−2))=((∇e(Qjn(i−2))+∇e(Hjn(i−2)))⊙∇Hjn(i−2)(Ajn(i−2)))TXjk(i−2)⋮∇e(Wnk(1))=((∇e(Qjn(1))+∇e(Hjn(1)))⊙∇Hjn(1)(Ajn(1)))TXjk(1)
最后, 将上面的结果加起来即可:
∇ e ( W n k ) = ∇ e ( W n k ( 1 ) ) + ∇ e ( W n k ( 2 ) ) + ∇ e ( W n k ( 3 ) ) + ⋯ + ∇ e ( W n k ( i ) ) \nabla {e}_{(W_{nk})}=\nabla e_{(W_{nk}^{(1)})}+\nabla e_{(W_{nk}^{(2)})}+\nabla e_{(W_{nk}^{(3)})}+\cdots+\nabla e_{(W_{nk}^{(i)})} ∇e(Wnk)=∇e(Wnk(1))+∇e(Wnk(2))+∇e(Wnk(3))+⋯+∇e(Wnk(i))
V n n V_{nn} Vnn 在所有的迭代步骤中是共享的, 根据链式求导法则, 参考上例, 易得
∇ e ( V n n ( i ) ) = ( ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) ) ) T H j n ( i − 1 )    ∇ e ( V n n ( i − 1 ) ) = ( ( ∇ e ( Q j n ( i − 1 ) ) + ∇ e ( H j n ( i − 1 ) ) ) ⊙ ∇ H j n ( i − 1 ) ( A j n ( i − 1 ) ) ) T H j n ( i − 2 )    ∇ e ( V n n ( i − 2 ) ) = ( ( ∇ e ( Q j n ( i − 2 ) ) + ∇ e ( H j n ( i − 2 ) ) ) ⊙ ∇ H j n ( i − 2 ) ( A j n ( i − 2 ) ) ) T H j n ( i − 3 ) ⋮ ∇ e ( V n n ( 1 ) ) = ( ( ∇ e ( Q j n ( 1 ) ) + ∇ e ( H j n ( 1 ) ) ) ⊙ ∇ H j n ( 1 ) ( A j n ( 1 ) ) ) T H j n ( 0 ) \nabla e_{(V_{nn}^{(i)})}=(\nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})})^T H_{jn}^{(i-1)}\\ \;\\ \nabla e_{(V_{nn}^{(i-1)})}=((\nabla e_{(Q_{jn}^{(i-1)})}+\nabla e_{(H_{jn}^{(i-1)})}) \odot \nabla {H_{jn}^{(i-1)}}_{(A_{jn}^{(i-1)})})^T H_{jn}^{(i-2)}\\ \;\\ \nabla e_{(V_{nn}^{(i-2)})}=((\nabla e_{(Q_{jn}^{(i-2)})}+\nabla e_{(H_{jn}^{(i-2)})}) \odot \nabla {H_{jn}^{(i-2)}}_{(A_{jn}^{(i-2)})})^T H_{jn}^{(i-3)}\\ \vdots\\ \nabla e_{(V_{nn}^{(1)})}=((\nabla e_{(Q_{jn}^{(1)})}+\nabla e_{(H_{jn}^{(1)})}) \odot \nabla {H_{jn}^{(1)}}_{(A_{jn}^{(1)})})^T H_{jn}^{(0)}\\ ∇e(Vnn(i))=(∇e(Qjn(i))⊙∇Hjn(i)(Ajn(i)))THjn(i−1)∇e(Vnn(i−1))=((∇e(Qjn(i−1))+∇e(Hjn(i−1)))⊙∇Hjn(i−1)(Ajn(i−1)))THjn(i−2)∇e(Vnn(i−2))=((∇e(Qjn(i−2))+∇e(Hjn(i−2)))⊙∇Hjn(i−2)(Ajn(i−2)))THjn(i−3)⋮∇e(Vnn(1))=((∇e(Qjn(1))+∇e(Hjn(1)))⊙∇Hjn(1)(Ajn(1)))THjn(0)
将上面的结果加起来即可:
∇ e ( V n n ) = ∇ e ( V n n ( 1 ) ) + ∇ e ( V n n ( 2 ) ) + ∇ e ( V n n ( 3 ) ) + ⋯ + ∇ e ( V n n ( i ) ) \nabla {e}_{(V_{nn})}=\nabla e_{(V_{nn}^{(1)})}+\nabla e_{(V_{nn}^{(2)})}+\nabla e_{(V_{nn}^{(3)})}+\cdots+\nabla e_{(V_{nn}^{(i)})} ∇e(Vnn)=∇e(Vnn(1))+∇e(Vnn(2))+∇e(Vnn(3))+⋯+∇e(Vnn(i))
Y j n = X j k W n k T + a 1 × n    ∇ e ( Y j n ( i ) ) = ∇ e ( A j n ( i ) ) ⊙ 1 = ∇ e ( A j n ( i ) )    ∇ e ( a 1 × n i ) = ∇ e ( Y j n ( i ) ) ⊙ 1 = ∇ e ( A j n ( i ) ) = ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) ) Y_{jn} = X_{jk}W_{nk}^T+a_{1 \times n}\\ \;\\ \nabla {e}_{(Y_{jn}^{(i)})}=\nabla e_{(A_{jn}^{(i)})} \odot 1=\nabla e_{(A_{jn}^{(i)})}\\ \;\\ \nabla {e}_{(a_{1 \times n}^{i})}=\nabla e_{(Y_{jn}^{(i)})} \odot 1=\nabla e_{(A_{jn}^{(i)})}= \nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})} Yjn=XjkWnkT+a1×n∇e(Yjn(i))=∇e(Ajn(i))⊙1=∇e(Ajn(i))∇e(a1×ni)=∇e(Yjn(i))⊙1=∇e(Ajn(i))=∇e(Qjn(i))⊙∇Hjn(i)(Ajn(i))
a 1 × n a_{1 \times n} a1×n 在所有的迭代步骤中是共享的, 一样的, 根据链式求导法则, 参考上例, 易得 :
∇ e ( a 1 × n ( i ) ) = ∇ e ( Q j n ( i ) ) ⊙ ∇ H j n ( i ) ( A j n ( i ) )    ∇ e ( a 1 × n ( i − 1 ) ) = ( ∇ e ( Q j n ( i − 1 ) ) + ∇ e ( H j n ( i − 1 ) ) ) ⊙ ∇ H j n ( i − 1 ) ( A j n ( i − 1 ) )    ∇ e ( a 1 × n ( i − 2 ) ) = ( ∇ e ( Q j n ( i − 2 ) ) + ∇ e ( H j n ( i − 2 ) ) ) ⊙ ∇ H j n ( i − 2 ) ( A j n ( i − 2 ) ) ⋮ ∇ e ( a 1 × n ( 1 ) ) = ( ∇ e ( Q j n ( 1 ) ) + ∇ e ( H j n ( 1 ) ) ) ⊙ ∇ H j n ( 1 ) ( A j n ( 1 ) ) \nabla {e}_{(a_{1 \times n}^{(i)})}= \nabla e_{(Q_{jn}^{(i)})} \odot \nabla {H_{jn}^{(i)}}_{(A_{jn}^{(i)})}\\ \;\\ \nabla {e}_{(a_{1 \times n}^{(i-1)})}= (\nabla e_{(Q_{jn}^{(i-1)})} + \nabla e_{(H_{jn}^{(i-1)})}) \odot \nabla {H_{jn}^{(i-1)}}_{(A_{jn}^{(i-1)})}\\ \;\\ \nabla {e}_{(a_{1 \times n}^{(i-2)})}= (\nabla e_{(Q_{jn}^{(i-2)})} + \nabla e_{(H_{jn}^{(i-2)})}) \odot \nabla {H_{jn}^{(i-2)}}_{(A_{jn}^{(i-2)})}\\ \vdots\\ \nabla {e}_{(a_{1 \times n}^{(1)})}= (\nabla e_{(Q_{jn}^{(1)})} + \nabla e_{(H_{jn}^{(1)})}) \odot \nabla {H_{jn}^{(1)}}_{(A_{jn}^{(1)})} ∇e(a1×n(i))=∇e(Qjn(i))⊙∇Hjn(i)(Ajn(i))∇e(a1×n(i−1))=(∇e(Qjn(i−1))+∇e(Hjn(i−1)))⊙∇Hjn(i−1)(Ajn(i−1))∇e(a1×n(i−2))=(∇e(Qjn(i−2))+∇e(Hjn(i−2)))⊙∇Hjn(i−2)(Ajn(i−2))⋮∇e(a1×n(1))=(∇e(Qjn(1))+∇e(Hjn(1)))⊙∇Hjn(1)(Ajn(1))
将上面的结果加起来即可 :
∇ e ( a 1 × n ) = ∇ e ( a 1 × n ( 1 ) ) + ∇ e ( a 1 × n ( 2 ) ) + ∇ e ( a 1 × n ( 3 ) ) + ⋯ + ∇ e ( a 1 × n ( i ) ) \nabla {e}_{(a_{1 \times n})}=\nabla {e}_{(a_{1 \times n}^{(1)})}+\nabla {e}_{(a_{1 \times n}^{(2)})}+\nabla {e}_{(a_{1 \times n}^{(3)})}+\cdots+\nabla {e}_{(a_{1 \times n}^{(i)})} ∇e(a1×n)=∇e(a1×n(1))+∇e(a1×n(2))+∇e(a1×n(3))+⋯+∇e(a1×n(i))
同理, 观察公式定义, 我们可以得到 :
∇ e ( b 1 × n ) = ∇ e ( a 1 × n ) \nabla {e}_{(b_{1 \times n})}=\nabla {e}_{(a_{1 \times n})} ∇e(b1×n)=∇e(a1×n)