参考:《神经网络与深度学习》
https://legacy.gitbook.com/book/xhhjin/neural-networks-and-deep-learning-zh
该笔记主要是反向传播公式的推导,理解反向传播的话建议看其他博客中更加具体的例子或者吴恩达老师反向传播介绍的视频(有具体数字的例子),主要有4个公式的推导:
(BP1) δ j L = ∂ C ∂ z j L = ∂ C ∂ a j L ⋅ σ ′ ( z j L ) \delta_j^L=\frac{\partial C}{\partial z^L_j}=\frac{\partial C}{\partial a^L_j} \cdot \sigma'(z^L_j) \tag{BP1} δjL=∂zjL∂C=∂ajL∂C⋅σ′(zjL)(BP1)
(BP2) δ l = W l + 1 T ⋅ δ l + 1 ⊙ σ ′ ( z l ) \delta ^ {l} = {W^{l+1}} ^\mathsf{T} \cdot \delta^{l+1} \odot \sigma'(z^l) \tag{BP2} δl=Wl+1T⋅δl+1⊙σ′(zl)(BP2)
(BP3) ∂ C ∂ w j k l = ∂ C ∂ z j l ⋅ ∂ z j l ∂ w j k l = δ j l ⋅ a k l − 1 \frac{\partial C}{\partial w^{l}_{jk}} = \frac{\partial C}{\partial z^{l}_{j}} \cdot \frac{\partial z^{l}_{j}}{ \partial w^{l}_{jk}} = \delta_j^l \cdot a_k^{l-1} \tag{BP3} ∂wjkl∂C=∂zjl∂C⋅∂wjkl∂zjl=δjl⋅akl−1(BP3)
(BP4) ∂ C ∂ b j l = ∂ C ∂ z j l ⋅ ∂ z j l ∂ b j l = δ j l \frac{\partial C}{\partial b^{l}_{j}} = \frac{\partial C}{\partial z^{l}_{j}} \cdot \frac{\partial z^{l}_{j}}{ \partial b^{l}_{j}} = \delta_j^l \tag{BP4} ∂bjl∂C=∂zjl∂C⋅∂bjl∂zjl=δjl(BP4)
(1) δ j L = ∂ C ∂ z j L = ∂ C ∂ a j L ⋅ σ ′ ( z j L ) \delta_j^L=\frac{\partial C}{\partial z^L_j}=\frac{\partial C}{\partial a^L_j} \cdot \sigma'(z^L_j) \tag{1} δjL=∂zjL∂C=∂ajL∂C⋅σ′(zjL)(1)
写成矩阵的形式:
(2) δ L = ∂ C ∂ a L ⊙ σ ′ ( z L ) = Δ a C ⊙ σ ′ ( z L ) \delta^L = \frac{\partial C}{\partial a^L} \odot \sigma'(z^L) = \Delta_aC \odot \sigma'(z^L) \tag{2} δL=∂aL∂C⊙σ′(zL)=ΔaC⊙σ′(zL)(2)
公式中 ⊙ \odot ⊙表示Hadamard积。
写成矩阵形式为:
(4) Z l + 1 = W l + 1 ⋅ A l + B l + 1 Z^{l+1}= W^{l+1} \cdot A^l+B^{l+1} \tag{4} Zl+1=Wl+1⋅Al+Bl+1(4)
在公式(3)中,现在 ∂ C ∂ z j l + 1 \frac{\partial C}{\partial z^{l+1}_j} ∂zjl+1∂C已知,需要求 ∂ C ∂ a k l \frac{\partial C}{\partial a^{l}_k} ∂akl∂C,自然想到链式求导法则:
(5) ∂ C ∂ a k l = ∑ j = 1 j ∂ C ∂ z j l + 1 ⋅ ∂ z j l + 1 ∂ a k l \frac{\partial C}{\partial a^{l}_k}= \sum_{j=1}^j \frac{\partial C}{\partial z^{l+1}_j} \cdot \frac{\partial z^{l+1}_j}{\partial a^{l}_k} \tag{5} ∂akl∂C=j=1∑j∂zjl+1∂C⋅∂akl∂zjl+1(5)
也可以从物理意义去理解这个公式, ∂ C ∂ a k l \frac{\partial C}{\partial a^{l}_k} ∂akl∂C表示 a k l a^{l}_k akl的变化对C的影响大小, a k l a^{l}_k akl可以从 z j l + 1 ( j = 1... J ) z^{l+1}_j(j=1...J) zjl+1(j=1...J)来影响C的大小,所以公式(5)中需要有累加。
公式(5)继续化简,这里需要参考公式(3):
(6) ∂ C ∂ a k l = ∑ j = 1 j δ j l + 1 ⋅ ∂ z j l + 1 ∂ a k l = ∑ j = 1 j δ j l + 1 ⋅ w j k l + 1 \frac{\partial C}{\partial a^{l}_k} = \sum_{j=1}^j \delta_j^{l+1} \cdot \frac{\partial z^{l+1}_j}{\partial a^{l}_k} = \sum_{j=1}^j \delta_j^{l+1} \cdot w_{jk}^{l+1} \tag{6} ∂akl∂C=j=1∑jδjl+1⋅∂akl∂zjl+1=j=1∑jδjl+1⋅wjkl+1(6)
(7) ∂ C ∂ z k l = ( ∑ j = 1 j δ j l + 1 ⋅ w j k l + 1 ) ⋅ σ ′ ( z k l ) \frac{\partial C}{\partial z^{l}_k} = (\sum_{j=1}^j \delta_j^{l+1} \cdot w_{jk}^{l+1}) \cdot \sigma'(z_k^l) \tag{7} ∂zkl∂C=(j=1∑jδjl+1⋅wjkl+1)⋅σ′(zkl)(7)
写成矩阵形式有:
(8) [ ∂ C ∂ z 1 l ∂ C ∂ z 2 l ∂ C ∂ z k l ] = [ w 11 l + 1 w 21 l + 1 w j 1 l + 1 w 12 l + 1 w 22 l + 1 w j 2 l + 1 w 1 k l + 1 w 2 k l + 1 w j k l + 1 ] ⋅ [ δ 1 l + 1 δ 2 l + 1 δ j l + 1 ] ⊙ [ σ ′ ( z 1 l ) σ ′ ( z 2 l ) σ ′ ( z k l ) ] \begin{bmatrix} \frac{\partial C}{\partial z^{l}_1} \\ \frac{\partial C}{\partial z^{l}_2} \\ \frac{\partial C}{\partial z^{l}_k} \end{bmatrix} = \begin{bmatrix} w^{l+1}_{11} & w^{l+1}_{21} & w^{l+1}_{j1} \\ w^{l+1}_{12} & w^{l+1}_{22} & w^{l+1}_{j2} \\ w^{l+1}_{1k} & w^{l+1}_{2k} & w^{l+1}_{jk} \\ \end{bmatrix} \cdot \begin{bmatrix} \delta_1^{l+1} \\ \delta_2^{l+1} \\ \delta_j^{l+1} \end{bmatrix} \odot \begin{bmatrix} \sigma'(z_1^l) \\ \sigma'(z_2^l) \\ \sigma'(z_k^l) \end{bmatrix} \tag{8} ⎣⎢⎡∂z1l∂C∂z2l∂C∂zkl∂C⎦⎥⎤=⎣⎡w11l+1w12l+1w1kl+1w21l+1w22l+1w2kl+1wj1l+1wj2l+1wjkl+1⎦⎤⋅⎣⎡δ1l+1δ2l+1δjl+1⎦⎤⊙⎣⎡σ′(z1l)σ′(z2l)σ′(zkl)⎦⎤(8)
(9) [ ∂ C ∂ z 1 l ∂ C ∂ z 2 l ∂ C ∂ z k l ] = [ w 11 l + 1 w 12 l + 1 w 1 k l + 1 w 21 l + 1 w 22 l + 1 w 2 k l + 1 w j 1 l + 1 w j 2 l + 1 w j k l + 1 ] T ⋅ [ δ 1 l + 1 δ 2 l + 1 δ j l + 1 ] ⊙ [ σ ′ ( z 1 l ) σ ′ ( z 2 l ) σ ′ ( z k l ) ] \begin{bmatrix} \frac{\partial C}{\partial z^{l}_1} \\ \frac{\partial C}{\partial z^{l}_2} \\ \frac{\partial C}{\partial z^{l}_k} \end{bmatrix} = \begin{bmatrix} w^{l+1}_{11} & w^{l+1}_{12} & w^{l+1}_{1k} \\ w^{l+1}_{21} & w^{l+1}_{22} & w^{l+1}_{2k} \\ w^{l+1}_{j1} & w^{l+1}_{j2} & w^{l+1}_{jk} \\ \end{bmatrix} ^\mathsf{T} \cdot \begin{bmatrix} \delta_1^{l+1} \\ \delta_2^{l+1} \\ \delta_j^{l+1} \end{bmatrix} \odot \begin{bmatrix} \sigma'(z_1^l) \\ \sigma'(z_2^l) \\ \sigma'(z_k^l) \end{bmatrix} \tag{9} ⎣⎢⎡∂z1l∂C∂z2l∂C∂zkl∂C⎦⎥⎤=⎣⎡w11l+1w21l+1wj1l+1w12l+1w22l+1wj2l+1w1kl+1w2kl+1wjkl+1⎦⎤T⋅⎣⎡δ1l+1δ2l+1δjl+1⎦⎤⊙⎣⎡σ′(z1l)σ′(z2l)σ′(zkl)⎦⎤(9)
(10) δ l = W l + 1 T ⋅ δ l + 1 ⊙ σ ′ ( z l ) \delta ^ {l} = {W^{l+1}} ^\mathsf{T} \cdot \delta^{l+1} \odot \sigma'(z^l) \tag{10} δl=Wl+1T⋅δl+1⊙σ′(zl)(10)
3.公式BP3的推导:
然后推导 ∂ C ∂ w j k l \frac{\partial C}{\partial w^{l}_{jk}} ∂wjkl∂C 和 ∂ C ∂ b l \frac{\partial C}{\partial b^{l}} ∂bl∂C, 这也是神经网络中实际参数更新需要计算的参数,先推导 ∂ C ∂ w j k l \frac{\partial C}{\partial w^{l}_{jk}} ∂wjkl∂C。
根据公式(3)可知:
(11) ∂ C ∂ w j k l = ∂ C ∂ z j l ⋅ ∂ z j l ∂ w j k l = δ j l ⋅ a k l − 1 \frac{\partial C}{\partial w^{l}_{jk}} = \frac{\partial C}{\partial z^{l}_{j}} \cdot \frac{\partial z^{l}_{j}}{ \partial w^{l}_{jk}} = \delta_j^l \cdot a_k^{l-1} \tag{11} ∂wjkl∂C=∂zjl∂C⋅∂wjkl∂zjl=δjl⋅akl−1(11)
4.公式BP4的推导:
然后推导 ∂ C ∂ b l \frac{\partial C}{\partial b^{l}} ∂bl∂C
(12) ∂ C ∂ b j l = ∂ C ∂ z j l ⋅ ∂ z j l ∂ b j l = δ j l \frac{\partial C}{\partial b^{l}_{j}} = \frac{\partial C}{\partial z^{l}_{j}} \cdot \frac{\partial z^{l}_{j}}{ \partial b^{l}_{j}} = \delta_j^l \tag{12} ∂bjl∂C=∂zjl∂C⋅∂bjl∂zjl=δjl(12)