BP四项基本原则:
δ i ( L ) = ▽ y i C o s t ⋅ σ ′ ( l o g i t i ( L ) ) δ i ( l ) = ∑ j δ j ( l + 1 ) w j i ( l + 1 ) σ ′ ( l o g i t i ( l ) ) ∂ C o s t ∂ b i a s i ( l ) = δ i ( l ) ∂ C o s t ∂ w i j ( l ) = δ i ( l ) h j ( l − 1 ) \begin{aligned} \delta_i^{(L)} &= \bigtriangledown_{y_i} Cost \cdot \sigma'(logit_i^{(L)}) \\ \delta_i^{(l)} &= \sum_j \delta_j^{(l+1)} w_{ji}^{(l+1)} \sigma'(logit_i^{(l)}) \\ \frac{\partial Cost}{\partial bias_i^{(l)}} &= \delta_i^{(l)} \\ \frac{\partial Cost}{\partial w_{ij}^{(l)}} &= \delta_i^{(l)} h_j^{(l-1)} \end{aligned} δi(L)δi(l)∂biasi(l)∂Cost∂wij(l)∂Cost=▽yiCost⋅σ′(logiti(L))=j∑δj(l+1)wji(l+1)σ′(logiti(l))=δi(l)=δi(l)hj(l−1)
其中, ( l ) (l) (l)表示第 l l l层,一共有L层, i , j i,j i,j表示当前层神经元的序号。
反向传播公式的目的主要是得到: ∂ C o s t ∂ b i a s i ( l ) \frac{\partial Cost}{\partial bias_i^{(l)}} ∂biasi(l)∂Cost和 ∂ C o s t ∂ w i j ( l ) \frac{\partial Cost}{\partial w_{ij}^{(l)}} ∂wij(l)∂Cost。
在推导的过程中
∂ C o s t ∂ b i a s i ( l ) = ∂ C o s t ∂ l o g i t i ( l ) ⋅ ∂ l o g i t i ( l ) ∂ b i a s i ( l ) ∂ C o s t ∂ w i j ( l ) = ∂ C o s t ∂ l o g i t i ( l ) ⋅ ∂ l o g i t i ( l ) ∂ w i j ( l ) \begin{aligned} \frac{\partial Cost}{\partial bias_i^{(l)}} &= \frac{\partial Cost}{\partial logit_i^{(l)}} \cdot \frac{\partial logit_i^{(l)}}{\partial bias_i^{(l)}} \\ \frac{\partial Cost}{\partial w_{ij}^{(l)}} &= \frac{\partial Cost}{\partial logit_i^{(l)}} \cdot \frac{\partial logit_i^{(l)}}{\partial w_{ij}^{(l)}} \end{aligned} ∂biasi(l)∂Cost∂wij(l)∂Cost=∂logiti(l)∂Cost⋅∂biasi(l)∂logiti(l)=∂logiti(l)∂Cost⋅∂wij(l)∂logiti(l)
会发现都要用到 ∂ C o s t ∂ l o g i t i ( l ) \frac{\partial Cost}{\partial logit_i^{(l)}} ∂logiti(l)∂Cost。
而
l o g i t i ( l ) = w i j ( l ) h j ( l ) + ∑ k ≠ j w i k ( l ) h k ( l ) + b i a s i ( l ) logit_i^{(l)} = w_{ij}^{(l)} h_j^{(l)} + \sum_{k\ne j} w_{ik}^{(l)} h_{k}^{(l)} + bias_i^{(l)} logiti(l)=wij(l)hj(l)+k=j∑wik(l)hk(l)+biasi(l)
所以
∂ l o g i t i ( l ) ∂ b i a s i ( l ) = 1 ∂ l o g i t i ( l ) ∂ w i j ( l ) = h j ( l ) \begin{aligned} \frac{\partial logit_i^{(l)}}{\partial bias_i^{(l)}} &= 1 \\ \frac{\partial logit_i^{(l)}}{\partial w_{ij}^{(l)}} &= h_j^{(l)} \end{aligned} ∂biasi(l)∂logiti(l)∂wij(l)∂logiti(l)=1=hj(l)
那接下来的问题就只有求 ∂ C o s t ∂ l o g i t i ( l ) \frac{\partial Cost}{\partial logit_i^{(l)}} ∂logiti(l)∂Cost了,求它可以用递推法:
为公式看起来简洁,我们把 ∂ C o s t ∂ l o g i t i ( l ) \frac{\partial Cost}{\partial logit_i^{(l)}} ∂logiti(l)∂Cost记为 δ i ( l ) \delta_i^{(l)} δi(l),那么
δ i ( l ) = ∂ C o s t ∂ l o g i t i ( l ) = ∑ j ∂ C o s t ∂ l o g i t j ( l + 1 ) ⋅ ∂ l o g i t j ( l + 1 ) ∂ l o g i t i ( l ) = ∑ j δ j ( l + 1 ) ⋅ ∂ l o g i t j ( l + 1 ) ∂ l o g i t i ( l ) \delta_i^{(l)} = \frac{\partial Cost}{\partial logit_i^{(l)}} = \sum_j \frac{\partial Cost}{\partial logit_j^{(l+1)}} \cdot \frac{\partial logit_j^{(l+1)}}{\partial logit_i^{(l)}} = \sum_j \delta_j^{(l+1)} \cdot \frac{\partial logit_j^{(l+1)}}{\partial logit_i^{(l)}} δi(l)=∂logiti(l)∂Cost=j∑∂logitj(l+1)∂Cost⋅∂logiti(l)∂logitj(l+1)=j∑δj(l+1)⋅∂logiti(l)∂logitj(l+1)
易知:
l o g i t j ( l + 1 ) = w j i ( l + 1 ) σ ( l o g i t i ( l ) ) + ∑ k ≠ i w j k ( l + 1 ) σ ( l o g i t k ( l ) ) + b i a s j ( l + 1 ) logit_j^{(l+1)} = w_{ji}^{(l+1)} \sigma(logit_i^{(l)}) + \sum_{k\ne i} w_{jk}^{(l+1)} \sigma(logit_k^{(l)}) + bias_j^{(l+1)} logitj(l+1)=wji(l+1)σ(logiti(l))+k=i∑wjk(l+1)σ(logitk(l))+biasj(l+1)
所以
∂ l o g i t j ( l + 1 ) ∂ l o g i t i ( l ) = ∂ [ w j i ( l + 1 ) σ ( l o g i t i ( l ) ) + ∑ k ≠ i w j k ( l + 1 ) σ ( l o g i t k ( l ) ) + b i a s j ( l + 1 ) ] ∂ l o g i t i ( l ) = w j i ( l + 1 ) σ ′ ( l o g i t i ( l ) ) \frac{\partial logit_j^{(l+1)}}{\partial logit_i^{(l)}} = \frac{\partial [w_{ji}^{(l+1)} \sigma(logit_i^{(l)}) + \sum_{k\ne i} w_{jk}^{(l+1)} \sigma(logit_k^{(l)}) + bias_j^{(l+1)}]}{\partial logit_i^{(l)}} = w_{ji}^{(l+1)} \sigma'(logit_i^{(l)}) ∂logiti(l)∂logitj(l+1)=∂logiti(l)∂[wji(l+1)σ(logiti(l))+∑k=iwjk(l+1)σ(logitk(l))+biasj(l+1)]=wji(l+1)σ′(logiti(l))
因此
δ i ( l ) = ∑ j δ j ( l + 1 ) w j i ( l + 1 ) σ ′ ( l o g i t i ( l ) ) \delta_i^{(l)} = \sum_j \delta_j^{(l+1)} w_{ji}^{(l+1)} \sigma'(logit_i^{(l)}) δi(l)=j∑δj(l+1)wji(l+1)σ′(logiti(l))
以上就是 δ i ( l ) \delta_i^{(l)} δi(l)的递推公式,当然递推公式需要一个终止条件,那就是 δ i ( L ) \delta_i^{(L)} δi(L):
δ i ( L ) = ∂ C o s t ∂ l o g i t i ( L ) = ∂ C o s t ∂ y i ⋅ ∂ y i ∂ l o g i t i ( L ) = ∂ C o s t ∂ y i ⋅ ∂ σ ( l o g i t i ( L ) ) ∂ l o g i t i ( L ) = ▽ y i C o s t ⋅ σ ′ ( l o g i t i ( L ) ) \delta_i^{(L)} = \frac{\partial Cost}{\partial logit_i^{(L)}} = \frac{\partial Cost}{\partial y_i} \cdot \frac{\partial y_i}{\partial logit_i^{(L)}} = \frac{\partial Cost}{\partial y_i} \cdot \frac{\partial \sigma(logit_i^{(L)})}{\partial logit_i^{(L)}} = \bigtriangledown_{y_i} Cost \cdot \sigma'(logit_i^{(L)}) δi(L)=∂logiti(L)∂Cost=∂yi∂Cost⋅∂logiti(L)∂yi=∂yi∂Cost⋅∂logiti(L)∂σ(logiti(L))=▽yiCost⋅σ′(logiti(L))
这就是BP四项基本原则的思考过程。
下面看矩阵形式:
δ ( L ) = [ δ 1 ( L ) δ 2 ( L ) ⋮ δ M ( L ) ] = [ ▽ y 1 C o s t ⋅ σ ′ ( l o g i t 1 ( L ) ) ▽ y 2 C o s t ⋅ σ ′ ( l o g i t 2 ( L ) ) ⋮ ▽ y M C o s t ⋅ σ ′ ( l o g i t M ( L ) ) ] = [ ▽ y 1 C o s t ▽ y 2 C o s t ⋮ ▽ y M C o s t ] ⊙ [ σ ′ ( l o g i t 1 ( L ) ) σ ′ ( l o g i t 2 ( L ) ) ⋮ σ ′ ( l o g i t M ( L ) ) ] ⇒ δ ( L ) = ▽ y C o s t ⊙ σ ′ ( l o g i t ( L ) ) \begin{aligned} \delta^{(L)} &= \left[ \begin{array}{c} \delta_1^{(L)} \\ \delta_2^{(L)} \\ \vdots \\ \delta_M^{(L)} \end{array} \right] =\left[ \begin{array}{c} \bigtriangledown_{y_1} Cost \cdot \sigma'(logit_1^{(L)}) \\ \bigtriangledown_{y_2} Cost \cdot \sigma'(logit_2^{(L)}) \\ \vdots \\ \bigtriangledown_{y_M} Cost \cdot \sigma'(logit_M^{(L)}) \end{array} \right] =\left[ \begin{array}{c} \bigtriangledown_{y_1} Cost \\ \bigtriangledown_{y_2} Cost \\ \vdots \\ \bigtriangledown_{y_M} Cost \end{array} \right] \odot \left[ \begin{array}{c} \sigma'(logit_1^{(L)}) \\ \sigma'(logit_2^{(L)}) \\ \vdots \\ \sigma'(logit_M^{(L)}) \end{array} \right] \\ \Rightarrow \quad \delta^{(L)} &= \bigtriangledown_y Cost \odot \sigma'(logit^{(L)}) \end{aligned} δ(L)⇒δ(L)=⎣⎢⎢⎢⎢⎡δ1(L)δ2(L)⋮δM(L)⎦⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎡▽y1Cost⋅σ′(logit1(L))▽y2Cost⋅σ′(logit2(L))⋮▽yMCost⋅σ′(logitM(L))⎦⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎡▽y1Cost▽y2Cost⋮▽yMCost⎦⎥⎥⎥⎤⊙⎣⎢⎢⎢⎢⎡σ′(logit1(L))σ′(logit2(L))⋮σ′(logitM(L))⎦⎥⎥⎥⎥⎤=▽yCost⊙σ′(logit(L))
这里, ⊙ \odot ⊙表示Hadamard乘积,即对应元素相乘,M表示第 L L L层有M个神经元。
δ ( l ) = [ δ 1 ( l ) δ 2 ( l ) ⋮ δ M ( l ) ] = [ ∑ j δ j ( l + 1 ) w j 1 ( l + 1 ) σ ′ ( l o g i t 1 ( l ) ) ∑ j δ j ( l + 1 ) w j 2 ( l + 1 ) σ ′ ( l o g i t 2 ( l ) ) ⋮ ∑ j δ j ( l + 1 ) w j M ( l + 1 ) σ ′ ( l o g i t M ( l ) ) ] = [ ∑ j δ j ( l + 1 ) w j 1 ( l + 1 ) ∑ j δ j ( l + 1 ) w j 2 ( l + 1 ) ⋮ ∑ j δ j ( l + 1 ) w j M ( l + 1 ) ] ⊙ [ σ ′ ( l o g i t 1 ( l ) ) σ ′ ( l o g i t 2 ( l ) ) ⋮ σ ′ ( l o g i t M ( l ) ) ] = [ w 11 ( l + 1 ) w 21 ( l + 1 ) … w N 1 ( l + 1 ) w 12 ( l + 1 ) w 22 ( l + 1 ) … w N 2 ( l + 1 ) ⋮ ⋮ ⋱ ⋮ w 1 M ( l + 1 ) w 2 M ( l + 1 ) … w N M ( l + 1 ) ] [ δ 1 ( l + 1 ) δ 2 ( l + 1 ) ⋮ δ N ( l + 1 ) ] ⊙ [ σ ′ ( l o g i t 1 ( l ) ) σ ′ ( l o g i t 2 ( l ) ) ⋮ σ ′ ( l o g i t M ( l ) ) ] = [ w 11 ( l + 1 ) w 12 ( l + 1 ) … w 1 M ( l + 1 ) w 21 ( l + 1 ) w 22 ( l + 1 ) … w 2 M ( l + 1 ) ⋮ ⋮ ⋱ ⋮ w N 1 ( l + 1 ) w N 2 ( l + 1 ) … w N M ( l + 1 ) ] T [ δ 1 ( l + 1 ) δ 2 ( l + 1 ) ⋮ δ N ( l + 1 ) ] ⊙ [ σ ′ ( l o g i t 1 ( l ) ) σ ′ ( l o g i t 2 ( l ) ) ⋮ σ ′ ( l o g i t M ( l ) ) ] ⇒ δ ( l ) = [ ( W ( L + 1 ) ) T δ ( l + 1 ) ] ⊙ σ ′ ( l o g i t ( l ) ) \begin{aligned} \delta^{(l)} &= \left[ \begin{array}{c} \delta_1^{(l)} \\ \delta_2^{(l)} \\ \vdots \\ \delta_M^{(l)} \end{array} \right] =\left[ \begin{array}{c} \sum_j \delta_j^{(l+1)} w_{j1}^{(l+1)} \sigma'(logit_1^{(l)}) \\ \sum_j \delta_j^{(l+1)} w_{j2}^{(l+1)} \sigma'(logit_2^{(l)}) \\ \vdots \\ \sum_j \delta_j^{(l+1)} w_{jM}^{(l+1)} \sigma'(logit_M^{(l)}) \end{array} \right] \\ &= \left[ \begin{array}{c} \sum_j \delta_j^{(l+1)} w_{j1}^{(l+1)} \\ \sum_j \delta_j^{(l+1)} w_{j2}^{(l+1)} \\ \vdots \\ \sum_j \delta_j^{(l+1)} w_{jM}^{(l+1)} \end{array} \right] \odot \left[ \begin{array}{c} \sigma'(logit_1^{(l)}) \\ \sigma'(logit_2^{(l)}) \\ \vdots \\ \sigma'(logit_M^{(l)}) \end{array} \right] \\ &= \left[ \begin{array}{cccc} w_{11}^{(l+1)} & w_{21}^{(l+1)} & \ldots & w_{N1}^{(l+1)} \\ w_{12}^{(l+1)} & w_{22}^{(l+1)} & \ldots & w_{N2}^{(l+1)} \\ \vdots & \vdots & \ddots & \vdots \\ w_{1M}^{(l+1)} & w_{2M}^{(l+1)} & \ldots & w_{NM}^{(l+1)} \\ \end{array} \right] \left[ \begin{array}{c} \delta_1^{(l+1)} \\ \delta_2^{(l+1)} \\ \vdots \\ \delta_N^{(l+1)} \end{array} \right] \odot \left[ \begin{array}{c} \sigma'(logit_1^{(l)}) \\ \sigma'(logit_2^{(l)}) \\ \vdots \\ \sigma'(logit_M^{(l)}) \end{array} \right] \\ &= \left[ \begin{array}{cccc} w_{11}^{(l+1)} & w_{12}^{(l+1)} & \ldots & w_{1M}^{(l+1)} \\ w_{21}^{(l+1)} & w_{22}^{(l+1)} & \ldots & w_{2M}^{(l+1)} \\ \vdots & \vdots & \ddots & \vdots \\ w_{N1}^{(l+1)} & w_{N2}^{(l+1)} & \ldots & w_{NM}^{(l+1)} \\ \end{array} \right]^T \left[ \begin{array}{c} \delta_1^{(l+1)} \\ \delta_2^{(l+1)} \\ \vdots \\ \delta_N^{(l+1)} \end{array} \right] \odot \left[ \begin{array}{c} \sigma'(logit_1^{(l)}) \\ \sigma'(logit_2^{(l)}) \\ \vdots \\ \sigma'(logit_M^{(l)}) \end{array} \right] \\ \Rightarrow \quad \delta^{(l)} &= [(W^{(L+1)})^T \delta^{(l+1)}] \odot \sigma'(logit^{(l)}) \end{aligned} δ(l)⇒δ(l)=⎣⎢⎢⎢⎢⎡δ1(l)δ2(l)⋮δM(l)⎦⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎡∑jδj(l+1)wj1(l+1)σ′(logit1(l))∑jδj(l+1)wj2(l+1)σ′(logit2(l))⋮∑jδj(l+1)wjM(l+1)σ′(logitM(l))⎦⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎡∑jδj(l+1)wj1(l+1)∑jδj(l+1)wj2(l+1)⋮∑jδj(l+1)wjM(l+1)⎦⎥⎥⎥⎥⎤⊙⎣⎢⎢⎢⎢⎡σ′(logit1(l))σ′(logit2(l))⋮σ′(logitM(l))⎦⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎡w11(l+1)w12(l+1)⋮w1M(l+1)w21(l+1)w22(l+1)⋮w2M(l+1)……⋱…wN1(l+1)wN2(l+1)⋮wNM(l+1)⎦⎥⎥⎥⎥⎤⎣⎢⎢⎢⎢⎡δ1(l+1)δ2(l+1)⋮δN(l+1)⎦⎥⎥⎥⎥⎤⊙⎣⎢⎢⎢⎢⎡σ′(logit1(l))σ′(logit2(l))⋮σ′(logitM(l))⎦⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎡w11(l+1)w21(l+1)⋮wN1(l+1)w12(l+1)w22(l+1)⋮wN2(l+1)……⋱…w1M(l+1)w2M(l+1)⋮wNM(l+1)⎦⎥⎥⎥⎥⎤T⎣⎢⎢⎢⎢⎡δ1(l+1)δ2(l+1)⋮δN(l+1)⎦⎥⎥⎥⎥⎤⊙⎣⎢⎢⎢⎢⎡σ′(logit1(l))σ′(logit2(l))⋮σ′(logitM(l))⎦⎥⎥⎥⎥⎤=[(W(L+1))Tδ(l+1)]⊙σ′(logit(l))
因为 ∂ C o s t ∂ b i a s i ( l ) = δ i ( l ) \frac{\partial Cost}{\partial bias_i^{(l)}} = \delta_i^{(l)} ∂biasi(l)∂Cost=δi(l),所以 ∂ C o s t ∂ b i a s ( l ) = δ ( l ) \frac{\partial Cost}{\partial bias^{(l)}} = \delta^{(l)} ∂bias(l)∂Cost=δ(l)。那就剩下
∂ C o s t ∂ w ( l ) = [ ∂ C o s t ∂ w 11 ( l ) ∂ C o s t ∂ w 12 ( l ) … ∂ C o s t ∂ w 1 N ( l ) ∂ C o s t ∂ w 21 ( l ) ∂ C o s t ∂ w 22 ( l ) … ∂ C o s t ∂ w 2 N ( l ) ⋮ ⋮ ⋱ ⋮ ∂ C o s t ∂ w M 1 ( l ) ∂ C o s t ∂ w M 2 ( l ) … ∂ C o s t ∂ w M N ( l ) ] = [ δ 1 ( l ) h 1 ( l − 1 ) δ 1 ( l ) h 2 ( l − 1 ) … δ 1 ( l ) h N ( l − 1 ) δ 2 ( l ) h 1 ( l − 1 ) δ 2 ( l ) h 2 ( l − 1 ) … δ 2 ( l ) h N ( l − 1 ) ⋮ ⋮ ⋱ ⋮ δ M ( l ) h 1 ( l − 1 ) δ M ( l ) h 2 ( l − 1 ) … δ M ( l ) h N ( l − 1 ) ] = [ δ 1 ( l ) δ 2 ( l ) ⋮ δ M ( l ) ] [ h 1 ( l − 1 ) h 2 ( l − 1 ) … h N ( l − 1 ) ] ⇒ ∂ C o s t ∂ w ( l ) = δ ( l ) ⋅ ( h ( l − 1 ) ) T \begin{aligned} \frac{\partial Cost}{\partial w^{(l)}} &= \left[ \begin{array}{cccc} \frac{\partial Cost}{\partial w_{11}^{(l)}} & \frac{\partial Cost}{\partial w_{12}^{(l)}} & \ldots & \frac{\partial Cost}{\partial w_{1N}^{(l)}} \\ \frac{\partial Cost}{\partial w_{21}^{(l)}} & \frac{\partial Cost}{\partial w_{22}^{(l)}} & \ldots & \frac{\partial Cost}{\partial w_{2N}^{(l)}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial Cost}{\partial w_{M1}^{(l)}} & \frac{\partial Cost}{\partial w_{M2}^{(l)}} & \ldots & \frac{\partial Cost}{\partial w_{MN}^{(l)}} \end{array} \right] =\left[ \begin{array}{cccc} \delta_1^{(l)} h_1^{(l-1)} & \delta_1^{(l)} h_2^{(l-1)} & \ldots & \delta_1^{(l)} h_N^{(l-1)} \\ \delta_2^{(l)} h_1^{(l-1)} & \delta_2^{(l)} h_2^{(l-1)} & \ldots & \delta_2^{(l)} h_N^{(l-1)} \\ \vdots & \vdots & \ddots & \vdots \\ \delta_M^{(l)} h_1^{(l-1)} & \delta_M^{(l)} h_2^{(l-1)} & \ldots & \delta_M^{(l)} h_N^{(l-1)} \end{array} \right] \\ &= \left[ \begin{array}{c} \delta_1^{(l)} \\ \delta_2^{(l)} \\ \vdots \\ \delta_M^{(l)} \end{array} \right] \left[ \begin{array}{cccc} h_1^{(l-1)} & h_2^{(l-1)} & \ldots & h_N^{(l-1)} \end{array} \right] \\ \Rightarrow \quad \frac{\partial Cost}{\partial w^{(l)}} &= \delta^{(l)} \cdot (h^{(l-1)})^T \end{aligned} ∂w(l)∂Cost⇒∂w(l)∂Cost=⎣⎢⎢⎢⎢⎢⎡∂w11(l)∂Cost∂w21(l)∂Cost⋮∂wM1(l)∂Cost∂w12(l)∂Cost∂w22(l)∂Cost⋮∂wM2(l)∂Cost……⋱…∂w1N(l)∂Cost∂w2N(l)∂Cost⋮∂wMN(l)∂Cost⎦⎥⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎡δ1(l)h1(l−1)δ2(l)h1(l−1)⋮δM(l)h1(l−1)δ1(l)h2(l−1)δ2(l)h2(l−1)⋮δM(l)h2(l−1)……⋱…δ1(l)hN(l−1)δ2(l)hN(l−1)⋮δM(l)hN(l−1)⎦⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎡δ1(l)δ2(l)⋮δM(l)⎦⎥⎥⎥⎥⎤[h1(l−1)h2(l−1)…hN(l−1)]=δ(l)⋅(h(l−1))T
综上,可以得到BP四项基本原则的矩阵形式:
δ ( L ) = ▽ y C o s t ⊙ σ ′ ( l o g i t ( L ) ) δ ( l ) = [ ( W ( L + 1 ) ) T δ ( l + 1 ) ] ⊙ σ ′ ( l o g i t ( l ) ) ∂ C o s t ∂ b i a s ( l ) = δ ( l ) ∂ C o s t ∂ w ( l ) = δ ( l ) ⋅ ( h ( l − 1 ) ) T \begin{aligned} \delta^{(L)} &= \bigtriangledown_y Cost \odot \sigma'(logit^{(L)}) \\ \delta^{(l)} &= [(W^{(L+1)})^T \delta^{(l+1)}] \odot \sigma'(logit^{(l)}) \\ \frac{\partial Cost}{\partial bias^{(l)}} &= \delta^{(l)} \\ \frac{\partial Cost}{\partial w^{(l)}} &= \delta^{(l)} \cdot (h^{(l-1)})^T \end{aligned} δ(L)δ(l)∂bias(l)∂Cost∂w(l)∂Cost=▽yCost⊙σ′(logit(L))=[(W(L+1))Tδ(l+1)]⊙σ′(logit(l))=δ(l)=δ(l)⋅(h(l−1))T
卷积神经网络反向传播理论推导:http://www.uml.org.cn/ai/201809102.asp?artid=21154
卷积神经网络(CNN)反向传播算法:https://www.cnblogs.com/pinard/p/6494810.html