全连接与卷积神经网络反向传播公式推导

全连接与卷积神经网络反向传播公式推导

全连接网络反向传播公式

BP四项基本原则:

δ i ( L ) = ▽ y i C o s t ⋅ σ ′ ( l o g i t i ( L ) ) δ i ( l ) = ∑ j δ j ( l + 1 ) w j i ( l + 1 ) σ ′ ( l o g i t i ( l ) ) ∂ C o s t ∂ b i a s i ( l ) = δ i ( l ) ∂ C o s t ∂ w i j ( l ) = δ i ( l ) h j ( l − 1 ) \begin{aligned} \delta_i^{(L)} &= \bigtriangledown_{y_i} Cost \cdot \sigma'(logit_i^{(L)}) \\ \delta_i^{(l)} &= \sum_j \delta_j^{(l+1)} w_{ji}^{(l+1)} \sigma'(logit_i^{(l)}) \\ \frac{\partial Cost}{\partial bias_i^{(l)}} &= \delta_i^{(l)} \\ \frac{\partial Cost}{\partial w_{ij}^{(l)}} &= \delta_i^{(l)} h_j^{(l-1)} \end{aligned} δi(L)δi(l)biasi(l)Costwij(l)Cost=yiCostσ(logiti(L))=jδj(l+1)wji(l+1)σ(logiti(l))=δi(l)=δi(l)hj(l1)

其中, ( l ) (l) (l)表示第 l l l层,一共有L层, i , j i,j i,j表示当前层神经元的序号。

反向传播公式的目的主要是得到: ∂ C o s t ∂ b i a s i ( l ) \frac{\partial Cost}{\partial bias_i^{(l)}} biasi(l)Cost ∂ C o s t ∂ w i j ( l ) \frac{\partial Cost}{\partial w_{ij}^{(l)}} wij(l)Cost

在推导的过程中

∂ C o s t ∂ b i a s i ( l ) = ∂ C o s t ∂ l o g i t i ( l ) ⋅ ∂ l o g i t i ( l ) ∂ b i a s i ( l ) ∂ C o s t ∂ w i j ( l ) = ∂ C o s t ∂ l o g i t i ( l ) ⋅ ∂ l o g i t i ( l ) ∂ w i j ( l ) \begin{aligned} \frac{\partial Cost}{\partial bias_i^{(l)}} &= \frac{\partial Cost}{\partial logit_i^{(l)}} \cdot \frac{\partial logit_i^{(l)}}{\partial bias_i^{(l)}} \\ \frac{\partial Cost}{\partial w_{ij}^{(l)}} &= \frac{\partial Cost}{\partial logit_i^{(l)}} \cdot \frac{\partial logit_i^{(l)}}{\partial w_{ij}^{(l)}} \end{aligned} biasi(l)Costwij(l)Cost=logiti(l)Costbiasi(l)logiti(l)=logiti(l)Costwij(l)logiti(l)

会发现都要用到 ∂ C o s t ∂ l o g i t i ( l ) \frac{\partial Cost}{\partial logit_i^{(l)}} logiti(l)Cost

l o g i t i ( l ) = w i j ( l ) h j ( l ) + ∑ k ≠ j w i k ( l ) h k ( l ) + b i a s i ( l ) logit_i^{(l)} = w_{ij}^{(l)} h_j^{(l)} + \sum_{k\ne j} w_{ik}^{(l)} h_{k}^{(l)} + bias_i^{(l)} logiti(l)=wij(l)hj(l)+k=jwik(l)hk(l)+biasi(l)

所以

∂ l o g i t i ( l ) ∂ b i a s i ( l ) = 1 ∂ l o g i t i ( l ) ∂ w i j ( l ) = h j ( l ) \begin{aligned} \frac{\partial logit_i^{(l)}}{\partial bias_i^{(l)}} &= 1 \\ \frac{\partial logit_i^{(l)}}{\partial w_{ij}^{(l)}} &= h_j^{(l)} \end{aligned} biasi(l)logiti(l)wij(l)logiti(l)=1=hj(l)

那接下来的问题就只有求 ∂ C o s t ∂ l o g i t i ( l ) \frac{\partial Cost}{\partial logit_i^{(l)}} logiti(l)Cost了,求它可以用递推法:

为公式看起来简洁,我们把 ∂ C o s t ∂ l o g i t i ( l ) \frac{\partial Cost}{\partial logit_i^{(l)}} logiti(l)Cost记为 δ i ( l ) \delta_i^{(l)} δi(l),那么

δ i ( l ) = ∂ C o s t ∂ l o g i t i ( l ) = ∑ j ∂ C o s t ∂ l o g i t j ( l + 1 ) ⋅ ∂ l o g i t j ( l + 1 ) ∂ l o g i t i ( l ) = ∑ j δ j ( l + 1 ) ⋅ ∂ l o g i t j ( l + 1 ) ∂ l o g i t i ( l ) \delta_i^{(l)} = \frac{\partial Cost}{\partial logit_i^{(l)}} = \sum_j \frac{\partial Cost}{\partial logit_j^{(l+1)}} \cdot \frac{\partial logit_j^{(l+1)}}{\partial logit_i^{(l)}} = \sum_j \delta_j^{(l+1)} \cdot \frac{\partial logit_j^{(l+1)}}{\partial logit_i^{(l)}} δi(l)=logiti(l)Cost=jlogitj(l+1)Costlogiti(l)logitj(l+1)=jδj(l+1)logiti(l)logitj(l+1)

易知:

l o g i t j ( l + 1 ) = w j i ( l + 1 ) σ ( l o g i t i ( l ) ) + ∑ k ≠ i w j k ( l + 1 ) σ ( l o g i t k ( l ) ) + b i a s j ( l + 1 ) logit_j^{(l+1)} = w_{ji}^{(l+1)} \sigma(logit_i^{(l)}) + \sum_{k\ne i} w_{jk}^{(l+1)} \sigma(logit_k^{(l)}) + bias_j^{(l+1)} logitj(l+1)=wji(l+1)σ(logiti(l))+k=iwjk(l+1)σ(logitk(l))+biasj(l+1)

所以

∂ l o g i t j ( l + 1 ) ∂ l o g i t i ( l ) = ∂ [ w j i ( l + 1 ) σ ( l o g i t i ( l ) ) + ∑ k ≠ i w j k ( l + 1 ) σ ( l o g i t k ( l ) ) + b i a s j ( l + 1 ) ] ∂ l o g i t i ( l ) = w j i ( l + 1 ) σ ′ ( l o g i t i ( l ) ) \frac{\partial logit_j^{(l+1)}}{\partial logit_i^{(l)}} = \frac{\partial [w_{ji}^{(l+1)} \sigma(logit_i^{(l)}) + \sum_{k\ne i} w_{jk}^{(l+1)} \sigma(logit_k^{(l)}) + bias_j^{(l+1)}]}{\partial logit_i^{(l)}} = w_{ji}^{(l+1)} \sigma'(logit_i^{(l)}) logiti(l)logitj(l+1)=logiti(l)[wji(l+1)σ(logiti(l))+k=iwjk(l+1)σ(logitk(l))+biasj(l+1)]=wji(l+1)σ(logiti(l))

因此

δ i ( l ) = ∑ j δ j ( l + 1 ) w j i ( l + 1 ) σ ′ ( l o g i t i ( l ) ) \delta_i^{(l)} = \sum_j \delta_j^{(l+1)} w_{ji}^{(l+1)} \sigma'(logit_i^{(l)}) δi(l)=jδj(l+1)wji(l+1)σ(logiti(l))

以上就是 δ i ( l ) \delta_i^{(l)} δi(l)的递推公式,当然递推公式需要一个终止条件,那就是 δ i ( L ) \delta_i^{(L)} δi(L)

δ i ( L ) = ∂ C o s t ∂ l o g i t i ( L ) = ∂ C o s t ∂ y i ⋅ ∂ y i ∂ l o g i t i ( L ) = ∂ C o s t ∂ y i ⋅ ∂ σ ( l o g i t i ( L ) ) ∂ l o g i t i ( L ) = ▽ y i C o s t ⋅ σ ′ ( l o g i t i ( L ) ) \delta_i^{(L)} = \frac{\partial Cost}{\partial logit_i^{(L)}} = \frac{\partial Cost}{\partial y_i} \cdot \frac{\partial y_i}{\partial logit_i^{(L)}} = \frac{\partial Cost}{\partial y_i} \cdot \frac{\partial \sigma(logit_i^{(L)})}{\partial logit_i^{(L)}} = \bigtriangledown_{y_i} Cost \cdot \sigma'(logit_i^{(L)}) δi(L)=logiti(L)Cost=yiCostlogiti(L)yi=yiCostlogiti(L)σ(logiti(L))=yiCostσ(logiti(L))

这就是BP四项基本原则的思考过程。

下面看矩阵形式:

δ ( L ) = [ δ 1 ( L ) δ 2 ( L ) ⋮ δ M ( L ) ] = [ ▽ y 1 C o s t ⋅ σ ′ ( l o g i t 1 ( L ) ) ▽ y 2 C o s t ⋅ σ ′ ( l o g i t 2 ( L ) ) ⋮ ▽ y M C o s t ⋅ σ ′ ( l o g i t M ( L ) ) ] = [ ▽ y 1 C o s t ▽ y 2 C o s t ⋮ ▽ y M C o s t ] ⊙ [ σ ′ ( l o g i t 1 ( L ) ) σ ′ ( l o g i t 2 ( L ) ) ⋮ σ ′ ( l o g i t M ( L ) ) ] ⇒ δ ( L ) = ▽ y C o s t ⊙ σ ′ ( l o g i t ( L ) ) \begin{aligned} \delta^{(L)} &= \left[ \begin{array}{c} \delta_1^{(L)} \\ \delta_2^{(L)} \\ \vdots \\ \delta_M^{(L)} \end{array} \right] =\left[ \begin{array}{c} \bigtriangledown_{y_1} Cost \cdot \sigma'(logit_1^{(L)}) \\ \bigtriangledown_{y_2} Cost \cdot \sigma'(logit_2^{(L)}) \\ \vdots \\ \bigtriangledown_{y_M} Cost \cdot \sigma'(logit_M^{(L)}) \end{array} \right] =\left[ \begin{array}{c} \bigtriangledown_{y_1} Cost \\ \bigtriangledown_{y_2} Cost \\ \vdots \\ \bigtriangledown_{y_M} Cost \end{array} \right] \odot \left[ \begin{array}{c} \sigma'(logit_1^{(L)}) \\ \sigma'(logit_2^{(L)}) \\ \vdots \\ \sigma'(logit_M^{(L)}) \end{array} \right] \\ \Rightarrow \quad \delta^{(L)} &= \bigtriangledown_y Cost \odot \sigma'(logit^{(L)}) \end{aligned} δ(L)δ(L)=δ1(L)δ2(L)δM(L)=y1Costσ(logit1(L))y2Costσ(logit2(L))yMCostσ(logitM(L))=y1Costy2CostyMCostσ(logit1(L))σ(logit2(L))σ(logitM(L))=yCostσ(logit(L))

这里, ⊙ \odot 表示Hadamard乘积,即对应元素相乘,M表示第 L L L层有M个神经元。

δ ( l ) = [ δ 1 ( l ) δ 2 ( l ) ⋮ δ M ( l ) ] = [ ∑ j δ j ( l + 1 ) w j 1 ( l + 1 ) σ ′ ( l o g i t 1 ( l ) ) ∑ j δ j ( l + 1 ) w j 2 ( l + 1 ) σ ′ ( l o g i t 2 ( l ) ) ⋮ ∑ j δ j ( l + 1 ) w j M ( l + 1 ) σ ′ ( l o g i t M ( l ) ) ] = [ ∑ j δ j ( l + 1 ) w j 1 ( l + 1 ) ∑ j δ j ( l + 1 ) w j 2 ( l + 1 ) ⋮ ∑ j δ j ( l + 1 ) w j M ( l + 1 ) ] ⊙ [ σ ′ ( l o g i t 1 ( l ) ) σ ′ ( l o g i t 2 ( l ) ) ⋮ σ ′ ( l o g i t M ( l ) ) ] = [ w 11 ( l + 1 ) w 21 ( l + 1 ) … w N 1 ( l + 1 ) w 12 ( l + 1 ) w 22 ( l + 1 ) … w N 2 ( l + 1 ) ⋮ ⋮ ⋱ ⋮ w 1 M ( l + 1 ) w 2 M ( l + 1 ) … w N M ( l + 1 ) ] [ δ 1 ( l + 1 ) δ 2 ( l + 1 ) ⋮ δ N ( l + 1 ) ] ⊙ [ σ ′ ( l o g i t 1 ( l ) ) σ ′ ( l o g i t 2 ( l ) ) ⋮ σ ′ ( l o g i t M ( l ) ) ] = [ w 11 ( l + 1 ) w 12 ( l + 1 ) … w 1 M ( l + 1 ) w 21 ( l + 1 ) w 22 ( l + 1 ) … w 2 M ( l + 1 ) ⋮ ⋮ ⋱ ⋮ w N 1 ( l + 1 ) w N 2 ( l + 1 ) … w N M ( l + 1 ) ] T [ δ 1 ( l + 1 ) δ 2 ( l + 1 ) ⋮ δ N ( l + 1 ) ] ⊙ [ σ ′ ( l o g i t 1 ( l ) ) σ ′ ( l o g i t 2 ( l ) ) ⋮ σ ′ ( l o g i t M ( l ) ) ] ⇒ δ ( l ) = [ ( W ( L + 1 ) ) T δ ( l + 1 ) ] ⊙ σ ′ ( l o g i t ( l ) ) \begin{aligned} \delta^{(l)} &= \left[ \begin{array}{c} \delta_1^{(l)} \\ \delta_2^{(l)} \\ \vdots \\ \delta_M^{(l)} \end{array} \right] =\left[ \begin{array}{c} \sum_j \delta_j^{(l+1)} w_{j1}^{(l+1)} \sigma'(logit_1^{(l)}) \\ \sum_j \delta_j^{(l+1)} w_{j2}^{(l+1)} \sigma'(logit_2^{(l)}) \\ \vdots \\ \sum_j \delta_j^{(l+1)} w_{jM}^{(l+1)} \sigma'(logit_M^{(l)}) \end{array} \right] \\ &= \left[ \begin{array}{c} \sum_j \delta_j^{(l+1)} w_{j1}^{(l+1)} \\ \sum_j \delta_j^{(l+1)} w_{j2}^{(l+1)} \\ \vdots \\ \sum_j \delta_j^{(l+1)} w_{jM}^{(l+1)} \end{array} \right] \odot \left[ \begin{array}{c} \sigma'(logit_1^{(l)}) \\ \sigma'(logit_2^{(l)}) \\ \vdots \\ \sigma'(logit_M^{(l)}) \end{array} \right] \\ &= \left[ \begin{array}{cccc} w_{11}^{(l+1)} & w_{21}^{(l+1)} & \ldots & w_{N1}^{(l+1)} \\ w_{12}^{(l+1)} & w_{22}^{(l+1)} & \ldots & w_{N2}^{(l+1)} \\ \vdots & \vdots & \ddots & \vdots \\ w_{1M}^{(l+1)} & w_{2M}^{(l+1)} & \ldots & w_{NM}^{(l+1)} \\ \end{array} \right] \left[ \begin{array}{c} \delta_1^{(l+1)} \\ \delta_2^{(l+1)} \\ \vdots \\ \delta_N^{(l+1)} \end{array} \right] \odot \left[ \begin{array}{c} \sigma'(logit_1^{(l)}) \\ \sigma'(logit_2^{(l)}) \\ \vdots \\ \sigma'(logit_M^{(l)}) \end{array} \right] \\ &= \left[ \begin{array}{cccc} w_{11}^{(l+1)} & w_{12}^{(l+1)} & \ldots & w_{1M}^{(l+1)} \\ w_{21}^{(l+1)} & w_{22}^{(l+1)} & \ldots & w_{2M}^{(l+1)} \\ \vdots & \vdots & \ddots & \vdots \\ w_{N1}^{(l+1)} & w_{N2}^{(l+1)} & \ldots & w_{NM}^{(l+1)} \\ \end{array} \right]^T \left[ \begin{array}{c} \delta_1^{(l+1)} \\ \delta_2^{(l+1)} \\ \vdots \\ \delta_N^{(l+1)} \end{array} \right] \odot \left[ \begin{array}{c} \sigma'(logit_1^{(l)}) \\ \sigma'(logit_2^{(l)}) \\ \vdots \\ \sigma'(logit_M^{(l)}) \end{array} \right] \\ \Rightarrow \quad \delta^{(l)} &= [(W^{(L+1)})^T \delta^{(l+1)}] \odot \sigma'(logit^{(l)}) \end{aligned} δ(l)δ(l)=δ1(l)δ2(l)δM(l)=jδj(l+1)wj1(l+1)σ(logit1(l))jδj(l+1)wj2(l+1)σ(logit2(l))jδj(l+1)wjM(l+1)σ(logitM(l))=jδj(l+1)wj1(l+1)jδj(l+1)wj2(l+1)jδj(l+1)wjM(l+1)σ(logit1(l))σ(logit2(l))σ(logitM(l))=w11(l+1)w12(l+1)w1M(l+1)w21(l+1)w22(l+1)w2M(l+1)wN1(l+1)wN2(l+1)wNM(l+1)δ1(l+1)δ2(l+1)δN(l+1)σ(logit1(l))σ(logit2(l))σ(logitM(l))=w11(l+1)w21(l+1)wN1(l+1)w12(l+1)w22(l+1)wN2(l+1)w1M(l+1)w2M(l+1)wNM(l+1)Tδ1(l+1)δ2(l+1)δN(l+1)σ(logit1(l))σ(logit2(l))σ(logitM(l))=[(W(L+1))Tδ(l+1)]σ(logit(l))

因为 ∂ C o s t ∂ b i a s i ( l ) = δ i ( l ) \frac{\partial Cost}{\partial bias_i^{(l)}} = \delta_i^{(l)} biasi(l)Cost=δi(l),所以 ∂ C o s t ∂ b i a s ( l ) = δ ( l ) \frac{\partial Cost}{\partial bias^{(l)}} = \delta^{(l)} bias(l)Cost=δ(l)。那就剩下

∂ C o s t ∂ w ( l ) = [ ∂ C o s t ∂ w 11 ( l ) ∂ C o s t ∂ w 12 ( l ) … ∂ C o s t ∂ w 1 N ( l ) ∂ C o s t ∂ w 21 ( l ) ∂ C o s t ∂ w 22 ( l ) … ∂ C o s t ∂ w 2 N ( l ) ⋮ ⋮ ⋱ ⋮ ∂ C o s t ∂ w M 1 ( l ) ∂ C o s t ∂ w M 2 ( l ) … ∂ C o s t ∂ w M N ( l ) ] = [ δ 1 ( l ) h 1 ( l − 1 ) δ 1 ( l ) h 2 ( l − 1 ) … δ 1 ( l ) h N ( l − 1 ) δ 2 ( l ) h 1 ( l − 1 ) δ 2 ( l ) h 2 ( l − 1 ) … δ 2 ( l ) h N ( l − 1 ) ⋮ ⋮ ⋱ ⋮ δ M ( l ) h 1 ( l − 1 ) δ M ( l ) h 2 ( l − 1 ) … δ M ( l ) h N ( l − 1 ) ] = [ δ 1 ( l ) δ 2 ( l ) ⋮ δ M ( l ) ] [ h 1 ( l − 1 ) h 2 ( l − 1 ) … h N ( l − 1 ) ] ⇒ ∂ C o s t ∂ w ( l ) = δ ( l ) ⋅ ( h ( l − 1 ) ) T \begin{aligned} \frac{\partial Cost}{\partial w^{(l)}} &= \left[ \begin{array}{cccc} \frac{\partial Cost}{\partial w_{11}^{(l)}} & \frac{\partial Cost}{\partial w_{12}^{(l)}} & \ldots & \frac{\partial Cost}{\partial w_{1N}^{(l)}} \\ \frac{\partial Cost}{\partial w_{21}^{(l)}} & \frac{\partial Cost}{\partial w_{22}^{(l)}} & \ldots & \frac{\partial Cost}{\partial w_{2N}^{(l)}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial Cost}{\partial w_{M1}^{(l)}} & \frac{\partial Cost}{\partial w_{M2}^{(l)}} & \ldots & \frac{\partial Cost}{\partial w_{MN}^{(l)}} \end{array} \right] =\left[ \begin{array}{cccc} \delta_1^{(l)} h_1^{(l-1)} & \delta_1^{(l)} h_2^{(l-1)} & \ldots & \delta_1^{(l)} h_N^{(l-1)} \\ \delta_2^{(l)} h_1^{(l-1)} & \delta_2^{(l)} h_2^{(l-1)} & \ldots & \delta_2^{(l)} h_N^{(l-1)} \\ \vdots & \vdots & \ddots & \vdots \\ \delta_M^{(l)} h_1^{(l-1)} & \delta_M^{(l)} h_2^{(l-1)} & \ldots & \delta_M^{(l)} h_N^{(l-1)} \end{array} \right] \\ &= \left[ \begin{array}{c} \delta_1^{(l)} \\ \delta_2^{(l)} \\ \vdots \\ \delta_M^{(l)} \end{array} \right] \left[ \begin{array}{cccc} h_1^{(l-1)} & h_2^{(l-1)} & \ldots & h_N^{(l-1)} \end{array} \right] \\ \Rightarrow \quad \frac{\partial Cost}{\partial w^{(l)}} &= \delta^{(l)} \cdot (h^{(l-1)})^T \end{aligned} w(l)Costw(l)Cost=w11(l)Costw21(l)CostwM1(l)Costw12(l)Costw22(l)CostwM2(l)Costw1N(l)Costw2N(l)CostwMN(l)Cost=δ1(l)h1(l1)δ2(l)h1(l1)δM(l)h1(l1)δ1(l)h2(l1)δ2(l)h2(l1)δM(l)h2(l1)δ1(l)hN(l1)δ2(l)hN(l1)δM(l)hN(l1)=δ1(l)δ2(l)δM(l)[h1(l1)h2(l1)hN(l1)]=δ(l)(h(l1))T

综上,可以得到BP四项基本原则的矩阵形式:

δ ( L ) = ▽ y C o s t ⊙ σ ′ ( l o g i t ( L ) ) δ ( l ) = [ ( W ( L + 1 ) ) T δ ( l + 1 ) ] ⊙ σ ′ ( l o g i t ( l ) ) ∂ C o s t ∂ b i a s ( l ) = δ ( l ) ∂ C o s t ∂ w ( l ) = δ ( l ) ⋅ ( h ( l − 1 ) ) T \begin{aligned} \delta^{(L)} &= \bigtriangledown_y Cost \odot \sigma'(logit^{(L)}) \\ \delta^{(l)} &= [(W^{(L+1)})^T \delta^{(l+1)}] \odot \sigma'(logit^{(l)}) \\ \frac{\partial Cost}{\partial bias^{(l)}} &= \delta^{(l)} \\ \frac{\partial Cost}{\partial w^{(l)}} &= \delta^{(l)} \cdot (h^{(l-1)})^T \end{aligned} δ(L)δ(l)bias(l)Costw(l)Cost=yCostσ(logit(L))=[(W(L+1))Tδ(l+1)]σ(logit(l))=δ(l)=δ(l)(h(l1))T

卷积网络反向传播公式

卷积神经网络反向传播理论推导:http://www.uml.org.cn/ai/201809102.asp?artid=21154

卷积神经网络(CNN)反向传播算法:https://www.cnblogs.com/pinard/p/6494810.html

你可能感兴趣的:(1024程序员节,深度学习)