全连接神经网络基础——反向传播及梯度下降

上文(传送门)说到全连接神经网络的正向传播以及损失函数,通过损失函数值来评价网络的拟合效果,如何实现在训练过程中降低损失函数值呢?就涉及到本文的主要内容,反向传播以及梯度下降了。

反向传播

还是以这个网络结构为例,
全连接神经网络基础——反向传播及梯度下降_第1张图片
通过正向传播以及损失函数,我们获得损失函数关于输入 x \boldsymbol{x} x、权重 W \mathbf W W和偏置 B \mathbf B B的复合函数,即 L = L ( y ^ , y ) = L ( f ( x , W , B ) , y ) 。 L = L(\hat{y}, y) = L(f(\boldsymbol x, \mathbf W, \mathbf B), y)。 L=L(y^,y)=L(f(x,W,B),y)反向传播过程中,我们需要计算出损失函数 L L L对权重 W \mathbf W W和偏置 B \mathbf B B的偏导数,以便于进行梯度下降。

根据不同的损失函数设置,可以先计算得到损失函数对网络输出的偏导数 ∂ L ∂ y ^ \frac{\partial{L}}{\partial{\hat{y}}} y^L

输出层: y ^ = f 3 ( W ( 3 ) x ( 3 ) + B ( 3 ) ) = δ ( w 11 ( 3 ) x 31 + w 12 ( 3 ) x 32 + w 13 ( 3 ) x 33 + b 1 ( 3 ) ) , \hat{y} = f_3(\mathbf{W^{(3)}}\boldsymbol{x^{(3)}}+\mathbf{B^{(3)}})=\delta(w_{11}^{(3)}x_{31} + w_{12}^{(3)}x_{32} + w_{13}^{(3)}x_{33} + b_1^{(3)}), y^=f3(W(3)x(3)+B(3))=δ(w11(3)x31+w12(3)x32+w13(3)x33+b1(3)),这里的激活函数以sigmoid函数为例,由于 ∂ δ ( x ) ∂ x = − 1 ( 1 + e − x ) 2 ( − e − x ) = 1 + e − x ( 1 + e − x ) 2 − 1 ( 1 + e − x ) 2 = δ ( x ) ( 1 − δ ( x ) ) , \frac{\partial \delta(x)}{\partial x} = -\frac{1}{(1 + e^{-x})^2}(-e^{-x}) = \frac{1 + e^{-x}}{(1 + e^{-x})^2} - \frac{1}{(1 + e^{-x})^2} = \delta(x)(1 - \delta(x)), xδ(x)=(1+ex)21(ex)=(1+ex)21+ex(1+ex)21=δ(x)(1δ(x)),因此根据链式法则,可以计算 ∂ L ∂ w 1 i ( 3 ) = ∂ L ∂ y ^ ∂ y ^ w 1 i ( 3 ) = ∂ L ∂ y ^ y ^ ( 1 − y ^ ) x 3 i , ∀ i = 1 , … , 3 , \begin{aligned} \frac{\partial{L}}{\partial{w_{1i}^{(3)}}} &= \frac{\partial{L}}{\partial{\hat{y}}} \frac{\partial{\hat{y}}}{w_{1i}^{(3)}} \\ &= \frac{\partial{L}}{\partial{\hat{y}}} \hat{y}(1 - \hat{y})x_{3i}, \forall i = 1, \dots, 3, \end{aligned} w1i(3)L=y^Lw1i(3)y^=y^Ly^(1y^)x3i,i=1,,3因此 ∂ L ∂ W ( 3 ) = [ ∂ L ∂ w 11 ( 3 ) , ∂ L ∂ w 12 ( 3 ) , ∂ L ∂ w 13 ( 3 ) ] = ∂ L ∂ y ^ y ^ ( 1 − y ^ ) × [ x 31 , x 32 , x 33 ] = ∂ L ∂ y ^ y ^ ( 1 − y ^ ) x ( 3 ) T 。 \begin{aligned} \frac{\partial{L}}{\partial{\mathbf{W^{(3)}}}} &= [\frac{\partial{L}}{\partial{w_{11}^{(3)}}}, \frac{\partial{L}}{\partial{w_{12}^{(3)}}}, \frac{\partial{L}}{\partial{w_{13}^{(3)}}}] \\ &= \frac{\partial{L}}{\partial{\hat{y}}} \hat{y}(1 - \hat{y})\times[x_{31}, x_{32}, x_{33}] \\ &= \frac{\partial{L}}{\partial{\hat{y}}} \hat{y}(1 - \hat{y})\boldsymbol{x^{(3)}}^T。 \end{aligned} W(3)L=[w11(3)L,w12(3)L,w13(3)L]=y^Ly^(1y^)×[x31,x32,x33]=y^Ly^(1y^)x(3)T类似的有 ∂ L ∂ x 3 i = ∂ L ∂ y ^ ∂ y ^ ∂ x 3 i = ∂ L ∂ y ^ y ^ ( 1 − y ^ ) w 1 i ( 3 ) , ∀ i = 1 , … , 3 , \begin{aligned} \frac{\partial{L}}{\partial{x_{3i}}} &= \frac{\partial{L}}{\partial{\hat{y}}} \frac{\partial{\hat{y}}}{\partial{x_{3i}}} \\ &= \frac{\partial{L}}{\partial{\hat{y}}}\hat{y}(1 - \hat{y})w_{1i}^{(3)}, \forall i = 1, \dots, 3, \end{aligned} x3iL=y^Lx3iy^=y^Ly^(1y^)w1i(3),i=1,,3, ∂ L ∂ x ( 3 ) = [ ∂ L ∂ x 31 , ∂ L ∂ x 32 , ∂ L ∂ x 33 ] T = ∂ L ∂ y ^ y ^ ( 1 − y ^ ) × [ w 11 ( 3 ) , w 12 ( 3 ) , w 13 ( 3 ) ] T = ∂ L ∂ y ^ y ^ ( 1 − y ^ ) W ( 3 ) T 。 \begin{aligned} \frac{\partial{L}}{\partial{\boldsymbol{x^{(3)}}}} &= [\frac{\partial{L}}{\partial{x_{31}}}, \frac{\partial{L}}{\partial{x_{32}}}, \frac{\partial{L}}{\partial{x_{33}}}]^T \\ &= \frac{\partial{L}}{\partial{\hat{y}}} \hat{y}(1 - \hat{y})\times[w_{11}^{(3)}, w_{12}^{(3)}, w_{13}^{(3)}]^T \\ &= \frac{\partial{L}}{\partial{\hat{y}}} \hat{y}(1 - \hat{y})\mathbf{W^{(3)}}^T。 \end{aligned} x(3)L=[x31L,x32L,x33L]T=y^Ly^(1y^)×[w11(3),w12(3),w13(3)]T=y^Ly^(1y^)W(3)T同理
∂ L ∂ B ( 3 ) = ∂ L ∂ y ^ ∂ y ^ ∂ B ( 3 ) = ∂ L ∂ y ^ y ^ ( 1 − y ^ ) 。 \frac{\partial{L}}{\partial{\mathbf{B^{(3)}}}} = \frac{\partial{L}}{\partial{\hat{y}}} \frac{\partial{\hat{y}}}{\partial{\mathbf{B^{(3)}}}} = \frac{\partial{L}}{\partial{\hat{y}}}\hat{y}(1 - \hat{y})。 B(3)L=y^LB(3)y^=y^Ly^(1y^)

第二个隐藏层 x ( 3 ) = f 2 ( W ( 2 ) x ( 2 ) + B ( 2 ) ) , \boldsymbol{x^{(3)}} = f_2(\mathbf{W^{(2)}}\boldsymbol{x^{(2)}}+\mathbf{B^{(2)}}), x(3)=f2(W(2)x(2)+B(2)), x 31 = δ ( w 11 ( 2 ) x 21 + w 12 ( 2 ) x 22 + w 13 ( 2 ) x 23 + b 1 ( 2 ) ) , x 32 = δ ( w 21 ( 2 ) x 21 + w 22 ( 2 ) x 22 + w 23 ( 2 ) x 23 + b 2 ( 2 ) ) , x 33 = δ ( w 31 ( 2 ) x 21 + w 32 ( 2 ) x 22 + w 33 ( 2 ) x 23 + b 3 ( 2 ) ) . x_{31} = \delta(w_{11}^{(2)}x_{21} + w_{12}^{(2)}x_{22} + w_{13}^{(2)}x_{23} + b_1^{(2)}), \\ x_{32} = \delta(w_{21}^{(2)}x_{21} + w_{22}^{(2)}x_{22} + w_{23}^{(2)}x_{23} + b_2^{(2)}), \\ x_{33} = \delta(w_{31}^{(2)}x_{21} + w_{32}^{(2)}x_{22} + w_{33}^{(2)}x_{23} + b_3^{(2)}). \\ x31=δ(w11(2)x21+w12(2)x22+w13(2)x23+b1(2)),x32=δ(w21(2)x21+w22(2)x22+w23(2)x23+b2(2)),x33=δ(w31(2)x21+w32(2)x22+w33(2)x23+b3(2)).可以看出 w 11 ( 2 ) w_{11}^{(2)} w11(2)仅通过 x 21 x_{21} x21 x 31 x_{31} x31有梯度贡献,因此 ∂ L ∂ w 11 ( 2 ) = ∂ L ∂ x 31 ∂ x 31 ∂ w 11 ( 2 ) = ∂ L ∂ x 31 x 31 ( 1 − x 31 ) x 21 , \frac{\partial{L}}{\partial{w_{11}^{(2)}}} = \frac{\partial{L}}{\partial{x_{31}}} \frac{\partial{x_{31}}}{\partial{w_{11}^{(2)}}} = \frac{\partial{L}}{\partial{x_{31}}}x_{31}(1 - x_{31})x_{21}, w11(2)L=x31Lw11(2)x31=x31Lx31(1x31)x21,故对于 ∀ i = 1 , … , 3 \forall i = 1, \dots, 3 i=1,,3 j = 1 , … , 3 j = 1, \dots, 3 j=1,,3,有 ∂ L ∂ w i j ( 2 ) = ∂ L ∂ x 3 i ∂ x 3 i ∂ w i j ( 2 ) = ∂ L ∂ x 3 i x 3 i ( 1 − x 3 i ) x 2 j , \frac{\partial{L}}{\partial{w_{ij}^{(2)}}} = \frac{\partial{L}}{\partial{x_{3i}}} \frac{\partial{x_{3i}}}{\partial{w_{ij}^{(2)}}} = \frac{\partial{L}}{\partial{x_{3i}}}x_{3i}(1 - x_{3i})x_{2j}, wij(2)L=x3iLwij(2)x3i=x3iLx3i(1x3i)x2j,因此 ∂ L ∂ W ( 2 ) = [ ∂ L ∂ x 3 i x 3 i ( 1 − x 3 i ) x 2 j ] 3 × 3 = ( ∂ L ∂ x ( 3 ) ⊙ x ( 3 ) ⊙ ( 1 − x ( 3 ) ) ) x ( 2 ) T 。 \begin{aligned} \frac{\partial{L}}{\partial{\mathbf{W^{(2)}}}} &= [\frac{\partial{L}}{\partial{x_{3i}}}x_{3i}(1 - x_{3i})x_{2j}]_{3\times3} \\ &=(\frac{\partial{L}}{\partial{\boldsymbol{x^{(3)}}}} \odot \boldsymbol{x^{(3)}} \odot (1 - \boldsymbol{x^{(3)}}))\boldsymbol{x^{(2)}}^T。 \end{aligned} W(2)L=[x3iLx3i(1x3i)x2j]3×3=(x(3)Lx(3)(1x(3)))x(2)T而对于 x 21 x_{21} x21来说,分别通过 w 11 ( 2 ) , w 21 ( 2 ) , w 31 ( 2 ) w_{11}^{(2)},w_{21}^{(2)},w_{31}^{(2)} w11(2),w21(2),w31(2) x 31 , x 32 , x 33 x_{31}, x_{32},x_{33} x31,x32,x33有梯度贡献,因此 ∂ L ∂ x 21 = ∂ L ∂ x 31 ∂ x 31 ∂ x 21 + ∂ L ∂ x 32 ∂ x 32 ∂ x 21 + ∂ L ∂ x 33 ∂ x 33 ∂ x 21 = ∂ L ∂ x 31 x 31 ( 1 − x 31 ) w 11 ( 2 ) + ∂ L ∂ x 32 x 32 ( 1 − x 32 ) w 21 ( 2 ) + ∂ L ∂ x 33 x 33 ( 1 − x 33 ) w 31 ( 2 ) = ( ∂ L ∂ x ( 3 ) ⊙ x ( 3 ) ⊙ ( 1 − x ( 3 ) ) ) T [ w 11 ( 2 ) , w 21 ( 2 ) , w 31 ( 2 ) ] T , \begin{aligned} \frac{\partial{L}}{\partial{x_{21}}} &= \frac{\partial{L}}{\partial{x_{31}}}\frac{\partial{x_{31}}}{\partial{x_{21}}} + \frac{\partial{L}}{\partial{x_{32}}}\frac{\partial{x_{32}}}{\partial{x_{21}}} + \frac{\partial{L}}{\partial{x_{33}}}\frac{\partial{x_{33}}}{\partial{x_{21}}} \\ &= \frac{\partial{L}}{\partial{x_{31}}}x_{31}(1 - x_{31})w_{11}^{(2)} + \frac{\partial{L}}{\partial{x_{32}}}x_{32}(1 - x_{32})w_{21}^{(2)} + \frac{\partial{L}}{\partial{x_{33}}}x_{33}(1 - x_{33})w_{31}^{(2)} \\ &= (\frac{\partial{L}}{\partial{\boldsymbol{x^{(3)}}}} \odot \boldsymbol{x^{(3)}} \odot (1 - \boldsymbol{x^{(3)}}))^T[w_{11}^{(2)}, w_{21}^{(2)}, w_{31}^{(2)}]^T, \end{aligned} x21L=x31Lx21x31+x32Lx21x32+x33Lx21x33=x31Lx31(1x31)w11(2)+x32Lx32(1x32)w21(2)+x33Lx33(1x33)w31(2)=(x(3)Lx(3)(1x(3)))T[w11(2),w21(2),w31(2)]T,故对 ∀ i = 1 , … , 3 \forall i = 1, \dots, 3 i=1,,3,有 ∂ L ∂ x 2 i = ∂ L ∂ x 31 x 31 ( 1 − x 31 ) w 1 i ( 2 ) + ∂ L ∂ x 32 x 32 ( 1 − x 32 ) w 2 i ( 2 ) + ∂ L ∂ x 33 x 33 ( 1 − x 33 ) w 3 i ( 2 ) = ( ∂ L ∂ x ( 3 ) ⊙ x ( 3 ) ⊙ ( 1 − x ( 3 ) ) ) T [ w 1 i ( 2 ) , w 2 i ( 2 ) , w 3 i ( 2 ) ] T , \begin{aligned} \frac{\partial{L}}{\partial{x_{2i}}} &= \frac{\partial{L}}{\partial{x_{31}}}x_{31}(1 - x_{31})w_{1i}^{(2)} + \frac{\partial{L}}{\partial{x_{32}}}x_{32}(1 - x_{32})w_{2i}^{(2)} + \frac{\partial{L}}{\partial{x_{33}}}x_{33}(1 - x_{33})w_{3i}^{(2)} \\ &= (\frac{\partial{L}}{\partial{\boldsymbol{x^{(3)}}}} \odot \boldsymbol{x^{(3)}} \odot (1 - \boldsymbol{x^{(3)}}))^T[w_{1i}^{(2)}, w_{2i}^{(2)}, w_{3i}^{(2)}]^T, \end{aligned} x2iL=x31Lx31(1x31)w1i(2)+x32Lx32(1x32)w2i(2)+x33Lx33(1x33)w3i(2)=(x(3)Lx(3)(1x(3)))T[w1i(2),w2i(2),w3i(2)]T,因此 ∂ L ∂ x ( 2 ) = [ ( ∂ L ∂ x ( 3 ) ⊙ x ( 3 ) ⊙ ( 1 − x ( 3 ) ) ) T [ w 1 i ( 2 ) , w 2 i ( 2 ) , w 3 i ( 2 ) ] T ] 3 × 1 = ( ( ∂ L ∂ x ( 3 ) ⊙ x ( 3 ) ⊙ ( 1 − x ( 3 ) ) ) T W ( 2 ) ) T = W ( 2 ) T ( ( ∂ L ∂ x ( 3 ) ⊙ x ( 3 ) ⊙ ( 1 − x ( 3 ) ) ) 。 \begin{aligned} \frac{\partial{L}}{\partial{\boldsymbol{x^{(2)}}}} &= [(\frac{\partial{L}}{\partial{\boldsymbol{x^{(3)}}}} \odot \boldsymbol{x^{(3)}} \odot (1 - \boldsymbol{x^{(3)}}))^T[w_{1i}^{(2)}, w_{2i}^{(2)}, w_{3i}^{(2)}]^T]_{3\times1} \\ &= ((\frac{\partial{L}}{\partial{\boldsymbol{x^{(3)}}}} \odot \boldsymbol{x^{(3)}} \odot (1 - \boldsymbol{x^{(3)}}))^T\mathbf{W^{(2)}})^T \\ &= \mathbf{W^{(2)}}^T((\frac{\partial{L}}{\partial{\boldsymbol{x^{(3)}}}} \odot \boldsymbol{x^{(3)}} \odot (1 - \boldsymbol{x^{(3)}}))。 \end{aligned} x(2)L=[(x(3)Lx(3)(1x(3)))T[w1i(2),w2i(2),w3i(2)]T]3×1=((x(3)Lx(3)(1x(3)))TW(2))T=W(2)T((x(3)Lx(3)(1x(3)))类似的 ∂ L ∂ B ( 2 ) = ∂ L ∂ x ( 3 ) ⊙ x ( 3 ) ⊙ ( 1 − x ( 3 ) ) 。 \frac{\partial{L}}{\partial{\mathbf{B^{(2)}}}} = \frac{\partial{L}}{\partial{\boldsymbol{x^{(3)}}}} \odot \boldsymbol{x^{(3)}} \odot (1 - \boldsymbol{x^{(3)}})。 B(2)L=x(3)Lx(3)(1x(3))

第一个隐藏层 x ( 2 ) = f 1 ( W ( 1 ) x + B ( 1 ) ) , \boldsymbol{x^{(2)}} = f_1(\mathbf{W^{(1)}}\boldsymbol x+\mathbf{B^{(1)}}), x(2)=f1(W(1)x+B(1)), x 21 = δ ( w 11 ( 1 ) x 11 + w 12 ( 1 ) x 12 + b 1 ( 1 ) ) , x 22 = δ ( w 21 ( 1 ) x 11 + w 22 ( 1 ) x 12 + b 2 ( 1 ) ) , x 23 = δ ( w 31 ( 1 ) x 11 + w 32 ( 1 ) x 12 + b 3 ( 1 ) ) . x_{21} = \delta(w_{11}^{(1)}x_{11} + w_{12}^{(1)}x_{12} + b_1^{(1)}), \\ x_{22} = \delta(w_{21}^{(1)}x_{11} + w_{22}^{(1)}x_{12} + b_2^{(1)}), \\ x_{23} = \delta(w_{31}^{(1)}x_{11} + w_{32}^{(1)}x_{12} + b_3^{(1)}). x21=δ(w11(1)x11+w12(1)x12+b1(1)),x22=δ(w21(1)x11+w22(1)x12+b2(1)),x23=δ(w31(1)x11+w32(1)x12+b3(1)).同第二个隐藏层推到类似,对于 ∀ i = 1 , … , 3 \forall i = 1, \dots, 3 i=1,,3 j = 1 , 2 j = 1, 2 j=1,2,有 ∂ L ∂ w i j = ∂ L ∂ x 2 i ∂ x 2 i ∂ w i j = ∂ L ∂ x 2 i x 2 i ( 1 − x 2 i ) x 1 j , \begin{aligned} \frac{\partial{L}}{\partial{w_{ij}}} &= \frac{\partial{L}}{\partial{x_{2i}}} \frac{\partial{x_{2i}}}{\partial{w_{ij}}} \\ &= \frac{\partial{L}}{\partial{x_{2i}}} x_{2i}(1 - x_{2i})x_{1j}, \end{aligned} wijL=x2iLwijx2i=x2iLx2i(1x2i)x1j, ∂ L ∂ W ( 1 ) = [ ∂ L ∂ x 2 i x 2 i ( 1 − x 2 i ) x 1 j ] 3 × 2 = ( ∂ L ∂ x ( 2 ) ⊙ x ( 2 ) ⊙ ( 1 − x ( 2 ) ) ) x T 。 \begin{aligned} \frac{\partial{L}}{\partial{\mathbf{W^{(1)}}}} &= [\frac{\partial{L}}{\partial{x_{2i}}} x_{2i}(1 - x_{2i})x_{1j}]_{3\times2} \\ &= (\frac{\partial{L}}{\partial{\boldsymbol{x^{(2)}}}} \odot \boldsymbol{x^{(2)}} \odot (1 - \boldsymbol{x^{(2)}})) \boldsymbol{x}^T。 \end{aligned} W(1)L=[x2iLx2i(1x2i)x1j]3×2=(x(2)Lx(2)(1x(2)))xT同样的,对于 ∀ i = 1 , 2 \forall i = 1, 2 i=1,2,有 ∂ L ∂ x 1 i = ∂ L ∂ x 21 ∂ x 21 ∂ x 1 i + ∂ L ∂ x 22 ∂ x 22 ∂ x 1 i + ∂ L ∂ x 23 ∂ x 23 ∂ x 1 i = ∂ L ∂ x 21 x 21 ( 1 − x 21 ) w 1 i ( 1 ) + ∂ L ∂ x 22 x 22 ( 1 − x 22 ) w 2 i ( 1 ) + ∂ L ∂ x 23 x 23 ( 1 − x 23 ) w 3 i ( 1 ) = ( ∂ L ∂ x ( 2 ) ⊙ x ( 2 ) ⊙ ( 1 − x ( 2 ) ) ) T [ w 1 i ( 1 ) , w 2 i ( 1 ) , w 3 i ( 1 ) ] T \begin{aligned} \frac{\partial{L}}{\partial{x_{1i}}} &= \frac{\partial{L}}{\partial{x_{21}}} \frac{\partial{x_{21}}}{\partial{x_{1i}}} + \frac{\partial{L}}{\partial{x_{22}}} \frac{\partial{x_{22}}}{\partial{x_{1i}}} + \frac{\partial{L}}{\partial{x_{23}}} \frac{\partial{x_{23}}}{\partial{x_{1i}}} \\ &=\frac{\partial{L}}{\partial{x_{21}}}x_{21}(1 - x_{21})w_{1i}^{(1)} + \frac{\partial{L}}{\partial{x_{22}}}x_{22}(1 - x_{22})w_{2i}^{(1)} + \frac{\partial{L}}{\partial{x_{23}}}x_{23}(1 - x_{23})w_{3i}^{(1)} \\ &= (\frac{\partial{L}}{\partial{\boldsymbol{x^{(2)}}}} \odot \boldsymbol{x^{(2)}} \odot (1 - \boldsymbol{x^{(2)}}))^T[w_{1i}^{(1)}, w_{2i}^{(1)}, w_{3i}^{(1)}]^T \end{aligned} x1iL=x21Lx1ix21+x22Lx1ix22+x23Lx1ix23=x21Lx21(1x21)w1i(1)+x22Lx22(1x22)w2i(1)+x23Lx23(1x23)w3i(1)=(x(2)Lx(2)(1x(2)))T[w1i(1),w2i(1),w3i(1)]T ∂ L ∂ x = [ ( ∂ L ∂ x ( 2 ) ⊙ x ( 2 ) ⊙ ( 1 − x ( 2 ) ) ) T [ w 1 i ( 1 ) , w 2 i ( 1 ) , w 3 i ( 1 ) ] T ] 2 × 1 = ( ( ∂ L ∂ x ( 2 ) ⊙ x ( 2 ) ⊙ ( 1 − x ( 2 ) ) ) T W ( 1 ) ) T = W ( 1 ) T ( ∂ L ∂ x ( 2 ) ⊙ x ( 2 ) ⊙ ( 1 − x ( 2 ) ) ) 。 \begin{aligned} \frac{\partial{L}}{\partial{\boldsymbol{x}}} &= [(\frac{\partial{L}}{\partial{\boldsymbol{x^{(2)}}}} \odot \boldsymbol{x^{(2)}} \odot (1 - \boldsymbol{x^{(2)}}))^T[w_{1i}^{(1)}, w_{2i}^{(1)}, w_{3i}^{(1)}]^T]_{2\times1} \\ &= ((\frac{\partial{L}}{\partial{\boldsymbol{x^{(2)}}}} \odot \boldsymbol{x^{(2)}} \odot (1 - \boldsymbol{x^{(2)}}))^T \mathbf{W^{(1)}})^T \\ &= \mathbf{W^{(1)}}^T (\frac{\partial{L}}{\partial{\boldsymbol{x^{(2)}}}} \odot \boldsymbol{x^{(2)}} \odot (1 - \boldsymbol{x^{(2)}}))。 \end{aligned} xL=[(x(2)Lx(2)(1x(2)))T[w1i(1),w2i(1),w3i(1)]T]2×1=((x(2)Lx(2)(1x(2)))TW(1))T=W(1)T(x(2)Lx(2)(1x(2)))同样的 ∂ L ∂ B ( 1 ) = ∂ L ∂ x ( 2 ) ⊙ x ( 2 ) ⊙ ( 1 − x ( 2 ) ) 。 \begin{aligned} \frac{\partial{L}}{\partial{\mathbf{B^{(1)}}}} = \frac{\partial{L}}{\partial{\boldsymbol{x^{(2)}}}} \odot \boldsymbol{x^{(2)}} \odot (1 - \boldsymbol{x^{(2)}})。 \end{aligned} B(1)L=x(2)Lx(2)(1x(2))

梯度下降

在网络的训练过程中,主要有三种梯度下降方法,分别是批量梯度下降、随机梯度下降、小批量梯度下降。

批量梯度下降:每次迭代中使用所有训练数据进行梯度更新,以上文讲到的MSE损失函数为例, L L L对网络参数 w w w的偏导可以描述为 ∂ L ∂ w = 2 n ∑ i = 1 n ( y i − y i ^ ) ∂ y i ^ ∂ w , \frac{\partial L}{\partial w} = \frac{2}{n}\sum_{i = 1}^{n}(y_i - \hat{y_i})\frac{\partial \hat{y_i}}{\partial w}, wL=n2i=1n(yiyi^)wyi^可以发现 n n n个样本都对参数 w w w的梯度产生了贡献,即将每个样本单独计算得到的梯度计算均值作为网络参数更新的梯度。

随机梯度下降:批量梯度下降有个明显的缺陷,即网络更新较慢,针对这个问题,随机梯度下降方法每次只使用一个样本进行更新网络参数,即 ∂ L ∂ w = 2 ( y i − y i ^ ) ∂ y i ^ ∂ w 。 \frac{\partial L}{\partial w} = 2(y_i - \hat{y_i})\frac{\partial \hat{y_i}}{\partial w}。 wL=2(yiyi^)wyi^

小批量梯度下降:随机梯度下降虽然解决了批量梯度下降网络更新速度慢的问题,但是也产生了新的问题,如容易陷入局部最优以及不易实现并行化,对此小批量梯度下降使用部分样本进行网络的参数迭代更新。

在计算得到网络参数的梯度之后,沿着梯度的负方向,以一定步长(学习率)进行更网络参数,即 w = w − α ∂ L ∂ w , w = w - \alpha\frac{\partial L}{\partial w}, w=wαwL,其中 α \alpha α表示网络的学习率,是一个重要的超参数,如果学习率过大,容易导致网络震荡,过小则会导致网络收敛较慢。

除此之外,网络训练过程中还有一些其他常用的优化算法。
Momentum:通过记录历史梯度与当前梯度共同完成网络参数的迭代。
AdaGrad:网络参数的更新过程使用可变步长。
Adam:采用可变步长与记录历史梯度的方式更新网络参数。

下面通过一段代码实现本文最开始的网络结构,

import numpy as np

class Net:
    def __init__(self):
        self.W1 = np.random.random((3, 2))
        self.B1 = np.random.random((3, 1))
        self.W2 = np.random.random((3, 3))
        self.B2 = np.random.random((3, 1))
        self.W3 = np.random.random((1, 3))
        self.B3 = np.random.random((1, 1))
        self.rate = 0.1
        self.gradient = None
        self.x2 = None
        self.x3 = None
        self.y = None
        self.x = None
        self.label = None

    def train(self, x, label):
        self.label = label
        self.x = x
        for i in range(100):
            self.forward(x)
            loss = self.computeLoss()
            self.backward()
            self.graDesc(self.rate)
            if i % 20 == 0:
                print("迭代次数:", i, "预测值:", self.y, "损失函数值:", loss)
        print("迭代次数:", 99, "预测值:", self.y, "损失函数值:", loss)

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def forward(self, x):
        self.x2 = self.sigmoid(np.dot(self.W1, x) + self.B1)
        self.x3 = self.sigmoid(np.dot(self.W2, self.x2) + self.B2)
        self.y = self.sigmoid(np.dot(self.W3, self.x3) + self.B3)

    def computeLoss(self):
        return (self.y - self.label) ** 2

    def backward(self):
        dLdy = 2 * (self.y - self.label)
        dLdW3 = dLdy * self.y * (1 - self.y) * self.x3.T
        dLdB3 = dLdy * self.y * (1 - self.y)
        dLdx3 = dLdy * self.y * (1 - self.y) * self.W3.T
        dLdW2 = np.dot(dLdx3 * self.x3 * (1 - self.x3), self.x2.T)
        dLdB2 = dLdx3 * self.x3 * (1 - self.x3)
        dLdx2 = np.dot(self.W2.T, dLdx3 * self.x3 * (1 - self.x3))
        dLdW1 = np.dot(dLdx2 * self.x2 * (1 - self.x2), self.x.T)
        dLdB1 = dLdx2 * self.x2 * (1 - self.x2)
        self.gradient = [dLdW1, dLdB1, dLdW2, dLdB2, dLdW3, dLdB3]

    def graDesc(self, rate):
        self.W1 -= rate * self.gradient[0]
        self.B1 -= rate * self.gradient[1]
        self.W2 -= rate * self.gradient[2]
        self.B2 -= rate * self.gradient[3]
        self.W3 -= rate * self.gradient[4]
        self.B3 -= rate * self.gradient[5]

if __name__ == "__main__":
    n = Net()
    x = np.array([[5, 7]]).T
    label = np.array(0.6)
    n.train(x, label)

输出结果如下。

迭代次数: 0 预测值: [[0.81946645]] 损失函数值: [[0.04816552]]
迭代次数: 20 预测值: [[0.75549392]] 损失函数值: [[0.02417836]]
迭代次数: 40 预测值: [[0.69519441]] 损失函数值: [[0.00906198]]
迭代次数: 60 预测值: [[0.65235451]] 损失函数值: [[0.00274099]]
迭代次数: 80 预测值: [[0.62715023]] 损失函数值: [[0.00073713]]
迭代次数: 99 预测值: [[0.61416866]] 损失函数值: [[0.00020075]]

注:博主才疏学浅,若有纰漏,还请各位不吝赐教!

你可能感兴趣的:(深度学习笔记,神经网络,深度学习)