卷积神经网络CNN的反向传播算法推导

文章目录

    • 1. 全连接层
    • 2. 池化层
    • 3. 卷积层
    • 4. 参考资料

1. 全连接层

与深度神经网络DNN的反向传播算法一致,辅助变量:
(1) { δ L = ∂ J ∂ z L = ∂ J ∂ a L ⊙ σ ′ ( z L ) δ l = ( W l + 1 ) T δ l + 1 ⊙ σ ′ ( z l ) \left\{\begin{aligned} &\delta^L = \frac{\partial J}{\partial z^L} = \frac{\partial J}{\partial a^L} \odot \sigma'(z^L)\\ &\\ &\delta^l = (W^{l+1})^T\delta^{l+1}\odot \sigma'(z^l) \end{aligned}\right. \tag{1} δL=zLJ=aLJσ(zL)δl=(Wl+1)Tδl+1σ(zl)(1)
进而求得参数 W W W b b b的梯度:
{ ∂ J ∂ W l = ∂ J ∂ z l ∂ z l ∂ W l = δ l ( a l − 1 ) T ∂ J ∂ b l = ∂ J ∂ z l ∂ z l ∂ b l = δ l \left\{\begin{aligned} &\frac{\partial J}{\partial W^l} = \frac{\partial J}{\partial z^l} \frac{\partial z^l}{\partial W^l} = \delta^l(a^{l-1})^T\\ &\\ & \frac{\partial J}{\partial b^l} = \frac{\partial J}{\partial z^l} \frac{\partial z^l}{\partial b^l} = \delta^l \end{aligned}\right. WlJ=zlJWlzl=δl(al1)TblJ=zlJblzl=δl

解释一下式(1)中为何使用点乘:

以平方误差损失函数为例:
J = 1 2 ∥ a L − y ∥ 2 2 = 1 2 ∑ i = 1 N ( a i L − y ) 2 = 1 2 ∑ i = 1 N ( σ ( z i L ) − y ) 2 J = \frac{1}{2}\left\Vert a^L - y\right\Vert_2^2= \frac{1}{2}\sum_{i=1}^N(a_i^L - y)^2= \frac{1}{2}\sum_{i=1}^N(\sigma(z_i^L) - y)^2 J=21aLy22=21i=1N(aiLy)2=21i=1N(σ(ziL)y)2

δ L = ∂ J ∂ z L = [ ∂ J ∂ z 1 L ∂ J ∂ z 2 L ⋮ ∂ J ∂ z N L ] = [ ( a 1 L − y ) σ ′ ( z 1 L ) ( a 2 L − y ) σ ′ ( z 2 L ) ⋮ ( a N L − y ) σ ′ ( z N L ) ] = ( a L − y ) ⊙ σ ′ ( z L ) = ∂ J ∂ a L ⊙ σ ′ ( z L ) \begin{aligned} &\delta^L = \frac{\partial J}{\partial z^L} =\begin{bmatrix} \frac{\partial J}{\partial z_1^L} \\ \\ \frac{\partial J}{\partial z_2^L} \\ \\ \vdots\\ \\ \frac{\partial J}{\partial z_N^L} \end{bmatrix}=\begin{bmatrix} (a_1^L-y)\sigma'(z_1^L) \\ \\ (a_2^L-y)\sigma'(z_2^L) \\ \\ \vdots\\ \\ (a_N^L-y)\sigma'(z_N^L) \end{bmatrix}=(a^L-y)\odot\sigma'(z^L)= \frac{\partial J}{\partial a^L} \odot \sigma'(z^L)\\ &\\ \end{aligned} δL=zLJ=z1LJz2LJzNLJ=(a1Ly)σ(z1L)(a2Ly)σ(z2L)(aNLy)σ(zNL)=(aLy)σ(zL)=aLJσ(zL)

也可以从向量的维度上进行分析:

向量 a L − y a^L-y aLy和向量 σ ′ ( z L ) \sigma'(z^L) σ(zL)维度相同,都属于 R N × 1 \mathbb{R}^{N\times1} RN×1,因此只能是点乘。

如果是交叉熵损失函数:
J = − ∑ i = 1 N y i log ⁡ a i L J= - \sum\limits_{i=1}^Ny_i\log a_i^L J=i=1NyilogaiL

δ L = ∂ J ∂ z L = a L − y \delta^L = \frac{\partial J}{\partial z^L} = a^L - y δL=zLJ=aLy
不存在 σ ′ ( z L ) \sigma'(z^L) σ(zL)这一项。

2. 池化层

设池化层的输入为 a l a^{l} al,输出为 z l + 1 z^{l+1} zl+1,则有:
z l + 1 = pool ( a l ) z^{l+1} = \text{pool}(a^{l}) zl+1=pool(al)

δ l = ∂ J ∂ z l = ∂ J ∂ z l + 1 ∂ z l + 1 ∂ a l ∂ a l ∂ z l = upsample ( δ l + 1 ) ⊙ σ ′ ( z l ) \delta^{l}= \frac{\partial J}{\partial z^{l}}= \frac{\partial J}{\partial z^{l+1}} \frac{\partial z^{l+1}}{\partial a^{l}}\frac{\partial a^{l}}{\partial z^{l}} = \text{upsample} (\delta^{l+1})\odot \sigma'(z^l) δl=zlJ=zl+1Jalzl+1zlal=upsample(δl+1)σ(zl)
其中,upsample指在反向传播时,把 δ l + 1 \delta^{l+1} δl+1的矩阵大小还原成池化之前的大小,一共分为两种情况:

  1. 如果是Max,则把 δ l + 1 \delta^{l+1} δl+1的各元素值放在之前做前向传播算法得到最大值的位置,所以这里需要额外记录每个区块中最大元素的位置
  2. 如果是Average,则把 δ l + 1 \delta^{l+1} δl+1的各元素值取平均后,填入对应的区块位置。

举例,设池化层的核心大小是 2 × 2 2\times2 2×2,则:
δ l + 1 = ( 2 8 4 6 ) → Max upsample ( 2 0 0 0 0 0 0 8 0 4 0 0 0 0 6 0 ) \delta^{l+1} = \left( \begin{array}{ccc} 2& 8 \\ 4& 6 \end{array} \right) \xrightarrow{\text{Max upsample}} \left( \begin{array}{ccc} 2&0&0&0 \\ 0&0& 0&8 \\ 0&4&0&0 \\ 0&0&6&0 \end{array} \right) δl+1=(2486)Max upsample 2000004000060800
δ l + 1 = ( 2 8 4 6 ) → Average upsample ( 0.5 0.5 2 2 0.5 0.5 2 2 1 1 1.5 1.5 1 1 1.5 1.5 ) \delta^{l+1} = \left( \begin{array}{ccc} 2& 8 \\ 4& 6 \end{array} \right) \xrightarrow{\text{Average upsample}} \left( \begin{array}{ccc} 0.5&0.5&2&2 \\ 0.5&0.5&2&2 \\ 1&1&1.5&1.5 \\ 1&1&1.5&1.5 \end{array} \right) δl+1=(2486)Average upsample 0.50.5110.50.511221.51.5221.51.5
注意,对于Average情况下的反向传播,容易误认为是把梯度值复制几遍之后直接填入对应的区块位置。其实很容易理解为什么要把梯度值求平均,我们用一个小例子来说明:

假设对四个变量 a , b , c , d a, b, c, d a,b,c,d求平均,得到 z z z,也即:
z = 1 4 ( a + b + c + d ) z=\frac{1}{4}(a+b+c+d) z=41(a+b+c+d)
那么, z z z关于每个变量的导数都是1/4。反向传播到 z z z时,累积的梯度值为 δ \delta δ,那么,
{ ∂ J ∂ a = ∂ J ∂ z ∂ z ∂ a = 1 4 δ ∂ J ∂ b = ∂ J ∂ z ∂ z ∂ b = 1 4 δ ∂ J ∂ c = ∂ J ∂ z ∂ z ∂ c = 1 4 δ ∂ J ∂ d = ∂ J ∂ z ∂ z ∂ d = 1 4 δ \left\{\begin{aligned} &\frac{\partial J}{\partial a} = \frac{\partial J}{\partial z}\frac{\partial z}{\partial a} = \frac{1}{4}\delta\\ &\frac{\partial J}{\partial b}= \frac{\partial J}{\partial z}\frac{\partial z}{\partial b} = \frac{1}{4}\delta\\ &\frac{\partial J}{\partial c}= \frac{\partial J}{\partial z}\frac{\partial z}{\partial c} = \frac{1}{4}\delta\\ &\frac{\partial J}{\partial d}= \frac{\partial J}{\partial z}\frac{\partial z}{\partial d} = \frac{1}{4}\delta \end{aligned}\right. aJ=zJaz=41δbJ=zJbz=41δcJ=zJcz=41δdJ=zJdz=41δ
这样就很容易理解了。

3. 卷积层

卷积层的前向传播公式:
a l + 1 = σ ( z l + 1 ) = σ ( a l ∗ W l + 1 + b l + 1 ) a^{l+1} = \sigma(z^{l+1}) = \sigma(a^l*W^{l+1} + b^{l+1}) al+1=σ(zl+1)=σ(alWl+1+bl+1)

δ l = ∂ J ∂ z l = ∂ J ∂ z l + 1 ∂ z l + 1 ∂ a l ∂ a l ∂ z l = δ l + 1 ∗ Rotation180 ( W l + 1 ) ⊙ σ ′ ( z l ) \delta^{l}= \frac{\partial J}{\partial z^{l}}= \frac{\partial J}{\partial z^{l+1}} \frac{\partial z^{l+1}}{\partial a^{l}}\frac{\partial a^{l}}{\partial z^{l}} = \delta^{l+1} *\text{Rotation180}(W^{l+1})\odot \sigma'(z^l) δl=zlJ=zl+1Jalzl+1zlal=δl+1Rotation180(Wl+1)σ(zl)
其中Rotation180意思是卷积核 W W W被旋转180度,也即上下翻转一次,接着左右翻转一次。另外注意,这里需要对 δ l + 1 \delta^{l+1} δl+1进行适当的padding,当stride为1时, p ′ = k − p − 1 p'=k-p-1 p=kp1

详细推导请参见 https://www.cnblogs.com/pinard/p/6494810.html

参数 W W W b b b的梯度:
{ ∂ J ∂ W l = ∂ J ∂ z l ∂ z l ∂ W l = a l − 1 ∗ δ l ∂ J ∂ b l = ∂ J ∂ z l ∂ z l ∂ b l = ∑ u , v ( δ l ) u , v \left\{\begin{aligned} &\frac{\partial J}{\partial W^l} = \frac{\partial J}{\partial z^l} \frac{\partial z^l}{\partial W^l} = a^{l-1}*\delta^l\\ &\\ & \frac{\partial J}{\partial b^l} = \frac{\partial J}{\partial z^l} \frac{\partial z^l}{\partial b^l} = \sum\limits_{u,v}(\delta^l)_{u,v} \end{aligned}\right. WlJ=zlJWlzl=al1δlblJ=zlJblzl=u,v(δl)u,v
其中,关于 W W W的梯度没有旋转操作, ∑ u , v ( δ l ) u , v \sum\limits_{u,v}(\delta^l)_{u,v} u,v(δl)u,v意思是把 δ l \delta^l δl的所有通道沿通道方向求和,累加成一个通道。

4. 参考资料

感谢 https://www.cnblogs.com/pinard/p/6494810.html

你可能感兴趣的:(人工智能/深度学习/机器学习)