与深度神经网络DNN的反向传播算法一致,辅助变量:
(1) { δ L = ∂ J ∂ z L = ∂ J ∂ a L ⊙ σ ′ ( z L ) δ l = ( W l + 1 ) T δ l + 1 ⊙ σ ′ ( z l ) \left\{\begin{aligned} &\delta^L = \frac{\partial J}{\partial z^L} = \frac{\partial J}{\partial a^L} \odot \sigma'(z^L)\\ &\\ &\delta^l = (W^{l+1})^T\delta^{l+1}\odot \sigma'(z^l) \end{aligned}\right. \tag{1} ⎩⎪⎪⎪⎨⎪⎪⎪⎧δL=∂zL∂J=∂aL∂J⊙σ′(zL)δl=(Wl+1)Tδl+1⊙σ′(zl)(1)
进而求得参数 W W W, b b b的梯度:
{ ∂ J ∂ W l = ∂ J ∂ z l ∂ z l ∂ W l = δ l ( a l − 1 ) T ∂ J ∂ b l = ∂ J ∂ z l ∂ z l ∂ b l = δ l \left\{\begin{aligned} &\frac{\partial J}{\partial W^l} = \frac{\partial J}{\partial z^l} \frac{\partial z^l}{\partial W^l} = \delta^l(a^{l-1})^T\\ &\\ & \frac{\partial J}{\partial b^l} = \frac{\partial J}{\partial z^l} \frac{\partial z^l}{\partial b^l} = \delta^l \end{aligned}\right. ⎩⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎧∂Wl∂J=∂zl∂J∂Wl∂zl=δl(al−1)T∂bl∂J=∂zl∂J∂bl∂zl=δl
解释一下式(1)中为何使用点乘:
以平方误差损失函数为例:
J = 1 2 ∥ a L − y ∥ 2 2 = 1 2 ∑ i = 1 N ( a i L − y ) 2 = 1 2 ∑ i = 1 N ( σ ( z i L ) − y ) 2 J = \frac{1}{2}\left\Vert a^L - y\right\Vert_2^2= \frac{1}{2}\sum_{i=1}^N(a_i^L - y)^2= \frac{1}{2}\sum_{i=1}^N(\sigma(z_i^L) - y)^2 J=21∥∥aL−y∥∥22=21i=1∑N(aiL−y)2=21i=1∑N(σ(ziL)−y)2
则
δ L = ∂ J ∂ z L = [ ∂ J ∂ z 1 L ∂ J ∂ z 2 L ⋮ ∂ J ∂ z N L ] = [ ( a 1 L − y ) σ ′ ( z 1 L ) ( a 2 L − y ) σ ′ ( z 2 L ) ⋮ ( a N L − y ) σ ′ ( z N L ) ] = ( a L − y ) ⊙ σ ′ ( z L ) = ∂ J ∂ a L ⊙ σ ′ ( z L ) \begin{aligned} &\delta^L = \frac{\partial J}{\partial z^L} =\begin{bmatrix} \frac{\partial J}{\partial z_1^L} \\ \\ \frac{\partial J}{\partial z_2^L} \\ \\ \vdots\\ \\ \frac{\partial J}{\partial z_N^L} \end{bmatrix}=\begin{bmatrix} (a_1^L-y)\sigma'(z_1^L) \\ \\ (a_2^L-y)\sigma'(z_2^L) \\ \\ \vdots\\ \\ (a_N^L-y)\sigma'(z_N^L) \end{bmatrix}=(a^L-y)\odot\sigma'(z^L)= \frac{\partial J}{\partial a^L} \odot \sigma'(z^L)\\ &\\ \end{aligned} δL=∂zL∂J=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡∂z1L∂J∂z2L∂J⋮∂zNL∂J⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡(a1L−y)σ′(z1L)(a2L−y)σ′(z2L)⋮(aNL−y)σ′(zNL)⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤=(aL−y)⊙σ′(zL)=∂aL∂J⊙σ′(zL)
也可以从向量的维度上进行分析:
向量 a L − y a^L-y aL−y和向量 σ ′ ( z L ) \sigma'(z^L) σ′(zL)的维度相同,都属于 R N × 1 \mathbb{R}^{N\times1} RN×1,因此只能是点乘。
如果是交叉熵损失函数:
J = − ∑ i = 1 N y i log a i L J= - \sum\limits_{i=1}^Ny_i\log a_i^L J=−i=1∑NyilogaiL
则
δ L = ∂ J ∂ z L = a L − y \delta^L = \frac{\partial J}{\partial z^L} = a^L - y δL=∂zL∂J=aL−y
不存在 σ ′ ( z L ) \sigma'(z^L) σ′(zL)这一项。
设池化层的输入为 a l a^{l} al,输出为 z l + 1 z^{l+1} zl+1,则有:
z l + 1 = pool ( a l ) z^{l+1} = \text{pool}(a^{l}) zl+1=pool(al)
则
δ l = ∂ J ∂ z l = ∂ J ∂ z l + 1 ∂ z l + 1 ∂ a l ∂ a l ∂ z l = upsample ( δ l + 1 ) ⊙ σ ′ ( z l ) \delta^{l}= \frac{\partial J}{\partial z^{l}}= \frac{\partial J}{\partial z^{l+1}} \frac{\partial z^{l+1}}{\partial a^{l}}\frac{\partial a^{l}}{\partial z^{l}} = \text{upsample} (\delta^{l+1})\odot \sigma'(z^l) δl=∂zl∂J=∂zl+1∂J∂al∂zl+1∂zl∂al=upsample(δl+1)⊙σ′(zl)
其中,upsample指在反向传播时,把 δ l + 1 \delta^{l+1} δl+1的矩阵大小还原成池化之前的大小,一共分为两种情况:
举例,设池化层的核心大小是 2 × 2 2\times2 2×2,则:
δ l + 1 = ( 2 8 4 6 ) → Max upsample ( 2 0 0 0 0 0 0 8 0 4 0 0 0 0 6 0 ) \delta^{l+1} = \left( \begin{array}{ccc} 2& 8 \\ 4& 6 \end{array} \right) \xrightarrow{\text{Max upsample}} \left( \begin{array}{ccc} 2&0&0&0 \\ 0&0& 0&8 \\ 0&4&0&0 \\ 0&0&6&0 \end{array} \right) δl+1=(2486)Max upsample⎝⎜⎜⎛2000004000060800⎠⎟⎟⎞
δ l + 1 = ( 2 8 4 6 ) → Average upsample ( 0.5 0.5 2 2 0.5 0.5 2 2 1 1 1.5 1.5 1 1 1.5 1.5 ) \delta^{l+1} = \left( \begin{array}{ccc} 2& 8 \\ 4& 6 \end{array} \right) \xrightarrow{\text{Average upsample}} \left( \begin{array}{ccc} 0.5&0.5&2&2 \\ 0.5&0.5&2&2 \\ 1&1&1.5&1.5 \\ 1&1&1.5&1.5 \end{array} \right) δl+1=(2486)Average upsample⎝⎜⎜⎛0.50.5110.50.511221.51.5221.51.5⎠⎟⎟⎞
注意,对于Average情况下的反向传播,容易误认为是把梯度值复制几遍之后直接填入对应的区块位置。其实很容易理解为什么要把梯度值求平均,我们用一个小例子来说明:
假设对四个变量 a , b , c , d a, b, c, d a,b,c,d求平均,得到 z z z,也即:
z = 1 4 ( a + b + c + d ) z=\frac{1}{4}(a+b+c+d) z=41(a+b+c+d)
那么, z z z关于每个变量的导数都是1/4。反向传播到 z z z时,累积的梯度值为 δ \delta δ,那么,
{ ∂ J ∂ a = ∂ J ∂ z ∂ z ∂ a = 1 4 δ ∂ J ∂ b = ∂ J ∂ z ∂ z ∂ b = 1 4 δ ∂ J ∂ c = ∂ J ∂ z ∂ z ∂ c = 1 4 δ ∂ J ∂ d = ∂ J ∂ z ∂ z ∂ d = 1 4 δ \left\{\begin{aligned} &\frac{\partial J}{\partial a} = \frac{\partial J}{\partial z}\frac{\partial z}{\partial a} = \frac{1}{4}\delta\\ &\frac{\partial J}{\partial b}= \frac{\partial J}{\partial z}\frac{\partial z}{\partial b} = \frac{1}{4}\delta\\ &\frac{\partial J}{\partial c}= \frac{\partial J}{\partial z}\frac{\partial z}{\partial c} = \frac{1}{4}\delta\\ &\frac{\partial J}{\partial d}= \frac{\partial J}{\partial z}\frac{\partial z}{\partial d} = \frac{1}{4}\delta \end{aligned}\right. ⎩⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎧∂a∂J=∂z∂J∂a∂z=41δ∂b∂J=∂z∂J∂b∂z=41δ∂c∂J=∂z∂J∂c∂z=41δ∂d∂J=∂z∂J∂d∂z=41δ
这样就很容易理解了。
卷积层的前向传播公式:
a l + 1 = σ ( z l + 1 ) = σ ( a l ∗ W l + 1 + b l + 1 ) a^{l+1} = \sigma(z^{l+1}) = \sigma(a^l*W^{l+1} + b^{l+1}) al+1=σ(zl+1)=σ(al∗Wl+1+bl+1)
则
δ l = ∂ J ∂ z l = ∂ J ∂ z l + 1 ∂ z l + 1 ∂ a l ∂ a l ∂ z l = δ l + 1 ∗ Rotation180 ( W l + 1 ) ⊙ σ ′ ( z l ) \delta^{l}= \frac{\partial J}{\partial z^{l}}= \frac{\partial J}{\partial z^{l+1}} \frac{\partial z^{l+1}}{\partial a^{l}}\frac{\partial a^{l}}{\partial z^{l}} = \delta^{l+1} *\text{Rotation180}(W^{l+1})\odot \sigma'(z^l) δl=∂zl∂J=∂zl+1∂J∂al∂zl+1∂zl∂al=δl+1∗Rotation180(Wl+1)⊙σ′(zl)
其中Rotation180意思是卷积核 W W W被旋转180度,也即上下翻转一次,接着左右翻转一次。另外注意,这里需要对 δ l + 1 \delta^{l+1} δl+1进行适当的padding,当stride为1时, p ′ = k − p − 1 p'=k-p-1 p′=k−p−1。
详细推导请参见 https://www.cnblogs.com/pinard/p/6494810.html
参数 W W W, b b b的梯度:
{ ∂ J ∂ W l = ∂ J ∂ z l ∂ z l ∂ W l = a l − 1 ∗ δ l ∂ J ∂ b l = ∂ J ∂ z l ∂ z l ∂ b l = ∑ u , v ( δ l ) u , v \left\{\begin{aligned} &\frac{\partial J}{\partial W^l} = \frac{\partial J}{\partial z^l} \frac{\partial z^l}{\partial W^l} = a^{l-1}*\delta^l\\ &\\ & \frac{\partial J}{\partial b^l} = \frac{\partial J}{\partial z^l} \frac{\partial z^l}{\partial b^l} = \sum\limits_{u,v}(\delta^l)_{u,v} \end{aligned}\right. ⎩⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎧∂Wl∂J=∂zl∂J∂Wl∂zl=al−1∗δl∂bl∂J=∂zl∂J∂bl∂zl=u,v∑(δl)u,v
其中,关于 W W W的梯度没有旋转操作, ∑ u , v ( δ l ) u , v \sum\limits_{u,v}(\delta^l)_{u,v} u,v∑(δl)u,v意思是把 δ l \delta^l δl的所有通道沿通道方向求和,累加成一个通道。
感谢 https://www.cnblogs.com/pinard/p/6494810.html