z l z^l zl:第 l l l层卷积后,激活前的结果
a l a^l al:第 l l l层激活后的结果,一般情况下,也是第 l + 1 l+1 l+1层的输入
卷积层前向传播的公式为:
a l = σ ( z l ) = σ ( a l − 1 ∗ W l + b l ) a^l= \sigma(z^l) = \sigma(a^{l-1}*W^l +b^l) al=σ(zl)=σ(al−1∗Wl+bl)
δ l \delta^l δl:反向传播时,传递到第 l l l层的误差
∂ J ( W , b ) {\partial J(W,b)} ∂J(W,b):损失函数
第 l l l层的误差可以理解为误差对第 l l l层的输出求偏导,第 l l l层的误差和第 l − 1 {l-1} l−1层的误差的递推关系为:
δ l = ∂ J ( W , b ) ∂ z l = ∂ J ( W , b ) ∂ z l + 1 ∂ z l + 1 ∂ z l = δ l + 1 ∂ z l + 1 ∂ z l \delta^{l} = \frac{\partial J(W,b)}{\partial z^l} = \frac{\partial J(W,b)}{\partial z^{l+1}}\frac{\partial z^{l+1}}{\partial z^{l}} = \delta^{l+1}\frac{\partial z^{l+1}}{\partial z^{l}} δl=∂zl∂J(W,b)=∂zl+1∂J(W,b)∂zl∂zl+1=δl+1∂zl∂zl+1
因此要导出 δ l − 1 \delta^{l-1} δl−1和 δ l \delta^l δl的关系,必须要计算 ∂ z l ∂ z l − 1 \frac{\partial z^{l}}{\partial z^{l-1}} ∂zl−1∂zl的梯度表达式。
注意到 z l z^{l} zl 和 z l − 1 z^{l-1} zl−1的关系为:
z l = a l − 1 ∗ W l + b l = σ ( z l − 1 ) ∗ W l + b l z^l = a^{l-1}*W^l +b^l =\sigma(z^{l-1})*W^l +b^l zl=al−1∗Wl+bl=σ(zl−1)∗Wl+bl
若直接进行参数推演过于复杂,取一个简单的例子:
假设第 l − 1 {l-1} l−1层的输出 a l − 1 a^{l-1} al−1是一个3X3矩阵,第 l {l} l层的卷积核 W l W^l Wl是一个2X2的矩阵,采用1像素的步幅,则输出是一个2X2的矩阵,我们简化偏置 b l b^l bl为0,则有:
a l − 1 ∗ W l = z l a^{l-1}*W^l = z^{l} al−1∗Wl=zl
列出a,W,z的矩阵表达式:
( a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 ) ∗ ( w 11 w 12 w 21 w 22 ) = ( z 11 z 12 z 21 z 22 ) \left( \begin{array}{ccc} a_{11}&a_{12}&a_{13} \\ a_{21}&a_{22}&a_{23}\\ a_{31}&a_{32}&a_{33} \end{array} \right) * \left( \begin{array}{ccc} w_{11}&w_{12}\\ w_{21}&w_{22} \end{array} \right) = \left( \begin{array}{ccc} z_{11}&z_{12}\\ z_{21}&z_{22} \end{array} \right) ⎝⎛a11a21a31a12a22a32a13a23a33⎠⎞∗(w11w21w12w22)=(z11z21z12z22)
利用卷积定义,很容易可以导出:
z 11 = a 11 w 11 + a 12 w 12 + a 21 w 21 + a 22 w 22 z_{11} = a_{11}w_{11} + a_{12}w_{12} + a_{21}w_{21} + a_{22}w_{22} z11=a11w11+a12w12+a21w21+a22w22
z 12 = a 12 w 11 + a 13 w 12 + a 22 w 21 + a 23 w 22 z_{12} = a_{12}w_{11} + a_{13}w_{12} + a_{22}w_{21} + a_{23}w_{22} z12=a12w11+a13w12+a22w21+a23w22
z 21 = a 21 w 11 + a 22 w 12 + a 31 w 21 + a 32 w 22 z_{21} = a_{21}w_{11} + a_{22}w_{12} + a_{31}w_{21} + a_{32}w_{22} z21=a21w11+a22w12+a31w21+a32w22
z 22 = a 22 w 11 + a 23 w 12 + a 32 w 21 + a 33 w 22 z_{22} = a_{22}w_{11} + a_{23}w_{12} + a_{32}w_{21} + a_{33}w_{22} z22=a22w11+a23w12+a32w21+a33w22
接下来,模拟反向求导:
∇ a l − 1 = ∂ J ( W , b ) ∂ a l − 1 = ∂ J ( W , b ) ∂ z l ∂ z l ∂ a l − 1 = δ l ∂ z l ∂ a l − 1 \nabla a^{l-1} = \frac{\partial J(W,b)}{\partial a^{l-1}} = \frac{\partial J(W,b)}{\partial z^{l}} \frac{\partial z^{l}}{\partial a^{l-1}} = \delta^{l} \frac{\partial z^{l}}{\partial a^{l-1}} ∇al−1=∂al−1∂J(W,b)=∂zl∂J(W,b)∂al−1∂zl=δl∂al−1∂zl
上式中, δ l \delta^{l} δl是上游传来的误差,维度为2X2(因为该卷积层输出的维度是2X2,所以上游传下来的维度和该层输出一致),而根据上面的四个式子,可以计算得到 ∂ z l ∂ a l − 1 \frac{\partial z^{l}}{\partial a^{l-1}} ∂al−1∂zl, ∇ a l − 1 \nabla a^{l-1} ∇al−1的维度和 a l − 1 a^{l-1} al−1保持一致,都应为3X3,具体可以表示为:比如对于 a 11 a_{11} a11的梯度,由于在4个等式中 a 11 a_{11} a11只和 z 11 z_{11} z11有乘积关系,从而我们有:
∇ a 11 = δ 11 w 11 \nabla a_{11} = \delta_{11}w_{11} ∇a11=δ11w11
∇ a 12 = δ 11 w 12 + δ 12 w 11 \nabla a_{12} = \delta_{11}w_{12} + \delta_{12}w_{11} ∇a12=δ11w12+δ12w11
∇ a 22 = δ 11 w 22 + δ 12 w 21 + δ 21 w 12 + δ 22 w 11 \nabla a_{22} = \delta_{11}w_{22} + \delta_{12}w_{21} + \delta_{21}w_{12} + \delta_{22}w_{11} ∇a22=δ11w22+δ12w21+δ21w12+δ22w11
若是用矩阵来表示,则可以写作:
( 0 0 0 0 0 δ 11 δ 12 0 0 δ 21 δ 22 0 0 0 0 0 ) ∗ ( w 22 w 21 w 12 w 11 ) = ( ∇ a 11 ∇ a 12 ∇ a 13 ∇ a 21 ∇ a 22 ∇ a 23 ∇ a 31 ∇ a 32 ∇ a 33 ) \left( \begin{array}{ccc} 0&0&0&0 \\ 0&\delta_{11}& \delta_{12}&0 \\ 0&\delta_{21}&\delta_{22}&0 \\ 0&0&0&0 \end{array} \right) * \left( \begin{array}{ccc} w_{22}&w_{21}\\ w_{12}&w_{11} \end{array} \right) = \left( \begin{array}{ccc} \nabla a_{11}&\nabla a_{12}&\nabla a_{13} \\ \nabla a_{21}&\nabla a_{22}&\nabla a_{23}\\ \nabla a_{31}&\nabla a_{32}&\nabla a_{33} \end{array} \right) ⎝⎜⎜⎛00000δ11δ2100δ12δ2200000⎠⎟⎟⎞∗(w22w12w21w11)=⎝⎛∇a11∇a21∇a31∇a12∇a22∇a32∇a13∇a23∇a33⎠⎞
为了符合梯度计算,我们在误差矩阵周围填充了一圈0,此时我们将卷积核翻转后和反向传播的梯度误差进行卷积,就得到了前一次的梯度误差。这个例子直观的介绍了为什么对含有卷积的式子反向传播时,卷积核要翻转180度的原因。
以上就是卷积层的误差反向传播过程。
注意到卷积层z和W,b的关系为:
z l = a l − 1 ∗ W l + b z^l = a^{l-1}*W^l +b zl=al−1∗Wl+b
因此我们有:
∂ J ( W , b ) ∂ W l = ∂ J ( W , b ) ∂ z l ∂ z l ∂ W l = a l − 1 ∗ δ l \frac{\partial J(W,b)}{\partial W^{l}} = \frac{\partial J(W,b)}{\partial z^{l}}\frac{\partial z^{l}}{\partial W^{l}} =a^{l-1} *\delta^l ∂Wl∂J(W,b)=∂zl∂J(W,b)∂Wl∂zl=al−1∗δl
注意到此时卷积核并没有反转,主要是此时是层内的求导,而不是反向传播到上一层的求导。具体过程我们可以分析一下。
和第4节一样的一个简化的例子,这里输入是矩阵,不是张量,那么对于第l层,某个个卷积核矩阵W的导数可以表示如下:
∂ J ( W , b ) ∂ W p q l = ∑ i ∑ j ( δ i j l x i + p − 1 , j + q − 1 l − 1 ) \frac{\partial J(W,b)}{\partial W_{pq}^{l}} = \sum\limits_i\sum\limits_j(\delta_{ij}^lx_{i+p-1,j+q-1}^{l-1}) ∂Wpql∂J(W,b)=i∑j∑(δijlxi+p−1,j+q−1l−1)
假设我们输入a是4x4的矩阵,卷积核W是3x3的矩阵,输出z是2x2的矩阵,那么反向传播的z的梯度误差 δ \delta δ也是2x2的矩阵。
那么根据上面的式子,我们有:
∂ J ( W , b ) ∂ W 11 l = a 11 δ 11 + a 12 δ 12 + a 21 δ 21 + a 22 δ 22 \frac{\partial J(W,b)}{\partial W_{11}^{l}} = a_{11}\delta_{11} + a_{12}\delta_{12} + a_{21}\delta_{21} + a_{22}\delta_{22} ∂W11l∂J(W,b)=a11δ11+a12δ12+a21δ21+a22δ22
∂ J ( W , b ) ∂ W 12 l = a 12 δ 11 + a 13 δ 12 + a 22 δ 21 + a 23 δ 22 \frac{\partial J(W,b)}{\partial W_{12}^{l}} = a_{12}\delta_{11} + a_{13}\delta_{12} + a_{22}\delta_{21} + a_{23}\delta_{22} ∂W12l∂J(W,b)=a12δ11+a13δ12+a22δ21+a23δ22
∂ J ( W , b ) ∂ W 13 l = a 13 δ 11 + a 14 δ 12 + a 23 δ 21 + a 24 δ 22 \frac{\partial J(W,b)}{\partial W_{13}^{l}} = a_{13}\delta_{11} + a_{14}\delta_{12} + a_{23}\delta_{21} + a_{24}\delta_{22} ∂W13l∂J(W,b)=a13δ11+a14δ12+a23δ21+a24δ22
∂ J ( W , b ) ∂ W 21 l = a 21 δ 11 + a 22 δ 12 + a 31 δ 21 + a 32 δ 22 \frac{\partial J(W,b)}{\partial W_{21}^{l}} = a_{21}\delta_{11} + a_{22}\delta_{12} + a_{31}\delta_{21} + a_{32}\delta_{22} ∂W21l∂J(W,b)=a21δ11+a22δ12+a31δ21+a32δ22
最终我们可以一共得到9个式子。整理成矩阵形式后可得:
∂ J ( W , b ) ∂ W l = ( a 11 a 12 a 13 a 14 a 21 a 22 a 23 a 24 a 31 a 32 a 33 a 34 a 41 a 42 a 43 a 44 ) ∗ ( δ 11 δ 12 δ 21 δ 22 ) \frac{\partial J(W,b)}{\partial W^{l}} =\left( \begin{array}{ccc} a_{11}&a_{12}&a_{13}&a_{14} \\ a_{21}&a_{22}&a_{23}&a_{24} \\ a_{31}&a_{32}&a_{33}&a_{34} \\ a_{41}&a_{42}&a_{43}&a_{44} \end{array} \right) * \left( \begin{array}{ccc} \delta_{11}& \delta_{12} \\ \delta_{21}&\delta_{22} \end{array} \right) ∂Wl∂J(W,b)=⎝⎜⎜⎛a11a21a31a41a12a22a32a42a13a23a33a43a14a24a34a44⎠⎟⎟⎞∗(δ11δ21δ12δ22)
从而可以清楚的看到这次我们为什么没有反转的原因。
而对于b,则稍微有些特殊,因为δl是三维张量,而b只是一个向量,不能像DNN那样直接和δl相等。通常的做法是将δl的各个子矩阵的项分别求和,得到一个误差向量,即为b的梯度
∂ J ( W , b ) ∂ b l = ∑ u , v ( δ l ) u , v \frac{\partial J(W,b)}{\partial b^{l}} = \sum\limits_{u,v}(\delta^l)_{u,v} ∂bl∂J(W,b)=u,v∑(δl)u,v
用python的具体实现为:
N, C, H, W = x.shape #x为输入
F, C, HH, WW = w.shape #w为卷积核
x_padded = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), 'constant')
dx_padded = np.pad(dx, ((0, 0), (0, 0), (pad, pad), (pad, pad)), 'constant')
for i in xrange(N): # ith image
for f in xrange(F): # fth filter
for j in xrange(H_new):
for k in xrange(W_new):
window = x_padded[i, :, j*s:HH+j*s, k*s:WW+k*s]
db[f] += dout[i, f, j, k]
dw[f] += window * dout[i, f, j, k]
dx_padded[i, :, j*s:HH+j*s, k*s:WW+k*s] += w[f] * dout[i, f, j, k]
卷积神经网络(CNN)反向传播算法.