基础的就不赘述了。主要看推导。
[convolution layer + Relu] --> pooling layer --> [convolution layer + Relu] --> pooling layer --> Fully Connected --> …
微积分中卷积的表达式为:
S ( t ) = ∫ x ( t − a ) w ( a ) d a S(t)=\int x(t-a) w(a) d a S(t)=∫x(t−a)w(a)da
离散形式是:
s ( t ) = ∑ a x ( t − a ) w ( a ) s(t)=\sum_a x(t-a) w(a) s(t)=a∑x(t−a)w(a)
矩阵表示为:
s ( t ) = ( X ∗ W ) ( t ) s(t)=(X * W)(t) s(t)=(X∗W)(t)
其中星号表示卷积。如果是二维的卷积, 则表示式为:
s ( i , j ) = ( X ∗ W ) ( i , j ) = ∑ m ∑ n x ( i − m , j − n ) w ( m , n ) s(i, j)=(X * W)(i, j)=\sum_m \sum_n x(i-m, j-n) w(m, n) s(i,j)=(X∗W)(i,j)=m∑n∑x(i−m,j−n)w(m,n)
在CNN中, 虽然说卷积, 但卷积公式和严格意义数学中的定义稍有不同,比如对于二维的卷积, 定义为:
s ( i , j ) = ( X ∗ W ) ( i , j ) = ∑ m ∑ n x ( i + m , j + n ) w ( m , n ) s(i, j)=(X * W)(i, j)=\sum_m \sum_n x(i+m, j+n) w(m, n) s(i,j)=(X∗W)(i,j)=m∑n∑x(i+m,j+n)w(m,n)
CNN的卷积都是指的后一个式子。其中,
如果 x x x 是一个二维输入的矩阵, 而 w w w 也是一个二维的矩阵。但是如果X是多维张量, 那么 W W W 也是多维张量。
卷积和池化操作都是降低数据维度,当然卷积根据旋转的filter的个数可能会是数据增加。再加上权值共享,都是简化网络的一些方法。
一层一层理解。
不管维度多高, 对于输入, 前向传播的过程可以表示为:
a 2 = σ ( z 2 ) = σ ( a 1 ∗ W 2 + b 2 ) a^2=\sigma\left(z^2\right)=\sigma\left(a^1 * W^2+b^2\right) a2=σ(z2)=σ(a1∗W2+b2)
其中,:
CNN模型参数是:
普通隐藏层前向传播到卷积层时的前向传播算法。
假设隐藏层的输出是 M M M 个矩阵对应的三维张量, 则输出到卷积层的卷积核也是 M M M 个子矩阵对应的三维张量。这时表达式和输入层的很像, 也是
a l = σ ( z l ) = σ ( a l − 1 ∗ W l + b l ) a^l=\sigma\left(z^l\right)=\sigma\left(a^{l-1} * W^l+b^l\right) al=σ(zl)=σ(al−1∗Wl+bl)
也可以写成 M M M 个矩阵子矩阵卷积后对应位置相加的形式,即:
a l = σ ( z l ) = σ ( ∑ k = 1 M z k l ) = σ ( ∑ k = 1 M a k l − 1 ∗ W k l + b l ) a^l=\sigma\left(z^l\right)=\sigma\left(\sum_{k=1}^M z_k^l\right)=\sigma\left(\sum_{k=1}^M a_k^{l-1} * W_k^l+b^l\right) al=σ(zl)=σ(k=1∑Mzkl)=σ(k=1∑Makl−1∗Wkl+bl)
和上一节唯一的区别仅仅在于, 这里的输入是隐藏层来的, 而不是输入的原始图片样本形成的矩阵。
需要定义的CNN模型参数也和上一节一样,这里需要定义卷积核的个数K,卷积核子矩阵的维度 F, 填充大小P以及步幅S。
池化层的处理逻辑是比较简单的, 对输入的矩阵进行缩小概括。比如输入的若干矩阵是 N x N N x N NxN 维的, 而池化大小是 k x k k x k kxk 的区域, 则输 出的矩阵都是 N k × N k \frac{N}{k} \times \frac{N}{k} kN×kN 维的。
这里需要需要定义的CNN模型参数是:
由于全连接层就是普通的模型结构, 即:
a l = σ ( z l ) = σ ( W l a l − 1 + b l ) a^l=\sigma\left(z^l\right)=\sigma\left(W^l a^{l-1}+b^l\right) al=σ(zl)=σ(Wlal−1+bl)
这里的激活函数一般是sigmoid或者tanh.
经过了若干全连接层之后,最后的一层为Softmax输出层。此时输出层和普通的全连接层唯一的区别是,激活函数是softmax函数。
这里需要需要定义的CNN模型参数是:
类比 BP 的反向传播算法:
但是也有一些不同的地方:
从上面可以看出, 问题1比较好解决, 但问题 2 , 3 , 4 2,3,4 2,3,4 就是解决CNN反向传播算法的关键所在了。
无论max pooling还是mean pooling,都没有需要学习的参数。因此,在卷积神经网络的训练中,Pooling层需要做的仅仅是将误差项传递到上一层,而没有梯度的计算。
在前向传播算法时, 池化层一般会用max pooling或者mean pooling对输入进行池化, 池化的区域大小已知。现在反过来, 要从缩小后的误差 δ l \delta^l δl, 还原前一 次较大区域对应的误差。
在反向传播时, 首先会把 δ l \delta^l δl 的所有子矩阵矩阵大小还原成池化之前的大小,:
这个过程一般叫做upsample上采样。
用一个例子可以很方便的表示: 假设池化区域大小是 2 × 2 2 \times 2 2×2 。 δ l \delta^l δl 的第 k \mathrm{k} k 个子矩阵为:
δ k l = ( 2 8 4 6 ) \delta_k^l=\left(\begin{array}{ll} 2 & 8 \\ 4 & 6 \end{array}\right) δkl=(2486)
由于池化区域为 2 × 2 2 \times 2 2×2, 此时先将 δ k l \delta_k^l δkl 做还原, 即变成:
( 0 0 0 0 0 2 8 0 0 4 6 0 0 0 0 0 ) \left(\begin{array}{llll} 0 & 0 & 0 & 0 \\ 0 & 2 & 8 & 0 \\ 0 & 4 & 6 & 0 \\ 0 & 0 & 0 & 0 \end{array}\right) 0000024008600000
如果是 max pooling, 假设之前在前向传播时记录的最大值位置分别是左上, 右下, 右上, 左下, 则转换后的矩阵为:
( 2 0 0 0 0 0 0 8 0 4 0 0 0 0 6 0 ) \left(\begin{array}{llll} 2 & 0 & 0 & 0 \\ 0 & 0 & 0 & 8 \\ 0 & 4 & 0 & 0 \\ 0 & 0 & 6 & 0 \end{array}\right) 2000004000060800
如果是mean pooling, 则进行平均: 转换后的矩阵为:
( 0.5 0.5 2 2 0.5 0.5 2 2 1 1 1.5 1.5 1 1 1.5 1.5 ) \left(\begin{array}{cccc} 0.5 & 0.5 & 2 & 2 \\ 0.5 & 0.5 & 2 & 2 \\ 1 & 1 & 1.5 & 1.5 \\ 1 & 1 & 1.5 & 1.5 \end{array}\right) 0.50.5110.50.511221.51.5221.51.5
这样就得到了上一层 ∂ J ( W , b ) ∂ a k l − 1 \frac{\partial J(W, b)}{\partial a_k^{l-1}} ∂akl−1∂J(W,b) 的值, 要得到 δ k l − 1 \delta_k^{l-1} δkl−1 :
δ k l − 1 = ( ∂ a k l − 1 ∂ z k l − 1 ) T ∂ J ( W , b ) ∂ a k l − 1 = upsample ( δ k l ) ⊙ σ ′ ( z k l − 1 ) \delta_k^{l-1}=\left(\frac{\partial a_k^{l-1}}{\partial z_k^{l-1}}\right)^T \frac{\partial J(W, b)}{\partial a_k^{l-1}}=\text { upsample }\left(\delta_k^l\right) \odot \sigma^{\prime}\left(z_k^{l-1}\right) δkl−1=(∂zkl−1∂akl−1)T∂akl−1∂J(W,b)= upsample (δkl)⊙σ′(zkl−1)
其中, upsample函数完成了池化误差矩阵放大与误差重新分配的逻辑。
概括下, 对于张量 δ l − 1 \delta^{l-1} δl−1, 有:
δ l − 1 = upsample ( δ l ) ⊙ σ ′ ( z l − 1 ) \delta^{l-1}=\text { upsample }\left(\delta^l\right) \odot \sigma^{\prime}\left(z^{l-1}\right) δl−1= upsample (δl)⊙σ′(zl−1)
max pooling 前向传播: n e t 1 , 1 l = max ( net 1 , 1 l − 1 , net 1 , 2 l − 1 net_{1,1}^l=\max \left(\operatorname{net}_{1,1}^{l-1}, \operatorname{net}_{1,2}^{l-1}\right. net1,1l=max(net1,1l−1,net1,2l−1, net 2 , 1 l − 1 , net 2 , 2 l − 1 ) \left._{2,1}^{l-1}, \operatorname{net}_{2,2}^{l-1}\right) 2,1l−1,net2,2l−1)
上图中的
∂ n e t 1 , 1 l ∂ n e t 1 , 1 l − 1 = 1 \frac{\partial n e t_{1,1}^l}{\partial n e t_{1,1}^{l-1}}=1 ∂net1,1l−1∂net1,1l=1
∂ n e t 1 , 1 l ∂ n e t 1 , 2 l − 1 = 0 \frac{\partial n e t_{1,1}^l}{\partial n e t_{1,2}^{l-1}}=0 ∂net1,2l−1∂net1,1l=0
∂ n e t 1 , 1 l ∂ n e t 2 , 1 l − 1 = 0 \frac{\partial n e t_{1,1}^l}{\partial n e t_{2,1}^{l-1}}=0 ∂net2,1l−1∂net1,1l=0
∂ n e t 1 , 1 l ∂ net t 2 , 2 l − 1 = 0 \frac{\partial n e t_{1,1}^l}{\partial \text { net } t_{2,2}^{l-1}}=0 ∂ net t2,2l−1∂net1,1l=0
δ 1 , 1 l − 1 = ∂ E d ∂ net 1 , 1 l − 1 = ∂ E d ∂ n e t 1 , 1 l ∂ n e t 1 , 1 l ∂ n e t 1 , 1 l − 1 = δ 1 , 1 l \quad \delta_{1,1}^{l-1}=\frac{\partial E_d}{\partial \text { net }_{1,1}^{l-1}} \quad=\frac{\partial E_d}{\partial n e t_{1,1}^l} \frac{\partial n e t_{1,1}^l}{\partial n e t_{1,1}^{l-1}} =\delta_{1,1}^l δ1,1l−1=∂ net 1,1l−1∂Ed=∂net1,1l∂Ed∂net1,1l−1∂net1,1l=δ1,1l
δ 1 , 2 l − 1 = ∂ E d ∂ net 1 , 2 l − 1 = ∂ E d ∂ n e t 1 , 1 l ∂ n e t 1 , 1 l ∂ n e t 1 , 2 l − 1 = 0 \quad \delta_{1,2}^{l-1}=\frac{\partial E_d}{\partial \text { net }_{1,2}^{l-1}} \quad=\frac{\partial E_d}{\partial n e t_{1,1}^l} \frac{\partial n e t_{1,1}^l}{\partial n e t_{1,2}^{l-1}} =0 δ1,2l−1=∂ net 1,2l−1∂Ed=∂net1,1l∂Ed∂net1,2l−1∂net1,1l=0
δ 2 , 1 l − 1 = ∂ E d ∂ net 2 , 1 l − 1 = ∂ E d ∂ n e t 1 , 1 l ∂ n e t 1 , 1 l ∂ n e t 2 , 1 l − 1 = 0 \quad \delta_{2,1}^{l-1}=\frac{\partial E_d}{\partial \text { net }_{2,1}^{l-1}} \quad=\frac{\partial E_d}{\partial n e t_{1,1}^l} \frac{\partial n e t_{1,1}^l}{\partial n e t_{2,1}^{l-1}} =0 δ2,1l−1=∂ net 2,1l−1∂Ed=∂net1,1l∂Ed∂net2,1l−1∂net1,1l=0
δ 2 , 2 l − 1 = ∂ E d ∂ net 2 , 2 l − 1 = ∂ E d ∂ n e t 1 , 1 l ∂ n e t 1 , 1 l ∂ n e t 2 , 2 l − 1 = 0 \quad \delta_{2,2}^{l-1}=\frac{\partial E_d}{\partial \text { net }_{2,2}^{l-1}} \quad=\frac{\partial E_d}{\partial n e t_{1,1}^l} \frac{\partial n e t_{1,1}^l}{\partial n e t_{2,2}^{l-1}} =0 δ2,2l−1=∂ net 2,2l−1∂Ed=∂net1,1l∂Ed∂net2,2l−1∂net1,1l=0
对于max pooling,下一层的误差项的值会原封不动地传递到上一层对应区块中的最大值所对应的 神经元,而其他神经元的误差项的值都是 0 。
如上图所示, layer l l l 层的误差 δ 1 , 1 \delta_{1,1} δ1,1 会原封不动地传递到layer l − 1 l-1 l−1 层对应区块 ( δ 1 , 1 δ 1 , 2 δ 2 , 1 δ 2 , 2 ) \left(\begin{array}{cc}\delta_{1,1} & \delta_{1,2} \\ \delta_{2,1} & \delta_{2,2}\end{array}\right) (δ1,1δ2,1δ1,2δ2,2) 中的最大值 所对应的神经元。
mean pooling前向传播: n e t 1 , 1 l = 1 4 ( n e t 1 , 1 l − 1 + n e t 1 , 2 l − 1 + n e t 2 , 1 l − 1 + n e t 2 , 2 l − 1 ) net_{1,1}^l=\frac{1}{4}\left(n e t_{1,1}^{l-1}+n e t_{1,2}^{l-1}+n e t_{2,1}^{l-1}+n e t_{2,2}^{l-1}\right) net1,1l=41(net1,1l−1+net1,2l−1+net2,1l−1+net2,2l−1)
$ \frac{\partial n e t_{1,1}^l}{\partial n e t_{1,1}^{l-1}}=\frac{1}{4}$
∂ n e t 1 , 1 l ∂ n e t 1 , 2 l − 1 = 1 4 \frac{\partial n e t_{1,1}^l}{\partial n e t_{1,2}^{l-1}}=\frac{1}{4} ∂net1,2l−1∂net1,1l=41
∂ net t 1 , 1 l ∂ net 2 , 1 l − 1 = 1 4 \frac{\partial \text { net } t_{1,1}^l}{\partial \text { net }_{2,1}^{l-1}}=\frac{1}{4} ∂ net 2,1l−1∂ net t1,1l=41
∂ n e t 1 , 1 l ∂ net t 2 , 2 l − 1 = 1 4 \frac{\partial n e t_{1,1}^l}{\partial \text { net } t_{2,2}^{l-1}}=\frac{1}{4} ∂ net t2,2l−1∂net1,1l=41
δ 1 , 1 l − 1 = ∂ E d ∂ n e t 1 , 1 l − 1 = ∂ E d ∂ n e t 1 , 1 l ∂ n e t 1 , 1 l ∂ n e t 1 , 1 l − 1 = 1 4 δ 1 , 1 l \quad \delta_{1,1}^{l-1}=\frac{\partial E_d}{\partial n e t_{1,1}^{l-1}} \quad=\frac{\partial E_d}{\partial n e t_{1,1}^l} \frac{\partial n e t_{1,1}^l}{\partial n e t_{1,1}^{l-1}} \quad=\frac{1}{4} \delta_{1,1}^l δ1,1l−1=∂net1,1l−1∂Ed=∂net1,1l∂Ed∂net1,1l−1∂net1,1l=41δ1,1l
δ 1 , 2 l − 1 = ∂ E d ∂ n e t 1 , 2 l − 1 = ∂ E d ∂ n e t 1 , 1 l ∂ n e t 1 , 1 l ∂ n e t 1 , 2 l − 1 = 1 4 δ 1 , 1 l \quad \delta_{1,2}^{l-1}=\frac{\partial E_d}{\partial n e t_{1,2}^{l-1}} \quad=\frac{\partial E_d}{\partial n e t_{1,1}^l} \frac{\partial n e t_{1,1}^l}{\partial n e t_{1,2}^{l-1}} \quad=\frac{1}{4} \delta_{1,1}^l δ1,2l−1=∂net1,2l−1∂Ed=∂net1,1l∂Ed∂net1,2l−1∂net1,1l=41δ1,1l
δ 2 , 1 l − 1 = ∂ E d ∂ n e t 2 , 1 l − 1 = ∂ E d ∂ n e t 1 , 1 l ∂ n e t 1 , 1 l ∂ n e t 2 , 1 l − 1 = 1 4 δ 1 , 1 l \quad \delta_{2,1}^{l-1}=\frac{\partial E_d}{\partial n e t_{2,1}^{l-1}} \quad=\frac{\partial E_d}{\partial n e t_{1,1}^l} \frac{\partial n e t_{1,1}^l}{\partial n e t_{2,1}^{l-1}} \quad=\frac{1}{4} \delta_{1,1}^l δ2,1l−1=∂net2,1l−1∂Ed=∂net1,1l∂Ed∂net2,1l−1∂net1,1l=41δ1,1l
δ 2 , 2 l − 1 = ∂ E d ∂ n e t 2 , 2 l − 1 = ∂ E d ∂ n e t 1 , 1 l ∂ n e t 1 , 1 l ∂ n e t 2 , 2 l − 1 = 1 4 δ 1 , 1 l \quad \delta_{2,2}^{l-1}=\frac{\partial E_d}{\partial n e t_{2,2}^{l-1}} \quad=\frac{\partial E_d}{\partial n e t_{1,1}^l} \frac{\partial n e t_{1,1}^l}{\partial n e t_{2,2}^{l-1}} \quad=\frac{1}{4} \delta_{1,1}^l δ2,2l−1=∂net2,2l−1∂Ed=∂net1,1l∂Ed∂net2,2l−1∂net1,1l=41δ1,1l
对于mean pooling,下一层的误差项的值会平均分配到上一层对应区块中的所有神经元。 如上图所示, layer l l l 层的误差 δ 1 , 1 \delta_{1,1} δ1,1 会平均分配到layer l − 1 l-1 l−1 层对应区块 ( δ 1 , 1 δ 1 , 2 δ 2 , 1 δ 2 , 2 ) \left(\begin{array}{ll}\delta_{1,1} & \delta_{1,2} \\ \delta_{2,1} & \delta_{2,2}\end{array}\right) (δ1,1δ2,1δ1,2δ2,2) 中的所有神经元。
整体思想:
前向传播:
a i , j l − 1 = f l − 1 ( n e t i , j l − 1 ) n e t l = conv ( W l , a l − 1 ) + w b \begin{aligned} & a_{i, j}^{l-1}=f^{l-1}\left(ne t_{i, j}^{l-1}\right) \\ & net^l=\operatorname{conv}\left(W^l, a^{l-1}\right)+w_b \end{aligned} ai,jl−1=fl−1(neti,jl−1)netl=conv(Wl,al−1)+wb
误差后向传播:
δ i , j l − 1 = ∂ E d ∂ net i , j l − 1 = ∂ E d ∂ a i , j l − 1 ∂ a i , j l − 1 ∂ n e t i , j I − 1 \delta_{i, j}^{l-1}=\frac{\partial E_d}{\partial \text { net }_{i, j}^{l-1}}=\frac{\partial E_d}{\partial a_{i, j}^{l-1}} \frac{\partial a_{i, j}^{l-1}}{\partial n e t_{i, j}^{I-1}} δi,jl−1=∂ net i,jl−1∂Ed=∂ai,jl−1∂Ed∂neti,jI−1∂ai,jl−1
前向计算:
n e t 1 , 1 l = w 1 , 1 a 1 , 1 l − 1 + w 1 , 2 a 1 , 2 l − 1 + w 2 , 1 a 2 , 1 l − 1 + w 2 , 2 a 2 , 2 l − 1 + w b (1) n e t_{1,1}^l=w_{1,1} a_{1,1}^{l-1}+w_{1,2} a_{1,2}^{l-1}+w_{2,1} a_{2,1}^{l-1}+w_{2,2} a_{2,2}^{l-1}+w_b \tag{1} net1,1l=w1,1a1,1l−1+w1,2a1,2l−1+w2,1a2,1l−1+w2,2a2,2l−1+wb(1) n e t 1 , 2 l = w 1 , 1 a 1 , 2 l − 1 + w 1 , 2 a 1 , 3 l − 1 + w 2 , 1 a 2 , 2 l − 1 + w 2 , 2 a 2 , 3 l − 1 + w b (2) n e t_{1,2}^l=w_{1,1} a_{1,2}^{l-1}+w_{1,2} a_{1,3}^{l-1}+w_{2,1} a_{2,2}^{l-1}+w_{2,2} a_{2,3}^{l-1}+w_b \tag{2} net1,2l=w1,1a1,2l−1+w1,2a1,3l−1+w2,1a2,2l−1+w2,2a2,3l−1+wb(2) n e t 2 , 1 l = w 1 , 1 a 2 , 1 l − 1 + w 1 , 2 a 2 , 2 l − 1 + w 2 , 1 a 3 , 1 l − 1 + w 2 , 2 a 3 , 2 l − 1 + w b (3) n e t_{2,1}^l=w_{1,1} a_{2,1}^{l-1}+w_{1,2} a_{2,2}^{l-1}+w_{2,1} a_{3,1}^{l-1}+w_{2,2} a_{3,2}^{l-1}+w_b \tag{3} net2,1l=w1,1a2,1l−1+w1,2a2,2l−1+w2,1a3,1l−1+w2,2a3,2l−1+wb(3) n e t 2 , 2 l = w 1 , 1 a 2 , 2 l − 1 + w 1 , 2 a 2 , 3 l − 1 + w 2 , 1 a 3 , 2 l − 1 + w 2 , 2 a 3 , 3 l − 1 + w b (4) n e t_{2,2}^l=w_{1,1} a_{2,2}^{l-1}+w_{1,2} a_{2,3}^{l-1}+w_{2,1} a_{3,2}^{l-1}+w_{2,2} a_{3,3}^{l-1}+w_b \tag{4} net2,2l=w1,1a2,2l−1+w1,2a2,3l−1+w2,1a3,2l−1+w2,2a3,3l−1+wb(4)
误差后向传播:
∂ E d ∂ a 1 , 1 l − 1 = ∂ E d ∂ n e t 1 , 1 l ∂ n e t 1 , 1 l ∂ a 1 , 1 l − 1 = δ 1 , 1 l w 1 , 1 \begin{aligned} \frac{\partial E_d}{\partial a_{1,1}^{l-1}} & =\frac{\partial E_d}{\partial n e t_{1,1}^l} \frac{\partial n e t_{1,1}^l}{\partial a_{1,1}^{l-1}} =\delta_{1,1}^l w_{1,1} \end{aligned} ∂a1,1l−1∂Ed=∂net1,1l∂Ed∂a1,1l−1∂net1,1l=δ1,1lw1,1
∂ E d ∂ a 1 , 2 l − 1 = ∂ E d ∂ n e t 1 , 1 l ∂ n e t 1 , 1 l ∂ a 1 , 2 l − 1 + ∂ E d ∂ n e t 1 , 2 l ∂ n e t 1 , 2 l ∂ a 1 , 2 l − 1 = δ 1 , 1 l w 1 , 2 + δ 1 , 2 l w 1 , 1 \begin{aligned} \frac{\partial E_d}{\partial a_{1,2}^{l-1}} & =\frac{\partial E_d}{\partial n e t_{1,1}^l} \frac{\partial n e t_{1,1}^l}{\partial a_{1,2}^{l-1}}+\frac{\partial E_d}{\partial n e t_{1,2}^l} \frac{\partial n e t_{1,2}^l}{\partial a_{1,2}^{l-1}} =\delta_{1,1}^l w_{1,2}+\delta_{1,2}^l w_{1,1} \end{aligned} ∂a1,2l−1∂Ed=∂net1,1l∂Ed∂a1,2l−1∂net1,1l+∂net1,2l∂Ed∂a1,2l−1∂net1,2l=δ1,1lw1,2+δ1,2lw1,1
∂ E d ∂ a 1 , 3 l − 1 = ∂ E d ∂ n e t 1 , 2 l ∂ n e t 1 , 2 l ∂ a 1 , 3 l − 1 = δ 1 , 2 l w 1 , 2 \begin{aligned} \frac{\partial E_d}{\partial a_{1,3}^{l-1}} & =\frac{\partial E_d}{\partial n e t_{1,2}^l} \frac{\partial n e t_{1,2}^l}{\partial a_{1,3}^{l-1}} =\delta_{1,2}^l w_{1,2}\end{aligned} ∂a1,3l−1∂Ed=∂net1,2l∂Ed∂a1,3l−1∂net1,2l=δ1,2lw1,2
∂ E d ∂ a 2 , 1 l − 1 = ∂ E d ∂ n e t 1 , 1 l ∂ n e t 1 , 1 l ∂ a 2 , 1 l − 1 + ∂ E d ∂ n e t 2 , 1 l ∂ n e t 2 , 1 l ∂ a 2 , 1 l − 1 = δ 1 , 1 l w 2 , 1 + δ 2 , 1 l w 1 , 1 \begin{aligned} \frac{\partial E_d}{\partial a_{2,1}^{l-1}} & =\frac{\partial E_d}{\partial n e t_{1,1}^l} \frac{\partial n e t_{1,1}^l}{\partial a_{2,1}^{l-1}}+\frac{\partial E_d}{\partial n e t_{2,1}^l} \frac{\partial n e t_{2,1}^l}{\partial a_{2,1}^{l-1}} =\delta_{1,1}^l w_{2,1}+\delta_{2,1}^l w_{1,1}\end{aligned} ∂a2,1l−1∂Ed=∂net1,1l∂Ed∂a2,1l−1∂net1,1l+∂net2,1l∂Ed∂a2,1l−1∂net2,1l=δ1,1lw2,1+δ2,1lw1,1
∂ E d ∂ a 2 , 2 l − 1 = ∂ E d ∂ n e t 1 , 1 l ∂ n e t 1 , 1 l ∂ a 2 , 2 l − 1 + ∂ E d ∂ n e t 1 , 2 l ∂ n e t 1 , 2 l ∂ a 2 , 2 l − 1 + ∂ E d ∂ n e t 2 , 1 l ∂ n e t 2 , 1 ∂ a 2 , 2 l − 1 + ∂ E d ∂ n e t 2 , 2 l ∂ n e t 2 , 2 l ∂ a 2 , 2 l − 1 = δ 1 , 1 l w 2 , 2 + δ 1 , 2 l w 2 , 1 + δ 2 , 1 l w 1 , 2 + δ 2 , 2 l w 1 , 1 \frac{\partial E_d}{\partial a_{2,2}^{l-1}} =\frac{\partial E_d}{\partial n e t_{1,1}^l} \frac{\partial ne t_{1,1}^l}{\partial a_{2,2}^{l-1}}+\frac{\partial E_d}{\partial n e t_{1,2}^l} \frac{\partial ne t_{1,2}^l}{\partial a_{2,2}^{l-1}}+\frac{\partial E_d}{\partial n e t_{2,1}^l} \frac{ \partial { net }_{2,1}}{\partial a_{2,2}^{l-1}}+\frac{\partial E_d}{\partial { net }_{2,2}^l} \frac{\partial { net }_{2,2}^l}{\partial a_{2,2}^{l-1}}=\delta_{1,1}^l w_{2,2}+\delta_{1,2}^l w_{2,1}+\delta_{2,1}^l w_{1,2}+\delta_{2,2}^l w_{1,1} ∂a2,2l−1∂Ed=∂net1,1l∂Ed∂a2,2l−1∂net1,1l+∂net1,2l∂Ed∂a2,2l−1∂net1,2l+∂net2,1l∂Ed∂a2,2l−1∂net2,1+∂net2,2l∂Ed∂a2,2l−1∂net2,2l=δ1,1lw2,2+δ1,2lw2,1+δ2,1lw1,2+δ2,2lw1,1
∂ E d ∂ a 2 , 3 l − 1 = ∂ E d ∂ n e t 1 , 2 l ∂ n e t 1 , 2 l ∂ a 2 , 3 l − 1 + ∂ E d ∂ n e t 2 , 2 l ∂ n e t 2 , 2 l ∂ a 2 , 3 l − 1 = δ 1 , 2 l w 2 , 2 + δ 2 , 2 l w 1 , 2 \begin{aligned} \frac{\partial E_d}{\partial a_{2,3}^{l-1}} & =\frac{\partial E_d}{\partial n e t_{1,2}^l} \frac{\partial n e t_{1,2}^l}{\partial a_{2,3}^{l-1}}+\frac{\partial E_d}{\partial n e t_{2,2}^l} \frac{\partial n e t_{2,2}^l}{\partial a_{2,3}^{l-1}} =\delta_{1,2}^l w_{2,2}+\delta_{2,2}^l w_{1,2}\end{aligned} ∂a2,3l−1∂Ed=∂net1,2l∂Ed∂a2,3l−1∂net1,2l+∂net2,2l∂Ed∂a2,3l−1∂net2,2l=δ1,2lw2,2+δ2,2lw1,2
∂ E d ∂ a 3 , 1 l − 1 = ∂ E d ∂ n e t 2 , 1 l ∂ n e t 2 , 1 l ∂ a 3 , 1 l − 1 = δ 2 , 1 l w 2 , 1 \begin{aligned} \frac{\partial E_d}{\partial a_{3,1}^{l-1}} & =\frac{\partial E_d}{\partial n e t_{2,1}^l} \frac{\partial n e t_{2,1}^l}{\partial a_{3,1}^{l-1}} =\delta_{2,1}^l w_{2,1}\end{aligned} ∂a3,1l−1∂Ed=∂net2,1l∂Ed∂a3,1l−1∂net2,1l=δ2,1lw2,1
∂ E d ∂ a 3 , 2 l − 1 = ∂ E d ∂ n e t 2 , 1 l ∂ n e t 2 , 1 l ∂ a 3 , 2 l − 1 + ∂ E d ∂ n e t 2 , 2 l ∂ n e t 2 , 2 l ∂ a 3 , 2 l − 1 = δ 2 , 1 l w 2 , 2 + δ 2 , 2 l w 2 , 1 \begin{aligned} \frac{\partial E_d}{\partial a_{3,2}^{l-1}} & =\frac{\partial E_d}{\partial n e t_{2,1}^l} \frac{\partial n e t_{2,1}^l}{\partial a_{3,2}^{l-1}}+\frac{\partial E_d}{\partial n e t_{2,2}^l} \frac{\partial n e t_{2,2}^l}{\partial a_{3,2}^{l-1}} =\delta_{2,1}^l w_{2,2}+\delta_{2,2}^l w_{2,1}\end{aligned} ∂a3,2l−1∂Ed=∂net2,1l∂Ed∂a3,2l−1∂net2,1l+∂net2,2l∂Ed∂a3,2l−1∂net2,2l=δ2,1lw2,2+δ2,2lw2,1
∂ E d ∂ a 3 , 3 l − 1 = ∂ E d ∂ n e t 2 , 2 l ∂ n e t 2 , 2 l ∂ a 3 , 3 l − 1 = δ 2 , 2 l w 2 , 2 \begin{aligned} \frac{\partial E_d}{\partial a_{3,3}^{l-1}} & =\frac{\partial E_d}{\partial n e t_{2,2}^l} \frac{\partial n e t_{2,2}^l}{\partial a_{3,3}^{l-1}} =\delta_{2,2}^l w_{2,2}\end{aligned} ∂a3,3l−1∂Ed=∂net2,2l∂Ed∂a3,3l−1∂net2,2l=δ2,2lw2,2
总结一下就是: ∂ E d ∂ a l − 1 = δ l ∗ W l \frac{\partial E_d}{\partial a^{l-1}}=\delta^l * W^l ∂al−1∂Ed=δl∗Wl
δ i , j l − 1 = ∂ E d ∂ n e t i , j I − 1 = ∂ E d ∂ a i , j l − 1 ∂ a i , j l − 1 ∂ n e t i , j l − 1 = ∑ m ∑ n w m , n l δ i + m , j + n l f ′ ( n e t i , j l − 1 ) δ l − 1 = δ l ∗ W l ∘ f ′ ( n e t l − 1 ) \begin{aligned} \delta_{i, j}^{l-1} & =\frac{\partial E_d}{\partial n e t_{i, j}^{I-1}} \\ & =\frac{\partial E_d}{\partial a_{i, j}^{l-1}} \frac{\partial a_{i, j}^{l-1}}{\partial n e t_{i, j}^{l-1}} \\ & =\sum_m \sum_n w_{m, n}^l \delta_{i+m, j+n}^l f^{\prime}\left(n e t_{i, j}^{l-1}\right) \\ \delta^{l-1}= & \delta^l * W^l \circ f^{\prime}\left(n e t^{l-1}\right) \end{aligned} δi,jl−1δl−1==∂neti,jI−1∂Ed=∂ai,jl−1∂Ed∂neti,jl−1∂ai,jl−1=m∑n∑wm,nlδi+m,j+nlf′(neti,jl−1)δl∗Wl∘f′(netl−1)
例如图像 RGB 三个channel,D=3.
在得到第l层sensitivity map的情况下,计算filter的权重的梯度,由于卷积层是权重共享的,因此梯度的计算稍有不同。
n e t 1 , 1 l = w 1 , 1 a 1 , 1 l − 1 + w 1 , 2 a 1 , 2 l − 1 + w 2 , 1 a 2 , 1 l − 1 + w 2 , 2 a 2 , 2 l − 1 + w b (1) n e t_{1,1}^l=w_{1,1} a_{1,1}^{l-1}+w_{1,2} a_{1,2}^{l-1}+w_{2,1} a_{2,1}^{l-1}+w_{2,2} a_{2,2}^{l-1}+w_b \tag{1} net1,1l=w1,1a1,1l−1+w1,2a1,2l−1+w2,1a2,1l−1+w2,2a2,2l−1+wb(1) n e t 1 , 2 l = w 1 , 1 a 1 , 2 l − 1 + w 1 , 2 a 1 , 3 l − 1 + w 2 , 1 a 2 , 2 l − 1 + w 2 , 2 a 2 , 3 l − 1 + w b (2) n e t_{1,2}^l=w_{1,1} a_{1,2}^{l-1}+w_{1,2} a_{1,3}^{l-1}+w_{2,1} a_{2,2}^{l-1}+w_{2,2} a_{2,3}^{l-1}+w_b \tag{2} net1,2l=w1,1a1,2l−1+w1,2a1,3l−1+w2,1a2,2l−1+w2,2a2,3l−1+wb(2) n e t 2 , 1 l = w 1 , 1 a 2 , 1 l − 1 + w 1 , 2 a 2 , 2 l − 1 + w 2 , 1 a 3 , 1 l − 1 + w 2 , 2 a 3 , 2 l − 1 + w b (3) n e t_{2,1}^l=w_{1,1} a_{2,1}^{l-1}+w_{1,2} a_{2,2}^{l-1}+w_{2,1} a_{3,1}^{l-1}+w_{2,2} a_{3,2}^{l-1}+w_b \tag{3} net2,1l=w1,1a2,1l−1+w1,2a2,2l−1+w2,1a3,1l−1+w2,2a3,2l−1+wb(3) n e t 2 , 2 l = w 1 , 1 a 2 , 2 l − 1 + w 1 , 2 a 2 , 3 l − 1 + w 2 , 1 a 3 , 2 l − 1 + w 2 , 2 a 3 , 3 l − 1 + w b (4) n e t_{2,2}^l=w_{1,1} a_{2,2}^{l-1}+w_{1,2} a_{2,3}^{l-1}+w_{2,1} a_{3,2}^{l-1}+w_{2,2} a_{3,3}^{l-1}+w_b \tag{4} net2,2l=w1,1a2,2l−1+w1,2a2,3l−1+w2,1a3,2l−1+w2,2a3,3l−1+wb(4)
∂ E d ∂ w 1 , 1 = ∂ E d ∂ n e t 1 , 1 l ∂ n e t 1 , 1 l ∂ w 1 , 1 + ∂ E d ∂ n e t 1 , 2 l ∂ n e t 1 , 2 l ∂ w 1 , 1 + ∂ E d ∂ n t 2 , 1 l ∂ n e t 2 , 1 l ∂ w 1 , 1 + ∂ E d ∂ net 2 , 2 I ∂ n e t 2 , 2 l ∂ w 1 , 1 = δ 1 , 1 l a 1 , 1 l − 1 + δ 1 , 2 l a 1 , 2 l − 1 + δ 2 , 1 l a 2 , 1 l − 1 + δ 2 , 2 l a 2 , 2 l − 1 \begin{aligned} & \frac{\partial E_d}{\partial w_{1,1}}=\frac{\partial E_d}{\partial n e t_{1,1}^l} \frac{\partial n e t_{1,1}^l}{\partial w_{1,1}}+\frac{\partial E_d}{\partial n e t_{1,2}^l} \frac{\partial n e t_{1,2}^l}{\partial w_{1,1}}+\frac{\partial E_d}{\partial n t_{2,1}^l} \frac{\partial n e t_{2,1}^l}{\partial w_{1,1}}+\frac{\partial E_d}{\partial \text { net }_{2,2}^I} \frac{\partial n e t_{2,2}^l}{\partial w_{1,1}} \\ & =\delta_{1,1}^l a_{1,1}^{l-1}+\delta_{1,2}^l a_{1,2}^{l-1}+\delta_{2,1}^l a_{2,1}^{l-1}+\delta_{2,2}^l a_{2,2}^{l-1} \\ & \end{aligned} ∂w1,1∂Ed=∂net1,1l∂Ed∂w1,1∂net1,1l+∂net1,2l∂Ed∂w1,1∂net1,2l+∂nt2,1l∂Ed∂w1,1∂net2,1l+∂ net 2,2I∂Ed∂w1,1∂net2,2l=δ1,1la1,1l−1+δ1,2la1,2l−1+δ2,1la2,1l−1+δ2,2la2,2l−1 ∂ E d ∂ w 1 , 2 = δ 1 , 1 l a 1 , 2 l − 1 + δ 1 , 2 l a 1 , 3 l − 1 + δ 2 , 1 l a 2 , 2 l − 1 + δ 2 , 2 l a 2 , 3 l − 1 \frac{\partial E_d}{\partial w_{1,2}}=\delta_{1,1}^l a_{1,2}^{l-1}+\delta_{1,2}^l a_{1,3}^{l-1}+\delta_{2,1}^l a_{2,2}^{l-1}+\delta_{2,2}^l a_{2,3}^{l-1} ∂w1,2∂Ed=δ1,1la1,2l−1+δ1,2la1,3l−1+δ2,1la2,2l−1+δ2,2la2,3l−1 ∂ E d ∂ w 2 , 1 = δ 1 , 1 l a 2 , 1 l − 1 + δ 1 , 2 l a 2 , 2 l − 1 + δ 2 , 1 l a 3 , 1 l − 1 + δ 2 , 2 l a 3 , 2 l − 1 \frac{\partial E_d}{\partial w_{2,1}}=\delta_{1,1}^l a_{2,1}^{l-1}+\delta_{1,2}^l a_{2,2}^{l-1}+\delta_{2,1}^l a_{3,1}^{l-1}+\delta_{2,2}^l a_{3,2}^{l-1} ∂w2,1∂Ed=δ1,1la2,1l−1+δ1,2la2,2l−1+δ2,1la3,1l−1+δ2,2la3,2l−1 ∂ E d ∂ w 2 , 2 = δ 1 , 1 l a 2 , 2 l − 1 + δ 1 , 2 l a 2 , 3 l − 1 + δ 2 , 1 l a 3 , 2 l − 1 + δ 2 , 2 l a 3 , 3 l − 1 \frac{\partial E_d}{\partial w_{2,2}}=\delta_{1,1}^l a_{2,2}^{l-1}+\delta_{1,2}^l a_{2,3}^{l-1}+\delta_{2,1}^l a_{3,2}^{l-1}+\delta_{2,2}^l a_{3,3}^{l-1} ∂w2,2∂Ed=δ1,1la2,2l−1+δ1,2la2,3l−1+δ2,1la3,2l−1+δ2,2la3,3l−1 ∂ E d ∂ w b = ∂ E d ∂ n e t 1 , 1 l ∂ n e t 1 , 1 l ∂ w b + ∂ E d ∂ n e t 1 , 2 l ∂ n e t 1 , 2 l ∂ w b + ∂ E d ∂ net 2 , 1 l ∂ n e t 2 , 1 l ∂ w b + ∂ E d ∂ n e t 2 , 2 l ∂ n t 2 , 2 ′ ∂ w b = δ 1 , 1 l + δ 1 , 2 l + δ 2 , 1 l + δ 2 , 2 I = ∑ i ∑ j δ i , j l \begin{aligned} \frac{\partial E_d}{\partial w_b} & =\frac{\partial E_d}{\partial n e t_{1,1}^l} \frac{\partial n e t_{1,1}^l}{\partial w_b}+\frac{\partial E_d}{\partial n e t_{1,2}^l} \frac{\partial n e t_{1,2}^l}{\partial w_b}+\frac{\partial E_d}{\partial \text { net }_{2,1}^l} \frac{\partial n e t_{2,1}^l}{\partial w_b}+\frac{\partial E_d}{\partial n e t_{2,2}^l} \frac{\partial n t_{2,2}^{\prime}}{\partial w_b} \\ & =\delta_{1,1}^l+\delta_{1,2}^l+\delta_{2,1}^l+\delta_{2,2}^I \\ & =\sum_i \sum_j \delta_{i, j}^l \end{aligned} ∂wb∂Ed=∂net1,1l∂Ed∂wb∂net1,1l+∂net1,2l∂Ed∂wb∂net1,2l+∂ net 2,1l∂Ed∂wb∂net2,1l+∂net2,2l∂Ed∂wb∂nt2,2′=δ1,1l+δ1,2l+δ2,1l+δ2,2I=i∑j∑δi,jl