这篇文章是没有配图的,很多事情,配图反而说不清楚。
关于正向与反向传播,我仅仅类比梯度下降一个做计算,并结合源码。但这样做的意义和效果不做分析。能力有限。更不知道网络每一层到底干了什么。玄学
∑ k = 1 n ∂ f ( x ) ∂ x k c o s β k \sum_{k=1}^n\frac{\partial f(x)}{\partial x_k}cos \beta_k k=1∑n∂xk∂f(x)cosβk
其中:
∑ k = 1 n c o s 2 β k = 1 \sum_{k=1}^ncos^2\beta_k=1 k=1∑ncos2βk=1
则:
∑ k = 1 n ∂ f ( x ) ∂ x i c o s β i ≤ ( ∑ k = 1 n ( ∂ f ( x ) ∂ x i ) 2 ) ( ∑ k = 1 n c o s 2 β k ) \sum_{k=1}^n\frac{\partial f(x)}{\partial x_i}cos \beta_i\leq \Big(\sum_{k=1}^{n}\Big(\frac{\partial f(x)}{\partial x_i}\Big)^2\Big)\Big(\sum_{k=1}^{n}cos^2\beta_k\Big) k=1∑n∂xi∂f(x)cosβi≤(k=1∑n(∂xi∂f(x))2)(k=1∑ncos2βk)
梯度下降很简单,根据施瓦茨不等式和全微分知识,我门可以知道,对于连续函数,梯度方向是其曾长最快的方向,这里最快有点瞬时的味道,就是说只仅限于当前点,那么反向是其下降最快的方向。而沿着与梯度正交的方向运动,则相当于沿着等势面运动。此时函数值不变。
沿着梯度反方向搜索,可以保证函数值下降,直到梯度消失,当梯度消失时,算法结束在一个极小值点。有人说不存在最优结果,对于训练集来说。这样理解是不对的。其实只能说有时候不存在解析解,最优解还是存在的,因为 l o s s loss loss函数是有下界的。
全联接层其实可以看作是特殊的卷积层。你也可以叫他不全联接。我觉得这个名字不错。
(如果不知道什么是全联接可以百度,介绍还是比较多的)
令 c t c_t ct为 t t t层神经元个数,全联接运算矩阵表示:
[ w t ( 0 , 0 ) w t ( 0 , 1 ) … w t ( 0 , c t − 1 ) w t ( 1 , 0 ) w t ( 1 , 1 ) … w t ( 1 , c t − 1 ) ⋮ ⋮ ⋱ ⋮ w t ( c t , 0 ) w t ( c t , 1 ) … w t ( c t , c t − 1 ) ] [ y t − 1 ( 0 ) y t − 1 ( 1 ) ⋮ y t − 1 ( c t − 1 ) ] = v t \left[\begin{matrix} w_t(0,0)&w_t(0,1)&\dots&w_t(0,c_{t-1})\\ w_t(1,0)&w_t(1,1)&\dots&w_t(1,c_{t-1})\\ \vdots&\vdots&\ddots &\vdots\\ w_t(c_t,0)&w_t(c_t,1)&\dots&w_t(c_t,c_{t-1}) \end{matrix}\right]\left[\begin{matrix}y_{t-1}(0)\\ y_{t-1}(1)\\ \vdots\\ y_{t-1}(c_{t-1}) \end{matrix}\right]=v_t ⎣⎢⎢⎢⎡wt(0,0)wt(1,0)⋮wt(ct,0)wt(0,1)wt(1,1)⋮wt(ct,1)……⋱…wt(0,ct−1)wt(1,ct−1)⋮wt(ct,ct−1)⎦⎥⎥⎥⎤⎣⎢⎢⎢⎡yt−1(0)yt−1(1)⋮yt−1(ct−1)⎦⎥⎥⎥⎤=vt
这里, y t − 1 y_{t-1} yt−1是上一层输出, w t w_t wt表示各个突触权重。 v t v_t vt就是还未激活的信号(局部诱导域)。
记: φ t ( v t ) = [ φ t ( v t ( 0 ) ) φ t ( v t ( 1 ) ) ⋮ φ t ( v t ( c t ) ) ] \varphi_t(v_t)=\left[\begin{matrix}\varphi_t(v_{t}(0))\\ \varphi_t(v_{t}(1))\\ \vdots\\ \varphi_t(v_{t}(c_t)) \end{matrix}\right] φt(vt)=⎣⎢⎢⎢⎡φt(vt(0))φt(vt(1))⋮φt(vt(ct))⎦⎥⎥⎥⎤
其中 φ t \varphi_t φt表示第 t t t层的激活函数。
简化矩阵表示为:
v t = w t y t − 1 y t = φ t ( v t ) v_t=w_ty_{t-1}\\ y_t=\varphi_t(v_t) vt=wtyt−1yt=φt(vt)
这部分操作于代码是对应的。
void forward_connected_layer(connected_layer l, network_state state)
{
int i;
fill_cpu(l.outputs*l.batch, 0, l.output, 1);
int m = l.batch;
int k = l.inputs;
int n = l.outputs;
float *a = state.input;
float *b = l.weights;
float *c = l.output;
gemm(0, 1, m, n, k, 1, a, k, b, k, 1, c, n);
activate_array(l.output, l.outputs*l.batch, l.activation);
}
上面代码删掉了 b a t c h n o r m a l i z e batch\ normalize batch normalize部分的代码。这部分代码可以先不用看。
其中, g e m m gemm gemm函数负责矩阵乘法。 b a t c h batch batch是训练用的参数,可以自行百度。在预测时,这个数值被强行置为 1 1 1.
g e m m gemm gemm的前两个参数表示矩阵是否进行转置。 m m m表述数据个数。 n , k n,k n,k为维度答案保存在 c c c中,也就是 l . o u t p u t l.output l.output.
对应关系:
c : v t s : y t − 1 a c t i v a t e _ a r r a y ( ) : φ ( v t ) c :v_t \\ s : y_{t-1}\\ activate\_array():\varphi(v_t) c:vts:yt−1activate_array():φ(vt)
这里以 b a t c h = 1 batch=1 batch=1来考虑定义每一层的代价函数:
J t = 1 2 ∑ k = 0 c t ( d t ( k ) − y t ( k ) ) 2 = 1 2 [ e t , e t ] e t ( k ) = d t ( k ) − y t ( k ) J_t=\frac{1}{2}\sum_{k=0}^{c_t}(d_t(k)-y_t(k))^2=\frac{1}{2}[e_t,e_t]\\ e_t(k)=d_t(k)-y_t(k) Jt=21k=0∑ct(dt(k)−yt(k))2=21[et,et]et(k)=dt(k)−yt(k)
其中, d t ( k ) d_t(k) dt(k)是我门期望的第 t t t层的输出。对于最后一层输出层来说,这个期望就是我门的标注数据。
当 t t t为输出层,也就是说,我门可以直接获得期望输出和实输出的误差 e t e_t et
那么可以很方便的计算最外层的权重梯度 ∇ J t \nabla J_t ∇Jt。
∂ J t ∂ w t ( a , b ) = ∑ k = 0 c t e t ( k ) ∂ e t ( k ) ∂ w t ( a , b ) = ∑ k = 0 c t e t ( k ) ∂ e t ( k ) ∂ v t ( k ) ∂ v t ( k ) ∂ w t ( a , b ) = − e t ( a ) φ t ′ ( v t ( a ) ) ∂ ∑ i w t ( a , i ) y t − 1 ( i ) ∂ w t ( a , b ) = − e t ( a ) φ t ′ ( v t ( a ) ) y t − 1 ( b ) \frac{\partial J_t}{\partial w_t(a,b)}=\sum_{k=0}^{c_t}e_t(k)\frac{\partial e_t(k)}{\partial w_t(a,b)}=\sum_{k=0}^{c_t}e_t(k)\frac{\partial e_t(k)}{\partial v_t(k)}\frac{\partial v_t(k)}{\partial w_t(a,b)}\\ =-e_t(a)\varphi'_t(v_t(a))\frac{\partial \sum_{i}w_{t}(a,i)y_{t-1}(i)}{\partial w_t(a,b)}\\= -e_t(a)\varphi'_t(v_t(a))y_{t-1}(b) ∂wt(a,b)∂Jt=k=0∑ctet(k)∂wt(a,b)∂et(k)=k=0∑ctet(k)∂vt(k)∂et(k)∂wt(a,b)∂vt(k)=−et(a)φt′(vt(a))∂wt(a,b)∂∑iwt(a,i)yt−1(i)=−et(a)φt′(vt(a))yt−1(b)
但这毕竟是外层神经元,对于内层,虽然假设了每一层的期望输出,但内层的期望输出是不知道的,这也是现阶段对网络了解过少所限制的。
此时算法是使用最外层的误差作为内层误差的。
接着上一部分计算,记: g t ( a ) = e t ( a ) φ t ′ ( v t ( a ) ) g_t(a)=e_t(a)\varphi'_t(v_t(a)) gt(a)=et(a)φt′(vt(a))
则: ∂ J t ∂ w t ( a , b ) = − g t ( a ) y t − 1 ( b ) \frac{\partial J_t}{\partial w_t(a,b)}=-g_t(a)y_{t-1}(b) ∂wt(a,b)∂Jt=−gt(a)yt−1(b)
那么考虑计算:
∂ J t ∂ w t − 1 ( a , b ) = ∑ k = 0 c t e t ( k ) ∂ e t ( k ) ∂ v t ( k ) ∂ v t ( k ) ∂ w t − 1 ( a , b ) = ∑ k = 0 c t − g t ( k ) ∂ ∑ i w t ( k , i ) y t − 1 ( i ) ∂ w t − 1 ( a , b ) = ∑ k = 0 c t − g t ( k ) ∑ i = 0 c t − 1 ∂ w t ( k , i ) y t − 1 ( i ) ∂ y t − 1 ( i ) ∂ y t − 1 ( i ) ∂ w t − 1 ( a , b ) = ∑ k = 0 c t − g t ( k ) ∑ i = 0 c t − 1 w t ( k , i ) φ t − 1 ( v t − 1 ( i ) ) ∂ ∑ j w t − 1 ( i , j ) y t − 2 ( j ) ∂ w t − 1 ( a , b ) = ∑ k = 0 c t − g t ( k ) w t ( k , a ) φ t − 1 ( v t − 1 ( a ) ) y t − 2 ( b ) = y t − 2 ( b ) φ t − 1 ′ ( v t − 1 ( a ) ) ∑ k = 0 c t − g t ( k ) w t ( k , a ) \frac{\partial J_{t}}{\partial w_{t-1}(a,b)}=\sum_{k=0}^{c_t}e_t(k)\frac{\partial e_t(k)}{\partial v_t(k)}\frac{\partial v_t(k)}{\partial w_{t -1}(a,b)}\\ =\sum_{k=0}^{c_t}-g_t(k)\frac{\partial \sum_iw_t(k,i)y_{t-1}(i)}{\partial w_{t-1}(a,b)}\\ =\sum_{k=0}^{c_t}-g_t(k)\sum_{i=0}^{c_{t-1}}\frac{\partial w_t(k,i)y_{t-1}(i)}{\partial y_{t-1}(i)}\frac{\partial y_{t-1}(i)}{\partial w_{t-1}(a,b)}\\ = \sum_{k=0}^{c_t}-g_t(k)\sum_{i=0}^{c_{t-1}}w_{t}(k,i)\varphi_{t-1}(v_{t-1}(i))\frac{\partial \sum_jw_{t-1}(i,j)y_{t-2}(j)}{\partial w_{t-1}(a,b)}\\ =\sum_{k=0}^{c_t}-g_{t}(k)w_t(k,a)\varphi_{t-1}(v_{t-1}(a))y_{t-2}(b)\\=y_{t-2}(b)\varphi'_{t-1}(v_{t-1}(a))\sum_{k=0}^{c_t}-g_t(k)w_t(k,a) ∂wt−1(a,b)∂Jt=k=0∑ctet(k)∂vt(k)∂et(k)∂wt−1(a,b)∂vt(k)=k=0∑ct−gt(k)∂wt−1(a,b)∂∑iwt(k,i)yt−1(i)=k=0∑ct−gt(k)i=0∑ct−1∂yt−1(i)∂wt(k,i)yt−1(i)∂wt−1(a,b)∂yt−1(i)=k=0∑ct−gt(k)i=0∑ct−1wt(k,i)φt−1(vt−1(i))∂wt−1(a,b)∂∑jwt−1(i,j)yt−2(j)=k=0∑ct−gt(k)wt(k,a)φt−1(vt−1(a))yt−2(b)=yt−2(b)φt−1′(vt−1(a))k=0∑ct−gt(k)wt(k,a)
进而:
g t − 1 ( a ) = φ t − 1 ′ ( v t − 1 ( a ) ) ∑ k = 0 c t g t ( k ) w t ( k , a ) ∂ J t ∂ w t − 1 ( a , b ) = − g t − 1 ( a ) y t − 2 ( b ) g_{t-1}(a)=\varphi'_{t-1}(v_{t-1}(a))\sum_{k=0}^{c_t}g_t(k)w_t(k,a)\\ \frac{\partial J_t}{\partial w_{t-1}(a,b)}=-g_{t-1}(a)y_{t-2}(b) gt−1(a)=φt−1′(vt−1(a))k=0∑ctgt(k)wt(k,a)∂wt−1(a,b)∂Jt=−gt−1(a)yt−2(b)
这种关系并非只存在 t t t为输出层时的 t − 1 t-1 t−1层,而是一直在传递。
考虑第 l l l层全联接以输出层 t t t的损失函数为目标函数进行梯度计算:
∂ J t ∂ w l ( a , b ) = ∂ J t ∂ y l ( a ) ∂ y l ( a ) ∂ w l ( a , b ) = ∂ J t ∂ y l ( a ) φ l ′ ( v l ( a ) ) y l − 1 ( b ) \frac{\partial J_t}{\partial w_{l}(a,b)}=\frac{\partial J_t}{\partial y_{l}(a)}\frac{\partial y_l(a)}{\partial w_{l}(a,b)}\\=\frac{\partial J_t}{\partial y_{l}(a)}\varphi_l'(v_l(a))y_{l-1}(b) ∂wl(a,b)∂Jt=∂yl(a)∂Jt∂wl(a,b)∂yl(a)=∂yl(a)∂Jtφl′(vl(a))yl−1(b)
由于误差的相互独立,回顾前向传播,可以有(非常不好理解): ∂ J t ∂ y l ( a ) = ∑ k = 0 c l + 1 ∂ J t ∂ y l + 1 ( k ) ∂ y l + 1 ( k ) ∂ y l ( a ) = ∑ k = 0 c l + 1 ∂ J t ∂ y l + 1 ( k ) ∂ y l + 1 ( k ) ∂ v l + 1 ( k ) w l + 1 ( k , a ) \frac{\partial J_t}{\partial y_l(a)} = \sum_{k=0}^{c_{l+1}}\frac{\partial J_t}{\partial y_{l+1}(k)}\frac{\partial y_{l+1}(k)}{\partial y_l(a)}\\=\sum_{k=0}^{c_{l+1}}\frac{\partial J_t}{\partial y_{l+1}(k)}\frac{ \partial y_{l+1}(k)}{\partial v_{l+1}(k)}w_{l+1}(k,a) ∂yl(a)∂Jt=k=0∑cl+1∂yl+1(k)∂Jt∂yl(a)∂yl+1(k)=k=0∑cl+1∂yl+1(k)∂Jt∂vl+1(k)∂yl+1(k)wl+1(k,a)
其中:
∂ J t ∂ w l ( k , a ) = ∂ J t ∂ y l ( k ) ∂ y l ( k ) ∂ v l ( k ) ∂ v l ( k ) ∂ w l ( k , a ) = ∂ J t ∂ y l ( k ) ∂ y l ( k ) ∂ v l ( k ) y ( a ) \frac{\partial J_t}{\partial w_{l}(k,a)}=\frac{\partial J_t}{\partial y_l(k)}\frac{\partial y_{l}(k)}{\partial v_l(k)}\frac{\partial v_l(k)}{\partial w_l(k,a)}\\ =\frac{\partial J_t}{\partial y_l(k)}\frac{\partial y_{l}(k)}{\partial v_l(k)}y(a) ∂wl(k,a)∂Jt=∂yl(k)∂Jt∂vl(k)∂yl(k)∂wl(k,a)∂vl(k)=∂yl(k)∂Jt∂vl(k)∂yl(k)y(a)
此时相当于重新定义了 g g g,可知:
∂ J t ∂ w l ( k , a ) = − g ( k ) y ( a ) \frac{\partial J_t}{\partial w_l(k,a)}=-g(k)y(a) ∂wl(k,a)∂Jt=−g(k)y(a)
综上,上述证明中,假设 t t t作为了输出层,下面的总结以 n n n作为输出层。:
∂ J n ∂ w t ( a , b ) = − g t ( a ) y t − 1 ( b ) g t ( a ) = φ t ′ ( v t ( a ) ) ∑ k = 0 c t + 1 g t + 1 ( k ) w t + 1 ( k , a ) \frac{\partial J_n}{\partial w_t(a,b)}=-g_t(a)y_{t-1}(b)\\ g_{t}(a)=\varphi'_{t}(v_{t}(a))\sum_{k=0}^{c_{t+1}}g_{t+1}(k)w_{t+1}(k,a)\\ ∂wt(a,b)∂Jn=−gt(a)yt−1(b)gt(a)=φt′(vt(a))k=0∑ct+1gt+1(k)wt+1(k,a)
注意这里 g ( t ) g(t) g(t)不是误差。它隐士的包含了误差传递。
∂ J n ∂ w t ( a , b ) = − g t ( a ) y t − 1 ( b ) g t ( a ) = φ t ′ ( v t ( a ) ) ∑ k = 0 c t + 1 g t + 1 ( k ) w t + 1 ( k , a ) \frac{\partial J_n}{\partial w_t(a,b)}=-g_t(a)y_{t-1}(b)\\ g_{t}(a)=\varphi'_{t}(v_{t}(a))\sum_{k=0}^{c_{t+1}}g_{t+1}(k)w_{t+1}(k,a)\\ ∂wt(a,b)∂Jn=−gt(a)yt−1(b)gt(a)=φt′(vt(a))k=0∑ct+1gt+1(k)wt+1(k,a)
根据这组关系,令:
g t = [ g t ( 0 ) g t ( 1 ) ⋮ g t ( c t ) ] g_t = \left[\begin{matrix}g_{t}(0)\\g_{t}(1)\\ \vdots\\g_{t}(c_t)\end{matrix}\right] gt=⎣⎢⎢⎢⎡gt(0)gt(1)⋮gt(ct)⎦⎥⎥⎥⎤
令所有向量都为列向量。
则:
∂ J n ∂ w t = − g t y t − 1 T g t = d i n g ( φ t ′ ( v t ) ) g t + 1 T w t + 1 = φ t ′ ( v t ) ⨀ g t + 1 T w t + 1 \frac{\partial J_n}{\partial w_t}=-g_ty_{t-1}^T\\ \ \\ g_t=ding(\varphi'_t(v_t))g_{t+1}^Tw_{t+1} \\=\varphi'_t(v_t)\bigodot g_{t+1}^Tw_{t+1} ∂wt∂Jn=−gtyt−1T gt=ding(φt′(vt))gt+1Twt+1=φt′(vt)⨀gt+1Twt+1
⨀ \bigodot ⨀这个运算符表示矩阵逐元素相乘得到的新矩阵: c i , j = a i , j b i , j c_{i,j}=a_{i,j}b_{i,j} ci,j=ai,jbi,j
void forward_convolutional_layer(convolutional_layer l, network_state state)
{
int out_h = convolutional_out_height(l);
int out_w = convolutional_out_width(l);
int i;
fill_cpu(l.outputs*l.batch, 0, l.output, 1);
int m = l.n;
int k = l.size*l.size*l.c;
int n = out_h*out_w;
float *a = l.weights;
float *b = state.workspace;
float *c = l.output;
static int u = 0;
u++;
for(i = 0; i < l.batch; ++i){
im2col_cpu_custom(state.input, l.c, l.h, l.w, l.size, l.stride, l.pad, b);
gemm(0, 0, m, n, k, 1, a, k, b, n, 1, c, n);
c += n*m;
state.input += l.c*l.h*l.w;
}
add_bias(l.output, l.biases, l.batch, l.n, out_h*out_w);
activate_array_cpu_custom(l.output, m*n*l.batch, l.activation);
}
基本操作: i m 2 c o l _ c p u _ c u s t o m im2col\_cpu\_custom im2col_cpu_custom
这个操作对上一层的输入进行处理,得到形如 [ ( s i z e ) 2 l . c ] × [ o u t _ h × o u t _ w ] [(size)^2l.c] \times[out\_h\times out\_w] [(size)2l.c]×[out_h×out_w]的矩阵,即为当前层输入 y y y。
当前层权重 w w w为 f i l t e r s × ( s i z e ) 2 l . c filters \times (size)^2l.c filters×(size)2l.c
这里, f i l t e r s filters filters代表下一层通道数,你也可以理解为神经元个数或者卷积核个数。
s i z e size size表示卷积核的长度(大小)。
这样很好理解,卷积操变成矩阵乘法。 w × y w\times y w×y得到 f i l t e r s × o u t _ h × o u t _ w filters\times out\_h\times out\_w filters×out_h×out_w的输出。
考虑计算当卷积层作为输出层时,其最外层权重梯度:
∂ J n ∂ w n ( a , b ) = 1 2 ∑ k ∑ m ∂ e n 2 ( k , m ) ∂ w n ( a , b ) = ∑ k ∑ m e n ( k , m ) ∂ e n ( k , m ) ∂ v n ( k , m ) ∂ v n ( k , m ) ∂ w n ( a , b ) = ∑ k ∑ m − e n ( k , m ) φ n ′ ( v n ( k , m ) ) ∂ ∑ i w n ( k , i ) y n − 1 ( i , m ) ∂ w n ( a , b ) = ∑ m − e n ( a , m ) φ n ′ ( v n ( a , m ) ) y n − 1 ( b , m ) \frac{\partial J_n}{\partial w_n(a,b)}=\frac{1}{2}\sum_{k}\sum_{m}\frac{\partial e^2_n(k,m)}{\partial w_n(a,b)}\\ =\sum_{k}\sum_{m}e_n(k,m)\frac{\partial e_n(k,m)}{\partial v_n(k,m)}\frac{\partial v_n(k,m)}{\partial w_n(a,b)} \\=\sum_{k}\sum_{m}-e_n(k,m)\varphi'_n(v_n(k,m))\frac{\partial \sum_{i}w_n(k,i)y_{n-1}(i,m)}{\partial w_n(a,b)} \\=\sum_{m}-e_n(a,m)\varphi'_n(v_n(a,m))y_{n-1}(b,m) ∂wn(a,b)∂Jn=21k∑m∑∂wn(a,b)∂en2(k,m)=k∑m∑en(k,m)∂vn(k,m)∂en(k,m)∂wn(a,b)∂vn(k,m)=k∑m∑−en(k,m)φn′(vn(k,m))∂wn(a,b)∂∑iwn(k,i)yn−1(i,m)=m∑−en(a,m)φn′(vn(a,m))yn−1(b,m)
对于 n − 1 n-1 n−1层,依然以最外层作为损失作为目标函数:
∂ J n ∂ w n − 1 ( a , b ) = 1 2 ∑ k ∑ m ∂ e n 2 ( k , m ) ∂ w n − 1 ( a , b ) = ∑ k ∑ m − e n ( k , m ) ∂ e n ( k , m ) ∂ v n ( k , m ) ∂ v n ( k , m ) ∂ w n − 1 ( a , b ) = ∑ k ∑ m − e n ( k , m ) φ n ′ ( v n ( k , m ) ) ∂ v n ( k , m ) ∂ w n − 1 ( a , b ) = ∑ k ∑ m − e n ( k , m ) φ n ′ ( v n ( k , m ) ) ∂ ∑ i w n ( k , i ) y n − 1 ( i , m ) ∂ w n − 1 ( a , b ) = ∑ k ∑ m − e n ( k , m ) φ n ′ ( v n ( k , m ) ) ∑ i ∂ w n ( k , i ) y n − 1 ( i , m ) ∂ v n − 1 ( i , m ) ∂ v n − 1 ( i , m ) ∂ w n − 1 ( a , b ) = ∑ k ∑ m − e n ( k , m ) φ n ′ ( v n ( k , m ) ) ∑ i w n ( k , i ) φ n − 1 ′ ( v n − 1 ( i , m ) ) ∂ ∑ j w n − 1 ( i , j ) y n − 2 ( j , m ) ∂ w n − 1 ( a , b ) = ∑ k ∑ m − e n ( k , m ) φ n ′ ( v n ( k , m ) ) w n ( k , a ) φ n − 1 ′ ( v n − 1 ( a , m ) ) y n − 2 ( b , m ) \frac{\partial J_n}{\partial w_{n-1}(a,b)}=\frac{1}{2}\sum_{k}\sum_{m}\frac{\partial e^2_n(k,m)}{\partial w_{n-1}(a,b)}\\ =\sum_{k}\sum_{m}-e_n(k,m)\frac{\partial e_n(k,m)}{\partial v_{n}(k,m)}\frac{\partial v_n(k,m)}{\partial w_{n-1}(a,b)}\\ =\sum_{k}\sum_{m}-e_n(k,m)\varphi'_n(v_n(k,m))\frac{\partial v_n(k,m)}{\partial w_{n-1}(a,b)}\\ =\sum_{k}\sum_{m}-e_n(k,m)\varphi'_n(v_n(k,m))\frac{\partial \sum_iw_n(k,i)y_{n-1}(i,m)}{\partial w_{n-1}(a,b)}\\ =\sum_{k}\sum_{m}-e_n(k,m)\varphi'_n(v_n(k,m))\sum_i\frac{\partial w_n(k,i)y_{n-1}(i,m)}{\partial v_{n-1}(i,m)}\frac{\partial v_{n-1}(i,m)}{\partial w_{n-1}(a,b)}\\ =\sum_{k}\sum_{m}-e_n(k,m)\varphi'_n(v_n(k,m))\sum_iw_n(k,i)\varphi'_{n-1}(v_{n-1}(i,m))\frac{\partial \sum_{j}w_{n-1}(i,j)y_{n-2}(j,m)}{\partial w_{n-1}(a,b)}\\ =\sum_{k}\sum_{m}-e_n(k,m)\varphi'_n(v_n(k,m))w_n(k,a)\varphi'_{n-1}(v_{n-1}(a,m))y_{n-2}(b,m) ∂wn−1(a,b)∂Jn=21k∑m∑∂wn−1(a,b)∂en2(k,m)=k∑m∑−en(k,m)∂vn(k,m)∂en(k,m)∂wn−1(a,b)∂vn(k,m)=k∑m∑−en(k,m)φn′(vn(k,m))∂wn−1(a,b)∂vn(k,m)=k∑m∑−en(k,m)φn′(vn(k,m))∂wn−1(a,b)∂∑iwn(k,i)yn−1(i,m)=k∑m∑−en(k,m)φn′(vn(k,m))i∑∂vn−1(i,m)∂wn(k,i)yn−1(i,m)∂wn−1(a,b)∂vn−1(i,m)=k∑m∑−en(k,m)φn′(vn(k,m))i∑wn(k,i)φn−1′(vn−1(i,m))∂wn−1(a,b)∂∑jwn−1(i,j)yn−2(j,m)=k∑m∑−en(k,m)φn′(vn(k,m))wn(k,a)φn−1′(vn−1(a,m))yn−2(b,m)
根据上述计算,令:
g n ( a , m ) = e n ( a , m ) φ n ′ ( v n ( a , m ) ) g_n(a,m)=e_n(a,m)\varphi'_n(v_n(a,m)) gn(a,m)=en(a,m)φn′(vn(a,m))
则:
∂ J n ∂ w n ( a , b ) = ∑ m − g n ( a , m ) y n − 1 ( b , m ) \frac{\partial J_n}{\partial w_n(a,b)}=\sum_{m}-g_n(a,m)y_{n-1}(b,m) ∂wn(a,b)∂Jn=m∑−gn(a,m)yn−1(b,m)
∂ J n ∂ w n − 1 ( a , b ) = ∑ k ∑ m − g n ( k , m ) w n ( k , a ) φ n − 1 ′ ( v n − 1 ( a , m ) ) y n − 2 ( b , m ) = ∑ m ( ∑ k − g n ( k , m ) w n ( k , a ) ) φ n − 1 ′ ( v n − 1 ( a , m ) ) y n − 2 ( b , m ) \frac{\partial J_n}{\partial w_{n-1}(a,b)}=\sum_{k}\sum_{m}-g_n(k,m)w_n(k,a)\varphi'_{n-1}(v_{n-1}(a,m))y_{n-2}(b,m)\\ =\sum_{m}\Big(\sum_{k}-g_n(k,m)w_n(k,a)\Big)\varphi'_{n-1}(v_{n-1}(a,m))y_{n-2}(b,m) ∂wn−1(a,b)∂Jn=k∑m∑−gn(k,m)wn(k,a)φn−1′(vn−1(a,m))yn−2(b,m)=m∑(k∑−gn(k,m)wn(k,a))φn−1′(vn−1(a,m))yn−2(b,m)
这里,我们令:
g n − 1 ( a , m ) = φ n − 1 ′ ( v n − 1 ( a , m ) ) ∑ k g n ( k , m ) w n ( k , a ) g_{n-1}(a,m)=\varphi'_{n-1}(v_{n-1}(a,m))\sum_{k}g_n(k,m)w_{n}(k,a) gn−1(a,m)=φn−1′(vn−1(a,m))k∑gn(k,m)wn(k,a)
那么:
∂ J n ∂ w n ( a , b ) = ∑ m − g n − 1 ( a , m ) y n − 2 ( b , m ) \frac{\partial J_n}{\partial w_n(a,b)}=\sum_{m}-g_{n-1}(a,m)y_{n-2}(b,m) ∂wn(a,b)∂Jn=m∑−gn−1(a,m)yn−2(b,m)
其实 g g g的这种传递性依然可以保持。重新定义 g g g
∂ J n ∂ w t ( a , b ) = ∑ k ∑ m ∂ J n ∂ y t ( k , m ) ∂ y t ( k , m ) ∂ w t ( a , b ) = ∑ k ∑ m ∂ J n ∂ y t ( k , m ) ∂ y t ( k , m ) ∂ v t ( k , m ) ∂ v t ( k , m ) ∂ w t ( a , b ) = ∑ m ∂ J n ∂ y t ( a , m ) ∂ y t ( a , m ) ∂ v t ( a , m ) y t − 1 ( b , m ) \frac{\partial J_n}{\partial w_{t}(a,b)}=\sum_{k}\sum_{m}\frac{\partial J_n}{\partial y_{t}(k,m)}\frac{\partial y_t(k,m)}{\partial w_{t}(a,b)}\\ =\sum_{k}\sum_{m}\frac{\partial J_n}{\partial y_{t}(k,m)}\frac{\partial y_t(k,m)}{\partial v_{t}(k,m)}\frac{\partial v_t(k,m)}{\partial w_t(a,b)}\\ =\sum_{m}\frac{\partial J_n}{\partial y_{t}(a,m)}\frac{\partial y_t(a,m)}{\partial v_{t}(a,m)}y_{t-1}(b,m) ∂wt(a,b)∂Jn=k∑m∑∂yt(k,m)∂Jn∂wt(a,b)∂yt(k,m)=k∑m∑∂yt(k,m)∂Jn∂vt(k,m)∂yt(k,m)∂wt(a,b)∂vt(k,m)=m∑∂yt(a,m)∂Jn∂vt(a,m)∂yt(a,m)yt−1(b,m)
这里,令: g t ( a , m ) = ∂ J n ∂ y t ( a , m ) ∂ y t ( a , m ) ∂ v t ( a , m ) g_t(a,m)=\frac{\partial J_n}{\partial y_{t}(a,m)}\frac{\partial y_t(a,m)}{\partial v_{t}(a,m)} gt(a,m)=∂yt(a,m)∂Jn∂vt(a,m)∂yt(a,m)
显然, t = n , n − 1 t=n,n-1 t=n,n−1时,是成立的。
归纳有:
∂ J n ∂ w t ( a , b ) = ∑ m ∂ J n ∂ y t ( a , m ) ∂ y t ( a , m ) ∂ v t ( a , m ) y t − 1 ( b , m ) \frac{\partial J_n}{\partial w_{t}(a,b)}=\sum_{m}\frac{\partial J_n}{\partial y_{t}(a,m)}\frac{\partial y_t(a,m)}{\partial v_{t}(a,m)}y_{t-1}(b,m) ∂wt(a,b)∂Jn=m∑∂yt(a,m)∂Jn∂vt(a,m)∂yt(a,m)yt−1(b,m)
对于(非常不好理解):
∂ J n ∂ y t ( a , m ) ∂ y t ( a , m ) ∂ v t ( a , m ) = ∂ J n ∂ y t ( a , m ) φ t ′ ( v t ( a , m ) ) = φ t ′ ( v t ( a , m ) ) ∑ i ∂ J n ∂ y t + 1 ( i , m ) ∂ y t + 1 ( i , m ) ∂ v t + 1 ( i , m ) ∂ v t + 1 ( i , m ) ∂ y t ( a , m ) = φ t ′ ( v t ( a , m ) ) ∑ i ∂ J n ∂ y t + 1 ( i , m ) ∂ y t + 1 ( i , m ) ∂ v t + 1 ( i , m ) w t + 1 ( i , a ) \frac{\partial J_n}{\partial y_{t}(a,m)}\frac{\partial y_t(a,m)}{\partial v_{t}(a,m)}=\frac{\partial J_n}{\partial y_{t}(a,m)}\varphi_{t}'(v_{t}(a,m))\\ =\varphi_{t}'(v_{t}(a,m)) \sum_{i}\frac{\partial J_n}{\partial y_{t+1}(i,m)}\frac{\partial y_{t+1}(i,m)}{\partial v_{t+1}(i,m)}\frac{\partial v_{t+1}(i,m)}{\partial y_t(a,m)}\\ =\varphi_{t}'(v_{t}(a,m)) \sum_{i}\frac{\partial J_n}{\partial y_{t+1}(i,m)}\frac{\partial y_{t+1}(i,m)}{\partial v_{t+1}(i,m)}w_{t+1}(i,a) ∂yt(a,m)∂Jn∂vt(a,m)∂yt(a,m)=∂yt(a,m)∂Jnφt′(vt(a,m))=φt′(vt(a,m))i∑∂yt+1(i,m)∂Jn∂vt+1(i,m)∂yt+1(i,m)∂yt(a,m)∂vt+1(i,m)=φt′(vt(a,m))i∑∂yt+1(i,m)∂Jn∂vt+1(i,m)∂yt+1(i,m)wt+1(i,a)
故: g t ( a , m ) = φ t ′ ( v t ( a , m ) ) ∑ k g t + 1 ( k , m ) w t + 1 ( k , a ) g_t(a,m)=\varphi'_t(v_t(a,m))\sum_{k}g_{t+1}(k,m)w_{t+1}(k,a) gt(a,m)=φt′(vt(a,m))k∑gt+1(k,m)wt+1(k,a)
卷积反向传播也可以用矩阵表示:
g t = φ t ′ ( v t ) ⨀ g t + 1 T w t + 1 ∂ J n ∂ w t = − g t y t − 1 T g_t=\varphi'_t(v_t)\bigodot g_{t+1}^Tw_{t+1}\\ \frac{\partial J_n}{\partial w_t}=-g_ty_{t-1}^T gt=φt′(vt)⨀gt+1Twt+1∂wt∂Jn=−gtyt−1T
卷积与全连接更新方式一样
对于部分不好理解的地方,做一个解释:
对于上文中,两处不好理解的地方,其实你可以这样理解为连续函数 h ( x 1 , x 2 , … , x n ) h(x_1,x_2,\dots,x_n) h(x1,x2,…,xn)
令: f ( t ) = h ( t , t , … , t ) f(t)=h(t,t,\dots,t) f(t)=h(t,t,…,t) 则:
d f ( t ) d t = ∑ k = 1 n ∂ h ∂ x k ∣ x k = t \frac{df(t)}{dt}=\sum_{k=1}^n\frac{\partial h}{\partial x_k}\Big|_{x_k=t} dtdf(t)=k=1∑n∂xk∂h∣∣∣xk=t
对于 t t t层第 k k k个神经元输出 y t ( a ) y_t(a) yt(a)对输出层 n n n的损失函数的影响是多维度的。只不过每个维度的输入变量是一个.显然,这种影响又是可微的。
∂ J n ∂ y t ( a ) = ∑ k = 0 c t + 1 ∂ J n ∂ y t + 1 ( k ) ∂ y t + 1 ( k ) ∂ y t ( a ) \frac{\partial J_n}{\partial y_t(a)}=\sum_{k=0}^{c_{t+1}}\frac{\partial J_n}{\partial y_{t+1}(k)}\frac{\partial y_{t+1}(k)}{\partial y_t(a)} ∂yt(a)∂Jn=k=0∑ct+1∂yt+1(k)∂Jn∂yt(a)∂yt+1(k)