1.选定层数:通常采用三层网络(因为增加网络层数并不能提高网络的分类能力;
2.输入层:输入层节点数为输入特征的维数 n, 激活函数采用线性函数;
3.隐层:隐层可实现非线性分类,其节点数需要设定;一般的,隐层节点数越多,网络的分类能力就越强,激活函数一般采用 Sigmoid 函数;
4.输出层:输出层节点数可以等于类别数,也可采用编码输出的方式(少于类别数),激活函数可使用线性函数或Sigmoid 函数。
如上图所示,该三层网络的判别函数形式为
Y 3 = f 3 ( ∑ k = 1 n 2 w k ⋅ Y k 2 − θ ) = f 3 ( ∑ k = 1 n 2 w k ⋅ { f 2 ( ∑ j = 1 n 1 w k j ⋅ Y j 1 − θ k ) } − θ ) = f 3 ( ∑ k = 1 n 2 w k ⋅ { f 2 ( ∑ j = 1 n 1 w k j ⋅ [ f 1 ( ∑ i = 1 n w j i ⋅ X i − θ j ) ] − θ k ) } − θ ) \begin{array}{ll} &Y^3= f_{3}(\sum^{n_2}_{k=1}w_k\cdot Y^2_k-\theta)\\ &\\ & = f_{3}(\sum^{n_2}_{k=1}w_k\cdot \{f_2(\sum^{n_1}_{j=1}w_{kj}\cdot Y^1_j-\theta_k)\}-\theta)\\ &\\ &= f_{3}(\sum^{n_2}_{k=1}w_k\cdot \{f_2(\sum^{n_1}_{j=1}w_{kj}\cdot [f_1(\sum^n_{i=1}w_{ji}\cdot X_i-\theta_j)]-\theta_k)\}-\theta) \end{array} Y3=f3(∑k=1n2wk⋅Yk2−θ)=f3(∑k=1n2wk⋅{f2(∑j=1n1wkj⋅Yj1−θk)}−θ)=f3(∑k=1n2wk⋅{f2(∑j=1n1wkj⋅[f1(∑i=1nwji⋅Xi−θj)]−θk)}−θ)
其中, n 2 n_2 n2 为隐层节点数; n n n 为输入特征维数。上图只有一个输出单元(两类),当有 c 个输出单元时(c 个类别),网络可视为计算 c 个判别函数 Y c 3 Y^3_c Yc3,通过所有求判别函数的最大值将输入信号分类。此过程是前馈计算过程,是识别过程。
BP算法是实质是一个均方误差最小算法(LMS)。
A.梯度下降法
记神经元 j 在第 n 次迭代(即输入第 n 个训练样本时)的输出为 y j ( n ) y_j(n) yj(n),其目标输出值记为 d j ( n ) d_j(n) dj(n),则该神经元的输出误差为: e j ( n ) = d j ( n ) − y j ( n ) e_j(n)=d_j(n)-y_j(n) ej(n)=dj(n)−yj(n),则整个网络输出层 C 的平方差作为损失函数: E ( n ) = 1 2 ∑ k ∈ C e k 2 ( n ) \displaystyle E(n)=\frac{1}{2}\sum_{k\in C}e^2_k(n) E(n)=21k∈C∑ek2(n)。求其最小值,可使用如下迭代: w i j k + 1 = w i j k + λ ⋅ △ w i j k w_{ij}^{k+1}=w_{ij}^{k}+\lambda\cdot\triangle w_{ij}^{k} wijk+1=wijk+λ⋅△wijk,当 △ w i j ∝ − ∂ E ∂ w i j \displaystyle\triangle w_{ij}\propto\;-\frac{\partial E}{\partial w_{ij}} △wij∝−∂wij∂E时,可使函数值下降最快,更早到达最小。
B.链式法则 ∂ f ( g ( x ) ) ∂ x = ∂ f ( g ( x ) ) ∂ g ( x ) ⋅ ∂ g ( x ) ∂ x \displaystyle\frac{\partial f(g(x))}{\partial x}=\frac{\partial f(g(x))}{\partial g(x)}\cdot \frac{\partial g(x)}{\partial x} ∂x∂f(g(x))=∂g(x)∂f(g(x))⋅∂x∂g(x)
根据上述描述,从神经元 i i i 到神经元 j j j 的连接权值 w j i w_{ji} wji 的迭代公式可设置为:
w j i ( t + 1 ) = w j i ( t ) + △ w j i , △ w j i = − λ ⋅ ∂ E ( n ) ∂ w j i \displaystyle w_{ji}(t+1)=w_{ji}(t)+\triangle w_{ji},\;\triangle w_{ji}=-\lambda\cdot\frac{\partial E(n)}{\partial w_{ji}} wji(t+1)=wji(t)+△wji,△wji=−λ⋅∂wji∂E(n)
若神经元 j j j 是输出节点,则 ∂ E ( n ) ∂ w j i ( n ) = ∂ E ( n ) ∂ v j ( n ) ⋅ ∂ v j ( n ) ∂ w j i ( n ) \displaystyle\frac{\partial E(n)}{\partial w_{ji}(n)}=\frac{\partial E(n)}{\partial v_{j}(n)}\cdot\frac{\partial v_j(n)}{\partial w_{ji}(n)} ∂wji(n)∂E(n)=∂vj(n)∂E(n)⋅∂wji(n)∂vj(n),其中 ∂ v j ( n ) ∂ w j i ( n ) = y i ( n ) \displaystyle\frac{\partial v_j(n)}{\partial w_{ji}(n)}=y_i(n) ∂wji(n)∂vj(n)=yi(n),
令 δ j ( n ) = − ∂ E ( n ) ∂ v j ( n ) \displaystyle\delta_j(n)=-\frac{\partial E(n)}{\partial v_{j}(n)} δj(n)=−∂vj(n)∂E(n),则:
δ j ( n ) = − ∂ E ( n ) ∂ v j ( n ) = − ∂ E ( n ) ∂ y j ( n ) ⋅ ∂ y j ( n ) ∂ v j ( n ) = − [ ∂ E ( n ) ∂ e j ( n ) ⋅ ∂ e j ( n ) ∂ y j ( n ) ] ⋅ ∂ y j ( n ) ∂ v j ( n ) = − [ e j ( n ) ⋅ ( − 1 ) ] ⋅ φ j ′ ( v j ( n ) ) = e j ( n ) ⋅ φ j ′ ( v j ( n ) ) \begin{array}{ll} &\displaystyle\delta_j(n)=-\frac{\partial E(n)}{\partial v_{j}(n)}=-\frac{\partial E(n)}{\partial y_{j}(n)}\cdot\frac{\partial y_j(n)}{\partial v_{j}(n)}=-[\frac{\partial E(n)}{\partial e_{j}(n)}\cdot\frac{\partial e_j(n)}{\partial y_{j}(n)}]\cdot\frac{\partial y_j(n)}{\partial v_{j}(n)}\\ &\\ &\displaystyle = -[e_j(n)\cdot(-1)]\cdot\varphi'_j(v_j(n))=e_j(n)\cdot\varphi'_j(v_j(n))\\ \end{array} δj(n)=−∂vj(n)∂E(n)=−∂yj(n)∂E(n)⋅∂vj(n)∂yj(n)=−[∂ej(n)∂E(n)⋅∂yj(n)∂ej(n)]⋅∂vj(n)∂yj(n)=−[ej(n)⋅(−1)]⋅φj′(vj(n))=ej(n)⋅φj′(vj(n))
可得, ∂ E ( n ) ∂ w j i ( n ) = − e j ( n ) ⋅ φ j ′ ( v j ( n ) ) ⋅ y i ( n ) = − δ j ( n ) ⋅ y i ( n ) \displaystyle\frac{\partial E(n)}{\partial w_{ji}(n)}=-e_j(n)\cdot\varphi'_j(v_j(n))\cdot y_i(n)=-\delta_j(n)\cdot y_i(n) ∂wji(n)∂E(n)=−ej(n)⋅φj′(vj(n))⋅yi(n)=−δj(n)⋅yi(n)
注意,对于 E ( n ) E(n) E(n)中的每个 e k ( n ) = d k ( n ) − y k ( n ) e_k(n)=d_k(n)-y_k(n) ek(n)=dk(n)−yk(n),每个输出神经元 k k k 的输入向量都包括 y j ( n ) y_j(n) yj(n),故 ∂ E ( n ) ∂ y j ( n ) = ∂ [ 1 2 ∑ k ∈ C e k 2 ( n ) ] ∂ y j ( n ) \displaystyle\frac{\partial E(n)}{\partial y_{j}(n)}=\frac{\partial [\frac{1}{2}\sum_{k\in C}e^2_k(n)]}{\partial y_{j}(n)} ∂yj(n)∂E(n)=∂yj(n)∂[21∑k∈Cek2(n)],若将 y j ( n ) y_j(n) yj(n) 视为 E ( n ) E(n) E(n) 的函数,则 :
∂ E ( n ) ∂ y j ( n ) = ∂ [ 1 2 ∑ k ∈ C e k 2 ( n ) ( y j ( n ) ) ] ∂ y j ( n ) = ∑ k { ∂ [ 1 2 e k 2 ( n ) ( y j ( n ) ) ] ∂ y j ( n ) } = ∑ k { ∂ [ 1 2 e k 2 ( n ) ] ∂ e k ( n ) ⋅ ∂ e k ( n ) ∂ y k ( n ) ⋅ ∂ y k ( n ) ∂ v k ( n ) ⋅ ∂ v k ( n ) ∂ y j ( n ) } = ∑ k [ e k ( n ) ⋅ ( − 1 ) ⋅ φ k ′ ( v k ( n ) ) ⋅ w k j ] = − ∑ k [ e k ( n ) ⋅ φ k ′ ( v k ( n ) ) ⋅ w k j ] = − ∑ k [ δ k ( n ) ⋅ w k j ] \begin{array}{ll} &\displaystyle\frac{\partial E(n)}{\partial y_{j}(n)}=\frac{\partial [\frac{1}{2}\sum_{k\in C}e^2_k(n)(y_j(n))]}{\partial y_{j}(n)}=\sum_k\{\frac{\partial [\frac{1}{2}e^2_k(n)(y_j(n))]}{\partial y_{j}(n)}\}=\sum_k\{\frac{\partial[\frac{1}{2}e^2_k(n)]}{\partial e_k(n)}\cdot\frac{\partial e_k(n)}{\partial y_k(n)}\cdot\frac{\partial y_k(n)}{\partial v_k(n)}\cdot\frac{\partial v_k(n)}{\partial y_j(n)}\}\\ &\\ &\displaystyle = \sum_k[e_k(n)\cdot(-1)\cdot\varphi'_k(v_k(n))\cdot w_{kj}]=-\sum_k[e_k(n)\cdot\varphi'_k(v_k(n))\cdot w_{kj}]=-\sum_k[\delta_k(n)\cdot w_{kj}]\\ \end{array} ∂yj(n)∂E(n)=∂yj(n)∂[21∑k∈Cek2(n)(yj(n))]=k∑{∂yj(n)∂[21ek2(n)(yj(n))]}=k∑{∂ek(n)∂[21ek2(n)]⋅∂yk(n)∂ek(n)⋅∂vk(n)∂yk(n)⋅∂yj(n)∂vk(n)}=k∑[ek(n)⋅(−1)⋅φk′(vk(n))⋅wkj]=−k∑[ek(n)⋅φk′(vk(n))⋅wkj]=−k∑[δk(n)⋅wkj]
其中, δ k ( n ) = e k ( n ) ⋅ φ k ′ ( v k ( n ) ) \delta_k(n)=e_k(n)\cdot\varphi'_k(v_k(n)) δk(n)=ek(n)⋅φk′(vk(n))
而对隐层神经元 j j j ,则有:
δ j ( n ) = − ∂ E ( n ) ∂ v j ( n ) = − ∂ E ( n ) ∂ y j ( n ) ⋅ ∂ y j ( n ) ∂ v j ( n ) = − { − ∑ k [ δ k ( n ) ⋅ w k j ] } ⋅ φ j ′ ( v j ( n ) ) = φ j ′ ( v j ( n ) ) ⋅ ∑ k [ δ k ( n ) ⋅ w k j ] \begin{array}{ll} &\displaystyle\delta_j(n)=-\frac{\partial E(n)}{\partial v_{j}(n)}=-\frac{\partial E(n)}{\partial y_{j}(n)}\cdot\frac{\partial y_j(n)}{\partial v_{j}(n)}\\ &\\ &\displaystyle = -\{-\sum_k[\delta_k(n)\cdot w_{kj}]\}\cdot\varphi'_j(v_j(n))=\varphi'_j(v_j(n))\cdot\sum_k[\delta_k(n)\cdot w_{kj}]\\ \end{array} δj(n)=−∂vj(n)∂E(n)=−∂yj(n)∂E(n)⋅∂vj(n)∂yj(n)=−{−k∑[δk(n)⋅wkj]}⋅φj′(vj(n))=φj′(vj(n))⋅k∑[δk(n)⋅wkj]
同理,隐层神经元的 ∂ E ( n ) ∂ w j i ( n ) = − δ j ( n ) ⋅ y i ( n ) \displaystyle\frac{\partial E(n)}{\partial w_{ji}(n)}=-\delta_j(n)\cdot y_i(n) ∂wji(n)∂E(n)=−δj(n)⋅yi(n),其中 δ j ( n ) \delta_j(n) δj(n) 是局域梯度,应当注意的是,不同层的局域梯度是不同的。
MPL 的训练过程为:
初始化网络权值(网络连接权重 w i j w_{ij} wij、神经元阈值 θ i \theta_i θi),一般随机设置为 [ − 1 , + 1 ] [-1,+1] [−1,+1] 之间的数,若无先验知识,可选择均匀分布;若有先验知识,则可根据先验进行初始化;
训练样本的排序:每个训练样本都要输入网络一次,称为回合;在每个回合开始时,对训练样本要进行随机排序;
前馈计算:从输入层到输出层,层层前进,计算每个神经元的局部诱导域和输出函数信号。
第 h h h 层的神经元 j j j 的局部诱导域为: v j ( h ) = ∑ i = 0 m h − 1 w j i ( h ) ( n ) ⋅ y i ( h − 1 ) ( n ) \displaystyle v^{(h)}_j=\sum^{m_h-1}_{i=0}w^{(h)}_{ji}(n)\cdot y^{(h-1)}_i(n) vj(h)=i=0∑mh−1wji(h)(n)⋅yi(h−1)(n)
第 h h h 层的神经元 j j j 的输出函数信号为: y j ( h ) = φ j ( v j ( h ) ) \displaystyle y^{(h)}_j=\varphi_j(v^{(h)}_j) yj(h)=φj(vj(h)),若神经元 j j j 在第 1 隐层(输入层,即 h = 1 h=1 h=1),则使 y j ( 0 ) = x j ( n ) y^{(0)}_j=x_j(n) yj(0)=xj(n);若神经元 j j j 在输出层( h = L h=L h=L),则令输出 o j ( n ) = y j ( L ) ( n ) o_j(n)=y^{(L)}_j(n) oj(n)=yj(L)(n)。
并计算误差 e j ( n ) = d j ( n ) − y j ( n ) e_j(n)=d_j(n)-y_j(n) ej(n)=dj(n)−yj(n)。
反向传播误差:从后向前,计算每一层神经元的局域梯度
δ j ( h ) = { φ j ′ ( v j ( L ) ( n ) ) ⋅ e j ( L ) ( n ) , h = L φ j ′ ( v j ( h ) ( n ) ) ⋅ ∑ k [ δ k ( h + 1 ) ( n ) ⋅ w k j ( h + 1 ) ( n ) ] , h < L \displaystyle\delta_j^{(h)}=\left\{ \begin{aligned} \varphi'_j(v^{(L)}_j(n))\cdot e^{(L)}_j(n), & & {h=L}\\ \varphi'_j(v^{(h)}_j(n))\cdot\sum_k[\delta^{(h+1)}_k(n)\cdot w^{(h+1)}_{kj}(n)], & & {h
网络权值与偏置调整,即更新各层神经元与其前面一层神经元的连接权重: w j i ( h ) ( n + 1 ) = w j i ( h ) ( n ) + λ ⋅ δ j ( h ) ⋅ y i ( h − 1 ) ( n ) \displaystyle w^{(h)}_{ji}(n+1)=w^{(h)}_{ji}(n)+\lambda\cdot\delta^{(h)}_j\cdot y^{(h-1)}_i(n) wji(h)(n+1)=wji(h)(n)+λ⋅δj(h)⋅yi(h−1)(n)
返回到第 2 步,直到达到终止条件。
后三项问题是针对多层感知器网络存在的问题。
有关BP算法的进一步分析,可参考误差反向传播算法(BP,Back-Propagation algorithm)(二)。