bp神经网络推导及python实例

bp神经网络

BP(back propagation,反向传播)神经网络(neural network),通常指具有三层网络结构的浅层神经网络。神经网络由一个个神经元(Neuron)组成,神经元由输入、计算、输出单元组成。

bp神经网络推导及python实例_第1张图片

对应上图输入为 x 1 , x 2 , ⋯   , x n x_1,x_2,\cdots,x_n x1,x2,,xn和截距 + 1 +1 +1,输出为:
y ^ = h w , b ( X ) = f ( w T X ) = f ( ∑ i = 1 n w i x i + b ) \hat y=h_{w,b}(X)=f(w^T X)=f(\sum_{i=1}^n w_i x_i+b) y^=hw,b(X)=f(wTX)=f(i=1nwixi+b)
其中w表示权重值,函数f为激活函数,有如下激活函数:
s i g m o i d : f ( x ) = 1 1 + e x p − x sigmoid: f(x)=\frac{1}{1+exp^{-x} } sigmoid:f(x)=1+expx1
t a n h : f ( x ) = e x − e − x e x + e − x tanh: f(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}} tanh:f(x)=ex+exexex
R e L u : f ( x ) = m a x ( 0 , x ) ReLu: f(x)=max(0,x) ReLu:f(x)=max(0,x)
S o f t P l u s : f ( x ) = l o g e ( 1 + e x ) SoftPlus: f(x)=log_e(1+e^x) SoftPlus:f(x)=loge(1+ex)
对应图像为:

bp神经网络推导及python实例_第2张图片

一个三层的神经网络结构图:

bp神经网络推导及python实例_第3张图片

相关参数说明

  • 网络层数 n l n_l nl,三层网络 n l = 3 n_l=3 nl=3,第 l l l层记为 L l L_l Ll
  • L 1 , L 2 , L 3 L_1,L_2,L_3 L1,L2,L3分别为输入层,隐藏层,输出层
  • 权重 w = ( w 1 , w 2 ) w=(w^1,w^2) w=(w1,w2),其中 w i j l w_{ij}^l wijl表示 l l l层的第 j j j个神经元与 l + 1 l+1 l+1层的第 i i i个神经元的参数,如 w 21 1 w_{21}^1 w211表示1层第1个神经元与2层第2个神经元的连接参数
  • 偏置 b i l b_i^l bil表示第 l + 1 l+1 l+1层的第i个神经元的偏置项,如 b 2 1 b_2^1 b21表示2层第2个神经元的偏置项
  • z i l z_i^l zil表示 l l l层第 i i i个神经元的输入
  • a i l a_i^l ail表示 l l l层第 i i i个神经元的输出
  • S l S_l Sl表示 l l l层的神经元个数,如 S 3 S_3 S3表示3层有2个神经元

参数计算关系如下:
a 2 2 = f ( z 2 2 ) = f ( w 21 1 x 1 + w 22 1 x 2 + w 23 1 x 3 + b 2 1 ) a_2^2=f(z_2^2)=f(w_{21}^1 x_1+w_{22}^1 x_2 +w_{23}^1 x_3+b_2^1) a22=f(z22)=f(w211x1+w221x2+w231x3+b21)
即每个神经元的输入为上一层所有神经元输出的加权求和,神经元输入值经过激活函数处理后得到神经元输出。

损失函数

对于每个训练样本 ( X , y ) (X,y) (X,y),损失函数为:
J ( W , b ; X , y ) = 1 2 ∣ ∣ h w , b ( X ) − y ∣ ∣ 2 J(W,b;X,y)=\frac{1}{2}||h_{w,b}(X)-y||^2 J(W,b;X,y)=21hw,b(X)y2
表示最后一层输出层的输出值与实际值的欧式距离,结果是一个向量,向量维度等于输出层神经元数量。

为得到损失函数最小值,首先对参数进行初始化,初始化为一个接近0的随机值。再利用前向传播得到预测值,从而计算损失值。此时需要利用损失函数调整参数,可使用梯度下降法,梯度下降公式为:
W i j l = W i j l − α ∂ J ( W , b ) ∂ W i j l W_{ij}^l=W_{ij}^l-\alpha\frac{\partial J(W,b)}{\partial W_{ij}^l} Wijl=WijlαWijlJ(W,b)
b i l = b i l − α ∂ J ( W , b ) ∂ b i l b_i^l=b_i^l-\alpha\frac{\partial J(W,b)}{\partial b_i^l} bil=bilαbilJ(W,b)
其中偏导部分:
∂ J ( W , b ) ∂ W i j l = [ 1 m ∑ k = 1 m ∂ J ( W , b ; x k , y k ) ∂ W i j l ] \frac{\partial J(W,b)}{\partial W_{ij}^l}=[\frac{1}{m}\sum_{k=1}^m \frac{\partial J(W,b;x^k,y^k)}{\partial W_{ij}^l} ] WijlJ(W,b)=[m1k=1mWijlJ(W,b;xk,yk)]
∂ J ( W , b ) ∂ b i l = = [ 1 m ∑ k = 1 m ∂ J ( W , b ; x k , y k ) ∂ b i l ] \frac{\partial J(W,b)}{\partial b_i^l}==[\frac{1}{m}\sum_{k=1}^m \frac{\partial J(W,b;x^k,y^k)}{\partial b_i^l} ] bilJ(W,b)==[m1k=1mbilJ(W,b;xk,yk)]
由于每两层之间有 W W W参数矩阵,考虑到预测值在最后一层输出层,可以先求解 W i j n l − 1 W_{ij}^{n_l -1} Wijnl1,推导如下:
∂ J ( W , b ) ∂ W i j n l − 1 = ∂ 1 2 ∣ ∣ a n l − y ∣ ∣ 2 ∂ W i j n l − 1 = \frac{\partial J(W,b)}{\partial W_{ij}^{n_l -1}}=\frac{\partial \frac{1}{2}||a^{n_l}-y||^2}{\partial W_{ij}^{n_l -1}}= Wijnl1J(W,b)=Wijnl121anly2=
∂ 1 2 ∑ k = 1 S n l ( a k n l − y k ) 2 ∂ W i j n l − 1 = \frac{\partial \frac{1}{2} \sum_{k=1}^{S_{n_l}} (a_k^{n_l}-y_k)^2 }{\partial W_{ij}^{n_l -1}}= Wijnl121k=1Snl(aknlyk)2=
∂ 1 2 ∑ k = 1 S n l ( f ( z k n l ) − y k ) 2 ∂ W i j n l − 1 \frac{\partial \frac{1}{2} \sum_{k=1}^{S_{n_l}} (f(z_k^{n_l})-y_k)^2 }{\partial W_{ij}^{n_l -1}} Wijnl121k=1Snl(f(zknl)yk)2
其中 z i n l z_i^{n_l} zinl等于:
z i n l = ∑ p = 1 S n l − 1 [ W i p n l − 1 a p n l − 1 + b i n l − 1 ] z_i^{n_l}=\sum_{p=1}^{S_{n_l -1}}[W_{ip}^{n_l -1}a_p^{n_l -1}+b_i^{n_l -1}] zinl=p=1Snl1[Wipnl1apnl1+binl1]
z i n l z_i^{n_l} zinl W i j n l − 1 W_{ij}^{n_l -1} Wijnl1可导,可以使用链式法则求导:
∂ J ( W , b ) ∂ W i j n l − 1 = \frac{\partial J(W,b)}{\partial W_{ij}^{n_l -1}}= Wijnl1J(W,b)=
∂ 1 2 ∑ k = 1 S n l ( f ( z k n l ) − y k ) 2 ∂ z i n l ⋅ ∂ z i n l ∂ W i j n l − 1 = \frac{\partial \frac{1}{2} \sum_{k=1}^{S_{n_l}} (f(z_k^{n_l})-y_k)^2 }{\partial z_i^{n_l}} \cdot \frac {\partial z_i^{n_l}} {\partial W_{ij}^{n_l -1}}= zinl21k=1Snl(f(zknl)yk)2Wijnl1zinl=
[ f ( z i n l ) − y i ] ⋅ f ′ ( z i n l ) ⋅ ∂ z i n l ∂ W i j n l − 1 = [f(z_i^{n_l})-y_i]\cdot f'(z_i^{n_l})\cdot \frac {\partial z_i^{n_l}} {\partial W_{ij}^{n_l -1}}= [f(zinl)yi]f(zinl)Wijnl1zinl=
[ f ( z i n l ) − y i ] ⋅ f ′ ( z i n l ) ⋅ a j n l − 1 [f(z_i^{n_l})-y_i]\cdot f'(z_i^{n_l})\cdot a_j^{n_l -1} [f(zinl)yi]f(zinl)ajnl1

反向传播算法的思路为,对于给定训练数据 ( X , y ) (X,y) (X,y),通过前向传播算法计算每个神经元的输出值,当所有神经元的输出都计算完成后,对每个神经元计算残差,如第 l l l层的第i个神经元的残差表示为 δ i l \delta_i^l δil,该残差表示该神经元对最终残差的影响,最后一层的残差公式为:
δ i n l = ∂ J ( W , b ) ∂ z i n l = [ f ( z i n l ) − y i ] ⋅ f ′ ( z i n l ) \delta_i^{n_l}=\frac{\partial J(W,b)}{\partial z_i^{n_l}}=[f(z_i^{n_l})-y_i]\cdot f'(z_i^{n_l}) δinl=zinlJ(W,b)=[f(zinl)yi]f(zinl)
δ i l \delta_i^l δil代入,得出:
∂ J ( W , b ) ∂ W i j n l − 1 = δ i n l ⋅ a j n l − 1 \frac{\partial J(W,b)}{\partial W_{ij}^{n_l -1}}=\delta_i^{n_l}\cdot a_j^{n_l -1} Wijnl1J(W,b)=δinlajnl1

其中 a j n l − 1 a_j^{n_l -1} ajnl1可以通过前向传播得到,需要求解 δ i n l \delta_i^{n_l} δinl,可以推导倒数第二层残差与最后一层残差的关系:
δ i n l − 1 = ∂ J ( W , b ) ∂ z i n l − 1 = \delta_i^{n_l -1}=\frac{\partial J(W,b)}{\partial z_i^{n_l -1}}= δinl1=zinl1J(W,b)=
∂ 1 2 ∑ k = 1 S n l ( f ( z k n l ) − y k ) 2 ∂ z i n l − 1 = \frac{\partial \frac{1}{2} \sum_{k=1}^{S_{n_l}} (f(z_k^{n_l})-y_k)^2 }{\partial z_i^{n_l -1}}= zinl121k=1Snl(f(zknl)yk)2=
1 2 ∑ k = 1 S n l ∂ ( f ( z k n l ) − y k ) 2 ∂ z i n l − 1 = \frac{1}{2} \sum_{k=1}^{S_{n_l}} \frac{\partial (f(z_k^{n_l})-y_k)^2 }{\partial z_i^{n_l -1}}= 21k=1Snlzinl1(f(zknl)yk)2=
1 2 ∑ k = 1 S n l ∂ ( f ( z k n l ) − y k ) 2 ∂ z k n l ⋅ ∂ z k n l ∂ z i n l − 1 = \frac{1}{2} \sum_{k=1}^{S_{n_l}} \frac{\partial (f(z_k^{n_l})-y_k)^2 }{\partial z_k^{n_l}}\cdot \frac{\partial z_k^{n_l}}{\partial z_i^{n_l -1}}= 21k=1Snlzknl(f(zknl)yk)2zinl1zknl=
∑ k = 1 S n l δ k n l ⋅ ∂ z k n l ∂ z i n l − 1 = \sum_{k=1}^{S_{n_l}} \delta_k^{n_l}\cdot \frac{\partial z_k^{n_l}}{\partial z_i^{n_l -1}}= k=1Snlδknlzinl1zknl=
∑ k = 1 S n l δ k n l ⋅ ∂ ∑ j = 1 S n l − 1 [ f ( z j n l − 1 ) ⋅ W k j n l − 1 + b j n l − 1 ] ∂ z i n l − 1 = \sum_{k=1}^{S_{n_l}} \delta_k^{n_l}\cdot \frac{\partial \sum_{j=1}^{S_{n_l-1}} [f(z_j^{n_l-1})\cdot W_{kj}^{n_l-1}+b_j^{n_l-1}]}{\partial z_i^{n_l -1}}= k=1Snlδknlzinl1j=1Snl1[f(zjnl1)Wkjnl1+bjnl1]=
∑ k = 1 S n l δ k n l ⋅ W k i n l − 1 f ′ ( z i n l − 1 ) = \sum_{k=1}^{S_{n_l}} \delta_k^{n_l}\cdot W_{ki}^{n_l-1} f'(z_i^{n_l-1})= k=1SnlδknlWkinl1f(zinl1)=
[ ∑ k = 1 S n l δ k n l ⋅ W k i n l − 1 ] ⋅ f ′ ( z i n l − 1 ) [\sum_{k=1}^{S_{n_l}} \delta_k^{n_l}\cdot W_{ki}^{n_l-1}] \cdot f'(z_i^{n_l-1}) [k=1SnlδknlWkinl1]f(zinl1)

δ i n l − 1 = [ ∑ k = 1 S n l δ k n l ⋅ W k i n l − 1 ] ⋅ f ′ ( z i n l − 1 ) \delta_i^{n_l -1}=[\sum_{k=1}^{S_{n_l}} \delta_k^{n_l}\cdot W_{ki}^{n_l-1}] \cdot f'(z_i^{n_l-1}) δinl1=[k=1SnlδknlWkinl1]f(zinl1)
推导到一般情况:
δ i l = [ ∑ k = 1 S l + 1 δ k l + 1 ⋅ W k i l ] ⋅ f ′ ( z i l ) \delta_i^l=[\sum_{k=1}^{S_{l+1}} \delta_k^{l+1}\cdot W_{ki}^l] \cdot f'(z_i^l) δil=[k=1Sl+1δkl+1Wkil]f(zil)
∂ J ( W , b ) ∂ W i j l = δ i l + 1 ⋅ a j l \frac{\partial J(W,b)}{\partial W_{ij}^l}=\delta_i^{l+1}\cdot a_j^l WijlJ(W,b)=δil+1ajl
∂ J ( W , b ) ∂ b i l = δ i l + 1 \frac{\partial J(W,b)}{\partial b_i^l}=\delta_i^{l+1} bilJ(W,b)=δil+1

python实例

# 代码来自《Python机器学习算法》一书
def bp_train(feature, label, n_hidden, maxCycle, alpha, n_output):
    '''计算隐含层的输入
    input:  feature(mat):特征
            label(mat):标签
            n_hidden(int):隐含层的节点个数
            maxCycle(int):最大的迭代次数
            alpha(float):学习率
            n_output(int):输出层的节点个数
    output: w0(mat):输入层到隐含层之间的权重
            b0(mat):输入层到隐含层之间的偏置
            w1(mat):隐含层到输出层之间的权重
            b1(mat):隐含层到输出层之间的偏置
    '''
    m, n = np.shape(feature)
    # 1、初始化
    w0 = np.mat(np.random.rand(n, n_hidden))
    w0 = w0 * (8.0 * sqrt(6) / sqrt(n + n_hidden)) - \
     np.mat(np.ones((n, n_hidden))) * \
      (4.0 * sqrt(6) / sqrt(n + n_hidden))
    b0 = np.mat(np.random.rand(1, n_hidden))
    b0 = b0 * (8.0 * sqrt(6) / sqrt(n + n_hidden)) - \
     np.mat(np.ones((1, n_hidden))) * \
      (4.0 * sqrt(6) / sqrt(n + n_hidden))
    w1 = np.mat(np.random.rand(n_hidden, n_output))
    w1 = w1 * (8.0 * sqrt(6) / sqrt(n_hidden + n_output)) - \
     np.mat(np.ones((n_hidden, n_output))) * \
      (4.0 * sqrt(6) / sqrt(n_hidden + n_output))
    b1 = np.mat(np.random.rand(1, n_output))
    b1 = b1 * (8.0 * sqrt(6) / sqrt(n_hidden + n_output)) - \
     np.mat(np.ones((1, n_output))) * \
      (4.0 * sqrt(6) / sqrt(n_hidden + n_output))

    # 2、训练
    i = 0
    while i <= maxCycle:
        # 2.1、信号正向传播
        # 2.1.1、计算隐含层的输入
        hidden_input = hidden_in(feature, w0, b0)  # mXn_hidden
        # 2.1.2、计算隐含层的输出
        hidden_output = hidden_out(hidden_input)
        # 2.1.3、计算输出层的输入
        output_in = predict_in(hidden_output, w1, b1)  # mXn_output
        # 2.1.4、计算输出层的输出
        output_out = predict_out(output_in)

        # 2.2、误差的反向传播
        # 2.2.1、隐含层到输出层之间的残差
        delta_output = -np.multiply((label - output_out), partial_sig(output_in))
        # 2.2.2、输入层到隐含层之间的残差
        delta_hidden = np.multiply((delta_output * w1.T), partial_sig(hidden_input))

        # 2.3、 修正权重和偏置       
        w1 = w1 - alpha * (hidden_output.T * delta_output)
        b1 = b1 - alpha * np.sum(delta_output, axis=0) * (1.0 / m)
        w0 = w0 - alpha * (feature.T * delta_hidden)
        b0 = b0 - alpha * np.sum(delta_hidden, axis=0) * (1.0 / m)
        if i % 100 == 0:
            print "\t-------- iter: ", i, \
            " ,cost: ",  (1.0/2) * get_cost(get_predict(feature, w0, w1, b0, b1) - label)
        i += 1
    return w0, w1, b0, b1
(1)初始化参数
  • 样本特征数 n=2

  • 隐藏层节点数 n_hidden=20

  • 输出层节点数(分类数量) n_output=2

  • 构成 2*20*2 的三层神经网络结构

  • 输入层到隐藏层的权重w0 = np.mat(np.random.rand(n, n_hidden)),即2*20

  • 输入层到隐藏层的偏置b0 = np.mat(np.random.rand(1, n_hidden)),即20

  • 隐藏层到输出层的权重w1 = np.mat(np.random.rand(n_hidden, n_output)),即20*2

  • 隐藏层到输出层的偏置b1 = np.mat(np.random.rand(1, n_output)),即2

可以利用更科学的随机算法,得到随机化的w0,b0,w1,b1

(2)正向传播
# 2.1.1、计算隐含层的输入
hidden_input = hidden_in(feature, w0, b0)  # mXn_hidden
# 2.1.2、计算隐含层的输出
hidden_output = hidden_out(hidden_input)
# 2.1.3、计算输出层的输入
output_in = predict_in(hidden_output, w1, b1)  # mXn_output
# 2.1.4、计算输出层的输出
output_out = predict_out(output_in)

hidden_in方法计算隐藏层的输入值,对应公式:
z i l = ∑ k = 1 S l − 1 [ W i k l − 1 a k l − 1 + b i l − 1 ] z_i^l=\sum_{k=1}^{S_{l -1}}[W_{ik}^{l -1}a_k^{l -1}+b_i^{l -1}] zil=k=1Sl1[Wikl1akl1+bil1]

def hidden_in(feature, w0, b0):
    '''计算隐含层的输入
    input:  feature(mat):特征
            w0(mat):输入层到隐含层之间的权重
            b0(mat):输入层到隐含层之间的偏置
    output: hidden_in(mat):隐含层的输入
    '''
    m = np.shape(feature)[0]
    hidden_in = feature * w0
    for i in xrange(m):
        hidden_in[i, ] += b0
    return hidden_in

hidden_out方法计算隐藏层的输出,对应公式:
a i l = f ( z i l ) a_i^l=f(z_i^l) ail=f(zil)

def hidden_out(hidden_in):
    '''隐含层的输出
    input:  hidden_in(mat):隐含层的输入
    output: hidden_output(mat):隐含层的输出
    '''
    hidden_output = sig(hidden_in)
    return hidden_output;

predict_in方法等同于hidden_inpredict_out方法等同于hidden_out

(3)反向传播
# 2.2.1、隐含层到输出层之间的残差
delta_output = -np.multiply((label - output_out), partial_sig(output_in))
# 2.2.2、输入层到隐含层之间的残差
delta_hidden = np.multiply((delta_output * w1.T), partial_sig(hidden_input))

partial_sig方法计算输入值的偏导值,delta_output对应最后一层的残差公式:
δ i n l = ∂ J ( W , b ) ∂ z i n l = [ f ( z i n l ) − y i ] ⋅ f ′ ( z i n l ) \delta_i^{n_l}=\frac{\partial J(W,b)}{\partial z_i^{n_l}}=[f(z_i^{n_l})-y_i]\cdot f'(z_i^{n_l}) δinl=zinlJ(W,b)=[f(zinl)yi]f(zinl)
delta_hidden对应一般情况的残差公式:
δ i l = [ ∑ k = 1 S l + 1 δ k l + 1 ⋅ W k i l ] ⋅ f ′ ( z i l ) \delta_i^l=[\sum_{k=1}^{S_{l+1}} \delta_k^{l+1}\cdot W_{ki}^l] \cdot f'(z_i^l) δil=[k=1Sl+1δkl+1Wkil]f(zil)

def partial_sig(x):
    '''Sigmoid导函数的值
    input:  x(mat/float):自变量,可以是矩阵或者是任意实数
    output: out(mat/float):Sigmoid导函数的值
    '''
    m, n = np.shape(x)
    out = np.mat(np.zeros((m, n)))
    for i in xrange(m):
        for j in xrange(n):
            out[i, j] = sig(x[i, j]) * (1 - sig(x[i, j]))
    return out
(4)更新权重和偏置
w1 = w1 - alpha * (hidden_output.T * delta_output)
b1 = b1 - alpha * np.sum(delta_output, axis=0) * (1.0 / m)
w0 = w0 - alpha * (feature.T * delta_hidden)
b0 = b0 - alpha * np.sum(delta_hidden, axis=0) * (1.0 / m)

对应公式:
∂ J ( W , b ) ∂ W i j l = δ i l + 1 ⋅ a j l \frac{\partial J(W,b)}{\partial W_{ij}^l}=\delta_i^{l+1}\cdot a_j^l WijlJ(W,b)=δil+1ajl
∂ J ( W , b ) ∂ b i l = δ i l + 1 \frac{\partial J(W,b)}{\partial b_i^l}=\delta_i^{l+1} bilJ(W,b)=δil+1

(5)预测
def get_predict(feature, w0, w1, b0, b1):
    '''计算最终的预测
    input:  feature(mat):特征
            w0(mat):输入层到隐含层之间的权重
            b0(mat):输入层到隐含层之间的偏置
            w1(mat):隐含层到输出层之间的权重
            b1(mat):隐含层到输出层之间的偏置
    output: 预测值
    '''
    return predict_out(predict_in(hidden_out(hidden_in(feature, w0, b0)), w1, b1))

传入训练完成的参数,计算测试样本在输出层每个神经元的输出值,选取最大值的神经元作为分类结果。

你可能感兴趣的:(机器学习)