cs231n学习之双层神经网络及反向传播(3)

前言

本文旨在学习和记录,如需转载,请附出处https://www.jianshu.com/p/f9cb3665ff01

神经网络

这里的神经网络一般指人工神经网络,是一种模仿动物神经行为特征的模型。

cs231n课件截图.png

这里的cell body就是线性分类器中的线性转换,相对于前者这里多的部分是activation function(激活函数),一方面,激活函数的功能可以理解成生物神经网络中的突触结构,满足一定条件才能将此神经元激活。另一方面,通过神经网络中的激活函数可以在模型中引入非线性变换。
线性分类器
2层神经网络:,这里的max函数就是激活函数的一种(ReLU)

一、激活函数

cs231n课件截图.png

激活函数有很多种,sigmoid,tanh,RelU等,现在最常用的是ReLU激活函数,当然这些激活函数各有各的特点和缺点,在这里就不做一一介绍了。

二、网络框架

cs231n课件截图.png

一般这种网络叫作全连接层(Fully-connected layers),即神经元的这一层的神经元和下一层的神经元彼此是互相连接的。
网络的正向传播就是先进行线性操作(乘以权重加偏置),然后通过激活函数,然后再进行上述两种操作,在最后一层时,不必加激活函数,最后一层一般是softmax输出。

class TwoLayerNet(object):
    """
    A two-layer fully-connected neural network. The net has an input dimension of
    N, a hidden layer dimension of H, and performs classification over C classes.
    We train the network with a softmax loss function and L2 regularization on the
    weight matrices. The network uses a ReLU nonlinearity after the first fully
    connected layer.

    In other words, the network has the following architecture:

    input - fully connected layer - ReLU - fully connected layer - softmax

    The outputs of the second fully-connected layer are the scores for each class.
    """

    def __init__(self, input_size, hidden_size, output_size, std=1e-4):
        self.params = {}
        self.params['W1'] = std * np.random.randn(input_size, hidden_size)
        self.params['b1'] = np.zeros(hidden_size)
        self.params['W2'] = std * np.random.randn(hidden_size, output_size)
        self.params['b2'] = np.zeros(output_size)

    def loss(self, X, y=None, reg=0.0):
  
        W1, b1 = self.params['W1'], self.params['b1']
        W2, b2 = self.params['W2'], self.params['b2']
        N, D = X.shape

        # Compute the forward pass
        scores = None
        hidden1_output = np.dot(X,W1)+b1
        hidden1_act = np.maximum(0,hidden1_output)
        scores = np.dot(hidden1_act,W2)+b2

    
        if y is None:
            return scores
        scores = scores- np.max(scores, axis=1,keepdims=True)
        p = np.exp(scores)/np.sum(np.exp(scores),axis =1,keepdims=True)
        loss = np.sum(-np.log(p[np.arange(X.shape[0]),y]))/N
        regular_loss = 0.5*np.sum(W1*W1)*reg+0.5*np.sum(W2*W2)*reg
        loss = loss+ regular_loss
        # Backward pass: compute gradients
        grads = {}
        p[np.arange(X.shape[0]),y] -= 1
        dscores = p/X.shape[0]
        grads['W2'] = np.dot(hidden1_act.T,dscores)
        grads['b2'] = np.sum(dscores, axis=0)
        
        dhidden = np.dot(dscores,W2.T)
        
        dhidden[hidden1_act<=0] = 0
        
        grads['W1'] = np.dot(X.T,dhidden)
        grads['b1'] = np.sum(dhidden,axis=0)
        
        grads['W2'] += reg * W2
        grads['W1'] += reg * W1
        return loss, grads

三、网络训练

在训练中采取的是小批量梯度下降法Mini-batch Gradient Descent: 在更新每一参数时都使用一部分样本进行更新。除此之外,存在着批量梯度下降法随机梯度下降法,前者在更新梯度时对所有的样本都进行更新,优点是能获得全局最优解,缺点是如果训练数据太多的话,训练会很慢;后者在更新参数每次都只使用一个样本来进行更新,优点是训练时间很快,缺点是准确度容易下降,并不全局最优,且梯度更新的方向并不很准的,大部分是盲目的。

def train(self, X, y, X_val, y_val,
              learning_rate=1e-3, learning_rate_decay=0.95,
              reg=5e-6, num_iters=100,
              batch_size=200, verbose=False):
        """
        Train this neural network using stochastic gradient descent.
        Inputs:
        - X: A numpy array of shape (N, D) giving training data.
        - y: A numpy array f shape (N,) giving training labels; y[i] = c means that
          X[i] has label c, where 0 <= c < C.
        - X_val: A numpy array of shape (N_val, D) giving validation data.
        - y_val: A numpy array of shape (N_val,) giving validation labels.
        - learning_rate: Scalar giving learning rate for optimization.
        - learning_rate_decay: Scalar giving factor used to decay the learning rate
          after each epoch.
        - reg: Scalar giving regularization strength.
        - num_iters: Number of steps to take when optimizing.
        - batch_size: Number of training examples to use per step.
        - verbose: boolean; if true print progress during optimization.
        """
        num_train = X.shape[0]
        iterations_per_epoch = max(num_train / batch_size, 1)

        # Use SGD to optimize the parameters in self.model
        loss_history = []
        train_acc_history = []
        val_acc_history = []

        for it in range(num_iters):
            X_batch = None
            y_batch = None
            batch_inx = np.random.choice(num_train, batch_size)
            X_batch = X[batch_inx,:]
            y_batch = y[batch_inx]
           

            # Compute loss and gradients using the current minibatch
            loss, grads = self.loss(X_batch, y=y_batch, reg=reg)
            loss_history.append(loss)

      
            self.params['W1'] -= grads['W1']*learning_rate
            self.params['W2'] -= grads['W2']*learning_rate
            self.params['b1'] -= grads['b1']*learning_rate
            self.params['b2'] -= grads['b2']*learning_rate
            pass

            if verbose and it % 100 == 0:
                print('iteration %d / %d: loss %f' % (it, num_iters, loss))

            # Every epoch, check train and val accuracy and decay learning rate.
            if it % iterations_per_epoch == 0:
                # Check accuracy
                train_acc = (self.predict(X_batch) == y_batch).mean()
                val_acc = (self.predict(X_val) == y_val).mean()
                train_acc_history.append(train_acc)
                val_acc_history.append(val_acc)

                # Decay learning rate
                learning_rate *= learning_rate_decay

        return {
          'loss_history': loss_history,
          'train_acc_history': train_acc_history,
          'val_acc_history': val_acc_history,
        }

note
在网络训练时,采用mini-bacth 梯度下降法来更新参数。该方法的前身是反向传播,通过反向传播求每个待更新参数的梯度,然后采用梯度下降进行更新。

四、反向传播(Back Propagation,BP)

BP算法是依据负梯度方向迭代来调整网络参数来实现训练误差目标函数的最小化,有效的克服了单层感知机不能解决XOR及其他的一些问题,因为XOR不是线性可分的,所以单层感知机不能解决该问题。
BP的核心思想:将输出误差以某种形式通过隐层向输入层逐层反传,这里的某种形式其实就是"信号的正向传播和误差的反向传播"的过程。BP神经网络,由输入层、隐藏层和输出层组成,隐藏层可以是一层或者多层,每个层中又包含许多单个神经元。层与层之间的神经元是全连接,层内部的神经元之间是无连接的。各隐层节点的激活函数像Sigmoid函数,其输入层和输出层激励函数可以根据应用的不同需要而异。
信息正向传播:输入层 -----> 隐含层 ----> 输出层
误差方向传播:误差以某种形式在各层表示 ----> 修正各层单元的权值

image.png

note
通过BP反向传播算法,能够多次迭代找到最适合网络的权重因子,使模型的精确度更高,但是在计算误差导数关系反向传播时,带来的计算量也就越大,选择好的激活函数会对其求导计算带来计算量的减小,这也是为什么ReLU函数能够更加适用做激活函数的原因之一。
在BP中,最重要的是梯度计算,梯度计算采取链式法则:

五、结果

image.png

你可能感兴趣的:(cs231n学习之双层神经网络及反向传播(3))