斯坦福cs231n课程记录——assignment2 FullyConnectedNets

目录

  • 作业目的
  • 网络层实现
  • 优化方法实现
  • 作业问题记录
  • 参考文献

一、作业目的

之前做了一个Two-layer neural network的作业,但是其损失函数和反向传播都是在一个函数中实现的,并没有实现模块化,因此不适合复杂网络结构的开发。因此本作业目的在于将各功能模块化,从而较好地实现复杂网络的搭建。

二、网络层实现

1.affine_layer(layers.py)

1.1 affine_forward

Inputs:
    - x: A numpy array containing input data, of shape (N, d_1, ..., d_k)
    - w: A numpy array of weights, of shape (D, M)
    - b: A numpy array of biases, of shape (M,)

 Returns a tuple of:
    - out: output, of shape (N, M)
    - cache: (x, w, b)

def affine_forward(x, w, b):
    out = None
    out = np.dot(x.reshape((x.shape[0], -1)), w) + b
    cache = (x, w, b)
    return out, cache

前向传播较为简单,就是out = x * w + b,维度是(N,M)=(N,D) * (D,M) +(M,),这里与b相加用到了broadcast机制。

1.2 affine_backward

Inputs:
    - dout: Upstream derivative, of shape (N, M)
    - cache: Tuple of:
      - x: Input data, of shape (N, d_1, ... d_k)
      - w: Weights, of shape (D, M)
      - b: Biases, of shape (M,)

    Returns a tuple of:
    - dx: Gradient with respect to x, of shape (N, d1, ..., d_k)
    - dw: Gradient with respect to w, of shape (D, M)
    - db: Gradient with respect to b, of shape (M,)

def affine_backward(dout, cache):
    x, w, b = cache
    dx, dw, db = None, None, None
    dw = np.dot(x.reshape((x.shape[0], -1)).T, dout)
    db = dout.sum(axis=0)
    dx = np.dot(dout, w.T)
    dx = dx.reshape(x.shape)
    return dx, dw, db

反向传播主要注意维度的问题。    

dw = x.T * dout   (D,M)=(D,N)*(N,M)
db = dout 的列向量之和 (M,) =(N,M)[0]
dx = dout * w.T (N,D) = (N,M) *(M,D)

2.ReLU layer(layers.py)

2.1 relu_forward 

Input:
    - x: Inputs, of any shape

 Returns a tuple of:
    - out: Output, of the same shape as x
    - cache: x

def relu_forward(x):
    out = x * (x > 0)
    cache = x
    return out, cache

(x > 0 ) 是一个布尔判断,输出大于0的x。

2.2 relu_backward

Input:
    - dout: Upstream derivatives, of any shape
    - cache: Input x, of same shape as dout

Returns:
    - dx: Gradient with respect to x

dx, x = None, cache
    dx = dout * (x > 0)
    return dx

同样,大于0的数才会得到反向传播的值。

3.Loss layers: Softmax and SVM(layers.py)

def svm_loss(x, y):
    """
    Computes the loss and gradient using for multiclass SVM classification.

    Inputs:
    - x: Input data, of shape (N, C) where x[i, j] is the score for the jth
      class for the ith input.
    - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
      0 <= y[i] < C

    Returns a tuple of:
    - loss: Scalar giving the loss
    - dx: Gradient of the loss with respect to x
    """
    N = x.shape[0]
    correct_class_scores = x[np.arange(N), y]
    margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0)
    margins[np.arange(N), y] = 0
    loss = np.sum(margins) / N
    num_pos = np.sum(margins > 0, axis=1)
    dx = np.zeros_like(x)
    dx[margins > 0] = 1
    dx[np.arange(N), y] -= num_pos
    dx /= N
    return loss, dx

def softmax_loss(x, y):
    """
    Computes the loss and gradient for softmax classification.

    Inputs:
    - x: Input data, of shape (N, C) where x[i, j] is the score for the jth
      class for the ith input.
    - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
      0 <= y[i] < C

    Returns a tuple of:
    - loss: Scalar giving the loss
    - dx: Gradient of the loss with respect to x
    """
    shifted_logits = x - np.max(x, axis=1, keepdims=True)
    Z = np.sum(np.exp(shifted_logits), axis=1, keepdims=True)
    log_probs = shifted_logits - np.log(Z)
    probs = np.exp(log_probs)
    N = x.shape[0]
    loss = -np.sum(log_probs[np.arange(N), y]) / N
    dx = probs.copy()
    dx[np.arange(N), y] -= 1
    dx /= N
    return loss, dx

4.Two-layer network(fc_net.py)

之前的作业:Two-layer neural network

5. Solver

In the previous assignment, the logic for training models was coupled to the models themselves. Following a more modular design, for this assignment we have split the logic for training models into a separate class.

一些作图方面的语句:

plt.subplot(2, 1, 1)
plt.title('Training loss')
plt.plot(solver.loss_history, 'o')
plt.xlabel('Iteration')

plt.subplot(2, 1, 2)
plt.title('Accuracy')
plt.plot(solver.train_acc_history, '-o', label='train')
plt.plot(solver.val_acc_history, '-o', label='val')
plt.plot([0.5] * len(solver.val_acc_history), 'k--')
plt.xlabel('Epoch')
plt.legend(loc='lower right')
plt.gcf().set_size_inches(15, 12)    
plt.show()

布局调整
在排列子图的过程中,可能出现:

坐标轴遮挡:使用tight_layout()设置间距,其中pad表示整体轮廓间距,w_pad表示子图水平间距,h_pad表示子图竖直间距;
图像密集:使用set_size_inches()设置图像长宽的具体尺寸(英寸);

即:
 

plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
fig = plt.gcf()
fig.set_size_inches(10, 8)

 

6.Multilayer network

Implement a fully-connected network with an arbitrary number of hidden layers.

class FullyConnectedNet(object):
    """
    A fully-connected neural network with an arbitrary number of hidden layers,
    ReLU nonlinearities, and a softmax loss function. This will also implement
    dropout and batch/layer normalization as options. For a network with L layers,
    the architecture will be

    {affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax

    where batch/layer normalization and dropout are optional, and the {...} block is
    repeated L - 1 times.

    Similar to the TwoLayerNet above, learnable parameters are stored in the
    self.params dictionary and will be learned using the Solver class.
    """

    def __init__(self, hidden_dims, input_dim=3 * 32 * 32, num_classes=10,
                 dropout=1, normalization=None, reg=0.0,
                 weight_scale=1e-2, dtype=np.float32, seed=None):
        """
        Initialize a new FullyConnectedNet.

        Inputs:
        - hidden_dims: A list of integers giving the size of each hidden layer.
        - input_dim: An integer giving the size of the input.
        - num_classes: An integer giving the number of classes to classify.
        - dropout: Scalar between 0 and 1 giving dropout strength. If dropout=1 then
          the network should not use dropout at all.
        - normalization: What type of normalization the network should use. Valid values
          are "batchnorm", "layernorm", or None for no normalization (the default).
        - reg: Scalar giving L2 regularization strength.
        - weight_scale: Scalar giving the standard deviation for random
          initialization of the weights.
        - dtype: A numpy datatype object; all computations will be performed using
          this datatype. float32 is faster but less accurate, so you should use
          float64 for numeric gradient checking.
        - seed: If not None, then pass this random seed to the dropout layers. This
          will make the dropout layers deteriminstic so we can gradient check the
          model.
        """
        self.normalization = normalization
        self.use_dropout = dropout != 1
        self.reg = reg
        self.num_layers = 1 + len(hidden_dims)
        self.dtype = dtype
        self.params = {}

        ############################################################################
        # TODO: Initialize the parameters of the network, storing all values in    #
        # the self.params dictionary. Store weights and biases for the first layer #
        # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
        # initialized from a normal distribution centered at 0 with standard       #
        # deviation equal to weight_scale. Biases should be initialized to zero.   #
        #                                                                          #
        # When using batch normalization, store scale and shift parameters for the #
        # first layer in gamma1 and beta1; for the second layer use gamma2 and     #
        # beta2, etc. Scale parameters should be initialized to ones and shift     #
        # parameters should be initialized to zeros.                               #
        ############################################################################
        for i in range(self.num_layers - 1):
            if i == 0:
                self.params['W%s' % (i + 1)] = np.random.normal(0, weight_scale, size=(input_dim, hidden_dims[i]))
                self.params['b%s' % (i + 1)] = np.zeros(shape=(hidden_dims[i]))
                if self.normalization is not None:
                    self.params['gamma%s' % (i + 1)] = np.ones(shape=(hidden_dims[i]))
                    self.params['beta%s' % (i + 1)] = np.zeros(shape=(hidden_dims[i]))
            else:
                self.params['W%s' % (i + 1)] = np.random.normal(0, weight_scale,
                                                                size=(hidden_dims[i - 1], hidden_dims[i]))
                self.params['b%s' % (i + 1)] = np.zeros(shape=(hidden_dims[i]))
                if self.normalization is not None:
                    self.params['gamma%s' % (i + 1)] = np.ones(shape=(hidden_dims[i]))
                    self.params['beta%s' % (i + 1)] = np.zeros(shape=(hidden_dims[i]))
        self.params['W%s' % self.num_layers] = np.random.normal(0, weight_scale, size=(hidden_dims[-1], num_classes))
        self.params['b%s' % self.num_layers] = np.zeros(shape=(num_classes))
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################
        # When using dropout we need to pass a dropout_param dictionary to each
        # dropout layer so that the layer knows the dropout probability and the mode
        # (train / test). You can pass the same dropout_param to each dropout layer.
        self.dropout_param = {}
        if self.use_dropout:
            self.dropout_param = {'mode': 'train', 'p': dropout}
            if seed is not None:
                self.dropout_param['seed'] = seed

        # With batch normalization we need to keep track of running means and
        # variances, so we need to pass a special bn_param object to each batch
        # normalization layer. You should pass self.bn_params[0] to the forward pass
        # of the first batch normalization layer, self.bn_params[1] to the forward
        # pass of the second batch normalization layer, etc.
        self.bn_params = []
        if self.normalization == 'batchnorm':
            self.bn_params = [{'mode': 'train'} for i in range(self.num_layers - 1)]
        if self.normalization == 'layernorm':
            self.bn_params = [{} for i in range(self.num_layers - 1)]

        # Cast all parameters to the correct datatype
        for k, v in self.params.items():
            self.params[k] = v.astype(dtype)

    def loss(self, X, y=None):
        """
        Compute loss and gradient for the fully-connected net.

        Input / output: Same as TwoLayerNet above.
        """
        X = X.astype(self.dtype)
        mode = 'test' if y is None else 'train'

        # Set train/test mode for batchnorm params and dropout param since they
        # behave differently during training and testing.
        if self.use_dropout:
            self.dropout_param['mode'] = mode
        if self.normalization == 'batchnorm':
            for bn_param in self.bn_params:
                bn_param['mode'] = mode
        scores = X
        ############################################################################
        # TODO: Implement the forward pass for the fully-connected net, computing  #
        # the class scores for X and storing them in the scores variable.          #
        #                                                                          #
        # When using dropout, you'll need to pass self.dropout_param to each       #
        # dropout forward pass.                                                    #
        #                                                                          #
        # When using batch normalization, you'll need to pass self.bn_params[0] to #
        # the forward pass for the first batch normalization layer, pass           #
        # self.bn_params[1] to the forward pass for the second batch normalization #
        # layer, etc.                                                              #
        ############################################################################
        caches = list()
        for i in range(self.num_layers - 1):
            cache = list()
            scores, fc_cache = affine_forward(scores, self.params['W%s' % (i + 1)], self.params['b%s' % (i + 1)])
            cache.append(fc_cache)
            if self.normalization == 'batchnorm':
                scores, bn_cache = batchnorm_forward(scores, self.params['gamma%s' % (i + 1)],
                                                   self.params['beta%s' % (i + 1)],
                                                   self.bn_params[i])
                cache.append(bn_cache)
            elif self.normalization == 'layernorm':
                scores, ln_cache = layernorm_forward(scores, self.params['gamma%s' % (i + 1)],
                                                     self.params['beta%s' % (i + 1)],
                                                     self.bn_params[i])
                cache.append(ln_cache)
            scores, relu_cache = relu_forward(scores)
            cache.append(relu_cache)
            if self.use_dropout:
                scores, dropout_cache = dropout_forward(scores, self.dropout_param)
                cache.append(dropout_cache)
            caches.append(cache)
        scores, fc_cache = affine_forward(scores, self.params['W%s' % self.num_layers],
                                          self.params['b%s' % self.num_layers])
        caches.append(fc_cache)
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # If test mode return early
        if mode == 'test':
            return scores

        loss, grads = 0.0, {}
        ############################################################################
        # TODO: Implement the backward pass for the fully-connected net. Store the #
        # loss in the loss variable and gradients in the grads dictionary. Compute #
        # data loss using softmax, and make sure that grads[k] holds the gradients #
        # for self.params[k]. Don't forget to add L2 regularization!               #
        #                                                                          #
        # When using batch/layer normalization, you don't need to regularize the scale   #
        # and shift parameters.                                                    #
        #                                                                          #
        # NOTE: To ensure that your implementation matches ours and you pass the   #
        # automated tests, make sure that your L2 regularization includes a factor #
        # of 0.5 to simplify the expression for the gradient.                      #
        ############################################################################
        loss, dx = softmax_loss(scores, y)
        # 加上正则项
        for i in range(1, self.num_layers + 1):
            loss += 0.5 * self.reg * np.sum(self.params['W%s' % i] ** 2)

        for i in range(self.num_layers, 0, -1):
            if i == self.num_layers:
                dx, dw, db = affine_backward(dx, caches[i - 1])
                grads['W%s' % i] = dw + self.reg * self.params['W%s' % i]
                grads['b%s' % i] = db
            else:
                if self.use_dropout:
                    dx = dropout_backward(dx, caches[i - 1][-1])
                if self.normalization is not None:
                    dx = relu_backward(dx, caches[i - 1][2])
                    if self.normalization == 'batchnorm':
                        dx, dgamma, dbeta = batchnorm_backward_alt(dx, caches[i - 1][1])
                    elif self.normalization == 'layernorm':
                        dx, dgamma, dbeta = layernorm_backward(dx, caches[i - 1][1])
                    else:
                        raise ValueError("No such normalization")
                    grads['gamma%s' % i] = dgamma
                    grads['beta%s' % i] = dbeta
                else:
                    dx = relu_backward(dx, caches[i - 1][1])
                dx, dw, db = affine_backward(dx, caches[i - 1][0])
                grads['W%s' % i] = dw + self.reg * self.params['W%s' % i]
                grads['b%s' % i] = db

        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return loss, grads

三、优化方法实现

优化方法总结

四、作业问题记录

Inline Question 1:
We've only asked you to implement ReLU, but there are a number of different activation functions that one could use in neural networks, each with its pros and cons. In particular, an issue commonly seen with activation functions is getting zero (or close to zero) gradient flow during backpropagation. Which of the following activation functions have this problem? If you consider these functions in the one dimensional case, what types of input would lead to this behaviour?

Sigmoid
ReLU
Leaky ReLU

Answer:
sigmoid,relu和leakyrelu都有梯度为0或者接近0的情况,对于sigmoid,输入太大或者太小会导致梯度接近0,relu的话,输入为负,梯度为0,leakyrelu的输入为负,梯度会比较小一点

 

Inline Question 2: 
Did you notice anything about the comparative difficulty of training the three-layer net vs training the five layer net? In particular, based on your experience, which network seemed more sensitive to the initialization scale? Why do you think that is the case?

Answer:
3层神经网络和5层神经网络虽然训练集都能达到100%,但是验证集上3层神经网络可以实现更好的泛化能力,特别的,5层神经网络对权重初始化scale更加敏感,因为5层网络更加复杂,参数更多,也使得loss函数更加复杂,更容易陷入局部极小点。

 

Inline Question 3:
AdaGrad, like Adam, is a per-parameter optimization method that uses the following update rule:

   cache += dw**2
    w += - learning_rate * dw / (np.sqrt(cache) + eps)
John notices that when he was training a network with AdaGrad that the updates became very small, and that his network was learning slowly. Using your knowledge of the AdaGrad update rule, why do you think the updates would become very small? Would Adam have the same issue?

13.2  Answer:
有可能是因为刚开始的时候dw非常大,导致cache积累的非常大,所以在更新的时候,会使得 \frac{learning rate}{\sqrt{cache }+ \varepsilon } 变得非常的小,所以收敛非常慢。Adam 不会有这个问题,因为更新之前,使用的是m_{t} = \frac{m}{1 - \beta _{1}^ t}可以看做动量梯度,t比较小时,mt会更大,就不会遇到这个问题。

 

参考文献

[1] Tijmen Tieleman and Geoffrey Hinton. "Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude." COURSERA: Neural Networks for Machine Learning 4 (2012).

[2] Diederik Kingma and Jimmy Ba, "Adam: A Method for Stochastic Optimization", ICLR 2015.

[3] https://juejin.im/post/5ad41530f265da2386705937 Matplolib Tips

[4] https://cs231n.github.io/neural-networks-3/#update  Parameter updates

你可能感兴趣的:(实践)