神经网络的优化——L2正则化

最近在学习神经网络的相关知识,希望通过写博客记录下学习时的感悟,督促、勉励自己。

正则化

关于神经网络优化的主要方面是避免神经网络的过拟合,提高模型的泛化能力,常用的方法有:L1、L2正则化,dropout、权重的初始化等。其中正则化在深度学习中的线性传播中用的最多,在反向求导的过程中起到惩罚权重的作用。
L2正则化原理其实很简单:

J=J+λ2mww2

其中λ是一个超参数,范围[0,1],m为输入batch中数据的数量,ω则是我们训练的深井网络中每一层的权重矩阵,这里进行的运算是对每一个权重矩阵进行了矩阵的2-范数运算(即每个元素的平方然后求总和)。
我们对损失函数进行了修改之后,反向传播的求导也会发生改变,对ω求导可得:
ddw(12λmW2)=λmw

然后对对应层的ω进行更新:
w=wdJdw(learning_rate)λmW(learning_rate)

也就是:
w=(wλm(learningrate))dJdw(learning_rate)

就此对ω进行了额外的惩罚。

理解

至于正则化为什么能够对神经网络防止过拟合,我是这么理解的:随着神经网络的不断加深,层数越来越多,很容易会产生过拟合的现象,导致偏差低却方差高的现象(也就是训练集效果好,测试集效果太差),而层数比较低的时候,则不容易过拟合。我们都知道,当神经网络的权重初始化的时候全部变为零的话,由数学推导可知与层数少的神经网络效果是一样的,相当于一个简单的线性函数,无法很好的拟合数据。由此我们可以通过 w 的额外惩罚令 w 接近0,起到防止过拟合的发生。

实现

以下是在学习deeplearning.ai推出的教程中的相关代码实现:

J=1mi=1m(y(i)log(a[L](i))+(1y(i))log(1a[L](i)))(1)

To:
Jregularized=1mi=1m(y(i)log(a[L](i))+(1y(i))log(1a[L](i)))cross-entropy cost+1mλ2lkjW[l]2k,jL2 regularization cost(2)

def compute_cost_with_regularization(A3, Y, parameters, lambd):
    """
    Implement the cost function with L2 regularization. See formula (2) above.

    Arguments:
    A3 -- post-activation, output of forward propagation, of shape (output size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    parameters -- python dictionary containing parameters of the model

    Returns:
    cost - value of the regularized loss function (formula (2))
    """
    m = Y.shape[1]
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    W3 = parameters["W3"]

    cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost

    ### START CODE HERE ### (approx. 1 line)
    L2_regularization_cost = lambd*(np.sum(np.square(W1))+ np.sum(np.square(W2))+ np.sum(np.square(W3)))/(2*m)
    ### END CODER HERE ###

    cost = cross_entropy_cost + L2_regularization_cost

    return cost
def backward_propagation_with_regularization(X, Y, cache, lambd):
    """
    Implements the backward propagation of our baseline model to which we added an L2 regularization.

    Arguments:
    X -- input dataset, of shape (input size, number of examples)
    Y -- "true" labels vector, of shape (output size, number of examples)
    cache -- cache output from forward_propagation()
    lambd -- regularization hyperparameter, scalar

    Returns:
    gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables
    """

    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache

    dZ3 = A3 - Y

    ### START CODE HERE ### (approx. 1 line)
    dW3 = 1./m * np.dot(dZ3, A2.T) + lambd*W3/m
    ### END CODE HERE ###
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)

    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))   ##因为除了最后一层,其余的激活函数都是relu,需要大于0才能求导
    ### START CODE HERE ### (approx. 1 line)
    dW2 = 1./m * np.dot(dZ2, A1.T) + lambd*W2/m
    ### END CODE HERE ###
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)

    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    ### START CODE HERE ### (approx. 1 line)
    dW1 = 1./m * np.dot(dZ1, X.T) + lambd*W1/m
    ### END CODE HERE ###
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)

    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}

    return gradients

你可能感兴趣的:(深度学习)