【深度学习】笔记5：神经网络的正则化

写在前面：吴恩达老师第二门课程第一周的内容主要包括以下三部分：

1.初始化参数：
使用0来初始化参数；
使用随机数来初始化参数；
使用抑梯度异常初始化参数（可以参见视频中的梯度消失和梯度爆炸）。
2.正则化模型：
使用二范数对二分类模型正则化，尝试避免过拟合；
使用dropout法精简模型，同样是为了尝试避免过拟合。
3.梯度校验：
对模型使用梯度校验，检测它是否会在梯度下降的过程中出现误差过大的情况。

内容较多，我将分三次来整理。下面是第二部分——正则化模型的内容。

问题描述

假设你现在是一个AI专家，你需要设计一个模型，可以用于推荐在足球场中守门员将球发至哪个位置可以让本队的球员抢到球的可能性更大。说白了，实际上就是一个二分类，一半是己方抢到球，一半就是对方抢到球，我们来看一下这个图：

Figure 1 :Football field

读取并绘制数据集

加载并查看一下数据集：

train_X, train_Y, test_X, test_Y = load_2D_dataset()

Figure 2 :Dataset

每一个点代表球落下的可能的位置，蓝色代表己方的球员会抢到球，红色代表对手的球员会抢到球，我们要做的就是使用模型来画出一条线，来找到适合我方球员能抢到球的位置。
我们要做以下三件事，来对比出不同的模型的优劣：

不使用正则化
使用二范数正则化
使用dropout法精简模型

模型：

正则化模式 - 将lambd输入设置为非零值。（使用“lambd”而不是“lambda”，因为“lambda”是Python中的保留关键字。）
dropout法精简模型即随机删除节点 - 将keep_prob设置为小于1的值

def model(X, Y, learning_rate = 0.3, num_iterations = 30000, print_cost = True, lambd = 0, keep_prob = 1):
   
    grads = {}
    costs = []                            # to keep track of the cost
    m = X.shape[1]                        # number of examples
    layers_dims = [X.shape[0], 20, 3, 1]
    
    # Initialize parameters dictionary.
    parameters = initialize_parameters(layers_dims)

    # Loop (gradient descent)
    for i in range(0, num_iterations):

        # Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
        if keep_prob == 1:
            a3, cache = forward_propagation(X, parameters)
        elif keep_prob < 1:
            a3, cache = forward_propagation_with_dropout(X, parameters, keep_prob)
        
        # Cost function
        if lambd == 0:
            cost = compute_cost(a3, Y)
        else:
            cost = compute_cost_with_regularization(a3, Y, parameters, lambd)
            
        # Backward propagation.
        assert(lambd==0 or keep_prob==1)    # it is possible to use both L2 regularization and dropout, 
                                            # but this assignment will only explore one at a time
        if lambd == 0 and keep_prob == 1:
            grads = backward_propagation(X, Y, cache)
        elif lambd != 0:
            grads = backward_propagation_with_regularization(X, Y, cache, lambd)
        elif keep_prob < 1:
            grads = backward_propagation_with_dropout(X, Y, cache, keep_prob)
        
        # Update parameters.
        parameters = update_parameters(parameters, grads, learning_rate)
        
        # Print the loss every 10000 iterations
        if print_cost and i % 10000 == 0:
            print("Cost after iteration {}: {}".format(i, cost))
        if print_cost and i % 1000 == 0:
            costs.append(cost)
    
    # plot the cost
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('iterations (x1,000)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()
    
    return parameters

先看一下不使用正则化时的模型训练效果

不使用正则化

parameters = model(train_X, train_Y)
print ("On the training set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

运行结果

Cost after iteration 0: 0.6557412523481002
Cost after iteration 10000: 0.16329987525724213
Cost after iteration 20000: 0.13851642423245572

On the training set:
Accuracy: 0.9478672985781991
On the test set:
Accuracy: 0.915

Figure 3 :Cost without regularization

决策边界如下

Figure 4 :Model without regularization

从图中可以看出，在无正则化时，决策边界有了明显的过拟合特性。接下来使用二范数正则化。

L2正则化

避免过度拟合的标准方法称为L2正则化，它包括适当修改成本函数，我们从原来的成本函数(1)到现在的函数(2)：

$J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{交叉熵成本} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2正则化成本} \tag{2}$
计算的代码为

np.sum(np.square(Wl))

需要注意的是，在前向传播中对 , 和这三个项进行操作，将这三个项相加并乘以。在后向传播中，使用计算梯度。
我们下面就开始写相关的函数：

def compute_cost_with_regularization(A3, Y, parameters, lambd):

    m = Y.shape[1]
    W1 = parameters["W1"]
    W2 = parameters["W2"]
    W3 = parameters["W3"]
    
    cross_entropy_cost = compute_cost(A3, Y) # This gives you the cross-entropy part of the cost
    
    L2_regularization_cost = lambd * (np.sum(np.square(W1)) + np.sum(np.square(W2))  + np.sum(np.square(W3))) / (2 * m)
   
    cost = cross_entropy_cost + L2_regularization_cost
    
    return cost

这里，因为改变了成本函数，所以也必须改变向后传播的函数，所有的梯度都必须根据这个新的成本值来计算。

def backward_propagation_with_regularization(X, Y, cache, lambd):
    
    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    
    dW3 = 1./m * np.dot(dZ3, A2.T) + ((lambd * W3) / m )
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    
    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1./m * np.dot(dZ2, A1.T) + ((lambd * W2) / m)
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1./m * np.dot(dZ1, X.T) + ((lambd * W1) / m)
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

放到模型中执行一下

X_assess, Y_assess, cache = backward_propagation_with_regularization_test_case()

grads = backward_propagation_with_regularization(X_assess, Y_assess, cache, lambd = 0.7)

parameters = model(train_X, train_Y, lambd = 0.7)
print ("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

结果如下

Cost after iteration 0: 0.6974484493131264
Cost after iteration 10000: 0.2684918873282239
Cost after iteration 20000: 0.2680916337127301

On the train set:
Accuracy: 0.9383886255924171
On the test set:
Accuracy: 0.93

Figure 5 :Cost with L2-regularization

决策边界

Figure 6 :Model with L2-regularization

1、λ的值是可以使用开发集调整的超参数。L2正则化会使决策边界更加平滑,但是如果λ太大，也可能会“过度平滑”，从而导致模型高偏差。
2、L2正则化通过削减成本函数中权重的平方，可以将所有权重逐渐改变到到较小的值。权值数值高的话会有更平滑的模型，其中输入变化时输出变化更慢，但是需要花费更多的时间。

L2正则化对以下内容有影响：

成本计算- 正则化的计算需要添加到成本函数中
反向传播功能- 在权重矩阵方面，梯度计算时也要依据正则化来做出相应的计算
重量变小（“重量衰减”)- 权重被逐渐改变到较小的值。

Dropout

最后，使用Dropout来进行正则化。Dropout的原理就是每次迭代过程中随机将其中的一些节点失效：当关闭一些节点时，实际上修改了网络模型。这么做的原因就是：在每次迭代时，我们都会训练一个只使用一部分神经元的不同模型，随着迭代次数的增加，模型的节点会对其他特定节点的激活变得不那么敏感，因为其他节点可能随时会失效。
来看一下这两个GIF图，图有点大，加载不出来请【点我下载(11.3MB)】：

下面我们将关闭第一层和第三层的一些节点，我们需要做以下四步：
1、在视频中，吴恩达老师在初始化和时使用np.random.rand() 来保证它们里面的元素都在0和1之间，在这里使用向量化来实现和。
2、如果低于 keep_prob的值就设为0，如果高于就设为1。
3、把更新为。 (我们已经关闭了一些节点)。可以使用作为一个算子，在做矩阵相乘的时候，关闭的那些节点（值为0）就会不参与计算，因为0乘以任何值都为0。
4、用除以 keep_prob。这样做是为了通过缩放使得在计算成本的时候仍然具有相同的期望值，这叫做反向dropout。

def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):
    
    # retrieve parameters
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]
    
    # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
   
    D1 = np.random.rand(A1.shape[0],A1.shape[1])      # Step 1: initialize matrix D1 = np.random.rand(..., ...)
    D1 = D1 < keep_prob                               # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
    A1 = A1 * D1                                      # Step 3: shut down some neurons of A1
    A1 = A1 / keep_prob                               # Step 4: scale the value of neurons that haven't been shut down
    
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
   
    D2 = np.random.rand(A2.shape[0],A2.shape[1])      # Step 1: initialize matrix D2 = np.random.rand(..., ...)
    D2 = D2 < keep_prob                               # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold)
    A2 = A2 * D2                                      # Step 3: shut down some neurons of A2
    A2 = A2 / keep_prob                               # Step 4: scale the value of neurons that haven't been shut down
   
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)
    
    cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)
    
    return A3, cache

改变了前向传播的算法，也需要改变后向传播的算法，使用存储在缓存中的算子和将舍弃的节点位置信息添加到第一个和第二个隐藏层。

def backward_propagation_with_dropout(X, Y, cache, keep_prob):
    
    m = X.shape[1]
    (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    dW3 = 1./m * np.dot(dZ3, A2.T)
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    dA2 = np.dot(W3.T, dZ3)
    
    dA2 = dA2 * D2          # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
    dA2 = dA2 / keep_prob   # Step 2: Scale the value of neurons that haven't been shut down
    
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1./m * np.dot(dZ2, A1.T)
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
   
    dA1 = dA1 * D1          # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation
    dA1 = dA1 / keep_prob   # Step 2: Scale the value of neurons that haven't been shut down
  
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1./m * np.dot(dZ1, X.T)
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

前向和后向传播的函数都写好了，现在用dropout运行模型（keep_prob = 0.86）跑一波。这意味着在每次迭代中，程序都会以24％的概率关闭第1层和第2层的每个神经元。

X_assess, Y_assess, cache = backward_propagation_with_dropout_test_case()

gradients = backward_propagation_with_dropout(X_assess, Y_assess, cache, keep_prob = 0.8)
parameters = model(train_X, train_Y, keep_prob = 0.86, learning_rate = 0.3)

print ("On the train set:")
predictions_train = predict(train_X, train_Y, parameters)
print ("On the test set:")
predictions_test = predict(test_X, test_Y, parameters)

运行结果

Cost after iteration 0: 0.6974484493131264
Cost after iteration 10000: 0.2684918873282239
Cost after iteration 20000: 0.2680916337127301

On the train set:
Accuracy: 0.9383886255924171
On the test set:
Accuracy: 0.93

Figure 7 :Cost with dropout

决策边界

Figure 8 :Model with dropout

总结

三个模型的对比如下表：

model	train accuracy	test accuracy
3-layer NN without regularization	95%	91.5%
3-layer NN with L2-regularization	94%	93%
3-layer NN with dropout	93%	95%

可以看到，正则化会把训练集的准确度降低，但是测试集的准确度提高了，所以，我们这个还是成功了。