【深度学习】吴恩达深度学习-Course2改善深层神经网络:超参数调试、正则化以及优化-第二周优化算法编程

视频链接:【中英字幕】吴恩达深度学习课程第二课 — 改善深层神经网络:超参数调试、正则化以及优化
参考链接:

  1. 【中文】【吴恩达课后编程作业】Course 2 - 改善深层神经网络 - 第二周作业
  2. Optimization methods

目录

  • 〇、作业目标与环境配置
  • 一、梯度下降
  • 二、Mini-Batch梯度下降
  • 三、Momentun动量梯度下降
  • 四、Adam
  • 五、不同优化算法的模型
    • 1.Mini-batch梯度下降
    • 2.使用Momentum的Mini-batch梯度下降
    • 3.使用Adam模式的Mini-batch
    • 4.总结
  • 六、总代码

〇、作业目标与环境配置

直到现在,你会经常使用梯度下降来更新参数并减小成本。在本篇文章中,你将学习到更多能够加速学习的先进优化方法并且能够获得成本函数的最优质的。好的优化算法能够用仅仅几个小时来获得好的结果,而不是花上许多天。
在成本函数J上进行梯度下降就像下山一样,如下图:

图1:将成本降至最低就像在丘陵地带找到最低点一样
在训练的每一步中,你都会向一个确定的方向更新你的参数以尝试获得可能的最低点。
Notations:如往常一样, ∂ J ∂ a = d a \frac{\partial{J}}{\partial{a}} = da aJ=da代表任何变量a。
开始之前,运行一下下面的代码来保证你的环境中有你所需的库。

import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import math
import sklearn
import sklearn.datasets

from opt_utils import load_params_and_grads, initialize_parameters, forward_propagation, backward_propagation
from opt_utils import compute_cost, predict, predict_dec, plot_decision_boundary, load_dataset
from testCas import *

plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

numpy包可以使用Anaconda进行安装。安装的方式见此篇文章:【深度学习】吴恩达深度学习-Course1神经网络与深度学习-第二周神经网络基础编程中的使用环境部分,安装sklearn包时请使用命令:conda install scikit-learn而不是conda install sklearn
testCase、opt_utils请点击后边的资料下载(从参考链接1中搬运来的链接,如果提取码有问题可以去看一下参考链接1中的提取码):资料下载,提取码:g5tv。

一、梯度下降

在机器学习中一个简单的优化方法就是梯度下降(gradient descent,GD),当你对m个例子执行梯度下降的每一步时,就称作Batch梯度下降。
热身练习:完成梯度下降的耿勋规则,对于 l = 1 , . . . , L l=1,...,L l=1,...,L梯度下降的规则:

W [ l ] = W [ l ] − α d W [ l ] W^{[l]}=W^{[l]}-αdW{[l]} W[l]=W[l]αdW[l]
b [ l ] = b [ l ] − α d b [ l ] b^{[l]}=b^{[l]}-αdb{[l]} b[l]=b[l]αdb[l]

在这其中,L是层数,α是学习率。所有的参数需要被存储在parameters字典中。注意迭代的l从0开始,第一个参数是 W [ 1 ] W^{[1]} W[1] Y [ 1 ] Y^{[1]} Y[1]。在编码时,你要注意转换成ll+1
请根据上边的解释编写下边的代码:

def update_parameters_with_gd(parameters, grads, learning_rate):
    """
    Update parameters using one step of gradient descent
    
    Arguments:
    parameters -- python dictionary containing your parameters to be updated:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients to update each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    learning_rate -- the learning rate, scalar.
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
    """

完成后应当如下:

def update_parameters_with_gd(parameters, grads, learning_rate):
    """
    Update parameters using one step of gradient descent

    Arguments:
    parameters -- python dictionary containing your parameters to be updated:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients to update each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    learning_rate -- the learning rate, scalar.

    Returns:
    parameters -- python dictionary containing your updated parameters
    """
    L = len(parameters) // 2

    for l in range(L):
        parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * grads["dW" + str(l + 1)]
        parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * grads["db" + str(l + 1)]

    return parameters

写完后可以用如下的代码测试一下:

# 测试update_parameters_with_gd(parameters, grads, learning_rate)函数
parameters, grads, learning_rate = update_parameters_with_gd_test_case()

parameters = update_parameters_with_gd(parameters, grads, learning_rate)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))

测试结果为:

W1 = [[ 1.63535156 -0.62320365 -0.53718766]
 [-1.07799357  0.85639907 -2.29470142]]
b1 = [[ 1.74604067]
 [-0.75184921]]
W2 = [[ 0.32171798 -0.25467393  1.46902454]
 [-2.05617317 -0.31554548 -0.3756023 ]
 [ 1.1404819  -1.09976462 -0.1612551 ]]
b2 = [[-0.88020257]
 [ 0.02561572]
 [ 0.57539477]]

一种变种是随机梯度下降(Stochastic Gradient Descent,SGD),等同于mini-batch梯度下降中每一个mini-batch只含有一个实例。刚刚您实现的更新规则不会改变。需要改变的是,你需要一次只计算一个训练实例的梯度,而不是整个训练集的梯度。接下来的代码示例会阐述stochastic梯度下降和batch梯度下降的差异。

Batch梯度下降:

X = data_input
Y = labels
parameters = initialize_parameters(layer_dims)
for i in range(0, num_iterations):
	# Forward propagation
	a, caches = forward_propagation(X, parameters)
	# Compute cost
	cost = compute_cost(a,Y)
	# Backward propagation
	grads = backward_propagation(a, caches, parameters)
	# update parameters
	parameters = update_parameters(parameters, grads)

Stochastic梯度下降(随机梯度下降):

X = data_input
Y = labels
parameters = initialize_parameters(layer_dims)
for i in range(0, num_iterations):
	for j in range(0,m):
		# Forward propagation
		a, caches = forward_propagation(X[:,j], parameters)
		# Compute cost
		cost = compute_cost(a, Y[:,j])
		# Backward propagation
		grads = backward_propagation(a, caches, parameters)
		# Update parameters
		parameters = update_parameters(parameters, grads)

在Stochastic梯度下降中,在更新梯度之前,你仅使用1个训练示例。当训练集非常大的时候,SGD会更快一些。但是参数将会朝着最小值“震荡”而不是平稳收敛。
【深度学习】吴恩达深度学习-Course2改善深层神经网络:超参数调试、正则化以及优化-第二周优化算法编程_第1张图片
图1:SGD vs GD
“+”代表成本的最小值。SGD导致产生了许多震荡从而达到收敛。但是SGD在每一步的计算上要快于GD,因为它只使用一个训练示例(与GD的整个批次相比)。

笔记完成SGD总共需要3个循环:

  1. 在迭代次数上进行循环
  2. 在m个训练例子上进行循环
  3. 在每一层上进行循环(更新所有的参数,从 ( W [ 1 ] ) , b [ 1 ] (W^{[1]}),b^{[1]} (W[1]),b[1] ( W [ L ] ) , b [ L ] (W^{[L]}),b^{[L]} (W[L]),b[L]

在练习中,如果你不使用个训练集或者仅使用一个训练示例来执行每一次更新,你将总是很快得到结果。Mini-batch梯度下降在每一步中使用中间数量的示例。使用Mini-batch,你在Mini-batch上进行循环而不是在单个训练实例上循环。

【深度学习】吴恩达深度学习-Course2改善深层神经网络:超参数调试、正则化以及优化-第二周优化算法编程_第2张图片
图2:SGD vs Mini-Batch GD
“+”代表成本的最小值。使用Mini-Batch来优化你的算法从而能够带来更快的优化。

你需要记住的:

  • GD、SGD、Mini-Batch梯度下降的差异是,执行一个更新步骤中你使用的实例数量。
  • 你需要调整学习率超参数α。
  • 合适的mini-batch大小,表现比GD和SGD要优秀(尤其是当训练集非常大的时候)。

二、Mini-Batch梯度下降

让我们学习如何从训练集(X, Y)中建立mini-batches。
一共有两步:

  • Shuffle(洗牌):创建一个shuffled版本的训练集(X, Y),如下所示。X和Y的每一列代表一个实例。请注意,随机shuffling(洗牌)是在X和Y之间同步进行的。因此,在洗牌过后,X的 i t h i^{th} ith列式与Y的 i t h i^{th} ith列相对应的。洗牌的步骤保证了示例将会被随机打散为不同的mini-batch。
    【深度学习】吴恩达深度学习-Course2改善深层神经网络:超参数调试、正则化以及优化-第二周优化算法编程_第3张图片
  • Partion(个人理解是分割的意思):将打乱过后的(X, Y)分割为mini_batch_size大小的mni-batches集。最后的mini-batch可能会很小,但是你不需要担心这一点。当最后的mini-batch比满mini_batch_size小,它看起来会是这样的:
    【深度学习】吴恩达深度学习-Course2改善深层神经网络:超参数调试、正则化以及优化-第二周优化算法编程_第4张图片
    练习:完成random_mini_batches。我们为你编码了洗牌的部分,来帮助你完成分割部分,我们给你以下的代码,用于为 1 s t 1^{st} 1st 2 n d 2^{nd} 2nd两个mini-batches提供索引:
first_mini_batch_X = shuffled_X[:, 0 : mini_batch_size]
second_mini_batch_X = shuffled_X[:, mini_batch_size : 2 * mini_batch_size]

注意,最后一个mini-batch的值可能比mini_batch_size=64要小。让 ⌊ s ⌋ \lfloor s \rfloor s代表s向下取整到最近的整数(在Python中就是math.floor(s))。如果实例的总数不是mini_batch_size=64的倍数,则会有 ⌊ m m i n i _ b a t c h _ s i z e ⌋ \lfloor \frac{m}{mini\_batch\_size} \rfloor mini_batch_sizem个mini_batch,包含完整地64个实例,并且实例的数量在最后一个mini-batch中是 ( m − m i n i _ b a t c h _ s i z e × ⌊ m m i n i _ b a t c h _ s i z e ⌋ ) (m-mini\_batch\_size × \lfloor \frac{m}{mini\_batch\_size} \rfloor ) (mmini_batch_size×mini_batch_sizem)
请根据上述描述完成如下函数:

def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):
    """
    Creates a list of random minibatches from (X, Y)
    
    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    mini_batch_size -- size of the mini-batches, integer
    
    Returns:
    mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
    """
    
    np.random.seed(seed)            # To make your "random" minibatches the same as ours
    m = X.shape[1]                  # number of training examples
    mini_batches = []
        
    # Step 1: Shuffle (X, Y)
    permutation = list(np.random.permutation(m))
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((1,m))

完成后的结果如下:

def random_mini_batches(X, Y, mini_batch_size=64, seed=0):
    """
    Creates a list of random minibatches from (X, Y)

    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    mini_batch_size -- size of the mini-batches, integer

    Returns:
    mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
    """

    np.random.seed(seed)  # To make your "random" minibatches the same as ours
    m = X.shape[1]  # number of training examples
    mini_batches = []

    # Step 1: Shuffle (X, Y)
    permutation = list(np.random.permutation(m))
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((1, m))

    # Step 2: Partition(shuffled_X, shuffled_Y),这一部分主要是将一整个训练集分割为相应数量的小集合
    num_complete_minibatches = math.floor(m / mini_batch_size)  # 首先算出以mini_batch_size为分割大小时mini_batch的数量
    # 对num_complete_minibatches进行循环,一次处理一个分割后的训练集
    for k in range(0, num_complete_minibatches):
        # shuffled_X、shuffle_Y是具有竖着和横着两个维度的,竖着的代表该实例的每一个特征。我们不进行切割,所以用":"表示。
        # 使用","隔开后,横向维度大小我们使用mini_batch_size进行切割,故用"k * mini_batch_size: (k + 1) * mini_batch_size"进行表示。
        # “... : ..."是Python中的切片函数,不清楚可以了解一下
        mini_batch_X = shuffled_X[:, k * mini_batch_size: (k + 1) * mini_batch_size]
        mini_batch_Y = shuffled_Y[:, k * mini_batch_size: (k + 1) * mini_batch_size]
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)

    # 假设有6463个数据,则按照步骤2一开始的计算,可以分割出100个mini_batch,但是在后边还有63个数据,这个if就是为了防止这种情况
    if m % mini_batch_size != 0:
        # end = m - mini_batch_size * math.floor(m / mini_batch_size) # 个人认为这句不用也没多大影响。。但是给出的答案中有这句
        mini_batch_X = shuffled_X[:, num_complete_minibatches * mini_batch_size:]
        mini_batch_Y = shuffled_Y[:, num_complete_minibatches * mini_batch_size:]
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)

    return mini_batches

写完后可以使用这段代码测试一下:

X_assess, Y_assess, mini_batch_size = random_mini_batches_test_case()
mini_batches = random_mini_batches(X_assess, Y_assess, mini_batch_size)

print("shape of the 1st mini_batch_X: " + str(mini_batches[0][0].shape))
print("shape of the 2nd mini_batch_X: " + str(mini_batches[1][0].shape))
print("shape of the 3rd mini_batch_X: " + str(mini_batches[2][0].shape))
print("shape of the 1st mini_batch_Y: " + str(mini_batches[0][1].shape))
print("shape of the 2nd mini_batch_Y: " + str(mini_batches[1][1].shape)) 
print("shape of the 3rd mini_batch_Y: " + str(mini_batches[2][1].shape))
print("mini batch sanity check: " + str(mini_batches[0][0][0][0:3]))

得到的结果应当如下:

shape of the 1st mini_batch_X: (12288, 64)
shape of the 2nd mini_batch_X: (12288, 64)
shape of the 3rd mini_batch_X: (12288, 20)
shape of the 1st mini_batch_Y: (1, 64)
shape of the 2nd mini_batch_Y: (1, 64)
shape of the 3rd mini_batch_Y: (1, 20)
mini batch sanity check: [ 0.90085595 -0.7612069   0.2344157 ]

你需要记住的:

  • Shuffling和Partitioning是建立mini-batches的两个必要步骤
  • mini-batch的大小通常选择为2的幂,如16,32,64,128

三、Momentun动量梯度下降

因为mini-batch梯度下降在仅看到一部分示例后就会进行参数更新,所以更新的方向有一些变化,因此mini-batch梯度下降所走的路径将向着收敛方向“振荡”。使用Momentun可以减少这些振荡。
Momentun考虑了之前的梯度,从而能够使得update更为平整。我们将之前梯度的“方向”存储在变量 v v v中。在形式上,这是之前步骤中梯度的指数加权平均值。你可以将这个 v v v认为是一个球滚下山坡的“速度”,根据山坡的坡度/坡度的方向来增加加速度(和动量)。
【深度学习】吴恩达深度学习-Course2改善深层神经网络:超参数调试、正则化以及优化-第二周优化算法编程_第5张图片
图3: 这里的红色箭头展示了使用momentun后走一步mini-batch梯度下降的方向。蓝色的点表示每一步梯度的方向(遵从原本mini-batch的方向)。我们让梯度影响 v v v并且向 v v v的方向迈步而不是只是跟随梯度。
**练习:**初始化速度。速度, v v v,是一个python字典,需要用0将数组初始化。其关键与grads字典一样,即:对于 l = 1 , . . . , L : l=1,...,L: l=1,...,L:

v["dW" + str(l+1)] = ... # 使用0初始化数组,维度与parameters["W" + str(l+1)]一样
v["db" + str(l+1)] = ... # 使用0初始化数字,维度与parameters["b" + str(l+1)]一样

注意: 迭代器l在for循环从0开始,第一个参数是v[“dW1”]和v[“db1”](上标是1)。这就是为什么我们在for循环中将l转换为l+1
利用上边的提示完成以下函数:

def initialize_velocity(parameters):
    """
    Initializes the velocity as a python dictionary with:
                - keys: "dW1", "db1", ..., "dWL", "dbL" 
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    
    Returns:
    v -- python dictionary containing the current velocity.
                    v['dW' + str(l)] = velocity of dWl
                    v['db' + str(l)] = velocity of dbl
    """

完成后如下:

def initialize_velocity(parameters):
    """
    Initializes the velocity as a python dictionary with:
                - keys: "dW1", "db1", ..., "dWL", "dbL"
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl

    Returns:
    v -- python dictionary containing the current velocity.
                    v['dW' + str(l)] = velocity of dWl
                    v['db' + str(l)] = velocity of dbl
    """
    L = len(parameters) // 2
    v = {}

    for l in range(0, L):
        v["dW" + str(l + 1)] = np.zeros_like(parameters["W" + str(l + 1)])
        v["db" + str(l + 1)] = np.zeros_like(parameters["b" + str(l + 1)])

    return v

可以用以下代码测试你的函数:

# 测试函数initialize_velocity(parameters)
parameters = initialize_velocity_test_case()

v = initialize_velocity(parameters)
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))

得到的结果应为:

v["dW1"] = [[0. 0. 0.]
 [0. 0. 0.]]
v["db1"] = [[0.]
 [0.]]
v["dW2"] = [[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
v["db2"] = [[0.]
 [0.]
 [0.]]

练习: 现在,完成使用momentun进行的参数更新。momentun的更新规则如下,对于 l = 1 , . . . , L : l = 1,...,L: l=1,...,L:

v d W [ l ] = β v d W [ l ] + ( 1 − β ) d W [ l ] v_{dW^{[l]}}=βv_{dW^{[l]}}+(1-β)dW^{[l]} vdW[l]=βvdW[l]+(1β)dW[l]
W [ l ] = W [ l ] − α v d W [ l ] W^{[l]}=W^{[l]}-αv_{dW^{[l]}} W[l]=W[l]αvdW[l]

v d b [ l ] = β v d b [ l ] + ( 1 − β ) d b [ l ] v_{db^{[l]}}=βv_{db^{[l]}}+(1-β)db^{[l]} vdb[l]=βvdb[l]+(1β)db[l]
b [ l ] = b [ l ] − α v d b [ l ] b^{[l]}=b^{[l]}-αv_{db^{[l]}} b[l]=b[l]αvdb[l]

L是层数,β是momentun的参数,α是学习率。所有的参数都应该被存储在parameters字典中。注意迭代器lfor循环中从0开始且第一个参数是 W [ 1 ] W^{[1]} W[1] b [ 1 ] b^{[1]} b[1](上标是1).所以在编码时你将需要将l转换为l+1
完成以下函数:

def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
    """
    Update parameters using Momentum
    
    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- python dictionary containing the current velocity:
                    v['dW' + str(l)] = ...
                    v['db' + str(l)] = ...
    beta -- the momentum hyperparameter, scalar
    learning_rate -- the learning rate, scalar
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
    v -- python dictionary containing your updated velocities
    """

写完后应如下:

def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
    """
    Update parameters using Momentum

    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- python dictionary containing the current velocity:
                    v['dW' + str(l)] = ...
                    v['db' + str(l)] = ...
    beta -- the momentum hyperparameter, scalar
    learning_rate -- the learning rate, scalar

    Returns:
    parameters -- python dictionary containing your updated parameters
    v -- python dictionary containing your updated velocities
    """
    L = len(parameters) // 2

    for l in range(L):
        v["dW" + str(l + 1)] = beta * v["dW" + str(l + 1)] + (1 - beta) * grads["dW" + str(l + 1)]
        v["db" + str(l + 1)] = beta * v["db" + str(l + 1)] + (1 - beta) * grads["db" + str(l + 1)]

        parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * v["dW" + str(l + 1)]
        parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * v["db" + str(l + 1)]

    return parameters, v

直接复制下边的代码测试一下你所写的函数是否正确:

# 测试函数update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)
parameters, grads, v = update_parameters_with_momentum_test_case()

parameters, v = update_parameters_with_momentum(parameters, grads, v, beta = 0.9, learning_rate = 0.01)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))

得到结果如下:

W1 = [[ 1.62544598 -0.61290114 -0.52907334]
 [-1.07347112  0.86450677 -2.30085497]]
b1 = [[ 1.74493465]
 [-0.76027113]]
W2 = [[ 0.31930698 -0.24990073  1.4627996 ]
 [-2.05974396 -0.32173003 -0.38320915]
 [ 1.13444069 -1.0998786  -0.1713109 ]]
b2 = [[-0.87809283]
 [ 0.04055394]
 [ 0.58207317]]
v["dW1"] = [[-0.11006192  0.11447237  0.09015907]
 [ 0.05024943  0.09008559 -0.06837279]]
v["db1"] = [[-0.01228902]
 [-0.09357694]]
v["dW2"] = [[-0.02678881  0.05303555 -0.06916608]
 [-0.03967535 -0.06871727 -0.08452056]
 [-0.06712461 -0.00126646 -0.11173103]]
v["db2"] = [[0.02344157]
 [0.16598022]
 [0.07420442]]

Note

  • 速率被初始化为0.所以算法将会花费一些循环来“建立”速率并且开始“迈大步”。
  • 如果β=0,则这刚好成为不带momentun的标准的梯度下降。

怎么选择参数β

  • 大的动量β会带来平滑的更新,因为我们将更多之前的梯度一起进行考虑。如果β太大,这也会使得过于平滑并且“迈的步骤过大”。
  • 普通的β值在0.8到0.999之间,如果你向调整,那么 β = 0.9 β=0.9 β=0.9通常是一个合理的数值。
  • 为你的模型调整最佳的β可能需要尝试好几个值,从而查看在降低成本函数J的值方面哪一个最有效。

你需要记住的:

  • Momentun将之前的梯度一并进行考虑从而平滑了梯度下降的步骤。这也可以与batch梯度下降和mini-baach梯度下降或stochastic梯度下降一同使用。你需要跳帧momentun超参数β和学习率α。

四、Adam

Adam是训练神经网络的非常高效的优化算法之一。它结合了RMSProp和Mmentun的思想。
Adam是如何工作的?

  1. 它计算过去梯度的指数加权平均值,并且存储在变量 v v v中(未进行偏差修正)和变量 v c o r r e c t e d v^{corrected} vcorrected(进行了偏差修正)中。
  2. 它计算过去梯度平方的指数加权平均值,并且存储在变量 s s s中(未进行偏差修正)和变量 s c o r r e c t e d s^{corrected} scorrected(进行了偏差修正)中。
  3. 它根据1和2额组合信息,在一个确定的方向更新参数。

对于 l = 1 , . . . , L l=1,...,L l=1,...,L,更新规则如下:

v d W [ l ] = β 1 v d W [ l ] + ( 1 − β 1 ) ∂ J ∂ W [ l ] v_{dW^{[l]}}=β_1v_{dW^{[l]}} + (1-β_1) \frac{\partial{J}}{\partial{W^{[l]}}} vdW[l]=β1vdW[l]+(1β1)W[l]J

v d W [ l ] c o r r e c t e d = v d W [ l ] 1 − ( β 1 ) t v^{corrected}_{dW^{[l]}}=\frac{v_{dW^{[l]}}}{1-(β_1)^t} vdW[l]corrected=1(β1)tvdW[l]

s d W [ l ] = β 2 s d W [ l ] + ( 1 − β 2 ) ( ∂ J ∂ W [ l ] ) 2 s_{dW^{[l]}}=β_2s_{dW^{[l]}} + (1-β_2)(\frac{\partial{J}}{\partial{W^{[l]}}})^2 sdW[l]=β2sdW[l]+(1β2)(W[l]J)2

s d W [ l ] c o r r e c t e d = s d W [ l ] 1 − ( β 2 ) 2 s^{corrected}_{dW^{[l]}}=\frac{s_{dW^{[l]}}}{1-(β_2)^2} sdW[l]corrected=1(β2)2sdW[l]

W [ l ] = W [ l ] − α v d W [ l ] c o r r e c t e d s d W [ l ] c o r r e c t e d + ε W^{[l]} = W^{[l]} - α\frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}+ε}} W[l]=W[l]αsdW[l]corrected+ε vdW[l]corrected

在这之中:

  • t t t计算Adam的步骤数
  • L L L是层数
  • β 1 β_1 β1 β 2 β_2 β2是控制两个加权平均的超参数
  • α α α是学习率
  • ε ε ε是一个非常小的数,用来避免与0相除

通常,我们将所有的参数存储于parameters字典中。
练习 :初始化Adam的参数 v , s v,s v,s,这些变量用于跟踪过去的信息。
说明 :变量 v , s v,s v,s 是Python字典,需要使用0数组来初始化,它们的关键和grads一样。对于 l = 1 , . . . , L l=1,...,L l=1,...,L:

v["dW" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["W" + str(l+1)])
v["db" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["b" + str(l+1)])
s["dW" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["W" + str(l+1)])
s["db" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["b" + str(l+1)])

请完成如下函数:

def initialize_adam(parameters) :
    """
    Initializes v and s as two python dictionaries with:
                - keys: "dW1", "db1", ..., "dWL", "dbL" 
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
    
    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters["W" + str(l)] = Wl
                    parameters["b" + str(l)] = bl
    
    Returns: 
    v -- python dictionary that will contain the exponentially weighted average of the gradient.
                    v["dW" + str(l)] = ...
                    v["db" + str(l)] = ...
    s -- python dictionary that will contain the exponentially weighted average of the squared gradient.
                    s["dW" + str(l)] = ...
                    s["db" + str(l)] = ...

    """

完成后应如下:

def initialize_adam(parameters):
    """
    Initializes v and s as two python dictionaries with:
                - keys: "dW1", "db1", ..., "dWL", "dbL"
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.

    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters["W" + str(l)] = Wl
                    parameters["b" + str(l)] = bl

    Returns:
    v -- python dictionary that will contain the exponentially weighted average of the gradient.
                    v["dW" + str(l)] = ...
                    v["db" + str(l)] = ...
    s -- python dictionary that will contain the exponentially weighted average of the squared gradient.
                    s["dW" + str(l)] = ...
                    s["db" + str(l)] = ...

    """
    L = len(parameters) // 2
    v = {}
    s = {}

    for l in range(L):
        v["dW" + str(l + 1)] = np.zeros_like(parameters["W" + str(l + 1)])
        v["db" + str(l + 1)] = np.zeros_like(parameters["b" + str(l + 1)])

        s["dW" + str(l + 1)] = np.zeros_like(parameters["W" + str(l + 1)])
        s["db" + str(l + 1)] = np.zeros_like(parameters["b" + str(l + 1)])

    return v, s

建议使用如下代码测试函数是否正确:

# 测试函数initialize_adam(parameters)
parameters = initialize_adam_test_case()

v, s = initialize_adam(parameters)
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))
print("s[\"dW1\"] = " + str(s["dW1"]))
print("s[\"db1\"] = " + str(s["db1"]))
print("s[\"dW2\"] = " + str(s["dW2"]))
print("s[\"db2\"] = " + str(s["db2"]))

得到的结果如下:

v["dW1"] = [[ 0.  0.  0.]
 [ 0.  0.  0.]]
v["db1"] = [[ 0.]
 [ 0.]]
v["dW2"] = [[ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]]
v["db2"] = [[ 0.]
 [ 0.]
 [ 0.]]
s["dW1"] = [[ 0.  0.  0.]
 [ 0.  0.  0.]]
s["db1"] = [[ 0.]
 [ 0.]]
s["dW2"] = [[ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]]
s["db2"] = [[ 0.]
 [ 0.]
 [ 0.]]

练习: 现在,完成Adam的参数更新,回顾以下更新规则, l = 1 , . . . , L l = 1,...,L l=1,...,L:

v d W [ l ] = β 1 v d W [ l ] + ( 1 − β 1 ) ∂ J ∂ W [ l ] v_{dW^{[l]}}=β_1v_{dW^{[l]}} + (1-β_1) \frac{\partial{J}}{\partial{W^{[l]}}} vdW[l]=β1vdW[l]+(1β1)W[l]J

v d W [ l ] c o r r e c t e d = v d W [ l ] 1 − ( β 1 ) t v^{corrected}_{dW^{[l]}}=\frac{v_{dW^{[l]}}}{1-(β_1)^t} vdW[l]corrected=1(β1)tvdW[l]

s d W [ l ] = β 2 s d W [ l ] + ( 1 − β 2 ) ( ∂ J ∂ W [ l ] ) 2 s_{dW^{[l]}}=β_2s_{dW^{[l]}} + (1-β_2)(\frac{\partial{J}}{\partial{W^{[l]}}})^2 sdW[l]=β2sdW[l]+(1β2)(W[l]J)2

s d W [ l ] c o r r e c t e d = s d W [ l ] 1 − ( β 2 ) 2 s^{corrected}_{dW^{[l]}}=\frac{s_{dW^{[l]}}}{1-(β_2)^2} sdW[l]corrected=1(β2)2sdW[l]

W [ l ] = W [ l ] − α v d W [ l ] c o r r e c t e d s d W [ l ] c o r r e c t e d + ε W^{[l]} = W^{[l]} - α\frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}+ε}} W[l]=W[l]αsdW[l]corrected+ε vdW[l]corrected

Note 迭代器l的for循环从0开始,第一个参数是 W [ 1 ] 和 b [ 1 ] W^{[1]}和b^ {[1]} W[1]b[1]。你在编码时需要将l转换为l+1

def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate=0.01,
                                beta1=0.9, beta2=0.999, epsilon=1e-8):
    """
    Update parameters using Adam
    
    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- Adam variable, moving average of the first gradient, python dictionary
    s -- Adam variable, moving average of the squared gradient, python dictionary
    learning_rate -- the learning rate, scalar.
    beta1 -- Exponential decay hyperparameter for the first moment estimates 
    beta2 -- Exponential decay hyperparameter for the second moment estimates 
    epsilon -- hyperparameter preventing division by zero in Adam updates

    Returns:
    parameters -- python dictionary containing your updated parameters 
    v -- Adam variable, moving average of the first gradient, python dictionary
    s -- Adam variable, moving average of the squared gradient, python dictionary
    """

根据给出的公式,照着编写一下代码就可以了:

def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate=0.01,
                                beta1=0.9, beta2=0.999, epsilon=1e-8):
    """
    Update parameters using Adam

    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- Adam variable, moving average of the first gradient, python dictionary
    s -- Adam variable, moving average of the squared gradient, python dictionary
    learning_rate -- the learning rate, scalar.
    beta1 -- Exponential decay hyperparameter for the first moment estimates
    beta2 -- Exponential decay hyperparameter for the second moment estimates
    epsilon -- hyperparameter preventing division by zero in Adam updates

    Returns:
    parameters -- python dictionary containing your updated parameters
    v -- Adam variable, moving average of the first gradient, python dictionary
    s -- Adam variable, moving average of the squared gradient, python dictionary
    """
    L = len(parameters) // 2
    v_corrected = {}
    s_corrected = {}

    for l in range(L):
        v["dW" + str(l + 1)] = beta1 * v["dW" + str(l + 1)] + (1 - beta1) * grads["dW" + str(l + 1)]
        v["db" + str(l + 1)] = beta1 * v["db" + str(l + 1)] + (1 - beta1) * grads["db" + str(l + 1)]

        v_corrected["dW" + str(l + 1)] = v["dW" + str(l + 1)] / (1 - np.power(beta1, t))
        v_corrected["db" + str(l + 1)] = v["db" + str(l + 1)] / (1 - np.power(beta1, t))

        s["dW" + str(l + 1)] = beta2 * s["dW" + str(l + 1)] + (1 - beta2) * np.power(grads["dW" + str(l +1)], 2)
        s["db" + str(l + 1)] = beta2 * s["db" + str(l + 1)] + (1 - beta2) * np.power(grads["db" + str(l + 1)], 2)

        s_corrected["dW" + str(l + 1)] = s["dW" + str(l + 1)] / (1 - np.power(beta2, t))
        s_corrected["db" + str(l + 1)] = s["db" + str(l + 1)] / (1 - np.power(beta2, t))

        parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * (v_corrected["dW" + str(l + 1)] / np.sqrt(s_corrected["dW" + str(l + 1)] + epsilon))
        parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * (v_corrected["db" + str(l + 1)] / np.sqrt(s_corrected["db" + str(l + 1)] + epsilon))

    return parameters, v, s

用以下代码测试一下你的函数是否正确:

# 测试函数update_parameters_with_adam(parameters, grads, v, s, t, learning_rate=0.01, beta1=0.9, beta2=0.999, epsilon=1e-8)
parameters, grads, v, s = update_parameters_with_adam_test_case()
parameters, v, s  = update_parameters_with_adam(parameters, grads, v, s, t = 2)

print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))
print("s[\"dW1\"] = " + str(s["dW1"]))
print("s[\"db1\"] = " + str(s["db1"]))
print("s[\"dW2\"] = " + str(s["dW2"]))
print("s[\"db2\"] = " + str(s["db2"]))

得到的结果如下:

W1 = [[ 1.63178673 -0.61919778 -0.53561312]
 [-1.08040999  0.85796626 -2.29409733]]
b1 = [[ 1.75225313]
 [-0.75376553]]
W2 = [[ 0.32648046 -0.25681174  1.46954931]
 [-2.05269934 -0.31497584 -0.37661299]
 [ 1.14121081 -1.09245036 -0.16498684]]
b2 = [[-0.88529978]
 [ 0.03477238]
 [ 0.57537385]]
v["dW1"] = [[-0.11006192  0.11447237  0.09015907]
 [ 0.05024943  0.09008559 -0.06837279]]
v["db1"] = [[-0.01228902]
 [-0.09357694]]
v["dW2"] = [[-0.02678881  0.05303555 -0.06916608]
 [-0.03967535 -0.06871727 -0.08452056]
 [-0.06712461 -0.00126646 -0.11173103]]
v["db2"] = [[0.02344157]
 [0.16598022]
 [0.07420442]]
s["dW1"] = [[0.00121136 0.00131039 0.00081287]
 [0.0002525  0.00081154 0.00046748]]
s["db1"] = [[1.51020075e-05]
 [8.75664434e-04]]
s["dW2"] = [[7.17640232e-05 2.81276921e-04 4.78394595e-04]
 [1.57413361e-04 4.72206320e-04 7.14372576e-04]
 [4.50571368e-04 1.60392066e-07 1.24838242e-03]]
s["db2"] = [[5.49507194e-05]
 [2.75494327e-03]
 [5.50629536e-04]]

现在你有3个工作良好的优化函数(mini-batch梯度下降,Momentun,Adam)。让我们为每一种优化算法完善一个模型并观察其中的差异。

五、不同优化算法的模型

让我们使用接下来的“moons”数据集来测试不同的优化方法。(数据集名称叫做"moons"是因为从两个数据集中的任何一个看起来都像月牙形的月亮)
用此句读取数据集并展示:

train_X, train_Y = load_dataset()
plt.show()

我们已经完成了一个3层神经网络,你将在上面进行训练:

  • Mini-batch梯度下降将会调用你的函数update_parameters_with_gd()
  • Mini-batch Momentum将会调用你的函数initialize_velocity()update_parameters_with_momentum()
  • Mini-batch Adam将会调用你的函数initialize_adam()update_parameters_with_adam()

我们给出如下函数:

def model(X, Y, layers_dims, optimizer, learning_rate=0.0007, mini_batch_size=64, beta=0.9,
          beta1=0.9, beta2=0.999, epsilon=1e-8, num_epochs=10000, print_cost=True):
    """
    3-layer neural network model which can be run in different optimizer modes.
    
    Arguments:
    X -- input data, of shape (2, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    layers_dims -- python list, containing the size of each layer
    learning_rate -- the learning rate, scalar.
    mini_batch_size -- the size of a mini batch
    beta -- Momentum hyperparameter
    beta1 -- Exponential decay hyperparameter for the past gradients estimates 
    beta2 -- Exponential decay hyperparameter for the past squared gradients estimates 
    epsilon -- hyperparameter preventing division by zero in Adam updates
    num_epochs -- number of epochs
    print_cost -- True to print the cost every 1000 epochs

    Returns:
    parameters -- python dictionary containing your updated parameters 
    """

    L = len(layers_dims)             # number of layers in the neural networks
    costs = []                       # to keep track of the cost
    t = 0                            # initializing the counter required for Adam update
    seed = 10                        # For grading purposes, so that your "random" minibatches are the same as ours
    
    # Initialize parameters
    parameters = initialize_parameters(layers_dims)

    # Initialize the optimizer
    if optimizer == "gd":
        pass # no initialization required for gradient descent
    elif optimizer == "momentum":
        v = initialize_velocity(parameters)
    elif optimizer == "adam":
        v, s = initialize_adam(parameters)
    
    # Optimization loop
    for i in range(num_epochs):
        
        # Define the random minibatches. We increment the seed to reshuffle differently the dataset after each epoch
        seed = seed + 1
        minibatches = random_mini_batches(X, Y, mini_batch_size, seed)

        for minibatch in minibatches:

            # Select a minibatch
            (minibatch_X, minibatch_Y) = minibatch

            # Forward propagation
            a3, caches = forward_propagation(minibatch_X, parameters)

            # Compute cost
            cost = compute_cost(a3, minibatch_Y)

            # Backward propagation
            grads = backward_propagation(minibatch_X, minibatch_Y, caches)

            # Update parameters
            if optimizer == "gd":
                parameters = update_parameters_with_gd(parameters, grads, learning_rate)
            elif optimizer == "momentum":
                parameters, v = update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)
            elif optimizer == "adam":
                t = t + 1 # Adam counter
                parameters, v, s = update_parameters_with_adam(parameters, grads, v, s,
                                                               t, learning_rate, beta1, beta2,  epsilon)
        
        # Print the cost every 1000 epoch
        if print_cost and i % 1000 == 0:
            print("Cost after epoch %i: %f" % (i, cost))
        if print_cost and i % 100 == 0:
            costs.append(cost)
                
    # plot the cost
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('epochs (per 100)')
    plt.title("Learning rate = " + str(learning_rate))
    plt.show()

    return parameters

接下来,你将在3层神经网络上运行3种优化方法。
从这里开始,请将前面加载数据集的代码更换成:

train_X, train_Y = load_dataset(is_plot=False)

在opt_utilsh中有一个错误,请你更正过来:
initialize_parameters(layer_dims)中初始化W时的np.sqrt(2 / layer_dims[l-1])改为np.sqrt(2. / layer_dims[l-1])
否则会出现初始化为0的情况,使得成本降低慢、精确度低。

1.Mini-batch梯度下降

运行如下代码来看看Mini-batch梯度下降模型是怎么运作的:

# train 3-layer model
layers_dims = [train_X.shape[0], 5, 2, 1]
parameters = model(train_X, train_Y, layers_dims, optimizer="gd")

# Predict
predictions = predict(train_X, train_Y, parameters)

# Plot decision boundary
plt.title("Model with Gradient Descent optimization")
axes = plt.gca()
axes.set_xlim([-1.5, 2.5])
axes.set_ylim([-1, 1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

结果如下:

Cost after epoch 0: 0.690736
Cost after epoch 1000: 0.685273
Cost after epoch 2000: 0.647072
Cost after epoch 3000: 0.619525
Cost after epoch 4000: 0.576584
Cost after epoch 5000: 0.607243
Cost after epoch 6000: 0.529403
Cost after epoch 7000: 0.460768
Cost after epoch 8000: 0.465586
Cost after epoch 9000: 0.464518

【深度学习】吴恩达深度学习-Course2改善深层神经网络:超参数调试、正则化以及优化-第二周优化算法编程_第6张图片

Accuracy: 0.7966666666666666

【深度学习】吴恩达深度学习-Course2改善深层神经网络:超参数调试、正则化以及优化-第二周优化算法编程_第7张图片

2.使用Momentum的Mini-batch梯度下降

跑一下以下的代码看看在Momentum的情况下如何。因为这个例子相对来说比较简单,使用动量的收益很小;但对于更复杂的问题,你可能会看到更大的收益。

# Momentum
# train 3-layer model
layers_dims = [train_X.shape[0], 5, 2, 1]
parameters = model(train_X, train_Y, layers_dims, beta=0.9, optimizer="momentum")

# Predict
predictions = predict(train_X, train_Y, parameters)

# Plot decision boundary
plt.title("Model with Momentum optimization")
axes = plt.gca()
axes.set_xlim([-1.5, 2.5])
axes.set_ylim([-1, 1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

得到结果如下:

Cost after epoch 0: 0.690741
Cost after epoch 1000: 0.685341
Cost after epoch 2000: 0.647145
Cost after epoch 3000: 0.619594
Cost after epoch 4000: 0.576665
Cost after epoch 5000: 0.607324
Cost after epoch 6000: 0.529476
Cost after epoch 7000: 0.460936
Cost after epoch 8000: 0.465780
Cost after epoch 9000: 0.464740

【深度学习】吴恩达深度学习-Course2改善深层神经网络:超参数调试、正则化以及优化-第二周优化算法编程_第8张图片

Accuracy: 0.7966666666666666

【深度学习】吴恩达深度学习-Course2改善深层神经网络:超参数调试、正则化以及优化-第二周优化算法编程_第9张图片

3.使用Adam模式的Mini-batch

跑一下以下的代码:

# Adam
# train 3-layer model
layers_dims = [train_X.shape[0], 5, 2, 1]
parameters = model(train_X, train_Y, layers_dims, optimizer="adam")

# Predict
predictions = predict(train_X, train_Y, parameters)

# Plot decision boundary
plt.title("Model with Adam optimization")
axes = plt.gca()
axes.set_xlim([-1.5, 2.5])
axes.set_ylim([-1, 1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

跑出来的结果如下:

Cost after epoch 0: 0.690552
Cost after epoch 1000: 0.185501
Cost after epoch 2000: 0.150830
Cost after epoch 3000: 0.074454
Cost after epoch 4000: 0.125959
Cost after epoch 5000: 0.104344
Cost after epoch 6000: 0.100676
Cost after epoch 7000: 0.031652
Cost after epoch 8000: 0.111973
Cost after epoch 9000: 0.197940

【深度学习】吴恩达深度学习-Course2改善深层神经网络:超参数调试、正则化以及优化-第二周优化算法编程_第10张图片

Accuracy: 0.94

【深度学习】吴恩达深度学习-Course2改善深层神经网络:超参数调试、正则化以及优化-第二周优化算法编程_第11张图片

4.总结

【深度学习】吴恩达深度学习-Course2改善深层神经网络:超参数调试、正则化以及优化-第二周优化算法编程_第12张图片
Momentum通常是有帮助的,但考虑到学习率低和数据集过于简单,其影响基本可以忽略不计。
同样的,在成本上看到的巨大波动来自于这样一个事实:对于优化算法来说,mini-batches会比其他优化算法更难。
另一方面,Adam的表现明显优于mini-batch梯度下降和Momentum。如果在这个简单的数据集上运行更多循环,这三种方法都将产生非常好的结果。然而,你已经看到Adam已经收敛的更快了。

Adam的优点包括:

  • 相对较低的内存需求(尽管高于梯度下降和Momentum梯度下降)
  • 通常即使在超参数很少调整的情况下也能很好地工作(除了α)

引用:
Adam论文:https://arxiv.org/pdf/1412.6980.pdf

六、总代码

import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import math
import sklearn
import sklearn.datasets

from course2.week2.opt_utils import load_params_and_grads, initialize_parameters, forward_propagation, backward_propagation
from course2.week2.opt_utils import compute_cost, predict, predict_dec, plot_decision_boundary, load_dataset
from course2.week2.testCase import *

plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'


def update_parameters_with_gd(parameters, grads, learning_rate):
    """
    Update parameters using one step of gradient descent

    Arguments:
    parameters -- python dictionary containing your parameters to be updated:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients to update each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    learning_rate -- the learning rate, scalar.

    Returns:
    parameters -- python dictionary containing your updated parameters
    """
    L = len(parameters) // 2

    for l in range(L):
        parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * grads["dW" + str(l + 1)]
        parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * grads["db" + str(l + 1)]

    return parameters

# # 测试update_parameters_with_gd(parameters, grads, learning_rate)函数
# parameters, grads, learning_rate = update_parameters_with_gd_test_case()
# 
# parameters = update_parameters_with_gd(parameters, grads, learning_rate)
# print("W1 = " + str(parameters["W1"]))
# print("b1 = " + str(parameters["b1"]))
# print("W2 = " + str(parameters["W2"]))
# print("b2 = " + str(parameters["b2"]))


def random_mini_batches(X, Y, mini_batch_size=64, seed=0):
    """
    Creates a list of random minibatches from (X, Y)

    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    mini_batch_size -- size of the mini-batches, integer

    Returns:
    mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
    """

    np.random.seed(seed)  # To make your "random" minibatches the same as ours
    m = X.shape[1]  # number of training examples
    mini_batches = []

    # Step 1: Shuffle (X, Y)
    permutation = list(np.random.permutation(m))
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((1, m))

    # Step 2: Partition(shuffled_X, shuffled_Y),这一部分主要是将一整个训练集分割为相应数量的小集合
    num_complete_minibatches = math.floor(m / mini_batch_size)  # 首先算出以mini_batch_size为分割大小时mini_batch的数量
    # 对num_complete_minibatches进行循环,一次处理一个分割后的训练集
    for k in range(0, num_complete_minibatches):
        # shuffled_X、shuffle_Y是具有竖着和横着两个维度的,竖着的代表该实例的每一个特征。我们不进行切割,所以用":"表示。
        # 使用","隔开后,横向维度大小我们使用mini_batch_size进行切割,故用"k * mini_batch_size: (k + 1) * mini_batch_size"进行表示。
        # “... : ..."是Python中的切片函数,不清楚可以了解一下
        mini_batch_X = shuffled_X[:, k * mini_batch_size: (k + 1) * mini_batch_size]
        mini_batch_Y = shuffled_Y[:, k * mini_batch_size: (k + 1) * mini_batch_size]
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)

    # 假设有6463个数据,则按照步骤2一开始的计算,可以分割出100个mini_batch,但是在后边还有63个数据,这个if就是为了防止这种情况
    if m % mini_batch_size != 0:
        # end = m - mini_batch_size * math.floor(m / mini_batch_size) # 个人认为这句不用也没多大影响。。但是给出的答案中有这句
        mini_batch_X = shuffled_X[:, num_complete_minibatches * mini_batch_size:]
        mini_batch_Y = shuffled_Y[:, num_complete_minibatches * mini_batch_size:]
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)

    return mini_batches

# # 测试函数random_mini_batches(X, Y, mini_batch_size=64, seed=0)
# X_assess, Y_assess, mini_batch_size = random_mini_batches_test_case()
# mini_batches = random_mini_batches(X_assess, Y_assess, mini_batch_size)
#
# print("shape of the 1st mini_batch_X: " + str(mini_batches[0][0].shape))
# print("shape of the 2nd mini_batch_X: " + str(mini_batches[1][0].shape))
# print("shape of the 3rd mini_batch_X: " + str(mini_batches[2][0].shape))
# print("shape of the 1st mini_batch_Y: " + str(mini_batches[0][1].shape))
# print("shape of the 2nd mini_batch_Y: " + str(mini_batches[1][1].shape))
# print("shape of the 3rd mini_batch_Y: " + str(mini_batches[2][1].shape))
# print("mini batch sanity check: " + str(mini_batches[0][0][0][0:3]))


def initialize_velocity(parameters):
    """
    Initializes the velocity as a python dictionary with:
                - keys: "dW1", "db1", ..., "dWL", "dbL"
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl

    Returns:
    v -- python dictionary containing the current velocity.
                    v['dW' + str(l)] = velocity of dWl
                    v['db' + str(l)] = velocity of dbl
    """
    L = len(parameters) // 2
    v = {}

    for l in range(0, L):
        v["dW" + str(l + 1)] = np.zeros_like(parameters["W" + str(l + 1)])
        v["db" + str(l + 1)] = np.zeros_like(parameters["b" + str(l + 1)])

    return v

# # 测试函数initialize_velocity(parameters)
# parameters = initialize_velocity_test_case()
#
# v = initialize_velocity(parameters)
# print("v[\"dW1\"] = " + str(v["dW1"]))
# print("v[\"db1\"] = " + str(v["db1"]))
# print("v[\"dW2\"] = " + str(v["dW2"]))
# print("v[\"db2\"] = " + str(v["db2"]))


def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
    """
    Update parameters using Momentum

    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- python dictionary containing the current velocity:
                    v['dW' + str(l)] = ...
                    v['db' + str(l)] = ...
    beta -- the momentum hyperparameter, scalar
    learning_rate -- the learning rate, scalar

    Returns:
    parameters -- python dictionary containing your updated parameters
    v -- python dictionary containing your updated velocities
    """
    L = len(parameters) // 2

    for l in range(L):
        v["dW" + str(l + 1)] = beta * v["dW" + str(l + 1)] + (1 - beta) * grads["dW" + str(l + 1)]
        v["db" + str(l + 1)] = beta * v["db" + str(l + 1)] + (1 - beta) * grads["db" + str(l + 1)]

        parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * v["dW" + str(l + 1)]
        parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * v["db" + str(l + 1)]

    return parameters, v

# # 测试函数update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)
# parameters, grads, v = update_parameters_with_momentum_test_case()
#
# parameters, v = update_parameters_with_momentum(parameters, grads, v, beta = 0.9, learning_rate = 0.01)
# print("W1 = " + str(parameters["W1"]))
# print("b1 = " + str(parameters["b1"]))
# print("W2 = " + str(parameters["W2"]))
# print("b2 = " + str(parameters["b2"]))
# print("v[\"dW1\"] = " + str(v["dW1"]))
# print("v[\"db1\"] = " + str(v["db1"]))
# print("v[\"dW2\"] = " + str(v["dW2"]))
# print("v[\"db2\"] = " + str(v["db2"]))


def initialize_adam(parameters):
    """
    Initializes v and s as two python dictionaries with:
                - keys: "dW1", "db1", ..., "dWL", "dbL"
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.

    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters["W" + str(l)] = Wl
                    parameters["b" + str(l)] = bl

    Returns:
    v -- python dictionary that will contain the exponentially weighted average of the gradient.
                    v["dW" + str(l)] = ...
                    v["db" + str(l)] = ...
    s -- python dictionary that will contain the exponentially weighted average of the squared gradient.
                    s["dW" + str(l)] = ...
                    s["db" + str(l)] = ...

    """
    L = len(parameters) // 2
    v = {}
    s = {}

    for l in range(L):
        v["dW" + str(l + 1)] = np.zeros_like(parameters["W" + str(l + 1)])
        v["db" + str(l + 1)] = np.zeros_like(parameters["b" + str(l + 1)])

        s["dW" + str(l + 1)] = np.zeros_like(parameters["W" + str(l + 1)])
        s["db" + str(l + 1)] = np.zeros_like(parameters["b" + str(l + 1)])

    return v, s

# # 测试函数initialize_adam(parameters)
# parameters = initialize_adam_test_case()
#
# v, s = initialize_adam(parameters)
# print("v[\"dW1\"] = " + str(v["dW1"]))
# print("v[\"db1\"] = " + str(v["db1"]))
# print("v[\"dW2\"] = " + str(v["dW2"]))
# print("v[\"db2\"] = " + str(v["db2"]))
# print("s[\"dW1\"] = " + str(s["dW1"]))
# print("s[\"db1\"] = " + str(s["db1"]))
# print("s[\"dW2\"] = " + str(s["dW2"]))
# print("s[\"db2\"] = " + str(s["db2"]))


def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate=0.01,
                                beta1=0.9, beta2=0.999, epsilon=1e-8):
    """
    Update parameters using Adam

    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- Adam variable, moving average of the first gradient, python dictionary
    s -- Adam variable, moving average of the squared gradient, python dictionary
    learning_rate -- the learning rate, scalar.
    beta1 -- Exponential decay hyperparameter for the first moment estimates
    beta2 -- Exponential decay hyperparameter for the second moment estimates
    epsilon -- hyperparameter preventing division by zero in Adam updates

    Returns:
    parameters -- python dictionary containing your updated parameters
    v -- Adam variable, moving average of the first gradient, python dictionary
    s -- Adam variable, moving average of the squared gradient, python dictionary
    """
    L = len(parameters) // 2
    v_corrected = {}
    s_corrected = {}

    for l in range(L):
        v["dW" + str(l + 1)] = beta1 * v["dW" + str(l + 1)] + (1 - beta1) * grads["dW" + str(l + 1)]
        v["db" + str(l + 1)] = beta1 * v["db" + str(l + 1)] + (1 - beta1) * grads["db" + str(l + 1)]

        v_corrected["dW" + str(l + 1)] = v["dW" + str(l + 1)] / (1 - np.power(beta1, t))
        v_corrected["db" + str(l + 1)] = v["db" + str(l + 1)] / (1 - np.power(beta1, t))

        s["dW" + str(l + 1)] = beta2 * s["dW" + str(l + 1)] + (1 - beta2) * np.power(grads["dW" + str(l +1)], 2)
        s["db" + str(l + 1)] = beta2 * s["db" + str(l + 1)] + (1 - beta2) * np.power(grads["db" + str(l + 1)], 2)

        s_corrected["dW" + str(l + 1)] = s["dW" + str(l + 1)] / (1 - np.power(beta2, t))
        s_corrected["db" + str(l + 1)] = s["db" + str(l + 1)] / (1 - np.power(beta2, t))

        parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * (v_corrected["dW" + str(l + 1)] / np.sqrt(s_corrected["dW" + str(l + 1)] + epsilon))
        parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * (v_corrected["db" + str(l + 1)] / np.sqrt(s_corrected["db" + str(l + 1)] + epsilon))

    return parameters, v, s

# # 测试函数update_parameters_with_adam(parameters, grads, v, s, t, learning_rate=0.01, beta1=0.9, beta2=0.999, epsilon=1e-8)
# parameters, grads, v, s = update_parameters_with_adam_test_case()
# parameters, v, s  = update_parameters_with_adam(parameters, grads, v, s, t = 2)
#
# print("W1 = " + str(parameters["W1"]))
# print("b1 = " + str(parameters["b1"]))
# print("W2 = " + str(parameters["W2"]))
# print("b2 = " + str(parameters["b2"]))
# print("v[\"dW1\"] = " + str(v["dW1"]))
# print("v[\"db1\"] = " + str(v["db1"]))
# print("v[\"dW2\"] = " + str(v["dW2"]))
# print("v[\"db2\"] = " + str(v["db2"]))
# print("s[\"dW1\"] = " + str(s["dW1"]))
# print("s[\"db1\"] = " + str(s["db1"]))
# print("s[\"dW2\"] = " + str(s["dW2"]))
# print("s[\"db2\"] = " + str(s["db2"]))

train_X, train_Y = load_dataset(is_plot=False)
# plt.show()


def model(X, Y, layers_dims, optimizer, learning_rate=0.0007, mini_batch_size=64, beta=0.9,
          beta1=0.9, beta2=0.999, epsilon=1e-8, num_epochs=10000, print_cost=True):
    """
    3-layer neural network model which can be run in different optimizer modes.

    Arguments:
    X -- input data, of shape (2, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    layers_dims -- python list, containing the size of each layer
    learning_rate -- the learning rate, scalar.
    mini_batch_size -- the size of a mini batch
    beta -- Momentum hyperparameter
    beta1 -- Exponential decay hyperparameter for the past gradients estimates
    beta2 -- Exponential decay hyperparameter for the past squared gradients estimates
    epsilon -- hyperparameter preventing division by zero in Adam updates
    num_epochs -- number of epochs
    print_cost -- True to print the cost every 1000 epochs

    Returns:
    parameters -- python dictionary containing your updated parameters
    """

    L = len(layers_dims)  # number of layers in the neural networks
    costs = []  # to keep track of the cost
    t = 0  # initializing the counter required for Adam update
    seed = 10  # For grading purposes, so that your "random" minibatches are the same as ours

    # Initialize parameters
    parameters = initialize_parameters(layers_dims)

    # Initialize the optimizer
    if optimizer == "gd":
        pass  # no initialization required for gradient descent
    elif optimizer == "momentum":
        v = initialize_velocity(parameters)
    elif optimizer == "adam":
        v, s = initialize_adam(parameters)

    # Optimization loop
    for i in range(num_epochs):

        # Define the random minibatches. We increment the seed to reshuffle differently the dataset after each epoch
        seed = seed + 1
        minibatches = random_mini_batches(X, Y, mini_batch_size, seed)

        for minibatch in minibatches:

            # Select a minibatch
            (minibatch_X, minibatch_Y) = minibatch

            # Forward propagation
            a3, caches = forward_propagation(minibatch_X, parameters)

            # Compute cost
            cost = compute_cost(a3, minibatch_Y)

            # Backward propagation
            grads = backward_propagation(minibatch_X, minibatch_Y, caches)

            # Update parameters
            if optimizer == "gd":
                parameters = update_parameters_with_gd(parameters, grads, learning_rate)
            elif optimizer == "momentum":
                parameters, v = update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)
            elif optimizer == "adam":
                t = t + 1  # Adam counter
                parameters, v, s = update_parameters_with_adam(parameters, grads, v, s,
                                                               t, learning_rate, beta1, beta2, epsilon)

        # Print the cost every 1000 epoch
        if print_cost and i % 1000 == 0:
            print("Cost after epoch %i: %f" % (i, cost))
        if print_cost and i % 100 == 0:
            costs.append(cost)

    # plot the cost
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('epochs (per 100)')
    plt.title("Learning rate = " + str(learning_rate))
    plt.show()

    return parameters


# # Mini-batch
# # train 3-layer model
# layers_dims = [train_X.shape[0], 5, 2, 1]
# parameters = model(train_X, train_Y, layers_dims, optimizer="gd")
# 
# # Predict
# predictions = predict(train_X, train_Y, parameters)
# 
# # Plot decision boundary
# plt.title("Model with Gradient Descent optimization")
# axes = plt.gca()
# axes.set_xlim([-1.5, 2.5])
# axes.set_ylim([-1, 1.5])
# plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)



# # Momentum
# # train 3-layer model
# layers_dims = [train_X.shape[0], 5, 2, 1]
# parameters = model(train_X, train_Y, layers_dims, beta=0.9, optimizer="momentum")
#
# # Predict
# predictions = predict(train_X, train_Y, parameters)
#
# # Plot decision boundary
# plt.title("Model with Momentum optimization")
# axes = plt.gca()
# axes.set_xlim([-1.5, 2.5])
# axes.set_ylim([-1, 1.5])
# plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)



# Adam
# train 3-layer model
layers_dims = [train_X.shape[0], 5, 2, 1]
parameters = model(train_X, train_Y, layers_dims, optimizer="adam")

# Predict
predictions = predict(train_X, train_Y, parameters)

# Plot decision boundary
plt.title("Model with Adam optimization")
axes = plt.gca()
axes.set_xlim([-1.5, 2.5])
axes.set_ylim([-1, 1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

你可能感兴趣的:(深度学习,深度学习,神经网络,算法)