视频链接:【中英字幕】吴恩达深度学习课程第二课 — 改善深层神经网络:超参数调试、正则化以及优化
参考链接:
直到现在,你会经常使用梯度下降来更新参数并减小成本。在本篇文章中,你将学习到更多能够加速学习的先进优化方法并且能够获得成本函数的最优质的。好的优化算法能够用仅仅几个小时来获得好的结果,而不是花上许多天。
在成本函数J上进行梯度下降就像下山一样,如下图:
图1:将成本降至最低就像在丘陵地带找到最低点一样
在训练的每一步中,你都会向一个确定的方向更新你的参数以尝试获得可能的最低点。
Notations:如往常一样, ∂ J ∂ a = d a \frac{\partial{J}}{\partial{a}} = da ∂a∂J=da代表任何变量a。
开始之前,运行一下下面的代码来保证你的环境中有你所需的库。
import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import math
import sklearn
import sklearn.datasets
from opt_utils import load_params_and_grads, initialize_parameters, forward_propagation, backward_propagation
from opt_utils import compute_cost, predict, predict_dec, plot_decision_boundary, load_dataset
from testCas import *
plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
numpy包可以使用Anaconda进行安装。安装的方式见此篇文章:【深度学习】吴恩达深度学习-Course1神经网络与深度学习-第二周神经网络基础编程中的使用环境部分,安装sklearn包时请使用命令:conda install scikit-learn
而不是conda install sklearn
。
testCase、opt_utils请点击后边的资料下载(从参考链接1中搬运来的链接,如果提取码有问题可以去看一下参考链接1中的提取码):资料下载,提取码:g5tv。
在机器学习中一个简单的优化方法就是梯度下降(gradient descent,GD),当你对m个例子执行梯度下降的每一步时,就称作Batch梯度下降。
热身练习:完成梯度下降的耿勋规则,对于 l = 1 , . . . , L l=1,...,L l=1,...,L梯度下降的规则:
W [ l ] = W [ l ] − α d W [ l ] W^{[l]}=W^{[l]}-αdW{[l]} W[l]=W[l]−αdW[l]
b [ l ] = b [ l ] − α d b [ l ] b^{[l]}=b^{[l]}-αdb{[l]} b[l]=b[l]−αdb[l]
在这其中,L是层数,α是学习率。所有的参数需要被存储在parameters
字典中。注意迭代的l
从0开始,第一个参数是 W [ 1 ] W^{[1]} W[1]和 Y [ 1 ] Y^{[1]} Y[1]。在编码时,你要注意转换成l
和l+1
。
请根据上边的解释编写下边的代码:
def update_parameters_with_gd(parameters, grads, learning_rate):
"""
Update parameters using one step of gradient descent
Arguments:
parameters -- python dictionary containing your parameters to be updated:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients to update each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
learning_rate -- the learning rate, scalar.
Returns:
parameters -- python dictionary containing your updated parameters
"""
完成后应当如下:
def update_parameters_with_gd(parameters, grads, learning_rate):
"""
Update parameters using one step of gradient descent
Arguments:
parameters -- python dictionary containing your parameters to be updated:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients to update each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
learning_rate -- the learning rate, scalar.
Returns:
parameters -- python dictionary containing your updated parameters
"""
L = len(parameters) // 2
for l in range(L):
parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * grads["dW" + str(l + 1)]
parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * grads["db" + str(l + 1)]
return parameters
写完后可以用如下的代码测试一下:
# 测试update_parameters_with_gd(parameters, grads, learning_rate)函数
parameters, grads, learning_rate = update_parameters_with_gd_test_case()
parameters = update_parameters_with_gd(parameters, grads, learning_rate)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))
测试结果为:
W1 = [[ 1.63535156 -0.62320365 -0.53718766]
[-1.07799357 0.85639907 -2.29470142]]
b1 = [[ 1.74604067]
[-0.75184921]]
W2 = [[ 0.32171798 -0.25467393 1.46902454]
[-2.05617317 -0.31554548 -0.3756023 ]
[ 1.1404819 -1.09976462 -0.1612551 ]]
b2 = [[-0.88020257]
[ 0.02561572]
[ 0.57539477]]
一种变种是随机梯度下降(Stochastic Gradient Descent,SGD),等同于mini-batch梯度下降中每一个mini-batch只含有一个实例。刚刚您实现的更新规则不会改变。需要改变的是,你需要一次只计算一个训练实例的梯度,而不是整个训练集的梯度。接下来的代码示例会阐述stochastic梯度下降和batch梯度下降的差异。
Batch梯度下降:
X = data_input
Y = labels
parameters = initialize_parameters(layer_dims)
for i in range(0, num_iterations):
# Forward propagation
a, caches = forward_propagation(X, parameters)
# Compute cost
cost = compute_cost(a,Y)
# Backward propagation
grads = backward_propagation(a, caches, parameters)
# update parameters
parameters = update_parameters(parameters, grads)
Stochastic梯度下降(随机梯度下降):
X = data_input
Y = labels
parameters = initialize_parameters(layer_dims)
for i in range(0, num_iterations):
for j in range(0,m):
# Forward propagation
a, caches = forward_propagation(X[:,j], parameters)
# Compute cost
cost = compute_cost(a, Y[:,j])
# Backward propagation
grads = backward_propagation(a, caches, parameters)
# Update parameters
parameters = update_parameters(parameters, grads)
在Stochastic梯度下降中,在更新梯度之前,你仅使用1个训练示例。当训练集非常大的时候,SGD会更快一些。但是参数将会朝着最小值“震荡”而不是平稳收敛。
图1:SGD vs GD
“+”代表成本的最小值。SGD导致产生了许多震荡从而达到收敛。但是SGD在每一步的计算上要快于GD,因为它只使用一个训练示例(与GD的整个批次相比)。
笔记完成SGD总共需要3个循环:
在练习中,如果你不使用个训练集或者仅使用一个训练示例来执行每一次更新,你将总是很快得到结果。Mini-batch梯度下降在每一步中使用中间数量的示例。使用Mini-batch,你在Mini-batch上进行循环而不是在单个训练实例上循环。
图2:SGD vs Mini-Batch GD
“+”代表成本的最小值。使用Mini-Batch来优化你的算法从而能够带来更快的优化。
你需要记住的:
让我们学习如何从训练集(X, Y)中建立mini-batches。
一共有两步:
mini_batch_size
大小的mni-batches
集。最后的mini-batch可能会很小,但是你不需要担心这一点。当最后的mini-batch比满mini_batch_size
小,它看起来会是这样的:random_mini_batches
。我们为你编码了洗牌的部分,来帮助你完成分割部分,我们给你以下的代码,用于为 1 s t 1^{st} 1st和 2 n d 2^{nd} 2nd两个mini-batches提供索引:first_mini_batch_X = shuffled_X[:, 0 : mini_batch_size]
second_mini_batch_X = shuffled_X[:, mini_batch_size : 2 * mini_batch_size]
注意,最后一个mini-batch的值可能比mini_batch_size=64
要小。让 ⌊ s ⌋ \lfloor s \rfloor ⌊s⌋代表s向下取整到最近的整数(在Python中就是math.floor(s)
)。如果实例的总数不是mini_batch_size=64
的倍数,则会有 ⌊ m m i n i _ b a t c h _ s i z e ⌋ \lfloor \frac{m}{mini\_batch\_size} \rfloor ⌊mini_batch_sizem⌋个mini_batch,包含完整地64个实例,并且实例的数量在最后一个mini-batch中是 ( m − m i n i _ b a t c h _ s i z e × ⌊ m m i n i _ b a t c h _ s i z e ⌋ ) (m-mini\_batch\_size × \lfloor \frac{m}{mini\_batch\_size} \rfloor ) (m−mini_batch_size×⌊mini_batch_sizem⌋)。
请根据上述描述完成如下函数:
def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):
"""
Creates a list of random minibatches from (X, Y)
Arguments:
X -- input data, of shape (input size, number of examples)
Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
mini_batch_size -- size of the mini-batches, integer
Returns:
mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
"""
np.random.seed(seed) # To make your "random" minibatches the same as ours
m = X.shape[1] # number of training examples
mini_batches = []
# Step 1: Shuffle (X, Y)
permutation = list(np.random.permutation(m))
shuffled_X = X[:, permutation]
shuffled_Y = Y[:, permutation].reshape((1,m))
完成后的结果如下:
def random_mini_batches(X, Y, mini_batch_size=64, seed=0):
"""
Creates a list of random minibatches from (X, Y)
Arguments:
X -- input data, of shape (input size, number of examples)
Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
mini_batch_size -- size of the mini-batches, integer
Returns:
mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
"""
np.random.seed(seed) # To make your "random" minibatches the same as ours
m = X.shape[1] # number of training examples
mini_batches = []
# Step 1: Shuffle (X, Y)
permutation = list(np.random.permutation(m))
shuffled_X = X[:, permutation]
shuffled_Y = Y[:, permutation].reshape((1, m))
# Step 2: Partition(shuffled_X, shuffled_Y),这一部分主要是将一整个训练集分割为相应数量的小集合
num_complete_minibatches = math.floor(m / mini_batch_size) # 首先算出以mini_batch_size为分割大小时mini_batch的数量
# 对num_complete_minibatches进行循环,一次处理一个分割后的训练集
for k in range(0, num_complete_minibatches):
# shuffled_X、shuffle_Y是具有竖着和横着两个维度的,竖着的代表该实例的每一个特征。我们不进行切割,所以用":"表示。
# 使用","隔开后,横向维度大小我们使用mini_batch_size进行切割,故用"k * mini_batch_size: (k + 1) * mini_batch_size"进行表示。
# “... : ..."是Python中的切片函数,不清楚可以了解一下
mini_batch_X = shuffled_X[:, k * mini_batch_size: (k + 1) * mini_batch_size]
mini_batch_Y = shuffled_Y[:, k * mini_batch_size: (k + 1) * mini_batch_size]
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)
# 假设有6463个数据,则按照步骤2一开始的计算,可以分割出100个mini_batch,但是在后边还有63个数据,这个if就是为了防止这种情况
if m % mini_batch_size != 0:
# end = m - mini_batch_size * math.floor(m / mini_batch_size) # 个人认为这句不用也没多大影响。。但是给出的答案中有这句
mini_batch_X = shuffled_X[:, num_complete_minibatches * mini_batch_size:]
mini_batch_Y = shuffled_Y[:, num_complete_minibatches * mini_batch_size:]
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)
return mini_batches
写完后可以使用这段代码测试一下:
X_assess, Y_assess, mini_batch_size = random_mini_batches_test_case()
mini_batches = random_mini_batches(X_assess, Y_assess, mini_batch_size)
print("shape of the 1st mini_batch_X: " + str(mini_batches[0][0].shape))
print("shape of the 2nd mini_batch_X: " + str(mini_batches[1][0].shape))
print("shape of the 3rd mini_batch_X: " + str(mini_batches[2][0].shape))
print("shape of the 1st mini_batch_Y: " + str(mini_batches[0][1].shape))
print("shape of the 2nd mini_batch_Y: " + str(mini_batches[1][1].shape))
print("shape of the 3rd mini_batch_Y: " + str(mini_batches[2][1].shape))
print("mini batch sanity check: " + str(mini_batches[0][0][0][0:3]))
得到的结果应当如下:
shape of the 1st mini_batch_X: (12288, 64)
shape of the 2nd mini_batch_X: (12288, 64)
shape of the 3rd mini_batch_X: (12288, 20)
shape of the 1st mini_batch_Y: (1, 64)
shape of the 2nd mini_batch_Y: (1, 64)
shape of the 3rd mini_batch_Y: (1, 20)
mini batch sanity check: [ 0.90085595 -0.7612069 0.2344157 ]
你需要记住的:
因为mini-batch梯度下降在仅看到一部分示例后就会进行参数更新,所以更新的方向有一些变化,因此mini-batch梯度下降所走的路径将向着收敛方向“振荡”。使用Momentun可以减少这些振荡。
Momentun考虑了之前的梯度,从而能够使得update更为平整。我们将之前梯度的“方向”存储在变量 v v v中。在形式上,这是之前步骤中梯度的指数加权平均值。你可以将这个 v v v认为是一个球滚下山坡的“速度”,根据山坡的坡度/坡度的方向来增加加速度(和动量)。
图3: 这里的红色箭头展示了使用momentun后走一步mini-batch梯度下降的方向。蓝色的点表示每一步梯度的方向(遵从原本mini-batch的方向)。我们让梯度影响 v v v并且向 v v v的方向迈步而不是只是跟随梯度。
**练习:**初始化速度。速度, v v v,是一个python字典,需要用0将数组初始化。其关键与grads字典一样,即:对于 l = 1 , . . . , L : l=1,...,L: l=1,...,L:
v["dW" + str(l+1)] = ... # 使用0初始化数组,维度与parameters["W" + str(l+1)]一样
v["db" + str(l+1)] = ... # 使用0初始化数字,维度与parameters["b" + str(l+1)]一样
注意: 迭代器l在for循环从0开始,第一个参数是v[“dW1”]和v[“db1”](上标是1)。这就是为什么我们在for循环中将l
转换为l+1
利用上边的提示完成以下函数:
def initialize_velocity(parameters):
"""
Initializes the velocity as a python dictionary with:
- keys: "dW1", "db1", ..., "dWL", "dbL"
- values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
Arguments:
parameters -- python dictionary containing your parameters.
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
Returns:
v -- python dictionary containing the current velocity.
v['dW' + str(l)] = velocity of dWl
v['db' + str(l)] = velocity of dbl
"""
完成后如下:
def initialize_velocity(parameters):
"""
Initializes the velocity as a python dictionary with:
- keys: "dW1", "db1", ..., "dWL", "dbL"
- values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
Arguments:
parameters -- python dictionary containing your parameters.
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
Returns:
v -- python dictionary containing the current velocity.
v['dW' + str(l)] = velocity of dWl
v['db' + str(l)] = velocity of dbl
"""
L = len(parameters) // 2
v = {}
for l in range(0, L):
v["dW" + str(l + 1)] = np.zeros_like(parameters["W" + str(l + 1)])
v["db" + str(l + 1)] = np.zeros_like(parameters["b" + str(l + 1)])
return v
可以用以下代码测试你的函数:
# 测试函数initialize_velocity(parameters)
parameters = initialize_velocity_test_case()
v = initialize_velocity(parameters)
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))
得到的结果应为:
v["dW1"] = [[0. 0. 0.]
[0. 0. 0.]]
v["db1"] = [[0.]
[0.]]
v["dW2"] = [[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
v["db2"] = [[0.]
[0.]
[0.]]
练习: 现在,完成使用momentun进行的参数更新。momentun的更新规则如下,对于 l = 1 , . . . , L : l = 1,...,L: l=1,...,L:
v d W [ l ] = β v d W [ l ] + ( 1 − β ) d W [ l ] v_{dW^{[l]}}=βv_{dW^{[l]}}+(1-β)dW^{[l]} vdW[l]=βvdW[l]+(1−β)dW[l]
W [ l ] = W [ l ] − α v d W [ l ] W^{[l]}=W^{[l]}-αv_{dW^{[l]}} W[l]=W[l]−αvdW[l]
v d b [ l ] = β v d b [ l ] + ( 1 − β ) d b [ l ] v_{db^{[l]}}=βv_{db^{[l]}}+(1-β)db^{[l]} vdb[l]=βvdb[l]+(1−β)db[l]
b [ l ] = b [ l ] − α v d b [ l ] b^{[l]}=b^{[l]}-αv_{db^{[l]}} b[l]=b[l]−αvdb[l]
L是层数,β是momentun的参数,α是学习率。所有的参数都应该被存储在parameters
字典中。注意迭代器l
在for
循环中从0开始且第一个参数是 W [ 1 ] W^{[1]} W[1]和 b [ 1 ] b^{[1]} b[1](上标是1).所以在编码时你将需要将l
转换为l+1
。
完成以下函数:
def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
"""
Update parameters using Momentum
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
v -- python dictionary containing the current velocity:
v['dW' + str(l)] = ...
v['db' + str(l)] = ...
beta -- the momentum hyperparameter, scalar
learning_rate -- the learning rate, scalar
Returns:
parameters -- python dictionary containing your updated parameters
v -- python dictionary containing your updated velocities
"""
写完后应如下:
def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
"""
Update parameters using Momentum
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
v -- python dictionary containing the current velocity:
v['dW' + str(l)] = ...
v['db' + str(l)] = ...
beta -- the momentum hyperparameter, scalar
learning_rate -- the learning rate, scalar
Returns:
parameters -- python dictionary containing your updated parameters
v -- python dictionary containing your updated velocities
"""
L = len(parameters) // 2
for l in range(L):
v["dW" + str(l + 1)] = beta * v["dW" + str(l + 1)] + (1 - beta) * grads["dW" + str(l + 1)]
v["db" + str(l + 1)] = beta * v["db" + str(l + 1)] + (1 - beta) * grads["db" + str(l + 1)]
parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * v["dW" + str(l + 1)]
parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * v["db" + str(l + 1)]
return parameters, v
直接复制下边的代码测试一下你所写的函数是否正确:
# 测试函数update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)
parameters, grads, v = update_parameters_with_momentum_test_case()
parameters, v = update_parameters_with_momentum(parameters, grads, v, beta = 0.9, learning_rate = 0.01)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))
得到结果如下:
W1 = [[ 1.62544598 -0.61290114 -0.52907334]
[-1.07347112 0.86450677 -2.30085497]]
b1 = [[ 1.74493465]
[-0.76027113]]
W2 = [[ 0.31930698 -0.24990073 1.4627996 ]
[-2.05974396 -0.32173003 -0.38320915]
[ 1.13444069 -1.0998786 -0.1713109 ]]
b2 = [[-0.87809283]
[ 0.04055394]
[ 0.58207317]]
v["dW1"] = [[-0.11006192 0.11447237 0.09015907]
[ 0.05024943 0.09008559 -0.06837279]]
v["db1"] = [[-0.01228902]
[-0.09357694]]
v["dW2"] = [[-0.02678881 0.05303555 -0.06916608]
[-0.03967535 -0.06871727 -0.08452056]
[-0.06712461 -0.00126646 -0.11173103]]
v["db2"] = [[0.02344157]
[0.16598022]
[0.07420442]]
Note :
怎么选择参数β :
你需要记住的:
Adam是训练神经网络的非常高效的优化算法之一。它结合了RMSProp和Mmentun的思想。
Adam是如何工作的?:
对于 l = 1 , . . . , L l=1,...,L l=1,...,L,更新规则如下:
v d W [ l ] = β 1 v d W [ l ] + ( 1 − β 1 ) ∂ J ∂ W [ l ] v_{dW^{[l]}}=β_1v_{dW^{[l]}} + (1-β_1) \frac{\partial{J}}{\partial{W^{[l]}}} vdW[l]=β1vdW[l]+(1−β1)∂W[l]∂J
v d W [ l ] c o r r e c t e d = v d W [ l ] 1 − ( β 1 ) t v^{corrected}_{dW^{[l]}}=\frac{v_{dW^{[l]}}}{1-(β_1)^t} vdW[l]corrected=1−(β1)tvdW[l]
s d W [ l ] = β 2 s d W [ l ] + ( 1 − β 2 ) ( ∂ J ∂ W [ l ] ) 2 s_{dW^{[l]}}=β_2s_{dW^{[l]}} + (1-β_2)(\frac{\partial{J}}{\partial{W^{[l]}}})^2 sdW[l]=β2sdW[l]+(1−β2)(∂W[l]∂J)2
s d W [ l ] c o r r e c t e d = s d W [ l ] 1 − ( β 2 ) 2 s^{corrected}_{dW^{[l]}}=\frac{s_{dW^{[l]}}}{1-(β_2)^2} sdW[l]corrected=1−(β2)2sdW[l]
W [ l ] = W [ l ] − α v d W [ l ] c o r r e c t e d s d W [ l ] c o r r e c t e d + ε W^{[l]} = W^{[l]} - α\frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}+ε}} W[l]=W[l]−αsdW[l]corrected+εvdW[l]corrected
在这之中:
通常,我们将所有的参数存储于parameters字典中。
练习 :初始化Adam的参数 v , s v,s v,s,这些变量用于跟踪过去的信息。
说明 :变量 v , s v,s v,s 是Python字典,需要使用0数组来初始化,它们的关键和grads一样。对于 l = 1 , . . . , L l=1,...,L l=1,...,L:
v["dW" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["W" + str(l+1)])
v["db" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["b" + str(l+1)])
s["dW" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["W" + str(l+1)])
s["db" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["b" + str(l+1)])
请完成如下函数:
def initialize_adam(parameters) :
"""
Initializes v and s as two python dictionaries with:
- keys: "dW1", "db1", ..., "dWL", "dbL"
- values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
Arguments:
parameters -- python dictionary containing your parameters.
parameters["W" + str(l)] = Wl
parameters["b" + str(l)] = bl
Returns:
v -- python dictionary that will contain the exponentially weighted average of the gradient.
v["dW" + str(l)] = ...
v["db" + str(l)] = ...
s -- python dictionary that will contain the exponentially weighted average of the squared gradient.
s["dW" + str(l)] = ...
s["db" + str(l)] = ...
"""
完成后应如下:
def initialize_adam(parameters):
"""
Initializes v and s as two python dictionaries with:
- keys: "dW1", "db1", ..., "dWL", "dbL"
- values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
Arguments:
parameters -- python dictionary containing your parameters.
parameters["W" + str(l)] = Wl
parameters["b" + str(l)] = bl
Returns:
v -- python dictionary that will contain the exponentially weighted average of the gradient.
v["dW" + str(l)] = ...
v["db" + str(l)] = ...
s -- python dictionary that will contain the exponentially weighted average of the squared gradient.
s["dW" + str(l)] = ...
s["db" + str(l)] = ...
"""
L = len(parameters) // 2
v = {}
s = {}
for l in range(L):
v["dW" + str(l + 1)] = np.zeros_like(parameters["W" + str(l + 1)])
v["db" + str(l + 1)] = np.zeros_like(parameters["b" + str(l + 1)])
s["dW" + str(l + 1)] = np.zeros_like(parameters["W" + str(l + 1)])
s["db" + str(l + 1)] = np.zeros_like(parameters["b" + str(l + 1)])
return v, s
建议使用如下代码测试函数是否正确:
# 测试函数initialize_adam(parameters)
parameters = initialize_adam_test_case()
v, s = initialize_adam(parameters)
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))
print("s[\"dW1\"] = " + str(s["dW1"]))
print("s[\"db1\"] = " + str(s["db1"]))
print("s[\"dW2\"] = " + str(s["dW2"]))
print("s[\"db2\"] = " + str(s["db2"]))
得到的结果如下:
v["dW1"] = [[ 0. 0. 0.]
[ 0. 0. 0.]]
v["db1"] = [[ 0.]
[ 0.]]
v["dW2"] = [[ 0. 0. 0.]
[ 0. 0. 0.]
[ 0. 0. 0.]]
v["db2"] = [[ 0.]
[ 0.]
[ 0.]]
s["dW1"] = [[ 0. 0. 0.]
[ 0. 0. 0.]]
s["db1"] = [[ 0.]
[ 0.]]
s["dW2"] = [[ 0. 0. 0.]
[ 0. 0. 0.]
[ 0. 0. 0.]]
s["db2"] = [[ 0.]
[ 0.]
[ 0.]]
练习: 现在,完成Adam的参数更新,回顾以下更新规则, l = 1 , . . . , L l = 1,...,L l=1,...,L:
v d W [ l ] = β 1 v d W [ l ] + ( 1 − β 1 ) ∂ J ∂ W [ l ] v_{dW^{[l]}}=β_1v_{dW^{[l]}} + (1-β_1) \frac{\partial{J}}{\partial{W^{[l]}}} vdW[l]=β1vdW[l]+(1−β1)∂W[l]∂J
v d W [ l ] c o r r e c t e d = v d W [ l ] 1 − ( β 1 ) t v^{corrected}_{dW^{[l]}}=\frac{v_{dW^{[l]}}}{1-(β_1)^t} vdW[l]corrected=1−(β1)tvdW[l]
s d W [ l ] = β 2 s d W [ l ] + ( 1 − β 2 ) ( ∂ J ∂ W [ l ] ) 2 s_{dW^{[l]}}=β_2s_{dW^{[l]}} + (1-β_2)(\frac{\partial{J}}{\partial{W^{[l]}}})^2 sdW[l]=β2sdW[l]+(1−β2)(∂W[l]∂J)2
s d W [ l ] c o r r e c t e d = s d W [ l ] 1 − ( β 2 ) 2 s^{corrected}_{dW^{[l]}}=\frac{s_{dW^{[l]}}}{1-(β_2)^2} sdW[l]corrected=1−(β2)2sdW[l]
W [ l ] = W [ l ] − α v d W [ l ] c o r r e c t e d s d W [ l ] c o r r e c t e d + ε W^{[l]} = W^{[l]} - α\frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}+ε}} W[l]=W[l]−αsdW[l]corrected+εvdW[l]corrected
Note 迭代器l
的for循环从0开始,第一个参数是 W [ 1 ] 和 b [ 1 ] W^{[1]}和b^ {[1]} W[1]和b[1]。你在编码时需要将l
转换为l+1
def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate=0.01,
beta1=0.9, beta2=0.999, epsilon=1e-8):
"""
Update parameters using Adam
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
v -- Adam variable, moving average of the first gradient, python dictionary
s -- Adam variable, moving average of the squared gradient, python dictionary
learning_rate -- the learning rate, scalar.
beta1 -- Exponential decay hyperparameter for the first moment estimates
beta2 -- Exponential decay hyperparameter for the second moment estimates
epsilon -- hyperparameter preventing division by zero in Adam updates
Returns:
parameters -- python dictionary containing your updated parameters
v -- Adam variable, moving average of the first gradient, python dictionary
s -- Adam variable, moving average of the squared gradient, python dictionary
"""
根据给出的公式,照着编写一下代码就可以了:
def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate=0.01,
beta1=0.9, beta2=0.999, epsilon=1e-8):
"""
Update parameters using Adam
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
v -- Adam variable, moving average of the first gradient, python dictionary
s -- Adam variable, moving average of the squared gradient, python dictionary
learning_rate -- the learning rate, scalar.
beta1 -- Exponential decay hyperparameter for the first moment estimates
beta2 -- Exponential decay hyperparameter for the second moment estimates
epsilon -- hyperparameter preventing division by zero in Adam updates
Returns:
parameters -- python dictionary containing your updated parameters
v -- Adam variable, moving average of the first gradient, python dictionary
s -- Adam variable, moving average of the squared gradient, python dictionary
"""
L = len(parameters) // 2
v_corrected = {}
s_corrected = {}
for l in range(L):
v["dW" + str(l + 1)] = beta1 * v["dW" + str(l + 1)] + (1 - beta1) * grads["dW" + str(l + 1)]
v["db" + str(l + 1)] = beta1 * v["db" + str(l + 1)] + (1 - beta1) * grads["db" + str(l + 1)]
v_corrected["dW" + str(l + 1)] = v["dW" + str(l + 1)] / (1 - np.power(beta1, t))
v_corrected["db" + str(l + 1)] = v["db" + str(l + 1)] / (1 - np.power(beta1, t))
s["dW" + str(l + 1)] = beta2 * s["dW" + str(l + 1)] + (1 - beta2) * np.power(grads["dW" + str(l +1)], 2)
s["db" + str(l + 1)] = beta2 * s["db" + str(l + 1)] + (1 - beta2) * np.power(grads["db" + str(l + 1)], 2)
s_corrected["dW" + str(l + 1)] = s["dW" + str(l + 1)] / (1 - np.power(beta2, t))
s_corrected["db" + str(l + 1)] = s["db" + str(l + 1)] / (1 - np.power(beta2, t))
parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * (v_corrected["dW" + str(l + 1)] / np.sqrt(s_corrected["dW" + str(l + 1)] + epsilon))
parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * (v_corrected["db" + str(l + 1)] / np.sqrt(s_corrected["db" + str(l + 1)] + epsilon))
return parameters, v, s
用以下代码测试一下你的函数是否正确:
# 测试函数update_parameters_with_adam(parameters, grads, v, s, t, learning_rate=0.01, beta1=0.9, beta2=0.999, epsilon=1e-8)
parameters, grads, v, s = update_parameters_with_adam_test_case()
parameters, v, s = update_parameters_with_adam(parameters, grads, v, s, t = 2)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))
print("v[\"dW1\"] = " + str(v["dW1"]))
print("v[\"db1\"] = " + str(v["db1"]))
print("v[\"dW2\"] = " + str(v["dW2"]))
print("v[\"db2\"] = " + str(v["db2"]))
print("s[\"dW1\"] = " + str(s["dW1"]))
print("s[\"db1\"] = " + str(s["db1"]))
print("s[\"dW2\"] = " + str(s["dW2"]))
print("s[\"db2\"] = " + str(s["db2"]))
得到的结果如下:
W1 = [[ 1.63178673 -0.61919778 -0.53561312]
[-1.08040999 0.85796626 -2.29409733]]
b1 = [[ 1.75225313]
[-0.75376553]]
W2 = [[ 0.32648046 -0.25681174 1.46954931]
[-2.05269934 -0.31497584 -0.37661299]
[ 1.14121081 -1.09245036 -0.16498684]]
b2 = [[-0.88529978]
[ 0.03477238]
[ 0.57537385]]
v["dW1"] = [[-0.11006192 0.11447237 0.09015907]
[ 0.05024943 0.09008559 -0.06837279]]
v["db1"] = [[-0.01228902]
[-0.09357694]]
v["dW2"] = [[-0.02678881 0.05303555 -0.06916608]
[-0.03967535 -0.06871727 -0.08452056]
[-0.06712461 -0.00126646 -0.11173103]]
v["db2"] = [[0.02344157]
[0.16598022]
[0.07420442]]
s["dW1"] = [[0.00121136 0.00131039 0.00081287]
[0.0002525 0.00081154 0.00046748]]
s["db1"] = [[1.51020075e-05]
[8.75664434e-04]]
s["dW2"] = [[7.17640232e-05 2.81276921e-04 4.78394595e-04]
[1.57413361e-04 4.72206320e-04 7.14372576e-04]
[4.50571368e-04 1.60392066e-07 1.24838242e-03]]
s["db2"] = [[5.49507194e-05]
[2.75494327e-03]
[5.50629536e-04]]
现在你有3个工作良好的优化函数(mini-batch梯度下降,Momentun,Adam)。让我们为每一种优化算法完善一个模型并观察其中的差异。
让我们使用接下来的“moons”数据集来测试不同的优化方法。(数据集名称叫做"moons"是因为从两个数据集中的任何一个看起来都像月牙形的月亮)
用此句读取数据集并展示:
train_X, train_Y = load_dataset()
plt.show()
我们已经完成了一个3层神经网络,你将在上面进行训练:
update_parameters_with_gd()
initialize_velocity()
和update_parameters_with_momentum()
initialize_adam()
和update_parameters_with_adam()
我们给出如下函数:
def model(X, Y, layers_dims, optimizer, learning_rate=0.0007, mini_batch_size=64, beta=0.9,
beta1=0.9, beta2=0.999, epsilon=1e-8, num_epochs=10000, print_cost=True):
"""
3-layer neural network model which can be run in different optimizer modes.
Arguments:
X -- input data, of shape (2, number of examples)
Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
layers_dims -- python list, containing the size of each layer
learning_rate -- the learning rate, scalar.
mini_batch_size -- the size of a mini batch
beta -- Momentum hyperparameter
beta1 -- Exponential decay hyperparameter for the past gradients estimates
beta2 -- Exponential decay hyperparameter for the past squared gradients estimates
epsilon -- hyperparameter preventing division by zero in Adam updates
num_epochs -- number of epochs
print_cost -- True to print the cost every 1000 epochs
Returns:
parameters -- python dictionary containing your updated parameters
"""
L = len(layers_dims) # number of layers in the neural networks
costs = [] # to keep track of the cost
t = 0 # initializing the counter required for Adam update
seed = 10 # For grading purposes, so that your "random" minibatches are the same as ours
# Initialize parameters
parameters = initialize_parameters(layers_dims)
# Initialize the optimizer
if optimizer == "gd":
pass # no initialization required for gradient descent
elif optimizer == "momentum":
v = initialize_velocity(parameters)
elif optimizer == "adam":
v, s = initialize_adam(parameters)
# Optimization loop
for i in range(num_epochs):
# Define the random minibatches. We increment the seed to reshuffle differently the dataset after each epoch
seed = seed + 1
minibatches = random_mini_batches(X, Y, mini_batch_size, seed)
for minibatch in minibatches:
# Select a minibatch
(minibatch_X, minibatch_Y) = minibatch
# Forward propagation
a3, caches = forward_propagation(minibatch_X, parameters)
# Compute cost
cost = compute_cost(a3, minibatch_Y)
# Backward propagation
grads = backward_propagation(minibatch_X, minibatch_Y, caches)
# Update parameters
if optimizer == "gd":
parameters = update_parameters_with_gd(parameters, grads, learning_rate)
elif optimizer == "momentum":
parameters, v = update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)
elif optimizer == "adam":
t = t + 1 # Adam counter
parameters, v, s = update_parameters_with_adam(parameters, grads, v, s,
t, learning_rate, beta1, beta2, epsilon)
# Print the cost every 1000 epoch
if print_cost and i % 1000 == 0:
print("Cost after epoch %i: %f" % (i, cost))
if print_cost and i % 100 == 0:
costs.append(cost)
# plot the cost
plt.plot(costs)
plt.ylabel('cost')
plt.xlabel('epochs (per 100)')
plt.title("Learning rate = " + str(learning_rate))
plt.show()
return parameters
接下来,你将在3层神经网络上运行3种优化方法。
从这里开始,请将前面加载数据集的代码更换成:
train_X, train_Y = load_dataset(is_plot=False)
在opt_utilsh中有一个错误,请你更正过来:
将initialize_parameters(layer_dims)
中初始化W时的np.sqrt(2 / layer_dims[l-1])
改为np.sqrt(2. / layer_dims[l-1])
否则会出现初始化为0的情况,使得成本降低慢、精确度低。
运行如下代码来看看Mini-batch梯度下降模型是怎么运作的:
# train 3-layer model
layers_dims = [train_X.shape[0], 5, 2, 1]
parameters = model(train_X, train_Y, layers_dims, optimizer="gd")
# Predict
predictions = predict(train_X, train_Y, parameters)
# Plot decision boundary
plt.title("Model with Gradient Descent optimization")
axes = plt.gca()
axes.set_xlim([-1.5, 2.5])
axes.set_ylim([-1, 1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
结果如下:
Cost after epoch 0: 0.690736
Cost after epoch 1000: 0.685273
Cost after epoch 2000: 0.647072
Cost after epoch 3000: 0.619525
Cost after epoch 4000: 0.576584
Cost after epoch 5000: 0.607243
Cost after epoch 6000: 0.529403
Cost after epoch 7000: 0.460768
Cost after epoch 8000: 0.465586
Cost after epoch 9000: 0.464518
Accuracy: 0.7966666666666666
跑一下以下的代码看看在Momentum的情况下如何。因为这个例子相对来说比较简单,使用动量的收益很小;但对于更复杂的问题,你可能会看到更大的收益。
# Momentum
# train 3-layer model
layers_dims = [train_X.shape[0], 5, 2, 1]
parameters = model(train_X, train_Y, layers_dims, beta=0.9, optimizer="momentum")
# Predict
predictions = predict(train_X, train_Y, parameters)
# Plot decision boundary
plt.title("Model with Momentum optimization")
axes = plt.gca()
axes.set_xlim([-1.5, 2.5])
axes.set_ylim([-1, 1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
得到结果如下:
Cost after epoch 0: 0.690741
Cost after epoch 1000: 0.685341
Cost after epoch 2000: 0.647145
Cost after epoch 3000: 0.619594
Cost after epoch 4000: 0.576665
Cost after epoch 5000: 0.607324
Cost after epoch 6000: 0.529476
Cost after epoch 7000: 0.460936
Cost after epoch 8000: 0.465780
Cost after epoch 9000: 0.464740
Accuracy: 0.7966666666666666
跑一下以下的代码:
# Adam
# train 3-layer model
layers_dims = [train_X.shape[0], 5, 2, 1]
parameters = model(train_X, train_Y, layers_dims, optimizer="adam")
# Predict
predictions = predict(train_X, train_Y, parameters)
# Plot decision boundary
plt.title("Model with Adam optimization")
axes = plt.gca()
axes.set_xlim([-1.5, 2.5])
axes.set_ylim([-1, 1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
跑出来的结果如下:
Cost after epoch 0: 0.690552
Cost after epoch 1000: 0.185501
Cost after epoch 2000: 0.150830
Cost after epoch 3000: 0.074454
Cost after epoch 4000: 0.125959
Cost after epoch 5000: 0.104344
Cost after epoch 6000: 0.100676
Cost after epoch 7000: 0.031652
Cost after epoch 8000: 0.111973
Cost after epoch 9000: 0.197940
Accuracy: 0.94
Momentum通常是有帮助的,但考虑到学习率低和数据集过于简单,其影响基本可以忽略不计。
同样的,在成本上看到的巨大波动来自于这样一个事实:对于优化算法来说,mini-batches会比其他优化算法更难。
另一方面,Adam的表现明显优于mini-batch梯度下降和Momentum。如果在这个简单的数据集上运行更多循环,这三种方法都将产生非常好的结果。然而,你已经看到Adam已经收敛的更快了。
Adam的优点包括:
引用:
Adam论文:https://arxiv.org/pdf/1412.6980.pdf
import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import math
import sklearn
import sklearn.datasets
from course2.week2.opt_utils import load_params_and_grads, initialize_parameters, forward_propagation, backward_propagation
from course2.week2.opt_utils import compute_cost, predict, predict_dec, plot_decision_boundary, load_dataset
from course2.week2.testCase import *
plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
def update_parameters_with_gd(parameters, grads, learning_rate):
"""
Update parameters using one step of gradient descent
Arguments:
parameters -- python dictionary containing your parameters to be updated:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients to update each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
learning_rate -- the learning rate, scalar.
Returns:
parameters -- python dictionary containing your updated parameters
"""
L = len(parameters) // 2
for l in range(L):
parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * grads["dW" + str(l + 1)]
parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * grads["db" + str(l + 1)]
return parameters
# # 测试update_parameters_with_gd(parameters, grads, learning_rate)函数
# parameters, grads, learning_rate = update_parameters_with_gd_test_case()
#
# parameters = update_parameters_with_gd(parameters, grads, learning_rate)
# print("W1 = " + str(parameters["W1"]))
# print("b1 = " + str(parameters["b1"]))
# print("W2 = " + str(parameters["W2"]))
# print("b2 = " + str(parameters["b2"]))
def random_mini_batches(X, Y, mini_batch_size=64, seed=0):
"""
Creates a list of random minibatches from (X, Y)
Arguments:
X -- input data, of shape (input size, number of examples)
Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
mini_batch_size -- size of the mini-batches, integer
Returns:
mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
"""
np.random.seed(seed) # To make your "random" minibatches the same as ours
m = X.shape[1] # number of training examples
mini_batches = []
# Step 1: Shuffle (X, Y)
permutation = list(np.random.permutation(m))
shuffled_X = X[:, permutation]
shuffled_Y = Y[:, permutation].reshape((1, m))
# Step 2: Partition(shuffled_X, shuffled_Y),这一部分主要是将一整个训练集分割为相应数量的小集合
num_complete_minibatches = math.floor(m / mini_batch_size) # 首先算出以mini_batch_size为分割大小时mini_batch的数量
# 对num_complete_minibatches进行循环,一次处理一个分割后的训练集
for k in range(0, num_complete_minibatches):
# shuffled_X、shuffle_Y是具有竖着和横着两个维度的,竖着的代表该实例的每一个特征。我们不进行切割,所以用":"表示。
# 使用","隔开后,横向维度大小我们使用mini_batch_size进行切割,故用"k * mini_batch_size: (k + 1) * mini_batch_size"进行表示。
# “... : ..."是Python中的切片函数,不清楚可以了解一下
mini_batch_X = shuffled_X[:, k * mini_batch_size: (k + 1) * mini_batch_size]
mini_batch_Y = shuffled_Y[:, k * mini_batch_size: (k + 1) * mini_batch_size]
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)
# 假设有6463个数据,则按照步骤2一开始的计算,可以分割出100个mini_batch,但是在后边还有63个数据,这个if就是为了防止这种情况
if m % mini_batch_size != 0:
# end = m - mini_batch_size * math.floor(m / mini_batch_size) # 个人认为这句不用也没多大影响。。但是给出的答案中有这句
mini_batch_X = shuffled_X[:, num_complete_minibatches * mini_batch_size:]
mini_batch_Y = shuffled_Y[:, num_complete_minibatches * mini_batch_size:]
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)
return mini_batches
# # 测试函数random_mini_batches(X, Y, mini_batch_size=64, seed=0)
# X_assess, Y_assess, mini_batch_size = random_mini_batches_test_case()
# mini_batches = random_mini_batches(X_assess, Y_assess, mini_batch_size)
#
# print("shape of the 1st mini_batch_X: " + str(mini_batches[0][0].shape))
# print("shape of the 2nd mini_batch_X: " + str(mini_batches[1][0].shape))
# print("shape of the 3rd mini_batch_X: " + str(mini_batches[2][0].shape))
# print("shape of the 1st mini_batch_Y: " + str(mini_batches[0][1].shape))
# print("shape of the 2nd mini_batch_Y: " + str(mini_batches[1][1].shape))
# print("shape of the 3rd mini_batch_Y: " + str(mini_batches[2][1].shape))
# print("mini batch sanity check: " + str(mini_batches[0][0][0][0:3]))
def initialize_velocity(parameters):
"""
Initializes the velocity as a python dictionary with:
- keys: "dW1", "db1", ..., "dWL", "dbL"
- values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
Arguments:
parameters -- python dictionary containing your parameters.
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
Returns:
v -- python dictionary containing the current velocity.
v['dW' + str(l)] = velocity of dWl
v['db' + str(l)] = velocity of dbl
"""
L = len(parameters) // 2
v = {}
for l in range(0, L):
v["dW" + str(l + 1)] = np.zeros_like(parameters["W" + str(l + 1)])
v["db" + str(l + 1)] = np.zeros_like(parameters["b" + str(l + 1)])
return v
# # 测试函数initialize_velocity(parameters)
# parameters = initialize_velocity_test_case()
#
# v = initialize_velocity(parameters)
# print("v[\"dW1\"] = " + str(v["dW1"]))
# print("v[\"db1\"] = " + str(v["db1"]))
# print("v[\"dW2\"] = " + str(v["dW2"]))
# print("v[\"db2\"] = " + str(v["db2"]))
def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
"""
Update parameters using Momentum
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
v -- python dictionary containing the current velocity:
v['dW' + str(l)] = ...
v['db' + str(l)] = ...
beta -- the momentum hyperparameter, scalar
learning_rate -- the learning rate, scalar
Returns:
parameters -- python dictionary containing your updated parameters
v -- python dictionary containing your updated velocities
"""
L = len(parameters) // 2
for l in range(L):
v["dW" + str(l + 1)] = beta * v["dW" + str(l + 1)] + (1 - beta) * grads["dW" + str(l + 1)]
v["db" + str(l + 1)] = beta * v["db" + str(l + 1)] + (1 - beta) * grads["db" + str(l + 1)]
parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * v["dW" + str(l + 1)]
parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * v["db" + str(l + 1)]
return parameters, v
# # 测试函数update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)
# parameters, grads, v = update_parameters_with_momentum_test_case()
#
# parameters, v = update_parameters_with_momentum(parameters, grads, v, beta = 0.9, learning_rate = 0.01)
# print("W1 = " + str(parameters["W1"]))
# print("b1 = " + str(parameters["b1"]))
# print("W2 = " + str(parameters["W2"]))
# print("b2 = " + str(parameters["b2"]))
# print("v[\"dW1\"] = " + str(v["dW1"]))
# print("v[\"db1\"] = " + str(v["db1"]))
# print("v[\"dW2\"] = " + str(v["dW2"]))
# print("v[\"db2\"] = " + str(v["db2"]))
def initialize_adam(parameters):
"""
Initializes v and s as two python dictionaries with:
- keys: "dW1", "db1", ..., "dWL", "dbL"
- values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
Arguments:
parameters -- python dictionary containing your parameters.
parameters["W" + str(l)] = Wl
parameters["b" + str(l)] = bl
Returns:
v -- python dictionary that will contain the exponentially weighted average of the gradient.
v["dW" + str(l)] = ...
v["db" + str(l)] = ...
s -- python dictionary that will contain the exponentially weighted average of the squared gradient.
s["dW" + str(l)] = ...
s["db" + str(l)] = ...
"""
L = len(parameters) // 2
v = {}
s = {}
for l in range(L):
v["dW" + str(l + 1)] = np.zeros_like(parameters["W" + str(l + 1)])
v["db" + str(l + 1)] = np.zeros_like(parameters["b" + str(l + 1)])
s["dW" + str(l + 1)] = np.zeros_like(parameters["W" + str(l + 1)])
s["db" + str(l + 1)] = np.zeros_like(parameters["b" + str(l + 1)])
return v, s
# # 测试函数initialize_adam(parameters)
# parameters = initialize_adam_test_case()
#
# v, s = initialize_adam(parameters)
# print("v[\"dW1\"] = " + str(v["dW1"]))
# print("v[\"db1\"] = " + str(v["db1"]))
# print("v[\"dW2\"] = " + str(v["dW2"]))
# print("v[\"db2\"] = " + str(v["db2"]))
# print("s[\"dW1\"] = " + str(s["dW1"]))
# print("s[\"db1\"] = " + str(s["db1"]))
# print("s[\"dW2\"] = " + str(s["dW2"]))
# print("s[\"db2\"] = " + str(s["db2"]))
def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate=0.01,
beta1=0.9, beta2=0.999, epsilon=1e-8):
"""
Update parameters using Adam
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
v -- Adam variable, moving average of the first gradient, python dictionary
s -- Adam variable, moving average of the squared gradient, python dictionary
learning_rate -- the learning rate, scalar.
beta1 -- Exponential decay hyperparameter for the first moment estimates
beta2 -- Exponential decay hyperparameter for the second moment estimates
epsilon -- hyperparameter preventing division by zero in Adam updates
Returns:
parameters -- python dictionary containing your updated parameters
v -- Adam variable, moving average of the first gradient, python dictionary
s -- Adam variable, moving average of the squared gradient, python dictionary
"""
L = len(parameters) // 2
v_corrected = {}
s_corrected = {}
for l in range(L):
v["dW" + str(l + 1)] = beta1 * v["dW" + str(l + 1)] + (1 - beta1) * grads["dW" + str(l + 1)]
v["db" + str(l + 1)] = beta1 * v["db" + str(l + 1)] + (1 - beta1) * grads["db" + str(l + 1)]
v_corrected["dW" + str(l + 1)] = v["dW" + str(l + 1)] / (1 - np.power(beta1, t))
v_corrected["db" + str(l + 1)] = v["db" + str(l + 1)] / (1 - np.power(beta1, t))
s["dW" + str(l + 1)] = beta2 * s["dW" + str(l + 1)] + (1 - beta2) * np.power(grads["dW" + str(l +1)], 2)
s["db" + str(l + 1)] = beta2 * s["db" + str(l + 1)] + (1 - beta2) * np.power(grads["db" + str(l + 1)], 2)
s_corrected["dW" + str(l + 1)] = s["dW" + str(l + 1)] / (1 - np.power(beta2, t))
s_corrected["db" + str(l + 1)] = s["db" + str(l + 1)] / (1 - np.power(beta2, t))
parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * (v_corrected["dW" + str(l + 1)] / np.sqrt(s_corrected["dW" + str(l + 1)] + epsilon))
parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * (v_corrected["db" + str(l + 1)] / np.sqrt(s_corrected["db" + str(l + 1)] + epsilon))
return parameters, v, s
# # 测试函数update_parameters_with_adam(parameters, grads, v, s, t, learning_rate=0.01, beta1=0.9, beta2=0.999, epsilon=1e-8)
# parameters, grads, v, s = update_parameters_with_adam_test_case()
# parameters, v, s = update_parameters_with_adam(parameters, grads, v, s, t = 2)
#
# print("W1 = " + str(parameters["W1"]))
# print("b1 = " + str(parameters["b1"]))
# print("W2 = " + str(parameters["W2"]))
# print("b2 = " + str(parameters["b2"]))
# print("v[\"dW1\"] = " + str(v["dW1"]))
# print("v[\"db1\"] = " + str(v["db1"]))
# print("v[\"dW2\"] = " + str(v["dW2"]))
# print("v[\"db2\"] = " + str(v["db2"]))
# print("s[\"dW1\"] = " + str(s["dW1"]))
# print("s[\"db1\"] = " + str(s["db1"]))
# print("s[\"dW2\"] = " + str(s["dW2"]))
# print("s[\"db2\"] = " + str(s["db2"]))
train_X, train_Y = load_dataset(is_plot=False)
# plt.show()
def model(X, Y, layers_dims, optimizer, learning_rate=0.0007, mini_batch_size=64, beta=0.9,
beta1=0.9, beta2=0.999, epsilon=1e-8, num_epochs=10000, print_cost=True):
"""
3-layer neural network model which can be run in different optimizer modes.
Arguments:
X -- input data, of shape (2, number of examples)
Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
layers_dims -- python list, containing the size of each layer
learning_rate -- the learning rate, scalar.
mini_batch_size -- the size of a mini batch
beta -- Momentum hyperparameter
beta1 -- Exponential decay hyperparameter for the past gradients estimates
beta2 -- Exponential decay hyperparameter for the past squared gradients estimates
epsilon -- hyperparameter preventing division by zero in Adam updates
num_epochs -- number of epochs
print_cost -- True to print the cost every 1000 epochs
Returns:
parameters -- python dictionary containing your updated parameters
"""
L = len(layers_dims) # number of layers in the neural networks
costs = [] # to keep track of the cost
t = 0 # initializing the counter required for Adam update
seed = 10 # For grading purposes, so that your "random" minibatches are the same as ours
# Initialize parameters
parameters = initialize_parameters(layers_dims)
# Initialize the optimizer
if optimizer == "gd":
pass # no initialization required for gradient descent
elif optimizer == "momentum":
v = initialize_velocity(parameters)
elif optimizer == "adam":
v, s = initialize_adam(parameters)
# Optimization loop
for i in range(num_epochs):
# Define the random minibatches. We increment the seed to reshuffle differently the dataset after each epoch
seed = seed + 1
minibatches = random_mini_batches(X, Y, mini_batch_size, seed)
for minibatch in minibatches:
# Select a minibatch
(minibatch_X, minibatch_Y) = minibatch
# Forward propagation
a3, caches = forward_propagation(minibatch_X, parameters)
# Compute cost
cost = compute_cost(a3, minibatch_Y)
# Backward propagation
grads = backward_propagation(minibatch_X, minibatch_Y, caches)
# Update parameters
if optimizer == "gd":
parameters = update_parameters_with_gd(parameters, grads, learning_rate)
elif optimizer == "momentum":
parameters, v = update_parameters_with_momentum(parameters, grads, v, beta, learning_rate)
elif optimizer == "adam":
t = t + 1 # Adam counter
parameters, v, s = update_parameters_with_adam(parameters, grads, v, s,
t, learning_rate, beta1, beta2, epsilon)
# Print the cost every 1000 epoch
if print_cost and i % 1000 == 0:
print("Cost after epoch %i: %f" % (i, cost))
if print_cost and i % 100 == 0:
costs.append(cost)
# plot the cost
plt.plot(costs)
plt.ylabel('cost')
plt.xlabel('epochs (per 100)')
plt.title("Learning rate = " + str(learning_rate))
plt.show()
return parameters
# # Mini-batch
# # train 3-layer model
# layers_dims = [train_X.shape[0], 5, 2, 1]
# parameters = model(train_X, train_Y, layers_dims, optimizer="gd")
#
# # Predict
# predictions = predict(train_X, train_Y, parameters)
#
# # Plot decision boundary
# plt.title("Model with Gradient Descent optimization")
# axes = plt.gca()
# axes.set_xlim([-1.5, 2.5])
# axes.set_ylim([-1, 1.5])
# plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
# # Momentum
# # train 3-layer model
# layers_dims = [train_X.shape[0], 5, 2, 1]
# parameters = model(train_X, train_Y, layers_dims, beta=0.9, optimizer="momentum")
#
# # Predict
# predictions = predict(train_X, train_Y, parameters)
#
# # Plot decision boundary
# plt.title("Model with Momentum optimization")
# axes = plt.gca()
# axes.set_xlim([-1.5, 2.5])
# axes.set_ylim([-1, 1.5])
# plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)
# Adam
# train 3-layer model
layers_dims = [train_X.shape[0], 5, 2, 1]
parameters = model(train_X, train_Y, layers_dims, optimizer="adam")
# Predict
predictions = predict(train_X, train_Y, parameters)
# Plot decision boundary
plt.title("Model with Adam optimization")
axes = plt.gca()
axes.set_xlim([-1.5, 2.5])
axes.set_ylim([-1, 1.5])
plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)