前面一部分主要讲了神经网络在前向传播过程中对数据的处理,包括去均值预处理, 权重初始化, 在线性输出结果和激活函数之间批量归一化BN,以及进入下一层layer之前的随机失活.
那么这一部分将介绍在反向传播过程中对梯度下降的处理,也就是几种优化算法.
真的很佩服这些人…一个梯度下降,你之前可能想的就是 w−=αdw w − = α d w 就好了,但研究者们也能大做文章,厉害厉害!!
其实就是mini-batch,每次iteration不是处理全部数据集,而是从全部数据集中随机选取batch_size个样本量进行前向和反向传播,并完成一次梯度下降.
python代码:
def sgd(w, dw, config=None):
"""
Performs vanilla stochastic gradient descent.
config format:
- learning_rate: Scalar learning rate.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-2) #{'learning_rate':1e-2}
w -= config['learning_rate'] * dw #随机梯度下降~
return w, config
统计学中又称移动加权平均.
vt=βvt−1+(1−β)θt v t = β v t − 1 + ( 1 − β ) θ t
假设 β=0.9 β = 0.9
得到如图所示的指数加权平均:
显然 β β 越大时,之前第n填填所占的权重 (0.1βn) ( 0.1 β n ) 就越大,统计的天数n就越多. 当n=10, β=0.9 β = 0.9 时, 0.910<1e 0.9 10 < 1 e 这个时候权重太小就不考虑了.
主要是针对估计的初期部分.
vt=βvt−1+(1−β)θt1−β v t = β v t − 1 + ( 1 − β ) θ t 1 − β
原理就是:纵向摆动加权平均为0,横向一直是沿着loss减小的方向,因而会加速这个方向的梯度.
vdw=βvdw+(1−β)dW v d w = β v d w + ( 1 − β ) d W
vdb=βvdb+(1−β)db v d b = β v d b + ( 1 − β ) d b
W=W−αvdw,b=b−αvdw W = W − α v d w , b = b − α v d w
Hyperparameters: α,β α , β . β是指数加权平均系数,α是学习率 β 是 指 数 加 权 平 均 系 数 , α 是 学 习 率
python 代码:
def sgd_momentum(w, dw, config=None):
"""
Performs stochastic gradient descent with momentum.
config format:
- learning_rate: Scalar learning rate.
- momentum: Scalar between 0 and 1 giving the momentum value. ## 指数加权平均系数
Setting momentum = 0 reduces to sgd.
- velocity: A numpy array of the same shape as w and dw used to store a
moving average of the gradients. ## 经过加权平均后的权重,和W的shape是一样的
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
config.setdefault('momentum', 0.9)
v = config.get('velocity', np.zeros_like(w)) #初始化为0
next_w = None
## Nesterov Accelerated gradient
v = config['momentum'] * v - config['learning']*dw
next_w = w + v
## Ng 讲解的似乎不太一样
# v = config['momentum'] * v -(1 - config['momentum'])*dw
# next_w = w - config['learning_rate'] * v
config['velocity'] = v
return next_w, config
Sdw=0,Sdb=0 S d w = 0 , S d b = 0
Sdw=β2Sdw+(1−β2)(dW)2 S d w = β 2 S d w + ( 1 − β 2 ) ( d W ) 2
Sdb=β2Sdb+(1−β2)(db)2 S d b = β 2 S d b + ( 1 − β 2 ) ( d b ) 2
W=W−αdwSdw‾‾‾‾√+ϵ,b=b−αdbSdb‾‾‾√+ϵ W = W − α d w S d w + ϵ , b = b − α d b S d b + ϵ
RMSprop原理: 假设纵向需要消除摆动的是参数b,当摆动很大,即 |db| | d b | 很大. 那么通过指数加权平均后,与momentum不同的是的 S2db S d b 2 显然是一个比较大的数,那么 dbSdb+ϵ‾‾‾‾‾‾‾√ d b S d b + ϵ 就相对减小,在纵向的梯度变化也会较小.
横向需要加速的是参数w,那么当dw很小的情况下,反过来的道理~~
当然在高维空间中,纵向需要消除的可能是W1,W3,W10…横向需要加速的可能是W2,W5…
def rmsprop(x, dx, config=None):
"""
Uses the RMSProp update rule, which uses a moving average of squared
gradient values to set adaptive per-parameter learning rates.
config format:
- learning_rate: Scalar learning rate.
- decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
gradient cache.
- epsilon: Small scalar used for smoothing to avoid dividing by zero.
- cache: Moving average of second moments of gradients.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-2) ## 学习率
config.setdefault('decay_rate', 0.99) ## 指数加权平均衰减率
config.setdefault('epsilon', 1e-8)
config.setdefault('cache', np.zeros_like(x))
next_x = None
config['cache'] = config['decay_rate']*config['cache']+(1-config['decay_rate'])*(dx**2)
next_x = x - config['learning_rate']*dx/(np.sqrt(config['cache'])+config['epsilon'])
return next_x, config
Adaptive Momentum Estimation
Ng的公式难得写的这么公整,我就不敲了…只是中间的yhat什么鬼…
Adam的原理也很简单:将momentum和RMSprop进行了结合,并且两个都用到了偏差修正.
超参数的设置:
α α needs to be tune
β1:0.9 β 1 : 0.9 这个dw的指数加权平均的底数
β2:0.99 β 2 : 0.99 这个是 (dw2) ( d w 2 ) 的指数加权平均的底数
ϵ:10−8 ϵ : 10 − 8
python 代码:
def adam(x, dx, config=None):
"""
Uses the Adam update rule, which incorporates moving averages of both the
gradient and its square and a bias correction term.
config format:
- learning_rate: Scalar learning rate.
- beta1: Decay rate for moving average of first moment of gradient. ## dw的指数加权平均衰减率
- beta2: Decay rate for moving average of second moment of gradient. ## dw^2的指数加权平均衰减率
- epsilon: Small scalar used for smoothing to avoid dividing by zero.
- m: Moving average of gradient. ## dw的移动平均值
- v: Moving average of squared gradient. ## dw^2的移动平均值
- t: Iteration number.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-3)
config.setdefault('beta1', 0.9)
config.setdefault('beta2', 0.999)
config.setdefault('epsilon', 1e-8)
config.setdefault('m', np.zeros_like(x))
config.setdefault('v', np.zeros_like(x))
config.setdefault('t', 1) ## 但是并没有传入这个参数啊..默认一直为1?
next_x = None
config['m'] = config['beta1']*config['m']+(1-config['beta1'])*dx
config['v'] = config['beta2']*config['v']+(1-config['beta2'])*(dx**2)
##偏差修正
m_correct = config['m']/(1-config['beta1']**config['t'])
v_correct = config['v']/(1-config['beta2']**config['t'])
next_x = x - config['learning_rate']*m_correct/(np.sqrt(v_correct)+config['epsilon'])
return next_x, config
α=0.95epoch_num∗α0 α = 0.95 e p o c h _ n u m ∗ α 0
α=11+decay_rate∗epoch_num∗α0 α = 1 1 + d e c a y _ r a t e ∗ e p o c h _ n u m ∗ α 0
画风soooo cute!!!全世界画人儿都是一样的啊hahahhhhh
鞍点就是梯度也为0,但却不是最优解.
import numpy as np
a = np.array([1,2])
a**2
array([1, 4])