常见优化器总结

优化器简述

神经网络是根据损失函数不断调整网络参数,使得最终能够获得近似最优解。
优化器 是为了让参数根据损失函数更快更准的朝着最优方向更新的一种策略。

符号说明

符号 含义 备注
W 模型参数
J(W) 代价函数
▽ J ( W ) \bigtriangledown J(W) J(W) 梯度 代价函数关于模型参数的偏导
v 一阶动量 代表惯性
r 二阶动量 用于控制自适应学习率
q 完全自适应学习率 自适应学习率
η \eta η 学习率
( X , Y ) (X,Y) (X,Y) 训练集整体样本
( X ( i ) , Y ( i ) ) (X^{(i)} ,Y^{(i)}) (X(i),Y(i)) 训练集第i个样本
N N N 训练集样本总量
ϵ 随机参数 ϵ是为了数值稳定性而加上的 通常取10的负10次方

常见优化器介绍

GD

W t + 1 = W t − η t ⋅ ▽ J ( W t ) W_{t+1} = W_{t} - \eta _{t}\cdot \bigtriangledown J(W_{t}) Wt+1=WtηtJ(Wt)

SGD

随机梯度下降(SGD、Stochastic Gradient Descent)
均匀地,随机选择一个样本 ( X ( i ) , Y ( i ) ) (X^{(i)} ,Y^{(i)}) (X(i),Y(i)), 代表整体样本,乘以N
W t + 1 = W t − η t ⋅ N ⋅ ▽ J ( W t , X ( i ) , Y ( i ) ) W_{t+1} = W_{t} - \eta _{t}\cdot N\cdot \bigtriangledown J(W_{t}, X^{(i)} ,Y^{(i)}) Wt+1=WtηtNJ(Wt,X(i),Y(i))

MBGD

小批量梯度下降(MBGD、Mini-batch gradient descent)
每次迭代用m个样本对参数更新
W t + 1 = W t − η t ⋅ N ⋅ 1 m ⋅ ∑ i = 1 m − 1 ▽ J ( W t , X ( i ) , Y ( i ) ) W_{t+1} = W_{t} - \eta _{t}\cdot N\cdot \frac{1}{m} \cdot\sum_{i=1}^{m-1} \bigtriangledown J(W_{t}, X^{(i)} ,Y^{(i)}) Wt+1=WtηtNm1i=1m1J(Wt,X(i),Y(i))

NAG

(NAG、Nesterov Accelerated Gradient)
V t = α ⋅ V t − 1 + η t ▽ J ( W t − α ⋅ V t − 1 ) V_{t} = \alpha \cdot V_{t-1} + \eta _{t} \bigtriangledown J(W_{t}- \alpha \cdot V_{t-1}) Vt=αVt1+ηtJ(WtαVt1)
W t + 1 = W t − V t W_{t+1} = W_{t} - V_{t} Wt+1=WtVt

SGD-M

V t = α ⋅ V t − 1 + η t ⋅ N ⋅ ▽ J ( W t , X ( i ) , Y ( i ) ) V_{t} = \alpha \cdot V_{t-1} + \eta _{t} \cdot N\cdot\bigtriangledown J(W_{t}, X^{(i)} ,Y^{(i)}) Vt=αVt1+ηtNJ(Wt,X(i),Y(i))
W t + 1 = W t − V t W_{t+1} = W_{t} - V_{t} Wt+1=WtVt

AdaGrad

g t = 1 m ⋅ ∑ i = 1 m − 1 ▽ J ( W t , X ( i ) , Y ( i ) ) g_{t}=\frac{1}{m} \cdot\sum_{i=1}^{m-1} \bigtriangledown J(W_{t}, X^{(i)} ,Y^{(i)}) gt=m1i=1m1J(Wt,X(i),Y(i))
r t + 1 = r t + g t ⊙ g t r_{t+1} = r_{t}+g_{t}\odot g_{t} rt+1=rt+gtgt
W t + 1 = W t − η t r t + 1 + ϵ ⊙ g t W_{t+1} = W_{t} -\frac{\eta _{t} }{\sqrt{r_{t+1} + ϵ} }\odot g_{t} Wt+1=Wtrt+1+ϵ ηtgt

RMSProp

均方根反向传播法(RMSProp、Root Mean Square Propagation)
g t = ▽ J ( W t ) g_{t}= \bigtriangledown J(W_{t}) gt=J(Wt)
r t + 1 = β r t + ( 1 − β ) g t ⊙ g t r_{t+1} = \beta r_{t}+(1-\beta )g_{t}\odot g_{t} rt+1=βrt+(1β)gtgt

W t + 1 = W t − η t r t + 1 + ϵ ⊙ g t W_{t+1} = W_{t} -\frac{\eta _{t} }{\sqrt{r_{t+1} + ϵ} }\odot g_{t} Wt+1=Wtrt+1+ϵ ηtgt

Adadelta

An Adaptive learning rate method
g t = ▽ J ( W t ) g_{t}= \bigtriangledown J(W_{t}) gt=J(Wt)
r t + 1 = β r t + ( 1 − β ) g t ⊙ g t r_{t+1} = \beta r_{t}+(1-\beta )g_{t}\odot g_{t} rt+1=βrt+(1β)gtgt
q t = α q t − 1 + ( 1 − α ) ( q t − 1 + ϵ r t + ϵ ⊙ g t ) 2 q_{t} = \alpha q_{t-1}+(1-\alpha ) (\frac{\sqrt{ q_{t-1}+ ϵ}}{\sqrt{r_{t} + ϵ}} \odot g_{t})^2 qt=αqt1+(1α)(rt+ϵ qt1+ϵ gt)2
W t + 1 = W t − q t + ϵ r t + 1 + ϵ ⊙ g t W_{t+1} = W_{t} -\frac{\sqrt{ q_{t}+ ϵ}}{\sqrt{r_{t+1} + ϵ} }\odot g_{t} Wt+1=Wtrt+1+ϵ qt+ϵ gt

你可能感兴趣的:(NLP基础,python统计模型,知识基础,人工智能,机器学习,算法,深度学习)