神经网络是根据损失函数不断调整网络参数,使得最终能够获得近似最优解。
优化器 是为了让参数根据损失函数更快更准的朝着最优方向更新的一种策略。
符号 | 含义 | 备注 |
---|---|---|
W | 模型参数 | |
J(W) | 代价函数 | |
▽ J ( W ) \bigtriangledown J(W) ▽J(W) | 梯度 | 代价函数关于模型参数的偏导 |
v | 一阶动量 | 代表惯性 |
r | 二阶动量 | 用于控制自适应学习率 |
q | 完全自适应学习率 | 自适应学习率 |
η \eta η | 学习率 | |
( X , Y ) (X,Y) (X,Y) | 训练集整体样本 | |
( X ( i ) , Y ( i ) ) (X^{(i)} ,Y^{(i)}) (X(i),Y(i)) | 训练集第i个样本 | |
N N N | 训练集样本总量 | |
ϵ | 随机参数 | ϵ是为了数值稳定性而加上的 通常取10的负10次方 |
W t + 1 = W t − η t ⋅ ▽ J ( W t ) W_{t+1} = W_{t} - \eta _{t}\cdot \bigtriangledown J(W_{t}) Wt+1=Wt−ηt⋅▽J(Wt)
随机梯度下降(SGD、Stochastic Gradient Descent)
均匀地,随机选择一个样本 ( X ( i ) , Y ( i ) ) (X^{(i)} ,Y^{(i)}) (X(i),Y(i)), 代表整体样本,乘以N
W t + 1 = W t − η t ⋅ N ⋅ ▽ J ( W t , X ( i ) , Y ( i ) ) W_{t+1} = W_{t} - \eta _{t}\cdot N\cdot \bigtriangledown J(W_{t}, X^{(i)} ,Y^{(i)}) Wt+1=Wt−ηt⋅N⋅▽J(Wt,X(i),Y(i))
小批量梯度下降(MBGD、Mini-batch gradient descent)
每次迭代用m个样本对参数更新
W t + 1 = W t − η t ⋅ N ⋅ 1 m ⋅ ∑ i = 1 m − 1 ▽ J ( W t , X ( i ) , Y ( i ) ) W_{t+1} = W_{t} - \eta _{t}\cdot N\cdot \frac{1}{m} \cdot\sum_{i=1}^{m-1} \bigtriangledown J(W_{t}, X^{(i)} ,Y^{(i)}) Wt+1=Wt−ηt⋅N⋅m1⋅i=1∑m−1▽J(Wt,X(i),Y(i))
(NAG、Nesterov Accelerated Gradient)
V t = α ⋅ V t − 1 + η t ▽ J ( W t − α ⋅ V t − 1 ) V_{t} = \alpha \cdot V_{t-1} + \eta _{t} \bigtriangledown J(W_{t}- \alpha \cdot V_{t-1}) Vt=α⋅Vt−1+ηt▽J(Wt−α⋅Vt−1)
W t + 1 = W t − V t W_{t+1} = W_{t} - V_{t} Wt+1=Wt−Vt
V t = α ⋅ V t − 1 + η t ⋅ N ⋅ ▽ J ( W t , X ( i ) , Y ( i ) ) V_{t} = \alpha \cdot V_{t-1} + \eta _{t} \cdot N\cdot\bigtriangledown J(W_{t}, X^{(i)} ,Y^{(i)}) Vt=α⋅Vt−1+ηt⋅N⋅▽J(Wt,X(i),Y(i))
W t + 1 = W t − V t W_{t+1} = W_{t} - V_{t} Wt+1=Wt−Vt
g t = 1 m ⋅ ∑ i = 1 m − 1 ▽ J ( W t , X ( i ) , Y ( i ) ) g_{t}=\frac{1}{m} \cdot\sum_{i=1}^{m-1} \bigtriangledown J(W_{t}, X^{(i)} ,Y^{(i)}) gt=m1⋅i=1∑m−1▽J(Wt,X(i),Y(i))
r t + 1 = r t + g t ⊙ g t r_{t+1} = r_{t}+g_{t}\odot g_{t} rt+1=rt+gt⊙gt
W t + 1 = W t − η t r t + 1 + ϵ ⊙ g t W_{t+1} = W_{t} -\frac{\eta _{t} }{\sqrt{r_{t+1} + ϵ} }\odot g_{t} Wt+1=Wt−rt+1+ϵηt⊙gt
均方根反向传播法(RMSProp、Root Mean Square Propagation)
g t = ▽ J ( W t ) g_{t}= \bigtriangledown J(W_{t}) gt=▽J(Wt)
r t + 1 = β r t + ( 1 − β ) g t ⊙ g t r_{t+1} = \beta r_{t}+(1-\beta )g_{t}\odot g_{t} rt+1=βrt+(1−β)gt⊙gt
W t + 1 = W t − η t r t + 1 + ϵ ⊙ g t W_{t+1} = W_{t} -\frac{\eta _{t} }{\sqrt{r_{t+1} + ϵ} }\odot g_{t} Wt+1=Wt−rt+1+ϵηt⊙gt
An Adaptive learning rate method
g t = ▽ J ( W t ) g_{t}= \bigtriangledown J(W_{t}) gt=▽J(Wt)
r t + 1 = β r t + ( 1 − β ) g t ⊙ g t r_{t+1} = \beta r_{t}+(1-\beta )g_{t}\odot g_{t} rt+1=βrt+(1−β)gt⊙gt
q t = α q t − 1 + ( 1 − α ) ( q t − 1 + ϵ r t + ϵ ⊙ g t ) 2 q_{t} = \alpha q_{t-1}+(1-\alpha ) (\frac{\sqrt{ q_{t-1}+ ϵ}}{\sqrt{r_{t} + ϵ}} \odot g_{t})^2 qt=αqt−1+(1−α)(rt+ϵqt−1+ϵ⊙gt)2
W t + 1 = W t − q t + ϵ r t + 1 + ϵ ⊙ g t W_{t+1} = W_{t} -\frac{\sqrt{ q_{t}+ ϵ}}{\sqrt{r_{t+1} + ϵ} }\odot g_{t} Wt+1=Wt−rt+1+ϵqt+ϵ⊙gt