https://github.com/torch/optim/blob/master/doc/intro.md
https://stats.stackexchange.com/questions/29130/difference-between-neural-net-weight-decay-and-learning-rate
http://cs231n.github.io/neural-networks-3/#sgd
http://www.jianshu.com/p/58b3fe300ecb
http://www.jianshu.com/p/d8222a84613c
学习率较小时,收敛到极值的速度较慢。
学习率较大时,容易在搜索过程中发生震荡。
The learning rate is a parameter that determines how much an updating step influences the current value of the weights. While weight decay is an additional term in the weight update rule that causes the weights to exponentially decay to zero, if no other update is scheduled.
So let’s say that we have a cost or error function E(w) that we want to minimize. Gradient descent tells us to modify the weights w in the direction of steepest descent in E :
In order to effectively limit the number of free parameters in your model so as to avoid over-fitting, it is possible to regularize the cost function. An easy way to do that is by introducing a zero mean Gaussian prior over the weights, which is equivalent to changing the cost function to E˜(w)=E(w)+λ2w2 . In practice this penalizes large weights and effectively limits the freedom in your model. The regularization parameter λ determines how you trade off the original cost E with the large weights penalization.
Applying gradient descent to this new cost function we obtain:
在使用梯度下降法求解目标函数func(x) = x * x的极小值时,更新公式为x += v,其中每次x的更新量v为v = - dx * lr,dx为目标函数func(x)对x的一阶导数。可以想到,如果能够让lr随着迭代周期不断衰减变小,那么搜索时迈的步长就能不断减少以减缓震荡。学习率衰减因子由此诞生:
decay越小,学习率衰减地越慢,当decay = 0时,学习率保持不变。
decay越大,学习率衰减地越快,当decay = 1时,学习率衰减最快。
“冲量”这个概念源自于物理中的力学,表示力对时间的积累效应。
在普通的梯度下降法 x+=v 中,每次x的更新量 v 为 v=−dx∗lr ,其中 dx 为目标函数 func(x) 对 x 的一阶导数,。
当使用冲量时,则把每次x的更新量v考虑为本次的梯度下降量 −dx∗lr 与上次 x 的更新量 v 乘上一个介于[0, 1]的因子momentum的和,即
当本次梯度下降- dx * lr的方向与上次更新量v的方向相同时,上次的更新量能够对本次的搜索起到一个正向加速的作用。
当本次梯度下降- dx * lr的方向与上次更新量v的方向相反时,上次的更新量能够对本次的搜索起到一个减速的作用。