尝试Adam代替梯度下降

我们介绍Adam,这是一种基于一阶梯度来优化随机目标函数的算法。

简介:
Adam 这个名字来源于 adaptive moment estimation,自适应矩估计。概率论中矩的含义是:如果一个随机变量 X 服从某个分布,X 的一阶矩是 E(X),也就是样本平均值,X 的二阶矩就是 E(X^2),也就是样本平方的平均值。Adam 算法根据损失函数对每个参数的梯度的一阶矩估计和二阶矩估计动态调整针对于每个参数的学习速率。Adam 也是基于梯度下降的方法,但是每次迭代参数的学习步长都有一个确定的范围,不会因为很大的梯度导致很大的学习步长,参数的值比较稳定。it does not require stationary objective, works with sparse gradients, naturally performs a form of step size annealing。
论文参考:https://arxiv.org/pdf/1412.6980v8.pdf
Algorithm 1: Adam, our proposed algorithm for stochastic optimization. See section 2 for details,
and for a slightly more efficient (but less clear) order of computation. g 2
t
indicates the elementwise
square g t ? g t . Good default settings for the tested machine learning problems are α = 0.001,
β 1 = 0.9, β 2 = 0.999 and ? = 10 −8 . All operations on vectors are element-wise. With β t
1 and β t 2
we denote β 1 and β 2 to the power t.
Require: α: Stepsize
Require: β 1 ,β 2 ∈ [0,1): Exponential decay rates for the moment estimates
Require: f(θ): Stochastic objective function with parameters θ
Require: θ 0 : Initial parameter vector
m 0 ← 0 (Initialize 1 st moment vector)
v 0 ← 0 (Initialize 2 nd moment vector)
t ← 0 (Initialize timestep)
while θ t not converged do
t ← t + 1
g t ← ∇ θ f t (θ t−1 ) (Get gradients w.r.t. stochastic objective at timestep t)
m t ← β 1 · m t−1 + (1 − β 1 ) · g t (Update biased first moment estimate)
v t ← β 2 · v t−1 + (1 − β 2 ) · g 2
t
(Update biased second raw moment estimate)
b m t ← m t /(1 − β t
1 ) (Compute bias-corrected first moment estimate)
b v t ← v t /(1 − β t
2 ) (Compute bias-corrected second raw moment estimate)
θ t ← θ t−1 − α · b m t /( √ b v t + ?) (Update parameters)
end while
return θ t (Resulting parameters)
In section 2 we describe the algorithm and the properties of its update rule. Section 3 explains
our initialization bias correction technique, and section 4 provides a theoretical analysis of Adam’s
convergenceinonlineconvexprogramming. Empirically, ourmethodconsistentlyoutperformsother
methods for a variety of models and datasets, as shown in section 6. Overall, we show that Adam is
a versatile algorithm that scales to large-scale high-dimensional machine learning problems
coed:

m = config[‘m’]*config[‘beta1’]+(1-config[‘beta1’])*dx
v = config[‘v’]*config[‘beta2’]+(1-config[‘beta2’])*dx*dx
config[‘t’] += 1
mb = m / (1 - config[‘beta1’]**config[‘t’])
vb = v / (1 - config[‘beta2’]**config[‘t’])
next_x = x - config[‘learning_rate’]*mb/(np.sqrt(vb)+config[‘epsilon’])
config[‘m’] = m
config[‘v’] = v

你可能感兴趣的:(bigdata)