在普通的梯度下降法x += v中,每次x的更新量v为v = - dx * lr,其中dx为目标函数func(x)对x的一阶导数。当使用冲量时,则把每次x的更新量v考虑为本次的梯度下降量- dx * lr与上次x的更新量v乘上一个介于[0, 1]的因子momentum的和,即v = - dx * lr + v * momemtum。
从公式上可看出:
当本次梯度下降- dx * lr的方向与上次更新量v的方向相同时,上次的更新量能够对本次的搜索起到一个正向加速的作用。
当本次梯度下降- dx * lr的方向与上次更新量v的方向相反时,上次的更新量能够对本次的搜索起到一个减速的作用。
def rmsprop(w, dw, config=None):
if config is None: config = {}
learning_rate = config.get('learning_rate', 1e-2)
decay_rate = config.get('decay_rate', 0.99)
epsilon = config.get('epsilon', 1e-8)
cache = config.get('cache', np.zeros_like(w))
cache = decay_rate * cache + (1 - decay_rate) * dw ** 2
next_w = w - learning_rate * dw / (np.sqrt(cache) + epsilon)
config['cache'] = cache
return next_w, config
def adam(w, dw, config=None):
"""
config format:
- learning_rate: Scalar learning rate.
- beta1: Decay rate for moving average of first moment of gradient.
- beta2: Decay rate for moving average of second moment of gradient.
- epsilon: Small scalar used for smoothing to avoid dividing by zero.
- m: Moving average of gradient.
- v: Moving average of squared gradient.
- t: Iteration number.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-3)
config.setdefault('beta1', 0.9)
config.setdefault('beta2', 0.999)
config.setdefault('epsilon', 1e-8)
config.setdefault('m', np.zeros_like(w))
config.setdefault('v', np.zeros_like(w))
config.setdefault('t', 0)
m = config['m']
v = config['v']
t = config['t'] + 1
beta1 = config['beta1']
beta2 = config['beta2']
epsilon = config['epsilon']
learning_rate = config['learning_rate']
m = beta1 * m + (1 - beta1) * dw
v = beta2 * v + (1 - beta2) * (dw ** 2)
mb = m / (1 - beta1 ** t)
vb = v / (1 - beta2 ** t)
next_w = w - learning_rate * mb / (np.sqrt(vb) + epsilon)
config['m'] = m
config['v'] = v
config['t'] = t
return next_w, config