9月30日计算机视觉基础学习笔记——优化算法

文章目录

  • 前言
  • 一、BGD、SGD、mini-batch GD
  • 二、Momentum、NAG
  • 三、Ada-grad、RMS-Prop、Ada-delta
  • 四、Ada-m


前言

本文为9月30日计算机视觉基础学习笔记——优化算法,分为四个章节:

  • BGD、SGD、mini-batch GD;
  • Momentum、NAG;
  • Ada-grad、RMS-Prop、Ada-delta;
  • Ada-m.

一、BGD、SGD、mini-batch GD

  • Batch gradient descent:
    θ = θ − η ⋅ ▽ θ J ( θ ) θ t + 1 = θ t + △ θ t \theta = \theta - \eta \cdot \bigtriangledown _{\theta }J(\theta )\\ \theta_{t+1} = \theta_t + \bigtriangleup \theta _t θ=θηθJ(θ)θt+1=θt+θt
    其中, θ \theta θ 是权重和偏置, J J J 是损失函数。

9月30日计算机视觉基础学习笔记——优化算法_第1张图片

  • Stochastic gradient descent: 学习率会衰减:
    θ = θ − η ⋅ ▽ θ J ( θ , x ( i ) , y ( i ) ) △ θ t = η ⋅ g t , i θ t + 1 = θ t + △ θ t \theta = \theta - \eta \cdot \bigtriangledown _{\theta }J(\theta, x^{(i)}, y^{(i)} )\\ \bigtriangleup \theta _t = \eta \cdot g_{t, i}\\ \theta_{t+1} = \theta_t + \bigtriangleup \theta _t θ=θηθJ(θ,x(i),y(i))θt=ηgt,iθt+1=θt+θt

都是凸函数的情况下,SGD 波动大,可能使梯度下降到更好的另一个局部最优解,但可能导致梯度一直在局部最优解附近波动。

  • Mini-batch gradient descent:

θ = θ − η ⋅ ▽ θ J ( θ , x ( i : i + n ) , y ( i : i + n ) ) b a t c h   s i z e = n \theta = \theta - \eta \cdot \bigtriangledown _{\theta }J(\theta, x^{(i:i+n)}, y^{(i:i+n)} ) \quad batch\ size = n θ=θηθJ(θ,x(i:i+n),y(i:i+n))batch size=n

相对于 SGD 可减小参数更新的波动。


二、Momentum、NAG

  • Momentum:
    v t = γ v t − 1 + η ⋅ ▽ θ J ( θ ) θ = θ − v t v_t = \gamma v_{t-1} + \eta \cdot \bigtriangledown _{\theta }J(\theta )\\ \theta = \theta - v_t vt=γvt1+ηθJ(θ)θ=θvt

γ \gamma γ 通常为 0.9。

  • Nesterov Accelerated Gradient:
    v t = γ v t − 1 + η ⋅ ▽ θ J ( θ − γ v t − 1 ) θ = θ − v t △ θ t = − η ⋅ g t , i θ t + 1 = θ t + △ θ t v_t = \gamma v_{t-1} + \eta \cdot \bigtriangledown _{\theta }J(\theta - \gamma v_{t-1} )\\ \theta = \theta -v_t\\ \bigtriangleup \theta_t = -\eta \cdot g_{t, i}\\ \theta _{t+1} = \theta _t + \bigtriangleup \theta _t vt=γvt1+ηθJ(θγvt1)θ=θvtθt=ηgt,iθt+1=θt+θt
    与 momentum 的区别:计算梯度不同。NAG 先用当前的速度 v 更新一遍参数,再用更新的临时参数计算 loss,然后计算梯度。

三、Ada-grad、RMS-Prop、Ada-delta

  • Adaptive grad:
    h → h + ∂ L ∂ W ⊙ ∂ L ∂ W W → W − η ⋅ 1 h ⋅ ∂ L ∂ W θ t + 1 , i = θ t , i − η G t , i i + ϵ ⋅ g t , i h → h + \frac{\partial L}{\partial \textbf{W} } \odot \frac{\partial L}{\partial \textbf{W} }\\ \textbf{W} → \textbf{W} - \eta \cdot \frac{1}{\sqrt{h} } \cdot \frac{\partial L}{\partial \textbf{W} }\\ \theta _{t+1, i} = \theta _{t, i} - \frac{\eta }{\sqrt{G_{t, ii}} + \epsilon} \cdot g_{t, i} hh+WLWLWWηh 1WLθt+1,i=θt,iGt,ii +ϵηgt,i
    缺点:随着训练次数增加,h 越来越大,训练步长越来越小,模型还未收敛,参数就不更新了。

  • Root Mean Square Propagation:
    E [ g 2 ] t = γ E [ g 2 ] t − 1 + ( 1 − γ ) g t 2 △ θ t = − η E [ g 2 ] + ϵ g t E[g^2]_t = \gamma E[g^2]_{t-1} + (1-\gamma)g_t^2\\ \bigtriangleup \theta _t = - \frac{\eta }{E[g^2] + \epsilon } g_t E[g2]t=γE[g2]t1+(1γ)gt2θt=E[g2]+ϵηgt

  • Ada-delta:
    E [ △ θ 2 ] t = η E [ △ θ 2 ] t − 1 + ( 1 − γ ) △ θ t 2 R M S [ △ θ ] t = E [ △ θ 2 ] t + ϵ △ θ t = − R M E [ △ θ ] t R M E [ g ] t g t E[\bigtriangleup \theta ^2]_t = \eta E[\bigtriangleup \theta ^2]_{t-1} + (1-\gamma )\bigtriangleup \theta _t^2\\ RMS[\bigtriangleup \theta ]_t = \sqrt{E[\bigtriangleup \theta^2 ]_t + \epsilon } \\ \bigtriangleup \theta _t = -\frac{RME[\bigtriangleup \theta ]_t}{RME[g]_t} g_t E[θ2]t=ηE[θ2]t1+(1γ)θt2RMS[θ]t=E[θ2]t+ϵ θt=RME[g]tRME[θ]tgt


四、Ada-m

  • Adaptive Moment Estimation:
    m t = β 1 m t − 1 + ( 1 − β 1 ) g t v t = β 2 v t − 1 + ( 1 − β 2 ) g t 2 m ^ t = m t 1 − β 1 t v ^ t = v t 1 − β 2 t θ t + 1 = θ t − η v ^ t + ϵ m ^ t m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t\\ v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2\\ \hat{m}_t = \frac{m_t}{1-\beta_1^t}\\ \hat{v}_t = \frac{v_t}{1-\beta_2^t}\\ \theta_{t+1} = \theta_t - \frac{\eta }{\sqrt{\hat{v}_t }+\epsilon }\hat{m}_t mt=β1mt1+(1β1)gtvt=β2vt1+(1β2)gt2m^t=1β1tmtv^t=1β2tvtθt+1=θtv^t +ϵηm^t
    • m 用来稳定梯度:来自 momentum;
    • v 使梯度自适应化:来自 RMSProp.

你可能感兴趣的:(算法,计算机视觉,学习)