该内容为笔者学习中国大学慕课中北京大学曹健老师Tensorflow笔记所总结
在此之前,笔者观看过吴恩达老师的深度学习和CS231n,其中都对几种优化器进行了讲解,并对几种不同的优化器为什么有效进行了说明,但相比直接曹健老师的讲解更便于记忆
待优化参数 w w w
损失函数 l o s s loss loss
学习率 l r lr lr
每次迭代一个 b a t c h batch batch
t t t表示当前 b a t c h batch batch迭代的总次数
参数更新的步骤:
一阶动量:与梯度相关的函数
二阶动量:与梯度平方相关的函数
一阶动量: m t = g t m_t=g_t mt=gt 二阶动量: V t = 1 V_t=1 Vt=1
η t = l r ⋅ m t / V t \eta_t=lr\cdot m_t/\sqrt{V_t} ηt=lr⋅mt/Vt
= l r ⋅ g t =lr\cdot g_t =lr⋅gt
w t + 1 = w t − η t w_{t+1}=w_t-\eta_t wt+1=wt−ηt
= w t − l r ⋅ m t V t =w_t-lr\cdot m_t\sqrt{V_t} =wt−lr⋅mtVt
= w t − l r ⋅ g t = w_t-lr\cdot g_t =wt−lr⋅gt
在SGD基础上增加了一阶动量
在SGDM中 m t m_t mt 表示各时刻梯度方向的指数滑动平均
一阶动量: m t = β ⋅ m t − 1 + ( 1 − β ) ⋅ g t m_t=\beta \cdot m_{t-1}+(1-\beta ) \cdot g_t mt=β⋅mt−1+(1−β)⋅gt 二阶动量: V t = 1 V_t=1 Vt=1
η t = l r ⋅ m t / V t \eta_t=lr\cdot m_t/\sqrt{V_t} ηt=lr⋅mt/Vt
= l r ⋅ m t =lr\cdot m_t =lr⋅mt
= l r ⋅ ( β ⋅ m t − 1 + ( 1 − β ) ⋅ g t ) =lr \cdot(\beta \cdot m_{t-1}+(1-\beta ) \cdot g_t) =lr⋅(β⋅mt−1+(1−β)⋅gt)
w t + 1 = w t − η t w_{t+1}=w_t-\eta_t wt+1=wt−ηt
= w t − l r ⋅ ( β ⋅ m t − 1 + ( 1 − β ) ⋅ g t ) =w_t-lr \cdot(\beta \cdot m_{t-1}+(1-\beta ) \cdot g_t) =wt−lr⋅(β⋅mt−1+(1−β)⋅gt)
在SGD基础上增加二阶动量
二阶动量是从开始到现在梯度平方的累计和
一阶动量: m t = g t m_t=g_t mt=gt 二阶动量: V t = ∑ τ t g τ 2 V_t=\sum^t_{\tau}g_{\tau}^2 Vt=∑τtgτ2
η t = l r ⋅ m t / ( V t ) \eta_t=lr \cdot m_t/(\sqrt{V_t}) ηt=lr⋅mt/(Vt)
= l r ⋅ g t / ( ∑ τ = 1 t ) g τ 2 ) =lr \cdot g_t/(\sqrt{\sum^t_{\tau=1})g_{\tau}^2}) =lr⋅gt/(∑τ=1t)gτ2)
w t + 1 = w t − η t w_{t+1}=w_t-\eta_t wt+1=wt−ηt
= w t − l r ⋅ g t / ( ∑ τ = 1 t ) g τ 2 ) =w_t-lr \cdot g_t/(\sqrt{\sum^t_{\tau=1})g_{\tau}^2}) =wt−lr⋅gt/(∑τ=1t)gτ2)
在SGD基础上增加二阶动量
二阶动量使用指数滑动平均值计算,表征的是过去一段时间的平均值
一阶动量: m t = g t m_t=g_t mt=gt 二阶动量: V t = β ⋅ V t − 1 + ( 1 − β ) ⋅ g 2 2 V_t=\beta \cdot V_{t-1}+(1-\beta)\cdot g_2^2 Vt=β⋅Vt−1+(1−β)⋅g22
η t = l r ⋅ m t / ( ( V t ) ) \eta_t=lr \cdot m_t/(\sqrt(V_t)) ηt=lr⋅mt/((Vt))
= l r ⋅ g t / ( β ⋅ V t − 1 + ( 1 − β ) ⋅ g 2 2 ) =lr \cdot g_t/(\sqrt{\beta \cdot V_{t-1}+(1-\beta)\cdot g_2^2}) =lr⋅gt/(β⋅Vt−1+(1−β)⋅g22)
w t + 1 = w t − η t w_{t+1}=w_t-\eta_t wt+1=wt−ηt
= w t − l r ⋅ g t / ( β ⋅ V t − 1 + ( 1 − β ) ⋅ g 2 2 ) =w_t-lr \cdot g_t/(\sqrt{\beta \cdot V_{t-1}+(1-\beta)\cdot g_2^2}) =wt−lr⋅gt/(β⋅Vt−1+(1−β)⋅g22)
同时结合SGDM一阶动量和RMWSProp二阶动量
一阶动量: m t = β 1 ⋅ m t − 1 + ( 1 − β 1 ) m_t=\beta_1 \cdot m_{t-1}+(1-\beta_1 ) mt=β1⋅mt−1+(1−β1)
修正一阶动量的偏差: m t ^ = m t 1 − β 1 t \hat{m_t}=\dfrac{m_t}{1-\beta_1^t} mt^=1−β1tmt
二阶动量: V t = β 2 ⋅ V t − 1 + ( 1 − β 2 ) ⋅ g 2 2 V_t=\beta_2 \cdot V_{t-1}+(1-\beta_2)\cdot g_2^2 Vt=β2⋅Vt−1+(1−β2)⋅g22
修正二阶动量的偏差: V t ^ = V t 1 − β 2 t \hat{V_t}=\dfrac{V_t}{1-\beta_2^t} Vt^=1−β2tVt
η t = l r ⋅ m t ^ / ( V t ^ ) \eta_t=lr \cdot \hat{m_t}/(\sqrt{\hat{V_t}}) ηt=lr⋅mt^/(Vt^)
= l r ⋅ m t 1 − β 1 t / V t 1 − β 2 t =lr \cdot \dfrac{m_t}{1-\beta_1^t}/\sqrt{\dfrac{V_t}{1-\beta_2^t}} =lr⋅1−β1tmt/1−β2tVt
w t + 1 = w t − η t w_{t+1}=w_t-\eta_t wt+1=wt−ηt
= w t − l r ⋅ m t 1 − β 1 t / V t 1 − β 2 t =w_t-lr \cdot \dfrac{m_t}{1-\beta_1^t}/\sqrt{\dfrac{V_t}{1-\beta_2^t}} =wt−lr⋅1−β1tmt/1−β2tVt