该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ
Coursera 课程 |deeplearning.ai |网易云课堂
转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」
知乎:https://zhuanlan.zhihu.com/c_147249273
CSDN:http://blog.csdn.net/junjun_zhao/article/details/79104432
2.7 RMSprop
(字幕来源:网易云课堂)
You’ve seen how using momentum can speed up gradient descent.There’s another algorithm called RMSprop ,which stands for root mean square prop,that can also speed up gradient descent.Let’s see how it works.Recall our example from before,that if you implement gradient descent,you can end up with huge oscillations in the vertical direction,even while it’s trying to make progress in the horizontal direction.In order to provide intuition for this example,let’s say that the vertical axis is the parameter band horizontal axis is the parameter W.Really could be W1 and W2 or some of the center parameters was named as b and W for the sake of intuition.And so, you want to slow down the learning in the b direction,or in the vertical direction.And speed up learning,or at least not slow it down in the horizontal direction.So this is what the RMSprop algorithm does to accomplish this.On iteration t, it will compute as usual the derivative dW, db on the current mini-batch.So I was going to keep this exponentially weighted average.Instead of vdW , I’m going to use the new notation SdW .So SdW is equal tobeta times their previous value + 1- beta times dW squared.Sometimes write this dW star star 2 to explain exponentiation,we should write this as dW squared.So for clarity, this squaring operationis an element-wise squaring operation.So what this is doing is really keepingan exponentially weighted average of the squares of the derivatives.And similarly, we also have Sdb equals beta Sdb + 1- beta, db squared.And again, the squaring is an element-wise operation.Next, RMSprop then updates the parameters as follows.W gets updated as W minus the learning rate,and whereas previously we had alpha times dW now it’s dW divided by square root of SdW.And b gets updated as b minus the learning ratetimes, instead of just the gradient,this is also divided by, now divided by Sdb .
你知道了 Momentum 可以加快梯度下降,还有一个叫做 RMSprop 的算法,全称是 root mean square prop 算法,它也可以加速梯度下降,我们来看看它是如何运作的,回忆一下我们之前的例子,如果你执行梯度下降,虽然横轴方向正在推进,但纵轴方向会有大幅度摆动,为了分析这个例子,假设纵轴代表参数 b,横轴代表参数 W,可能有 W1 W2 或者其他重要的参数,为了便于理解 被称为 b 和 W,所以 你想减缓 b 方向的学习,即纵轴方向,同时加快,至少不是减缓横轴方向的学习, RMSprop 算法可以实现这一点,在第 t 次迭代中 该算法会照常计算,当下 mini-batch 的微分 dW db,所以我会保留这个指数加权平均数,我要用到新符号 SdW 而不是 VdW ,因此 SdW 等于,β 乘以之前的值加上 (1−β)∗(dW)2 ,有时候写成 (dW)2 来表示幂的形式,我们直接写成 (dW)2 ,澄清一下 这个平方的操作,是针对这一整个符号的,这样做能够保留微分平方的加权平均数,同样 Sdb=β∗Sdb+(1−β)∗(db)2 ,再说一次 平方是针对整个符号的操作,接着 RMSprop 会这样更新参数值,W 变成了 W 减去学习率,之前的值是 α 乘以 dW,现在是 dW 除以 SdW 的平方根,b 被赋值为b 减去学习率,乘以 不仅仅是梯度,而是梯度除以 Sdb 。
So let’s gain some intuition about how this works.Recall that in the horizontal direction or in this example, in the W direction we want learning to go pretty fast.Whereas in the vertical directionor in this example in the b direction,we want to slow down all the oscillations into the vertical direction.So with this terms SdW and Sdb ,what we’re hoping is that SdW will be relatively small,so that here we’re divided by relatively small number.Whereas Sdb will be relatively large,so that here we’re divided by relatively large number,in order to slow down the updates on a vertical dimension.And indeed if you look at the derivatives,these derivatives are much larger in the vertical direction than in the horizontal direction.So the slope is very large in the b direction, right?So with derivatives like this,this is a very large db and a relatively small dW.Because the function is sloped much more steeply in the vertical direction, that is in the b direction,than in the w direction, than in horizontal direction.And so, db square will be relatively large,So Sdb will relatively large,whereas compared to that dW will be smaller,or dW square will be smaller,and so SdW will be smaller.So the net effect of this is that your updates in the vertical directionare divided by a much larger number,and so that helps damp out the oscillations.Whereas the updates in the horizontal directionare divided by a smaller number.So the net impact of using RMSprop is that your updates will end up looking more like this.That your updates in the vertical direction get damp out,and then horizontal direction you can keep going.And one effect of this is also thatyou can therefore use a larger learning rate alpha,and get faster learning without diverging in the vertical direction.Now just for the sake of clarity,I’ve been callingthe vertical and horizontal directions b and W,just to illustrate this.
我们来理解一下其原理,记得在横轴方向,或者在例子中的 W 方向,我们希望学习速度快,而在垂直方向,也就是例子中的 b 方向,我们希望减缓纵轴上的摆动,所以有了 SdW 和 Sdb ,我们希望 SdW 会相对较小,所以我们要除以一个较小的数,而希望 Sdb 又较大,所以这里我们要除以较大的数字,这样就可以减缓纵轴上的变化,你看这些微分,垂直方向的,要比水平方向的大得多,所以斜率在b方向特别大,所以这些微分中,db较大 dW较小,因为函数的倾斜程度,在纵轴上也就是 b 方向上,要大于在横轴上 也就是 W 方向上,db 的平方较大,所以 Sdb 也会较大,而相比之下 dw 会小一些,亦或 dW 平方会小一些,因此 SdW 会小一些,结果就是纵轴上的更新,要被一个较大的数相除,就能消除摆动,而水平方向的更新,则被较小的数相除,RMSprop 的影响就是,你的更新最后会长这样,纵轴方向上摆动较小,而横轴方向继续推进,还有个影响就是,你可以用一个更大学习率 α,然后加快学习,而无须在纵轴上垂直方向偏离,要说明一点 我一直把,纵轴和横轴方向分别称为 b 和 W,只是为了方便展示而已。
In practice, you’re in a very high dimensional space of parameters,so maybe the vertical dimensions where you’re trying to damp the oscillationis a sum set of parameters, W1, W2, W17.And the horizontal dimensions might be W3, W4 and so on, right?.And so, the separation there’s a W and b is just an illustration.In practice, dW is a very high-dimensional parameter vector.db is also very high-dimensional parameter vector,but your intuition is thatin dimensions where you’re getting these oscillations,you end up computing a larger sum.A weighted average for these squares and derivatives,and so you end up dumping out the directionsin which there are these oscillations.So that’s RMSprop ,and it stands for root mean square because here you’re squaring the derivatives,and then you take the square root here at the end.So finally, just a couple last details on this algorithm before we move on.In the next video, we’re actually going to combine RMSprop together with momentum.So rather than using the hyperparameter beta,which we had used for momentum,I’m going to call this hyperparameter beta 2 just to not clash,The same hyperparameter for both momentum and for RMSprop .And also to make sure that your algorithm doesn’t divide by 0.What if square root of SdW , right, is very close to 0.Then things could blow up.Just to ensure numerical stability,when you implement this in practice you add a very, very small epsilon to the denominator.It doesn’t really matter what epsilon is used.10 to the -8 would be a reasonable default,but this just ensures slightly greater numerical stability that for numerical round off or whatever reason,that you don’t end up divided by a very, very small number.So that’s RMSprop , and similar to momentum has the effects of damping out the oscillations in gradient descent,in mini-batch gradient descent.and allowing you to maybe use a larger learning rate alpha.And certainly speeding up the learning speed of your algorithm.So now you know to implement RMSprop ,and this will be another way for you to speed up your learning algorithm.
实际中 你会处于参数的高维度空间,所以需要消除摆动的,垂直维度你要消除摆动,实际上是参数 W1 W2 W17 的合集,水平维度可能 W3 W4 等等,因此把 W 和 b 分开只是方便说明,实际中 dW 是一个高维度的参数向量,db 也是一个高维度参数向量,但是你的直觉是,在你要消除摆动的维度中,最终你要计算一个更大的和值,这个平方和微分的加权平均值,所以你最后去掉了,那些有摆动的方向,所以这就是 RMSprop ,全称是均方根,因为你将微分进行平方,然后最后使用平方根,最后再就这个算法说一些细节的东西,然后我们再继续,下个视频中 我们会将, RMSprop 和 Momentum 结合起来,我们在 Momentum 中采用超参数 β,为了避免混淆 我们现在不用 β,而采用超参数 β_2,以保证在 Momentum 和 RMSprop 中,采用同一个超参数,要确保你的算法不会除以 0,如果 SdW 的平方根趋近于0怎么办?,得到的答案就非常大,为了确保数值稳定,在实际中操练的时候,你要在分母加上一个很小很小的 ε,ε 是多少没关系, 10(−8) 是个不错的选择,这只是保证数值能稳定一些,无论什么原因,你都不会除以一个很小很小的数,所以 RMSprop 跟 Momentum 有很相似的一点,可以消除梯度下降中的摆动,包括 mini-batch 梯度下降,并允许你使用一个更大的学习率 α,从而加快你的算法学习速度,所以你学会了如何运用 RMSprop ,这是给学习算法加速的另一个方法。
One fun fact about RMSprop ,it was actually first proposed not in an academic research paper but in a Coursera course that Jeff Hinton had taught on Coursera many years ago.I guess Coursera wasn’t intended to be a platform for dissemination of novel academic research,but it worked out pretty well in that case.And was really from the Coursera coursethat RMSprop started to become widely knownand it really took off.We talked about momentum.We talked about RMSprop .It turns out that if you put them together you can get an even better optimization algorithm.Let’s talk about that in the next video.
关于 RMSprop 的一个趣事是,它的首次提出并不是在学术研究论文中,而是在多年前,Jeff Hinton在 Coursera 的课程上,我想 Coursera 不是故意打算成为,一个传播新兴的学术研究的平台,但是却达到了意想不到的效果,就是从 Coursera 课程开始, RMSprop 开始被人们广为熟知,并且发展迅猛,我们讲过了 Momentum ,我们讲了 RMSprop ,如果二者结合起来,你会得到一个更好的优化算法,在下个视频中我们再好好讲一讲为什么。
RMSprop
除了上面所说的 Momentum 梯度下降法,RMSprop(root mean square prop)也是一种可以加快梯度下降的算法。
同样算法的样例实现如下图所示:
这里假设参数b的梯度处于纵轴方向,参数w的梯度处于横轴方向(当然实际中是处于高维度的情况),利用RMSprop算法,可以减小某些维度梯度更新波动较大的情况,如图中蓝色线所示,使其梯度下降的速度变得更快,如图绿色线所示。
在如图所示的实现中,RMSprop将微分项进行平方,然后使用平方根进行梯度更新,同时为了确保算法不会除以0,平方根分母中在实际使用会加入一个很小的值如 ε=10−8 。
参考文献:
[1]. 大树先生.吴恩达 Coursera 深度学习课程 DeepLearning.ai 提炼笔记(2-2)– 优化算法
PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。