该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ
Coursera 课程 |deeplearning.ai |网易云课堂
转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」
知乎:https://zhuanlan.zhihu.com/c_147249273
CSDN:http://blog.csdn.net/junjun_zhao/article/details/79102865
2.6 Gradient descent with Momentum (动量梯度下降法)
(字幕来源:网易云课堂)
There’s an algorithm called Momentum ,or gradient descent with Momentum that almost always works faster than the standard gradient descent algorithm.In one sentence, the basic idea is to compute an exponentially weighted average of your gradients,and then use that gradient to update your weights instead.In this video, let’s unpack that one sentence description and see how you can actually implement this.
还有一种算法叫做 Momentum ,或者叫做 Momentum 梯度下降法,运行速度几乎总是快于标准的梯度下降算法,简而言之,基本的想法就是,计算梯度的指数加权平均数,并利用该梯度更新你的权重,在本视频中 我们要一起拆解单句描述,看看你到底如何计算。
As a example let’s say that you’re trying to optimize a cost function which has contours like this.So the red dot denotes the position of the minimum.Maybe you start gradient descent here and if you take one iteration of gradient descent either batch or mini-batch descent maybe end up heading there.But now you’re on the other side of this ellipse,and you take another step of gradient descent,maybe you end up doing that.And then another step, another step, and so on.And you see that gradient descents willsort of take a lot of steps, right?Just slowly oscillate toward the minimum.And this up and down oscillations slows down gradient descent and prevents you from using a much larger learning rate.In particular, if you were to use a much larger learning rate you might end up overshooting and end up diverging like so.And so the need to prevent the oscillations from getting too big forces you to use a learning rate that’s not itself too large.
例如,如果你要优化成本函数,函数形状如图,红点代表最小值的位置,假设你从这里开始梯度下降法,如果进行梯度下降法的一次迭代,无论是 batch 或 mini-batch 下降法,也许会指向这里,现在在椭圆的另一边,计算下一步梯度下降,结果或许如此,然后再计算一步 再一步 计算下去,你发现梯度下降法,要很多计算步骤对吧?,慢慢摆动到最小值,这种上下波动,减慢了梯度下降法的速度,你就无法使用更大的学习率,如果你要用较大的学习率,结果可能会偏离函数的范围,为了避免摆动过大,你要用一个较小的学习率。
Another way of viewing this problem is that on the vertical axis you want your learning to be a bit slower,because you don’t want those oscillations.But on the horizontal axis,you want faster learning.Right, because you want it to aggressively move from left to right,toward that minimum, toward that red dot.So here’s what you can doif you implement gradient descent with Momentum .On each iteration,or more specifically, during iteration tyou would compute the usual derivatives dw, db.I’ll omit the superscript square bracket l’s but you compute dw, db on the current mini-batch .And if you’re using batch gradient descent,then the current mini-batch would be just your whole batch.And this works as well off a batch gradient descent.So if your current mini-batch is your entire training set,this works fine as well.And then what you do isyou compute vdW to be Beta vdW plus 1 minus Beta dW.So this is similar to when we’re previously computing v data equals beta v data plus 1 minus beta data t.Right, so it’s computinga moving average of the derivatives for w you’re getting.And then you similarly compute vdb equals thatplus 1 minus Beta times db.And then you would update your weights,using W gets updated as W minus the learning rate times,instead of updating it with dW, with the derivative you update it with vdW .And similarly, b gets updated as b minus alpha times vdb .So what this does is smooth out the steps of gradient descent.
另一个看待问题的角度是,在纵轴上,你希望学习慢一点,因为你不想要这些摆动,但是在横轴上,你希望加快学习,你希望快速从左向右移,移向最小值 移向红点,所以使用 Momentum 梯度下降法,你需要做的是,在每次迭代中,确切来说在第 t 次迭代的过程中,你会计算微分dw db,我会省略上标括号 l ,你用现有的 mini-batch 计算 dw db,如果你用 batch梯度下降法,现在的 mini-batch 就是全部的 batch,对于 batch 梯度下降法的效果是一样的,如果现有的 mini-batch 就是整个训练集,效果也不错,你要做的是, vdW=β∗vdW+(1−β)∗dW ,这跟我们之前的计算相似,也就是 v=β∗v+(1−β)∗ 数据 t,所以计算得到的是,dw 的移动平均数,接着同样地计算 vdb ,等于 β∗vdb+(1−β)∗db ,然后重新赋值权重, W=W−α 乘以,这里重新赋值不用 dW ,而用 vdW ,同样 b=b−α∗vdb ,这样就可以减缓梯度下降的幅度。
For example, let’s say that in the last few derivatives you computed were this, this, this, this, this.If you average out these gradients,you find that the oscillations in the vertical directionwill tend to average out to something closer to zero.So, in the vertical direction, where you want to slow things downthis will average out positive and negative numbers,so the average will be close to zero.Whereas, on the horizontal direction,all the derivatives are pointing to the right of the horizontal direction,so the average in the horizontal direction will still be pretty big.So that’s why with this algorithm, with a few iterations you find that the gradient descent with Momentum ends up eventually just taking steps that are much smaller oscillations in the vertical direction,but are more directed to just moving quickly in the horizontal direction.And so this allows your algorithm to take a more straightforward path,or to damp out the oscillations in this path to the minimum.
例如,在上几个导数中,你计算得到了这个 这个 这个,如果平均这些梯度,你会发现这些纵轴上的摆动,平均值接近于零,所以在纵轴方向 你希望放慢一点,平均过程中 正负数相互抵消,所以平均值接近于零,但是在横轴方向,所有的微分都指向横轴方向,因此横轴方向的平均值仍然较大,因此用算法几次迭代后,你发现 Momentum 梯度下降法,最终,纵轴方向的摆动变小了,横轴方向运动更快,因此你的算法,走了一条更加直接的路径,在抵达最小值的路上减少了摆动。
One intuition for this Momentum ,which works for some people, but not everyone,is that if you’re trying to minimize your bowl shape function, right?This is really the contours of a bowl.I guess I’m not very good at drawing.They kind of minimize this type of bowl shaped function then these derivative terms you can think of as providing acceleration to a ball that you’re rolling down hill.And these Momentum terms you can think of as representing the velocity.And so imagine that you have a bowl,and you take a ball the derivative imparts acceleration to this little ball so the little ball is rolling down this hill, right?And so it rolls faster and faster, because of acceleration.And beta, because this number a little bit less than one displays a row of friction and it prevents your ball from speeding up without limit.But so rather than gradient descent,just taking every single step independently of all previous steps.Now, your little ball can roll downhill and gain Momentum ,it can accelerate down this bowl and therefore gain Momentum .I find that this ball rolling down a bowl analogy it seems to work for some people who enjoy physics intuitions.But it doesn’t work for everyone.so if this analogy of a ball rolling down the bowl doesn’t work for you,don’t worry about it.
Momentum 的一个本质,这对有些人而不是所有人有效,就是如果你要最小化碗状函数,这是碗的形状,我画画不太好,它们能够最小化碗装函数,这些微分项,想象它们为你从山上往下滚的一个球,提供了加速度, Momentum 项就相当于速度,想象你有一个碗,你拿一个球,微分给了这个球一个加速度,此时球正向山下滚,球因为加速度越滚越快,而因为 β 稍小于 1,表现出一些摩擦力,所以球不会无限加速下去,所以不像梯度下降法,每一步都独立于之前的步骤,你的球可以向下滚获得动量,可以从碗向下加速获得动量,我发现这个球从碗滚下的比喻,物理能力强的人接受得比较好,但不是所有人都能接受,如果球从碗中滚下这个比喻,你理解不了,别担心。
Finally, let’s look at some details on how you implement this.Here’s the algorithm and so you now have two hyperparameters,the learning rate alpha, as well as this parameter Beta,which controls your exponentially weighted average.The most common value for Beta is 0.9.We’re averaging over the last ten days temperature.So it is averaging of the last ten iteration’s gradients.And in practice,Beta equals 0.9 works very well.Feel free to try different values and do some hyperparameter search,but 0.9 appears to be a pretty robust value.Well, and how about bias correction, right?So do you want to take vdW and vdb anddivide it by 1 minus beta to the t.In practice, people don’t usually do this because after just ten iterations,your moving average will have warmed up,and is no longer a bias estimate.
最后我们来看具体如何计算,算法在此,所以你有两个超参数,学习率 α 以及参数 β,β 控制着指数加权平均数,β 最常用的值是 0.9,我们之前平均了过去十天的温度,所以现在平均了前十次迭代的梯度,实际上 β 为 0.9 时 效果不错,你可以尝试不同的值,可以做一些超参数的研究,不过 0.9 是很棒的鲁棒数,那么关于偏差修正,所以你要拿 vdW 和 vdb ,除以 (1−βt) 。
So in practice,I don’t really see people bothering with bias correction when implementing gradient descent or Momentum .And of course this process initialize the vdW equals 0.Note that this is a matrix of zeroes with the same dimension as dW,which has the same dimension as W.And vdb is also initialized to a vector of zeroes.So, the same dimension as db,which in turn has same dimension as b.Finally, I just want to mention that if you read the literature ongradient descent with Momentum often you see it with this term omitted,with this 1 minus Beta term omitted.So you end up with vdW equals Beta vdW plus dW.And the net effect of using this version in purple is that vdW ends up being scaled by a factor of 1 minus Beta,or really 1 over 1 minus Beta.
实际上 人们不这么做,因为 10 次迭代之后,因为你的移动平均已经过了初始阶段,不再是一个具有偏差的预测,实际中,在使用梯度下降法或 Momentum 时,人们不会受到偏差修正的困扰,当然 vdW 的初始值是0,要注意到这是,和 dW 拥有相同维数的零矩阵,也就是跟 W 拥有相同的维数, vdb 的初始值也是向量零,所以和 db 拥有相同的维数,也就是和 b 是同一个维数,最后要说一点,如果你查阅了, Momentum 梯度下降法相关资料,通常会看到一个被删除了的专业词汇,1-β 被删除了,最后得到的是 vdW=β∗vdW+dW ,用紫色版本的结果就是,所以 vdW 缩小了 (1−β) 倍,相当于乘以 1/(1−β) ,
And so when you’re performing these gradient descent updates,alpha just needs to change by a corresponding value of 1 over 1 minus Beta.In practice, both of these will work just fine,it just affects what’s the best value of the learning rate alpha.But I find that this particular formulation is a little less intuitive.Because one impact of this is thatif you end up tuning the hyperparameter Beta,then this affects the scaling of vdW and vdb as well.And so you end up needing to retune the learning rate, alpha, as well, maybe.So I personally prefer the formulation that I have written here on the left,rather than leaving out the 1 minus Beta term.But, so I tend to use the formula on the left,the printed formula with the 1 minus Beta term.But both versions having Beta equal 0.9 is a common choice of hyperparameter.it’s just at alpha the learning rate would need to be tuned differentlyfor these two different versions.So that’s it for gradient descent with Momentum .This will almost always work better than the straightforward gradient descent algorithm without Momentum .But there’s still other things we could doto speed up your learning algorithm.Let’s continue talking about these in the next couple videos.
所以你要用梯度下降最新值的话, α 要根据 1/(1-β) 相应变化,实际上 二者效果都不错,只是会影响到学习率 α 的最佳值,我觉得这个公式用起来没有那么自然,因为有一个影响,如果你最后要调整超参数 β,就会影响到 vdW 和 vdb ,你也许还要修改学习率 α ,所以我更喜欢左边的公式,而不是删去了 1-β 的这个公式,所以我更倾向于使用左边的公式,也就是有1-β的这个公式,但是两个公式都将 β 设置为0.9,是超参数的常见选择,只是在这两个公式中,学习率 α 的调整会有所不同,所以这就是 Momentum 梯度下降法,这个算法肯定要好于,没有 Momentum 的梯度下降算法,我们还可以做别的事情,来加快学习算法,我们将在接下来的视频中,继续探讨这些问题。
动量( Momentum )梯度下降法
动量梯度下降的基本思想就是计算梯度的指数加权平均数,并利用该梯度来更新权重。
在我们优化 Cost function 的时候,以下图所示的函数图为例:
在利用梯度下降法来最小化该函数的时候,每一次迭代所更新的代价函数值如图中蓝色线所示在上下波动,而这种幅度比较大波动,减缓了梯度下降的速度,而且我们只能使用一个较小的学习率来进行迭代。
如果用较大的学习率,结果可能会如紫色线一样偏离函数的范围,所以为了避免这种情况,只能用较小的学习率。
但是我们又希望在如图的纵轴方向梯度下降的缓慢一些,不要有如此大的上下波动,在横轴方向梯度下降的快速一些,使得能够更快的到达最小值点,而这里用动量梯度下降法既可以实现,如红色线所示。
算法实现
β 常用的值是 0.9。
在我们进行动量梯度下降算法的时候,由于使用了指数加权平均的方法。原来在纵轴方向上的上下波动,经过平均以后,接近于0,纵轴上的波动变得非常的小;但在横轴方向上,所有的微分都指向横轴方向,因此其平均值仍然很大。最终实现红色线所示的梯度下降曲线。
参考文献:
[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(2-2)– 优化算法
PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。