该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ
Coursera 课程 |deeplearning.ai |网易云课堂
转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」
知乎:https://zhuanlan.zhihu.com/c_147249273
CSDN:http://blog.csdn.net/junjun_zhao/article/details/79002219
3.9 Gradient descent for neural networks
神经网络的梯度下降法
(字幕来源:网易云课堂)
All right I think that’s been an exciting video,in this video you see how to implement gradient descent,for your neural network with one hidden layer.In this video I’m going to just give you the equations you need to implement,in order to get back propagation or to get gradient descent working,and then in the video after this one.I’ll give some more intuition about,why these particular equations are the accurate equations,or the correct equations for computing the gradients you need for your neural network,so your neural network with a single hidden layer for now,will have parameters W[1]b[1]W[2] and b[2] ,and so as a reminder if you have nx , or alternatively n[0] input features,and n[1] hidden units and n[2] output units,in our example so far we’ve only have n[2] equals 1,then the matrix W[1] will be n[1] by n[0] , b[1] will be an n[1] dimensional vector.
好 我想这是一个激动人心的视频,在此视频中 你将看到梯度下降算法的具体实现,如何处理单隐层神经网络,在这个视频中我会给你提供所需的方程,来实现反向传播或者说梯度下降算法,在下一个视频,我会介绍更多推导,介绍为什么这特定的几个方程是精准的方程,可以针对你的神经网络实现梯度下降的正确方程,所以这是你的单隐层神经网络,会有这些参数 W[1]b[1]W[2] 和 b[2] ,要记住你还有个 nx 有时也用 n[0] 表示有那么多输入特征, n[1] 个隐藏单元 n[2] 个输出单元,在我们的例子中 我们只介绍过 n[2] 等于1的情况,那么矩阵 W[1] 维度就是 (n[1],n[0]) , b[1] 就是 n[1] 维向量。
so you can write that n[1] by 1 dimensional matrix, really a column vector,the dimensions of W[2] will be n[2] by n[1] ,and the dimension of b[2] will be n[2] by 1 right,where again so far we’ve only seen examples where n[2] is equal to 1,where you have just one a single hidden unit,so you also have a cost function for a neural network,and for now I’m just going to assume that,you’re doing binary classification,so in that case the cost of your parameters as follows,is going to be 1 over m of the average of that loss function,and so L here is the loss when your new network predicts y hat, right,this is really a a[2] when the ground truth labels should equal to y,and if you’re doing binary classification,the loss function can be exactly what you use for logistic regression earlier,so to train the parameters your algorithms you need to perform gradient descent,when training a neural network, it is important to initialize the parameters randomly,rather than into all zeros.
所以你可以把它写成是 (n[1],1) 维向量 就是一个列向量, W[2] 的维度是 (n[2],n[1]) ,然后 b[2] 的维度就是 (n[2],1) ,到目前为止 我们只见过 n[2]=1 的例子,你只有一个隐藏单元,你还有一个神经网络的成本函数,目前而言 我假设,你在做二元分类,在这种情况下 你的参数的成本是这么定的,就是 1/m 然后对损失函数求平均,所以这里 L 表示当你的神经网络预测出 y^ 时的损失函数,所以这其实是个 a[2] 然后基本真值标签等于 y,如果你在做二元分类,损失函数可能就和之前做 logistic 回归完全一样,所以要训练参数 你的算法需要做梯度下降,在训练神经网络时 随机初始化参数很重要,而不是初始化成全零。
we’ll see later why that’s the case,but after initializing the parameter to something,each loop of gradient descent would compute the predictions,so you basically compute you know y hat i for i equals 1 through m say,and then you need to compute the derivative,so you need to compute dW[1] ,and that’s we see the derivative of the cost function with respect to the parameter W[1] ,you need to compute another variable,which is going to call db[1] ,which is the derivative or the slope of your cost function,with respect to the variable b[1] and so on,similarly for the other parameters W[2] and b[2] ,and then finally the gradient descent update would be,the learning rate times d W[1] ,to update W[1] as W[1] minus alpha, b[1] gets updated as, b[1] minus the learning rate times d as similarly for W[2] and b[2] ,and sometimes I use colon equals and sometimes equals,as either either the notation works fine,and so this would be one iteration of gradient descent,and then your repeat this some number of times,until your parameters look like they’re converging.
我们稍后会看到为什么会这样,当你把参数初始化成某些值之后,每个梯度下降循环都会计算预测值,所以基本上 你要计算 i=1 到 m 的预测值y帽,然后你需要计算导数,所以你需要计算 dW[1] ,然后我们看到成本函数对参数 W[1] 的导数,你需要计算另一个变量,就是 db[1] ,这是你的成本函数 对变量 b[1] 的,导数或者说斜率,类似 还有对其他参数 W[2] 和 b[2] 的导数,然后梯度下降最后会更新..,将 W[1] 更新成 W[1]−α 乘以..,学习率乘以 dW[1] ,然后 b[1] 更新成, b[1] 减去学习率乘以 db[1] 然后 W[2] 和 b[2] 也类似,有时候我用 := 有时候用 = ,用哪种符号都行,所以这是梯度下降的一次迭代循环,然后你重复这些步骤很多次,直到你的参数看起来在收敛。
so in previous videos we talked about how to compute the predictions,how to compute the outputs,and we saw how to do that in a vectorized way as well,so the key is to know how to compute these partial derivative terms,the dW[1] db[1] as well as the derivatives dw[2] and db[2] ,so what I’d like to do is just give you the equations you need,in order to compute these derivatives,and I’ll defer to the next video,which is an optional video to go great turn to greater into depth about,how we came up with those formulas,so then just summarize again the equations for forward propagation,so you have z[1] equals W[1]X plus b[1] ,and then A[1] equals the activation function in that layer,applied element-wise z[1] ,and then z[2] equals w[2]A[1] plus b[2] .
在以前的视频里 我讨论过如何计算预测值,如何计算输出,我们也看到如何用向量化方式去做,所以关键在于 如何计算这些偏导项, dW[1] db[1] 以及导数 dw[2] 和 db[2] ,我想做的是 给出你需要的公式,求这些导数的公式,而在下一个视频,这是一个可选的视频 会深入介绍,我们是怎么想出这些公式的,所以我再总结一下 正向传播的方程,所以你有 z[1] 等于 W[1]X+b[1] ,然后 A[1] 等于该层中的激活函数,逐个元素作用到 z[1] 上,然后 z[2] 等于 W[2]a[1]+b[2] 。
and then finally these are all vectorize across your training set right, A[2] is equal to g[2](z[2]) .Again for now if we assume you’re doing binary classification,then this activation function really should be the sigmoid function.So I’m just throw that in here,so that’s the forward propagation,or the left-to-right forward computation for your neural network,Next let’s compute the derivatives,so this is the back propagation step,then it computes dz[2] equals A[2] minus the ground truth of Y,and just just as a reminder all this is vectorize across example,so the matrix Y is this some 1 by m matrix,that lists all of your m examples stacked horizontally,then it turns out dw[2] is equal to this,in fact um these first three equations are very similar to gradient descent,for logistic regression,comma, axis is equals 1 comma um keepdims = True,and just a little detail this np.sum is a Python numpy commands,for summing across one dimension of a matrix in this case,summing horizontally and what keepdims does is it prevents python,from outputting one of those funny rank 1 arrays,where the dimensions was you know n comma,so by having keepdims = true,this ensures that Python outputs, for db[2] a vector that is some n by one,in fact, technically this will be I guess n[2] by 1.
最后这些公式都对整个训练集向量化, A[2] 等于 g[2](Z[2]) ,再次 现在假设我们在做二元分类,那么这个激活函数其实应该是 sigmoid 函数,我就把它丢到这里,所以这是正向传播,或者你的神经网络从左到右正向传播计算,接下来我们来计算导数,所以这是反向传播的步骤,需要计算 dz[2]=A[2] 减去 Y 的基本真值,这里要提醒一下 这些都针对所有样本向量化,所以矩阵 Y 是 1×m 矩阵,它将 m 个样本横向堆叠起来,那么结果这个 dw[2] 等于这个,实际上 这三个方程和 logistic 回归的梯度下降,非常相似,然后 axis=1
, keepdims = True
,这里有个细节 np.sum
是 Python 的 numpy 命令,用来对矩阵的一个维度求和,水平相加求和 而加上开关 keepdims 就是防止 python 直接输出这些古怪的秩为 1 的数组,它的维度是 (n,..) ,所以加上 keepdims = true
,确保 Python 输出的是矩阵 对于 db[2] 这个向量输出的维度是 (n,1) ,其实 技术上 这应该是 (n[2],1) 。
in this case is just a one by one number,so maybe it doesn’t matter,but later on we’ll see when it really matters,so far what we’ve done is very similar to logistic regression,but now as you compute continue to run back propagation,you would compute this dz[2] times g[1] prime of z[1] ,so this quantity g[1] ’ is the derivative of,whatever was the activation function you use for the hidden layer,and for the output layer I assume that,you’re doing binary classification with the sigmoid function,so that’s already baked into that formula for dz[2], and this times is a element-wise product,so this year there’s going to be an n[1] by m matrix,and this here this element wise derivative thing,is also going to be an n[1] by m matrix.
这种情况下 就是 (1,1) 一个数字,所以也许没关系,但以后我们会看到真正需要考虑多维的情况,到目前为止 我们所做的与 Logistic 回归非常相似,但当你开始计算反向传播时,你要计算这个 .. dz[2]⋅g[1] ’ (z[1]) ,所以这个量 g[1] ’是,你用的隐藏层的激活函数的导数,对于输出层我假设,你正在使用 Sigmoid 函数进行二元分类,这已经包含在 dz[2] 的式子里了,这是逐个元素乘积,所以现在这里有个 n[1]×m 矩阵,而这里这个逐个元素的导数,就是一个 n[1]×m 矩阵。
and so this times there is an element wise product of two matrices,then finally dW[1] is equal to that and db[1] is equal to this,np.sum dz[1] x is equals 1, keepdims = True,so whereas previously the keepdims maybe matter less if n[2] is equal to 1,so it’s just a one by one thing is there’s a real number,here db[1] will be a n[1] by 1 vector,and so you want Python you want np.sum output something of this dimension,rather than a funny looking one array of that dimension,which could end up messing up some of your later calculations,the other way would be to not have to keepdims parameters,but to explicitly call in a reshape to reshape the output of np.sum into this dimension,which you would like d b to have,so that was forward propagation in,.And I guess four equations and back propagation in.I guess six equations.
所以这个*乘积是两个矩阵的逐元素乘积,最后 dW[1] 等于那个 db[1] 等于这个,然后 np.sum
dz[1] x=1, keepdims = True
,之前如果 n[2]=1 那么 keepdims 可能没那么重要,这只是 1×1 是一个实数,而这里 db[1] 是个 n[1]×1 向量,所以如果你想用 Python 用np.sum
输出这个维度的矩阵,而不是形式很怪的一维数组,那样可能会把后面的计算步骤搞乱,还有一种办法是不需要用keepdims
参数,但要显式的调用 reshape
把np.sum
输出结果写成矩阵形式,就是你希望 db 出现的矩阵形式,所以这是正向传播 一共有,四个方程 而反向传播,我想有六个方程。
I know I just wrote down these equations,but in the next optional video,let’s go over some intuitions for,how the six equations for the back propagation algorithm were derived,please feel free to watch that or not,but either way if you implement these algorithms,you will have a correct implementation of for forward prop and backprop,and you’ll be able to compute the derivatives you need,in order to apply gradient descent to learn the parameters of your neural network,it is possible to implement this algorihtms,and get it to work without deeply understanding the calculus,a lot of successful deep learning practitioners do so,but if you want you can also watch the next video,just to get a bit more intuition about the derivation of these of these equations.
我现在就直接写出这些式子,但在下一个可选视频里,我们来讲讲一些灵感来源,我们是如何导出反向传播算法的六个式子的,你们可以看看 也可以不看,不过总之 如果你要实现这些算法,你必须能够正确执行正向和反向传播运算,你必须能够计算所有需要的导数,用梯度下降法来学习神经网络的参数,你也可以直接实现这个算法,让它工作起来 而不去了解很多微积分知识,有很多成功的深度学习从业者是这样做的,但如果你想了解一下 可以看下一个视频,了解这些式子推导的直觉。
神经网络的梯度下降法
以本节中的浅层神经网络为例,我们给出神经网络的梯度下降法的公式。
下面为该例子的神经网络反向梯度下降公式(左)和其代码向量化(右):
参考文献:
[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(1-3)– 浅层神经网络
PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。