Coursera | Andrew Ng (01-week-3-3.10)—直观理解反向传播

该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂


转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」

知乎:https://zhuanlan.zhihu.com/c_147249273

CSDN:http://blog.csdn.net/junjun_zhao/article/details/79003068


3.10 Backpropagation intuition (直观理解反向传播)

(字幕来源:网易云课堂)

Coursera | Andrew Ng (01-week-3-3.10)—直观理解反向传播_第1张图片

In the last video,you saw the equations for back propagation.In this video, let’s go over some intuition usingthe computation graph for how those equations were derived.This video is completely optional.So, feel free to watch or not.you should be able to do the whole work either way.So, recall that when we talk about logistic regression,we had this forward pass where we compute z,then a and then the loss.And then to take the derivatives,we had this backward pass where we could first compute da,and then go on to compute dz,and then go on to compute dW and db.So, the definition for the loss was L(a,y)equals negative y log a - (1-y)log(1-a).

在上一个视频里,你们学到了反向传播的式子,在这个视频中 我们看看它们的来源,这些式子是如何通过流程图推导出来的,这个视频可以不看,所以想看就看 不看也可以忽略,不看你也应该能够把作业完成,所以回想当时我们讨论 logistic 回归的时候,我们有这个正向传播步骤 其中我们计算 z,然后 a 然后损失,然后求导数,我们有这个反向步骤 我们可以首先计算 da ,然后计算 dz ,然后计算 dW db ,所以损失函数的定义是 L(a,y) ,等于 yloga(1y)log(1a)

Coursera | Andrew Ng (01-week-3-3.10)—直观理解反向传播_第2张图片

So, if you are familiar with calculus and you take the derivative of this with respect to a,that would give you the formula for da.So, da is equal to that.And if we actually figure out the calculus you could show that this is negative y/a plus (1-y)/(1-a) you just kind of derive that from calculus by taking derivatives of this.It turns out when you take another step backwards to compute dz,we did work out that dz is equal to a-y. I did explain why previously,but it turns out that from the chamber of calculus dz is equal to da times g’(z).Where here g(z) equals sigmoid of z is our activation function for this output unit in logistic regression, right?So, just remember this is still logistic regression where we have x1, x2, x3 and then just one sigmoid unit and that gives us a, will gives us y hat.So, here are the activation function was a sigmoid function.

所以如果你对微积分很熟悉,那么你求一下这个对 a 的导数,那就可以得到 da 的式子,所以 da 等于那个,如果我们进行具体的微积分运算 就得到,这个是 ya+(1y)(1a) ,直接用微积分对那个式子求导就能得到。事实上 当你再往反向走一步 计算 dz 时,我们确实算出 dz=ay 我之前解释过,但事实上 使用微积分的链式法则, dz 等于 da g(z) ,其中 g(z) 等于 z 的 sigmoid 函数,就是 logistic 回归中输出单元的激活函数,所以要记住 这还是 logistic 回归 这里我们有 x1x2x3 ,这里只有一个 sigmoid 单元 可以得到 y帽,所以这里的激活函数时个 sigmoid 函数。

Coursera | Andrew Ng (01-week-3-3.10)—直观理解反向传播_第3张图片

And as an aside,only for those of you familiar with the chaine rule of calculusthe reason for this is because a is equal to sigmoid of z.And so, partial of L with respect to z is equal to partial of L with respect to a times da/dz.This is a is equal to sigmoid of z,this is equal to d/dz, g(z), which is equal to g’(z).So, that’s why this expression which is dz in our code is equal to this expression which is da in our code times g’(z).And so this is just that.So, that last derivation would made sense only if you’re familiar with calculus and specifically the chamber from calculus.But if not don’t worry about it.

另外插一句,对于那些熟悉微积分链式法则的人来说,得到这种形式的原因是 a=sigmoid(z) ,所以 L z 的偏导,就等于 L 对 a 的偏导 乘以 da/dz 。这 a 等于 sigmoid(z),这等于 ddzg(z) 就等于 g(z) 。所以在我们的代码中 dz 的式子就是这样的,就等于这个式子 在我们的代码中 da 乘以 g(z) 。就是这样。要弄懂上面的推导,你必须熟悉微积分 特别是链式法则。但也不用太担心。

Coursera | Andrew Ng (01-week-3-3.10)—直观理解反向传播_第4张图片

I’ll try to explain the intuition wherever it’s needed.And then finally having computed dz for this regression,we will compute dw which turns out wasdz times x and db which is just dz when you have a single training example.So, that was logistic regression.So, what we’re going to do when computing back propagation for a neural network is a calculation a lot like this but only we’ll do it twice because now we have not x going to an output unit,but x going to a hidden layer and then going to an output unit.And so instead of this computation being sort of one step as we have here,we’ll have you two steps here in this kind of a neural network with two layers.So, in this two layer neural network that is we have the input layer,a hidden layer and then output layer.

我会尽量用直觉的方式去解释。最后计算出这个回归的 dz 。然后我们计算 dW 这结果是, dzx db 这就是 dz 当你只有一个训练样本的时候。这就是 logistic 回归。所以 当你计算神经网络的反向传播时,你做的计算很像这些,我们这里会算了两次 现在不是 x 直接连着输出单元,x 先进入隐层 然后才连到输出单元。所以和我们这里的一步计算不同,我们这里有两步 这就像是双层神经网络。所以在这双层神经网络中 我们有一个输入层,隐层 还有一个输出层。

Remember the steps of a computation.First you compute z[1] using this equation,and then compute a[1] and then you compute z[2] .And notice z[2] also depends on the parameters W[2] and b[2] .And then based on z[2] ,compute a[2] and then finally that gives you the loss.What backpropagation does is it will go backward to compute da[2] and then dz[2] ,and then you go back to compute dW[2] and dP2,go backwards to compute da[1] , dz[1] and so on.We don’t need to take the riveter as respect to the input x since the input x for supervised learning is fixWe’re not trying to optimize x so we won’t bother to take derivatives.At least, for supervised learning, with respect to x.I’m going to skip explicitly computing da[2] .If you want, you can actually compute da[2] and then use that to compute dz[2] but, in practice,you could collapse both of these steps into one step.so you end up at dz[2]=a[2]y , same as before.

回忆一下计算步骤。首先用这个式子计算 z[1] ,然后计算 a[1] 然后计算 z[2] 。并注意到 z[2] 也取决于参数 W[2] b[2] 。然后从 z[2] 出发,计算 a[2] 最后得到损失函数。反向传播的做法是向后推算出 da[2] 然后是 dz[2] ,然后你回去计算 dW[2] db[2] ,然后反向计算 da[1] dz[1] 等等。我们不需要对输入 x 求导,因为监督学习的输入 x 是固定的。我们不是想优化 x 所以我们就不求导了。至少 对于监督学习 不会对 x 求导。我们要跳过显式计算 da[2] 。如果你想的话 可以自己动笔算算 da[2] ,然后用它来计算 dz[2] 但实际上,你可以将这两步合并成一步,最后你会得到 dz[2]=a[2]y 和以前一样。

Coursera | Andrew Ng (01-week-3-3.10)—直观理解反向传播_第5张图片

And, you have also,I’m going to write dW[2] and db[2] down here below.you have that dW[2]=dz[2]a[1] ,transpose, and db[2]=dz[2] .This step is quite similar for logistic regression where we had that dw=dzx except that now, a[1] plays the role of x and there’s an extra transpose there because the relationship between the capital matrix W and our individual parameters w,there’s a transpose there, right?Because W is equal a row vector, in the case of the logistic regression with a single output. dW[2] is like that, whereas,W here was a column vector so that’s why it has an extra transpose for a[1] ,whereas, we didn’t for x here for logistic regression.This completes half of backpropagation.Then, again, you can compute da[1] if you wish.

而且 你还有..。我会在下面写出 dW[2] db[2] 的式子,你得到 dW[2]=dz[2]a[1] ,转置 然后 db[2]=dz[2] 。这一步与 logistic 回归是非常相似的,那个 dw=dzx 不过现在, a[1] 扮演了 x 的角色 那有一个额外的转置运算,因为大写矩阵 W 和我们单个参数 w 的关系是,这里有个转置 对吧。因为 W 是一个行向量 在单一输出的 logistic 回归时是这样, dW[2] 就是这样 然而,W 这里是一个列向量 所以 a[1] 需要转置一次,而这里 对于 logistic 回归 x 不需要转置。这完成了反向传播的一半。那么 再次 如果你愿意 你可以计算 da[1]

Although, in practice, the computation for da[1] and the dz[1] are usually collapsed into one stepand so what you’ll actually implement is that dz[1]=W[2] ,transpose * dz[2] ,and then times an element-wise product of g[1] (z[1]) .And, just to do a check on the dimensions, right?If you have a new network that looks like this,output y if so.If you have n[0] , nx=n[0] input features, n[1] hidden units,and n[2] so far.and n[2] , in our case, just one output unit,then the matrix W[2] is (n[2],n[1]) dimensional, z[2] and therefore dz[2] are going to be (n[2],n[1]) by one dimensional.

虽然在实践中 da[1] dz[1] 的计算,通常合并成一步,所以你实际编程时用的是 dz[1]=W[2] ,转置* dz[2] ,然后 * g[1] (z[1]) 逐个元素乘积。我们这里检查一下维度。如果你有这样一个新网络,这么输出 y。如果你有 n[0] , nx=n[0] 输入特征, n[1] 个隐藏单位,还有 n[2] ,还有 n[2] 在我们的例子中只有一个输出单元,那么矩阵 W[2] (n[2],n[1]) 维, z[2] 还有 dz[2] 就是是 (n[2],1) 维。

Coursera | Andrew Ng (01-week-3-3.10)—直观理解反向传播_第6张图片

This really is going to be a one by one when we are doing binary classification,and z[1] and therefore also d z[1] are going to be n[1] by one dimensional, right?Note that for any variable foo and d foo always have the same dimension.That’s why W and dW always have the same dimension and similarly,for b and db and z and dz and so on.To make sure that the dimensions of this all match up,we have that dz[1]=W[2] transpose times dz[2] and then this is an element y’s product times g{[1]}’ of z[1] .Matching the dimensions from above,this is going to be n[1] by one= W[2] transpose,we transpose of this so there’s going to be n[1] by n[2] dimensional. dz[2] is going to be n[2] by one dimensional and then this,this is the same dimension as z[1] .This is also n[1] by one dimensional so element y’s product.

当我们做二元分类时 就是 1×1 维度,然后 z[1] dz[1] 一样都是 (n[1],1) 维 对吧?。要注意 对于任意变量 foo 和 dfoo 它们的维度都是一样的。所以 W dW 也有相同的维度 类似地,对于 b db z dz 等等也一样。为了确保一切的维度互相匹配,我们有 dz[1]=W[2] 转置乘以 dz[2] ,这是逐元素乘积 g[1](z[1]) 。和上面维度匹配,这是 n[1] ,1 = W[2] 转置,转置之后 这就是 (n[1],n[2]) 维, dz[2] 就是 (n[2],1) 维 然后这个,维度和 z[1] 相同。这也是 (n[1],1) 维 所以逐元素 y 乘积。

Coursera | Andrew Ng (01-week-3-3.10)—直观理解反向传播_第7张图片

The dimensions do make sense, right? n[1] by one dimensional vector can be obtained by n[1] by n[2] dimensional matrix times n[2] by n[1] because the product of these two things gives you an n[1] by one dimensional matrix and so this becomes the element y’s product of two n[1] by one dimensional vectors,and so the dimensions do match.One tip when implementing a back prop.If you just make sure that the dimensions of your matrices match up,so you think through what are the dimensions of the various matrices including W[1],W[2],z[1] z[2],a[1],a[2] and so on and just make sure that the dimensions of these matrix operations match up,sometimes that will already eliminate quite a lot of bugs in back prop.

这个维度有道理吧, (n[1],1) 维向量可以通过, (n[1],n[2]) 维矩阵乘以 (n[2],n[1]) 得到.. 因为两个矩阵的乘积,可以得到一个 (n[1],1) 维矩阵,这就变成 两个 (n[1],1) 维向量的逐元素乘积,所以维度是匹配的。实现后向传播算法有个技巧。你必须确保矩阵的维度互相匹配,你可以想想不同矩阵的维度都是怎样的,包括 W[1],W[2],z[1] z[2]a[1]a[2] 等等 只要确定,这些矩阵运算的维度互相匹配,有时候 这已经可以消除后向传播实现中的很多 bug 了。

Coursera | Andrew Ng (01-week-3-3.10)—直观理解反向传播_第8张图片

All right. This gives us dz[1] and then finally,just to wrap up dW[1] and db[1] ,we should write them here I guess,but since I’m running of the space right on the right of the slight, dW[1] and dB[1] are given by the following formulas.This is going to be equal to the dz[1] times x transpose and this is going to be equal to dz.you might notice a similarity between these equations and these equations,which is really no coincidence because x plays the role of a[0] so x transpose is a[0] transpose.Those equations are actually very similar.That gives a sense for how backpropagation is derived.We have six key equations here for dz[2] , dW[2] , db[2] , dz[1],dW[1] and db[1] .

好 这就得到 dz[1] 最后,把 dW[1] db[1] 写出,我们应该把它们写在这里,但右边空间不够了, dW[1] dB[1] 由下式给出。这等于 dz[1] 乘以 x 转置,这就等于 dz ,你可能注意到这些式子和这些式子的相似性,这真的不是巧合,因为 x 扮演了 a[0] 的角色 所以 x 转置就是 a[0] 转置。那些方程式其实非常相似。这样你们就理解反向传播是怎么推导的了。对于 dz[2] dW[2] 我们有六个关键方程, db[2]dz[1] dz[2] dz[2]

Coursera | Andrew Ng (01-week-3-3.10)—直观理解反向传播_第9张图片

Let me just take these six equations and copy them over to the next slide. Here they are.So far, we have to write backpropagation,for if you are training on a single training example at the time,but it should come as no surprise that rather than working on a single example at a time,we would like to vectorize across different training examples.We remember that for propagation,when we’re operating on one example at a time,we had equations like this as was say a[1]=g1 of z[1] .In order to vectorize,we took say the zs and stacked them up in columns like this onto z[1](m) and call this capital Z.Then we found that by stacking things up in columns and defining the capital uppercase version of this,we then just had z[1]=W[1]x+B and a[1]=g[1](z[1]) , right?

我们就把这六个方程拿过来 放到下一页 就在这里。到目前为止 我们必须写出反向传播,如果你每次训练单个训练样本的话,但你应该不会意外 我们实际不会一个个样本算,我们要把所有训练样本向量化。我们记得正向传播的时候,我们一次处理一个样本,我们有这样的方程 比如 a[1]=g[1](z[1]) 。为了向量化,我们把这些 z 堆叠起来 写成这样的列向量,一直到 z[1](m) 并把这个叫 Z 。我们发现将所有样本以列向量堆叠起来,然后定义这个的大写版本,我们就得到 z[1]=W[1]X+b ,还有 a[1]=g[1](z[1]) 对吧?

Coursera | Andrew Ng (01-week-3-3.10)—直观理解反向传播_第10张图片

We define the notation very carefully in this course to make sure thatstacking examples into different columns of a matrix makes all this work out.It turns out that if you go through the math carefully,the same trick also works for backpropagation so the vectorize equations are as follows.First, if you take these dzs for different training examples and stack them as the different columns of a matrix and the same for this and the same for this,then this is the vectorize implementation and then here’s the definition for,or here’s how you can compute dW[2] .There is this extra 1/M because the cost function J is this 1/M of sum for y = one through M of the losses.

在本课中我们很注意符号约定,确保顺利将样本堆叠到矩阵里的各列 然后满足这种形式。事实证明 如果你仔细推算这里的数学步骤,反向传播用的也是一样的技巧 所以向量化方程是这样的。首先 如果这些 dz 代表不同的训练样本,然后堆叠起来 把它们作为矩阵的列向量堆叠起来,这两个也一样,这就是向量化的实现 然后这就是定义,如何计算 dW[2] 的定义。这里有个额外的 1/m 因为成本函数J是,这个 1/m 从 y=1 到 m 对损失求和。

Coursera | Andrew Ng (01-week-3-3.10)—直观理解反向传播_第11张图片

When computing derivatives,we have that extra 1/M term just as we did when we were computing the [inaudible] updates for the logistic regression.That’s the update you get for db[2] .Again, sum of the dZ and then with a 1/m and then dz[1] is computed as follows.Once again, this is an element-wise product only whereas previously,we saw on the previous slide that this was an n[1] by one dimensional vector.Now, this is a n[1] by m dimensional matrix.Both of these are also n[1] by m dimensional.That’s why that asterisk is element y’s product and then finally,the remaining two updates.Perhaps it shouldn’t look too surprising.

计算导数时,我们有这个额外的 1/m 项,当我们计算 Logistic 回归的更新。那是对 db[2] 的更新。再次 sum(dZ[2]).. 还有 1/m 然后 dz[1] 是这样算的。再次 这是逐元素乘积 而之前,在上一张幻灯片里看到 这是 (n[1],1) 维向量。现在 这是 n[1] ,m 维矩阵。这两者也是 (n[1],m) 维。所以这个星号就表示逐元素乘积 然后最后,剩下的两个更新。应该不会很意外了。

I hope that gives you some intuition for how the backpropagation algorithm is derived.In all of machine learning,I think the derivation of the backpropagation algorithm is actually one of the most complicated pieces of math I’ve seen,and it requires knowing both linear algebra as well as the derivative of matrices to re-derive it from scratch from first principles.If you aren’t an expert in matrix calculus,using this process, you might prove the derivative algorithm yourself,but I think there are actually plenty of deep learning practitioners that have seen the derivation at about the level you’ve seen in this video and are already able to have all the very intuitions and be able to implement this algorithm very effectively.If you are an expert in calculus,do see if you can derive the whole thing from scratch.It is one of the very hardest pieces of math.One of the very hardest derivations that I’ve seen in all of machine learning.Either way, if you implement this,this will work and I think you have enough intuitions to tune and get it to work.There’s just one last detail I want to share with you before you implement your neural network,which is how to initialize the weights of your neural network.It turns out that initializing your parameters,not to zero but randomly,turns out to be very important for training your neural network.In the next video, you’ll see why.

我希望本视频能让你们知道这些后向传播算法是怎么推导的。在所有机器学习领域中我认为反向传播算法的推导,实际上是我看过的最复杂的数学之一,它需要知道线性代数以及,矩阵的导数 要用链式法则推导出来。如果你不是矩阵微积分的专家,使用这个步骤 你可以自己证明求导算法,但我认为实际上有很多深度学习实践者,看过的推导过程水平和本视频差不多,他们就已经掌握了这种直觉,并能够非常有效地实现该算法。如果你是微积分的专家,你可以看看 是否可以从头开始推导全部公式。这是最难的一部分数学。这是我在机器学习领域中看到的最难的推导之一。无论如何 只要你实现了这些方程,这就能用了 我想你也会有足够的直觉来调整并使其工作。最后我还有一个细节想给大家分享,在你实现神经网络之前,如何初始化你的神经网络的权重。事实证明 初始化你的参数,不要全零 而是随机初始化,对于训练你的神经网络而言 这一点非常重要。在下一个视频中你将会看到原因。


参考文献:

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(1-3)– 浅层神经网络


PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。

Coursera | Andrew Ng (01-week-3-3.10)—直观理解反向传播_第12张图片

你可能感兴趣的:(深度学习,深度学习,吴恩达)