该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ
Coursera 课程 |deeplearning.ai |网易云课堂
转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」
知乎:https://zhuanlan.zhihu.com/c_147249273
CSDN:http://blog.csdn.net/junjun_zhao/article/details/79122927
3.9 Trying a Softmax classifier (训练一个 Softmax 分类器 )
(字幕来源:网易云课堂)
In the last video, you learned about the Softmax layer and the Softmax activation function.In this video, you deepen your understanding of Softmax classification,and also learn how to train a model that uses a Softmax layer.Recall our earlier example where the output layer computes z[L] z [ L ] as follows.So we have four classes,C = 4 then z[L] z [ L ] can be (4,1) dimensional vector and we said we compute t which is this temporary variable that performs element-wise exponentiation.And then finally, if the activation function for your output layer, g[L] g [ L ] is the Softmax activation function,then your outputs will be this.It’s basically taking the temporary variable tand normalizing it to sum to 1.So this then becomes a[L] a [ L ] .So you notice that in the z vector, the biggest element was 5, andthe biggest probability ends up being this first probability.
上一个视频中我们学习了 Softmax 层,和 Softmax 激活函数,在这个视频中 你将更深入地了解 Softmax 分类,并学习如何训练一个使用了 Softmax 层的模型,回忆一下我们之前举的例子,输出层计算出的 z[L] z [ L ] 如下,我们有四个分类,C 等于 4。 z[L] z [ L ] 可以是 4*1维向量,我们计算了临时变量 t t ,对元素进行幂运算,最后 如果你的输出层的激活函数 g[L] g [ L ] ,是 Softmax 激活函数,那么输出就会是这样的,简单来说就是用临时变量 t t 将它归一化,使总和为 1,于是这就变成了 a[L] a [ L ] ,你注意到在向量 z z 中 最大的元素是 5,而最大的概率也就是第一种概率。
The name Softmax comes from contrasting it to what’s called a hard max which would have taken the vector z and map it to this vector.So hard max function will look at the elements of z and just put an 1 in the position of the biggest element of z and then 0s everywhere else.And so this is a very hard max where the biggest element gets a output of 1 and everything else gets an output of 0. Whereas in contrast,a Softmax is a more gentle mapping from z to these probabilities.So, I’m not sure if this is a great name but at least, that was the intuition behind why we call it a Softmax ,all this in contrast to the hard max.And one thing I didn’t really show but had alluded to is that Softmax regression or the Softmax activation function generalizes the logistic activation function to C classes rather than just two classes.And it turns out that if C = 2, then Softmax with C = 2 essentially reduces to logistic regression.And I’m not going to prove this in this video but the rough outline for the proof is that if C = 2 and if you apply Softmax ,then the output layer, a[L] a [ L ] , will output two numbers if C = 2,so maybe it outputs 0.842 and 0.158, right?And these two numbers always have to sum to 1.And because these two numbers always have to sum to 1, they’re actually redundant.And maybe you don’t need to bother to compute two of them,maybe you just need to compute one of them.And it turns out that the way you end up computing that number reduces tothe way that logistic regression is computing its single output.So that wasn’t much of a proof but the takeaway from this is that Softmax regression is a generalization of logistic regression to more than two classes.
Softmax 这个名称的来源是与所谓 hard max 对比,hard max 会把向量 z 变成这个向量,hard max 函数会观察 z 的元素,然后在 z 中最大元素的位置放上 1,其他位置放上 0,所以这是一个很硬 (hard) 的 max,也就是最大的元素的输出为 1,其他的输出都为 0,与之相反, Softmax 所做的从 z 到这些概率的映射更为温和,我不知道这是不是一个好名字,但至少这就是 Softmax 这一名称背后所包含的想法,与 hard max 正好相反,有一点我没有细讲 但之前已经提到过的,就是 Softmax 回归或 Softmax 激活函数,将 logistic 激活函数推广到 C 类 而不仅仅是两类,结果就是如果 C 等于 2 那么 C 等于 2 的 Softmax 实际上变回到了 logistic 回归,我不会在这个视频中给出证明,但是大致的证明思路是这样的,如果 C 等于 2 并且你应用了 Softmax ,那么输出层 a[L] a [ L ] 将会输出两个数字,如果 C 等于 2 的话,也许它会输出 0.842 和 0.158 对吧,这两个数字加起来要等于 1,因为它们的和必须为 1 其实它们是冗余的,也许你不需要计算两个,而只需要计算其中一个,结果就是你最终计算那个数字的方式又回到了,logistic 回归计算单个输出的方式,这算不上是一个证明 但我们可以从中得出结论, Softmax 回归将 logistic 回归推广到了两种分类以上。
Now let’s look at how you would actually train a neural network with a Softmax output layer.So in particular,let’s define the loss functions you use to train your neural network.Let’s take an example.Let’s see of an example in your training set where the target output,the ground truth label is 0 1 0 0.So the example from the previous video,this means that this is an image of a cat because it falls into Class 1.And now let’s say that your neural network is currently outputting y hat equals…so y hat would be a vector of probabilities sum to 1…0.1, 0.4, so you can check that sums to 1, and this is going to be a[L] a [ L ] .So the neural network’s not doing very well in this example because this is actually a cat and assigned only a 20% chance that this is a cat.So didn’t do very well in this example.So what’s the loss function you would want to use to train this neural network?In Softmax classification,the loss we typically use is negative sum of j=1 through 4.And it’s really sum from 1 to C in the general case.We’re going to just use 4 here– of yj y j log y hat of j.So let’s look at our single example above to better understand what happens.Notice that in this example, y1=y3=y4=0 y 1 = y 3 = y 4 = 0 because those are 0s and only y2=1 y 2 = 1 .So if you look at this summation,all of the terms with 0 values of yj y j were equal to 0.
接下来我们来看,怎样训练带有 Softmax 输出层的神经网络,具体而言,我们先定义训练神经网络时会用到的损失函数,举个例子,我们来看看训练集中某个样本的目标输出,真实标签是0 1 0 0,用上一个视频中讲到过的例子,这表示这是一张猫的图片 因为它属于类 1,现在我们假设你的神经网络输出的是 y^ y ^ 等于, y^ y ^ 是一个包括总和为 1 的概率的向量,0.1 0.4 你可以看到总和为 1 这就是 a[L] a [ L ] ,对于这个样本 神经网络的表现不佳,这实际上是一只猫 但却只分配到 20% 是猫的概率,所以在本例中表现不佳,那么你想用什么损失函数来训练这个神经网络?在 Softmax 分类中,我们一般用到的损失函数是负的 j 从 1 到 4 的和,实际上一般来说是从 1 到 C 的和,我们这里就用 4 yjlogy^j y j l o g y ^ j ,我们来看上面的单个样本,来更好地理解整个过程,注意在这个样本中, y1=y3=y4=0 y 1 = y 3 = y 4 = 0 ,因为这些都是 0 只有 y2=1 y 2 = 1 ,如果你看这个求和,所有含有值为 0 的 yj y j 的项都等于 0。
And the only term you’re left with is -y2 log y hat 2,because when you sum over the indices of j,all the terms will end up 0, except when j is equal to 2.And because y2=1 y 2 = 1 , this is just -log y hat 2.So what this means is that,if your learning algorithm is trying to make this small because you use gradient descent to try to reduce the loss on your training set.Then the only way to make this small is to make this small.And the only way to do that is to make y hat 2 as big as possible.And these are probabilities, so they can never be bigger than 1.But this kind of makes sensebecause x for this example is the picture of a cat,then you want that output probability to be as big as possible.So more generally, what this loss function does isit looks at whatever is the ground truth class in your training set,and it tries to make the corresponding probability of that class as high as possible.If you’re familiar with maximum likelihood estimation statistics,this turns out to be a form of maximum likelyhood estimation.But if you don’t know what that means, don’t worry about it.The intuition we just talked about will suffice.Now this is the loss on a single training example.How about the cost J on the entire training set.So, the cost of setting of the parameters and so on,of all the ways of biases,you define that as pretty much what you’d guess,sum of your entire training sets of the loss,your learning algorithm’s predictions are summed over your training samples.And so, what you do is use gradient descentin order to try to minimize this cost.
最后只剩下 −y2logy^2 − y 2 l o g y ^ 2 ,因为当你按照下标 j 全部加起来,所有的项都为 0 除了 j 等于 2 时,又因为 y2=1 y 2 = 1 所以它就等于 −logy^2 − l o g y ^ 2 ,这就意味着,如果你的学习算法试图将它变小,因为梯度下降法是用来减少训练集的损失的,要使它变小的唯一方式就是使它变小,要想做到这一点 就需要使 y^2 y ^ 2 尽可能大,因为这些是概率 所以不可能比 1 大,但这的确也讲得通,因为在这个例子中 x 是猫的图片,你就需要这项输出的概率尽可能地大,概括来讲 损失函数所做的就是,它找到你的训练集中的真实类别,然后试图使该类别相应的概率尽可能地高,如果你熟悉统计学中的最大似然估计,这其实就是最大似然估计的一种形式,但如果你不知道那是什么意思 也不用担心,用我们刚刚讲过的算法思维也足够了,这是单个训练样本的损失,整个训练集的损失 J 又如何呢,也就是设定参数的代价之类的,还有各种形式的偏差的代价,它的定义你大致也能猜到,就是整个训练集损失的总和,把你的训练算法对所有训练样本的预测都加起来,因此你要做的就是用梯度下降法,使这里的损失最小化。
Finally, one more implementation detail.Notice that because C is equal to 4, y is a 4 by 1 vector, andy hat is also a 4 by 1 vector.So if you’re using a vectorized implementation,the matrix capital Y is going to be y^(1) y ^ ( 1 ) , y^(2) y ^ ( 2 ) , through y^(m) y ^ ( m ) , stacked horizontally.And so for example, if this example up here is your first training examplethen the first column of this matrix Y will be 0 1 0 0and then maybe the second example is a dog,maybe the third example is a none of the above, and so on.And then this matrix Y will end up being a 4 by m dimensional matrix.And similarly, Y hat will be y hat 1 stacked up horizontally going through y hat mso this is actually y hat 1 or the output on the first training exampleThen y hat with these 0.3, 0.2, 0.1, and 0.4, and so on.And y hat itself will also be 4 by m dimensional matrix.
最后还有一个实现细节,注意因为 C=4 C = 4 y y 是一个 4∗1 4 ∗ 1 向量, y^ y ^ 也是一个 4∗1 4 ∗ 1 向量,如果你使用向量化实现,矩阵大写 Y Y 就是 y^(1) y ^ ( 1 ) y^(2) y ^ ( 2 ) 到 y^(m) y ^ ( m ) 的横向排列,例如如果上面这个样本是你的第一个训练样本,那么矩阵 Y Y 的第一列就是0 1 0 0,也许第二个样本是一只狗,也许第三个样本是以上均不符合 等等,那么这个矩阵 Y Y 最终就是一个 4∗m 4 ∗ m 维矩阵,类似的 y^ y ^ 就是 y^(1) y ^ ( 1 ) …横向排列 一直到 y^m y ^ m ,这个其实就是 y^1 y ^ 1 或是第一个训练样本的输出,那么 y^ y ^ 就是0.3 0.2 0.1 0.4 等等, y^ y ^ 本身也是一个 4∗m 4 ∗ m 维矩阵。
Finally, let’s take a look at how you’d implement gradient descent when you have a Softmax output layer.So this output layer will compute z[L] z [ L ] which is C by 1in our example, 4 by 1 andthen you apply the Softmax activation function to get a[L] a [ L ] , or y hat.And then that in turn allows you to compute the loss.So we’ve talked about how to implement the forward propagation step of a neural network to get these outputs and to compute that loss.How about the backpropagation step, or gradient descent?Turns out that the key step orthe key equation you need to initialize backprop is this expression,that the derivative with respect to z at the last layer, this turns out,you can compute this y hat, the 4 by 1 vector, minus y, the 4 by 1 vector.So you notice that all of these are going to be 4 by 1 vectors when you have 4 classes and C by 1 in the more general case.And so this going by our usual definition of what is dz,this is the partial derivative for the cost function with respect to z[L] z [ L ] .If you are an expert in calculus, you can derive this yourself.Or if you’re an expert in calculus,you can try to derive this yourself,but using this formula will also just work fine,if you have a need to implement this from scratch.With this, you can then compute dz[L] d z [ L ] and then sort of start off the backprop processto compute all the derivatives you need throughout your neural network.But it turns out that in this week’s primary exercise,we’ll start to use one of the deep learning program frameworks and for those program frameworks,usually it turns out you just need to focus on getting the forward prop right.And so long as you specify it as a program framework, the forward prop pass,the program framework will figure out how to do back prop,how to do the backward pass for you.
最后我们来看一下,在有 Softmax 输出层时如何实现梯度下降法,这个输出层会计算 z[L] z [ L ] 它是 C∗1 C ∗ 1 的,在这个例子中是 4∗1 4 ∗ 1 ,然后你用 Softmax 激活函数来得到 a[L] a [ L ] 或者说 y^ y ^ ,然后又能由此算出损失,我们已经讲了如何实现神经网络前向传播的步骤,来得到这些输出 并计算损失,那么反向传播步骤或者梯度下降法又如何呢?其实初始化反向传播,所需的关键步骤或者说关键方程是这个表达式,对于最后一层的 z z 的导数 其实,你可以用 y^ y ^ 这个 4∗1 4 ∗ 1 向量减去 y y 这个 4∗1 4 ∗ 1 向量,你可以看到这些都会是 4∗1 4 ∗ 1 向量,当你有 4 个分类时,在一般情况下就是 C∗1 C ∗ 1 ,这符合我们对 dz d z 的一般定义,这是对于 z[L] z [ L ] 的损失函数的偏导数,如果你精通微积分 就可以自己推导,或者说如果你精通微积分,可以试着自己推导,但是如果你需要从零开始使用这个公式,它也一样有用,有了这个 你就可以计算 dz[L] d z [ L ] ,然后开始反向传播的过程,计算整个神经网络中所需的所有导数,但是在这周的初级练习中,我们将开始使用一种深度学习编程框架,对于这些编程框架,通常你只需专注于把前向传播做对,只要你将它指明为编程框架 前向传播,它自己会弄明白怎样反向传播,会帮你实现反向传播。
So this expression is worth keeping in mind for if you ever need to implement Softmax regression, or Softmax classification from scratch.Although you won’t actually need this in this week’s primary exercise because the program framework you use will take care of this derivative computation for you.So that’s it for Softmax classification,with it you can now implement learning algorithms to categorize inputs into not just one of two classes,but one of C different classes.Next, I want to show you some of the deep learning program frameworks which can make you much more efficient in terms of implementing deep learning algorithms.Let’s go on to the next video to discuss that.
这个表达式值得牢记,如果你需要从头开始,实现 Softmax 回归或者 Softmax 分类,但其实在这周的初级练习中你不会用到它,因为编程框架会帮你搞定导数计算, Softmax 分类就讲到这里,有了它 你就可以运用学习算法,将输入分成不止两类,而是 C 个不同类别,接下来我想向你展示一些深度学习编程框架,可以让你在实现深度学习算法时更加高效,让我们在下个视频中一起讨论。
理解 Sotfmax
为什么叫做Softmax?我们以前面的例子为例,由 z[L] z [ L ] 到 a[L] a [ L ] 的计算过程如下:
通常我们判定模型的输出类别,是将输出的最大值对应的类别判定为该模型的类别,也就是说最大值为的位置1,其余位置为0,这也就是所谓的“hardmax”。而Sotfmax将模型判定的类别由原来的最大数字5,变为了一个最大的概率0.842,这相对于“hardmax”而言,输出更加“soft”而没有那么“hard”。
Sotfmax回归 将 logistic回归 从二分类问题推广到了多分类问题上。
Softmax 的Loss function
在使用Sotfmax层时,对应的目标值 y 以及训练结束前某次的输出的概率值 y^ y ^ 分别为:
Sotfmax 使用的 Loss function为:
在训练过程中,我们的目标是最小化Loss function,由目标值我们可以知道, y1=y3=y4=0,y2=1 y 1 = y 3 = y 4 = 0 , y 2 = 1 ,所以代入 L(y^,y) L ( y ^ , y ) 中,有:
所以为了最小化Loss function,我们的目标就变成了使得 y^2 y ^ 2 的概率尽可能的大。
也就是说,这里的损失函数的作用就是找到你训练集中的真实的类别,然后使得该类别相应的概率尽可能地高,这其实是最大似然估计的一种形式。
对应的Cost function如下:
Softmax 的梯度下降
在Softmax层的梯度计算公式为:
参考文献:
[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(2-3)– 超参数调试 和 Batch Norm
PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。