该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ
Coursera 课程 |deeplearning.ai |网易云课堂
转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」
知乎:https://zhuanlan.zhihu.com/c_147249273
CSDN:http://blog.csdn.net/junjun_zhao/article/details/79122695
3.7 Batch norm at test time (测试时的 Batch Norm)
(字幕来源:网易云课堂)
Batch norm processes your data one mini batch at a time,but the test time you might need to process the examples one at a time.Let’s see how you can adapt your network to do that.Recall that during training,here are the equations you’d use to implement batch norm.Within a single mini batch,you’d sum over that mini batch of the ZI values to compute the mean.So here, you’re just summing over the examples in one mini batch.I’m using M to denote the number of examples in the mini batchnot in the whole training set.Then, you compute the variance and then you compute znorm z n o r m by scaling by the mean and standard deviation with Epsilon added for numerical stability.And then Z tilde is taking znorm z n o r m and rescaling by gamma and beta.So, notice that mu and sigma squared which you need for this scaling calculation are computed on the entire mini batch.But the test time you might not have a mini batch of 6428 or 2056 examples to process at the same time.So, you need some different way of coming up with mu and sigma squared.And if you have just one example,taking the mean and variance of that one example, doesn’t make sense.
Batch 归一化将你的数据以 mini-batch 的形式逐一处理,但在测试时,你可能需要对每一个样本逐一处理,我们来看一下怎样调整你的网络来做到这一点,回想一下在训练时,这些就是用来执行 Batch 归一化的等式,在一个 mini-batch 中,你将 mini-batch 的 z(i) 值求和,计算均值,所以这里你只要把一个 mini-batch 中的样本都加起来,我用 m 来表示这个 mini-batch 中的样本数量,而不是整个训练集,然后计算方差 再算 znorm z n o r m ,即用均值和标准差来调整 加上 ε 是为了数值稳定性, z̃ z ̃ 是用 γ γ 和 β β 再次调整 znorm z n o r m ,请注意用于调节计算的 μ μ 和 σ2 σ 2 ,是在整个 mini-batch 上进行计算,但是在测试时 你可能不能将一个 mini-batch 中的,6428 个或 2056 个样本同时处理,因此你需要用其他方式来得到 μ μ 和 σ2 σ 2 ,而且如果你只有一个样本,一个样本的均值和方差没有意义。
So what’s actually done in order to apply your neural network at test time is to come up with some separate estimate of mu and sigma squared.And in typical implementations of batch norm,what you do is estimate this using an exponentially weighted average where the average is across the mini batches.So, to be very concrete here’s what I mean.Let’s pick some layer L and let’s say you’re going through mini batches X1, X2 together with the corresponding values of Y and so on.So, when training on X1 for that layer L, you get some mu L.And in fact, I’m going to write this as mu for the first mini batch and that layer.And then when you train on the second mini batch for that layer and that mini batch,you end up with some second value of mu.And then for the third mini batch in this hidden layer,you end up with some third value for mu.
那么实际上 为了将你的神经网络运用于测试,就需要单独估算 μ μ 和 σ2 σ 2 ,在典型的 Batch 归一化运用中,你需要用一个指数加权平均来估算,这个平均数涵盖了所有 mini-batch ,接下来我会具体解释,我们选择 L 层 假设我们有 mini-batch X[1]X[2] X [ 1 ] X [ 2 ] ,以及对应的 y 值等等,那么在为 L层训练 X[1] X [ 1 ] 时 你就得到了 μ[L] μ [ L ] ,我还是把它写做第一个 mini-batch 和这一层的 μ μ 吧,当你训练第二个 mini-batch 在这一层和这个 mini-batch 中,你就会得到第二个 μ μ 值,然后在这一隐藏层的第三个 mini-batch ,你得到了第三个 μ μ 值。
So just as we saw how to use an exponentially weighted average to compute the mean of Theta one, Theta two, Theta three when you were trying to compute an exponentially weighted average of the current temperature,you would do that to keep track of what’s the latest average value of this mean vector you’ve seen.So that exponentially weighted average becomes your estimate for what the mean of the Zs is for that hidden layer and similarly,you use an exponentially weighted average to keep track of these values of sigma squared that you see on the first mini batch in that layer,sigma square that you see on second mini batch and so on.So you keep a running average of the mu and the sigma squared that you’re seeing for each layer as you train the neural network across different mini batches.
正如我们之前用指数加权平均,来计算 θ1θ2θ3 θ 1 θ 2 θ 3 的均值,当时是试着计算当前气温的指数加权平均,你会这样来追踪,你看到的这个均值向量的最新平均值,于是这个指数加权平均就成了,你对这一隐藏层的 z z 均值的估值 同样的,你也可以用指数加权平均来追踪,你在这一层的第一个 mini-batch 中所见的 σ² σ ² 的值,以及第二个 mini-batch 中所见的 σ² σ ² 的值等等,因此在用不同的 mini-batch 训练神经网络的同时,能够得到你所查看的每一层的 μ μ 和 σ² σ ² 的平均数的实时数值。
Then finally at test time,what you do is in place of this equation,you would just compute znorm z n o r m using whatever value your Z have,and using your exponentially weighted average of the mu and sigma square whatever was the latest value you have to do the scaling here.And then you would compute Z tilde on your one test example using that znorm z n o r m that we just computed on the left and using the beta and gamma parameters that you have learned during your neural network training process.So the takeaway from this is that during training time mu and sigma squared are computed on an entire mini batch of,say, 64, 28 or some number of examples.But at test time, you might need to process a single example at a time.So, the way to do that is to estimate mu and sigma squared from your training set.
最后在测试时,对应这个等式,你只需要用你的 z z 值来计算 znorm z n o r m ,用 μ μ 和 σ² σ ² 的指数加权平均,用你手头的最新数值来做调整,然后你可以用左边我们刚算出来的 znorm z n o r m ,和你在神经网络训练过程中得到的 β β 和 γ γ 参数,来计算你那个测试样本的z̃值,总结一下就是 在训练时, μ μ 和 σ² σ ² 是在整个 mini-batch 上计算出来的,包含了像是 64 或 28 或其他一定数量的样本,但在测试时,你可能需要逐一处理样本,方法是根据你的训练集估算 μ μ 和 σ² σ ² 。
and there are many ways to do that.You could in theory run your whole training set through your final network to get mu and sigma squared.But in practice, what people usually do is implement an exponentially weighted average where you just keep track of the mu and sigma squared values you’re seeing during training and use an exponentially weighted average, also sometimes called the running average,to just get a rough estimate of mu and sigma squared and then you use those values of mu and sigma squared at test time to do the scaling you need of the hidden unit values Z.In practice, this process is pretty robustto the exact way you used to estimate mu and sigma squared.So, I wouldn’t worry too much about exactly how you do this and if you’re using a deep learning framework,they’ll usually have some default way to estimate the mu and sigma squared that should work reasonably well as well.But in practice, any reasonable way to estimate the mean andvariance of your hidden unit values Z should work fine at test.So, that’s it for batch norm.And using it, I think you’ll be able to train much deeper networks and get your learning algorithm to run much more quickly.Before we wrap up for this week,I want to share with you some thoughts on deep learning frameworks as well.Let’s start to talk about that in the next video.
估算的方式有很多种,理论上你可以在最终的网络中运行整个训练集,来得到 μ μ 和 σ² σ ² ,但在实际操作中 我们通常运用指数加权平均,来追踪在训练过程中你看到的 μ μ 和 σ² σ ² 的值,还可以用指数加权平均,有时也叫做流动平均,来粗略估算 μ μ 和 σ² σ ² ,然后用测试中 μ μ 和 σ² σ ² 的值,来进行你所需的隐藏单元 z z 值的调整,在实践中 不管你用什么方式估算 μ μ 和 σ² σ ² ,这套过程都是比较稳健的,因此我不会太担心你具体的操作方式,而且如果你使用的是某种深度学习框架,通常会有默认的估算 μ μ 和 σ² σ ² 的方式,应该一样会起到比较好的效果,但在实践中 任何合理的估算你的隐藏单元 z 值的,均值和方差的方式 在测试中应该都会有效,Batc h归一化就讲到这里,使用 Batch 归一化 你能够训练更深的网络,让你的学习算法运行速度更快,在结束这周的课程前,我还想和你们分享一些关于深度学习框架的想法,让我们在下一段视频中一起讨论这个话题。
在测试数据上使用 Batch Norm
训练过程中,(流程: 训练 train ,开发 Dev,测试 test )我们是在每个 mini-batch 使用 Batch Norm,来计算所需要的均值 μ μ 和方差 σ2 σ 2 。但是在测试的时候,我们需要对每一个测试样本进行预测,无法计算均值和方差。
此时,我们需要单独进行估算均值 μ μ 和方差 σ2 σ 2 。通常的方法就是在我们训练的过程中,对于训练集的 mini-batch ,使用指数加权平均,当训练结束的时候,得到指数加权平均后的均值 μ μ 和方差 σ2 σ 2 ,而这些值直接用于 Batch Norm 公式的计算,用以对测试样本进行预测。
参考文献:
[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(2-3)– 超参数调试 和 Batch Norm
PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。