Coursera | Andrew Ng (02-week-1-1.1)—训练_开发_测试_数据集

该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂


转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」

知乎:https://zhuanlan.zhihu.com/c_147249273

CSDN:http://blog.csdn.net/junjun_zhao/article/details/79059399


(字幕来源:网易云课堂)

week 1: Setting up your ML application

1.1 Train_dev_test sets (训练开发测试_数据集)

Coursera | Andrew Ng (02-week-1-1.1)—训练_开发_测试_数据集_第1张图片

Welcome to this course on the practical aspects of deep learning.Perhaps now you’ve learned how to implement a neural network.In this week you’ll learn the practical aspects of how to make your neural network work well,ranging from things like hyperparameter tuning to how to set up your data to how to make sure your optimization algorithm runs quickly.so that you get your learning algorithm to learn in a reasonable time.In this first week, we’ll first talk about the cellular machine learning problem,then we’ll talk about randomization.And we’ll talk about some tricks for making sure your neural network implementation is correct.With that, let’s get started.

欢迎收看机器深度学习的实操课程,关于如何构建神经网络 ,大家可能已经了解了,那么本周 我们将继续学习如何有效运作神经网络内容涉及超参数调优 如何构建数据,以及如何确保优化算法快速运行,从而使学习算法在合理时间内完成自我学习,第一周 我们首先说说神经网络机器学习中的问题,然后是随机神经网络,还会学习一些,确保神经网络正确运行的技巧,带着这些问题 我们开始今天的课程。

Making good choices in how you set up your training, development, and test sets can make a huge difference in helping you quickly find a good high performance neural network.When training a neural network you have to make a lot of decisions,such as how many layers will your neural network have?And how many hidden units do you want each layer to have?And what’s the learning rates?And what are the activation functions you want to use for the different layers?When you’re starting on a new application,it’s almost impossible to correctly guess the right values for all of these,and for other hyperparameter choices, on your first attempt.So in practice applied machine learning is a highly iterative process,in which you often start with an idea,such as you want to build a neural network of a certain number of layers,certain number of hidden units, maybe on certain data sets and so on.And then you just have to code it up and try it by running your code.You run and experiment and you get back a result that tells youhow well this particular network, or this particular configuration works.And based on the outcome,you might then refine your ideas and change your choices andmaybe keep iterating in order to try to find a better and a better neural network.

在配置训练 ,验证和测试数据集的过程中做出正确决策,会在很大程度上帮助大家,创建高效的神经网络,训练神经网络时 我们需要做出很多决策,例如 神经网络分多少层,每层含有多少个隐藏单元,学习速率是多少,各层采用哪些激活函数,创建新应用的过程中,我们不可能从一开始就准确预测出这些信息,和其他超级参数,实际上 应用型机器学习是一个高度迭代的过程,通常在项目启动时 我们会先有一个初步想法,比如构建一个含有特定层数 隐藏单元数量,或数据集个数等等的神经网络,然后编码 并尝试运行这些代码,通过运行和测试得到该神经网络,或这些配置信息的运行结果,你可能会根据输出结果,重新完善自己的想法 改变策略,或者为了找到更好的神经网络不断迭代更新自己的方案。

Coursera | Andrew Ng (02-week-1-1.1)—训练_开发_测试_数据集_第2张图片

Today, deep learning has found great success in a lot of areas.ranging from natural language processing to computer vision to speech recognition to a lot of applications on also structured data.And structured data includes everything from advertisements to web search,which isn’t just Internet search engines it’s also, for example, shopping websites.Already any websites that wants deliver great search results when you enter terms into a search bar,to computer security, to logistics, such as figuring out where to send drivers to pick up and drop off things,to many more.So what I’m seeing is that sometimes a researcher with a lot of experience in NLP might try to do something in computer vision.Or maybe a researcher with a lot of experience in speech recognition might jump in and try to do something on advertising.Or someone from security might want to jump in and do something on logistics.And what I’ve seen is that intuitions from one domain or from one application area often do not transfer to other application areas.And the best choices may depend on the amount of data you have,the number of input features you have through your computer configuration and whether you’re training on GPUs or CPUs.And if so, exactly what configuration of GPUs and CPUs, and many other things.So for a lot of applications, I think it’s almost impossible.Even very experienced deep learning people find it almost impossible to correctly guess the best choice of hyperparameters the very first time.And so today, applied deep learning is a very iterative process where you just have to go around this cycle many times to hopefully find a good choice of network for your application.

现如今,深度学习已经在自然语言处理,计算机视觉 语音识别,以及结构化数据应用等众多领域取得巨大成功,结构化数据无所不包 从广告到网络搜索,其中网络搜索不仅包括网络搜索引擎 还包括购物网站,从所有根据搜索栏词条,传输结果的网站,再到计算机安全 物流,比如 判断司机去哪接送货,范围之广 不胜枚举,我发现 可能有自然语言处理方面的人才,想踏足计算机视觉领域,或者经验丰富的语音识别专家,想投身广告行业,又或者 有的人想从电脑安全领域跳到物流行业,在我看来 从一个领域或者应用领域得来的直觉经验,通常无法转移到其它应用领域最佳决策取决于你所拥有的数据量,计算机配置中输入特征的数量,用GPU训练还是CPU,GPU和CPU的具体配置以及其他诸多因素,目前为止 我觉得 对于很多应用系统,即使是经验丰富的深度学习行家也不太可能一开始就,预设出最匹配的超级参数

Coursera | Andrew Ng (02-week-1-1.1)—训练_开发_测试_数据集_第3张图片

So one of the things that determine how quickly you can make progress is how efficiently you can go around this cycle.And setting up your data sets well in terms of your train, development and test sets can make you much more efficient at that.So if this is your training data, let’s draw that as a big box.Then traditionally you might take all the data you have and carve off some portion of it to be your training set.Some portion of it to be your hold-out cross validation set,and this is sometimes also called the development set.And for brevity, I’m just going to call this the dev set, but all of these terms mean roughly the same thing.And then you might carve out some final portion of it to be your test set.

Coursera | Andrew Ng (02-week-1-1.1)—训练_开发_测试_数据集_第4张图片

所以说,应用深度学习是一个典型的迭代过程,需要多次循环往复,才能为应用程序找到一个称心的神经网络,因此 循环该过程的效率是决定,项目进展速度的一个关键因素,而创建高质量的训练数据集 验证集和测试集,也有助于提高循环效率,假设这是训练数据 我用一个长方形表示,我们通常会将这些数据划分成几部分,一部分作为训练集,一部分作为简单交叉验证集,有时也称之为验证集,方便起见 我就叫它验证集(dev set),其实都是同一个概念,最后一部分则作为测试集。

And so the workflow is that you keep on training algorithms on your training sets.And use your dev set or your hold-out cross validation set to see which of many different models performs best on your dev set.And then after having done this long enough,when you have a final model that you want to evaluate,you can take the best model you have found and evaluate it on your test set.In order to get an unbiased estimate of how well your algorithm is doing,so in the previous era of machine learning, it was common practice to take all your data and split it according to maybe a 70/30% in terms of a people often talk about the 70/30 train test splits.If you don’t have an explicit dev set or maybe a 60/20/20% split in terms of 60% train, 20% dev and 20% test.And several years ago, this was widely considered best practice in machine learning.If you have maybe 100 examples in total,maybe 1000 examples in total, maybe after 10,000 examples.These sorts of ratios were perfectly reasonable rules of thumb.

Coursera | Andrew Ng (02-week-1-1.1)—训练_开发_测试_数据集_第5张图片

接下来 我们开始对训练集执行训练算法,通过验证集或简单交叉验证集,通过验证集或简单交叉验证集选择最好的模型,经过充分验证,我们选定了最终模型,然后就可以在测试集上进行评估了,为了无偏评估算法的运行状况,在机器学习发展的小数据量时代,常见做法是将所有数据三七分,就是人们常说的70% 验证集 30% 测试集,如果没有明确设置验证集 也可以按照,60% 训练 20% 验证和 20% 测试集来划分,这是前几年机器学习领域普遍认可的最好的实践方法,如果只有100条,1000 条或者 1 万条数据,那么按上述比例划分是非常合理的。

But in the modern big data era, where, for example,you might have a million examples in total, then the trend is that your dev and test sets have been becoming a much smaller percentage of the total.Because remember, the goal of the dev set or the development set is that you’re going to test different algorithms on it and see which algorithm works better.So the dev set just needs to be big enough for you to evaluate, say, two different algorithm choices or ten different algorithm choices and quickly decide which one is doing better.And you might not need a whole 20% of your data for that.So, for example, if you have a million training examples you might decide that just having 10,000 examples in your dev set is more than enough to evaluate which one or two algorithms does better.And in a similar vein, the main goal of your test set is, given your final classifier to give you a pretty confident estimate of how well it’s doing.And again, if you have a million examples maybe you might decide that 10,000 examples is more than enough in order to evaluate a single classifier and give you a good estimate of how well it’s doing.So in this example where you have a million examples,if you need just 10,000 for your dev and 10,000 for your test,your ratio will be more like this 10,000 is 1% of 1 million,so you’ll have 98% train, 1% dev, 1% test.And I’ve also seen applications where,if you have even more than a million examples, you might end up with 99.5% train and 0.25% dev, 0.25% test,or maybe a 0.4% dev, 0.1% test.So just to recap, when setting up your machine learning problem,I’ll often set it up into a train, dev and test sets, and if you have a relatively small dataset, these traditional ratios might be okay.But if you have a much larger data set, it’s also fine to set your dev and test setsto be much smaller than your 20% or even 10% of your data.We’ll give more specific guidelines on the sizes of dev and test sets later in this specialization.

Coursera | Andrew Ng (02-week-1-1.1)—训练_开发_测试_数据集_第6张图片

但是在大数据时代,我们现在的数据量可能是百万级别,那么验证集和测试集占数据总量的比例会趋向于变得更小,因为验证集的目的就是验证不同的算法,检验哪种算法更有效,因此 验证集要足够大才能评估,比如 2 个甚至 10 个不同算法,并迅速判断出哪种算法更有效,我们可能不需要拿出 20% 的数据作为验证集,比如我们有 100 万条数据,那么取 1 万条数据便足以进行评估,找出其中表现最好的 1-2 种算法,同样地 根据最终选择的分类器,测试集的主要目的是正确评估分类器的性能,所以 如果拥有百万数据 我们只需要 1000 条数据,便足以评估单个分类器,准确评估该分类器的性能,假设我们有 100 万条数据,其中 1 万条作验证集 1 万条作测试集,100 万里取 1 万 比例是1%,即:训练集占 98% 验证集和测试集各占 1%,对于数据量过百万的应用,训练集可以占到 99.5%,验证和测试集各占 0.25%,或者测试集占 0.4% 测试集占 0.1%,总结一下 在机器学习中,我们通常将样本分成训练集 验证集和测试集三部分数据集规模相对较小的 适用传统的划分比例,数据集规模较大的 验证集和测试集可以占到,数据总量的 20% 或 10%以下,后面我会给出如何划分验证集和测试集的具体指导。

One other trend we’re seeing in the era of modern deep learning is that more and more people train on mismatched 不匹配 train and test distributions.Let’s say you’re building an app that lets users upload a lot of pictures and your goal is to find pictures of cats in order to show your users.Maybe all your users are cat lovers.Maybe your training set comes from cat pictures downloaded off the Internet, but your dev and test sets might comprise cat pictures from users using our app.So maybe your training set has a lot of pictures crawled off the Interne tbut the dev and test sets are pictures uploaded by users.Turns out a lot of webpages have very high resolution, very professional,very nicely framed pictures of cats.But maybe your users are uploading blurrier, lower res images just taken with a cell phone camera in a more casual condition.And so these two distributions of data may be different.The rule of thumb I’d encourage you to follow in this case is to make sure that the dev and test sets come from the same distribution.We’ll say more about this particular guideline as well, but because you will be using the dev set to evaluate a lot of different models and trying really hard to improve performance on the dev set.It’s nice if your dev set comes from the same distribution as your test set.But because deep learning algorithms have such a huge hunger for training data,one trend I’m seeing is that you might use all sorts of creative tactics,such as crawling webpages, in order to acquire a much bigger training set than you would otherwise have.Even if part of the cost of that is then that your training set data might not come from the same distribution as your dev and test sets.But you find that so long as you follow this rule of thumb,that progress in your machine learning algorithm will be faster.And I’ll give a more detailed explanation for this particular rule of thumb later in the specialization as well.

Coursera | Andrew Ng (02-week-1-1.1)—训练_开发_测试_数据集_第7张图片

现代深度学习的另一个趋势是 越来越多的人,在训练和测试集分布不匹配的情况下进行训练,假设你要构建一个用户可以上传大量图片的应用程序,目的是找出并呈现所有猫咪图片,可能你的用户都是爱猫人士,训练集可能是从网上下载的猫咪图片,而验证集和测试集是用户在这个应用上上传的猫的图片,就是说 训练集可能是从网络上抓下来的图片,而验证集和测试集是用户上传的图片,结果许多网页上的猫咪图片分辨率很高,很专业 后期制作精良,而用户上传的照片可能是用手机随意拍摄的,像素低 比较模糊,这两类数据有所不同,针对这种情况 根据经验 我建议大家,要确保验证集和测试集的数据来自同一分布,关于这个问题我也会多讲一些,因为你们要用验证集来评估不同的模型,尽可能地优化性能,如果验证集和测试集来自同一个分布,但由于深度学习算法需要大量的训练数据,为了获取更大规模的训练数据集,我们可以采用当前流行的各种创意策略,例如 网页抓取,代价就是训练集数据与验证集和测试集数据,有可能不是来自同一分布,但只要遵循这个经验法则,你会发现 机器学习算法会变得更快,我会在后面的课程中 ,更加详细地解释这条经验法则。

Finally, it might be okay to not have a test set.Remember the goal of the test set is to give you a unbiased estimate of the performance of your final network, of the network that you selected.But if you don’t need that unbiased estimate,then it might be okay to not have a test set.So what you do, if you have only a dev set but not a test set,is you train on the training set and then you try different model architectures evaluate them on the dev set,and then use that to iterate and try to get to a good model.Because you’ve fit your data to the dev set,this no longer gives you an unbiased estimate of performance.But if you don’t need one, that might be perfectly fine.In the machine learning world, when you have just a train and a dev set but no separate test set.Most people will call this a training set and they will call the dev set the test set.But what they actually end up doing is using the test set as a hold-out cross validation set,which maybe isn’t completely a great use of terminology,because they’re then overfitting to the test set.

最后一点 就算没有测试集也不要紧,测试集的目的是对最终所选定的神经网络系统做出无偏评估,如果不需要无偏评估,也可以不设置测试集,所以如果只有验证集 没有测试集,我们要做的就是 在训练集上训练 尝试不同的模型框架,在验证集上评估这些模型,然后迭代并选出适用的模型,因为验证集中已经涵盖测试集数据,其不再提供无偏性能评估,当然 如果你不需要无偏评估 那就再好不过了,在机器学习中 如果只有一个训练集和一个验证集,而没有独立的测试集,遇到这种情况 训练集还被人们称为训练集,而验证集则被称为测试集,不过在实际应用中 人们只是把测试集当成简单交叉验证集使用,并没有完全实现该术语的功能,因为他们把验证集数据过度拟合到了测试集中。

Coursera | Andrew Ng (02-week-1-1.1)—训练_开发_测试_数据集_第8张图片

So when the team tells you that they have only a train and a test set,I would just be cautious and think, do they really have a train dev set?Because they’re overfitting to the test set.Culturally, it might be difficult to change some of these team’s terminology and get them to call it a trained dev set rather than a trained test set.Even though I think calling it a train and development set would be a more correct terminology.And this is actually okay practice if you don’t need a completely unbiased estimate of the performance of your algorithm.So having set up a train dev and test set will allow you to integrate more quickly.It will also allow you to more efficiently measure the bias and variance of your algorithm,so you can more efficiently select ways to improve your algorithm.Let’s start to talk about that in the next video.

如果某团队跟你说他们只设置了一个训练集和一个测试集,我会很谨慎 心想他们是不是真的有训练验证集,因为他们把验证集数据过度拟合到了测试集中,让这些团队改变叫法 改称其为“训练验证集”, 而不是“训练测试集”可能不太容易,即便我认为“训练验证集”在专业用词上更准确,实际上 如果你不需要无偏评估算法性能,那么这样是 ok 的,所以说 搭建训练验证集和测试集能够加速神经网络的集成,也可以更有效地衡量算法的偏差和方差,从而帮助我们更高效地选择合适的方法来优化算法,我们下节课继续讲。


重点总结:

1. 训练、验证、测试集

对于一个需要解决的问题的样本数据,在建立模型的过程中,我们会将问题的 data 划分为以下几个部分:

  • 训练集(train set):用训练集对算法或模型进行训练过程;

  • 验证集(development set):利用验证集或者又称为简单交叉验证集(hold-out cross validation set)进行交叉验证,选择出最好的模型

  • 测试集(test set):最后利用测试集对模型进行测试,获取模型运行的无偏估计

小数据时代

在小数据量的时代,如:100、1000、10000 的数据量大小,可以将 data 做以下划分:

  • 无验证集的情况:70% / 30%;
  • 有验证集的情况:60% / 20% / 20%;

通常在小数据量时代,以上比例的划分是非常合理的。

大数据时代

但是在如今的大数据时代,对于一个问题,我们拥有的 data 的数量可能是百万级别的,所以验证集和测试集所占的比重会趋向于变得更小。

验证集的目的是为了验证不同的算法哪种更加有效,所以验证集只要足够大能够验证大约 2-10 种算法哪种更好就足够了,不需要使用 20% 的数据作为验证集。如百万数据中抽取 1 万的数据作为验证集就可以了。

测试集的主要目的是评估模型的效果,如在单个分类器中,往往在百万级别的数据中,我们选择其中 1000 条数据足以评估单个模型的效果。

  • 100万数据量:98% / 1% / 1%;

  • 超百万数据量:99.5% / 0.25% / 0.25%(或者99.5% / 0.4% / 0.1%)

Notation

  • 建议验证集要和训练集来自于同一个分布,可以使得机器学习算法变得更快;
  • 如果不需要用无偏估计来评估模型的性能,则可以不需要测试集。

参考文献:

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(2-1)– 深度学习的实践方面


PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。

Coursera | Andrew Ng (02-week-1-1.1)—训练_开发_测试_数据集_第9张图片

你可能感兴趣的:(深度学习,正则化以及优化,深度学习,吴恩达)