这个例子来源于实际项目,但是为了保护机密性,我们会对细节进行保护。
现在你是和平之城的著名研究员,和平之城的人有一个共同的特点:他们害怕鸟类。为了保护他们,你必须设计一个算法,以检测飞越和平之城的任何鸟类,同时警告人们有鸟类飞过。市议会为你提供了10,000,000张图片的数据集,这些都是从城市的安全摄像头拍摄到的。它们被命名为:
你的目标是设计一个算法,能够对和平之城安全摄像头拍摄的新图像进行分类。
有很多决定要做:
市议会告诉你,他们想要一个算法:
请注意: 有三个评估指标使您很难在两种不同的算法之间进行快速选择,并且会降低您的团队迭代的速度,是真的吗?
★
】正确经过进一步讨论,市议会缩小了它的标准:
如果你有以下三个模型,你会选择哪一个?
测试准确度 | 运行时间 | 内存大小 |
---|---|---|
97% | 1 sec | 3MB |
测试准确度 | 运行时间 | 内存大小 |
---|---|---|
99% | 13 sec | 9MB |
测试准确度 | 运行时间 | 内存大小 |
---|---|---|
97% | 3 sec | 2MB |
★
】D测试准确度 | 运行时间 | 内存大小 |
---|---|---|
98% | 9 sec | 9MB |
根据城市的要求,您认为以下哪一项是正确的?
【★
】准确度是一个优化指标; 运行时间和内存大小是令人满意的指标。
【 】准确度是一个令人满意的指标; 运行时间和内存大小是一个优化指标。
【 】准确性、运行时间和内存大小都是优化指标,因为您希望在所有这三方面都做得很好。
【 】准确性、运行时间和内存大小都是令人满意的指标,因为您必须在三项方面做得足够好才能使系统可以被接受。
在实现你的算法之前,你需要将你的数据分割成训练/开发/测试集,你认为哪一个是最好的选择?
训练集 | 开发集 | 测试集 |
---|---|---|
3,333,334 | 3,333,333 | 3,333,333 |
训练集 | 开发集 | 测试集 |
---|---|---|
6,000,000 | 3,000,000 | 1,000,000 |
★
】C训练集 | 开发集 | 测试集 |
---|---|---|
9,500,000 | 250,000 | 250,000 |
训练集 | 开发集 | 测试集 |
---|---|---|
6,000,000 | 1,000,000 | 3,000,000 |
在设置了训练/开发/测试集之后,市议会再次给你了1,000,000张图片,称为“公民数据”。 显然,和平之城的公民非常害怕鸟类,他们自愿为天空拍照并贴上标签,从而为这些额外的1,000,000张图像贡献力量。 这些图像与市议会最初给您的图像分布不同,但您认为它可以帮助您的算法。
你不应该将公民数据添加到训练集中,因为这会导致训练/开发/测试集分布变得不同,从而损害开发集和测试集性能,是真的吗?
★
】False市议会的一名成员对机器学习知之甚少,他认为应该将1,000,000个公民的数据图像添加到测试集中,你反对的原因是:
【★
】这会导致开发集和测试集分布变得不同。这是一个很糟糕的主意,因为这会达不到你想要的效果。
【 】公民的数据图像与其他数据没有一致的x- >y映射(类似于纽约/底特律的住房价格例子)。
【 】一个更大的测试集将减慢迭代速度,因为测试集上评估模型会有计算开销。
【★
】测试集不再反映您最关心的数据(安全摄像头)的分布。(博主注:训练集是摄像头拍的,用他人拍的数据去测试摄像头拍的,势必会导致准确度下降,要添加也应该添加到整个数据集中,保证同一分布。)
你训练了一个系统,其误差度如下(误差度 = 100% - 准确度):
训练集误差 | 4.0% |
---|---|
开发集误差 | 4.5% |
这表明,提高性能的一个很好的途径是训练一个更大的网络,以降低4%的训练误差。你同意吗?
【 】是的,因为有4%的训练误差表明你有很高的偏差。
【 】是的,因为这表明你的模型的偏差高于方差。
【 】不同意,因为方差高于偏差。
【★
】不同意,因为没有足够的信息,这什么也说明不了。(博主注:想一下贝叶斯最优误差,我们至少还要一个人们对图片的识别误差值,请看下面的题。)
你让一些人对数据集进行标记,以便找出人们对它的识别度。你发现了准确度如下:
鸟类专家1 | 错误率:0.3% |
---|---|
鸟类专家2 | 错误率:0.5% |
普通人1 (不是专家) | 错误率:1.0% |
普通人2 (不是专家) | 错误率:1.2% |
如果您的目标是将“人类表现”作为贝叶斯错误的基准线(或估计),那么您如何定义“人类表现”?
【 】0.0% (因为不可能做得比这更好)
【★
】0.3% (专家1的错误率)
【 】0.4% (0.3 到 0.5 之间)
【 】0.75% (以上所有四个数字的平均值)
您同意以下哪项陈述?
【★
】学习算法的性能可以优于人类表现,但它永远不会优于贝叶斯错误的基准线。
【 】学习算法的性能不可能优于人类表现,但它可以优于贝叶斯错误的基准线。
【 】学习算法的性能不可能优于人类表现,也不可能优于贝叶斯错误的基准线。
【 】学习算法的性能可以优于人类表现,也可以优于贝叶斯错误的基准线。
你发现一组鸟类学家辩论和讨论图像得到一个更好的0.1%的性能,所以你将其定义为“人类表现”。在对算法进行深入研究之后,最终得出以下结论:
人类表现 | 0.1% |
---|---|
训练集误差 | 2.0% |
开发集误差 | 2.1% |
根据你的资料,以下四个选项中哪两个尝试起来是最有希望的?(两个选项。)
【 】尝试增加正则化。
【 】获得更大的训练集以减少差异。
【★
】尝试减少正则化。
【★
】训练一个更大的模型,试图在训练集上做得更好。
你在测试集上评估你的模型,并找到以下内容:
人类表现 | 0.1% |
---|---|
训练集误差 | 2.0% |
开发集误差 | 2.1% |
测试集误差 | 7.0% |
这意味着什么?(两个最佳选项。)
【 】你没有拟合开发集
【★
】你应该尝试获得更大的开发集。
【 】你应该得到一个更大的测试集。
【★
】你对开发集过拟合了。
在一年后,你完成了这个项目,你终于实现了:
人类表现 | 0.10% |
---|---|
训练集误差 | 0.05% |
开发集误差 | 0.05% |
你能得出什么结论? (检查所有选项。)
【★
】现在很难衡量可避免偏差,因此今后的进展将会放缓。
【 】统计异常(统计噪声的结果),因为它不可能超过人类表现。
【 】只有0.09%的进步空间,你应该很快就能够将剩余的差距缩小到0%
【★
】如果测试集足够大,使得这0.05%的误差估计是准确的,这意味着贝叶斯误差是小于等于0.05的。
事实证明,和平之城也雇佣了你的竞争对手来设计一个系统。您的系统和竞争对手都被提供了相同的运行时间和内存大小的系统,您的系统有更高的准确性。然而,当你和你的竞争对手的系统进行测试时,和平之城实际上更喜欢竞争对手的系统,因为即使你的整体准确率更高,你也会有更多的假阴性结果(当鸟在空中时没有发出警报)。你该怎么办?
【 】查看开发过程中开发的所有模型,找出错误率最低的模型。
【 】要求你的团队在开发过程中同时考虑准确性和假阴性率。
【★
】重新思考此任务的指标,并要求您的团队调整到新指标。
【 】选择假阴性率作为新指标,并使用这个新指标来进一步发展。
你轻易击败了你的竞争对手,你的系统现在被部署在和平之城中,并且保护公民免受鸟类攻击! 但在过去几个月中,一种新的鸟类已经慢慢迁移到该地区,因此你的系统的性能会逐渐下降,因为您的系统正在测试一种新类型的数据。(博主注:以系统未训练过的鸟类图片来测试系统的性能)
你只有1000张新鸟类的图像,在未来的3个月里,城市希望你能更新为更好的系统。你应该先做哪一个?
【★
】使用所拥有的数据来定义新的评估指标(使用新的开发/测试集),同时考虑到新物种,并以此来推动团队的进一步发展。
【 】把1000张图片放进训练集,以便让系统更好地对这些鸟类进行训练。
【 】尝试数据增强/数据合成,以获得更多的新鸟的图像。
【 】将1,000幅图像添加到您的数据集中,并重新组合成一个新的训练/开发/测试集
市议会认为在城市里养更多的猫会有助于吓跑鸟类,他们对你在鸟类探测器上的工作感到非常满意,他们也雇佣你来设计一个猫探测器。(哇~猫探测器是非常有用的,不是吗?)由于有多年的猫探测器的工作经验,你有一个巨大的数据集,你有100,000,000猫的图像,训练这个数据需要大约两个星期。你同意哪些说法?(检查所有选项。)
【★
】需要两周的时间来训练将会限制你迭代的速度。
【★
】购买速度更快的计算机可以加速团队的迭代速度,从而提高团队的生产力。
【★
】如果100,000,000个样本就足以建立一个足够好的猫探测器,你最好用100,000,00个样本训练,从而使您可以快速运行实验的速度提高约10倍,即使每个模型表现差一点因为它的训练数据较少。
【 】建立了一个效果比较好的鸟类检测器后,您应该能够采用相同的模型和超参数,并将其应用于猫数据集,因此无需迭代。
This example is adapted from a real production application, but with details disguised to protect confidentiality.
You are a famous researcher in the City of Peacetopia. The people of Peacetopia have a common characteristic: they are afraid of birds. To save them, you have to build an algorithm that will detect any bird flying over Peacetopia and alert the population.
The City Council gives you a dataset of 10,000,000 images of the sky above Peacetopia, taken from the city’s security cameras. They are labelled:
Your goal is to build an algorithm able to classify new images taken by security cameras from Peacetopia.
There are a lot of decisions to make:
The City Council tells you the following that they want an algorithm that
Note: Having three evaluation metrics makes it harder for you to quickly choose between two different algorithms, and will slow down the speed with which your team can iterate. True/False?
After further discussions, the city narrows down its criteria to:
If you had the three following models, which one would you choose?
Test Accuracy | Runtime | Memory size |
---|---|---|
97% | 1 sec | 3MB |
- [ ] B
Test Accuracy | Runtime | Memory size |
---|---|---|
99% | 13 sec | 9MB |
- [ ] C
Test Accuracy | Runtime | Memory size |
---|---|---|
97% | 3 sec | 2MB |
- [x] D
Test Accuracy | Runtime | Memory size |
---|---|---|
98% | 9 sec | 9MB |
Based on the city’s requests, which of the following would you say is true?
[x] Accuracy is an optimizing metric; running time and memory size are a satisficing metrics.
[ ] Accuracy is a satisficing metric; running time and memory size are an optimizing metric.
[ ] Accuracy, running time and memory size are all optimizing metrics because you want to do well on all three.
[ ] Accuracy, running time and memory size are all satisficing metrics because you have to do sufficiently well on all three for your system to be acceptable.
Before implementing your algorithm, you need to split your data into train/dev/test sets. Which of these do you think is the best choice?
Train | Dev | Test |
---|---|---|
3,333,334 | 3,333,333 | 3,333,333 |
- [ ] B
Train | Dev | Test |
---|---|---|
6,000,000 | 3,000,000 | 1,000,000 |
- [x] C
Train | Dev | Test |
---|---|---|
9,500,000 | 250,000 | 250,000 |
- [ ] D
Train | Dev | Test |
---|---|---|
6,000,000 | 1,000,000 | 3,000,000 |
After setting up your train/dev/test sets, the City Council comes across another 1,000,000 images, called the “citizens’ data”. Apparently the citizens of Peacetopia are so scared of birds that they volunteered to take pictures of the sky and label them, thus contributing these additional 1,000,000 images. These images are different from the distribution of images the City Council had originally given you, but you think it could help your algorithm.
You should not add the citizens’ data to the training set, because this will cause the training and dev/test set distributions to become different, thus hurting dev and test set performance. True/False?
One member of the City Council knows a little about machine learning, and thinks you should add the 1,000,000 citizens’ data images to the test set. You object because:
[x] This would cause the dev and test set distributions to become different. This is a bad idea because you’re not aiming where you want to hit.
[ ] The 1,000,000 citizens’ data images do not have a consistent x–>y mapping as the rest of the data (similar to the New York City/Detroit housing prices example from lecture).
[ ] A bigger test set will slow down the speed of iterating because of the computational expense of evaluating models on the test set.
[x] The test set no longer reflects the distribution of data (security cameras) you most care about.
You train a system, and its errors are as follows (error = 100%-Accuracy):
Training set error | 4.0% |
---|---|
Dev set error | 4.5% |
This suggests that one good avenue for improving performance is to train a bigger network so as to drive down the 4.0% training error. Do you agree?
[ ] Yes, because having 4.0% training error shows you have high bias.
[ ] Yes, because this shows your bias is higher than your variance.
[ ] No, because this shows your variance is higher than your bias.
[x] No, because there is insufficient information to tell.
You ask a few people to label the dataset so as to find out what is human-level performance. You find the following levels of accuracy:
Bird watching expert #1 | 0.3% error |
---|---|
Bird watching expert #2 | 0.5% error |
Normal person #1 (not a bird watching expert) | 1.0% error |
Normal person #2 (not a bird watching expert) | 1.2% error |
If your goal is to have “human-level performance” be a proxy (or estimate) for Bayes error, how would you define “human-level performance”?
[ ] 0.0% (because it is impossible to do better than this)
[x] 0.3% (accuracy of expert #1)
[ ] 0.4% (average of 0.3 and 0.5)
[ ] 0.75% (average of all four numbers above)
Which of the following statements do you agree with?
[x] A learning algorithm’s performance can be better than human-level performance but it can never be better than Bayes error.
[ ] A learning algorithm’s performance can never be better than human-level performance but it can be better than Bayes error.
[ ] A learning algorithm’s performance can never be better than human-level performance nor better than Bayes error.
[ ] A learning algorithm’s performance can be better than human-level performance and better than Bayes error.
You find that a team of ornithologists debating and discussing an image gets an even better 0.1% performance, so you define that as “human-level performance.” After working further on your algorithm, you end up with the following:
Human-level performance | 0.1% |
---|---|
Training set error | 2.0% |
Dev set error | 2.1% |
Based on the evidence you have, which two of the following four options seem the most promising to try? (Check two options.)
[ ] Try increasing regularization.
[ ] Get a bigger training set to reduce variance.
[x] Try decreasing regularization.
[x] Train a bigger model to try to do better on the training set.
You also evaluate your model on the test set, and find the following:
Human-level performance | 0.1% |
---|---|
Training set error | 2.0% |
Dev set error | 2.1% |
Test set error | 7.0% |
What does this mean? (Check the two best options.)
[ ] You have underfit to the dev set.
[x] You should try to get a bigger dev set.
[ ] You should get a bigger test set.
[x] You have overfit to the dev set.
After working on this project for a year, you finally achieve:
Human-level performance | 0.10% |
---|---|
Training set error | 0.05% |
Dev set error | 0.05% |
What can you conclude? (Check all that apply.)
[x] It is now harder to measure avoidable bias, thus progress will be slower going forward.
[ ] This is a statistical anomaly (or must be the result of statistical noise) since it should not be possible to surpass human-level performance.
[ ] With only 0.09% further progress to make, you should quickly be able to close the remaining gap to 0%
[x] If the test set is big enough for the 0.05% error estimate to be accurate, this implies Bayes error is ≤0.05
It turns out Peacetopia has hired one of your competitors to build a system as well. Your system and your competitor both deliver systems with about the same running time and memory size. However, your system has higher accuracy! However, when Peacetopia tries out your and your competitor’s systems, they conclude they actually like your competitor’s system better, because even though you have higher overall accuracy, you have more false negatives (failing to raise an alarm when a bird is in the air). What should you do?
[ ] Look at all the models you’ve developed during the development process and find the one with the lowest false negative error rate.
[ ] Ask your team to take into account both accuracy and false negative rate during development.
[x] Rethink the appropriate metric for this task, and ask your team to tune to the new metric.
[ ] Pick false negative rate as the new metric, and use this new metric to drive all further development.
You’ve handily beaten your competitor, and your system is now deployed in Peacetopia and is protecting the citizens from birds! But over the last few months, a new species of bird has been slowly migrating into the area, so the performance of your system slowly degrades because your data is being tested on a new type of data.
You have only 1,000 images of the new species of bird. The city expects a better system from you within the next 3 months. Which of these should you do first?
[x] Use the data you have to define a new evaluation metric (using a new dev/test set) taking into account the new species, and use that to drive further progress for your team.
[ ] Put the 1,000 images into the training set so as to try to do better on these birds.
[ ] Try data augmentation/data synthesis to get more images of the new type of bird.
[ ] Add the 1,000 images into your dataset and reshuffle into a new train/dev/test split.
The City Council thinks that having more Cats in the city would help scare off birds. They are so happy with your work on the Bird detector that they also hire you to build a Cat detector. (Wow Cat detectors are just incredibly useful aren’t they.) Because of years of working on Cat detectors, you have such a huge dataset of 100,000,000 cat images that training on this data takes about two weeks. Which of the statements do you agree with? (Check all that agree.)
[x] Needing two weeks to train will limit the speed at which you can iterate.
[x] Buying faster computers could speed up your teams’ iteration speed and thus your team’s productivity.
[x] If 100,000,000 examples is enough to build a good enough Cat detector, you might be better of training with just 10,000,000 examples to gain a ≈10x improvement in how quickly you can run experiments, even if each model performs a bit worse because it’s trained on less data.
[ ] Having built a good Bird detector, you should be able to take the same model and hyperparameters and just apply it to the Cat dataset, so there is no need to iterate.