Andrew Ng的机器学习新书 Machine Learning Yearning的简单记录。
为了提高算法性能,尝试各种办法:
Get more data,Collect a more diverse training set,Train the algorithm longer,Try a bigger neural network,Try a smaller neural network,Try adding regularization ,Change the neural network architecture
…
怎么选呢? choose well 还好, choose poorly就糟糕了。
Most machine learning problems leave clues that tell you what is useful to try, and what is not useful to try. Learning to read those clues will save you months or years of development time.
A few changes in prioritization can have a huge effect on your team’s productivity.
学习完本书:
you will have a deep understanding of how to set technical direction for a machine learning project.
关于deep learning的兴起:
- Data availability
- Computational scale
the older algorithms didn’t know what to do with all the data we now have.引出传统算法与NN的性能对比图。page9 注释1
由检测cat的例子引出。
你从网络中获得大量的cats 和 non-cats图像,按70%/30%分为training set/test set。有了这些样本,可以建立一个在training set/test set都work well的detector。但是在移动端性能太差。
your algorithm did not generalize well to the actual distribution you care about of smartphone pictures.[移动端用户上传的图像–分辨率低、模糊、光照不理想,这与从网络中下载的样本分布不同]
按70%/30% 划分样本构成 training and test sets在practice中是可以work的,but is a bad idea in more and more applications where the training distribution is different from the distribution you ultimately care about.
The purpose of the dev and test sets are to direct your team toward the most important changes to make to the machine learning system.
Choose dev and test sets to reflect data you expect to get in the future and want to do well on.
classifier A accuracy 90%, classifier B accuracy 90.1%, dev set的样本数量太少(100)不足以detect这二者0.1%的difference。
Test set should be large enough to give high confidence in the overall performance of your system. 总结:即便现在的大数据时代,机器学习的问题可以有more than a billion的样本,意味着dev/test的数量也在增加,但是分配给dev/test的比例在减小,数量不宜过大。 page15 注释2
single-number evaluation metric的优点:
√ classification accuracy : a single number for assessing classifier
× precision and recall : two numbers for assessing classifier [看precision,classifierA比B好;看recall,classifierB比A好]
Having multiple-number evaluation metrics makes it
harder to compare algorithms.
如果想用多个numbers的evaluation metric。一种方式是combine them into a single number。
F1 score https://en.wikipedia.org/wiki/F1_score
another way to combine multiple evaluation metrics
假设有N个不同的指标,考虑设置 N-1个“satisficing”指标(binary file size of model, running time, …),1个“optimizing”指标(accuracy)。控制变量,在给定N-1个条件后(通过阈值等),控制了N-1个satisficing指标,优化accuracy。
例子:
假设建硬件设备,麦克风监听到用户说唤醒词“wakeword”,则系统被唤醒。(Apple的‘Hey Siri’, Android的‘Okay Google’, Amazon Echo的‘Alexa’)。
两种指标出现了:
考虑一下实际情况,应该尽量避免false negative的情况,江湖救急,我要唤醒你,结果你还在睡觉,太尴尬。所以相比较,optimizing metirc应该是false negative rate。另一个false positive rate作为satisficing metric。具体一下评价性能的目标可以是:
在保证每24小时的唤醒过程中至多发生一次false positive前提下,最小化false negative rate。
明确要优化的评价指标
如图,循环往复。The faster you can go round this loop, the faster you will make progress.
解释了dev/test sets和metric的重要性:
以检测cats为例,分类准确率从95.0%到95.1%,这1%的提升可能放在app中没有明显的变化,但是多个1%的积累效果是可观的。
开始一个新项目,尽快选择dev/test sets。(明确的追求目标)
Having an initial dev/test set and metric helps you iterate quickly. If you ever find that the dev/test sets or metric are no longer pointing your team in the right direction, it is not a big deal! Just change them and make sure your team knows about the new direction.