Stanford Machine Learning 公开课笔记(5) Machine Learning System Design

常见问题场景 如何做一个spam classifier?
supervised learning. spam(1), not spam(0)

features: choose 100 words indicative of spam/not spam —improve—> pick most frequent 10,000~50,000 words instead of manually picking up 100 words.

Stanford Machine Learning 公开课笔记(5) Machine Learning System Design_第1张图片

建议步骤
step1  get something really quick and dirty running and implement it and then test it on cross validation data. 
step2  plot learning curves of the training and test errors to try to figure out if your learning algorithm may be suffering from high bias or high variance or something else and use that to try to decide if having more data and more features and so on are likely to help.
step3 manually analysis.manually look at the examples that my algorithm is making errors on. See if you can spot any systematic
patterns in what type of examples it is misclassifying. This also inspire you to design new features.
比如看下面这个例子

Stanford Machine Learning 公开课笔记(5) Machine Learning System Design_第2张图片

Error Metrics for Skewed Classes
如下图,当正例于负例之间的数量有严重的两极分化分布(When they are skewed class)的时候,
计算准确率非常不靠谱。
比如下面的cancer classifiction问题,或者广告点击率预估中的问题。

Stanford Machine Learning 公开课笔记(5) Machine Learning System Design_第3张图片


一个折衷的办法是precision/recall办法
Accuracy = (true positives + true negatives) / (total examples) Precision = (true positives) / (true positives + false positives) Recall = (true positives) / (true positives + false negatives) F1 score = (2 * precision * recall) / (precision + recall)

Stanford Machine Learning 公开课笔记(5) Machine Learning System Design_第4张图片


precison和recall之间有trade-off

Stanford Machine Learning 公开课笔记(5) Machine Learning System Design_第5张图片

使用F1 score来综合评价precision and recall

Stanford Machine Learning 公开课笔记(5) Machine Learning System Design_第6张图片

F score和 Average(P,R)相比,优势是:
P和R必须都很大,F才能很大。
P和R中的某一个很小,F就必然很小了。


Normalization
为了拟合非线性数据,需要把低维变特征映射为高维(x²,x³...x⁹...)特征。映射完后,还要对特征做归一化。



你可能感兴趣的:(MachineLearning)