Everything You Wanted to Know About Machine Learning
翻译了理解机器学习的10个重要的观点,加入了自己的理解,这些原则在大部分情况下也许是这样,但是具体问题具体分析才是王道,不加思索的应用只能是一知半解。所以张小龙才说‘我说的都是错的’。 note by 王犇
1. How Does Machine Learning Work?
- A set of possible models to look thorough
- A way to test whether a model is good
- A clever way to find a really good model with only a few test
a) 确定可行的候选模型集合(搜索空间,这个空间一般会很大)
b) 确定模型是否可行的方法(效果评价方法)
c) 找到一个有效的方法,利用尽量少的试探得到一个较好的模型(优化算法,model selection)
2. Overfitting Has Many Faces
The general moral of this section of the paper is to
always measure the performance of your classifier on out-of-sample data
You cannot do too many of these training and testing splits. You should even make some predictions on data you imagine yourself, to see what the model does in certain situations.
3. Intuition Fails in High Dimensions
What this means in practice is that
as you add more and more input fields, you must also add more and more training data
to “fill up” the space created by the additional inputs if you want to use them accurately.
4. Theoretical Guarantees Are Not What They Seem
The only certain way (that we know of now) to know if an algorithm will model your data well is to try it out.
5. Feature Engineering is the Key
The problem here is that
no single input field, or even any single pair of fields, is closely correlated with the objective
It’s when you use your knowledge about the data to create fields that make machine learning algorithms work better.
In my career, I would say an average of
70% of the project’s time goes into feature engineering
, 20% goes towards figuring out what comprises a proper and comprehensive evaluation of the algorithm, and only 10% goes into algorithm selection and tuning.
6. More Data Beats A Cleverer Algorithm
there’s increasingly good evidence
that, in a lot of problems, very simple machine learning techniques can be levered into incredibly powerful classifiers with the addition of loads of data.
A big reason for this is because, once you’ve defined your input fields, there’s only so much analytic gymnastics you can do. Computer algorithms trying to learn models have only a relatively few tricks they can do efficiently, and many of them are not so very different. Thus, as we have said before, performance differences between algorithms are typically not large. Thus, if you want better classifiers, you should spend your time:
- Engineering better features
- Getting your hands on more high-quality data
原因是,一般来说你定义好了你的特征,也就限定了你能够在其中探索的空间(其实就是说,数据限定了最终效果的天花板,这里面的信息量是有限的,模型和算法是在这个空间下寻找一个更好的解)。并且其实很多模型的原理也都有相似之处。(想想n多的Learning 2 Rank算法)所以如果你希望达到更好的分类器,你可以优先这么做:
1. 更好的特征工程
2. 获取更多质量更好的数据
7. Learn Many Models, Not Just One
One can often
make a more powerful model by learning multiple classifiers over different random subsets of the data.
8. Simplicity Does Not Imply Accuracy
So too in machine learning. If we have two models that fit the data equally well, many machine learning algorithms have a way of mathematically preferring the simpler of the two. The folk wisdom here is that a simpler model will perform better on out-of-sample testing data, because it has less parameters to fit, and thus is less likely to be overfit
One should not take this rule too far. There are many places in machine learning where additional complexity can benefit performance. On top of that, it is not quite accurate to say that model complexity leads to overfitting. More accurate is that the procedure used to fit all that complexity leads to overfitting if it is not very clever. But there are plenty of cases where the complexity is brought to heel by cleverness in the model fitting process.
Thus, prefer simple models because they are smaller, faster to fit, and more interpretable, but not necessarily because they will lead to better performance; the only way to know that is to evaluate your model on test data.
8. Representable Does Not Imply Learnable
The creators of many machine learning algorithms are fond of saying that the function representing an accurate prediction on your data is
by the learning algorithm. This means that it is
for the algorithm to build a good model on your data.
Unfortunately, this possibility is rarely comforting by itself. Building a good model may require much more data than you have, or the good model might simply never be found by the algorithm. Just because there’s a good model out there that the algorithm
find does not mean that it
find it.
This is another great argument for feature engineering: If the algorithm can’t find a good model, but you are pretty sure that a good model exists, try engineering features that will make that model a little more obvious to the algorithm.
9. Correlation Does Not Imply Causation
The point of this common saying is that
modeling observational data can only show us that two variables are related
, but it
cannot tell us the “why”
You should take similar care when interpreting your models
. Just because one thing predicts another doesn’t mean it causes another, and making business (or public policy) decisions based on some imagined causal relationship should be done with extreme caution.
10. The Big Picture
Machine learning is an awfully powerful tool, and like any powerful tool,
misuses of it can cause a lot of damage
. Understanding how machine learning works and some of the potential pitfalls can go a long way towards keeping you out of trouble.
