http://blog.csdn.net/pipisorry/article/details/46490177
实用机器学习Practical Machine Learning courses学习笔记
Practical Machine Learning实用机器学习
1.1 Prediction motivation预测的动机
课程概览About this course
This course covers the basic ideas behind machine learning/prediction, What this course depends on What would be useful
·Study design training
vs. test sets
Conceptual issues out
of sample error, ROC curves
Practical implementation the
caret package
·The Data Scientist's Toolbox
R Programming
·Exploratory analysis
Reporting Data and Reproducible Research
Regression models
机器学习的用处
Local governments > pension(退休金) payments
Google >whether you will click on an ad
Amazon >what movies you will watch
Insurance companies >what your risk of death is
Johns Hopkins >who will succeed in their programs
推荐书目及资源
The elements of statistical learning
Machine learning (more advanced material)
List of machine learning resources on Quora
List of machine learning resources from Science
Advanced notes from MIT open courseware
Advanced notes from CMU
Kaggle machine learning competitions
1.2 什么是预测What is prediction
预测问题的中心教条dogma
predict for these dots whether they're red or blue:
choosing the right dataset and that knowing what the specific question is are again paramount(最重要的)
可能存在的问题
一个例子:Google Flu trends algorithm didn't realize the search terms that people would use would change overtime.They might use different terms when they were searching, and so that would affect the algorithm's performance.And also, the way that those terms were actually being used in the algorithm wasn't very well understood.And so when the function of a particular search term changed in their algorithm, it can cause problems.
预测器的流程components of a predictor
question -> input data -> features -> algorithm -> parameters -> evaluation
Note: question: What are you trying to predict and what are you trying to predict it with?
预测的一个例子:垃圾邮件
question -> input data -> features -> algorithm -> parameters -> evaluation
Start with a general question
Can I automatically detect emails that are SPAM that are not?
Make it concrete
Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?
Note:try to make it as concrete as possible
question -> input data -> features -> algorithm -> parameters -> evaluation
rss.acs.unt.edu/Rdoc/library/kernlab/html/spam.html
question -> input data -> features -> algorithm -> parameters -> evaluation
library(kernlab)
data(spam)
head(spam)
question -> input data -> features ->
algorithm-> parameters -> evaluation
Our simple algorithm
Note:best cut off is above 0.5 then we say that it's SPAM, and if it's below 0.5 we can say that it's HAM.
question -> input data -> features -> algorithm -> parameters -> evaluation
question -> input data -> features -> algorithm -> parameters -> evaluation
1.3 步骤的相对重要性Relative importance of steps
{about the tradeoffs and the different components of building a machine learning algorithm}
Relative order of importance:question > data > features > algorithms
...
Then creating features is an important component in that if you don't compress the data in the right way you might lose all of the relevant and valuable information.
And finally, in my experience it's been the algorithmis often the least important part of building a machine learning algorithm.It can be very important depending on the exact modality of the type of data that you're using.For example, image data and voice data can require certain kinds of prediction algorithms that might not necessarily be as.
The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.--John Tukey
In other words, an important component of knowing how to do prediction is to know when to give up,when the data that you have is just not sufficient to answer the question that you're trying to answer.
Properties of good features
Note:there's a debate in the community about whether it's better to create features automatically or whether it's better to use expert domain knowledge.And in general it seems that the, expert domain knowledge can help quite a bit in many, many applications and so should be consulted when building a features for machine learning algorithm.
Common mistakes
Note:
1. Some common mistake are trying to automate feature selection in a way that doesn't allow for you to understand how those features are actually.Being applied to make good predictions.Black box predictions can be very useful,can be very accurate but they can also change on a dime if we're not paying attention to how those features actually do predict the outcome.
2. The function of a particular set of data set might be that there's outlines if there's weird behaviors of specific features and not understanding those can cause problems.
Based on a bunch of features they sort of collected in an unsupervised way.In other words they filtered through the data in a way to identify those features that might be useful for later predictive algorithms.But even when they did this, they went back and looked at those features and tried to figure out why they would be predictive and so for example these features, this feature here makes it very clear why it would be a good predictor for a cat.
建立机器学习算法要考虑的问题
Note:Scalable means it's easy to apply to a large data set.Whether that's because it's very very fast or whether it's because it's parallelizable across multiple samples for example.
[Gaining access to the best machine-learning methods]
Prediction is about accuracy tradeoffs
Note:如医疗上,医生更容易理解决策树形式的。
http://www.cs.cornell.edu/~chenhao/pub/mldg-0815.pdf
Scalability matters
http://www.techdirt.com/blog/innovation/articles/20120409/03412518422/
http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html
Note:netflix从未应用竞赛的算法到实际中是因为拓展性问题,不能应用到大数据中
1.4 训练集和测试集误差 In and out of sample errors
In Sample Error: The error rate you get on the samedata set you used to build your predictor. Sometimescalled resubstitution error.
Out of Sample Error: The error rate you get on a newdata set. Sometimes called generalization error.
Key ideas
Note:
1. In sample error is always going to be a little bit optimistic,from what the error is that you would get from a new sample.And the reason why is, in your specific sample, sometimes your prediction algorithm will tune itself a little bit to the noise that you collected in that particular data set.
2. So sometimes you want to be able to give up a little bit of accuracy in the sample you have, to be able to get accuracy on new data sets.In other words, when the noise is a little bit different, your algorithm will berobust.
过拟合Overfitting
1.5 预测的研究设计Prediction study design
{about how to minimize the problems that can be caused by in sample verses out of sample errors}
1. 5:why do we only apply it one time?If we applied multiple models to our testing set, then, and pick the best one, then we're using the test set, in some sense, to train the model.In other words, we're still getting an optimistic view of what the data error would be on a completely new dataset.
2. 6:apply your best prediction models all to your test set and refine them a little bit.So what you might find is that some features don't work so well when you're doing out of sample prediction and you might refine and adjust your model a little bit.But now, again, like I said, your test set error is going to be a little bit optimistic error for what your actual out of sample error will be.And so what we do is we again, apply our model to exactly one time to the validation set, only the best one to get our prediction.
3. The idea is that there is one dataset that's held out from the very start, that you only apply exactly one model to, and you never do any training or tuning or testing to, and that will give you a good estimate of your out of sample error rates.
http://www2.research.att.com/~volinsky/papers/ASAStatComp.pdf
数据集划分规则Rules of thumb for prediction study design
1. 2:you don't get to refine your models in a test set and then apply them to a validation set.But it might insure that your testing site is of sufficient size.
2. 3:First of all, you might reconsider whether you have enough samples to be able to build a prediction algorithm in the first place.But suppose your dead set on building a prediction or machine learning algorithm, then the idea might be to do cross validation and report the caveat of the small sample size and the fact that you never got to predict this in an out of sample or a testing data set.
要记住的原则Some principles to remember
1. the test set or the validation set should be set aside and never looked at when building your model.In other words, you would need to have one data set which you apply only one model to, only one time, and that data set should be completely independent of anything you use to build the prediction model.
2. So for example, if you have, time fit time force data, in other words you have, data collected over time, you might want to build your, training set inchunksof time, but again, random chunks of time and build them on random predictions.
3. In other words, if you want to sample any data set that might have sources of dependence over time or across space,you need to sample your data in chunks.This is called backtesting in finance.And it's basically the idea that you want to be able to use chunks of data that consist of observations over time.
1.6 误差类型Types of errors
{about the types of errors and the ways that evaluate the prediction functions}
基本术语 Basic terms
对于二分类问题:
In general, Positive = identified and negative = rejected. Therefore:
True positive = correctly identified
False positive = incorrectly identified
True negative = correctly rejected
False negative = incorrectly rejected
Medical testing example:
True positive = Sick people correctly diagnosed as sick
False positive= Healthy people incorrectly identified as sick
True negative = Healthy people correctly identified as healthy
False negative = Sick people incorrectly identified as healthy.
关键量 Key quantities
敏感性sensitivity(召回率recall)、特异性specificity、正预测值positive predictive value(精确度precision)、负预测值、准确性accuracy
[http://en.wikipedia.org/wiki/Sensitivity_and_specificity]