Resample Methods for Training Error & Test Error

Why use resample methods

Resampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. 
Resample methods often used to get the test error.

Resample methods:
cross-validation(交叉验证)
cross-validation can be used to estimate the test error associated with a given statistical learning method in order to evaluate its performance, or to select the appropriate level of flexibility. 
bootstrap(自助法)
The bootstrap is used in several contexts, most commonly model to provide a measure of accuracy of a parameter estimate or of a given selection statistical learning method.


k-Fold Cross-Validation

1)Theory
k-fold CV involves randomly k-fold CV dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on 
the remaining k − 1 folds.
This procedure is repeated k times; each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error.

Resample Methods for Training Error & Test Error_第1张图片

2)R implement 
rnorm: The rnorm() function generates a vector of random normal variables, rnorm() with first argument n the sample size.
set.seed: Sometimes we want our code to reproduce the exact same set of random numbers; we can use the set.seed() function to do this. 
正如ls所说,用于设定随机数种子,一个特定的种子可以产生一个特定的伪随机序列,这个函数的主要目的,是让你的模拟能够可重复出现,因为很多时候我们需要取随机数,但这段代码再跑一次的时候,结果就不一样了,如果需要重复出现同样的模拟结果的话,就可以用set.seed()。在调试程序或者做展示的时候,结果的可重复性是很重要的,所以随机数种子也就很有必要。
set.seed(parm), 是一个标记啦,下次你还想取这个随机序列就启用set.seed(100),后面随机函数会和上次一样生成样本。
cv.glm: This function calculates the estimated K-fold cross-validation prediction error for generalized linear models.

//prepare data source
>library(ISLR)
> library (boot ) //The cv.glm() function is part of the boot library
>attach(Auto)
>names(Auto)  //see the data source predictors(field)
>fix(Auto) //see the detail of the data source data
//output cross-validation estimate for the test error is approximately 24.23.
> set . seed (17)
> glm . fit =glm (mpg~horsepower , data= Auto)
> cv.err = cv.glm (Auto ,glm .fit ,K =10)
> cv. err$delta
[1] 24.20520 24.19133

2 bootstrap
The bootstrap is used in several contexts, most commonly model to provide a measure of accuracy of a parameter estimate or of a given selection statistical learning method.

TODO

你可能感兴趣的:(数据分析,机器学习,统计分析)