重采样方法 (Resampling Methods) (CV, Bootstrap)

文章目录

  • Introduction
  • Cross-Validation
    • The Validation Set Approach
      • Drawbacks
    • Leave-One-Out Cross-Validation
      • In Linear Regression
      • Drawbacks
    • K-fold Cross-Validation
  • Bootstrap
      • Steps
      • Estimate of S.E.
      • Estimate of C.I.
        • Bootstrap Percentile C.I.
        • Bootstrap S.E. based C.I.
        • Better Option (Basic Bootstrap/Reverse Percentile Interval)
      • In General
    • Bootstrap in Regression
      • Empirical Bootstrap
      • Residual Bootstrap
      • Wild Bootstrap

Introduction

Resampling methods involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. (e.g. cross-validation, bootstrap)

  • Estimates of test-set prediction error (CV)
  • S.E. and bias of estimated parameters (Bootstrap)
  • C.I. of target parameter (Bootstrap)

Cross-Validation

The training error rate often is quite different from the test error rate, and in particular the former can dramatically underestimate the latter.

  • Model Complexity Low: High bias, Low variance

  • Model Complexity High: Low bias, High variance

Prediction Error Estimates

  • Large test set
  • Mathematical adjustment
    • C p = 1 n ( S S E d + 2 d σ ^ 2 ) C_p=\frac{1}{n}(SSE_d+2d\hat{\sigma}^2) Cp=n1(SSEd+2dσ^2)
    • A I C = 1 n σ ^ 2 ( S S E d + 2 d σ ^ 2 ) AIC=\frac{1}{n\hat{\sigma}^2}(SSE_d+2d\hat{\sigma}^2) AIC=nσ^21(SSEd+2dσ^2)
    • B I C = 1 n σ ^ 2 ( S S E d + l o g ( n ) d σ ^ 2 ) BIC=\frac{1}{n\hat{\sigma}^2}(SSE_d+log(n)d\hat{\sigma}^2) BIC=nσ^21(SSEd+log(n)dσ^2)
  • CV: Consider a class of methods that estimate the test error rate by holding out a subset of the training observations from the fitting process, and then applying the statistical learning method to those held out observations.

The Validation Set Approach

A random splitting into two halves: left part is training set, right part is validation set.

Drawbacks

  • The validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set.
  • Only a subset of the observations are used to fit the model.
  • Validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.

Leave-One-Out Cross-Validation

LOOCV involves splitting the set of observations into two parts. However, instead of creating two subsets of comparable size, a single observation ( x 1 , y 1 ) (x_1 , y_1 ) (x1,y1) is used for the validation set, and the remaining observations ( x 2 , y 2 ) , . . . , ( x n , y n ) { (x_2 , y_2 ), . . . , (x_n , y_n ) } (x2,y2),...,(xn,yn) make up the training set.

In Linear Regression

C V ( n ) = 1 n ∑ i = 1 n ( y i − y i ^ 1 − h i ) 2 CV_{(n)}=\frac{1}{n}\sum^n_{i=1}(\frac{y_i-\hat{y_i}}{1-h_i})^2 CV(n)=n1i=1n(1hiyiyi^)2

  • C V n CV_n CVn bacomes a weighted MSE

Drawbacks

  • Estimates from each fold are highly correlated and hence their average can have high variance.

K-fold Cross-Validation

This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. This procedure is repeated k times; each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error. The k-fold CV estimate is computed by averaging these values. If k=n, then it is LOOCV.
C V ( k ) = 1 k ∑ i = 1 k M S E o r C V ( k ) = 1 k ∑ i = 1 k E r r k CV_{(k)}=\frac{1}{k}\sum^k_{i=1}{MSE}\\or\\ CV_{(k)}=\frac{1}{k}\sum^k_{i=1}{Err_k} CV(k)=k1i=1kMSEorCV(k)=k1i=1kErrk
Typically, given these considerations, one performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

Bootstrap

  • A powerful statistical tool to quantify the uncertainty associated with a given estimator or statistical learning method.
  • For example, it can provide an estimate of the standard error of a coefficient, or a confidence interval for that coefficient.

Steps

  • Obtain datasets ( n n n observations) by repeatedly sampling from the original data set Z Z Z with replacement B B B times.

  • Each of these bootstrap data, denoted as Z ∗ 1 , . . . , Z ∗ B Z^{*1},..., Z^{*B} Z1,...,ZB, is the same size as original dataset n n n. And bootstrap estimates for α \alpha α denoted as α ^ ∗ 1 , . . . , α ^ ∗ B \hat{\alpha}^{*1},..., \hat{\alpha}^{*B} α^1,...,α^B. Thus some observations may appear more than once and some not at all (2/3 of original dataset).

Estimate of S.E.

S E B ( θ ^ ) = 1 B − 1 ∑ r = 1 B ( θ ^ ∗ r − θ ˉ ∗ ) 2 SE_B(\hat{\theta})=\sqrt{\frac{1}{B-1}\sum^B_{r=1}(\hat{\theta}^{*r}-\bar{\theta}^*)^2} SEB(θ^)=B11r=1B(θ^rθˉ)2

Estimate of C.I.

Bootstrap Percentile C.I.

[ L , U ] = [ θ ^ α / 2 ∗ , θ ^ 1 − α / 2 ∗ ] [L,U]=[\hat{\theta}^*_{\alpha/2}, \hat{\theta}^*_{1-\alpha/2}] [L,U]=[θ^α/2,θ^1α/2]

Bootstrap S.E. based C.I.

[ L , U ] = θ ˉ ± z 1 − α / 2 × S E ∗ B [L,U]=\bar{\theta}\pm z_{1-\alpha/2}\times\frac{SE^*}{B} [L,U]=θˉ±z1α/2×BSE

Better Option (Basic Bootstrap/Reverse Percentile Interval)

[ L , U ] = [ 2 θ ^ − θ 1 − α / 2 ∗ , 2 θ ^ − θ α / 2 ∗ ] [L,U]=[2\hat{\theta}-\theta^*_{1-\alpha/2}, 2\hat{\theta}-\theta^*_{\alpha/2}] [L,U]=[2θ^θ1α/2,2θ^θα/2]

Key: the behavior of θ ^ ∗ − θ ^ \hat{\theta}^*-\hat{\theta} θ^θ^ is approximately the same as the behavior of θ ^ − θ \hat{\theta}-\theta θ^θ.

Therefore:
0.95 = P ( θ ^ α / 2 ∗ ≤ θ ^ ∗ ≤ θ ^ 1 − α / 2 ∗ ) = P ( θ ^ α / 2 ∗ − θ ^ ≤ θ ^ ∗ − θ ^ ≤ θ ^ 1 − α / 2 ∗ − θ ^ ) = P ( θ ^ α / 2 ∗ − θ ^ ≤ θ ^ ∗ − θ ≤ θ ^ 1 − α / 2 ∗ − θ ^ ) = P ( θ ^ α / 2 ∗ − θ ^ ≤ θ ^ − θ ≤ θ ^ 1 − α / 2 ∗ − θ ^ ) = P ( 2 θ ^ − θ 1 − α / 2 ∗ ≤ θ ≤ 2 θ ^ − θ α / 2 ∗ ) 0.95 = P(\hat{\theta}^*_{\alpha/2}\le\hat{\theta}^*\le\hat{\theta}^*_{1-\alpha/2}) \\ = P(\hat{\theta}^*_{\alpha/2}-\hat{\theta}\le\hat{\theta}^*-\hat{\theta}\le\hat{\theta}^*_{1-\alpha/2}-\hat{\theta}) \\ = P(\hat{\theta}^*_{\alpha/2}-\hat{\theta}\le\hat{\theta}^*-\theta\le\hat{\theta}^*_{1-\alpha/2}-\hat{\theta}) \\ = P(\hat{\theta}^*_{\alpha/2}-\hat{\theta}\le\hat{\theta}-\theta\le\hat{\theta}^*_{1-\alpha/2}-\hat{\theta}) \\ = P(2\hat{\theta}-\theta^*_{1-\alpha/2}\le\theta\le2\hat{\theta}-\theta^*_{\alpha/2}) 0.95=P(θ^α/2θ^θ^1α/2)=P(θ^α/2θ^θ^θ^θ^1α/2θ^)=P(θ^α/2θ^θ^θθ^1α/2θ^)=P(θ^α/2θ^θ^θθ^1α/2θ^)=P(2θ^θ1α/2θ2θ^θα/2)

In General

  • Each bootstrap sample has significant overlap with the original data. This will cause the bootstrap to seriously underestimate the true prediction error.

    • Can partly fix this problem by only using predictions for those observations that did not ( by chance ) occur in the current bootstrap sample. (Complicated)
  • If the data is a time series, we can’t simply sample the observations with replacement. We can instead create blocks of consecutive observations, and samp le those with replacements. Then we paste to gether sampled blocks to obtain a bootstrap samples.

Bootstrap in Regression

Y i = β 0 + β 1 X i + ϵ i ,   i = 1 , . . . , n Y_i=\beta_0+\beta_1X_i+\epsilon_i,\ i=1,...,n Yi=β0+β1Xi+ϵi, i=1,...,n

Find S.E. and C.I. for β 0 \beta_0 β0 and β 1 \beta_1 β1

Empirical Bootstrap

  • Resampling ( X 1 , Y 1 ) , . . . , ( X n , Y n ) (X_1, Y_1), ..., (X_n, Y_n) (X1,Y1),...,(Xn,Yn) and obtain:

    • Bootstrap sample 1: ( X 1 ∗ 1 , Y 1 ∗ 1 ) , . . . , ( X n ∗ 1 , Y n ∗ 1 ) (X_1^{*1}, Y_1^{*1}), ..., (X_n^{*1}, Y_n^{*1}) (X11,Y11),...,(Xn1,Yn1)
    • Bootstrap sample 2: ( X 1 ∗ 2 , Y 1 ∗ 2 ) , . . . , ( X n ∗ 2 , Y n ∗ 2 ) (X_1^{*2}, Y_1^{*2}), ..., (X_n^{*2}, Y_n^{*2}) (X12,Y12),...,(Xn2,Yn2)
    • Bootstrap sample 1: ( X 1 ∗ B , Y 1 ∗ B ) , . . . , ( X n ∗ B , Y n ∗ B ) (X_1^{*B}, Y_1^{*B}), ..., (X_n^{*B}, Y_n^{*B}) (X1B,Y1B),...,(XnB,YnB)
  • For each Bootstrap sample, fit regression and obtain ( β ^ 0 ∗ 1 , β ^ 1 ∗ 1 ) . . . ( β ^ 0 ∗ B , β ^ 1 ∗ B ) (\hat{\beta}_0^{*1},\hat{\beta}_1^{*1})...(\hat{\beta}_0^{*B},\hat{\beta}_1^{*B}) (β^01,β^11)...(β^0B,β^1B), then estimate S.E. and C.I.

Residual Bootstrap

  • Recall that residuals to mimic the role of ϵ \epsilon ϵ.

  • Bootstrap the residuals and obtain:

    • Bootstrap residual 1: e ^ 1 ∗ 1 , . . . , e ^ n ∗ 1 \hat{e}_1^{*1},...,\hat{e}_n^{*1} e^11,...,e^n1
    • Bootstrap residual 1: e ^ 1 ∗ 2 , . . . , e ^ n ∗ 2 \hat{e}_1^{*2},...,\hat{e}_n^{*2} e^12,...,e^n2
    • Bootstrap residual 1: e ^ 1 ∗ B , . . . , e ^ n ∗ B \hat{e}_1^{*B},...,\hat{e}_n^{*B} e^1B,...,e^nB
  • Generate new bootstrap sample: X i ∗ b = X i ,   Y i ∗ b = β ^ 0 + β ^ 1 X i + e ^ i ∗ b X_i^{*b}=X_i,\ Y_i^{*b}=\hat{\beta}_0+\hat{\beta}_1X_i+\hat{e}_i^{*b} Xib=Xi, Yib=β^0+β^1Xi+e^ib

  • For each bootstrap sample, fit regression and estimate S.E. and C.I.

Wild Bootstrap

When variance of error V a r ( ϵ i ∣ X i ) Var(\epsilon_i|X_i) Var(ϵiXi) depends on the value of X i X_i Xi ( so called heteroskedasticity) , residual bootstrap is unstable because the residual bootstrap will swap all the residuals regardless of the value of X. But wild bootstrap uses the residual of itself only.

  • Generate IID random variables V 1 b , . . . , V n b ∼ N ( 0 , 1 ) V_1^b,...,V_n^b \sim N(0,1) V1b,...,VnbN(0,1)

  • Generate new bootstrap sample: X i ∗ b = X i ,   Y i ∗ b = β ^ 0 + β ^ 1 X i + V i b e ^ i X_i^{*b}=X_i,\ Y_i^{*b}=\hat{\beta}_0+\hat{\beta}_1X_i+V_i^b\hat{e}_i Xib=Xi, Yib=β^0+β^1Xi+Vibe^i

  • For each bootstrap sample, fit regression and estimate S.E. and C.I.

你可能感兴趣的:(数据科学和机器学习)