Bias-Variance+Noise Decomposition in Linear Regression

Model:
y = F ( x ) + v F ( x )  在这里可以看做的oracle model,不随training data的改变而改变。 \begin{aligned} &y = F(\mathbf{x}) + v\\ &\text{$F(\mathbf{x})$ 在这里可以看做的oracle model,不随training data的改变而改变。} \end{aligned} y=F(x)+vF(x) 在这里可以看做的oracle model,不随training data的改变而改变。
where v v v is additive w h i t e white white noise with σ v 2 \sigma^2_v σv2. (Note: noise does not have to be gaussian, but does have to be white)
That means, for any x 0 \mathbf{x}_0 x0, we have
E y ∣ x [ y 0 ∣ x 0 ] = F ( x 0 ) 这里的  ( x 0 , y 0 )  可以看做是test data point \begin{aligned} & E_{y|x}[y_0|\mathbf{x}_0] = F(\mathbf{x}_0) \\ & \text{这里的 $(\mathbf{x}_0,y_0)$ 可以看做是test data point} \end{aligned} Eyx[y0x0]=F(x0)这里的 (x0,y0) 可以看做是test data point

The expected loss with a predictor f ^ \hat{f} f^ is taken w.r.t x 0 \mathbf{x}_0 x0 and y 0 y_0 y0: (Can be interpreted as Expectation w.r.t test data)
E x , y [ ( y 0 − f ^ ( x 0 ) ) 2 ] = E x , y [ ( y 0 − F ( x 0 ) + F ( x 0 ) − f ^ ( x 0 ) ) 2 ] = E x , y [ ( y 0 − F ( x 0 ) ) 2 ] ( 1 ) ( = σ v 2 ) + E x , y [ ( F ( x 0 ) − f ^ ( x 0 ) ) 2 ] ( 2 ) ( i m p o r t a n t ) + 2 E x , y [ ( y 0 − F ( x 0 ) ) ( F ( x 0 ) − f ^ ( x 0 ) ) ] ( 3 ) ( = 0 ) \begin{aligned} E_{\mathbf{x},y}[(y_0-\hat{f}(\mathbf{x}_0))^2] &=E_{\mathbf{x},y}[(y_0-F(\mathbf{x}_0) + F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))^2] \\ &=E_{\mathbf{x}, y}[(y_0 - F(\mathbf{x}_0))^2] \qquad (1) (=\sigma^2_v)\\ &+E_{\mathbf{x},y}[(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))^2] \qquad (2) (important)\\ &+2E_{\mathbf{x},y}[(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))] \qquad (3)(=0) \end{aligned} Ex,y[(y0f^(x0))2]=Ex,y[(y0F(x0)+F(x0)f^(x0))2]=Ex,y[(y0F(x0))2](1)(=σv2)+Ex,y[(F(x0)f^(x0))2](2)(important)+2Ex,y[(y0F(x0))(F(x0)f^(x0))](3)(=0)
The cross term (3) can be written as:
E x , y [ ( y 0 − F ( x 0 ) ) ( F ( x 0 ) − f ^ ( x 0 ) ) ] = ∫ ∫ ( y 0 − F ( x 0 ) ) ( F ( x 0 ) − f ^ ( x 0 ) ) p ( y 0 ∣ x 0 ) p ( x 0 ) d y 0 d x 0 = ∫ { E y ∣ x [ ( y 0 − F ( x 0 ) ) ] } ( F ( x 0 ) − f ^ ( x 0 ) ) p ( x 0 ) d x 0 = 0 这里困惑我的问题是:为什么  x 0  固定后, f ^ ( x 0 )  相对于 E y ∣ x  on something是定值?  因为  f ^  是你在training data中得到的模型,这个模型只是跟training data  X  有关, 当得到 f ^ 后, f ^ ( x 0 )  只跟你要带入的input  x 0 有关,所以他跟 y 0  是无关的。 了 解 了 这 层 关 系 后 接 下 来 的 公 式 都 是 信 手 拈 来 。 下 面 的 公 式 对 这 个 解 释 更 直 观 一 些 。 \begin{aligned} & E_{\mathbf{x},y}[(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))] \\ &=\int\int(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))p(y_0|\mathbf{x}_0)p(\mathbf{x}_0)dy_0d\mathbf{x}_0\\ &=\int\{E_{y|\mathbf{x}}[(y_0-F(\mathbf{x}_0))]\}(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))p(\mathbf{x}_0)d\mathbf{x}_0 \\ &=0 \\ & \text{这里困惑我的问题是:为什么 $\mathbf{x}_0$ 固定后,$\hat{f}(\mathbf{x}_0)$ 相对于$E_{y|\mathbf{x}}$ on something是定值? } \\ & \text{因为 $\hat{f}$ 是你在training data中得到的模型,这个模型只是跟training data $X$ 有关,}\\ & \text{当得到$\hat{f}$后,$\hat{f}(\mathbf{x}_0)$ 只跟你要带入的input $\mathbf{x}_0$有关,所以他跟$y_0$ 是无关的。} \\ &了解了这层关系后接下来的公式都是信手拈来。 \\ &下面的公式对这个解释更直观一些。 \end{aligned} Ex,y[(y0F(x0))(F(x0)f^(x0))]=(y0F(x0))(F(x0)f^(x0))p(y0x0)p(x0)dy0dx0={Eyx[(y0F(x0))]}(F(x0)f^(x0))p(x0)dx0=0这里困惑我的问题是:为什么 x0 固定后,f^(x0) 相对于Eyx on something是定值? 因为 f^ 是你在training data中得到的模型,这个模型只是跟training data X 有关,当得到f^后,f^(x0) 只跟你要带入的input x0有关,所以他跟y0 是无关的。

Another way to think about the above equation:
E x , y [ ( y 0 − F ( x 0 ) ) ( F ( x 0 ) − f ^ ( x 0 ) ) ] = E x { E y ∣ x [ ( y 0 − F ( x 0 ) ) ( F ( x 0 ) − f ^ ( x 0 ) ) ∣ x 0 ] } = E x { E y ∣ x [ ( y 0 − F ( x 0 ) ) ∣ x 0 ] ( F ( x 0 ) − f ^ ( x 0 ) ) } = E x { ( E y ∣ x [ y 0 ∣ x 0 ] − F ( x 0 ) ) ( F ( x 0 ) − f ^ ( x 0 ) ) } ( N o t e : 因 为 E y ∣ x [ y 0 ∣ x 0 ] = F ( x 0 ) ) = 0 \begin{aligned} & E_{\mathbf{x},y}[(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))] \\ &=E_{\mathbf{x}}\{E_{y|\mathbf{x}}[(y_0-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))|\mathbf{x}_0]\} \\ &=E_{\mathbf{x}}\{E_{y|\mathbf{x}}[(y_0-F(\mathbf{x}_0))|\mathbf{x}_0](F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))\} \\ &= E_{\mathbf{x}}\{(E_{y|\mathbf{x}}[y_0|\mathbf{x}_0]-F(\mathbf{x}_0))(F(\mathbf{x}_0)-\hat{f}(\mathbf{x}_0))\} \\ &(Note: 因为E_{y|x}[y_0|\mathbf{x}_0] = F(\mathbf{x}_0))\\ &=0 \end{aligned} Ex,y[(y0F(x0))(F(x0)f^(x0))]=Ex{Eyx[(y0F(x0))(F(x0)f^(x0))x0]}=Ex{Eyx[(y0F(x0))x0](F(x0)f^(x0))}=Ex{(Eyx[y0x0]F(x0))(F(x0)f^(x0))}(Note:Eyx[y0x0]=F(x0))=0

We will analyze (2). More clearly, f ^ ( x 0 ) = f ^ ( x 0 , X ) \hat{f}(\mathbf{x_0})=\hat{f}(\mathbf{x_0},X) f^(x0)=f^(x0,X)(which d e p e n d s \mathbf{depends} depends on X X X). Let’s define f ˉ ( x 0 ) = E X ( f ^ ( x 0 ) ) \bar{f}(\mathbf{x_0})=E_X(\hat{f}(\mathbf{x_0})) fˉ(x0)=EX(f^(x0))(which does n o t   d e p e n d \mathbf{not \ depend} not depend on X X X). Then the term inside (2) can be rewritten as:

( F ( x 0 ) − f ^ ( x 0 ) ) 2 ( ∗ ∗ ∗ ∗ ∗ ∗ ) = ( F ( x 0 ) − f ˉ ( x 0 ) + f ˉ ( x 0 ) − f ^ ( x 0 ) ) 2 = ( F ( x 0 ) − f ˉ ( x 0 ) ) 2 ( 4 ) + ( f ˉ ( x 0 ) − f ^ ( x 0 ) ) 2 ( 5 ) + 2 ( F ( x 0 ) − f ˉ ( x 0 ) ) ( f ˉ ( x 0 ) − f ^ ( x 0 ) ) ( 6 ) \begin{aligned} & (F(\mathbf{x_0})-\hat{f}(\mathbf{x}_0))^2 \qquad (******)\\ &= (F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}) + \bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))^2 \\ &= (F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))^2 \qquad (4)\\ &+(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))^2 \qquad (5)\\ &+2(F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0})) \qquad (6) \end{aligned} (F(x0)f^(x0))2()=(F(x0)fˉ(x0)+fˉ(x0)f^(x0))2=(F(x0)fˉ(x0))2(4)+(fˉ(x0)f^(x0))2(5)+2(F(x0)fˉ(x0))(fˉ(x0)f^(x0))(6)
The e x p e c t a t i o n \mathbf{expectation} expectation will be taken w.r.t. the random t r a i n i n g \mathbf{training} training data set X X X, the cross term (6) can be written as:
E X [ 2 ( F ( x 0 ) − f ˉ ( x 0 ) ) ( f ˉ ( x 0 ) − f ^ ( x 0 ) ) ] = 2 ( F ( x 0 ) − f ˉ ( x 0 ) ) E X ( f ˉ ( x 0 ) − f ^ ( x 0 ) ) = 0 \begin{aligned} E_X[2(F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))] &= 2(F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))E_X(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0})) \\ &=0 \end{aligned} EX[2(F(x0)fˉ(x0))(fˉ(x0)f^(x0))]=2(F(x0)fˉ(x0))EX(fˉ(x0)f^(x0))=0

Then we take expectation of ( ∗ ∗ ∗ ∗ ∗ ∗ ) (******) () w.r.t. X X X:
(Note: 这部分跟我们常见的 E ( θ ^ − θ ) 2 E(\hat{\theta}-\theta)^2 E(θ^θ)2很相似,并且通过下面的分析我们可以更好地理解这个公式。记住, θ ^ \hat{\theta} θ^ 是关于training data的变量,所以 E ( θ ^ − θ ) 2 E(\hat{\theta}-\theta)^2 E(θ^θ)2中的Expectation是w.r.t Training Data)
E X [ ( F ( x 0 ) ) − ( f ^ ( x 0 ) ) 2 ] = ( F ( x 0 ) − f ˉ ( x 0 ) ) 2 + E X [ ( f ˉ ( x 0 ) − f ^ ( x 0 ) ) 2 ] = B i a s 2 + V a r i a n c e \begin{aligned} E_X[(F(\mathbf{x_0})) - (\hat{f}(\mathbf{x}_0))^2]&=(F(\mathbf{x_0}) - \bar{f}(\mathbf{x_0}))^2 \\ &+E_X[(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))^2]\\ &=Bias^2 + Variance \end{aligned} EX[(F(x0))(f^(x0))2]=(F(x0)fˉ(x0))2+EX[(fˉ(x0)f^(x0))2]=Bias2+Variance

Putting it all together (1), (2) and (3), we have the decomposition of the expected error:
E X , x 0 , y 0 [ ( y 0 − f ^ ( x 0 , X ) ) ] = σ v 2 ( noise variance ) + ∫ ( F ( x 0 ) − f ˉ ( x 0 ) ) 2 p ( x 0 ) d x 0 expected squared bias + ∫ E X [ ( f ˉ ( x 0 ) − f ^ ( x 0 ) ) 2 ] p ( x 0 ) d x 0 expected variance \begin{aligned} E_{X, \mathbf{x}_0, y_0}[(y_0 - \hat{f}(x_0,X))] &= \sigma^2_v \qquad (\text{noise variance} ) \\ &+\int(F(\mathbf{x}_0) - \bar{f}(\mathbf{x}_0))^2p(\mathbf{x}_0)d\mathbf{x}_0 \qquad \text{expected squared bias} \\ &+ \int E_X[(\bar{f}(\mathbf{x_0}) -\hat{f}(\mathbf{x_0}))^2]p(\mathbf{x}_0)d\mathbf{x}_0 \qquad \text{expected variance} \end{aligned} EX,x0,y0[(y0f^(x0,X))]=σv2(noise variance)+(F(x0)fˉ(x0))2p(x0)dx0expected squared bias+EX[(fˉ(x0)f^(x0))2]p(x0)dx0expected variance

Short Summary:Data is divided into two parts: training and testing.
Expected Squared Error can be viewed as true error or prediction error, which comes from both training error + test error.

Reference: https://web.archive.org/web/20140821063842/http://ttic.uchicago.edu/~gregory/courses/wis-ml2012/lectures/biasVarDecom.pdf

你可能感兴趣的:(Statistics,machine,learning,Decomposition)