19 贝叶斯线性回归

文章目录

  • 19 贝叶斯线性回归
    • 19.1 频率派线性回归
    • 19.2 Bayesian Method
      • 19.2.1 Inference问题
      • 19.2.2 Prediction问题

19 贝叶斯线性回归

19.1 频率派线性回归

数据与模型:

  • 样本:
    { ( x i , y i ) } i = 1 N , x i ∈ R p , y i ∈ R p {\lbrace (x_i, y_i) \rbrace}_{i=1}^{N}, \quad x_i \in {\mathbb R}^p, \quad y_i \in {\mathbb R}^p {(xi,yi)}i=1N,xiRp,yiRp

    X = ( x 1   x 2   …   x N ) T = ( x 1 T x 2 T … x N T ) = ( x 11 x 12 … x 1 N x 21 x 22 … x 2 N … x N 1 x N 2 … x N N ) , Y = ( y 1 T y 2 T … y N T ) X = (x_1 \ x_2 \ \dots \ x_N )^T = \begin{pmatrix} x_1^T \\ x_2^T \\ \dots \\ x_N^T \end{pmatrix} = \begin{pmatrix} x_{11} & x_{12} & \dots & x_{1N} \\ x_{21} & x_{22} & \dots & x_{2N} \\ \dots \\ x_{N1} & x_{N2} & \dots & x_{NN} \\ \end{pmatrix} , Y = \begin{pmatrix} y_1^T \\ y_2^T \\ \dots \\ y_N^T \end{pmatrix} X=(x1 x2  xN)T= x1Tx2TxNT = x11x21xN1x12x22xN2x1Nx2NxNN ,Y= y1Ty2TyNT

  • 回归方程:
    f ( x ) = w T x = x T w , y = f ( x ) + ε ⏟ n o i s e , ε ∽ N ( 0 , σ 2 ) f(x) = w^T x = x^T w, \quad y = f(x) + \underbrace{\varepsilon}_{noise}, \quad \varepsilon \backsim N(0,\sigma^2) f(x)=wTx=xTw,y=f(x)+noise ε,εN(0,σ2)
    其中 x , y , ε x, y, \varepsilon x,y,ε都是随机变量,假设 w w w用于表示参数

在频率派的线性回归中,我们是通过假设 w w w表示一个未知的常量,转化为优化问题进行求解。我们将这种方法称为点估计,在过去我们学习过了两种方法:

  1. L S E    ⟸    M L E ( noise is Gaussian ) LSE \impliedby MLE(\text{noise is Gaussian}) LSEMLE(noise is Gaussian)——极大似然估计:
    w M L E = a r g max ⁡ w P ( D a t a ∣ w ) w_{MLE} = arg\max_{w} P(Data|w) wMLE=argwmaxP(Dataw)

  2. R e g u l a r i z e d   L S E    ⟸    M A P ( noise is Gaussian ) Regularized \ LSE \impliedby MAP(\text{noise is Gaussian}) Regularized LSEMAP(noise is Gaussian)——最大后验估计:
    w M A P = a r g max ⁡ w P ( w ∣ D a t a ) ⏟ ∝ P ( D a t a ∣ w ) ⋅ P ( w ) = a r g max ⁡ w P ( D a t a ∣ w ) ⋅ P ( w ) w_{MAP} = arg\max_{w} \underbrace{P(w|Data)}_{\propto P(Data|w) \cdot P(w)} = arg\max_{w} P(Data|w) \cdot P(w) wMAP=argwmaxP(Dataw)P(w) P(wData)=argwmaxP(Dataw)P(w)
    其中若 P ( w ) P(w) P(w)表示为Gaussian Dist则为岭回归(Ridge),若 P ( w ) P(w) P(w)表示为Laplace则为Lasso

在本章我们的目标是通过Bayesian Method解决线性回归问题:

  • 假定 w w w是一个随机变量
  • 求出后验 P ( w ∣ D a t a ) P(w|Data) P(wData)

19.2 Bayesian Method

数据与模型:

  • 样本数据:
    { ( x i , y i ) } i = 1 N , x i ∈ R p , y i ∈ R p {\lbrace (x_i, y_i) \rbrace}_{i=1}^{N}, \quad x_i \in {\mathbb R}^p, \quad y_i \in {\mathbb R}^p {(xi,yi)}i=1N,xiRp,yiRp

    X = ( x 1   x 2   …   x N ) T = ( x 1 T x 2 T … x N T ) = ( x 11 x 12 … x 1 N x 21 x 22 … x 2 N … x N 1 x N 2 … x N N ) , Y = ( y 1 T y 2 T … y N T ) X = (x_1 \ x_2 \ \dots \ x_N )^T = \begin{pmatrix} x_1^T \\ x_2^T \\ \dots \\ x_N^T \end{pmatrix} = \begin{pmatrix} x_{11} & x_{12} & \dots & x_{1N} \\ x_{21} & x_{22} & \dots & x_{2N} \\ \dots \\ x_{N1} & x_{N2} & \dots & x_{NN} \\ \end{pmatrix} , Y = \begin{pmatrix} y_1^T \\ y_2^T \\ \dots \\ y_N^T \end{pmatrix} X=(x1 x2  xN)T= x1Tx2TxNT = x11x21xN1x12x22xN2x1Nx2NxNN ,Y= y1Ty2TyNT

  • 模型:
    f ( x ) = w T x = x T w , y = f ( x ) + ε ⏟ n o i s e , ε ∽ N ( 0 , σ 2 ) f(x) = w^T x = x^T w, \quad y = f(x) + \underbrace{\varepsilon}_{noise}, \quad \varepsilon \backsim N(0,\sigma^2) f(x)=wTx=xTw,y=f(x)+noise ε,εN(0,σ2)
    其中 x , y , ε , w x, y, \varepsilon, w x,y,ε,w都是随机变量,假设用于表示参数

  • 问题表示:
    { I n f e r e n c e : p o s t e r i o r ( w ) P r e d i c t i o n : x ∗ → y ∗ \begin{cases} Inference: posterior(w) \\ Prediction: x^* \rightarrow y^* \end{cases} {Inference:posterior(w)Prediction:xy

19.2.1 Inference问题

Inference问题就是求解后验: P ( w ∣ D a t a ) P(w|Data) P(wData)。接下来进行逐步的推导:
P ( w ∣ D a t a ) = P ( w ∣ X , Y ) = P ( w , Y ∣ X ) P ( Y ∣ X ) = P ( Y ∣ w , X ) ⏞ l i k e l i h o o d ⋅ P ( w ∣ X ) ⏞ p r i o r ∫ P ( Y ∣ w , X ) ⋅ P ( w ∣ X ) d w \begin{align} P(w|Data) = P(w|X, Y) = \frac{P(w, Y| X)}{P(Y|X)} = \frac{\overbrace{P(Y|w, X)}^{likelihood} \cdot \overbrace{P(w|X)}^{prior}}{\int P(Y|w, X) \cdot P(w|X) {\rm d}w} \end{align} P(wData)=P(wX,Y)=P(YX)P(w,YX)=P(Yw,X)P(wX)dwP(Yw,X) likelihoodP(wX) prior
将后验拆解开之后,我们只需要分开求解likelihood和prior:

  • 求解likelihood:
    P ( Y ∣ w , X ) = ∏ i = 1 N P ( y i ∣ w , x i ) = ∏ i = 1 N N ( y i ∣ w T x i , σ 2 ) P(Y|w, X) = \prod_{i=1}^{N} P(y_i| w, x_i) = \prod_{i=1}^{N} N(y_i| w^T x_i, \sigma^2) P(Yw,X)=i=1NP(yiw,xi)=i=1NN(yiwTxi,σ2)

  • 假设prior:
    p ( w ∣ X ) = N ( 0 , Σ p ) p(w|X) = N(0, \Sigma_p) p(wX)=N(0,Σp)

所以求解后验可以写为:
P ( w ∣ D a t a ) ∝ P ( Y ∣ w , X ) ⋅ P ( w ∣ X ) ∝ ∏ i = 1 N N ( y i ∣ w T x i , σ 2 ) ⋅ N ( 0 , Σ p ) \begin{align} P(w|Data) &\propto P(Y|w,X) \cdot P(w|X) \\ &\propto \prod_{i=1}^{N} N(y_i| w^T x_i, \sigma^2) \cdot N(0, \Sigma_p) \end{align} P(wData)P(Yw,X)P(wX)i=1NN(yiwTxi,σ2)N(0,Σp)
我们先将likelihood进行一个变换:
P ( Y ∣ w , X ) = ∏ i = 1 N N ( y i ∣ w T x i , σ 2 ) = ∏ i = 1 N 1 ( 2 π ) 1 2 σ exp ⁡ { − 1 2 σ 2 ( y i − w T x i ) 2 } = 1 ( 2 π ) N 2 σ N exp ⁡ { − 1 2 σ 2 ∑ i = 1 N ( y i − w T x i ) 2 } = 1 ( 2 π ) N 2 σ N ⏟ ∣ Σ ∣ 1 2 exp ⁡ { − 1 2 ( Y − X w ) ⏟ x − μ T σ − 2 I ⏟ Σ − 1 ( Y − X w ) } = N ( X w , σ − 2 I ) \begin{align} P(Y|w, X) &= \prod_{i=1}^{N} N(y_i| w^T x_i, \sigma^2) \\ &= \prod_{i=1}^{N} \frac{ 1 }{ {(2 \pi)}^\frac{1}{2} \sigma } \exp{\lbrace -\frac{1}{2\sigma^2} {( y_i - w^T x_i )}^2 \rbrace} \\ &= \frac{ 1 }{ {(2 \pi)}^\frac{N}{2} \sigma^N } \exp{\lbrace -\frac{1}{2\sigma^2} \sum_{i=1}^N {( y_i - w^T x_i )}^2 \rbrace} \\ &= \frac{ 1 }{ {(2 \pi)}^\frac{N}{2} \underbrace{\sigma^N}_{{|\Sigma|}^\frac{1}{2}} } \exp{\lbrace -\frac{1}{2} {\underbrace{(Y-Xw)}_{x-\mu}}^T \underbrace{\sigma^{-2} I}_{\Sigma^{-1}} {(Y-Xw)} \rbrace} \\ &= N(Xw, \sigma^{-2} I) \end{align} P(Yw,X)=i=1NN(yiwTxi,σ2)=i=1N(2π)21σ1exp{2σ21(yiwTxi)2}=(2π)2NσN1exp{2σ21i=1N(yiwTxi)2}=(2π)2N∣Σ∣21 σN1exp{21xμ (YXw)TΣ1 σ2I(YXw)}=N(Xw,σ2I)
通过上文的likelihood我们可以求解:
P ( w ∣ D a t a ) ∝ P ( Y ∣ w , X ) ⋅ P ( w ∣ X ) = N ( X w , σ − 2 I ) ) ⋅ N ( 0 , Σ p ) ∝ exp ⁡ { − 1 2 ( Y − X w ) T σ − 2 I ( Y − X w ) } ⋅ exp ⁡ { − 1 2 w T Σ p − 1 w } = exp ⁡ { − 1 2 ( Y − X w ) T σ − 2 I ( Y − X w ) − 1 2 w T Σ p − 1 w } = exp ⁡ { − 1 2 ( Y T Y − 2 Y T X w + w T X T X w ) − 1 2 w T Σ p − 1 w } \begin{align} P(w|Data) &\propto P(Y|w,X) \cdot P(w|X) = N(Xw, \sigma^{-2} I)) \cdot N(0, \Sigma_p) \\ &\propto \exp{\lbrace -\frac{1}{2} {{(Y-Xw)}}^T {\sigma^{-2} I} {(Y-Xw)} \rbrace} \cdot \exp{\lbrace -\frac{1}{2} w^T \Sigma_p^{-1} w \rbrace} \\ &= \exp{\lbrace -\frac{1}{2} {{(Y-Xw)}}^T {\sigma^{-2} I} {(Y-Xw)} -\frac{1}{2} w^T \Sigma_p^{-1} w \rbrace} \\ &= \exp{\lbrace -\frac{1}{2} {( Y^T Y - 2Y^T X w + w^T X^T X w )} -\frac{1}{2} w^T \Sigma_p^{-1} w \rbrace} \\ \end{align} P(wData)P(Yw,X)P(wX)=N(Xw,σ2I))N(0,Σp)exp{21(YXw)Tσ2I(YXw)}exp{21wTΣp1w}=exp{21(YXw)Tσ2I(YXw)21wTΣp1w}=exp{21(YTY2YTXw+wTXTXw)21wTΣp1w}

引入配方法:

若将标准的高斯分布可以得到二次项和一次项:
p ( x ) ∝ exp ⁡ { − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) } = exp ⁡ { − 1 2 ( x T Σ − 1 x − x T Σ − 1 μ ⏟ 1 × 1 − μ T Σ − 1 x ⏟ 1 × 1 + μ T Σ − 1 μ ) } = exp ⁡ { − 1 2 ( x T Σ − 1 x − 2 μ T Σ − 1 x + μ T Σ − 1 μ ⏟ 与 x 无关 ) } ∝ exp ⁡ { − 1 2 x T Σ − 1 x ⏟ 二次项 − μ T Σ − 1 x ⏟ 一次项 } \begin{align} p(x) &\propto \exp{\lbrace -\frac{1}{2}{(x-\mu)}^T \Sigma^{-1} (x-\mu) \rbrace} \\ &= \exp{\lbrace -\frac{1}{2} (x^T \Sigma^{-1} x - \underbrace{x^T \Sigma^{-1} \mu}_{1 \times 1} - \underbrace{\mu^T \Sigma^{-1} x}_{1 \times 1} + \mu^T \Sigma^{-1} \mu) \rbrace} \\ &= \exp{\lbrace -\frac{1}{2} (x^T \Sigma^{-1} x - 2 \mu^T \Sigma^{-1} x + \underbrace{\mu^T \Sigma^{-1} \mu}_{与x无关}) \rbrace} \\ &\propto \exp{\lbrace \underbrace{-\frac{1}{2} x^T \Sigma^{-1} x}_{二次项} - \underbrace{\mu^T \Sigma^{-1} x}_{一次项} \rbrace} \\ \end{align} p(x)exp{21(xμ)TΣ1(xμ)}=exp{21(xTΣ1x1×1 xTΣ1μ1×1 μTΣ1x+μTΣ1μ)}=exp{21(xTΣ1x2μTΣ1x+x无关 μTΣ1μ)}exp{二次项 21xTΣ1x一次项 μTΣ1x}
我们可以通过二次项和一次项求出均值和方差

让我们用配方法,取出 P ( w ∣ D a t a ) P(w|Data) P(wData)的二次项和一次项,假设 P ( w ∣ D a t a ) P(w|Data) P(wData)的均值和方差表示为 μ w , Σ w \mu_w, \Sigma_w μw,Σw
{ 二次项: − 1 2 σ 2 w T X T X w − 1 2 w T Σ p − 1 w = − 1 2 ( w T ( σ − 2 X T X + Σ p − 1 ) w ) ⏟ − 1 2 x T Σ w − 1 x 一次项: σ − 2 Y T X w ⏟ μ T Σ w − 1 x    ⟹    { Σ w − 1 = ( σ − 2 X T X + Σ p − 1 ) μ T Σ w − 1 = σ − 2 Y T X \begin{align} &\begin{cases} \text{二次项:} -\frac{1}{2 \sigma^2} w^T X^T X w - \frac{1}{2} w^T \Sigma_p^{-1} w = \underbrace{ -\frac{1}{2} {(w^T {(\sigma^{-2} X^T X + \Sigma_p^{-1})} w)}}_{-\frac{1}{2} x^T \Sigma_w^{-1} x} \\ \text{一次项:} \underbrace{\sigma^{-2} Y^T X w}_{\mu^T \Sigma_w^{-1} x} \end{cases} \\ \implies &\begin{cases} \Sigma_w^{-1} = {(\sigma^{-2} X^T X + \Sigma_p^{-1})} \\ \mu^T \Sigma_w^{-1} = \sigma^{-2} Y^T X \end{cases} \end{align} 二次项:2σ21wTXTXw21wTΣp1w=21xTΣw1x 21(wT(σ2XTX+Σp1)w)一次项:μTΣw1x σ2YTXw{Σw1=(σ2XTX+Σp1)μTΣw1=σ2YTX
通过上文的方程可以简单求解出均值和方差:
{ Σ w = ( σ − 2 X T X + Σ p − 1 ) − 1 μ T = σ − 4 X T X Y T X + σ − 2 Σ p − 1 Y T X \begin{cases} \Sigma_w = {(\sigma^{-2} X^T X + \Sigma_p^{-1})}^{-1} \\ \mu^T = \sigma^{-4} X^T X Y^T X + \sigma^{-2} \Sigma_p^{-1} Y^T X \end{cases} {Σw=(σ2XTX+Σp1)1μT=σ4XTXYTX+σ2Σp1YTX

19.2.2 Prediction问题

Prediction问题是假设已有数据为 x ∗ x^* x,要求在 y ∗ y^* y的条件下的概率分布。

我们的条件有:
{ f ( x ) = x T w w ∽ N ( μ w , Σ w ) \begin{cases} f(x) = x^T w \\ w \backsim N(\mu_w, \Sigma_w) \end{cases} {f(x)=xTwwN(μw,Σw)
此时我们已知 f ( x ∗ ) = x ∗ T w f(x^*) = {x^*}^T w f(x)=xTw,可以根据参数的分布得到 P ( x ∗ T w ) P({x^*}^T w) P(xTw)
w ∽ N ( μ w , Σ w )    ⟹    x ∗ T w ∽ N ( x ∗ T μ w , x ∗ T Σ w x ∗ ) \begin{align} & w \backsim N(\mu_w, \Sigma_w) \\ \implies & {x^*}^T w \backsim N({x^*}^T \mu_w, {x^*}^T \Sigma_w x^*) \end{align} wN(μw,Σw)xTwN(xTμw,xTΣwx)
实际情况是我们要求解 y = f ( x ∗ ) + ε , ε ∽ N ( 0 , σ 2 ) y = f(x^*) + \varepsilon, \quad \varepsilon \backsim N(0, \sigma^2) y=f(x)+ε,εN(0,σ2),也就是求解分布 P ( y ∗ ∣ D a t a , x ∗ ) P(y^*| Data, x^*) P(yData,x)​:
{ y = x ∗ T w + ε , ε ∽ N ( 0 , σ 2 ) x ∗ T w ∽ N ( x ∗ T μ w , x ∗ T Σ w x ∗ )    ⟹    P ( y ∗ ∣ D a t a , x ∗ ) = N ( x ∗ T μ w , x ∗ T Σ w x ∗ + σ 2 ) \begin{align} &\begin{cases} y = {x^*}^T w + \varepsilon, \quad \varepsilon \backsim N(0, \sigma^2) \\ {x^*}^T w \backsim N({x^*}^T \mu_w, {x^*}^T \Sigma_w x^*) \end{cases} \\ \implies & P(y^*|Data, x^*) = N({x^*}^T \mu_w, {x^*}^T \Sigma_w x^* + \sigma^2) \end{align} {y=xTw+ε,εN(0,σ2)xTwN(xTμw,xTΣwx)P(yData,x)=N(xTμw,xTΣwx+σ2)

你可能感兴趣的:(机器学习-白板推导,线性回归,机器学习,python)