线性回归之数学原理解析

主要内容:

1.模型数学表达式

2.模型目标函数

3.求模型参数

极大似然估计(MLE)

贝叶斯最大后验估计(MAP)


1.模型公式:
y = wx +b
从一维到n维:
h_{\theta}(x) = \sum_{i=0}^{x}{\theta_{i}x_{i}} = \theta^{T}x
2.目标函数:(或者叫“损失函数”,就是度量预测值和真实值的差距)
J(\theta) = \frac{1}{2}\sum_{i=1}^{m}{(h_{\theta}(x_{i}) - y_{i})^{2}}
a.解释1:直观的理解,就是计算“预测值”和“真实值”的距离,但是因为有正负,所以一般就取差的平方。

b.解释2:通过模型函数:
y_{i} = \theta^{T}x_{i} + \varepsilon_{i}, 假设 \varepsilon_{i} \sim N(0, \sigma^{2}) (均值为0的高斯分布)
通过MLE推导:
y_{i} = \theta^{T}x_{i} + \varepsilon_{i},
\varepsilon_{i} = y_{i} - \theta^{T}x_{i}
代入高斯分布得:
p(\varepsilon_{i}) = \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(\varepsilon_{i})^{2}}{2\sigma^{2}}) ,
p(y_{i}|x_{i};\theta) = \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y_{i}-\theta^{T}x_{i})^{2}}{2\sigma^{2}}) ,
用拉格朗日函数表示:
L(\theta) = \prod_{i=1}^{m}p(y_{i}|x_{i};\theta)
= \prod_{i=1}^{m}\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y_{i}-\theta^{T}x_{i})^{2}}{2\sigma^{2}})
求log取似然函数:
l(\theta) = logL(\theta)
= log\prod_{i=1}^{m}\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y_{i}-\theta^{T}x_{i})^{2}}{2\sigma^{2}})
= \sum_{i=1}^{m}\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y_{i}-\theta^{T}x_{i})^{2}}{2\sigma^{2}})
= mlog\frac{1}{\sqrt{2\pi}\sigma} - \frac{1}{\sigma^{2}}\cdot\frac{1}{2}\sum_{i=1}^{m} (y_{i}-\theta^{T}x_{i})^{2}
因为 mlog\frac{1}{\sqrt{2\pi}\sigma} 是常数,舍去
所以 J(\theta) = \frac{1}{2}\sum_{i=1}^{m} (\theta^{T}x_{i}-y_{i})^{2}
=\frac{1}{2}(X\theta - y)^{T}(X\theta -y)
= \frac{1}{2}(\theta^{T}X^{T} - y^{T})(X\theta - y)
= \frac{1}{2}(\theta^{T}X^{T}X\theta - \theta^{T}X^{T}y - y^{T}X\theta + y^{T}y)
\frac{dJ(\theta)}{d\theta} = \frac{1}{2}(2X^{T}X\theta - X^{T}y -(y^{T}X)^{T})
= X^{T}X\theta - X^{T}y
求导后取极值点,令函数值为0:
X^{T}X\theta - X^{T}y = 0
X^{T}X\theta = X^{T}y
\theta = (X^{T}X )^{-1}X^{T}y
一般为了保证 X^{T}X 可逆,
\theta = (X^{T}X + \lambda I)^{-1}X^{T}y

c.解释3:通过贝叶斯最大后验概率
-参数 \theta 也是随机变量,有概率分布
-给定 X,Y, 推导 \theta 的后验概率:
p(\theta|Y,X) = \frac{p(\theta, Y|X)}{p(Y|X)}
-从后验概率里找一个概率最大的 \theta 出来
后验概率 p(\theta|Y,X) = \frac{p(\theta, Y|X)}{p(Y|X)}
先验概率 p(\theta|X) = \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{\theta^{2}}{2\gamma^{2}}), 假设 \theta|X \sim N(0, \gamma^{2})
后验概率 = 先验概率 * 似然
p(\theta|Y,X) = \frac{p(\theta, Y|X)}{p(Y|X)} = \frac{p(Y|\theta,X)p(\theta|X)}{\int p(Y|\theta,X)p(\theta|X)d\theta}
p(Y|X) \theta 无关,所以最大后验概率等价于:
p(Y|\theta,X)p(\theta|X) 的最大
L(\theta) = -p(Y|\theta,X)p(\theta|X) ,所以只需要求该函数的最小值:
l(\theta) = -logL(\theta)
= - log(p(Y|\theta,X)p(\theta|X))
=-\sum_{i=1}^{m}{log(p(y_{i}|\theta,x_{i}}) - log(p(\theta|X))
= -\sum_{i=1}^{m}{log\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(y_{i}-\theta^{T}x_{i})^{2}}{2\sigma^{2}}}) - log\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{\theta^{2}}{2\gamma^{2}})
= \frac{1}{\sigma^{2}} \cdot \frac{1}{2}\sum_{i=1}^{m}{(y_{i}-\theta^{T}x_{i})^{2}} + \frac{1}{\gamma^{2}}\theta^{T}\theta +C
去掉常数C,
J(\theta) = \frac{1}{2}\sum_{i=1}^{m}{(y_{i}-\theta^{T}x_{i})^{2}} + \lambda\theta^{2}\theta, (\lambda = \frac{\sigma^{2}}{\gamma^{2}})
对函数J求导,得出:
\theta = (X^{T}X + \lambda I)^{-1}X^{T}y


你可能感兴趣的:(机器学习)