线性回归理论和演算

线性回归之最小二乘法

线性回归

线性回归是很常见的一种回归,线性回归可以用来预测或者分类,主要解决线性问题。

最小二乘法

线性回归过程主要解决的就是如何通过样本来获取最佳的拟合线。最常用的方法便是最小二乘法,它是一种数学优化技术,它通过最小化误差的平方和寻找数据的最佳函数匹配。

代数推导:

1 , 假 设 拟 合 直 线 为 y = a x + b 2 , 对 任 意 样 本 点 ( x i , y i ) 3 , 误 差 为 e = y i − ( a x i + b ) 4 , 当 S = ∑ i = 1 n e i 2 为 最 小 时 拟 合 度 最 高 , 即 ∑ i = 1 n ( y i − a x i − b ) 2 最 小 。 5 , 分 别 求 一 阶 偏 导 θ S θ b = − 2 ( ∑ i = 1 n y i − n b − a ∑ i = 1 n x i ) θ S θ a = − 2 ( ∑ i = 1 n x i y i − b ∑ i = 1 n x i − a ∑ i = 1 n x i 2 ) 6 , 分 别 让 上 面 两 式 等 于 0 , 并 且 有 n x ‾ = ∑ i = 1 n x i , n y ‾ = ∑ i = 1 n y i 7 , 得 到 的 最 终 解 a = ∑ i = 1 n ( x i − x ‾ ) ( y i − y ‾ ) ∑ i = 1 n ( x i − x ‾ ) 2 b = y ‾ − a x ‾ 结 果 也 可 以 如 下 a = ∑ x i 2 ∑ y i − ∑ x i ∑ x i y i n ∑ x i 2 − ( ∑ x i ) 2 1,假设拟合直线为y=ax+b\\ 2,对任意样本点(x_i,y_i)\\ 3,误差为e=y_i−(ax_i+b)\\ 4,当S=\sum_{i=1}^ne_i^2为最小时拟合度最高,即\sum_{i=1}^n(y_i−ax_i−b)^2最小。\\ 5,分别求一阶偏导\\ \frac{\theta S}{\theta b}=-2(\sum_{i=1}^{n}y_i-nb-a \sum_{i=1}^{n}x_i)\\ \frac{\theta S}{\theta a}=-2(\sum_{i=1}^{n}x_iy_i-b\sum_{i=1}^{n}x_i-a \sum_{i=1}^{n}x_i^2)\\ 6,分别让上面两式等于0,并且有n\overline{x} = \sum_{i=1}^{n}x_i,n\overline{y} = \sum_{i=1}^{n}y_i \\ 7,得到的最终解\\ a = \frac{\sum_{i=1}^{n}(x_i-\overline{x})(y_i-\overline{y})}{\sum_{i=1}^{n}(x_i-\overline{x})^2}\\ b=\overline{y}-a\overline{x}\\ 结果也可以如下\\ a = \frac{\sum x_i^2 \sum y_i-\sum x_i \sum x_iy_i}{n\sum x_i^2-(\sum x_i)^2} 1,线y=ax+b2,(xi,yi)3,e=yi(axi+b)4,S=i=1nei2i=1n(yiaxib)25,θbθS=2(i=1nyinbai=1nxi)θaθS=2(i=1nxiyibi=1nxiai=1nxi2)6,0nx=i=1nxi,ny=i=1nyi7,a=i=1n(xix)2i=1n(xix)(yiy)b=yaxa=nxi2xi2xi2yixixiyi

代码实现

import numpy as np
import matplotlib.pyplot as plt

def calcAB(x,y):
    n = len(x)
    sumX,sumY,sumXY,sumXX =0,0,0,0
    for i in range(0,n):
        sumX  += x[i]
        sumY  += y[i]
        sumXX += x[i]*x[i]
        sumXY += x[i]*y[i]
    a = (n*sumXY -sumX*sumY)/(n*sumXX -sumX*sumX)
    b = (sumXX*sumY - sumX*sumXY)/(n*sumXX-sumX*sumX)
    return a,b,

xi = [1,2,3,4,5,6,7,8,9,10]
yi = [10,11.5,12,13,14.5,15.5,16.8,17.3,18,18.7]
a,b=calcAB(xi,yi)
print("y = %10.5fx + %10.5f" %(a,b))
x = np.linspace(0,10)
y = a * x + b
plt.plot(x,y)
plt.scatter(xi,yi)
plt.show()

公式推导

Hypothesis:


h θ ( x ) = θ 0 + θ 1 x h_\theta(x) = \theta_0+{\theta_1}x hθ(x)=θ0+θ1x
Parameters:
θ 0 , θ 1 \theta_0 ,\theta_1 θ0,θ1
cost Function:
j ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) 2 j(\theta_0,\theta_1) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}-y^{(i)})^2 j(θ0,θ1)=2m1i=1m(hθ(x(i)y(i))2
Goal:
m i n ⎵ θ 0 , θ 1 J ( θ 0 , θ 1 ) \underbrace{min}_{\theta_0,\theta_1}J(\theta_0,\theta_1) θ0,θ1 minJ(θ0,θ1)
repeat until convergence{
θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 )    ( f o r   i = 0   a n d   j = 1 ) \theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1) \space \space (for \space i = 0\space and\space j=1) θj:=θjαθjJ(θ0,θ1)  (for i=0 and j=1)
}

矩阵推导

1 , 对 于 y = θ 0 + θ 1 x 1 = θ T X = θ 0 x 0 + θ 1 x 1   { x 0 = 1 } 转 为 向 量 形 式 θ = [ θ 0 θ 1 ] , X = [ 1 x 1 ] 2 , 于 是 y = θ T X 3 , 损 失 函 数 J ( θ ) = 1 2 ( θ T X T − y T ) ( X θ − y ) 最 后 可 化 为 1 2 ( θ T X T X θ − θ T X T y − y T X θ − y T y ) 则 偏 导 数 ∂ J ( θ ) ∂ θ = X T X θ − X T y 1,对于y=\theta_0+\theta_1x_1=\theta^TX=\theta_0x_0+\theta_1x_1 \space \{x_0=1\}转为向量形式\\ \theta=\begin{bmatrix} \theta_0 \\ \theta_1 \end{bmatrix}\quad,X=\begin{bmatrix} 1 \\ x_1 \end{bmatrix}\quad\\ 2,于是y=\theta^TX\\ 3,损失函数 J(\theta)=\frac{1}{2}(\theta^TX^T-y^T)(X\theta-y)\\ 最后可化为\frac{1}{2}(\theta^TX^TX\theta-\theta^TX^Ty-y^TX\theta-y^Ty)\\ 则偏导数 \frac{\partial J(\theta)}{\partial \theta} = X^TX\theta-X^Ty 1,y=θ0+θ1x1=θTX=θ0x0+θ1x1 {x0=1}θ=[θ0θ1],X=[1x1]2,y=θTX3,J(θ)=21(θTXTyT)(Xθy)21(θTXTXθθTXTyyTXθyTy)θJ(θ)=XTXθXTy

代码实现

import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets as datasets
 
class LR:
    def __init__(self,X,Y):
        self.m = len(Y)                                              
        # num of training examples
        self.n = len(X[0]) + 1                                       
        # num of features, adding "1"
        self.X = np.hstack((np.ones((self.m,1)),X))                 
        # Add '1' to each training example
        self.Y = Y.reshape((self.m,1))
        self.theta = 0
        self.cost = []
 
 
    def fit(self):
        theta = np.zeros((self.n,1)) + 0.01                          
        # Initiate the theta with a small number
        cost_old = np.array([[0]])
        M = self.X.dot(theta) - self.Y
        cost_new = M.T.dot(M) * 1.0 / 2 /self.m
        # print 'the initail cost is',cost_new
        print('theta is :',theta)
        i = 0
        while (np.abs(cost_new - cost_old) > 0.001 ):
            cost_old = cost_new
            theta_grad = self.X.T.dot(self.X.dot(theta) - self.Y) / self.m
            print('theta_grad',theta_grad)
            # print 'in the %d-th interation'%(i),'theta:',theta
            # print 'theta_grad :',theta_grad
            theta = theta - theta_grad * 0.001                    
            # Actually, I have no idea how to properly set this index,
                                                                     
                # Anyway, practice and the curve will tell me everything
            M = self.X.dot(theta) - self.Y                          
            # M is a temporary variable
            cost_new = np.dot(M.T,M) / self.m                        
            # Divided by self.m, which is the num of training examples
            i = i + 1                                               
            # The index of iteration is added by 1
            self.cost.append(cost_new[0][0])
        self.theta = theta
        self.cost = np.array(self.cost)
        l = len(self.cost)
        axis_X = np.arange(l)
 
        # I wanna show how the cost changes in the iteration process
        # This better tells me that my code is right
        plt.plot(axis_X,self.cost)
        plt.xlabel('Iteration No.')
        plt.ylabel('Total Cost')
        plt.title('Iteration VS Cost')
        plt.show()
 
    def pred(self,X):
        X = np.hstack( ( X,np.ones( (len(X),1) ) ) )
        return np.dot(X,self.theta)
 
    def plot(self):
        x = np.linspace(0,30,401)
        y = x * self.theta[1] + self.theta[0]
        plt.plot(x,y,'r')
        plt.legend(['Prediction Line'])
 
  
diabetes = datasets.load_diabetes()
X = diabetes.data
Y = diabetes.target

# Train the Lineaer Regression with the given datasets
NN = LR(X,Y)

print(NN.theta.shape())

NN.fit()
X_test = np.array([[0.0380759,0.0506801,0.0616962,0.0218724,-0.0442235,-0.0348208,-0.0434008,-0.00259226,0.0199084,-0.0176461]])
Y_test = NN.pred(X_test)
print(Y_test)

你可能感兴趣的:(线性回归理论和演算)