线性回归是很常见的一种回归,线性回归可以用来预测或者分类,主要解决线性问题。
线性回归过程主要解决的就是如何通过样本来获取最佳的拟合线。最常用的方法便是最小二乘法,它是一种数学优化技术,它通过最小化误差的平方和寻找数据的最佳函数匹配。
1 , 假 设 拟 合 直 线 为 y = a x + b 2 , 对 任 意 样 本 点 ( x i , y i ) 3 , 误 差 为 e = y i − ( a x i + b ) 4 , 当 S = ∑ i = 1 n e i 2 为 最 小 时 拟 合 度 最 高 , 即 ∑ i = 1 n ( y i − a x i − b ) 2 最 小 。 5 , 分 别 求 一 阶 偏 导 θ S θ b = − 2 ( ∑ i = 1 n y i − n b − a ∑ i = 1 n x i ) θ S θ a = − 2 ( ∑ i = 1 n x i y i − b ∑ i = 1 n x i − a ∑ i = 1 n x i 2 ) 6 , 分 别 让 上 面 两 式 等 于 0 , 并 且 有 n x ‾ = ∑ i = 1 n x i , n y ‾ = ∑ i = 1 n y i 7 , 得 到 的 最 终 解 a = ∑ i = 1 n ( x i − x ‾ ) ( y i − y ‾ ) ∑ i = 1 n ( x i − x ‾ ) 2 b = y ‾ − a x ‾ 结 果 也 可 以 如 下 a = ∑ x i 2 ∑ y i − ∑ x i ∑ x i y i n ∑ x i 2 − ( ∑ x i ) 2 1,假设拟合直线为y=ax+b\\ 2,对任意样本点(x_i,y_i)\\ 3,误差为e=y_i−(ax_i+b)\\ 4,当S=\sum_{i=1}^ne_i^2为最小时拟合度最高,即\sum_{i=1}^n(y_i−ax_i−b)^2最小。\\ 5,分别求一阶偏导\\ \frac{\theta S}{\theta b}=-2(\sum_{i=1}^{n}y_i-nb-a \sum_{i=1}^{n}x_i)\\ \frac{\theta S}{\theta a}=-2(\sum_{i=1}^{n}x_iy_i-b\sum_{i=1}^{n}x_i-a \sum_{i=1}^{n}x_i^2)\\ 6,分别让上面两式等于0,并且有n\overline{x} = \sum_{i=1}^{n}x_i,n\overline{y} = \sum_{i=1}^{n}y_i \\ 7,得到的最终解\\ a = \frac{\sum_{i=1}^{n}(x_i-\overline{x})(y_i-\overline{y})}{\sum_{i=1}^{n}(x_i-\overline{x})^2}\\ b=\overline{y}-a\overline{x}\\ 结果也可以如下\\ a = \frac{\sum x_i^2 \sum y_i-\sum x_i \sum x_iy_i}{n\sum x_i^2-(\sum x_i)^2} 1,假设拟合直线为y=ax+b2,对任意样本点(xi,yi)3,误差为e=yi−(axi+b)4,当S=i=1∑nei2为最小时拟合度最高,即i=1∑n(yi−axi−b)2最小。5,分别求一阶偏导θbθS=−2(i=1∑nyi−nb−ai=1∑nxi)θaθS=−2(i=1∑nxiyi−bi=1∑nxi−ai=1∑nxi2)6,分别让上面两式等于0,并且有nx=i=1∑nxi,ny=i=1∑nyi7,得到的最终解a=∑i=1n(xi−x)2∑i=1n(xi−x)(yi−y)b=y−ax结果也可以如下a=n∑xi2−(∑xi)2∑xi2∑yi−∑xi∑xiyi
import numpy as np
import matplotlib.pyplot as plt
def calcAB(x,y):
n = len(x)
sumX,sumY,sumXY,sumXX =0,0,0,0
for i in range(0,n):
sumX += x[i]
sumY += y[i]
sumXX += x[i]*x[i]
sumXY += x[i]*y[i]
a = (n*sumXY -sumX*sumY)/(n*sumXX -sumX*sumX)
b = (sumXX*sumY - sumX*sumXY)/(n*sumXX-sumX*sumX)
return a,b,
xi = [1,2,3,4,5,6,7,8,9,10]
yi = [10,11.5,12,13,14.5,15.5,16.8,17.3,18,18.7]
a,b=calcAB(xi,yi)
print("y = %10.5fx + %10.5f" %(a,b))
x = np.linspace(0,10)
y = a * x + b
plt.plot(x,y)
plt.scatter(xi,yi)
plt.show()
Hypothesis:
h θ ( x ) = θ 0 + θ 1 x h_\theta(x) = \theta_0+{\theta_1}x hθ(x)=θ0+θ1x
Parameters:
θ 0 , θ 1 \theta_0 ,\theta_1 θ0,θ1
cost Function:
j ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) 2 j(\theta_0,\theta_1) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}-y^{(i)})^2 j(θ0,θ1)=2m1i=1∑m(hθ(x(i)−y(i))2
Goal:
m i n ⎵ θ 0 , θ 1 J ( θ 0 , θ 1 ) \underbrace{min}_{\theta_0,\theta_1}J(\theta_0,\theta_1) θ0,θ1 minJ(θ0,θ1)
repeat until convergence{
θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 ) ( f o r i = 0 a n d j = 1 ) \theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1) \space \space (for \space i = 0\space and\space j=1) θj:=θj−α∂θj∂J(θ0,θ1) (for i=0 and j=1)
}
1 , 对 于 y = θ 0 + θ 1 x 1 = θ T X = θ 0 x 0 + θ 1 x 1 { x 0 = 1 } 转 为 向 量 形 式 θ = [ θ 0 θ 1 ] , X = [ 1 x 1 ] 2 , 于 是 y = θ T X 3 , 损 失 函 数 J ( θ ) = 1 2 ( θ T X T − y T ) ( X θ − y ) 最 后 可 化 为 1 2 ( θ T X T X θ − θ T X T y − y T X θ − y T y ) 则 偏 导 数 ∂ J ( θ ) ∂ θ = X T X θ − X T y 1,对于y=\theta_0+\theta_1x_1=\theta^TX=\theta_0x_0+\theta_1x_1 \space \{x_0=1\}转为向量形式\\ \theta=\begin{bmatrix} \theta_0 \\ \theta_1 \end{bmatrix}\quad,X=\begin{bmatrix} 1 \\ x_1 \end{bmatrix}\quad\\ 2,于是y=\theta^TX\\ 3,损失函数 J(\theta)=\frac{1}{2}(\theta^TX^T-y^T)(X\theta-y)\\ 最后可化为\frac{1}{2}(\theta^TX^TX\theta-\theta^TX^Ty-y^TX\theta-y^Ty)\\ 则偏导数 \frac{\partial J(\theta)}{\partial \theta} = X^TX\theta-X^Ty 1,对于y=θ0+θ1x1=θTX=θ0x0+θ1x1 {x0=1}转为向量形式θ=[θ0θ1],X=[1x1]2,于是y=θTX3,损失函数J(θ)=21(θTXT−yT)(Xθ−y)最后可化为21(θTXTXθ−θTXTy−yTXθ−yTy)则偏导数∂θ∂J(θ)=XTXθ−XTy
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets as datasets
class LR:
def __init__(self,X,Y):
self.m = len(Y)
# num of training examples
self.n = len(X[0]) + 1
# num of features, adding "1"
self.X = np.hstack((np.ones((self.m,1)),X))
# Add '1' to each training example
self.Y = Y.reshape((self.m,1))
self.theta = 0
self.cost = []
def fit(self):
theta = np.zeros((self.n,1)) + 0.01
# Initiate the theta with a small number
cost_old = np.array([[0]])
M = self.X.dot(theta) - self.Y
cost_new = M.T.dot(M) * 1.0 / 2 /self.m
# print 'the initail cost is',cost_new
print('theta is :',theta)
i = 0
while (np.abs(cost_new - cost_old) > 0.001 ):
cost_old = cost_new
theta_grad = self.X.T.dot(self.X.dot(theta) - self.Y) / self.m
print('theta_grad',theta_grad)
# print 'in the %d-th interation'%(i),'theta:',theta
# print 'theta_grad :',theta_grad
theta = theta - theta_grad * 0.001
# Actually, I have no idea how to properly set this index,
# Anyway, practice and the curve will tell me everything
M = self.X.dot(theta) - self.Y
# M is a temporary variable
cost_new = np.dot(M.T,M) / self.m
# Divided by self.m, which is the num of training examples
i = i + 1
# The index of iteration is added by 1
self.cost.append(cost_new[0][0])
self.theta = theta
self.cost = np.array(self.cost)
l = len(self.cost)
axis_X = np.arange(l)
# I wanna show how the cost changes in the iteration process
# This better tells me that my code is right
plt.plot(axis_X,self.cost)
plt.xlabel('Iteration No.')
plt.ylabel('Total Cost')
plt.title('Iteration VS Cost')
plt.show()
def pred(self,X):
X = np.hstack( ( X,np.ones( (len(X),1) ) ) )
return np.dot(X,self.theta)
def plot(self):
x = np.linspace(0,30,401)
y = x * self.theta[1] + self.theta[0]
plt.plot(x,y,'r')
plt.legend(['Prediction Line'])
diabetes = datasets.load_diabetes()
X = diabetes.data
Y = diabetes.target
# Train the Lineaer Regression with the given datasets
NN = LR(X,Y)
print(NN.theta.shape())
NN.fit()
X_test = np.array([[0.0380759,0.0506801,0.0616962,0.0218724,-0.0442235,-0.0348208,-0.0434008,-0.00259226,0.0199084,-0.0176461]])
Y_test = NN.pred(X_test)
print(Y_test)