线性回归通过寻找合适的ω={w0,w1,w2,…,wn},使得观测样本集合X,和目标输出y之间的残差平方和最小,从数学上来说,其的目标问题或者叫做损失函数是:
对于上面的损失函数而言,J(w)显然是非负的,其去最小值的充要条件就是∥Xw−y∥2的倒数为0时的w
定义为X的伪逆矩阵。
这个是一个非常好的结果,它说明使用最小二乘法,从数学上是可以直接通过公式来求解出参数w
的。
注意:最小二乘法对ω
的估计,是基于模型中变量之间相互独立的基本假设的,即输入向量x中的任意两项xi和xj之间是相互独立的。如果输入矩阵X中存在线性相关或者近似线性相关的列,那么输入矩阵X
就会变成或者近似变成奇异矩阵(singular matrix)。这是一种病态矩阵,矩阵中任何一个元素发生一点变动,整个矩阵的行列式的值和逆矩阵都会发生巨大变化。这将导致最小二乘法对观测数据的随机误差极为敏感,进而使得最后的线性模型产生非常大的方差,这个在数学上称为多重共线性(multicollinearity)。在实际数据中,变量之间的多重共线性是一个非常普遍的现象,其产生机理及相关解决方案在“特征选择和评估”中有介绍。
直接引用周志华老师的机器学习教材上的推导过程:
推广到矩阵形式就是:
回归模型主体部分较为简单,关键在于如何在给出损失函数之后基于梯度下降的参数更新过程。首先我们需要写出模型的主体和损失函数以及基于损失函数的参数求导结果,然后对参数进行初始化,最后写出基于梯度下降法的参数更新过程
import numpy as np
from sklearn.utils import shuffle
from sklearn.datasets import load_diabetes
class linearegression_model():
def __init__(self):
pass
def prepare_data(self):
data = load_diabetes().data
target = load_diabetes().target
X, y = shuffle(data, target, random_state=42)
X = X.astype(np.float32)
y = y.reshape((-1, 1))
data = np.concatenate((X, y), axis=1)
return data
def initialize_params(self, dims):
w = np.zeros((dims, 1))
b = 0
return w, b
def linear_loss(self, X, y, w, b):
num_train = X.shape[0]
num_feature = X.shape[1]
y_hat = np.dot(X, w) + b
loss = np.sum((y_hat-y)**2) / num_train
dw = np.dot(X.T, (y_hat - y)) / num_train
db = np.sum((y_hat - y)) / num_train
return y_hat, loss, dw, db
def linear_train(self, X, y, learning_rate, epochs):
w, b = self.initialize_params(X.shape[1])
for i in range(1, epochs):
y_hat, loss, dw, db = self.linear_loss(X, y, w, b)
w += -learning_rate * dw
b += -learning_rate * db
if i % 10000 == 0:
print('epoch %d loss %f' % (i, loss))
params = {
'w': w,
'b': b
}
grads = {
'dw': dw,
'db': db
}
return loss, params, grads
def predict(self, X, params):
w = params['w']
b = params['b']
y_pred = np.dot(X, w) + b
return y_pred
def linear_cross_validation(self, data, k, randomize=True):
if randomize:
data = list(data)
shuffle(data)
slices = [data[i::k] for i in range(k)]
for i in range(k):
validation = slices[i]
train = [data
for s in slices if s is not validation for data in s]
train = np.array(train)
validation = np.array(validation)
yield train, validation
if __name__ == '__main__':
lr = linearegression_model()
data = lr.prepare_data()
for train, validation in lr.linear_cross_validation(data, 5):
X_train = train[:, :10]
y_train = train[:, -1].reshape((-1, 1))
X_valid = validation[:, :10]
y_valid = validation[:, -1].reshape((-1, 1))
loss5 = []
loss, params, grads = lr.linear_train(X_train, y_train, 0.001, 100000)
loss5.append(loss)
score = np.mean(loss5)
print('five kold cross validation score is', score)
y_pred = lr.predict(X_valid, params)
valid_score = np.sum(((y_pred - y_valid) ** 2)) / len(X_valid)
print('valid score is', valid_score)
epoch 10000 loss 5611.704502
epoch 20000 loss 5258.726277
epoch 30000 loss 4960.271811
epoch 40000 loss 4707.234957
epoch 50000 loss 4492.067734
epoch 60000 loss 4308.511724
epoch 70000 loss 4151.375892
epoch 80000 loss 4016.352800
epoch 90000 loss 3899.866551
five kold cross validation score is 3798.9563843996484
valid score is 4214.092765494475
epoch 10000 loss 5421.168404
epoch 20000 loss 5106.640864
epoch 30000 loss 4838.968257
epoch 40000 loss 4610.442887
epoch 50000 loss 4414.661874
epoch 60000 loss 4246.304449
epoch 70000 loss 4100.947379
epoch 80000 loss 3974.911961
epoch 90000 loss 3865.137182
five kold cross validation score is 3769.083546495626
valid score is 4615.051403980545
epoch 10000 loss 5586.652342
epoch 20000 loss 5295.082496
epoch 30000 loss 5044.857033
epoch 40000 loss 4829.540681
epoch 50000 loss 4643.729724
epoch 60000 loss 4482.885115
epoch 70000 loss 4343.192699
epoch 80000 loss 4221.446108
epoch 90000 loss 4114.948637
five kold cross validation score is 4021.439797419828
valid score is 3787.77408008022
epoch 10000 loss 5611.711211
epoch 20000 loss 5329.043484
epoch 30000 loss 5085.368704
epoch 40000 loss 4874.725779
epoch 50000 loss 4692.097616
epoch 60000 loss 4533.259845
epoch 70000 loss 4394.653912
epoch 80000 loss 4273.280578
epoch 90000 loss 4166.610541
five kold cross validation score is 4072.518260076695
valid score is 3549.9681755234255
epoch 10000 loss 5585.410482
epoch 20000 loss 5280.637799
epoch 30000 loss 5020.541759
epoch 40000 loss 4798.003288
epoch 50000 loss 4607.068771
epoch 60000 loss 4442.757611
epoch 70000 loss 4300.901673
epoch 80000 loss 4178.011305
epoch 90000 loss 4071.163532
five kold cross validation score is 3977.917458102055
valid score is 3855.0420442947343
Process finished with exit code 0