上一篇文章中《机器学习学习笔记(4)----线性回归的数学解析》,求最优的w的方法是普通最小二乘法(Ordinary Least Squares,OLS),最小二乘法的python实现相对简单(ols.py):
import numpy as np
class OLSLinearRegression:
def _ols(self, X, y):
'''最小二乘法'''
tmp = np.linalg.inv(np.matmul(X.T, X))
tmp = np.matmul(tmp, X.T)
return np.matmul(tmp, y)
#如果使用较新的python版本和numpy版本,可使用如下实现
# return np.linalg.inv(X.T @ X) @ X.T @ y
def _preprocess_data_X(self, X):
'''数据预处理'''
# 扩展X,添加x0列并设置为1
m, n = X.shape
X_ = np.empty((m, n+1))
X_[:,0] = 1
X_[:, 1:] = X
return X_
def train(self, X_train, y_train):
'''训练模型'''
# 预处理X_train(添加x0列并设置为1)
_X_train = self._preprocess_data_X(X_train)
# 使用最小二乘法估算w
self.w = self._ols(_X_train, y_train)
def predict(self, X):
'''预测'''
# 预处理X_train(添加x0列并设置为1)
_X = self._preprocess_data_X(X)
return np.matmul(_X, self.w)
_ols函数,就是最小二乘法的实现。
_preprocess_data_X函数,对X进行预处理,添加x0列并设置为1。
train函数用于训练,predict函数用于预测。
使用红酒口感数据集,对上面的最小二乘法实现进行测试,红酒数据集的路径:https://archive.ics.uci.edu/ml/datasets/wine+quality,winequality-red.csv。
先加载数据集:
>>> import numpy as np
>>> data = np.genfromtxt('winequality-red.csv',delimiter=';',skip_header=True)
>>> X=data[:,:-1]
>>> X
array([[ 7.4 , 0.7 , 0. , ..., 3.51 , 0.56 , 9.4 ],
[ 7.8 , 0.88 , 0. , ..., 3.2 , 0.68 , 9.8 ],
[ 7.8 , 0.76 , 0.04 , ..., 3.26 , 0.65 , 9.8 ],
...,
[ 6.3 , 0.51 , 0.13 , ..., 3.42 , 0.75 , 11. ],
[ 5.9 , 0.645, 0.12 , ..., 3.57 , 0.71 , 10.2 ],
[ 6. , 0.31 , 0.47 , ..., 3.39 , 0.66 , 11. ]])
>>> y = data[:,-1]
>>> y
array([5., 5., 5., ..., 6., 5., 6.])
然后创建模型:
>>> from ols import OLSLinearRegression
>>> ols_lr = OLSLinearRegression()
然后,用train_test_split切分训练集和测试集:
>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
其中test_size=0.3表示测试集占总样本的百分比是30%。
>>> X_train.shape
(1119, 11)
>>> X_test.shape
(480, 11)
然后,做训练:
>>> ols_lr.train(X_train, y_train)
最后,进行预测:
>>> y_pred = ols_lr.predict(X_test)
仍然用上一节提到的损失函数(均方误差),来衡量回归模型的性能:
>>> from sklearn.metrics import mean_squared_error
>>> mse = mean_squared_error(y_test, y_pred)
>>> mse
0.4735667694917602
>>> y_train_pred = ols_lr.predict(X_train)
>>> mse_train = mean_squared_error(y_train, y_train_pred)
>>> mse_train
0.39425252552801526
训练集的拟合效果好于测试集。
使用sklearn库中的线性模型进行训练和预测,并计算损失函数:
>>> from sklearn import linear_model
>>> model = linear_model.LinearRegression()
>>> model.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
>>> y_pred = model.predict(X_test)
>>> mse = mean_squared_error(y_test, y_pred)
>>> mse
0.47356676947363036
>>> y_train_pred = model.predict(X_train)
>>> mse_train = mean_squared_error(y_train, y_train_pred)
>>> mse_train
0.3942525255280152
发现和我们示例的最小二乘法的效果差不多,说明sklearn库中的线性模型使用的就是最小二乘法。
参考资料:
《Python机器学习算法:原理,实现与案例》