摘要：关于 - 写文章 (jianshu.com)的代码理解

参考：

（1）《机器学习》周志华著，清华大学出版社

（2）《Hands-On Machine Learning with Scikit-Learn and TensorFlow》by Aurélien Géron

（3）《python机器学习》[美]塞巴斯蒂安·拉施卡(Sebastian Raschka) 著高明徐莹陶虎成译

（4）数据集的分割：https://developer.aliyun.com/article/672642

（5）数据归一化：https://blog.csdn.net/GentleCP/article/details/109333753

（6）模型评估r2_score：https://blog.csdn.net/qq_38890412/article/details/109565970

（7）LinearRegression解析：https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

（8）matplotlib rc配置：https://blog.csdn.net/weixin_39010770/article/details/88200298

（9）matplotlib.pyplot.plot详解：https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot

1.机器学习过程

（1）Look at the big picture

（2）Get the data

（3）Discover and visualize the data to gain insights

（4）Prepared data for machine learning algorithms

（5）Select a model and train it

（6）Fine-tune ur model

（7）Present ur solution

（8）Launch, monitor and maintain ur system

5.训练模型

5.1线性模型的基本形式

= ；其中和学得之后，模型及就可以确定。

5.2线性回归

给定数据集，其中，。“线性回归”试图学得一个线性模型以尽可能准确的预测实值输出标记。

后续同样写详细理解。

5.3代码

from sklearn.linear_model import LinearRegression

lr = LinearRegression()

lr.fit(x_train, y_train)

从广义线性模型中导入线性回归模型，并将其实例化，利用之前分割好的训练集对其进行训练，lr就是基于训练集x_train与其标记y_train训练得到的模型。

6.模型评估

6.1性能度量

回归任务最常用的性能度量是"均方误差" (mean squared error)

6.2决定系数

在统计学中，决定系数反映了因变量的波动，有多少百分比能被自变量（即特征）的波动所描述。简单来说，该参数可以用来判断统计模型对数据的拟合能力（或说服力）。

用数学语言简单描述，决定系数就是回归平方和与总平方和之比，其表达式为：

sklearn提供了两种方法来计算决定系数。一种使用上文训练出的模型lr的score方法，另外一种就是导入r2_score包来进行计算。

6.3绘图函数

# 绘图函数

def figure(title, *datalist):

plt.figure(facecolor='gray', figsize=[20, 10]) # 确定画布大小

for v in datalist:

plt.plot(v[0], '-', label=v[1], linewidth=2)

plt.plot(v[0], 'o')

plt.grid() # 生成网格

plt.title(title, fontsize=20)

plt.legend(fontsize=16)

plt.show()

matplotlib的使用后续会再写，这里知道该函数输入两个参数，一个参数是图像的标题，另一个参数是一个或多个包含两个元素的列表，列表第一个元素是要画图的数据，第二个元素是图例。

另关于中文乱码问题，使用plt的rc配置设置字体为黑体。

plt.rcParams['font.sans-serif'] = ['SimHei']

6.4代码

y_train_pred = lr.predict(x_train)

y_test_pred = lr.predict(x_test)

首先计算出使用训练集和测试集在模型上训练得到的数据，后于原本数据的标签进行比较。

print(mean_squared_error(y_train, y_train_pred))

print(mean_squared_error(y_test, y_test_pred))

分别输出在训练集和测试集上训练得到结果的均方误差。

print(lr.score(x_train, y_train))

print(lr.score(x_test, y_test))

分别输出在训练集和测试集上训练得到结果的决定系数。

# 绘制预测值与真实值图

figure('预测值与真实值图模型的' + r'$R^2={:.4f}$'.format(r2_score(y_test, y_test_pred)), [y_test_pred, '预测值'], [y_test, '真实值'])

可视化展示在测试集上的训练结果。

机器学习——线性回归波士顿房价代码理解（三）