在具体情况下,很可能不会是一维的特征,就如同房价预测,并非只是与其所占有的面积相关,与房子的各种属性也是息息相关的(房子的大小、房间数量、地段、是否有游泳池等)。所以对其进行预测也与之前不相同的代码,下面是代码示例以及相关的代码分析。
首先观察数据集的形式,是以csv文件的形式以下示例格式储存:
longitude latitude ... median_house_value ocean_proximity
0 -122.23 37.88 ... 452600.0 NEAR BAY
1 -122.22 37.86 ... 358500.0 NEAR BAY
2 -122.24 37.85 ... 352100.0 NEAR BAY
3 -122.25 37.85 ... 341300.0 NEAR BAY
4 -122.25 37.85 ... 342200.0 NEAR BAY
... ... ... ... ... ...
20635 -121.09 39.48 ... 78100.0 INLAND
20636 -121.21 39.49 ... 77100.0 INLAND
20637 -121.22 39.43 ... 92300.0 INLAND
20638 -121.32 39.43 ... 84700.0 INLAND
20639 -121.24 39.37 ... 89400.0 INLAND
需要进行头函数的导入,本次多元回归线性模型需要对文件进行读取,故此次运用到了pandas的包,同时对向量进行处理使用numpy,最后还需要sklearn,机器学习包进行拟合。
import pandas as pd
from pandas import read_csv
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
第一步便是将数据从csv文件中进行导入:
# 加载数据集
filename = "mutilp_linear_data.csv"
names = ['longitude','latitude','housing_median_age','total_rooms','total_bedrooms','population','households','median_income','median_house_value','ocean_proximity']
housing_data = read_csv(filename,names=names)
print(housing_data)
在观察数据集的过程中发现,部分数据是缺少特征的,但是其个数很少,与整个数据集相比不值得在意,故此处直接进行了删除处理,但是有些例如关于生物学相关的数据往往具有大量的空缺特征,此处可以使用后方贝叶斯的插值或者其他插值等方法进行分析,这个过程是在对数据集进行预处理。
# 数据预处理
housing_data = housing_data.dropna() # 去除缺失值
housing_data = pd.get_dummies(housing_data) # 对类别变量进行one-hot编码
而后我们需要将特征和标签进行分离,以进行相关的训练预测,其根据观察的数据集可以采用下方代码进行分离:
# 分离特征和标签
X = housing_data.drop("median_house_value", axis=1)
y = housing_data["median_house_value"]
后需将训练集以及测试集进行分开,训练集用于训练模型,而测试集则可用来证明该模型的准确性。此次取8:2的比例进行。
# 划分训练集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
将数据都处理好以后,便可用sklearn进行模型的建立,其代码如下:
# 建立模型
model = LinearRegression()
model.fit(X_train, y_train)
模型建立好后为验证模型的准确性,我们先将测试集进行预测与评估:
# 预测
y_pred = model.predict(X_test)
# 评估
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE:", rmse)
输出结果为:
longitude latitude ... median_house_value ocean_proximity
0 -122.23 37.88 ... 452600.0 NEAR BAY
1 -122.22 37.86 ... 358500.0 NEAR BAY
2 -122.24 37.85 ... 352100.0 NEAR BAY
3 -122.25 37.85 ... 341300.0 NEAR BAY
4 -122.25 37.85 ... 342200.0 NEAR BAY
... ... ... ... ... ...
20635 -121.09 39.48 ... 78100.0 INLAND
20636 -121.21 39.49 ... 77100.0 INLAND
20637 -121.22 39.43 ... 92300.0 INLAND
20638 -121.32 39.43 ... 84700.0 INLAND
20639 -121.24 39.37 ... 89400.0 INLAND
[20640 rows x 10 columns]
RMSE: 69297.71669113009
Process finished with exit code 0
代码汇总:
import pandas as pd
from pandas import read_csv
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# 加载数据集
filename = "mutilp_linear_data.csv"
names = ['longitude','latitude','housing_median_age','total_rooms','total_bedrooms','population','households','median_income','median_house_value','ocean_proximity']
housing_data = read_csv(filename,names=names)
print(housing_data)
# 数据预处理
housing_data = housing_data.dropna() # 去除缺失值
housing_data = pd.get_dummies(housing_data) # 对类别变量进行one-hot编码
# 分离特征和标签
X = housing_data.drop("median_house_value", axis=1)
y = housing_data["median_house_value"]
# 划分训练集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 建立模型
model = LinearRegression()
model.fit(X_train, y_train)
# 预测
y_pred = model.predict(X_test)
# 评估
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE:", rmse)