本文旨在通过经典的波士顿放假预测问题来实战运行一下sk-learn中所有常见的回归算法,因此不涉及过多的算法讲解。下面,先对本文中会用到的算法进行简单的介绍:
本文使用的是sklearn集成的‘美国波士顿地区房价预测’的数据,下面我们先通过代码看下数据描述:
from sklearn import datasets
boston = datasets.load_boston()
print(boston.DESCR)
输出结果如下:
**Data Set Characteristics:**
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
从数据描述中可以看出,该数据共有506条数据样本,每条数据包括13项特征,没有缺失的属性/特征值(Missing Attribute Values),方便了进一步的分析。
数据也了解了,下面我们将通过上述介绍的算法,对波士顿房价数据进行拟合和预测,下面直接上代码:
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
boston = datasets.load_boston()
# print(boston.DESCR)
X = boston.data
y = boston.target
# print(X.shape, y.shape)
#将X,y标准化,后面可以通过标准化器中的inverse_transform还原
ss_X = StandardScaler()
ss_y = StandardScaler()
X = ss_X.fit_transform(X)
y = ss_y.fit_transform(y.reshape(-1, 1)) #这里必须用reshape(-1,1)不然会报错
y = y.ravel() #把y从一列变为一行,不然下面会报错
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=2)
#创建一个Series对象用于储存不同回归模型的性能即R-Squared
r_squared = pd.Series()
#使用线性回归拟合数据并预测
lr = LinearRegression()
lr.fit(X_train, y_train)
y_lr_predict = lr.predict(X_test)
r_squared['linear Regression'] = lr.score(X_test, y_test)
#使用随机梯度下降回归拟合数据分类器
sgdr = SGDRegressor()
sgdr.fit(X_train, y_train)
y_sgdr_predict = sgdr.predict(X_test)
r_squared['SGD Regressor'] = sgdr.score(X_test, y_test)
#使用线性核函数配置的SVM进行回归训练
linear_svr = SVR(kernel='linear')
linear_svr.fit(X_train, y_train)
y_linear_svr_predict = linear_svr.predict(X_test)
r_squared['linear SVM'] = linear_svr.score(X_test, y_test)
#使用多项式核函数配置的SVM进行回归训练
poly_svr = SVR(kernel='poly')
poly_svr.fit(X_train, y_train)
y_poly_svr_predict = poly_svr.predict(X_train)
r_squared['Polynomial SVM'] = poly_svr.score(X_test, y_test)
#用径向基核函数配置的SVM进行回归训练
rbf_svr = SVR(kernel='rbf')
rbf_svr.fit(X_train, y_train)
y_rbf_svr_predict = rbf_svr.predict(X_test)
r_squared['RBF SVM'] = rbf_svr.score(X_test, y_test)
#使用KNN回归模型拟合数据,调整参数,使得预测方式为平均回归
uni_knr = KNeighborsRegressor(weights='uniform')
uni_knr.fit(X_train, y_train)
y_uni_knr_predict = uni_knr.predict(X_test)
r_squared['uniform weighted KNN'] = uni_knr.score(X_test, y_test)
#使用KNN回归模型拟合数据,调整参数,使得预测方式为根据距离加权回归
dis_knr = KNeighborsRegressor(weights='distance')
dis_knr.fit(X_train, y_train)
y_dis_knr_predict = dis_knr.predict(X_test)
r_squared['distance weighted KNN'] = dis_knr.score(X_test, y_test)
#用回归树模型拟合数据并预测
dtr = DecisionTreeRegressor()
dtr.fit(X_train, y_train)
y_dtr_predict = dtr.predict(X_test)
r_squared['Dicision Tree Regressor'] = dtr.score(X_test, y_test)
#使用集成模型——随机森林回归器拟合数据并预测
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
y_rfr_predict = rfr.predict(X_test)
r_squared['Random Forest Regressor'] = rfr.score(X_test, y_test)
#使用集成模型——极端随机树(Extremely Randomized Trees)拟合数据并预测
etr = ExtraTreesRegressor()
etr.fit(X_train, y_train)
y_etr_predict = etr.predict(X_test)
r_squared['Extra Trees Rrgressor'] = etr.score(X_test, y_test)
#使用集成模型——梯度提升回归树拟合数据并预测
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)
y_gbr_predict = gbr.predict(X_test)
r_squared['Gradient Boosting Regressor'] = gbr.score(X_test, y_test)
#输出线性回归模型的性能,有以下三种评估方式,排序用的是R-Squared
print('R-Squared of Linear Regression:', lr.score(X_test, y_test))
print('Mean Squared Error of Linear Regression:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(y_lr_predict)))
print('Mean Absolute Error of Linear Regression:', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(y_lr_predict)))
#输出所有的回归模型的预测性能,并倒序排序
print('------------回归模型的R-Squared排序-------------')
print(r_squared.sort_values(ascending=False))
输出结果:
线性回归模型的性能,有以下三种评估方式:
R-Squared of Linear Regression: 0.7503116174489232
Mean Squared Error of Linear Regression: 22.160198304875518
Mean Absolute Error of Linear Regression: 3.241656596795042
------------回归模型的R-Squared排序-------------
Gradient Boosting Regressor 0.901122
Random Forest Regressor 0.869124
Extra Trees Rrgressor 0.863499
RBF SVM 0.834929
Polynomial SVM 0.796276
linear Regression 0.750312
SGD Regressor 0.741797
linear SVM 0.733410
distance weighted KNN 0.718520
Dicision Tree Regressor 0.704389
uniform weighted KNN 0.669170
dtype: float64
从上述结果我们可以看出来,使用非线性回归树模型,特别是集成模型,能够取得更高的性能表现。