二、Sklearn机器学习基础实例之---回归预测问题

书籍《Python机器学习及实践》阅读笔记

回归预测问题

代预测的目标是连续变量,如:价格、降水量等

二、Sklearn机器学习基础实例之---回归预测问题_第1张图片

图片来自:https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

 

预测问题

代预测的目标是连续变量,如:价格、降水量等

一、线性回归器

简单易用,但线性假设限制了其使用范围。在不清楚特征之间的关系的情况下,可以使用共线性回归模型作为大多数科学实验的基线系统。

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression,SGDRegressor
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error

boston = load_boston()
X, y = boston.data, boston.target
y = y.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=33)
#预测目标房价之间的差异较大,因此需要对特征及目标值进行标准化处理
ss_X,  ss_y= StandardScaler(), StandardScaler()
X_train, y_train = ss_X.fit_transform(X_train), ss_y.fit_transform(y_train)
X_test, y_test = ss_X.transform(X_test), ss_y.transform(y_test)
y_train, y_test= y_train.reshape(-1,), y_test.reshape(-1,)

#LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_y_pred = lr.predict(X_test)
print("The value of default measurement of LinearRegression is {sc}".format(sc=lr.score(X_test,y_test)))
print("The value of R-squared of LinearRegression is {}".format(r2_score(y_test,lr_y_pred)))
print("The mean squared error of LinearRegression is {}".format(mean_squared_error(y_test,lr_y_pred)))
print("The mean absoluate error of LinearRegression is {}".format(mean_absolute_error(y_test, lr_y_pred)))

#SGDRegressor
sgdr = SGDRegressor()
sgdr.fit(X_train, y_train)
sgdr_y_pred = sgdr.predict(X_test)
print("The value of default measurement of SGDRegressor is {sc}".format(sc=sgdr.score(X_test,y_test)))
print("The value of R-squared of SGDRegressor is {}".format(r2_score(y_test,sgdr_y_pred)))
print("The mean squared error of SGDRegressor is {}".format(mean_squared_error(y_test,sgdr_y_pred)))
print("The mean absoluate error of SGDRegressor is {}".format(mean_absolute_error(y_test, sgdr_y_pred)))

--------------------
The value of default measurement of LinearRegression is 0.6763403830998702
The value of R-squared of LinearRegression is 0.6763403830998702
The mean squared error of LinearRegression is 0.2914340857699757
The mean absoluate error of LinearRegression is 0.379976703912958
The value of default measurement of SGDRegressor is 0.6550953755633774
The value of R-squared of SGDRegressor is 0.6550953755633774
The mean squared error of SGDRegressor is 0.31056381041055237
The mean absoluate error of SGDRegressor is 0.3787925618115339

二、支持向量机(SVM)

使用三种不同核函数配置的SVM回归模型进行预测,并且分别对测试数据做出预测。在使用高斯核函数对特征进行非线性映射之后,SVM展现了最佳的回归性能

from sklearn.svm import SVR
#线性核函数
linear_svr = SVR(kernel='linear')
linear_svr.fit(X_train, y_train)
linear_svr_y_pred = linear_svr.predict(X_test)
print("The value of default measurement of linear_svr is {sc}".format(sc=linear_svr.score(X_test,y_test)))
print("The mean squared error of linear_svr is {}".format(mean_squared_error(y_test,linear_svr_y_pred)))
print("The mean absoluate error of linear_svr is {}".format(mean_absolute_error(y_test, linear_svr_y_pred)))
#多项式核函数
poly_svr = SVR(kernel='poly')
poly_svr.fit(X_train, y_train)
poly_svr_y_pred = poly_svr.predict(X_test)
print("The value of default measurement of poly_svr is {sc}".format(sc=poly_svr.score(X_test,y_test)))
print("The mean squared error of poly_svr is {}".format(mean_squared_error(y_test,poly_svr_y_pred)))
print("The mean absoluate error of poly_svr is {}".format(mean_absolute_error(y_test, poly_svr_y_pred)))
#高斯核函数
rbf_svr = SVR(kernel='rbf')
rbf_svr.fit(X_train, y_train)
rbf_svr_y_pred = rbf_svr.predict(X_test)
print("The value of default measurement of rbf_svr is {sc}".format(sc=rbf_svr.score(X_test,y_test)))
print("The mean squared error of rbf_svr is {}".format(mean_squared_error(y_test,rbf_svr_y_pred)))
print("The mean absoluate error of rbf_svr is {}".format(mean_absolute_error(y_test, rbf_svr_y_pred)))
-----------------
output:
The value of default measurement of linear_svr is 0.651717097429608
The mean squared error of linear_svr is 0.31360572651000684
The mean absoluate error of linear_svr is 0.3692598109626841

The value of default measurement of poly_svr is 0.40445405800289286
The mean squared error of poly_svr is 0.5362497453412647
The mean absoluate error of poly_svr is 0.4043235900151712

The value of default measurement of rbf_svr is 0.7564068912273935
The mean squared error of rbf_svr is 0.21933948892028846
The mean absoluate error of rbf_svr is 0.28099219092230115

三、KNN

借助周围的K个最近的训练样本的目标数值,对待测样本的回归值进行决策。采用算术平均和加权平均

uni_knr = KNeighborsRegressor(weights='uniform')
uni_knr.fit(X_train, y_train)
uni_knr_y_predict =uni_knr.predict(X_test)

dis_knr = KNeighborsRegressor(weights='distance')
dis_knr.fit(X_train, y_train)
dis_knr_y_predict = dis_knr.predict(X_test)

print("R-squared value of uniform-weighted KNeighorRegression: {sc}".format(sc = uni_knr.score(X_test, y_test)))
print("The mean squared error of uniform - weighted KNeighorRegression: {}".format(mean_squared_error(y_test, uni_knr_y_predict)))
print("The mean absoluate error of uniform - weighted KNeighorRegression: {}".format(mean_absolute_error(y_test,uni_knr_y_predict)))

print("R-squared value of distance - weighted KNeighorRegression: {sc}".format(sc = uni_knr.score(X_test, y_test)))
print("The mean squared error of distance - weighted KNeighorRegression: {}".format(mean_squared_error(y_test, dis_knr_y_predict)))
print("The mean absoluate error of distance - weighted KNeighorRegression: {}".format(mean_absolute_error(y_test,dis_knr_y_predict)))
----------------
R-squared value of uniform-weighted KNeighorRegression: 0.6903454564606561
The mean squared error of uniform - weighted KNeighorRegression: 0.27882344317534674
The mean absoluate error of uniform - weighted KNeighorRegression: 0.3198364056782287

R-squared value of distance - weighted KNeighorRegression: 0.6903454564606561
The mean squared error of distance - weighted KNeighorRegression: 0.25233849462662644
The mean absoluate error of distance - weighted KNeighorRegression: 0.302274187769214

四、回归树

节点的数据类型是离散型,不是连续型。节点返回“一团”数据的均值,而不是具体的、连续的预测值

优点:

  • 可以解决非线性特征问题
  • 不要求特征标准化
  • 预测结果具有可解释性

缺点:

  • 容易过拟合,泛化能力差
  • 因数据细微改变会发生较大结构变化,预测稳定性差
  • NP难问题,无法在有限时间内获得最优解
dtr = DecisionTreeRegressor()
dtr.fit(X_train, y_train)
dtr_y_predict = dtr.predict(X_test)

print("R-squared value of distance - weighted KNeighorRegression: {sc}".format(sc = dtr.score(X_test, y_test)))
print("The mean squared error of distance - weighted KNeighorRegression: {}".format(mean_squared_error(y_test, dtr_y_predict)))
print("The mean absoluate error of distance - weighted KNeighorRegression: {}".format(mean_absolute_error(y_test,dtr_y_predict)))
----------------
R-squared value of distance - weighted KNeighorRegression: 0.6578394137883
The mean squared error of distance - weighted KNeighorRegression: 0.30809298541527674
The mean absoluate error of distance - weighted KNeighorRegression: 0.3486517193537014

五、集成模型

许多商业系统开发更加青睐集成模型,并且经常以这些模型的性能为基准,与新设计的其他模型性能进行比对。集成模型在训练中要耗费更多时间,但是往往可以提供更高的表现性能和更好的稳定性。

#RandomForestRegressor
rfr = RandomForestRegressor()
rfr.fit(X_train,y_train)
rfr_y_predict = rfr.predict(X_test)

#ExtraTreesRegressor
etr = ExtraTreesRegressor()
etr.fit(X_train, y_train)
etr_y_predict = etr.predict(X_test)

#GradientBoostingRegressor
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)
gbr_y_predict = gbr.predict(X_test)

print("R-squared value of RandomForestRegressor: {sc}".format(sc = rfr.score(X_test, y_test)))
print("The mean squared error of RandomForestRegressor: {}".format(mean_squared_error(y_test, rfr_y_predict)))
print("The mean absoluate error of RandomForestRegressor: {}".format(mean_absolute_error(y_test,rfr_y_predict)))
print()
print("R-squared value of ExtraTreesRegressor: {sc}".format(sc = etr.score(X_test, y_test)))
print("The mean squared error of ExtraTreesRegressor: {}".format(mean_squared_error(y_test, etr_y_predict)))
print("The mean absoluate error of ExtraTreesRegressor: {}".format(mean_absolute_error(y_test,etr_y_predict)))
[print(boston.feature_names[i], etr.feature_importances_[i]) for i in range(len(boston.feature_names))]

print()
print("R-squared value of GradientBoostingRegressor: {sc}".format(sc = gbr.score(X_test, y_test)))
print("The mean squared error of GradientBoostingRegressor: {}".format(mean_squared_error(y_test, gbr_y_predict)))
print("The mean absoluate error of GradientBoostingRegressor: {}".format(mean_absolute_error(y_test,gbr_y_predict)))
----------------
R-squared value of RandomForestRegressor: 0.8061848865443059
The mean squared error of RandomForestRegressor: 0.17451769528539432
The mean absoluate error of RandomForestRegressor: 0.25866750217808665

R-squared value of ExtraTreesRegressor: 0.7884953230506744
The mean squared error of ExtraTreesRegressor: 0.19044597763897325
The mean absoluate error of ExtraTreesRegressor: 0.26368218132478993
CRIM 0.02714547406437261
ZN 0.008844733766319734
INDUS 0.04703140761230033
CHAS 0.027736865042215297
NOX 0.041049521038692874
RM 0.2521879378214602
AGE 0.015563841006466943
DIS 0.030784973556192842
RAD 0.0066161352621591785
TAX 0.051284319483710275
PTRATIO 0.04986567775108296
B 0.018597944399002288
LSTAT 0.4232911691960245

R-squared value of GradientBoostingRegressor: 0.8447341376327575
The mean squared error of GradientBoostingRegressor: 0.13980664342270013
The mean absoluate error of GradientBoostingRegressor: 0.24442389615991972

性能比较

Rank Regressors R-squared MSE MAE
1 GradientBoostingRegressor 0.8426 12.20 2.29
2 ExtraTreesRegressor 0.8195 13.99 2.36
3 RandomForestRegressor 0.8024 15.32 2.37
4 SVM Regressor(RBF Kernel) 0.7564 18.89 2.61
5 KNN Regressor(distance-weighted) 0.7198 21.73 2.81
6 DecisionTreeRegressor 0.6941 23.72 3.14
7 KNN Regressor(uniform-weighted) 0.6903 24.01 2.97
8 LinearRegression 0.6763 25.10 3.53
9 SGDRegressor 0.6599 26.38 3.55
10 SVM Regressor(Linear Kernel) 0.6517 27.76 3.57
11 SVM Regressor(Poly Kernel) 0.4045 46.18 3.75

 

 

你可能感兴趣的:(机器学习实战)