书籍《Python机器学习及实践》阅读笔记
代预测的目标是连续变量,如:价格、降水量等
图片来自:https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
代预测的目标是连续变量,如:价格、降水量等
简单易用,但线性假设限制了其使用范围。在不清楚特征之间的关系的情况下,可以使用共线性回归模型作为大多数科学实验的基线系统。
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression,SGDRegressor
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
boston = load_boston()
X, y = boston.data, boston.target
y = y.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=33)
#预测目标房价之间的差异较大,因此需要对特征及目标值进行标准化处理
ss_X, ss_y= StandardScaler(), StandardScaler()
X_train, y_train = ss_X.fit_transform(X_train), ss_y.fit_transform(y_train)
X_test, y_test = ss_X.transform(X_test), ss_y.transform(y_test)
y_train, y_test= y_train.reshape(-1,), y_test.reshape(-1,)
#LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_y_pred = lr.predict(X_test)
print("The value of default measurement of LinearRegression is {sc}".format(sc=lr.score(X_test,y_test)))
print("The value of R-squared of LinearRegression is {}".format(r2_score(y_test,lr_y_pred)))
print("The mean squared error of LinearRegression is {}".format(mean_squared_error(y_test,lr_y_pred)))
print("The mean absoluate error of LinearRegression is {}".format(mean_absolute_error(y_test, lr_y_pred)))
#SGDRegressor
sgdr = SGDRegressor()
sgdr.fit(X_train, y_train)
sgdr_y_pred = sgdr.predict(X_test)
print("The value of default measurement of SGDRegressor is {sc}".format(sc=sgdr.score(X_test,y_test)))
print("The value of R-squared of SGDRegressor is {}".format(r2_score(y_test,sgdr_y_pred)))
print("The mean squared error of SGDRegressor is {}".format(mean_squared_error(y_test,sgdr_y_pred)))
print("The mean absoluate error of SGDRegressor is {}".format(mean_absolute_error(y_test, sgdr_y_pred)))
--------------------
The value of default measurement of LinearRegression is 0.6763403830998702
The value of R-squared of LinearRegression is 0.6763403830998702
The mean squared error of LinearRegression is 0.2914340857699757
The mean absoluate error of LinearRegression is 0.379976703912958
The value of default measurement of SGDRegressor is 0.6550953755633774
The value of R-squared of SGDRegressor is 0.6550953755633774
The mean squared error of SGDRegressor is 0.31056381041055237
The mean absoluate error of SGDRegressor is 0.3787925618115339
使用三种不同核函数配置的SVM回归模型进行预测,并且分别对测试数据做出预测。在使用高斯核函数对特征进行非线性映射之后,SVM展现了最佳的回归性能
from sklearn.svm import SVR
#线性核函数
linear_svr = SVR(kernel='linear')
linear_svr.fit(X_train, y_train)
linear_svr_y_pred = linear_svr.predict(X_test)
print("The value of default measurement of linear_svr is {sc}".format(sc=linear_svr.score(X_test,y_test)))
print("The mean squared error of linear_svr is {}".format(mean_squared_error(y_test,linear_svr_y_pred)))
print("The mean absoluate error of linear_svr is {}".format(mean_absolute_error(y_test, linear_svr_y_pred)))
#多项式核函数
poly_svr = SVR(kernel='poly')
poly_svr.fit(X_train, y_train)
poly_svr_y_pred = poly_svr.predict(X_test)
print("The value of default measurement of poly_svr is {sc}".format(sc=poly_svr.score(X_test,y_test)))
print("The mean squared error of poly_svr is {}".format(mean_squared_error(y_test,poly_svr_y_pred)))
print("The mean absoluate error of poly_svr is {}".format(mean_absolute_error(y_test, poly_svr_y_pred)))
#高斯核函数
rbf_svr = SVR(kernel='rbf')
rbf_svr.fit(X_train, y_train)
rbf_svr_y_pred = rbf_svr.predict(X_test)
print("The value of default measurement of rbf_svr is {sc}".format(sc=rbf_svr.score(X_test,y_test)))
print("The mean squared error of rbf_svr is {}".format(mean_squared_error(y_test,rbf_svr_y_pred)))
print("The mean absoluate error of rbf_svr is {}".format(mean_absolute_error(y_test, rbf_svr_y_pred)))
-----------------
output:
The value of default measurement of linear_svr is 0.651717097429608
The mean squared error of linear_svr is 0.31360572651000684
The mean absoluate error of linear_svr is 0.3692598109626841
The value of default measurement of poly_svr is 0.40445405800289286
The mean squared error of poly_svr is 0.5362497453412647
The mean absoluate error of poly_svr is 0.4043235900151712
The value of default measurement of rbf_svr is 0.7564068912273935
The mean squared error of rbf_svr is 0.21933948892028846
The mean absoluate error of rbf_svr is 0.28099219092230115
借助周围的K个最近的训练样本的目标数值,对待测样本的回归值进行决策。采用算术平均和加权平均
uni_knr = KNeighborsRegressor(weights='uniform')
uni_knr.fit(X_train, y_train)
uni_knr_y_predict =uni_knr.predict(X_test)
dis_knr = KNeighborsRegressor(weights='distance')
dis_knr.fit(X_train, y_train)
dis_knr_y_predict = dis_knr.predict(X_test)
print("R-squared value of uniform-weighted KNeighorRegression: {sc}".format(sc = uni_knr.score(X_test, y_test)))
print("The mean squared error of uniform - weighted KNeighorRegression: {}".format(mean_squared_error(y_test, uni_knr_y_predict)))
print("The mean absoluate error of uniform - weighted KNeighorRegression: {}".format(mean_absolute_error(y_test,uni_knr_y_predict)))
print("R-squared value of distance - weighted KNeighorRegression: {sc}".format(sc = uni_knr.score(X_test, y_test)))
print("The mean squared error of distance - weighted KNeighorRegression: {}".format(mean_squared_error(y_test, dis_knr_y_predict)))
print("The mean absoluate error of distance - weighted KNeighorRegression: {}".format(mean_absolute_error(y_test,dis_knr_y_predict)))
----------------
R-squared value of uniform-weighted KNeighorRegression: 0.6903454564606561
The mean squared error of uniform - weighted KNeighorRegression: 0.27882344317534674
The mean absoluate error of uniform - weighted KNeighorRegression: 0.3198364056782287
R-squared value of distance - weighted KNeighorRegression: 0.6903454564606561
The mean squared error of distance - weighted KNeighorRegression: 0.25233849462662644
The mean absoluate error of distance - weighted KNeighorRegression: 0.302274187769214
节点的数据类型是离散型,不是连续型。节点返回“一团”数据的均值,而不是具体的、连续的预测值
优点:
缺点:
dtr = DecisionTreeRegressor()
dtr.fit(X_train, y_train)
dtr_y_predict = dtr.predict(X_test)
print("R-squared value of distance - weighted KNeighorRegression: {sc}".format(sc = dtr.score(X_test, y_test)))
print("The mean squared error of distance - weighted KNeighorRegression: {}".format(mean_squared_error(y_test, dtr_y_predict)))
print("The mean absoluate error of distance - weighted KNeighorRegression: {}".format(mean_absolute_error(y_test,dtr_y_predict)))
----------------
R-squared value of distance - weighted KNeighorRegression: 0.6578394137883
The mean squared error of distance - weighted KNeighorRegression: 0.30809298541527674
The mean absoluate error of distance - weighted KNeighorRegression: 0.3486517193537014
许多商业系统开发更加青睐集成模型,并且经常以这些模型的性能为基准,与新设计的其他模型性能进行比对。集成模型在训练中要耗费更多时间,但是往往可以提供更高的表现性能和更好的稳定性。
#RandomForestRegressor
rfr = RandomForestRegressor()
rfr.fit(X_train,y_train)
rfr_y_predict = rfr.predict(X_test)
#ExtraTreesRegressor
etr = ExtraTreesRegressor()
etr.fit(X_train, y_train)
etr_y_predict = etr.predict(X_test)
#GradientBoostingRegressor
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)
gbr_y_predict = gbr.predict(X_test)
print("R-squared value of RandomForestRegressor: {sc}".format(sc = rfr.score(X_test, y_test)))
print("The mean squared error of RandomForestRegressor: {}".format(mean_squared_error(y_test, rfr_y_predict)))
print("The mean absoluate error of RandomForestRegressor: {}".format(mean_absolute_error(y_test,rfr_y_predict)))
print()
print("R-squared value of ExtraTreesRegressor: {sc}".format(sc = etr.score(X_test, y_test)))
print("The mean squared error of ExtraTreesRegressor: {}".format(mean_squared_error(y_test, etr_y_predict)))
print("The mean absoluate error of ExtraTreesRegressor: {}".format(mean_absolute_error(y_test,etr_y_predict)))
[print(boston.feature_names[i], etr.feature_importances_[i]) for i in range(len(boston.feature_names))]
print()
print("R-squared value of GradientBoostingRegressor: {sc}".format(sc = gbr.score(X_test, y_test)))
print("The mean squared error of GradientBoostingRegressor: {}".format(mean_squared_error(y_test, gbr_y_predict)))
print("The mean absoluate error of GradientBoostingRegressor: {}".format(mean_absolute_error(y_test,gbr_y_predict)))
----------------
R-squared value of RandomForestRegressor: 0.8061848865443059
The mean squared error of RandomForestRegressor: 0.17451769528539432
The mean absoluate error of RandomForestRegressor: 0.25866750217808665
R-squared value of ExtraTreesRegressor: 0.7884953230506744
The mean squared error of ExtraTreesRegressor: 0.19044597763897325
The mean absoluate error of ExtraTreesRegressor: 0.26368218132478993
CRIM 0.02714547406437261
ZN 0.008844733766319734
INDUS 0.04703140761230033
CHAS 0.027736865042215297
NOX 0.041049521038692874
RM 0.2521879378214602
AGE 0.015563841006466943
DIS 0.030784973556192842
RAD 0.0066161352621591785
TAX 0.051284319483710275
PTRATIO 0.04986567775108296
B 0.018597944399002288
LSTAT 0.4232911691960245
R-squared value of GradientBoostingRegressor: 0.8447341376327575
The mean squared error of GradientBoostingRegressor: 0.13980664342270013
The mean absoluate error of GradientBoostingRegressor: 0.24442389615991972
Rank | Regressors | R-squared | MSE | MAE |
---|---|---|---|---|
1 | GradientBoostingRegressor | 0.8426 | 12.20 | 2.29 |
2 | ExtraTreesRegressor | 0.8195 | 13.99 | 2.36 |
3 | RandomForestRegressor | 0.8024 | 15.32 | 2.37 |
4 | SVM Regressor(RBF Kernel) | 0.7564 | 18.89 | 2.61 |
5 | KNN Regressor(distance-weighted) | 0.7198 | 21.73 | 2.81 |
6 | DecisionTreeRegressor | 0.6941 | 23.72 | 3.14 |
7 | KNN Regressor(uniform-weighted) | 0.6903 | 24.01 | 2.97 |
8 | LinearRegression | 0.6763 | 25.10 | 3.53 |
9 | SGDRegressor | 0.6599 | 26.38 | 3.55 |
10 | SVM Regressor(Linear Kernel) | 0.6517 | 27.76 | 3.57 |
11 | SVM Regressor(Poly Kernel) | 0.4045 | 46.18 | 3.75 |