房价预测的例子是很多机器学习课程的经典入门案例,房价受多种因素的影响,例如房屋面积、卧室数量等等,那么是否存在一个方程式能够表达这些因素与房价的定量关系呢?其实这就是机器学习需要解决的问题,寻找最佳匹配方程。本次实验采用的是波士顿房价数据集,关于该数据集,可以参考sklearn的datasets。房价是一个连续值,预测连续值的问题属于逻辑回归问题。借助python强大的机器学习类库,我们可以轻松地实现一个模型,那不同的模型在这个问题上的表现又如何呢,下面是一个简单的测试。
from sklearn.datasets import load_boston
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
X, y = load_boston(True)
X = scale(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
print(X_train[0], y_train[0])
[ 0.67497414 -0.48772236 1.01599907 -0.27259857 1.60072524 -0.93690454
0.900575 -0.94020538 1.66124525 1.53092646 0.80657583 0.44105193
1.43354842] 12.8
这部分模型并没有进行调参优化,所有参数均采用了默认值
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
lasso = Lasso()
y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)
r2_score_lasso = r2_score(y_test, y_pred_lasso)
print(r2_score(y_train,lasso.predict(X_train)))
print(r2_score_lasso)
0.6558542290928164
0.7063382166531593
from sklearn.linear_model import ElasticNet
enet = ElasticNet()
y_pred_enet = enet.fit(X_train, y_train).predict(X_test)
r2_score_enet = r2_score(y_test, y_pred_enet)
print(r2_score(y_train,enet.predict(X_train)))
print(r2_score_enet)
0.6319495878879611
0.6989301886731756
from sklearn import svm
svr = svm.SVR()
svr.fit(X_train,y_train)
y_pred_svr = svr.predict(X_test)
print(r2_score(y_train,svr.predict(X_train)))
print(r2_score(y_test,y_pred_svr))
0.6430295360107772
0.6908834297202064
这个模型其实是一个“很浅很窄”的模型,但其实深度模型之所以强大,就在于深度,每一层可以发现不同层次的特征,模型越深层次越深。这里选择一个很浅很窄模型的原因是数据量太少,模式太深训练不起来,而且容易过拟合。
from keras.models import Sequential
from keras.layers import Dense
seq = Sequential()
seq.add(Dense(13, activation='relu',input_dim=X.shape[1]))
seq.add(Dense(13, activation='relu'))
seq.add(Dense(1, activation='relu'))
seq.compile(loss='mse', optimizer='adam')
seq.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=200, batch_size=10)
print(r2_score(y_test, seq.predict(X_test)))
Epoch 195/200
354/354 [==============================] - 0s 167us/step - loss: 8.2531 - val_loss: 7.7015
Epoch 196/200
354/354 [==============================] - 0s 156us/step - loss: 8.0658 - val_loss: 7.7211
Epoch 197/200
354/354 [==============================] - 0s 167us/step - loss: 8.0975 - val_loss: 7.6756
Epoch 198/200
354/354 [==============================] - 0s 164us/step - loss: 8.0590 - val_loss: 7.6672
Epoch 199/200
354/354 [==============================] - 0s 164us/step - loss: 8.0256 - val_loss: 7.7370
Epoch 200/200
354/354 [==============================] - 0s 173us/step - loss: 8.0272 - val_loss: 7.6881
0.905115228567158
可以看出,深度模型的表现优于其他模式。但实际上,深度模型是经历了多次“玄学调参”的结果,而其他模型并没有经历调参优化,所以这个结果也仅供参考吧。
import pandas as pd
df = pd.DataFrame(columns=['真实价格','DNN','SVR','ENET','LASSO'])
df['真实价格']=y_test[:20]
df['DNN'] = seq.predict(X_test[:20]).reshape(20,)
df['SVR'] = svr.predict(X_test[:20]).reshape(20,)
df['ENET'] = enet.predict(X_test[:20]).reshape(20,)
df['LASSO'] = lasso.predict(X_test[:20]).reshape(20,)
print(df)
真实价格 DNN SVR ENET LASSO
0 22.0 18.252920 21.089813 23.385601 23.312081
1 28.7 25.008955 23.710673 26.422968 26.891553
2 13.1 16.204874 15.304707 17.510302 15.982067
3 22.5 18.993956 17.257705 19.499183 17.915353
4 20.0 21.143742 21.106587 22.952084 22.742153
5 42.8 45.877789 28.505149 32.595058 34.055575
6 17.5 18.026285 17.776068 19.260607 18.383730
7 14.5 16.182840 15.307121 17.582478 15.724255
8 8.4 10.724962 12.276034 9.835185 7.893643
9 50.0 48.813076 35.115328 34.616077 35.574188
10 27.5 34.224918 19.906433 13.545103 14.658437
11 14.9 14.568207 15.160364 18.622336 18.828576
12 14.5 17.585505 17.397113 20.635838 20.146342
13 17.8 16.192526 16.716685 18.274871 17.565880
14 18.9 20.941542 20.661864 22.726536 22.514724
15 33.8 34.242413 31.564452 30.224818 30.590694
16 23.0 28.125729 24.612038 26.795441 25.767934
17 10.5 7.609363 12.236361 15.076118 14.950944
18 50.0 44.337326 30.579195 32.720842 35.185425
19 23.1 21.985710 23.457522 24.123978 24.204844