完成前三周的任务之后,我们小组六个人都各自整理出了一份城市租房信息数据集。筛查并删除掉数据异常的信息之后,最终得到了一份含有中国六个城市(天津、咸阳、西安、宝鸡、深圳、北京)共9080条租房数据的总数据集。
接下来的任务是使用sklearn
库中提供的模型,对房价进行预估。这里我选择了k-NN、岭回归、线性回归、决策树、随机森林五种模型。
代码如下:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
# 加载数据集
dataset = pd.read_csv('./datasets/总数据集.csv', encoding='gbk')
# 使用train_test_split()分离出测试集和训练集
X_train, X_test, y_train, y_test = train_test_split(np.array(dataset.drop(columns=['月租'])), np.array(dataset['月租']),
test_size=0.2, random_state=22)
print(y_test)
# 构建k-NN模型
knn = KNeighborsRegressor(n_neighbors=4)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(y_pred)
print('Training set score: {:.2f}'.format(knn.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(knn.score(X_test, y_test)))
输出结果如下:
[2000 1850 1300 ... 950 1500 1010]
[1175. 1451.25 1175. ... 939. 1225. 1102.5 ]
Training set score: 0.77
Test set score: 0.67
代码如下:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
# 加载数据集
dataset = pd.read_csv('./datasets/总数据集.csv', encoding='gbk')
# 使用train_test_split()分离出测试集和训练集
X_train, X_test, y_train, y_test = train_test_split(np.array(dataset.drop(columns=['月租'])), np.array(dataset['月租']),
test_size=0.25, random_state=22)
print(y_test)
# 构建岭回归模型
ridge = Ridge(alpha=10).fit(X_train, y_train)
y_pred = ridge.predict(X_test)
print(y_pred)
print('Training set score: {:.2f}'.format(ridge.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(ridge.score(X_test, y_test)))
输出结果如下:
[2000 1850 1300 ... 1600 650 600]
[1204.37018259 2163.5745905 1729.04638212 ... 2236.86333521 966.88410719
1158.33879739]
Training set score: 0.68
Test set score: 0.71
代码如下:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# 加载数据集
dataset = pd.read_csv('./datasets/总数据集.csv', encoding='gbk')
# 使用train_test_split()分离出测试集和训练集
X_train, X_test, y_train, y_test = train_test_split(np.array(dataset.drop(columns=['月租'])), np.array(dataset['月租']),
test_size=0.25, random_state=22)
print(y_test)
# 构建线性回归模型
lr = LinearRegression().fit(X_train, y_train)
y_pred = lr.predict(X_test)
print(y_pred)
print('Training set score: {:.2f}'.format(lr.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(lr.score(X_test, y_test)))
输出结果如下:
[2000 1850 1300 ... 1600 650 600]
[1204.08458918 2167.07002703 1729.80116253 ... 2235.10990609 964.65496297
1159.57735581]
Training set score: 0.68
Test set score: 0.71
代码如下:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
# 加载数据集
dataset = pd.read_csv('./datasets/总数据集.csv', encoding='gbk')
# 使用train_test_split()分离出测试集和训练集
X_train, X_test, y_train, y_test = train_test_split(np.array(dataset.drop(columns=['月租'])), np.array(dataset['月租']),
test_size=0.25, random_state=22)
print(y_test)
# 构建决策树模型
dtr = DecisionTreeRegressor()
dtr.fit(X_train, y_train)
y_pred = dtr.predict(X_test)
print(y_pred)
print('Training set score: {:.2f}'.format(dtr.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(dtr.score(X_test, y_test)))
输出结果如下:
[2000 1850 1300 ... 1600 650 600]
[ 900. 2205. 1200. ... 2000. 650. 600.]
Training set score: 1.00
Test set score: 0.91
代码如下:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
# 加载数据集
dataset = pd.read_csv('./datasets/总数据集.csv', encoding='gbk')
# 使用train_test_split()分离出测试集和训练集
X_train, X_test, y_train, y_test = train_test_split(np.array(dataset.drop(columns=['月租'])), np.array(dataset['月租']),
test_size=0.25, random_state=22)
print(y_test)
# 使用随机森林方法
rfr = RandomForestRegressor(n_estimators=200)
rfr.fit(X_train, y_train)
y_pred = rfr.predict(X_test)
print(y_pred)
print('Training set score: {:.2f}'.format(rfr.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(rfr.score(X_test, y_test)))
输出结果如下:
[2000 1850 1300 ... 1600 650 600]
[1064.5 2211.89 1233.25 ... 1319.5 775. 579.15]
Training set score: 0.99
Test set score: 0.95