找到数据集中关于特征的描述。使用数据集中的其他变量来构建最佳模型以预测平均房价。
数据集总共包含506个案例。
每种情况下,数据集都有14个属性:
特征 | 说明 |
---|---|
MedianHomePrice | 房价中位数 |
CRIM | 人均城镇犯罪率 |
ZN | 25,000平方英尺以上土地的住宅用地比例 |
INDIUS | 每个城镇非零售业务英亩的比例。 |
CHAS | 查尔斯河虚拟变量(如果束缚河,则为1;否则为0) |
NOX- | 氧化氮浓度(百万分之一) |
RM | 每个住宅的平均房间数 |
AGE | 1940年之前建造的自有住房的比例 |
DIS | 到五个波士顿就业中心的加权距离 |
RAD | 径向公路的可达性指数 |
TAX | 每10,000美元的全值财产税率 |
PTRATIO | 各镇师生比例 |
B | 1000(Bk-0.63)^ 2,其中Bk是按城镇划分的黑人比例 |
LSTAT | 人口状况降低百分比 |
MEDV | 自有住房的中位价格(以$ 1000为单位) |
设定库和数据。
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from patsy import dmatrices
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(42)
#加载内置数据集,了解即可
boston_data = load_boston()
df = pd.DataFrame()
df['MedianHomePrice'] = boston_data.target
df2 = pd.DataFrame(boston_data.data)
df2.columns = boston_data.feature_names
df = df.join(df2)
df.head()
MedianHomePrice | CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 24.0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 |
1 | 21.6 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 |
2 | 34.7 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 |
3 | 33.4 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 |
4 | 36.2 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 |
使用 corr 方法计算各变量间的相关性,判断是否存在多重线性。
#绘制热力图
import seaborn as sns
plt.subplots(figsize=(10,10))#调节图像大小
sns.heatmap(df.corr(), annot = True, vmax = 1, square = True, cmap='RdPu')
创建一个 training 数据集与一个 test 数据集,其中20%的数据在 test 数据集中。将结果存储在 X_train, X_test, y_train, y_test 中。
X = df.drop('MedianHomePrice' , axis=1, inplace=False)
y = df['MedianHomePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42 )
使用 [StandardScaler]来缩放数据集中的所有 x 变量。将结果存储在 X_scaled_train 中。
#把y_train的索引改为从0开始,因为原索引与下面的training_data索引不一致,合并会出错
y_train = pd.Series(y_train.values)
#使用 StandardScaler 来缩放数据集中的所有 x 变量,将结果存储在 X_scaled_train 中。
X_scaled_train = StandardScaler()
#创建一个 pandas 数据帧并存储缩放的 x 变量以及 y_train。命名为 training_data 。
training_data = X_scaled_train.fit_transform(X_train)
training_data = pd.DataFrame(training_data, columns = X_train.columns)
training_data['MedianHomePrice'] = y_train
training_data.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MedianHomePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.287702 | -0.500320 | 1.033237 | -0.278089 | 0.489252 | -1.428069 | 1.028015 | -0.802173 | 1.706891 | 1.578434 | 0.845343 | -0.074337 | 1.753505 | 12.0 |
1 | -0.336384 | -0.500320 | -0.413160 | -0.278089 | -0.157233 | -0.680087 | -0.431199 | 0.324349 | -0.624360 | -0.584648 | 1.204741 | 0.430184 | -0.561474 | 19.9 |
2 | -0.403253 | 1.013271 | -0.715218 | -0.278089 | -1.008723 | -0.402063 | -1.618599 | 1.330697 | -0.974048 | -0.602724 | -0.637176 | 0.065297 | -0.651595 | 19.4 |
3 | 0.388230 | -0.500320 | 1.033237 | -0.278089 | 0.489252 | -0.300450 | 0.591681 | -0.839240 | 1.706891 | 1.578434 | 0.845343 | -3.868193 | 1.525387 | 13.4 |
4 | -0.325282 | -0.500320 | -0.413160 | -0.278089 | -0.157233 | -0.831094 | 0.033747 | -0.005494 | -0.624360 | -0.584648 | 1.204741 | 0.379119 | -0.165787 | 18.2 |
对训练集training_data进行线性拟合,查看p值判断显著性
#用所有的缩放特征来拟合线性模型,以预测此响应(平均房价)。不要忘记添加一个截距。
training_data['intercept'] = 1
X_train1= training_data.drop('MedianHomePrice' , axis=1, inplace=False)
lm = sm.OLS(training_data['MedianHomePrice'], X_train1)
result = lm.fit()
result.summary()
Dep. Variable: | MedianHomePrice | R-squared: | 0.751 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.743 |
Method: | Least Squares | F-statistic: | 90.43 |
Date: | Sun, 10 May 2020 | Prob (F-statistic): | 6.21e-109 |
Time: | 20:22:27 | Log-Likelihood: | -1194.3 |
No. Observations: | 404 | AIC: | 2417. |
Df Residuals: | 390 | BIC: | 2473. |
Df Model: | 13 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
CRIM | -1.0021 | 0.308 | -3.250 | 0.001 | -1.608 | -0.396 |
ZN | 0.6963 | 0.370 | 1.882 | 0.061 | -0.031 | 1.423 |
INDUS | 0.2781 | 0.464 | 0.599 | 0.549 | -0.634 | 1.190 |
CHAS | 0.7187 | 0.247 | 2.914 | 0.004 | 0.234 | 1.204 |
NOX | -2.0223 | 0.498 | -4.061 | 0.000 | -3.001 | -1.043 |
RM | 3.1452 | 0.329 | 9.567 | 0.000 | 2.499 | 3.792 |
AGE | -0.1760 | 0.407 | -0.432 | 0.666 | -0.977 | 0.625 |
DIS | -3.0819 | 0.481 | -6.408 | 0.000 | -4.027 | -2.136 |
RAD | 2.2514 | 0.652 | 3.454 | 0.001 | 0.970 | 3.533 |
TAX | -1.7670 | 0.704 | -2.508 | 0.013 | -3.152 | -0.382 |
PTRATIO | -2.0378 | 0.321 | -6.357 | 0.000 | -2.668 | -1.408 |
B | 1.1296 | 0.271 | 4.166 | 0.000 | 0.596 | 1.663 |
LSTAT | -3.6117 | 0.395 | -9.133 | 0.000 | -4.389 | -2.834 |
intercept | 22.7965 | 0.236 | 96.774 | 0.000 | 22.333 | 23.260 |
Omnibus: | 133.052 | Durbin-Watson: | 2.114 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 579.817 |
Skew: | 1.379 | Prob(JB): | 1.24e-126 |
Kurtosis: | 8.181 | Cond. No. | 9.74 |
计算训练集中的vif
#计算数据集中每个 x_variable 的 vif
def vif_calculator(df, response):
'''
INPUT:
df - 包含x和y的数据集
response - 反应变量的列名string
OUTPUT:
vif - a dataframe of the vifs
'''
df2 = df.drop(response, axis = 1, inplace=False)#删除反应变量列
features = "+".join(df2.columns)
y, X = dmatrices(response + ' ~' + features, df, return_type='dataframe')
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns
vif = vif.round(1)
return vif
vif = vif_calculator(training_data, 'MedianHomePrice')
vif
C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\regression\linear_model.py:1685: RuntimeWarning: divide by zero encountered in double_scalars
return 1 - self.ssr/self.centered_tss
VIF Factor | features | |
---|---|---|
0 | 0.0 | Intercept |
1 | 1.7 | CRIM |
2 | 2.5 | ZN |
3 | 3.9 | INDUS |
4 | 1.1 | CHAS |
5 | 4.5 | NOX |
6 | 1.9 | RM |
7 | 3.0 | AGE |
8 | 4.2 | DIS |
9 | 7.7 | RAD |
10 | 8.9 | TAX |
11 | 1.9 | PTRATIO |
12 | 1.3 | B |
13 | 2.8 | LSTAT |
14 | 0.0 | intercept |
结合vif、相关性和p值,判断要删除哪些变量:
vif限制在4以内。INDUS、RAD、TAX、NOX的VIF较大
TAX 和 RAD 之间具有强相关性,INDUS 和 NOX 也是如此,因此,每组相关性高的变量只要删除一个就能有效地减小另一个的 VIF。
p值限制在0.05以内。AGE和INDUS的p值较大。
根据查看 p 值和VIF的结果,如果选择保留RAD和INDUS,那么删除 AGE、 NOX 与TAX,删掉这些特征之后,用剩余的特征拟合一个新的线性模型。
X_train1 = training_data.drop(['AGE','NOX','TAX','MedianHomePrice'] , axis=1, inplace=False)
lm1 = sm.OLS(training_data['MedianHomePrice'], X_train1)
result1 = lm1.fit()
result1.summary()
Dep. Variable: | MedianHomePrice | R-squared: | 0.733 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.727 |
Method: | Least Squares | F-statistic: | 108.1 |
Date: | Sun, 10 May 2020 | Prob (F-statistic): | 2.77e-106 |
Time: | 21:02:41 | Log-Likelihood: | -1208.0 |
No. Observations: | 404 | AIC: | 2438. |
Df Residuals: | 393 | BIC: | 2482. |
Df Model: | 10 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
CRIM | -0.9116 | 0.317 | -2.876 | 0.004 | -1.535 | -0.289 |
ZN | 0.5622 | 0.363 | 1.548 | 0.123 | -0.152 | 1.276 |
INDUS | -0.8746 | 0.411 | -2.128 | 0.034 | -1.683 | -0.067 |
CHAS | 0.6896 | 0.252 | 2.738 | 0.006 | 0.194 | 1.185 |
RM | 3.2406 | 0.330 | 9.818 | 0.000 | 2.592 | 3.889 |
DIS | -2.1728 | 0.434 | -5.010 | 0.000 | -3.025 | -1.320 |
RAD | 0.4380 | 0.389 | 1.126 | 0.261 | -0.327 | 1.202 |
PTRATIO | -1.6369 | 0.310 | -5.288 | 0.000 | -2.246 | -1.028 |
B | 1.2106 | 0.279 | 4.345 | 0.000 | 0.663 | 1.758 |
LSTAT | -3.9851 | 0.381 | -10.470 | 0.000 | -4.733 | -3.237 |
intercept | 22.7965 | 0.243 | 93.916 | 0.000 | 22.319 | 23.274 |
Omnibus: | 126.568 | Durbin-Watson: | 2.033 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 542.197 |
Skew: | 1.310 | Prob(JB): | 1.83e-118 |
Kurtosis: | 8.034 | Cond. No. | 4.66 |
根据p值,应删除 RAD ,保留其他变量。
X_train2 = training_data.drop(['AGE','NOX','TAX','RAD', 'MedianHomePrice'] , axis=1, inplace=False)
lm2 = sm.OLS(training_data['MedianHomePrice'], X_train2)
result2 = lm2.fit()
result2.summary()
Dep. Variable: | MedianHomePrice | R-squared: | 0.733 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.726 |
Method: | Least Squares | F-statistic: | 119.9 |
Date: | Sun, 10 May 2020 | Prob (F-statistic): | 4.60e-107 |
Time: | 21:02:09 | Log-Likelihood: | -1208.6 |
No. Observations: | 404 | AIC: | 2437. |
Df Residuals: | 394 | BIC: | 2477. |
Df Model: | 9 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
CRIM | -0.7616 | 0.288 | -2.647 | 0.008 | -1.327 | -0.196 |
ZN | 0.6151 | 0.360 | 1.707 | 0.089 | -0.093 | 1.323 |
INDUS | -0.7544 | 0.397 | -1.900 | 0.058 | -1.535 | 0.026 |
CHAS | 0.7067 | 0.252 | 2.810 | 0.005 | 0.212 | 1.201 |
RM | 3.3022 | 0.326 | 10.142 | 0.000 | 2.662 | 3.942 |
DIS | -2.2235 | 0.432 | -5.153 | 0.000 | -3.072 | -1.375 |
PTRATIO | -1.5090 | 0.288 | -5.239 | 0.000 | -2.075 | -0.943 |
B | 1.1502 | 0.273 | 4.206 | 0.000 | 0.613 | 1.688 |
LSTAT | -3.9413 | 0.379 | -10.406 | 0.000 | -4.686 | -3.197 |
intercept | 22.7965 | 0.243 | 93.884 | 0.000 | 22.319 | 23.274 |
Omnibus: | 134.948 | Durbin-Watson: | 2.028 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 619.161 |
Skew: | 1.381 | Prob(JB): | 3.56e-135 |
Kurtosis: | 8.399 | Cond. No. | 4.36 |
仔细检查所有的 VIF 是否小于4。与先前模型相比,Rsquared 值没有发生变化。
training_data2 = training_data.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False)
vif = vif_calculator(training_data2, 'MedianHomePrice')
vif
VIF Factor | features | |
---|---|---|
0 | 0.0 | Intercept |
1 | 1.4 | CRIM |
2 | 2.2 | ZN |
3 | 2.7 | INDUS |
4 | 1.1 | CHAS |
5 | 1.8 | RM |
6 | 3.2 | DIS |
7 | 1.4 | PTRATIO |
8 | 1.3 | B |
9 | 2.4 | LSTAT |
10 | 0.0 | intercept |
对各个模型的测试预测值和实际测试值的匹配度进行打分
#含有全部变量的模型
lm_full = LinearRegression()
lm_full.fit(X_train, y_train)
lm_full.score(X_test, y_test)#打分
0.66848257539715972
#删除AGE、NOX、TAX
X_train_red = X_train.drop(['AGE','NOX','TAX'] , axis=1, inplace=False)
X_test_red = X_test.drop(['AGE','NOX','TAX'] , axis=1, inplace=False)
#删除 AGE、 NOX 、TAX、RAD
X_train_red2 = X_train.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False)
X_test_red2 = X_test.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False)
lm_red = LinearRegression()#删除AGE、NOX、TAX的模型
lm_red.fit(X_train_red, y_train)
print(lm_red.score(X_test_red, y_test))#打分
lm_red2 = LinearRegression()#删除 AGE、 NOX 、TAX、RAD的模型
lm_red2.fit(X_train_red2, y_train)
print(lm_red2.score(X_test_red2, y_test))#打分
0.639421781821
0.63441065636
从评分可以看出,在此测试集中,拥有所有变量的模型表现最佳。后续可以用交叉验证 (即在多个训练和测试集里重复这一操作)来确定模型效果是否有稳定性。