多元线性回归练习-预测房价

目的:

找到数据集中关于特征的描述。使用数据集中的其他变量来构建最佳模型以预测平均房价。

数据集说明:

数据集总共包含506个案例。

每种情况下,数据集都有14个属性:

特征 说明
MedianHomePrice 房价中位数
CRIM 人均城镇犯罪率
ZN 25,000平方英尺以上土地的住宅用地比例
INDIUS 每个城镇非零售业务英亩的比例。
CHAS 查尔斯河虚拟变量(如果束缚河,则为1;否则为0)
NOX- 氧化氮浓度(百万分之一)
RM 每个住宅的平均房间数
AGE 1940年之前建造的自有住房的比例
DIS 到五个波士顿就业中心的加权距离
RAD 径向公路的可达性指数
TAX 每10,000美元的全值财产税率
PTRATIO 各镇师生比例
B 1000(Bk-0.63)^ 2,其中Bk是按城镇划分的黑人比例
LSTAT 人口状况降低百分比
MEDV 自有住房的中位价格(以$ 1000为单位)

设定库和数据。

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from patsy import dmatrices
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(42)

#加载内置数据集,了解即可
boston_data = load_boston()
df = pd.DataFrame()
df['MedianHomePrice'] = boston_data.target
df2 = pd.DataFrame(boston_data.data)
df2.columns = boston_data.feature_names
df = df.join(df2)
df.head()
MedianHomePrice CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 24.0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 21.6 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 34.7 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 33.4 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 36.2 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

1.获取数据集中每个特征的汇总

使用 corr 方法计算各变量间的相关性,判断是否存在多重线性。

#绘制热力图
import seaborn as sns
plt.subplots(figsize=(10,10))#调节图像大小
sns.heatmap(df.corr(), annot = True, vmax = 1, square = True, cmap='RdPu')

多元线性回归练习-预测房价_第1张图片

2.拆分数据集

创建一个 training 数据集与一个 test 数据集,其中20%的数据在 test 数据集中。将结果存储在 X_train, X_test, y_train, y_test 中。

X = df.drop('MedianHomePrice' , axis=1, inplace=False)
y = df['MedianHomePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42 )

3.标准化

使用 [StandardScaler]来缩放数据集中的所有 x 变量。将结果存储在 X_scaled_train 中。

#把y_train的索引改为从0开始,因为原索引与下面的training_data索引不一致,合并会出错
y_train = pd.Series(y_train.values)
#使用 StandardScaler 来缩放数据集中的所有 x 变量,将结果存储在 X_scaled_train 中。 
X_scaled_train = StandardScaler()

#创建一个 pandas 数据帧并存储缩放的 x 变量以及 y_train。命名为 training_data 。
training_data = X_scaled_train.fit_transform(X_train)
training_data = pd.DataFrame(training_data, columns = X_train.columns)

training_data['MedianHomePrice'] = y_train
training_data.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MedianHomePrice
0 1.287702 -0.500320 1.033237 -0.278089 0.489252 -1.428069 1.028015 -0.802173 1.706891 1.578434 0.845343 -0.074337 1.753505 12.0
1 -0.336384 -0.500320 -0.413160 -0.278089 -0.157233 -0.680087 -0.431199 0.324349 -0.624360 -0.584648 1.204741 0.430184 -0.561474 19.9
2 -0.403253 1.013271 -0.715218 -0.278089 -1.008723 -0.402063 -1.618599 1.330697 -0.974048 -0.602724 -0.637176 0.065297 -0.651595 19.4
3 0.388230 -0.500320 1.033237 -0.278089 0.489252 -0.300450 0.591681 -0.839240 1.706891 1.578434 0.845343 -3.868193 1.525387 13.4
4 -0.325282 -0.500320 -0.413160 -0.278089 -0.157233 -0.831094 0.033747 -0.005494 -0.624360 -0.584648 1.204741 0.379119 -0.165787 18.2

4.模型1:所有特征

对训练集training_data进行线性拟合,查看p值判断显著性

#用所有的缩放特征来拟合线性模型,以预测此响应(平均房价)。不要忘记添加一个截距。
training_data['intercept'] = 1
X_train1= training_data.drop('MedianHomePrice' , axis=1, inplace=False)
lm = sm.OLS(training_data['MedianHomePrice'], X_train1)
result = lm.fit()
result.summary()
OLS Regression Results
Dep. Variable: MedianHomePrice R-squared: 0.751
Model: OLS Adj. R-squared: 0.743
Method: Least Squares F-statistic: 90.43
Date: Sun, 10 May 2020 Prob (F-statistic): 6.21e-109
Time: 20:22:27 Log-Likelihood: -1194.3
No. Observations: 404 AIC: 2417.
Df Residuals: 390 BIC: 2473.
Df Model: 13
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
CRIM -1.0021 0.308 -3.250 0.001 -1.608 -0.396
ZN 0.6963 0.370 1.882 0.061 -0.031 1.423
INDUS 0.2781 0.464 0.599 0.549 -0.634 1.190
CHAS 0.7187 0.247 2.914 0.004 0.234 1.204
NOX -2.0223 0.498 -4.061 0.000 -3.001 -1.043
RM 3.1452 0.329 9.567 0.000 2.499 3.792
AGE -0.1760 0.407 -0.432 0.666 -0.977 0.625
DIS -3.0819 0.481 -6.408 0.000 -4.027 -2.136
RAD 2.2514 0.652 3.454 0.001 0.970 3.533
TAX -1.7670 0.704 -2.508 0.013 -3.152 -0.382
PTRATIO -2.0378 0.321 -6.357 0.000 -2.668 -1.408
B 1.1296 0.271 4.166 0.000 0.596 1.663
LSTAT -3.6117 0.395 -9.133 0.000 -4.389 -2.834
intercept 22.7965 0.236 96.774 0.000 22.333 23.260
Omnibus: 133.052 Durbin-Watson: 2.114
Prob(Omnibus): 0.000 Jarque-Bera (JB): 579.817
Skew: 1.379 Prob(JB): 1.24e-126
Kurtosis: 8.181 Cond. No. 9.74


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

5.判断解释变量间是否存在相关性:

计算训练集中的vif

#计算数据集中每个 x_variable 的 vif
def vif_calculator(df, response):
    '''
    INPUT:
    df - 包含x和y的数据集
    response - 反应变量的列名string
    OUTPUT:
    vif - a dataframe of the vifs
    '''
    df2 = df.drop(response, axis = 1, inplace=False)#删除反应变量列
    features = "+".join(df2.columns)
    y, X = dmatrices(response + ' ~' + features, df, return_type='dataframe')
    vif = pd.DataFrame()
    vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif["features"] = X.columns
    vif = vif.round(1)
    return vif
vif = vif_calculator(training_data, 'MedianHomePrice')
vif
C:\ProgramData\Anaconda3\lib\site-packages\statsmodels\regression\linear_model.py:1685: RuntimeWarning: divide by zero encountered in double_scalars
  return 1 - self.ssr/self.centered_tss
VIF Factor features
0 0.0 Intercept
1 1.7 CRIM
2 2.5 ZN
3 3.9 INDUS
4 1.1 CHAS
5 4.5 NOX
6 1.9 RM
7 3.0 AGE
8 4.2 DIS
9 7.7 RAD
10 8.9 TAX
11 1.9 PTRATIO
12 1.3 B
13 2.8 LSTAT
14 0.0 intercept

结合vif、相关性和p值,判断要删除哪些变量:

vif限制在4以内。INDUS、RAD、TAX、NOX的VIF较大

TAX 和 RAD 之间具有强相关性,INDUS 和 NOX 也是如此,因此,每组相关性高的变量只要删除一个就能有效地减小另一个的 VIF。

p值限制在0.05以内。AGE和INDUS的p值较大。

根据查看 p 值和VIF的结果,如果选择保留RAD和INDUS,那么删除 AGE、 NOX 与TAX,删掉这些特征之后,用剩余的特征拟合一个新的线性模型。

6.模型2:删除 AGE、 NOX 与TAX

X_train1 = training_data.drop(['AGE','NOX','TAX','MedianHomePrice'] , axis=1, inplace=False)
lm1 = sm.OLS(training_data['MedianHomePrice'], X_train1)
result1 = lm1.fit()
result1.summary()
OLS Regression Results
Dep. Variable: MedianHomePrice R-squared: 0.733
Model: OLS Adj. R-squared: 0.727
Method: Least Squares F-statistic: 108.1
Date: Sun, 10 May 2020 Prob (F-statistic): 2.77e-106
Time: 21:02:41 Log-Likelihood: -1208.0
No. Observations: 404 AIC: 2438.
Df Residuals: 393 BIC: 2482.
Df Model: 10
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
CRIM -0.9116 0.317 -2.876 0.004 -1.535 -0.289
ZN 0.5622 0.363 1.548 0.123 -0.152 1.276
INDUS -0.8746 0.411 -2.128 0.034 -1.683 -0.067
CHAS 0.6896 0.252 2.738 0.006 0.194 1.185
RM 3.2406 0.330 9.818 0.000 2.592 3.889
DIS -2.1728 0.434 -5.010 0.000 -3.025 -1.320
RAD 0.4380 0.389 1.126 0.261 -0.327 1.202
PTRATIO -1.6369 0.310 -5.288 0.000 -2.246 -1.028
B 1.2106 0.279 4.345 0.000 0.663 1.758
LSTAT -3.9851 0.381 -10.470 0.000 -4.733 -3.237
intercept 22.7965 0.243 93.916 0.000 22.319 23.274
Omnibus: 126.568 Durbin-Watson: 2.033
Prob(Omnibus): 0.000 Jarque-Bera (JB): 542.197
Skew: 1.310 Prob(JB): 1.83e-118
Kurtosis: 8.034 Cond. No. 4.66


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

根据p值,应删除 RAD ,保留其他变量。

7.模型3:删除 AGE、 NOX 、TAX、RAD

X_train2 = training_data.drop(['AGE','NOX','TAX','RAD', 'MedianHomePrice'] , axis=1, inplace=False)
lm2 = sm.OLS(training_data['MedianHomePrice'], X_train2)
result2 = lm2.fit()
result2.summary()
OLS Regression Results
Dep. Variable: MedianHomePrice R-squared: 0.733
Model: OLS Adj. R-squared: 0.726
Method: Least Squares F-statistic: 119.9
Date: Sun, 10 May 2020 Prob (F-statistic): 4.60e-107
Time: 21:02:09 Log-Likelihood: -1208.6
No. Observations: 404 AIC: 2437.
Df Residuals: 394 BIC: 2477.
Df Model: 9
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
CRIM -0.7616 0.288 -2.647 0.008 -1.327 -0.196
ZN 0.6151 0.360 1.707 0.089 -0.093 1.323
INDUS -0.7544 0.397 -1.900 0.058 -1.535 0.026
CHAS 0.7067 0.252 2.810 0.005 0.212 1.201
RM 3.3022 0.326 10.142 0.000 2.662 3.942
DIS -2.2235 0.432 -5.153 0.000 -3.072 -1.375
PTRATIO -1.5090 0.288 -5.239 0.000 -2.075 -0.943
B 1.1502 0.273 4.206 0.000 0.613 1.688
LSTAT -3.9413 0.379 -10.406 0.000 -4.686 -3.197
intercept 22.7965 0.243 93.884 0.000 22.319 23.274
Omnibus: 134.948 Durbin-Watson: 2.028
Prob(Omnibus): 0.000 Jarque-Bera (JB): 619.161
Skew: 1.381 Prob(JB): 3.56e-135
Kurtosis: 8.399 Cond. No. 4.36


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

仔细检查所有的 VIF 是否小于4。与先前模型相比,Rsquared 值没有发生变化。

training_data2 = training_data.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False)
vif = vif_calculator(training_data2, 'MedianHomePrice')
vif
VIF Factor features
0 0.0 Intercept
1 1.4 CRIM
2 2.2 ZN
3 2.7 INDUS
4 1.1 CHAS
5 1.8 RM
6 3.2 DIS
7 1.4 PTRATIO
8 1.3 B
9 2.4 LSTAT
10 0.0 intercept

8.模型评估

对各个模型的测试预测值和实际测试值的匹配度进行打分

#含有全部变量的模型
lm_full = LinearRegression()
lm_full.fit(X_train, y_train)
lm_full.score(X_test, y_test)#打分
0.66848257539715972
#删除AGE、NOX、TAX
X_train_red = X_train.drop(['AGE','NOX','TAX'] , axis=1, inplace=False)
X_test_red = X_test.drop(['AGE','NOX','TAX'] , axis=1, inplace=False)

#删除 AGE、 NOX 、TAX、RAD
X_train_red2 = X_train.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False)
X_test_red2 = X_test.drop(['AGE','NOX','TAX','RAD'] , axis=1, inplace=False)

lm_red = LinearRegression()#删除AGE、NOX、TAX的模型
lm_red.fit(X_train_red, y_train)
print(lm_red.score(X_test_red, y_test))#打分

lm_red2 = LinearRegression()#删除 AGE、 NOX 、TAX、RAD的模型
lm_red2.fit(X_train_red2, y_train)
print(lm_red2.score(X_test_red2, y_test))#打分

0.639421781821
0.63441065636

从评分可以看出,在此测试集中,拥有所有变量的模型表现最佳。后续可以用交叉验证 (即在多个训练和测试集里重复这一操作)来确定模型效果是否有稳定性。

你可能感兴趣的:(多元线性回归练习-预测房价)