#使用假设检验法
import statsmodels.api as sm
Y = df1["总价"].values
X = df1[["建筑面积","室","厅","卫","中装修","毛坯","精装修","豪华装修","东","东北","东南","南","西","西北","西南","低层","高层"]]
X_ = sm.add_constant(X)
#使用最小平方法
result = sm.OLS(Y,X_)
#fit方法运行计算
summary = result.fit()
#调用summary2方法,打印出假设检验的系列信息
print(summary.summary2())
Results: Ordinary least squares
===================================================================
Model: OLS Adj. R-squared: 0.767
Dependent Variable: y AIC: 27289.5750
Date: 2017-12-19 20:55 BIC: 27394.4725
No. Observations: 2509 Log-Likelihood: -13627.
Df Model: 17 F-statistic: 485.5
Df Residuals: 2491 Prob (F-statistic): 0.00
R-squared: 0.768 Scale: 3076.8
---------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
---------------------------------------------------------------------
const -23.5634 5.7070 -4.1288 0.0000 -34.7545 -12.3724
建筑面积 1.6372 0.0368 44.4952 0.0000 1.5650 1.7093
室 7.9032 1.9707 4.0104 0.0001 4.0389 11.7675
厅 -12.7437 3.0179 -4.2227 0.0000 -18.6616 -6.8258
卫 8.1369 2.3460 3.4685 0.0005 3.5367 12.7372
中装修 -9.2360 6.5189 -1.4168 0.1567 -22.0191 3.5471
毛坯 -5.2365 3.2997 -1.5870 0.1126 -11.7070 1.2339
精装修 13.8924 3.2671 4.2522 0.0000 7.4859 20.2990
豪华装修 35.8055 5.0192 7.1336 0.0000 25.9632 45.6478
东 18.1810 4.1681 4.3619 0.0000 10.0076 26.3543
东北 -8.1028 12.8777 -0.6292 0.5293 -33.3549 17.1493
东南 13.4226 4.6954 2.8586 0.0043 4.2152 22.6299
南 13.3270 2.6813 4.9703 0.0000 8.0692 18.5848
西 10.4530 5.7140 1.8294 0.0675 -0.7516 21.6577
西北 0.6791 12.2561 0.0554 0.9558 -23.3540 24.7122
西南 1.0296 5.6021 0.1838 0.8542 -9.9557 12.0149
低层 13.5631 2.6789 5.0630 0.0000 8.3101 18.8161
高层 3.4208 2.8190 1.2135 0.2251 -2.1069 8.9486
-------------------------------------------------------------------
Omnibus: 836.776 Durbin-Watson: 1.849
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5433.736
Skew: 1.419 Prob(JB): 0.000
Kurtosis: 9.628 Condition No.: 1565
===================================================================
* The condition number is large (2e+03). This might indicate
strong multicollinearity or other numerical problems.
名词解释,
Coef 回归系数,
Std.Err 标准差,
t 虚无假设成立时的t值,
P>|t| 虚无假设成立时的概率值,
[0.025 ,0.975] 97.5%置信估计,
要做假设性检验,首先要设置显著性标准。a.假设显著性标准是0.01 b.推翻虚无假设的标准是 p < 0.01 c.上面的SqFt的t=9.2416,P(>t) = 0.0000 < 0.01,因此虚无假设被推翻(这里的虚无假设是SqFt对price的回归系数为0,即SqFt与price不相关),
F统计,
回归平方和 Regression Square Sum [RSS] :依变量的变化归咎于回归模型 A = sum((y-y*)^2),
误差平方和 Error Square Sum [ESS] : 依变量的变化归咎于线性模型 B = sum((y-y')^2),
总的平方和 Total Square Sum [TSS] : 依变量整体变化 C = A+B,
回归平方平均 Model Mean Square: =RSS/Regression d.f(k) k=自变数的数量,
误差平方平均 Error Mean Square:= ESS / Error d.f(n-k-1) n=观测值得数量,
F统计 F = Model Mean Square / Error Mean Square,
F值越大越好,Prob(F-statistic)越小越好,
R Square,
回归可以解释的变异比例,可以作为自变量预测因变量准确度的指标,
SSE (残差平方和) = sum((y-y')^2),
SST (整体平方和) = sum((yi-yavg)^2),
R^2 = 1-SSE/SST 一般要大于0.6,0.7才算好,
Adjust R Square,
R^2 = 1-SSE/SST SSE最小,推导出R^2不会递减,
yi = b1x1 + b2x2 + .... bkxk + .... 增加任何一个变量还会增加R^2,
Adj R^2 = 1-(1-R^2)*((n-1)/(n-p-1)),
n为总体大小,p为回归因子个数,
AIC/BIC,
AIC (The Akaike Information Criterion)= 2K + nln(SSE/n) k是参数数量,n是观察数,SSE是残差平方和。 AIC鼓励数据拟合的优良性,但是尽量避免出现过拟合,所以优先考虑的模型应该是AIC最小的那一个,赤池信息量的准则是寻找可以最好的解释数据但是包含最少自由参数的模型,
BIC (The Bayesain Information Criterion)"
import itertools
#使用AIC,找出AIC最小的属性作为预测的特征属性
#寻找最小AIC的属性组合
fileds = ["建筑面积","室","厅","卫","中装修","毛坯","精装修","豪华装修","东","东北","东南","南","西","西北","西南","低层","高层"]
acis = {}
for i in range(1,len(fileds)+1):
for virables in itertools.combinations(fileds,i):
x1 = sm.add_constant(df1[list(virables)])
x2 = sm.OLS(Y,x1)
res = x2.fit()
acis[virables] = res.aic
#使用collections里面的Counter,对字典进行统计
from collections import Counter
counter = Counter(acis)
#倒序取出最后10个特征组合
counter.most_common()[::-10]
cols2 = ['建筑面积', '室', '厅', '卫', '精装修', '豪华装修', '东', '东南', '南', '西', '低层']
### 划分测试集和训练集
X = df1[cols2]
Y = df1['总价']
from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=123)
#多元线性回归
linear_multi = LinearRegression()
model1 = linear_multi.fit(x_train,y_train)
print(model1.intercept_,model1.coef_)
print(model1.score(x_test,y_test))