机器学习特征选择-使用假设检验法

#使用假设检验法
import statsmodels.api as sm
Y = df1["总价"].values
X = df1[["建筑面积","室","厅","卫","中装修","毛坯","精装修","豪华装修","东","东北","东南","南","西","西北","西南","低层","高层"]]
X_ = sm.add_constant(X)
#使用最小平方法
result = sm.OLS(Y,X_)
#fit方法运行计算
summary = result.fit()
#调用summary2方法,打印出假设检验的系列信息
print(summary.summary2())
Results: Ordinary least squares

===================================================================

Model:              OLS              Adj. R-squared:     0.767     

Dependent Variable: y                AIC:                27289.5750

Date:               2017-12-19 20:55 BIC:                27394.4725

No. Observations:   2509             Log-Likelihood:     -13627.   

Df Model:           17               F-statistic:        485.5     

Df Residuals:       2491             Prob (F-statistic): 0.00      

R-squared:          0.768            Scale:              3076.8    

---------------------------------------------------------------------

          Coef.     Std.Err.      t      P>|t|     [0.025     0.975] 

---------------------------------------------------------------------

const    -23.5634     5.7070   -4.1288   0.0000   -34.7545   -12.3724

建筑面积       1.6372     0.0368   44.4952   0.0000     1.5650     1.7093

室          7.9032     1.9707    4.0104   0.0001     4.0389    11.7675

厅        -12.7437     3.0179   -4.2227   0.0000   -18.6616    -6.8258

卫          8.1369     2.3460    3.4685   0.0005     3.5367    12.7372

中装修       -9.2360     6.5189   -1.4168   0.1567   -22.0191     3.5471

毛坯        -5.2365     3.2997   -1.5870   0.1126   -11.7070     1.2339

精装修       13.8924     3.2671    4.2522   0.0000     7.4859    20.2990

豪华装修      35.8055     5.0192    7.1336   0.0000    25.9632    45.6478

东         18.1810     4.1681    4.3619   0.0000    10.0076    26.3543

东北        -8.1028    12.8777   -0.6292   0.5293   -33.3549    17.1493

东南        13.4226     4.6954    2.8586   0.0043     4.2152    22.6299

南         13.3270     2.6813    4.9703   0.0000     8.0692    18.5848

西         10.4530     5.7140    1.8294   0.0675    -0.7516    21.6577

西北         0.6791    12.2561    0.0554   0.9558   -23.3540    24.7122

西南         1.0296     5.6021    0.1838   0.8542    -9.9557    12.0149

低层        13.5631     2.6789    5.0630   0.0000     8.3101    18.8161

高层         3.4208     2.8190    1.2135   0.2251    -2.1069     8.9486

-------------------------------------------------------------------

Omnibus:             836.776       Durbin-Watson:          1.849   

Prob(Omnibus):       0.000         Jarque-Bera (JB):       5433.736

Skew:                1.419         Prob(JB):               0.000   

Kurtosis:            9.628         Condition No.:          1565    

===================================================================

* The condition number is large (2e+03). This might indicate

strong multicollinearity or other numerical problems.
名词解释,
   Coef 回归系数,
   Std.Err 标准差,
   t 虚无假设成立时的t值,
   P>|t| 虚无假设成立时的概率值,
   [0.025 ,0.975] 97.5%置信估计,
   要做假设性检验,首先要设置显著性标准。a.假设显著性标准是0.01 b.推翻虚无假设的标准是 p < 0.01 c.上面的SqFt的t=9.2416,P(>t) = 0.0000 < 0.01,因此虚无假设被推翻(这里的虚无假设是SqFt对price的回归系数为0,即SqFt与price不相关),
   F统计,
   回归平方和 Regression Square Sum [RSS] :依变量的变化归咎于回归模型 A = sum((y-y*)^2),
   误差平方和 Error Square Sum [ESS] : 依变量的变化归咎于线性模型 B = sum((y-y')^2),
   总的平方和 Total Square Sum [TSS] : 依变量整体变化 C = A+B,
   回归平方平均 Model Mean Square: =RSS/Regression d.f(k) k=自变数的数量,
   误差平方平均 Error Mean Square:= ESS / Error d.f(n-k-1) n=观测值得数量,
   F统计 F = Model Mean Square / Error Mean Square,
   F值越大越好,Prob(F-statistic)越小越好,
   R Square,
   回归可以解释的变异比例,可以作为自变量预测因变量准确度的指标,
   SSE (残差平方和) = sum((y-y')^2),
   SST (整体平方和) = sum((yi-yavg)^2),
   R^2 = 1-SSE/SST 一般要大于0.6,0.7才算好,
   Adjust R Square,
   R^2 = 1-SSE/SST SSE最小,推导出R^2不会递减,
   yi = b1x1 + b2x2 + .... bkxk + .... 增加任何一个变量还会增加R^2,
   Adj R^2 = 1-(1-R^2)*((n-1)/(n-p-1)),
   n为总体大小,p为回归因子个数,
   AIC/BIC,
   AIC (The Akaike Information Criterion)= 2K + nln(SSE/n) k是参数数量,n是观察数,SSE是残差平方和。 AIC鼓励数据拟合的优良性,但是尽量避免出现过拟合,所以优先考虑的模型应该是AIC最小的那一个,赤池信息量的准则是寻找可以最好的解释数据但是包含最少自由参数的模型,
   BIC (The Bayesain Information Criterion)"
import itertools
#使用AIC,找出AIC最小的属性作为预测的特征属性
#寻找最小AIC的属性组合
fileds = ["建筑面积","室","厅","卫","中装修","毛坯","精装修","豪华装修","东","东北","东南","南","西","西北","西南","低层","高层"]
acis = {}
for i in range(1,len(fileds)+1):
    for virables in itertools.combinations(fileds,i):
        x1 = sm.add_constant(df1[list(virables)])
        x2 = sm.OLS(Y,x1)
        res = x2.fit()
        acis[virables] = res.aic
#使用collections里面的Counter,对字典进行统计
from  collections import Counter
counter = Counter(acis)
#倒序取出最后10个特征组合
counter.most_common()[::-10]
cols2 = ['建筑面积', '室', '厅', '卫', '精装修', '豪华装修', '东', '东南', '南', '西', '低层']
### 划分测试集和训练集
X = df1[cols2]
Y = df1['总价']
from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=123)
#多元线性回归
linear_multi = LinearRegression()
model1 = linear_multi.fit(x_train,y_train)
print(model1.intercept_,model1.coef_)
print(model1.score(x_test,y_test))

 

你可能感兴趣的:(自己用,学习)