数据科学 案例3 线性回归之汽车贷款(代码)
- 7 线性回归模型与诊断
- Step1、导入数据和数据清洗
- Step2、相关性分析
- Step3、线性回归算法
- Step4、残差分析
- Step5、强影响点分析
- Step6、多重共线性分析(vif函数)
- Step7、正则算法
- 1、岭回归
- 2、使用scikit-learn进行正则化参数调优
7 线性回归模型与诊断
数据说明:本数据是一份汽车贷款数据
字段名 |
中文含义 |
id |
id |
Acc |
是否开卡(1=已开通) |
avg_exp |
月均信用卡支出(元) |
avg_exp_ln |
月均信用卡支出的自然对数 |
gender |
性别(男=1) |
Age |
年龄 |
Income |
年收入(万元) |
Ownrent |
是否自有住房(有=1;无=0) |
Selfempl |
是否自谋职业(1=yes, 0=no) |
dist_home_val |
所住小区房屋均价(万元) |
dist_avg_income |
当地人均收入 |
high_avg |
高出当地平均收入 |
edu_class |
教育等级:小学及以下开通=0,中学=1,本科=2,研究生=3 |
get_ipython().magic('matplotlib inline')
import matplotlib.pyplot as plt
import os
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
Step1、导入数据和数据清洗
(直接去除了缺失值)
raw = pd.read_csv(r'.\data\creditcard_exp.csv', skipinitialspace=True)
raw.head()
|
id |
Acc |
avg_exp |
avg_exp_ln |
gender |
Age |
Income |
Ownrent |
Selfempl |
dist_home_val |
dist_avg_income |
age2 |
high_avg |
edu_class |
0 |
19 |
1 |
1217.03 |
7.104169 |
1 |
40 |
16.03515 |
1 |
1 |
99.93 |
15.932789 |
1600 |
0.102361 |
3 |
1 |
5 |
1 |
1251.50 |
7.132098 |
1 |
32 |
15.84750 |
1 |
0 |
49.88 |
15.796316 |
1024 |
0.051184 |
2 |
2 |
95 |
0 |
NaN |
NaN |
1 |
36 |
8.40000 |
0 |
0 |
88.61 |
7.490000 |
1296 |
0.910000 |
1 |
3 |
86 |
1 |
856.57 |
6.752936 |
1 |
41 |
11.47285 |
1 |
0 |
16.10 |
11.275632 |
1681 |
0.197218 |
3 |
4 |
50 |
1 |
1321.83 |
7.186772 |
1 |
28 |
13.40915 |
1 |
0 |
100.39 |
13.346474 |
784 |
0.062676 |
2 |
exp = raw[raw['avg_exp'].notnull()].copy().iloc[:, 2:].drop('age2',axis=1)
exp_new = raw[raw['avg_exp'].isnull()].copy().iloc[:, 2:].drop('age2',axis=1)
exp.describe(include='all')
|
avg_exp |
avg_exp_ln |
gender |
Age |
Income |
Ownrent |
Selfempl |
dist_home_val |
dist_avg_income |
high_avg |
edu_class |
count |
70.000000 |
70.000000 |
70.000000 |
70.000000 |
70.000000 |
70.000000 |
70.000000 |
70.000000 |
70.000000 |
70.000000 |
70.000000 |
mean |
983.655429 |
6.787787 |
0.285714 |
31.157143 |
7.424706 |
0.385714 |
0.028571 |
74.540857 |
8.005472 |
-0.580766 |
1.928571 |
std |
446.294237 |
0.476035 |
0.455016 |
7.206349 |
3.077986 |
0.490278 |
0.167802 |
36.949228 |
3.070744 |
0.432808 |
0.873464 |
min |
163.180000 |
5.094854 |
0.000000 |
20.000000 |
3.493900 |
0.000000 |
0.000000 |
13.130000 |
3.828842 |
-1.526850 |
0.000000 |
25% |
697.155000 |
6.547003 |
0.000000 |
26.000000 |
5.175662 |
0.000000 |
0.000000 |
49.302500 |
5.915553 |
-0.887981 |
1.000000 |
50% |
884.150000 |
6.784627 |
0.000000 |
30.000000 |
6.443525 |
0.000000 |
0.000000 |
65.660000 |
7.084184 |
-0.612068 |
2.000000 |
75% |
1229.585000 |
7.114415 |
1.000000 |
36.000000 |
8.494237 |
1.000000 |
0.000000 |
105.067500 |
9.123105 |
-0.302082 |
3.000000 |
max |
2430.030000 |
7.795659 |
1.000000 |
55.000000 |
16.900150 |
1.000000 |
1.000000 |
157.900000 |
18.427000 |
0.259337 |
3.000000 |
Step2、相关性分析
散点图
exp.plot('Income', 'avg_exp', kind='scatter')
plt.show()
exp[['Income', 'avg_exp', 'Age', 'dist_home_val']].corr(method='pearson')
|
Income |
avg_exp |
Age |
dist_home_val |
Income |
1.000000 |
0.674011 |
0.369129 |
0.249153 |
avg_exp |
0.674011 |
1.000000 |
0.258478 |
0.319499 |
Age |
0.369129 |
0.258478 |
1.000000 |
0.109323 |
dist_home_val |
0.249153 |
0.319499 |
0.109323 |
1.000000 |
Step3、线性回归算法
1、简单线性回归
lm_s = ols('avg_exp ~ Income', data=exp).fit()
lm_s.summary()
OLS Regression Results
Dep. Variable: |
avg_exp |
R-squared: |
0.454 |
Model: |
OLS |
Adj. R-squared: |
0.446 |
Method: |
Least Squares |
F-statistic: |
56.61 |
Date: |
Thu, 06 Feb 2020 |
Prob (F-statistic): |
1.60e-10 |
Time: |
10:14:45 |
Log-Likelihood: |
-504.69 |
No. Observations: |
70 |
AIC: |
1013. |
Df Residuals: |
68 |
BIC: |
1018. |
Df Model: |
1 |
|
|
Covariance Type: |
nonrobust |
|
|
|
coef |
std err |
t |
P>|t| |
[0.025 |
0.975] |
Intercept |
258.0495 |
104.290 |
2.474 |
0.016 |
49.942 |
466.157 |
Income |
97.7286 |
12.989 |
7.524 |
0.000 |
71.809 |
123.648 |
Omnibus: |
3.714 |
Durbin-Watson: |
1.424 |
Prob(Omnibus): |
0.156 |
Jarque-Bera (JB): |
3.507 |
Skew: |
0.485 |
Prob(JB): |
0.173 |
Kurtosis: |
2.490 |
Cond. No. |
21.4 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
pd.DataFrame([lm_s.predict(exp), lm_s.resid], index=['predict', 'resid']
).T.head()
|
predict |
resid |
0 |
1825.141904 |
-608.111904 |
1 |
1806.803136 |
-555.303136 |
3 |
1379.274813 |
-522.704813 |
4 |
1568.506658 |
-246.676658 |
5 |
1238.281793 |
-422.251793 |
lm_s.predict(exp_new)[:5]
2 1078.969552
11 756.465245
13 736.919530
19 687.077955
20 666.554953
dtype: float64
3、多元线性回归
lm_m = ols('avg_exp ~ Age + Income + dist_home_val + dist_avg_income',
data=exp).fit()
lm_m.summary()
OLS Regression Results
Dep. Variable: |
avg_exp |
R-squared: |
0.542 |
Model: |
OLS |
Adj. R-squared: |
0.513 |
Method: |
Least Squares |
F-statistic: |
19.20 |
Date: |
Thu, 06 Feb 2020 |
Prob (F-statistic): |
1.82e-10 |
Time: |
10:15:18 |
Log-Likelihood: |
-498.59 |
No. Observations: |
70 |
AIC: |
1007. |
Df Residuals: |
65 |
BIC: |
1018. |
Df Model: |
4 |
|
|
Covariance Type: |
nonrobust |
|
|
|
coef |
std err |
t |
P>|t| |
[0.025 |
0.975] |
Intercept |
-32.0078 |
186.874 |
-0.171 |
0.865 |
-405.221 |
341.206 |
Age |
1.3723 |
5.605 |
0.245 |
0.807 |
-9.822 |
12.566 |
Income |
-166.7204 |
87.607 |
-1.903 |
0.061 |
-341.684 |
8.243 |
dist_home_val |
1.5329 |
1.057 |
1.450 |
0.152 |
-0.578 |
3.644 |
dist_avg_income |
261.8827 |
87.807 |
2.982 |
0.004 |
86.521 |
437.245 |
Omnibus: |
5.234 |
Durbin-Watson: |
1.582 |
Prob(Omnibus): |
0.073 |
Jarque-Bera (JB): |
5.174 |
Skew: |
0.625 |
Prob(JB): |
0.0752 |
Kurtosis: |
2.540 |
Cond. No. |
459. |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
3.1 多元线性回归的变量筛选
'''forward select'''
def forward_select(data, response):
remaining = set(data.columns)
remaining.remove(response)
selected = []
current_score, best_new_score = float('inf'), float('inf')
while remaining:
aic_with_candidates=[]
for candidate in remaining:
formula = "{} ~ {}".format(
response,' + '.join(selected + [candidate]))
aic = ols(formula=formula, data=data).fit().aic
aic_with_candidates.append((aic, candidate))
aic_with_candidates.sort(reverse=True)
best_new_score, best_candidate=aic_with_candidates.pop()
if current_score > best_new_score:
remaining.remove(best_candidate)
selected.append(best_candidate)
current_score = best_new_score
print ('aic is {},continuing!'.format(current_score))
else:
print ('forward selection over!')
break
formula = "{} ~ {} ".format(response,' + '.join(selected))
print('final formula is {}'.format(formula))
model = ols(formula=formula, data=data).fit()
return(model)
data_for_select = exp[['avg_exp', 'Income', 'Age', 'dist_home_val',
'dist_avg_income']]
lm_m = forward_select(data=data_for_select, response='avg_exp')
print(lm_m.rsquared)
aic is 1007.6801413968117,continuing!
aic is 1005.4969816306302,continuing!
aic is 1005.2487355956046,continuing!
forward selection over!
final formula is avg_exp ~ dist_avg_income + Income + dist_home_val
0.541151292841195
Step4、残差分析
ana1 = lm_s
exp['Pred'] = ana1.predict(exp)
exp['resid'] = ana1.resid
exp.plot('Income', 'resid',kind='scatter')
plt.show()
ana1 = ols('avg_exp ~ Income', data=exp).fit()
ana1.summary()
OLS Regression Results
Dep. Variable: |
avg_exp |
R-squared: |
0.454 |
Model: |
OLS |
Adj. R-squared: |
0.446 |
Method: |
Least Squares |
F-statistic: |
56.61 |
Date: |
Thu, 06 Feb 2020 |
Prob (F-statistic): |
1.60e-10 |
Time: |
10:15:54 |
Log-Likelihood: |
-504.69 |
No. Observations: |
70 |
AIC: |
1013. |
Df Residuals: |
68 |
BIC: |
1018. |
Df Model: |
1 |
|
|
Covariance Type: |
nonrobust |
|
|
|
coef |
std err |
t |
P>|t| |
[0.025 |
0.975] |
Intercept |
258.0495 |
104.290 |
2.474 |
0.016 |
49.942 |
466.157 |
Income |
97.7286 |
12.989 |
7.524 |
0.000 |
71.809 |
123.648 |
Omnibus: |
3.714 |
Durbin-Watson: |
1.424 |
Prob(Omnibus): |
0.156 |
Jarque-Bera (JB): |
3.507 |
Skew: |
0.485 |
Prob(JB): |
0.173 |
Kurtosis: |
2.490 |
Cond. No. |
21.4 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
ana2 = ols('avg_exp_ln ~ Income', exp).fit()
exp['Pred'] = ana2.predict(exp)
exp['resid'] = ana2.resid
ana2.summary()
OLS Regression Results
Dep. Variable: |
avg_exp_ln |
R-squared: |
0.403 |
Model: |
OLS |
Adj. R-squared: |
0.394 |
Method: |
Least Squares |
F-statistic: |
45.92 |
Date: |
Thu, 06 Feb 2020 |
Prob (F-statistic): |
3.58e-09 |
Time: |
10:15:59 |
Log-Likelihood: |
-28.804 |
No. Observations: |
70 |
AIC: |
61.61 |
Df Residuals: |
68 |
BIC: |
66.11 |
Df Model: |
1 |
|
|
Covariance Type: |
nonrobust |
|
|
|
coef |
std err |
t |
P>|t| |
[0.025 |
0.975] |
Intercept |
6.0587 |
0.116 |
52.077 |
0.000 |
5.827 |
6.291 |
Income |
0.0982 |
0.014 |
6.776 |
0.000 |
0.069 |
0.127 |
Omnibus: |
10.765 |
Durbin-Watson: |
1.197 |
Prob(Omnibus): |
0.005 |
Jarque-Bera (JB): |
12.708 |
Skew: |
-0.688 |
Prob(JB): |
0.00174 |
Kurtosis: |
4.569 |
Cond. No. |
21.4 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
exp['Income_ln'] = np.log(exp['Income'])
ana3 = ols('avg_exp_ln ~ Income_ln', exp).fit()
exp['Pred'] = ana3.predict(exp)
exp['resid'] = ana3.resid
exp.plot('Income_ln', 'resid',kind='scatter')
plt.show()
ana3.summary()
OLS Regression Results
Dep. Variable: |
avg_exp_ln |
R-squared: |
0.480 |
Model: |
OLS |
Adj. R-squared: |
0.473 |
Method: |
Least Squares |
F-statistic: |
62.87 |
Date: |
Thu, 06 Feb 2020 |
Prob (F-statistic): |
2.95e-11 |
Time: |
10:16:30 |
Log-Likelihood: |
-23.950 |
No. Observations: |
70 |
AIC: |
51.90 |
Df Residuals: |
68 |
BIC: |
56.40 |
Df Model: |
1 |
|
|
Covariance Type: |
nonrobust |
|
|
|
coef |
std err |
t |
P>|t| |
[0.025 |
0.975] |
Intercept |
5.0611 |
0.222 |
22.833 |
0.000 |
4.619 |
5.503 |
Income_ln |
0.8932 |
0.113 |
7.929 |
0.000 |
0.668 |
1.118 |
Omnibus: |
8.382 |
Durbin-Watson: |
1.368 |
Prob(Omnibus): |
0.015 |
Jarque-Bera (JB): |
8.074 |
Skew: |
-0.668 |
Prob(JB): |
0.0177 |
Kurtosis: |
3.992 |
Cond. No. |
13.2 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
r_sq = {'exp~Income':ana1.rsquared, 'ln(exp)~Income':ana2.rsquared,
'ln(exp)~ln(Income)':ana3.rsquared}
print(r_sq)
{'exp~Income': 0.45429062315565294, 'ln(exp)~Income': 0.4030855555329651, 'ln(exp)~ln(Income)': 0.48039279938931057}
Step5、强影响点分析
exp['resid_t'] = (exp['resid'] - exp['resid'].mean()) / exp['resid'].std()
exp[abs(exp['resid_t']) > 2]
|
avg_exp |
avg_exp_ln |
gender |
Age |
Income |
Ownrent |
Selfempl |
dist_home_val |
dist_avg_income |
high_avg |
edu_class |
Pred |
resid |
Income_ln |
resid_t |
73 |
251.56 |
5.527682 |
0 |
29 |
5.1578 |
0 |
0 |
63.23 |
5.492947 |
-0.335147 |
0 |
6.526331 |
-0.998649 |
1.640510 |
-2.910292 |
98 |
163.18 |
5.094854 |
0 |
22 |
3.8159 |
0 |
0 |
63.27 |
3.997789 |
-0.181889 |
0 |
6.257191 |
-1.162337 |
1.339177 |
-3.387317 |
exp2 = exp[abs(exp['resid_t']) <= 2].copy()
ana4 = ols('avg_exp_ln ~ Income_ln', exp2).fit()
exp2['Pred'] = ana4.predict(exp2)
exp2['resid'] = ana4.resid
exp2.plot('Income', 'resid', kind='scatter')
plt.show()
ana4.rsquared
0.49397191385172456
from statsmodels.stats.outliers_influence import OLSInfluence
OLSInfluence(ana3).summary_frame().head()
|
dfb_Intercept |
dfb_Income_ln |
cooks_d |
standard_resid |
hat_diag |
dffits_internal |
student_resid |
dffits |
0 |
0.343729 |
-0.381393 |
0.085587 |
-1.319633 |
0.089498 |
-0.413732 |
-1.326996 |
-0.416040 |
1 |
0.307196 |
-0.341294 |
0.069157 |
-1.201699 |
0.087409 |
-0.371907 |
-1.205702 |
-0.373146 |
3 |
0.207619 |
-0.244956 |
0.044984 |
-1.440468 |
0.041557 |
-0.299947 |
-1.452165 |
-0.302382 |
4 |
0.112301 |
-0.127713 |
0.010759 |
-0.575913 |
0.060926 |
-0.146693 |
-0.573062 |
-0.145967 |
5 |
0.120572 |
-0.150924 |
0.022274 |
-1.221080 |
0.029011 |
-0.211064 |
-1.225579 |
-0.211842 |
exp2['dist_home_val_ln'] = np.log(exp2['dist_home_val'])
exp2['dist_avg_income_ln'] = np.log(exp2['dist_avg_income'])
ana5 = ols('''avg_exp_ln ~ Age + Income_ln +
dist_home_val_ln + dist_avg_income_ln''', exp2).fit()
exp2.plot('Income', 'resid', kind='scatter')
plt.show()
ana5.rsquared
0.5529068646270383
Step6、多重共线性分析(vif函数)
Step regression is not always work.
ana5.bse
Intercept 0.317453
Age 0.005124
Income_ln 0.568848
dist_home_val_ln 0.058210
dist_avg_income_ln 0.612197
dtype: float64
def vif(df, col_i):
cols = list(df.columns)
cols.remove(col_i)
cols_noti = cols
formula = col_i + '~' + '+'.join(cols_noti)
r2 = ols(formula, df).fit().rsquared
return 1. / (1. - r2)
exog = exp2[['Income_ln', 'dist_home_val_ln',
'dist_avg_income_ln']]
for i in exog.columns:
print(i, '\t', vif(df=exog, col_i=i))
Income_ln 36.653639058963186
dist_home_val_ln 1.053596313570258
dist_avg_income_ln 36.894876856102
exp2['high_avg_ratio'] = exp2['high_avg'] / exp2['dist_avg_income']
exog1 = exp2[['high_avg_ratio', 'dist_home_val_ln',
'dist_avg_income_ln']]
for i in exog1.columns:
print(i, '\t', vif(df=exog1, col_i=i))
high_avg_ratio 1.1230220802048871
dist_home_val_ln 1.0527009887483532
dist_avg_income_ln 1.1762825351755393
var_select = exp2[['avg_exp_ln', 'high_avg_ratio',
'dist_home_val_ln', 'dist_avg_income_ln']]
ana7 = forward_select(data=var_select, response='avg_exp_ln')
print(ana7.rsquared)
aic is 23.816793700737392,continuing!
aic is 20.830952279560805,continuing!
forward selection over!
final formula is avg_exp_ln ~ dist_avg_income_ln + dist_home_val_ln
0.552039773684598
formula8 = '''
avg_exp_ln ~ dist_avg_income_ln + dist_home_val_ln +
C(gender) + C(Ownrent) + C(Selfempl) + C(edu_class)
'''
ana8 = ols(formula8, exp2).fit()
ana8.summary()
OLS Regression Results
Dep. Variable: |
avg_exp_ln |
R-squared: |
0.873 |
Model: |
OLS |
Adj. R-squared: |
0.858 |
Method: |
Least Squares |
F-statistic: |
58.71 |
Date: |
Thu, 06 Feb 2020 |
Prob (F-statistic): |
1.75e-24 |
Time: |
11:17:40 |
Log-Likelihood: |
35.337 |
No. Observations: |
68 |
AIC: |
-54.67 |
Df Residuals: |
60 |
BIC: |
-36.92 |
Df Model: |
7 |
|
|
Covariance Type: |
nonrobust |
|
|
|
coef |
std err |
t |
P>|t| |
[0.025 |
0.975] |
Intercept |
4.5520 |
0.212 |
21.471 |
0.000 |
4.128 |
4.976 |
C(gender)[T.1] |
-0.4301 |
0.060 |
-7.200 |
0.000 |
-0.550 |
-0.311 |
C(Ownrent)[T.1] |
0.0184 |
0.045 |
0.413 |
0.681 |
-0.071 |
0.107 |
C(Selfempl)[T.1] |
-0.3805 |
0.119 |
-3.210 |
0.002 |
-0.618 |
-0.143 |
C(edu_class)[T.2] |
0.2895 |
0.051 |
5.658 |
0.000 |
0.187 |
0.392 |
C(edu_class)[T.3] |
0.4686 |
0.060 |
7.867 |
0.000 |
0.349 |
0.588 |
dist_avg_income_ln |
0.9563 |
0.098 |
9.722 |
0.000 |
0.760 |
1.153 |
dist_home_val_ln |
0.0522 |
0.034 |
1.518 |
0.134 |
-0.017 |
0.121 |
Omnibus: |
3.788 |
Durbin-Watson: |
2.129 |
Prob(Omnibus): |
0.150 |
Jarque-Bera (JB): |
4.142 |
Skew: |
0.020 |
Prob(JB): |
0.126 |
Kurtosis: |
4.208 |
Cond. No. |
60.2 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
formula9 = '''
avg_exp_ln ~ dist_avg_income_ln + dist_home_val_ln +
C(Selfempl) + C(gender):C(edu_class)
'''
ana9 = ols(formula9, exp2).fit()
ana9.summary()
OLS Regression Results
Dep. Variable: |
avg_exp_ln |
R-squared: |
0.914 |
Model: |
OLS |
Adj. R-squared: |
0.902 |
Method: |
Least Squares |
F-statistic: |
78.50 |
Date: |
Thu, 06 Feb 2020 |
Prob (F-statistic): |
1.42e-28 |
Time: |
11:17:48 |
Log-Likelihood: |
48.743 |
No. Observations: |
68 |
AIC: |
-79.49 |
Df Residuals: |
59 |
BIC: |
-59.51 |
Df Model: |
8 |
|
|
Covariance Type: |
nonrobust |
|
|
|
coef |
std err |
t |
P>|t| |
[0.025 |
0.975] |
Intercept |
4.4098 |
0.178 |
24.839 |
0.000 |
4.055 |
4.765 |
C(Selfempl)[T.1] |
-0.2945 |
0.101 |
-2.908 |
0.005 |
-0.497 |
-0.092 |
C(edu_class)[T.2] |
0.3164 |
0.045 |
7.012 |
0.000 |
0.226 |
0.407 |
C(edu_class)[T.3] |
0.5576 |
0.054 |
10.268 |
0.000 |
0.449 |
0.666 |
C(gender)[T.1]:C(edu_class)[1] |
-0.0054 |
0.098 |
-0.055 |
0.956 |
-0.201 |
0.190 |
C(gender)[T.1]:C(edu_class)[2] |
-0.4357 |
0.068 |
-6.374 |
0.000 |
-0.573 |
-0.299 |
C(gender)[T.1]:C(edu_class)[3] |
-0.6001 |
0.065 |
-9.230 |
0.000 |
-0.730 |
-0.470 |
dist_avg_income_ln |
0.9893 |
0.078 |
12.700 |
0.000 |
0.833 |
1.145 |
dist_home_val_ln |
0.0654 |
0.029 |
2.278 |
0.026 |
0.008 |
0.123 |
Omnibus: |
5.023 |
Durbin-Watson: |
1.722 |
Prob(Omnibus): |
0.081 |
Jarque-Bera (JB): |
5.070 |
Skew: |
-0.328 |
Prob(JB): |
0.0793 |
Kurtosis: |
4.166 |
Cond. No. |
61.2 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Step7、正则算法
1、岭回归
lmr = ols('avg_exp ~ Income + dist_home_val + dist_avg_income',
data=exp).fit_regularized(alpha=1, L1_wt=0)
lmr.summary()
lmr1 = ols('avg_exp ~ Age + Income + dist_home_val + dist_avg_income',
data=exp).fit_regularized(alpha=1, L1_wt=1)
lmr1.summary()
2、使用scikit-learn进行正则化参数调优
from sklearn.preprocessing import StandardScaler
continuous_xcols = ['Age', 'Income', 'dist_home_val',
'dist_avg_income']
scaler = StandardScaler()
X = scaler.fit_transform(exp[continuous_xcols])
y = exp['avg_exp_ln']
d:\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:645: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
return self.partial_fit(X, y)
d:\Anaconda3\lib\site-packages\sklearn\base.py:464: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
return self.fit(X, **fit_params).transform(X)
from sklearn.linear_model import RidgeCV
alphas = np.logspace(-2, 3, 100, base=10)
rcv = RidgeCV(alphas=alphas, store_cv_values=True)
rcv.fit(X, y)
RidgeCV(alphas=array([1.00000e-02, 1.12332e-02, ..., 8.90215e+02, 1.00000e+03]),
cv=None, fit_intercept=True, gcv_mode=None, normalize=False,
scoring=None, store_cv_values=True)
print('The best alpha is {}'.format(rcv.alpha_))
print('The r-square is {}'.format(rcv.score(X, y)))
The best alpha is 0.2915053062825176
The r-square is 0.47568267770194916
X_new = scaler.transform(exp_new[continuous_xcols])
np.exp(rcv.predict(X_new)[:5])
d:\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
"""Entry point for launching an IPython kernel.
array([759.67677561, 606.74024213, 661.20654568, 681.888929 ,
641.06967182])
cv_values = rcv.cv_values_
n_fold, n_alphas = cv_values.shape
cv_mean = cv_values.mean(axis=0)
cv_std = cv_values.std(axis=0)
ub = cv_mean + cv_std / np.sqrt(n_fold)
lb = cv_mean - cv_std / np.sqrt(n_fold)
plt.semilogx(alphas, cv_mean, label='mean_score')
plt.fill_between(alphas, lb, ub, alpha=0.2)
plt.xlabel("$\\alpha$")
plt.ylabel("mean squared errors")
plt.legend(loc="best")
plt.show()
from sklearn.linear_model import Ridge
ridge = Ridge()
coefs = []
for alpha in alphas:
ridge.set_params(alpha=alpha)
ridge.fit(X, y)
coefs.append(ridge.coef_)
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
plt.show()
rcv.coef_
array([ 0.03321449, -0.30956185, 0.05551208, 0.59067449])
ridge.set_params(alpha=40)
ridge.fit(X, y)
ridge.coef_
array([0.03293109, 0.09907747, 0.04976305, 0.12101456])
ridge.score(X, y)
0.4255673043353688
np.exp(ridge.predict(X_new)[:5])
array([934.79025945, 727.11042209, 703.88143602, 759.04342764,
709.54172995])
from sklearn.linear_model import LassoCV
lasso_alphas = np.logspace(-3, 0, 100, base=10)
lcv = LassoCV(alphas=lasso_alphas, cv=10)
lcv.fit(X, y)
print('The best alpha is {}'.format(lcv.alpha_))
print('The r-square is {}'.format(lcv.score(X, y)))
The best alpha is 0.04037017258596556
The r-square is 0.4426451069862233
from sklearn.linear_model import Lasso
lasso = Lasso()
lasso_coefs = []
for alpha in lasso_alphas:
lasso.set_params(alpha=alpha)
lasso.fit(X, y)
lasso_coefs.append(lasso.coef_)
ax = plt.gca()
ax.plot(lasso_alphas, lasso_coefs)
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Lasso coefficients as a function of the regularization')
plt.axis('tight')
plt.show()
lcv.coef_
array([0. , 0. , 0.02789489, 0.26549855])
from sklearn.linear_model import ElasticNetCV
l1_ratio = [.1, .5, .7, .9, .95, .99, 1]
encv = ElasticNetCV(l1_ratio=l1_ratio)
encv.fit(X,y)
print('The best l1_ratio is {}'.format(encv.l1_ratio_))
print('The best alpha is {}'.format(encv.alpha_))
d:\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
warnings.warn(CV_WARNING, FutureWarning)
The best l1_ratio is 0.1
The best alpha is 0.6293876843197391