数据科学 案例3 线性回归之汽车贷款(代码)

数据科学 案例3 线性回归之汽车贷款(代码)

  • 7 线性回归模型与诊断
    • Step1、导入数据和数据清洗
    • Step2、相关性分析
    • Step3、线性回归算法
      • 1、简单线性回归
      • 3、多元线性回归
        • 3.1 多元线性回归的变量筛选
    • Step4、残差分析
    • Step5、强影响点分析
    • Step6、多重共线性分析(vif函数)
    • Step7、正则算法
      • 1、岭回归
      • 2、使用scikit-learn进行正则化参数调优

7 线性回归模型与诊断

数据说明:本数据是一份汽车贷款数据

字段名 中文含义
id id
Acc 是否开卡(1=已开通)
avg_exp 月均信用卡支出(元)
avg_exp_ln 月均信用卡支出的自然对数
gender 性别(男=1)
Age 年龄
Income 年收入(万元)
Ownrent 是否自有住房(有=1;无=0)
Selfempl 是否自谋职业(1=yes, 0=no)
dist_home_val 所住小区房屋均价(万元)
dist_avg_income 当地人均收入
high_avg 高出当地平均收入
edu_class 教育等级:小学及以下开通=0,中学=1,本科=2,研究生=3
get_ipython().magic('matplotlib inline')

import matplotlib.pyplot as plt
import os
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

Step1、导入数据和数据清洗

(直接去除了缺失值)

raw = pd.read_csv(r'.\data\creditcard_exp.csv', skipinitialspace=True)
raw.head()
id Acc avg_exp avg_exp_ln gender Age Income Ownrent Selfempl dist_home_val dist_avg_income age2 high_avg edu_class
0 19 1 1217.03 7.104169 1 40 16.03515 1 1 99.93 15.932789 1600 0.102361 3
1 5 1 1251.50 7.132098 1 32 15.84750 1 0 49.88 15.796316 1024 0.051184 2
2 95 0 NaN NaN 1 36 8.40000 0 0 88.61 7.490000 1296 0.910000 1
3 86 1 856.57 6.752936 1 41 11.47285 1 0 16.10 11.275632 1681 0.197218 3
4 50 1 1321.83 7.186772 1 28 13.40915 1 0 100.39 13.346474 784 0.062676 2
exp = raw[raw['avg_exp'].notnull()].copy().iloc[:, 2:].drop('age2',axis=1)

exp_new = raw[raw['avg_exp'].isnull()].copy().iloc[:, 2:].drop('age2',axis=1)

exp.describe(include='all')
avg_exp avg_exp_ln gender Age Income Ownrent Selfempl dist_home_val dist_avg_income high_avg edu_class
count 70.000000 70.000000 70.000000 70.000000 70.000000 70.000000 70.000000 70.000000 70.000000 70.000000 70.000000
mean 983.655429 6.787787 0.285714 31.157143 7.424706 0.385714 0.028571 74.540857 8.005472 -0.580766 1.928571
std 446.294237 0.476035 0.455016 7.206349 3.077986 0.490278 0.167802 36.949228 3.070744 0.432808 0.873464
min 163.180000 5.094854 0.000000 20.000000 3.493900 0.000000 0.000000 13.130000 3.828842 -1.526850 0.000000
25% 697.155000 6.547003 0.000000 26.000000 5.175662 0.000000 0.000000 49.302500 5.915553 -0.887981 1.000000
50% 884.150000 6.784627 0.000000 30.000000 6.443525 0.000000 0.000000 65.660000 7.084184 -0.612068 2.000000
75% 1229.585000 7.114415 1.000000 36.000000 8.494237 1.000000 0.000000 105.067500 9.123105 -0.302082 3.000000
max 2430.030000 7.795659 1.000000 55.000000 16.900150 1.000000 1.000000 157.900000 18.427000 0.259337 3.000000

Step2、相关性分析

散点图

exp.plot('Income', 'avg_exp', kind='scatter')
plt.show()
数据科学 案例3 线性回归之汽车贷款(代码)_第1张图片
exp[['Income', 'avg_exp', 'Age', 'dist_home_val']].corr(method='pearson')
Income avg_exp Age dist_home_val
Income 1.000000 0.674011 0.369129 0.249153
avg_exp 0.674011 1.000000 0.258478 0.319499
Age 0.369129 0.258478 1.000000 0.109323
dist_home_val 0.249153 0.319499 0.109323 1.000000

Step3、线性回归算法

1、简单线性回归

lm_s = ols('avg_exp ~ Income', data=exp).fit()
lm_s.summary()
OLS Regression Results
Dep. Variable: avg_exp R-squared: 0.454
Model: OLS Adj. R-squared: 0.446
Method: Least Squares F-statistic: 56.61
Date: Thu, 06 Feb 2020 Prob (F-statistic): 1.60e-10
Time: 10:14:45 Log-Likelihood: -504.69
No. Observations: 70 AIC: 1013.
Df Residuals: 68 BIC: 1018.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 258.0495 104.290 2.474 0.016 49.942 466.157
Income 97.7286 12.989 7.524 0.000 71.809 123.648
Omnibus: 3.714 Durbin-Watson: 1.424
Prob(Omnibus): 0.156 Jarque-Bera (JB): 3.507
Skew: 0.485 Prob(JB): 0.173
Kurtosis: 2.490 Cond. No. 21.4


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Predict-在原始数据集上得到预测值和残差
pd.DataFrame([lm_s.predict(exp), lm_s.resid], index=['predict', 'resid']
            ).T.head()
predict resid
0 1825.141904 -608.111904
1 1806.803136 -555.303136
3 1379.274813 -522.704813
4 1568.506658 -246.676658
5 1238.281793 -422.251793
# 在待预测数据集上得到预测值
lm_s.predict(exp_new)[:5]
2     1078.969552
11     756.465245
13     736.919530
19     687.077955
20     666.554953
dtype: float64

3、多元线性回归

lm_m = ols('avg_exp ~ Age + Income + dist_home_val + dist_avg_income',
           data=exp).fit()
lm_m.summary()
OLS Regression Results
Dep. Variable: avg_exp R-squared: 0.542
Model: OLS Adj. R-squared: 0.513
Method: Least Squares F-statistic: 19.20
Date: Thu, 06 Feb 2020 Prob (F-statistic): 1.82e-10
Time: 10:15:18 Log-Likelihood: -498.59
No. Observations: 70 AIC: 1007.
Df Residuals: 65 BIC: 1018.
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -32.0078 186.874 -0.171 0.865 -405.221 341.206
Age 1.3723 5.605 0.245 0.807 -9.822 12.566
Income -166.7204 87.607 -1.903 0.061 -341.684 8.243
dist_home_val 1.5329 1.057 1.450 0.152 -0.578 3.644
dist_avg_income 261.8827 87.807 2.982 0.004 86.521 437.245
Omnibus: 5.234 Durbin-Watson: 1.582
Prob(Omnibus): 0.073 Jarque-Bera (JB): 5.174
Skew: 0.625 Prob(JB): 0.0752
Kurtosis: 2.540 Cond. No. 459.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

3.1 多元线性回归的变量筛选

'''forward select'''
def forward_select(data, response):
    remaining = set(data.columns)
    remaining.remove(response)
    selected = []
    current_score, best_new_score = float('inf'), float('inf')
    while remaining:
        aic_with_candidates=[]
        for candidate in remaining:
            formula = "{} ~ {}".format(
                response,' + '.join(selected + [candidate]))
            aic = ols(formula=formula, data=data).fit().aic
            aic_with_candidates.append((aic, candidate))
        aic_with_candidates.sort(reverse=True)
        best_new_score, best_candidate=aic_with_candidates.pop()
        if current_score > best_new_score: 
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
            print ('aic is {},continuing!'.format(current_score))
        else:        
            print ('forward selection over!')
            break
            
    formula = "{} ~ {} ".format(response,' + '.join(selected))
    print('final formula is {}'.format(formula))
    model = ols(formula=formula, data=data).fit()
    return(model)
data_for_select = exp[['avg_exp', 'Income', 'Age', 'dist_home_val', 
                       'dist_avg_income']]
lm_m = forward_select(data=data_for_select, response='avg_exp')
print(lm_m.rsquared)
aic is 1007.6801413968117,continuing!
aic is 1005.4969816306302,continuing!
aic is 1005.2487355956046,continuing!
forward selection over!
final formula is avg_exp ~ dist_avg_income + Income + dist_home_val 
0.541151292841195

Step4、残差分析

ana1 = lm_s

exp['Pred'] = ana1.predict(exp)
exp['resid'] = ana1.resid
exp.plot('Income', 'resid',kind='scatter')
plt.show()
数据科学 案例3 线性回归之汽车贷款(代码)_第2张图片
# 遇到异方差情况,教科书上会介绍使用加权最小二乘法,但是实际上最常用的是对被解释变量取对数
ana1 = ols('avg_exp ~ Income', data=exp).fit()
ana1.summary()
OLS Regression Results
Dep. Variable: avg_exp R-squared: 0.454
Model: OLS Adj. R-squared: 0.446
Method: Least Squares F-statistic: 56.61
Date: Thu, 06 Feb 2020 Prob (F-statistic): 1.60e-10
Time: 10:15:54 Log-Likelihood: -504.69
No. Observations: 70 AIC: 1013.
Df Residuals: 68 BIC: 1018.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 258.0495 104.290 2.474 0.016 49.942 466.157
Income 97.7286 12.989 7.524 0.000 71.809 123.648
Omnibus: 3.714 Durbin-Watson: 1.424
Prob(Omnibus): 0.156 Jarque-Bera (JB): 3.507
Skew: 0.485 Prob(JB): 0.173
Kurtosis: 2.490 Cond. No. 21.4


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
ana2 = ols('avg_exp_ln ~ Income', exp).fit()
exp['Pred'] = ana2.predict(exp)
exp['resid'] = ana2.resid
#exp.plot('Income', 'resid',kind='scatter')
ana2.summary()
OLS Regression Results
Dep. Variable: avg_exp_ln R-squared: 0.403
Model: OLS Adj. R-squared: 0.394
Method: Least Squares F-statistic: 45.92
Date: Thu, 06 Feb 2020 Prob (F-statistic): 3.58e-09
Time: 10:15:59 Log-Likelihood: -28.804
No. Observations: 70 AIC: 61.61
Df Residuals: 68 BIC: 66.11
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 6.0587 0.116 52.077 0.000 5.827 6.291
Income 0.0982 0.014 6.776 0.000 0.069 0.127
Omnibus: 10.765 Durbin-Watson: 1.197
Prob(Omnibus): 0.005 Jarque-Bera (JB): 12.708
Skew: -0.688 Prob(JB): 0.00174
Kurtosis: 4.569 Cond. No. 21.4


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# 取对数会使模型更有解释意义(R-squre甚至减小,考虑因变量取对数)
exp['Income_ln'] = np.log(exp['Income'])
ana3 = ols('avg_exp_ln ~ Income_ln', exp).fit()
exp['Pred'] = ana3.predict(exp)
exp['resid'] = ana3.resid
exp.plot('Income_ln', 'resid',kind='scatter')
plt.show()
ana3.summary()
数据科学 案例3 线性回归之汽车贷款(代码)_第3张图片
OLS Regression Results
Dep. Variable: avg_exp_ln R-squared: 0.480
Model: OLS Adj. R-squared: 0.473
Method: Least Squares F-statistic: 62.87
Date: Thu, 06 Feb 2020 Prob (F-statistic): 2.95e-11
Time: 10:16:30 Log-Likelihood: -23.950
No. Observations: 70 AIC: 51.90
Df Residuals: 68 BIC: 56.40
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 5.0611 0.222 22.833 0.000 4.619 5.503
Income_ln 0.8932 0.113 7.929 0.000 0.668 1.118
Omnibus: 8.382 Durbin-Watson: 1.368
Prob(Omnibus): 0.015 Jarque-Bera (JB): 8.074
Skew: -0.668 Prob(JB): 0.0177
Kurtosis: 3.992 Cond. No. 13.2


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# 寻找最优的模型
r_sq = {'exp~Income':ana1.rsquared, 'ln(exp)~Income':ana2.rsquared, 
        'ln(exp)~ln(Income)':ana3.rsquared}
print(r_sq)
{'exp~Income': 0.45429062315565294, 'ln(exp)~Income': 0.4030855555329651, 'ln(exp)~ln(Income)': 0.48039279938931057}

Step5、强影响点分析

# 法一:
# Find outlier:
exp['resid_t'] = (exp['resid'] - exp['resid'].mean()) / exp['resid'].std()
exp[abs(exp['resid_t']) > 2]
avg_exp avg_exp_ln gender Age Income Ownrent Selfempl dist_home_val dist_avg_income high_avg edu_class Pred resid Income_ln resid_t
73 251.56 5.527682 0 29 5.1578 0 0 63.23 5.492947 -0.335147 0 6.526331 -0.998649 1.640510 -2.910292
98 163.18 5.094854 0 22 3.8159 0 0 63.27 3.997789 -0.181889 0 6.257191 -1.162337 1.339177 -3.387317
# Drop outlier
exp2 = exp[abs(exp['resid_t']) <= 2].copy()
ana4 = ols('avg_exp_ln ~ Income_ln', exp2).fit()
exp2['Pred'] = ana4.predict(exp2)
exp2['resid'] = ana4.resid
exp2.plot('Income', 'resid', kind='scatter')
plt.show()
ana4.rsquared
数据科学 案例3 线性回归之汽车贷款(代码)_第4张图片
0.49397191385172456
#法二:
# statemodels包提供了更多强影响点判断指标
from statsmodels.stats.outliers_influence import OLSInfluence
OLSInfluence(ana3).summary_frame().head()
dfb_Intercept dfb_Income_ln cooks_d standard_resid hat_diag dffits_internal student_resid dffits
0 0.343729 -0.381393 0.085587 -1.319633 0.089498 -0.413732 -1.326996 -0.416040
1 0.307196 -0.341294 0.069157 -1.201699 0.087409 -0.371907 -1.205702 -0.373146
3 0.207619 -0.244956 0.044984 -1.440468 0.041557 -0.299947 -1.452165 -0.302382
4 0.112301 -0.127713 0.010759 -0.575913 0.060926 -0.146693 -0.573062 -0.145967
5 0.120572 -0.150924 0.022274 -1.221080 0.029011 -0.211064 -1.225579 -0.211842
# ### 增加变量
# 经过单变量线性回归的处理,我们基本对模型的性质有了一定的了解,接下来可以放入更多的连续型解释变量。在加入变量之前,要注意变量的函数形式转变。比如当地房屋均价、当地平均收入,其性质和个人收入一样,都需要取对数

exp2['dist_home_val_ln'] = np.log(exp2['dist_home_val'])
exp2['dist_avg_income_ln'] = np.log(exp2['dist_avg_income'])

ana5 = ols('''avg_exp_ln ~ Age + Income_ln + 
           dist_home_val_ln + dist_avg_income_ln''', exp2).fit()
exp2.plot('Income', 'resid', kind='scatter')
plt.show()
ana5.rsquared
数据科学 案例3 线性回归之汽车贷款(代码)_第5张图片
0.5529068646270383

Step6、多重共线性分析(vif函数)

Step regression is not always work.

ana5.bse # The standard errors of the parameter estimates
Intercept             0.317453
Age                   0.005124
Income_ln             0.568848
dist_home_val_ln      0.058210
dist_avg_income_ln    0.612197
dtype: float64
# The function "statsmodels.stats.outliers_influence.variance_inflation_factor" uses "OLS" to fit data, and it will generates a wrong rsquared. So define it ourselves!
def vif(df, col_i):
    cols = list(df.columns)
    cols.remove(col_i)
    cols_noti = cols
    formula = col_i + '~' + '+'.join(cols_noti)
    r2 = ols(formula, df).fit().rsquared
    return 1. / (1. - r2)
exog = exp2[['Income_ln', 'dist_home_val_ln',
             'dist_avg_income_ln']]
for i in exog.columns:
    print(i, '\t', vif(df=exog, col_i=i))

Income_ln 	 36.653639058963186
dist_home_val_ln 	 1.053596313570258
dist_avg_income_ln 	 36.894876856102
# Income_ln与dist_avg_income_ln具有共线性,使用“高出平均收入的比率”代替其中一个
exp2['high_avg_ratio'] = exp2['high_avg'] / exp2['dist_avg_income']
exog1 = exp2[['high_avg_ratio', 'dist_home_val_ln', 
              'dist_avg_income_ln']]

for i in exog1.columns:
    print(i, '\t', vif(df=exog1, col_i=i))
high_avg_ratio 	 1.1230220802048871
dist_home_val_ln 	 1.0527009887483532
dist_avg_income_ln 	 1.1762825351755393
var_select = exp2[['avg_exp_ln', 'high_avg_ratio', 
                   'dist_home_val_ln', 'dist_avg_income_ln']]
ana7 = forward_select(data=var_select, response='avg_exp_ln')
print(ana7.rsquared)
aic is 23.816793700737392,continuing!
aic is 20.830952279560805,continuing!
forward selection over!
final formula is avg_exp_ln ~ dist_avg_income_ln + dist_home_val_ln 
0.552039773684598
formula8 = '''
avg_exp_ln ~ dist_avg_income_ln + dist_home_val_ln + 
C(gender) + C(Ownrent) + C(Selfempl) + C(edu_class)
'''
ana8 = ols(formula8, exp2).fit()
ana8.summary()
OLS Regression Results
Dep. Variable: avg_exp_ln R-squared: 0.873
Model: OLS Adj. R-squared: 0.858
Method: Least Squares F-statistic: 58.71
Date: Thu, 06 Feb 2020 Prob (F-statistic): 1.75e-24
Time: 11:17:40 Log-Likelihood: 35.337
No. Observations: 68 AIC: -54.67
Df Residuals: 60 BIC: -36.92
Df Model: 7
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 4.5520 0.212 21.471 0.000 4.128 4.976
C(gender)[T.1] -0.4301 0.060 -7.200 0.000 -0.550 -0.311
C(Ownrent)[T.1] 0.0184 0.045 0.413 0.681 -0.071 0.107
C(Selfempl)[T.1] -0.3805 0.119 -3.210 0.002 -0.618 -0.143
C(edu_class)[T.2] 0.2895 0.051 5.658 0.000 0.187 0.392
C(edu_class)[T.3] 0.4686 0.060 7.867 0.000 0.349 0.588
dist_avg_income_ln 0.9563 0.098 9.722 0.000 0.760 1.153
dist_home_val_ln 0.0522 0.034 1.518 0.134 -0.017 0.121
Omnibus: 3.788 Durbin-Watson: 2.129
Prob(Omnibus): 0.150 Jarque-Bera (JB): 4.142
Skew: 0.020 Prob(JB): 0.126
Kurtosis: 4.208 Cond. No. 60.2


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
formula9 = '''
avg_exp_ln ~ dist_avg_income_ln + dist_home_val_ln + 
C(Selfempl) + C(gender):C(edu_class)
'''
ana9 = ols(formula9, exp2).fit()
ana9.summary()
OLS Regression Results
Dep. Variable: avg_exp_ln R-squared: 0.914
Model: OLS Adj. R-squared: 0.902
Method: Least Squares F-statistic: 78.50
Date: Thu, 06 Feb 2020 Prob (F-statistic): 1.42e-28
Time: 11:17:48 Log-Likelihood: 48.743
No. Observations: 68 AIC: -79.49
Df Residuals: 59 BIC: -59.51
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 4.4098 0.178 24.839 0.000 4.055 4.765
C(Selfempl)[T.1] -0.2945 0.101 -2.908 0.005 -0.497 -0.092
C(edu_class)[T.2] 0.3164 0.045 7.012 0.000 0.226 0.407
C(edu_class)[T.3] 0.5576 0.054 10.268 0.000 0.449 0.666
C(gender)[T.1]:C(edu_class)[1] -0.0054 0.098 -0.055 0.956 -0.201 0.190
C(gender)[T.1]:C(edu_class)[2] -0.4357 0.068 -6.374 0.000 -0.573 -0.299
C(gender)[T.1]:C(edu_class)[3] -0.6001 0.065 -9.230 0.000 -0.730 -0.470
dist_avg_income_ln 0.9893 0.078 12.700 0.000 0.833 1.145
dist_home_val_ln 0.0654 0.029 2.278 0.026 0.008 0.123
Omnibus: 5.023 Durbin-Watson: 1.722
Prob(Omnibus): 0.081 Jarque-Bera (JB): 5.070
Skew: -0.328 Prob(JB): 0.0793
Kurtosis: 4.166 Cond. No. 61.2


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Step7、正则算法

1、岭回归

# L1_wt参数为0则使用岭回归,为1使用lasso
lmr = ols('avg_exp ~ Income + dist_home_val + dist_avg_income',
          data=exp).fit_regularized(alpha=1, L1_wt=0)

lmr.summary()
# ### LASSO算法
lmr1 = ols('avg_exp ~ Age + Income + dist_home_val + dist_avg_income',
           data=exp).fit_regularized(alpha=1, L1_wt=1)
lmr1.summary()

2、使用scikit-learn进行正则化参数调优

from sklearn.preprocessing import StandardScaler

continuous_xcols = ['Age', 'Income', 'dist_home_val', 
                    'dist_avg_income']   #  抽取连续变量
scaler = StandardScaler()  # 标准化
X = scaler.fit_transform(exp[continuous_xcols])
y = exp['avg_exp_ln']
d:\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:645: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)
d:\Anaconda3\lib\site-packages\sklearn\base.py:464: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
from sklearn.linear_model import RidgeCV

alphas = np.logspace(-2, 3, 100, base=10)

# Search the min MSE by CV
rcv = RidgeCV(alphas=alphas, store_cv_values=True) 
rcv.fit(X, y)
RidgeCV(alphas=array([1.00000e-02, 1.12332e-02, ..., 8.90215e+02, 1.00000e+03]),
    cv=None, fit_intercept=True, gcv_mode=None, normalize=False,
    scoring=None, store_cv_values=True)
print('The best alpha is {}'.format(rcv.alpha_))
print('The r-square is {}'.format(rcv.score(X, y))) 
# Default score is rsquared
The best alpha is 0.2915053062825176
The r-square is 0.47568267770194916
X_new = scaler.transform(exp_new[continuous_xcols])
np.exp(rcv.predict(X_new)[:5])
d:\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
  """Entry point for launching an IPython kernel.

array([759.67677561, 606.74024213, 661.20654568, 681.888929  ,
       641.06967182])
cv_values = rcv.cv_values_
n_fold, n_alphas = cv_values.shape

cv_mean = cv_values.mean(axis=0)
cv_std = cv_values.std(axis=0)
ub = cv_mean + cv_std / np.sqrt(n_fold)
lb = cv_mean - cv_std / np.sqrt(n_fold)

plt.semilogx(alphas, cv_mean, label='mean_score')
plt.fill_between(alphas, lb, ub, alpha=0.2)
plt.xlabel("$\\alpha$")
plt.ylabel("mean squared errors")
plt.legend(loc="best")
plt.show()
数据科学 案例3 线性回归之汽车贷款(代码)_第6张图片
# 手动选择正则化系数——根据业务判断

# 岭迹图

# In[42]:

from sklearn.linear_model import Ridge

ridge = Ridge()

coefs = []
for alpha in alphas:
    ridge.set_params(alpha=alpha)
    ridge.fit(X, y)
    coefs.append(ridge.coef_)


# In[43]:

ax = plt.gca()

ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
plt.show()
数据科学 案例3 线性回归之汽车贷款(代码)_第7张图片
rcv.coef_
array([ 0.03321449, -0.30956185,  0.05551208,  0.59067449])
ridge.set_params(alpha=40)
ridge.fit(X, y)
ridge.coef_
array([0.03293109, 0.09907747, 0.04976305, 0.12101456])
ridge.score(X, y)
0.4255673043353688
np.exp(ridge.predict(X_new)[:5])
array([934.79025945, 727.11042209, 703.88143602, 759.04342764,
       709.54172995])
# lasso

from sklearn.linear_model import LassoCV

lasso_alphas = np.logspace(-3, 0, 100, base=10)
lcv = LassoCV(alphas=lasso_alphas, cv=10) # Search the min MSE by CV
lcv.fit(X, y)

print('The best alpha is {}'.format(lcv.alpha_))
print('The r-square is {}'.format(lcv.score(X, y))) 
# Default score is rsquared
The best alpha is 0.04037017258596556
The r-square is 0.4426451069862233
from sklearn.linear_model import Lasso

lasso = Lasso()
lasso_coefs = []
for alpha in lasso_alphas:
    lasso.set_params(alpha=alpha)
    lasso.fit(X, y)
    lasso_coefs.append(lasso.coef_)
ax = plt.gca()
ax.plot(lasso_alphas, lasso_coefs)
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Lasso coefficients as a function of the regularization')
plt.axis('tight')
plt.show()
lcv.coef_
数据科学 案例3 线性回归之汽车贷款(代码)_第8张图片
array([0.        , 0.        , 0.02789489, 0.26549855])
# 弹性网络
from sklearn.linear_model import ElasticNetCV

l1_ratio = [.1, .5, .7, .9, .95, .99, 1] #0取消

encv = ElasticNetCV(l1_ratio=l1_ratio)
encv.fit(X,y)

print('The best l1_ratio is {}'.format(encv.l1_ratio_))
print('The best alpha is {}'.format(encv.alpha_))
d:\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
The best l1_ratio is 0.1
The best alpha is 0.6293876843197391

你可能感兴趣的:(#,数据科学,案例篇,python数据挖掘)