以伍德里奇《计量经济学导论:现代方法》的”第7章 含有定性信息的多元回归分析:二值(或虚拟)变量“的案例7.8为例,讲解Python如何对虚拟变量进行回归分析。
例7.8 文件LAWSCH85包含了法学院毕业生起薪中位数的数据。一个关键的解释变量是法学院的排名。由于每个法学院都有一个排名,所以我们显然不能对每个排名都包括进来一个虚拟变量。因此我们可以将排名转换为排名范围,这需要用到pandas.cut函数。
import wooldridge as woo
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
lawsch85 = woo.dataWoo('lawsch85')
x: 进行划分的一维数组
bins: 分组依据
-sequence of scalars:标量序列,标量序列定义了被分割后每一个bin的区间边缘,此时x没有扩展。
right : 默认为True,是否包含区间右部。如果bins是[1, 2, 3, 4],区间就是(1,2], (2,3], (3,4]。如果为False,不包含区间右部,区间就是(1,2), (2,3), (3,4)
labels :每组的标签。
retbins : return bins的缩写默认为False,是否返回bins,默认不返回。
precision: 小数精度,默认为3
include_lowest : 表示第一个bin的初始值是否包含在内,默认为false,np.arange(0, 101, 10) 默认不包含0,第一个bin为(0, 10]。如果设置为True,则包含0,第一个bin就是(-0.001, 10.0]
out:一个pandas.Categorical, Series或者ndarray类型的值,代表分区后x中的每个值在哪个bin(区间)中,如果指定了labels,则返回对应的label。
0 128
1 104
2 34
3 49
4 95
151 17
152 21
153 143
154 3
155 120
Name: rank, Length: 156, dtype: int64
cutpts = [0, 10, 25, 40, 60, 100, 175]
lawsch85['rc'] = pd.cut(lawsch85['rank'], bins=cutpts,
labels=['(0,10]', '(10,25]', '(25,40]',
'(40,60]', '(60,100]', '(100,175]'])
0 (100,175]
1 (100,175]
2 (25,40]
3 (40,60]
4 (60,100]
151 (10,25]
152 (10,25]
153 (100,175]
154 (0,10]
155 (100,175]
Name: rc, Length: 156, dtype: category
Categories (6, object): ['(0,10]' < '(10,25]' < '(25,40]' < '(40,60]' < '(60,100]' < '(100,175]']
reg = smf.ols(formula='np.log(salary) ~ C(rc)+ LSAT + GPA + np.log(libvol) + np.log(cost)',data=lawsch85)
其中,C()为分类变量函数,C指Categorical variables;若变量为字符串或二值变量,则可以用分类变量函数,也也可以不用分类变量函数,回归模型自动将变量分类;若变量为数字,且超过2个数字,则必须通过分类变量函数将连续变量转变为分类变量,否则回归模型将变量视为连续变量进行回归。
在本案例,由于变量rc为字符串,因此formula='np.log(salary) ~ C(rc)+ LSAT + GPA + np.log(libvol) + np.log(cost)'
和formula='np.log(salary) ~ rc+ LSAT + GPA + np.log(libvol) + np.log(cost)'
results = reg.fit()
OLS Regression Results
Dep. Variable: np.log(salary) R-squared: 0.911
Model: OLS Adj. R-squared: 0.905
Method: Least Squares F-statistic: 143.2
Date: Sun, 17 Apr 2022 Prob (F-statistic): 9.45e-62
Time: 15:12:27 Log-Likelihood: 146.45
No. Observations: 136 AIC: -272.9
Df Residuals: 126 BIC: -243.8
Df Model: 9
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 9.8649 0.450 21.930 0.000 8.975 10.755
C(rc)[T.(10,25]] -0.1060 0.039 -2.739 0.007 -0.183 -0.029
C(rc)[T.(25,40]] -0.3245 0.044 -7.318 0.000 -0.412 -0.237
C(rc)[T.(40,60]] -0.4367 0.046 -9.512 0.000 -0.528 -0.346
C(rc)[T.(60,100]] -0.5680 0.047 -12.038 0.000 -0.661 -0.475
C(rc)[T.(100,175]] -0.6996 0.053 -13.078 0.000 -0.805 -0.594
LSAT 0.0057 0.003 1.858 0.066 -0.000 0.012
GPA 0.0137 0.074 0.185 0.854 -0.133 0.161
np.log(libvol) 0.0364 0.026 1.398 0.165 -0.015 0.088
np.log(cost) 0.0008 0.025 0.033 0.973 -0.049 0.051
Omnibus: 9.419 Durbin-Watson: 1.926
Prob(Omnibus): 0.009 Jarque-Bera (JB): 20.478
Skew: 0.100 Prob(JB): 3.57e-05
Kurtosis: 4.890 Cond. No. 9.85e+03
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.85e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
分类变量rc一共有6个分类:'(0,10]'、 '(10,25]' 、 '(25,40]' 、 '(40,60]' 、 '(60,100]' 、 '(100,175]',6个类别应该有5个虚拟变量,其中1个类别作为基组,若不设定基组,C()自动设定一个基组;我们也可以制定基组,形式为C(varible,Treatment(base group))。
reg = smf.ols(fomula='np.log(salary) ~ C(rc,Treatment("(10,25]"))+ LSAT + GPA + np.log(libvol) + np.log(cost)',data=lawsch85)
results = reg.fit()
OLS Regression Results
Dep. Variable: np.log(salary) R-squared: 0.911
Model: OLS Adj. R-squared: 0.905
Method: Least Squares F-statistic: 143.2
Date: Sun, 17 Apr 2022 Prob (F-statistic): 9.45e-62
Time: 15:28:40 Log-Likelihood: 146.45
No. Observations: 136 AIC: -272.9
Df Residuals: 126 BIC: -243.8
Df Model: 9
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 9.7588 0.436 22.388 0.000 8.896 10.621
C(rc, Treatment("(10,25]"))[T.(0,10]] 0.1060 0.039 2.739 0.007 0.029 0.183
C(rc, Treatment("(10,25]"))[T.(25,40]] -0.2185 0.035 -6.164 0.000 -0.289 -0.148
C(rc, Treatment("(10,25]"))[T.(40,60]] -0.3307 0.035 -9.480 0.000 -0.400 -0.262
C(rc, Treatment("(10,25]"))[T.(60,100]] -0.4619 0.034 -13.460 0.000 -0.530 -0.394
C(rc, Treatment("(10,25]"))[T.(100,175]] -0.5935 0.039 -15.049 0.000 -0.672 -0.515
LSAT 0.0057 0.003 1.858 0.066 -0.000 0.012
GPA 0.0137 0.074 0.185 0.854 -0.133 0.161
np.log(libvol) 0.0364 0.026 1.398 0.165 -0.015 0.088
np.log(cost) 0.0008 0.025 0.033 0.973 -0.049 0.051
Omnibus: 9.419 Durbin-Watson: 1.926
Prob(Omnibus): 0.009 Jarque-Bera (JB): 20.478
Skew: 0.100 Prob(JB): 3.57e-05
Kurtosis: 4.890 Cond. No. 9.48e+03
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.48e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
import wooldridge as woo
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
lawsch85 = woo.dataWoo('lawsch85')
cutpts = [0, 10, 25, 40, 60, 100, 175]
lawsch85['rc'] = pd.cut(lawsch85['rank'], bins=cutpts,
labels=['(0,10]', '(10,25]', '(25,40]',
'(40,60]', '(60,100]', '(100,175]'])
reg = smf.ols(formula='np.log(salary) ~ C(rc)+ LSAT + GPA + np.log(libvol) + np.log(cost)',data=lawsch85)
results = reg.fit()
reg = smf.ols(formula='np.log(salary) ~ C(rc,treatment("(10,25]"))+ LSAT + GPA + np.log(libvol) + np.log(cost)',data=lawsch85)
results = reg.fit()