我们以伍德里奇《计量经济学导论:现代方法》的”第15章 工具变量估计与两阶段最小二二乘法“的案例15.5为例,使用美国女性教育回报数据MORZ,学习工具变量法的Python实现。
变量:被解释变量log(wage)为工资的对数,解释变量educ为受正式教育年数,exper为工作经验。
构建模型如下:
l o g ( w a g e ) = β 0 + β 1 e d u c + β 2 e x p e r + β 3 e x p e r 2 + u log(wage)=\beta_0+\beta_1educ+\beta_2exper+\beta_3exper^2+u log(wage)=β0+β1educ+β2exper+β3exper2+u
上式仅考虑了职业女性自身受正式教育年数的影响,存在遗漏变量的情况,引发内生性问题。因此,考虑将父亲的受教育程度fathereduc、母亲的受教育程度mothereduc作为工具变量,fathereduc、mothereduc应该与educ相关,而与u无关。
什么是内生性?
什么情况下会产生内生性?
工具变量的要求
举个例子:对于简单的 y = α + β x + ε y=\alpha+\beta x+\varepsilon y=α+βx+ε
如果扰动项与 x x x相关,我们可以设置一个工具变量 z z z,使得 z z z满足以下两个条件:
(1)相关性: z z z与 x x x相关, C o v ( x , z ) ≠ 0 Cov(x,z)\neq0 Cov(x,z)=0
(2)外生性:z与扰动项无关。 C o v ( ε , z ) = 0 Cov(\varepsilon,z)=0 Cov(ε,z)=0
工具变量法的实现
工具变量法一般通过“二阶段最小二乘法”(2SLS,Two Stage Least Square) 来实现,其中的两个阶段是:
(1)求 x 对 z 的回归,得到一个x的拟合值;
(2)求 y 对 x 拟合值的回归,得到 β \beta β,由于此阶段的回归中,x 的拟合值与扰动项不相关(OLS的正交性),所以可以得到一致的估计量。
简单来说,2SLS 在回归的第一阶段,把 x分成了两部分,一部分是x的拟合值,另一部分是与扰动项相关的部分;然后在第二阶段中求 y 对 x 拟合值的回归,也就是对消去内生性部分的 x 的回归,故可以得到一致的估计。
import wooldridge as woo
import pandas as pd
mroz = woo.dataWoo('mroz')
#去除缺失值
mroz = mroz.dropna(subset=['lwage'])
(1)内生变量对工具变量做回归,获得内生变量拟合值
e d u c = π 0 + π 1 e x p e r + β 2 e x p e r 2 + β 3 m o t h e r e d u c + β 4 f a t h e r e d u c + v educ=\pi_0+\pi_1exper+\beta_2exper^2+\beta_3mothereduc+\beta_4fathereduc+v educ=π0+π1exper+β2exper2+β3mothereduc+β4fathereduc+v
import statsmodels.formula.api as smf
#1阶段回归
reg_1st= smf.ols(formula='educ ~ exper + expersq + motheduc + fatheduc',
data=mroz)
results_1st = reg_1st.fit()
mroz['educ_fitted'] = results_1st.fittedvalues
print(results_1st.summary()
结果如下:
OLS Regression Results
==============================================================================
Dep. Variable: educ R-squared: 0.211
Model: OLS Adj. R-squared: 0.204
Method: Least Squares F-statistic: 28.36
Date: Sun, 17 Jul 2022 Prob (F-statistic): 6.87e-21
Time: 16:34:39 Log-Likelihood: -909.72
No. Observations: 428 AIC: 1829.
Df Residuals: 423 BIC: 1850.
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 9.1026 0.427 21.340 0.000 8.264 9.941
exper 0.0452 0.040 1.124 0.262 -0.034 0.124
expersq -0.0010 0.001 -0.839 0.402 -0.003 0.001
motheduc 0.1576 0.036 4.391 0.000 0.087 0.228
fatheduc 0.1895 0.034 5.615 0.000 0.123 0.256
==============================================================================
Omnibus: 10.903 Durbin-Watson: 1.940
Prob(Omnibus): 0.004 Jarque-Bera (JB): 20.371
Skew: -0.013 Prob(JB): 3.77e-05
Kurtosis: 4.068 Cond. No. 1.55e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.55e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
(2)被解释变量对内生变量拟合值做回归
l o g ( w a g e ) = β 0 + β 1 e d u c ^ + β 2 e x p e r + β 3 e x p e r 2 + u log(wage)=\beta_0+\beta_1\widehat{educ}+\beta_2exper+\beta_3exper^2+u log(wage)=β0+β1educ +β2exper+β3exper2+u
#2阶段回归
reg_2nd = smf.ols(formula='lwage ~ educ_fitted + exper + expersq',
data=mroz)
results_2nd = reg_2nd.fit()
print(results_2nd.summary())
结果如下:
OLS Regression Results
==============================================================================
Dep. Variable: lwage R-squared: 0.050
Model: OLS Adj. R-squared: 0.043
Method: Least Squares F-statistic: 7.405
Date: Sun, 17 Jul 2022 Prob (F-statistic): 7.62e-05
Time: 16:40:02 Log-Likelihood: -457.17
No. Observations: 428 AIC: 922.3
Df Residuals: 424 BIC: 938.6
Df Model: 3
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 0.0481 0.420 0.115 0.909 -0.777 0.873
educ_fitted 0.0614 0.033 1.863 0.063 -0.003 0.126
exper 0.0442 0.014 3.136 0.002 0.016 0.072
expersq -0.0009 0.000 -2.134 0.033 -0.002 -7.11e-05
==============================================================================
Omnibus: 53.587 Durbin-Watson: 1.959
Prob(Omnibus): 0.000 Jarque-Bera (JB): 168.354
Skew: -0.551 Prob(JB): 2.77e-37
Kurtosis: 5.868 Cond. No. 4.41e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.41e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
使用linearmodels工具包中的IV2SLS工具,首先需要导入库
from linearmodels.iv import IV2SLS
其次可加载公式,进行2SLS回归
IV2SLS(formula,data)
formula:回归方程,形式为:dep~exog+[endog~instr],其中exog表示外生变量,endog表示内生变量,instr表示工具变量
具体到本例,代码如下:
from linearmodels.iv import IV2SLS
reg_iv = IV2SLS.from_formula(
formula='lwage ~ 1 + exper + expersq + [educ ~ motheduc + fatheduc]',
data=mroz)
results_iv = reg_iv.fit(cov_type='unadjusted', debiased=True)
print(results_iv)
结果如下:
IV-2SLS Estimation Summary
==============================================================================
Dep. Variable: lwage R-squared: 0.1357
Estimator: IV-2SLS Adj. R-squared: 0.1296
No. Observations: 428 F-statistic: 8.1407
Date: Sun, Jul 17 2022 P-value (F-stat) 0.0000
Time: 16:42:41 Distribution: F(3,424)
Cov. Estimator: unadjusted
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
Intercept 0.0481 0.4003 0.1202 0.9044 -0.7388 0.8350
exper 0.0442 0.0134 3.2883 0.0011 0.0178 0.0706
expersq -0.0009 0.0004 -2.2380 0.0257 -0.0017 -0.0001
educ 0.0614 0.0314 1.9530 0.0515 -0.0004 0.1232
==============================================================================
Endogenous: educ
Instruments: fatheduc, motheduc
Unadjusted Covariance (Homoskedastic)
Debiased: True
构建模型如下:
l o g ( w a g e ) = β 0 + β 1 e d u c + β 2 e x p e r + β 3 e x p e r 2 + u log(wage)=\beta_0+\beta_1educ+\beta_2exper+\beta_3exper^2+u log(wage)=β0+β1educ+β2exper+β3exper2+u
1、将疑是内生变量 e d u c educ educ对外生变量和工具变量做回归,得到残差 v v v。
e d u c = π 0 + π 1 e x p e r + β 2 e x p e r 2 + β 3 m o t h e r e d u c + β 4 f a t h e r e d u c + v educ=\pi_0+\pi_1exper+\beta_2exper^2+\beta_3mothereduc+\beta_4fathereduc+v educ=π0+π1exper+β2exper2+β3mothereduc+β4fathereduc+v
import statsmodels.formula.api as smf
#1阶段回归
reg_1st= smf.ols(formula='educ ~ exper + expersq + motheduc + fatheduc',
data=mroz)
results_1st = reg_1st.fit()
mroz['resid'] = results_1st.resid #获得残差
2、在原方程中将残差 v v v也作为一个变量加入,用OLS模型检验系数及其显著性,如果 v v v的系数显著异于零,则 e d u c educ educ变量是内生的。
l o g ( w a g e ) = β 0 + β 1 v + β 2 e d u c + β 3 e x p e r + β 4 e x p e r 2 + u log(wage)=\beta_0+\beta_1v+\beta_2educ+\beta_3exper+\beta_4exper^2+u log(wage)=β0+β1v+β2educ+β3exper+β4exper2+u
#2阶段回归
reg_2 = smf.ols(formula='lwage~ resid + educ + exper + expersq',
data=mroz)
results_2 = reg_2.fit()
print(results_2.summary())
结果如下:
OLS Regression Results
==============================================================================
Dep. Variable: lwage R-squared: 0.162
Model: OLS Adj. R-squared: 0.154
Method: Least Squares F-statistic: 20.50
Date: Sun, 17 Jul 2022 Prob (F-statistic): 1.89e-15
Time: 17:02:34 Log-Likelihood: -430.19
No. Observations: 428 AIC: 870.4
Df Residuals: 423 BIC: 890.7
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.0481 0.395 0.122 0.903 -0.727 0.824
resid 0.0582 0.035 1.671 0.095 -0.010 0.127
educ 0.0614 0.031 1.981 0.048 0.000 0.122
exper 0.0442 0.013 3.336 0.001 0.018 0.070
expersq -0.0009 0.000 -2.271 0.024 -0.002 -0.000
==============================================================================
Omnibus: 74.968 Durbin-Watson: 1.931
Prob(Omnibus): 0.000 Jarque-Bera (JB): 278.059
Skew: -0.736 Prob(JB): 4.17e-61
Kurtosis: 6.664 Cond. No. 4.42e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.42e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
步骤:
1、用2SLS方法估计方程,得到残差 u ^ \hat{u} u^
2、将 u ^ \hat{u} u^对所有外生变量和工具变量回归,得到 R 2 R^2 R2
3、原假设为:所有工具变量与 u u u不相关,于是 n R 2 ∼ X q 2 nR^2 \sim X_q^2 nR2∼Xq2,其中 q q q是工具变量数量减去内生变量数量。如果 n R 2 nR^2 nR2大于 X q 2 X_q^2 Xq2某个显著性水平的临界值,则拒绝所有变量都是外生的原假设
from linearmodels.iv import IV2SLS
import statsmodels.formula.api as smf
import scipy.stats as stats
#第一步,用2SLS法估计方程,得到残差
reg_iv = IV2SLS.from_formula(
formula='lwage ~ 1 + exper + expersq + [educ ~ motheduc + fatheduc]',
data=mroz)
results_iv = reg_iv.fit(cov_type='unadjusted', debiased=True)
#第二步,将残差对所有外生变量和工具变量回归
mroz['resid_iv'] = results_iv.resids
reg_aux = smf.ols(formula='resid_iv ~ exper + expersq + motheduc + fatheduc',
data=mroz)
results_aux = reg_aux.fit()
#第三步,显著性判断
r2 = results_aux.rsquared
n = results_aux.nobs
q = 2-1
teststat = n * r2
pval = 1 - stats.chi2.cdf(teststat, q)
print(f'r2: {r2}')
print(f'n: {n}')
print(f'teststat: {teststat}')
print(f'pval: {pval}')
运行结果为:
r2: 0.0008833442569250449
n: 428.0
teststat: 0.3780713419639192
pval: 0.5386372330714363
经过上述步骤我们得到,有n=428条观测数据,p值为0.539,在5%的显著性水平下不能拒绝原假设,即父母的受教育程度通过了过度识别检测,可作为工具变量。