如果存在异方差,一种处理方法是,仍然进行OLS回归,但使用在异方差情况下也成立的稳健标准误。
标准误在统计推断中发挥着至关重要的作用,直接影响着系数的显著性和置信区间,并最终影响到假设检验的结论。因此,正确地估计标准误在实证分析的过程中显得尤为重要。
White标准误(异方差稳健的标准误):
HC0:White(1980)提出的异方差稳健的标准误
HC1:Mackinon and White(1985)提出的异方差稳健的标准误
HC2:MacKinnon and White(1985)提出的异方差稳健的标准误
HC3:MacKinnon and White(1985)提出的异方差稳健的标准误
Newey-West标准误(异方差自相关稳健的标准误):
HAC
我们以伍德里奇《计量经济学导论:现代方法》的”第8章 异方差性“的案例8.4为例,使用HPRICE1中的数据,分别采用OLS模型,OLS模型+HC0标准误、OLS模型+HAC标准误对方程进行估计,代码如下:
import wooldridge as woo
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
hprice1 = woo.dataWoo('hprice1')
#建立OLS回归模型:
reg = smf.ols(formula='price ~ lotsize + sqrft + bdrms', data=hprice1)
results = reg.fit()
print(results.summary().tables[1])
#使用white标准误(异方差稳健的标准误)
reg_HC0 = smf.ols(formula='price ~ lotsize + sqrft + bdrms', data=hprice1)
results_HC0 =reg_HC0.fit(cov_type='HC0', use_t=True)
print(results_HC0.summary().tables[1])
#使用Newey-West标准误(异方差自相关稳健的标准误)
reg_HAC = smf.ols(formula='price ~ lotsize + sqrft + bdrms', data=hprice1)
results_HAC =reg_HAC.fit(cov_type='HAC', use_t=True,cov_kwds={'maxlags':1}) #maxlags表示滞后阶数
print(results_HAC.summary().tables[1])
结果如下:
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -21.7703 29.475 -0.739 0.462 -80.385 36.844
lotsize 0.0021 0.001 3.220 0.002 0.001 0.003
sqrft 0.1228 0.013 9.275 0.000 0.096 0.149
bdrms 13.8525 9.010 1.537 0.128 -4.065 31.770
==============================================================================
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -21.7703 41.033 -0.531 0.597 -103.368 59.828
lotsize 0.0021 0.007 0.289 0.773 -0.012 0.016
sqrft 0.1228 0.041 3.014 0.003 0.042 0.204
bdrms 13.8525 11.562 1.198 0.234 -9.139 36.844
==============================================================================
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -21.7703 36.606 -0.595 0.554 -94.565 51.024
lotsize 0.0021 0.001 1.660 0.101 -0.000 0.005
sqrft 0.1228 0.018 6.999 0.000 0.088 0.158
bdrms 13.8525 8.814 1.572 0.120 -3.675 31.380
==============================================================================
可以发现“OLS+稳健标准误”的方法并不会改变模型的系数,只是使用稳健标准误。通常,OLS标准误放在圆括号()中,而异方差稳健标准误放在方括号[]中。因此,该模型方程可以写成如下形式:
p r i c e ^ = − 21.77 + 0.00207 l o t s i z e + 0.123 s q r f t + 13.85 b d r m s ( 29.475 ) ( 0.001 ) ( 0.013 ) ( 9.010 ) [ 41.033 ] [ 0.007 ] [ 0.041 ] [ 11.562 ] \widehat{price}=-21.77+0.00207lotsize +0.123sqrft+13.85bdrms\\ \quad (29.475)\qquad (0.001) \qquad (0.013)\qquad(9.010)\\ \quad [41.033]\qquad [0.007] \qquad [0.041]\qquad[11.562] price =−21.77+0.00207lotsize+0.123sqrft+13.85bdrms(29.475)(0.001)(0.013)(9.010)[41.033][0.007][0.041][11.562]
由于方差较小的观测值包含的信息量较大,故对于异方差的另一处理办法是,给予方差较小的观测值较大的权重,继而进行加权最小二乘估计。WLS的基本思想:通过变量转换,使得变化后的模型满足扰动项同方差的假定,然后进行OLS估计。
考虑线性模型:
y i = β 0 + β 1 x i 1 + . . . + β k x i k + u i y_i= \beta_0+\beta_1x_{i1}+...+\beta_kx_{ik}+u_i yi=β0+β1xi1+...+βkxik+ui
记
X i = ( 1 , x i 1 , . . . , x i k ) X_i=(1,x_{i1},...,x_{ik}) Xi=(1,xi1,...,xik)
条件同方差的原假设为:
H 0 : V a r ( u i ∣ X i ) = σ 2 H_0:Var(u_i|X_i)=\sigma^2 H0:Var(ui∣Xi)=σ2
如果原假设不成立,假定:
V a r ( u i ∣ X i ) = E ( u i 2 ∣ X i ) = σ 2 h ( X i ) = σ 2 h i Var(u_i|X_i)=E(u_i^2|X_i)=\sigma^2h(X_i)=\sigma^2h_i Var(ui∣Xi)=E(ui2∣Xi)=σ2h(Xi)=σ2hi
其中, h ( X ) h(X) h(X)是解释变量的某种函数,并决定着异方差性。由于方差必须为正,所以 h ( X ) h(X) h(X)恒>0。
在原线性模型中同时乘以权重 1 / h i 1/\sqrt{h_i} 1/hi,线性模型变为:
y i / h i = β 0 / h i + β 1 x i 1 / h i + . . . + β k x i k / h i + u i / h i y_i/\sqrt{h_i}= \beta_0/\sqrt{h_i}+\beta_1x_{i1}/\sqrt{h_i}+...+\beta_kx_{ik}/\sqrt{h_i}+u_i/\sqrt{h_i} yi/hi=β0/hi+β1xi1/hi+...+βkxik/hi+ui/hi
或
y i ∗ = β 0 + β 1 x i 1 ∗ + . . . + β k x i k ∗ + u i ∗ y_i^*= \beta_0+\beta_1x_{i1}^*+...+\beta_kx_{ik}^*+u_i^* yi∗=β0+β1xi1∗+...+βkxik∗+ui∗
其中标星号的变量都表示将原对应变量除以 h i \sqrt{h_i} hi
V a r ( u i / h i ) = 1 / h i V a r ( u i ) = σ 2 h i / h i = σ 2 Var(u_i/\sqrt{h_i})=1/h_iVar(u_i)=\sigma^2h_i/h_i=\sigma^2 Var(ui/hi)=1/hiVar(ui)=σ2hi/hi=σ2
可见,新扰动项 u i / h i u_i/\sqrt{h_i} ui/hi不再存在异方差
WLS的难点在于找到函数 h ( X i ) h(X_i) h(Xi)。不过,我们可以模型化函数 h h h,得到每个 h i h_i hi的估计值,记为 h ^ i \hat{h}_i h^i。用 h ^ i \hat{h}_i h^i取代 h i h_i hi进行加权最小二乘运算,就称为FWLS.
在做BP检验时,进行如下辅助回归:
e 2 = δ 0 + δ 1 x 1 + δ 2 x 2 . . . + δ k x k + e r r o r {e}^2=\delta_0+\delta_1x_1+\delta_2x_2...+\delta_kx_k+error e2=δ0+δ1x1+δ2x2...+δkxk+error
其中, e 2 e^2 e2为原方程的残差平方和。通过此辅助回归的拟合值,即可获得 σ 2 \sigma^2 σ2的估计值:
σ ^ 2 = δ ^ 0 + δ ^ 1 x 1 + δ ^ 2 x 2 . . . + δ ^ k x k \hat{\sigma}^2=\hat{\delta}_0+\hat{\delta}_1x_1+\hat{\delta}_2x_2...+\hat{\delta}_kx_k σ^2=δ^0+δ^1x1+δ^2x2...+δ^kxk
然而,上式可能出现 σ ^ 2 < 0 \hat{\sigma}^2<0 σ^2<0的情形,而方差不能为负数。为保证 σ ^ 2 \hat{\sigma}^2 σ^2始终为正,一般假设条件方差函数为对数形式:
l n e 2 = δ 0 + δ 1 x 1 + δ 2 x 2 . . . + δ k x k + e r r o r ln{e}^2=\delta_0+\delta_1x_1+\delta_2x_2...+\delta_kx_k+error lne2=δ0+δ1x1+δ2x2...+δkxk+error
对此方程进行OLS回归,可得 l n e 2 ln{e}^2 lne2的预测值,记为 l n σ ^ 2 ln\hat{\sigma}^2 lnσ^2;进而得到:
σ ^ 2 = e x p ( δ ^ 0 + δ ^ 1 x 1 + δ ^ 2 x 2 . . . + δ ^ k x k ) \hat{\sigma}^2=exp(\hat{\delta}_0+\hat{\delta}_1x_1+\hat{\delta}_2x_2...+\hat{\delta}_kx_k) σ^2=exp(δ^0+δ^1x1+δ^2x2...+δ^kxk)
然后以 1 / σ ^ 2 1/\hat{\sigma}^2 1/σ^2为权重对原方程进行WLS估计。
处理异方差的FWLS程序:
1、将 y y y对 x 1 x_1 x1, x 2 x_2 x2,…, x k x_k xk做回归并得到残差 e e e
2、将OLS残差平方后取自然对数得到 o g ( e 2 ) og(e^2) og(e2)
3、 l o g ( e 2 ) log(e^2) log(e2)对 x 1 x_1 x1, x 2 x_2 x2,…, x k x_k xk做回归并得到拟合值 l n σ ^ 2 ln\hat{\sigma}^2 lnσ^2
4、求出拟合值的指数 σ ^ 2 \hat{\sigma}^2 σ^2
5、以 1 / σ ^ 2 1/\hat{\sigma}^2 1/σ^2为权数,用WLS估计方程
我们以伍德里奇《计量经济学导论:现代方法》的”第8章 异方差性“的案例8.7为例,使用SMOKE中的数据分别进行OLS估计和FWLS估计。
import wooldridge as woo
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
smoke = woo.dataWoo('smoke')
#OLS:
reg_ols = smf.ols(formula='cigs ~ np.log(income) + np.log(cigpric) +'
'educ + age + I(age**2) + restaurn',data=smoke)
results_ols = reg_ols.fit()
print(results_ols.summary())
#log(残差平方)对自变量做回归
smoke['loge2'] = np.log(results_ols.resid ** 2)
reg_loge2 = smf.ols(formula='loge2 ~ np.log(income) + np.log(cigpric) +'
'educ + age + I(age**2) + restaurn', data=smoke)
results_loge2 = reg_loge2.fit()
# FWLS
wls_weight = list(1 / np.exp(results_loge2.fittedvalues))
reg_wls = smf.wls(formula='cigs ~ np.log(income) + np.log(cigpric) +'
'educ + age + I(age**2) + restaurn',
weights=wls_weight, data=smoke)
results_wls = reg_wls.fit()
print(results_wls.summary())
结果为:
OLS Regression Results
==============================================================================
Dep. Variable: cigs R-squared: 0.053
Model: OLS Adj. R-squared: 0.046
Method: Least Squares F-statistic: 7.423
Date: Wed, 04 May 2022 Prob (F-statistic): 9.50e-08
Time: 20:53:14 Log-Likelihood: -3236.2
No. Observations: 807 AIC: 6486.
Df Residuals: 800 BIC: 6519.
Df Model: 6
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept -3.6398 24.079 -0.151 0.880 -50.905 43.625
np.log(income) 0.8803 0.728 1.210 0.227 -0.548 2.309
np.log(cigpric) -0.7509 5.773 -0.130 0.897 -12.084 10.582
educ -0.5015 0.167 -3.002 0.003 -0.829 -0.174
age 0.7707 0.160 4.813 0.000 0.456 1.085
I(age ** 2) -0.0090 0.002 -5.176 0.000 -0.012 -0.006
restaurn -2.8251 1.112 -2.541 0.011 -5.007 -0.643
==============================================================================
Omnibus: 225.317 Durbin-Watson: 2.013
Prob(Omnibus): 0.000 Jarque-Bera (JB): 494.255
Skew: 1.536 Prob(JB): 4.72e-108
Kurtosis: 5.294 Cond. No. 1.33e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.33e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
WLS Regression Results
==============================================================================
Dep. Variable: cigs R-squared: 0.113
Model: WLS Adj. R-squared: 0.107
Method: Least Squares F-statistic: 17.06
Date: Wed, 04 May 2022 Prob (F-statistic): 1.32e-18
Time: 20:53:14 Log-Likelihood: -3207.8
No. Observations: 807 AIC: 6430.
Df Residuals: 800 BIC: 6462.
Df Model: 6
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 5.6355 17.803 0.317 0.752 -29.311 40.582
np.log(income) 1.2952 0.437 2.964 0.003 0.437 2.153
np.log(cigpric) -2.9403 4.460 -0.659 0.510 -11.695 5.815
educ -0.4634 0.120 -3.857 0.000 -0.699 -0.228
age 0.4819 0.097 4.978 0.000 0.292 0.672
I(age ** 2) -0.0056 0.001 -5.990 0.000 -0.007 -0.004
restaurn -3.4611 0.796 -4.351 0.000 -5.023 -1.900
==============================================================================
Omnibus: 325.055 Durbin-Watson: 2.050
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1258.137
Skew: 1.908 Prob(JB): 6.30e-274
Kurtosis: 7.780 Cond. No. 2.30e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.3e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
用ols估计出来的方程:
c i g s ^ = − 3.64 + 0.880 l o g ( i n c o m e ) − 0.751 l o g ( c i g p r i c ) − 0.501 e d u c + 0.771 l a g e − 0.0090 a g e 2 − 2.83 r e s t a u r n \widehat{cigs}=-3.64+0.880log(income)-0.751log(cigpric)-0.501educ\\+0.771lage-0.0090age^2-2.83restaurn cigs =−3.64+0.880log(income)−0.751log(cigpric)−0.501educ+0.771lage−0.0090age2−2.83restaurn
用FWLS估计出来的方程:
c i g s ^ = 5.64 + 1.30 l o g ( i n c o m e ) − 2.94 l o g ( c i g p r i c ) − 0.463 e d u c + 0.482 l a g e − 0.0056 a g e 2 − 3.46 r e s t a u r n \widehat{cigs}=5.64+1.30log(income)-2.94log(cigpric)-0.463educ\\+0.482lage-0.0056age^2-3.46restaurn cigs =5.64+1.30log(income)−2.94log(cigpric)−0.463educ+0.482lage−0.0056age2−3.46restaurn
参考资料:Python处理标准误https://www.vincentgregoire.com/standard-errors-in-python/#With-statsmodels