R2 分解到每个变量上：相对重要性分析 (Dominance Analysis)

作者：胡雨霄 (伦敦政治经济学院)

Stata 连享会：知乎 | | 码云 | CSDN

Stata连享会计量专题 || 精品课程 || 推文 || 公众号合集

R2 分解到每个变量上：相对重要性分析 (Dominance Analysis)_第1张图片

点击查看完整推文列表

2020寒假Stata现场班 (北京, 1月8-17日，连玉君-江艇主讲)，「+助教招聘」

R2 分解到每个变量上：相对重要性分析 (Dominance Analysis)_第2张图片

2020寒假Stata现场班

本篇推文介绍重要性分析 (Dominance Analysis) 及其 Stata 命令实现 domin。

1. 重要性分析简介

在实证经济学中，一个重要的问题是探究不同的解释变量 (explanatory variable) 对被解释变量 (dependent variable) 的方差的具体贡献程度。

例如，在 叶德珠，黄有光，连玉君 (2014) 的论文中，三位作者试图高清楚哪些文化因素对幸福感的影响更大 (Ye D, Ng Y K, Lian Y. Culture and Happiness[J]. Social Indicators Research, 2015, 123(2):519-547. https://core.ac.uk/download/pdf/81850289.pdf)。显然，各个系数大小是不能被用来直接比较的；对系数进行标准化似乎可行，但却不知道他们的相对重要性。

此外，文献中比较普遍的方法为逐步回归法 (stepwise regression)，即在回归中逐步引入解释变量，以及显著性测试 (significance test)。然而，逐步回归方法中，引入解释变量的顺序是非常主观的。显著性测试也并不总是可以将不同的解释变量按其重要程度排序。基于此，Isareli (2006) 在前人的基础上（主要是 (Shorrocks, 1999) 以及（Fields, 2003））提出了重要性分析 (Dominance Analysis) 的方法。该方法旨在确定线性回归中，不同解释变量对决定系数的贡献程度。而事实上，对于决定系数的贡献程度也反映了不同解释变量对被解释变量方差的贡献度。

假设线性回归为

$y=a+\sum_{j=1}^{J} b_{j} x_{j}+e \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ [ 1 ]$

被解释变量的方差，即总离差平方和 (total sum of squares, TSS), 可以被分解为两部分，回归平方和 (regression sum of squares, RSS) 以及残差平方和 (error sum or squares, ESS)。

其中，为被解释变量的预测值。

拟合优度可以被表示为

$R^{2}=\frac{\operatorname{RSS}}{\mathrm{TSS}}=\frac{\operatorname{Var}(\hat{y})}{\operatorname{Var}(y)}=1-\frac{\operatorname{Var}(e)}{\operatorname{Var}(y)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ [3]$

因为是重要的模型拟合优度统计量。自然而然，为了分析不同解释变量的相对重要性，研究者会想要分解不同解释变量对的贡献程度，并以此判断其相对重要性。

根据 Fields(2003)，被解释变量的方差，即总离差平方和 (total sum of squares, TSS),可分解为

由此可以得到不同解释变量的相对贡献程度

$R^{2}(y)=\frac{\sum_{j=1}^{J} b_{j} \operatorname{Cov}\left(x_{j}, y\right)}{\operatorname{Var}(y)}=1-\frac{\operatorname{Cov}(e, y)}{\operatorname{Var}(y)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ [5]$

事实上，上式与 [3] 式本质相同。但根据 [5] 式，不同解释变量可按照重要程度排序。然而，Fields(2003) 未考虑到不同解释变量之间的相互关系。也就是说，某一变量的系数会与回归中其他的解释变量有关。

与之相反，Shapley(1999) 认为解释变量的贡献应当等同于其对的边际效用 (marginal effect)，。具体而言，解释变量对的边际效用可以表示为，

其中，是不包含变量 k 的其他解释变量。可以看到，该式实则为完整回归的减去不包含变量 k 的回归的。由于去除一个解释变量后，回归的系数通常会发生改变，因此不包含变量 k 的回归的系数都以 * 表示。

此处，产生一个问题。变量 k 被剔除回归的顺序不同，那么对拟合优度的边际效用也会不同。因此，为了解决这个问题，最终对变量 k 的重要程度的判定是对种不同剔除方式得到结果的平均值。

2. Stata 命令 domin

2.1 命令下载

ssc install domin

2.2 命令语法

该命令的基本语法如下

domin depvar indepvars [if] [in] [weight], sets((varlist) (varlist) ...)

其中，

depvar ：因变量

indepvars：解释变量

sets((varlist)(varlist)) 设定会将被列入 varlist 的变量视作一个解释变量。例如 sets((x1 x2)(x3 x4)) 表示会创建 2 个变量集合 (set)。其中 set1 由变量 x1 和变量 x2 创立，而 set2 则由变量 x3 和变量 x4 创立。该命令通常由于进行分组分析。

在研究中，通常只会使用到基本命令。但本篇推文也将介绍该命令的进阶语法。

domin depvar [indepvars [if] [in] [weight] , fitstat(scalar)  
sets((varlist)(varlist) ...) noconditional nocomplete epsilon ]

其中，

fitstat(scalar) 规定了用于进行重要性分析的拟合优度统计量。fitstat 允许的 scalar 有 3 种形式：returned, ereturned, 或者其他 scalar。若无特别设定，Stata 则默认为 fitstat(e(r2))。

noconditional 设定不输出 conditional dominance 的结果。

nocomplete 设定不计算 complete dominance 结果。

epsilon 设定可以加快计算速度，输出结果也与未设定 epsilon 的结果类似。但是如果加入该设定之后，无法同时加入 set。

2.3 实证运用: 两变量情形

2.3.1 数据

sysuse "auto.dta", clear

数据结构如下

. list in 1/10

     +-------------------------------------------------------------------------------------------------------------------+
     | make             price   mpg   rep78   headroom   trunk   weight   length   turn   displa~t   gear_r~o    foreign |
     |-------------------------------------------------------------------------------------------------------------------|
  1. | AMC Concord      4,099    22       3        2.5      11    2,930      186     40        121       3.58   Domestic |
  2. | AMC Pacer        4,749    17       3        3.0      11    3,350      173     40        258       2.53   Domestic |
  3. | AMC Spirit       3,799    22       .        3.0      12    2,640      168     35        121       3.08   Domestic |
  4. | Buick Century    4,816    20       3        4.5      16    3,250      196     40        196       2.93   Domestic |
  5. | Buick Electra    7,827    15       4        4.0      20    4,080      222     43        350       2.41   Domestic |
     |-------------------------------------------------------------------------------------------------------------------|
  6. | Buick LeSabre    5,788    18       3        4.0      21    3,670      218     43        231       2.73   Domestic |
  7. | Buick Opel       4,453    26       .        3.0      10    2,230      170     34        304       2.87   Domestic |
  8. | Buick Regal      5,189    20       3        2.0      16    3,280      200     42        196       2.93   Domestic |
  9. | Buick Riviera   10,372    16       3        3.5      17    3,880      207     43        231       2.93   Domestic |
 10. | Buick Skylark    4,082    19       3        3.5      13    3,400      200     42        231       3.08   Domestic |
     +-------------------------------------------------------------------------------------------------------------------+

2.3.2 回归

通过 reg 命令进行回归后，可以发现解释变量 weight 和解释变量 length 都与被解释变量 price 显著相关。

. reg price weight length 

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(2, 71)        =     18.91
       Model |   220725280         2   110362640   Prob > F        =    0.0000
    Residual |   414340116        71  5835776.28   R-squared       =    0.3476
-------------+----------------------------------   Adj R-squared   =    0.3292
       Total |   635065396        73  8699525.97   Root MSE        =    2415.7

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |   4.699065   1.122339     4.19   0.000     2.461184    6.936946
      length |  -97.96031    39.1746    -2.50   0.015    -176.0722   -19.84838
       _cons |   10386.54   4308.159     2.41   0.019     1796.316    18976.76
------------------------------------------------------------------------------

2.3.3 相对重要性分析

运行 domin 命令后，结果如下所示。

我们可以发现，Overall Fit Statistic 的数值与运行 reg命令后的 R-squared 数值相同。这是因为 domin 命令默认的拟合优度统计量即为 R-squared.

.    domin price weight length 

Total of 3 regressions

General dominance statistics: Linear regression
Number of obs             =                      74
Overall Fit Statistic     =                  0.3476

            |      Dominance      Standardized      Ranking
 price      |      Stat.          Domin. Stat.
------------+------------------------------------------------------------------------
 weight     |         0.2256      0.6491            1 
 length     |         0.1220      0.3509            2 
-------------------------------------------------------------------------------------

变量 weight 的贡献度为 0.2256，也可以解释为该变量对拟合优度的边际贡献为0.2256。
变量 length 的贡献度为 0.1220，也可以解释为该变量对拟合优度的边际贡献为 0.1220。
在该线性回归中，变量 weight 相对于变量 length 更加重要，对被解释变量 price 的方差的变化的解释力度更强。

接下来，我们利用 Stata 来验证 Dominance Stat. 是如何得出的，同时将上述的数学公式进行运用。

在这个实证的例子中，我们只引入了两个解释变量。线性回归方程可表示为

代入 [6] 式，

. rename (price weight length) (y x1 x2) // 为了与数学公式一致
. 
. /*第一种剔除方法*/
. 
. **回归1：完整回归
. qui reg y x1 x2 
. 
. local R2_all = e(r2) //记录完整回归的R-squared
. 
. **回归2：剔除变量x1
. qui reg y x2
. local R2_x2 = e(r2) //记录回归2的R-squared
. 
. **第一种剔除方法得到的边际贡献
. local R2_m1_x1 = `R2_all'-`R2_x2'
. 
. /*第二种剔除方法*/
. 
. **回归1b:不包含x2的回归
. qui reg y x1 //回归1b
. 
. local R2_x1 = e(r2) //记录回归1b的R-squared
. 
. **回归1b:不包含x2和x1的回归
. qui reg y  //回归2b
. 
. local R2_0 = e(r2) //记录回归2b的R-squared
. 
. **第二种剔除方法得到的边际贡献
. local R2_m2_x1 = `R2_x1'-`R2_0'
. 
. /*Dominance Stat.*/
. 
. **[7]式
. local R2_x1_Sharp = (`R2_m1_x1'+`R2_m2_x1')/2

. dis "Shapley value of weight = " in g %6.4f `R2_x1_Sharp'
Shapley value of weight = 0.2256

2.4 实证运用: 多变量情形

2.4.1 数据

sysuse "nlsw88.dta", clear

数据结构如下

. list wage age hours tenure married in 1/10

     +---------------------------------------------+
     |     wage   age   hours     tenure   married |
     |---------------------------------------------|
  1. | 11.73913    37      48   5.333333    single |
  2. | 6.400963    37      40       5.25    single |
  3. | 5.016723    42      40       1.25    single |
  4. | 9.033813    43      42       1.75   married |
  5. | 8.083731    42      48      17.75   married |
     |---------------------------------------------|
  6. |  4.62963    39      30       2.25   married |
  7. | 10.49114    37      40         19    single |
  8. | 17.20612    40      45   14.16667   married |
  9. | 13.08374    40       8        5.5   married |
 10. | 7.745568    40      50       2.25   married |
     +---------------------------------------------+

2.4.2 回归

通过 reg 命令进行回归后，可以发现解释变量 age 在 5% 水平上显著与被解释变量 wage 相关。解释变量 hours 以及解释变量 tenure都与被解释变量 wage 在 1% 水平上显著相关。而解释变量 married 与被解释变量不显著相关。

. reg wage age hours tenure married

      Source |       SS           df       MS      Number of obs   =     2,227
-------------+----------------------------------   F(4, 2222)      =     29.94
       Model |   3784.9833         4  946.245825   Prob > F        =    0.0000
    Residual |  70235.8273     2,222  31.6092832   R-squared       =    0.0511
-------------+----------------------------------   Adj R-squared   =    0.0494
       Total |  74020.8106     2,226  33.2528349   Root MSE        =    5.6222

------------------------------------------------------------------------------
        wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |  -.0844975   .0390588    -2.16   0.031    -.1610931    -.007902
       hours |   .0711019   .0116507     6.10   0.000     .0482545    .0939493
      tenure |   .1665189   .0219824     7.58   0.000     .1234107    .2096272
     married |  -.2209223   .2514333    -0.88   0.380    -.7139911    .2721464
       _cons |   7.606344   1.617926     4.70   0.000     4.433539    10.77915
------------------------------------------------------------------------------

2.4.3 相对重要性分析

运行 domin 命令后，结果如下所示。我们可以发现

. domin wage age hours tenure married
Regression type not entered in reg(). 
reg(regress) assumed.

Fitstat type not entered in fitstat(). 
fitstat(e(r2)) assumed.


Total of 15 regressions

General dominance statistics: Linear regression
Number of obs             =                    2227
Overall Fit Statistic     =                  0.0511

            |      Dominance      Standardized      Ranking
 wage       |      Stat.          Domin. Stat.
------------+------------------------------------------------------------------------
 age        |         0.0017      0.0331            3 
 hours      |         0.0205      0.4010            2 
 tenure     |         0.0279      0.5461            1 
 married    |         0.0010      0.0198            4 
-------------------------------------------------------------------------------------

在该线性回归中，各个变量的相对重要性排序为：tenure > hours > age > married。也就是说，在工资水平 (wage) 的影响因素中，获得职位的年限 (tenure）是最重要的影响因素，其次是工作时长 (hours), 再次是年龄 (age)，最后是已婚状态 (married)。

2.4.4 相对重要性分析（分组情形）

变量 occupation，industry 和 race 均为分组变量。 domin 命令中的 sets() 设定可以很好得处理分组变量，并在处理中将 set1，set2，set3 分别视作由 occupation ，industry 和 race产生的三个解释变量。

. domin wage age hours tenure married, sets((i.occupation) (i.industry) (i.race))

Total of 127 regressions

General dominance statistics: Linear regression
Number of obs             =                    2209
Overall Fit Statistic     =                  0.1990

            |      Dominance      Standardized      Ranking
 wage       |      Stat.          Domin. Stat.
------------+------------------------------------------------------------------------
 age        |         0.0013      0.0067            7 
 hours      |         0.0114      0.0572            4 
 tenure     |         0.0181      0.0908            3 
 married    |         0.0022      0.0111            6 
 set1       |         0.1133      0.5692            1 
 set2       |         0.0472      0.2374            2 
 set3       |         0.0055      0.0276            5 
-------------------------------------------------------------------------------------

在引入不同的 set 之后，在该线性回归中，各个变量的相对重要性排序为 set1 > set2 > tenure > hours > set3 > married > age。我们可以看到，相对重要性排序相较之前发生了变化。具体而言，在工资水平 (wage) 的影响因素中，职业的选择（occupation）是最重要的，其次是所在的行业（industry），再次为获得职位的年限 (tenure），之后为工作时长 (hours)，种族（race），已婚状态 (married)，最后为年龄 (age)。

参考文献

[1] Fields, G. S. (2003). Accounting for income inequality and its change: A new method, with application to the distribution of earnings in the United States. In Worker well-being and public policy (pp. 1-38). Emerald Group Publishing Limited. [PDF]

[2] Israeli, O. (2007). A Shapley-based decomposition of the R-square of a linear regression. The Journal of Economic Inequality, 5(2), 199-212. [PDF]

[3] Shorrocks, A.F.: Decomposition Procedures for Distributional Analysis: A Unified Framework Based on the Shapley Value (mimeo). University of Essex (1999)

[4] Shorrocks, A. F. (2012). Decomposition procedures for distributional analysis: a unified framework based on the Shapley value. The Journal of Economic Inequality, 11(1), 99–126. doi:10.1007/s10888-011-9214-z. [PDF]，[PDF2]

[5] Ye D, Ng Y K, Lian Y. Culture and Happiness[J]. Social Indicators Research, 2015, 123(2):519-547. Note： 这篇文章对本文介绍的内容进行了细致的说明和应用。[PDF]

连享会计量方法专题……

关于我们

Stata连享会 由中山大学连玉君老师团队创办，定期分享实证分析经验。
推文同步发布于 CSDN 、和知乎Stata专栏。可在百度中搜索关键词「Stata连享会」查看往期推文。
点击推文底部【阅读原文】可以查看推文中的链接并下载相关资料。
欢迎赐稿： 欢迎赐稿。录用稿件达三篇以上，即可免费获得一期 Stata 现场培训资格。
E-mail： [email protected]
往期精彩推文：一网打尽

R2 分解到每个变量上：相对重要性分析 (Dominance Analysis)_第3张图片

欢迎加入Stata连享会(公众号: StataChina)

RqjkOw)