模型中各变量对模型的解释程度

在建立一个模型后,我们会关心这个模型对于因变量的解释程度,甚至想知道各个自变量分别对模型的贡献有多少。对于非线性模型,如 Random Forest 和 XGBoost 等由于其建模过程就是筛选变量的过程,可以计算变量的重要性;但对于大多数非线性模型,是比较难确定各个变量的贡献程度,本文仅讨论广义线性模型中的变量贡献程度。因此本文分为两种情况来看:普通线性模型与广义线性模型。

 

普通线性回归模型

将因变量的变异进行分解(如ANOVA),可求得

      R^{2}=\frac{SSR}{SSTO}=1-\frac{SSE}{SSTO}

其中,SSSums of Squares 的缩写,SSR 表示来自Regression 的变异,SSE 表示随机变异(未能解释的变异),SSTO 表示总变异,SSTO=SSR+SSE。则R^{2} 表示回归模型对因变量的解释程度,称R^{2} 为模型的决定系数。并且R^{2}等于r^{2}(r为相关系数)。

由于随着变量增加,R^{2}也会变大,有可能出现 一个变量少但实际解释能力较好的模型的R^{2} 小于 变量特别多但实际解释能力一般的模型的R^{2},这种比较会因为变量数目不同而导致不公平,所以就有校正的R^{2},但不是本文的重点,此处不赘述。

 

关于各个变量的贡献程度,Yi-Chun E. Chao 等人写了篇paper总结,详细内容见文末参考文献。

各个变量相对重要性的评价,令 l_j 表示 x_j 的relative importance,理想的 l_j 应该满足:

(1)对于所有的 x_j,其 l_j 均为非负数;

(2)所有的 l_j 之和等于回归模型的总R^{2}

(3)l_j 值与 x_j 进入模型的顺序无关。

下面探讨几个可能可以度量 l_j 的指标。

 

单变量​​​​​ r2

各个变量自己单独建立回归模型(或作相关分析),可以求得各个变量的r^{2},一般表示为:r^2_{yx_j} 

但是仅当各个变量完全不相关时,这个式子才成立:

 

Type III SS 与 Type I SS

这部分详细内容建议参考:Sequential (or Extra) Sums of Squares

Type III SS 在软件里一般显示为Adjust SS,指的是,将p个变量纳入回归模型后,各个变量的额外贡献度(独立贡献度),一般来说,各个变量的SS之和是小于SSR的,仅当各个变量完全不相关时,各个变量的SS的和才等于SSR。相应地,可以求出Type III r^{2},即:

      

Type I SS 在软件里一般显示为Sequential SS,指的是,在之前p-1个变量的基础上,再加入当前变量,SSR的增加量。因此各个变量的SS之和是等于SSR的。但是这个SS依赖于进入模型的顺序(先进入模型的占便宜)。相应地,有Type III r^{2},即:

      

 

R^{2} (Partial R-squared)

这部分详细内容请参考:Partial R-squared

R^{2}又叫偏决定系数。这个概念也是基于变量加入的顺序,表示的是,在之前p-1个变量的模型不能解释的变异中,新加入的变量能解释的比例。也就是这个式子:

      模型中各变量对模型的解释程度_第1张图片

比如:在含有x1的模型的基础上,新增变量 x2 和 x3,则

      模型中各变量对模型的解释程度_第2张图片

这个概念一般用于检验新加入的变量有没有价值。

 

Pratt’s Index  

这个指标首先由Pratt 等人提出。Pratt 指数是一个乘积:B_jr_{yx_j}B_j 是回归系数,r_{yx_j}y 与 x_j 的相关系数。一般来说,这个指标评价各个变量的相对重要程度,较前面几个指标更好,运用较为广泛。

R_p^2=\sum B_jr_{yx_j},  用 B_jr_{yx_j} 表示 x_j 的解释能力,则据此可求出各变量的解释比例。

但是存在一个问题就是,有时候Pratt指数可能是负数值。对于这个问题,笔者不知是否可以修改成 |B_jr_{yx_j}| 作为评价指标。

 

Dj

其他方法包括:General Dominance Index D_j 和 Johnson’s Relative Weight \varepsilon _j
D_j 这个指标首先由Budescu等人提出。之前说过Type III r^{2}与当前变量的加入顺序有关,那么枚举所有可能的顺序都求出一个r^{2},然后求平均数,这就是 D_j 的思想。具体参考Yi-Chun E. Chao的论文。另外 \varepsilon _j 这里也不叙述了,也请参照Yi-Chun E. Chao的论文。

 

VIP值

PLSR(偏最小二乘法回归)本质上也是线性模型,综合求解过程中的参数(映射变换的系数和映射维度本身对因变量的解释程度),可以求得VIP值(Variable Importance in Projection),变量的 VIP 值反映的也是该变量对模型的解释程度。VIP值可用于变量的筛选。

对应的PLS-DA(偏最小二乘判别分析)属于广义线性模型,原理和PLSR基本一致,只是将回归任务变成了分类任务,也有VIP值。

 

广义线性模型

这里的非线性模型主要包括 Logistic 回归 和 Cox 回归。

由于R^{2}的计算时基于最小二乘法(OLS)及F统计量的ANOVA,而 Logistic回归等模型采用最大似然估计法(MLE),因此难以直接求出R^{2},这时候衍生出了广义的R^{2},即伪R^{2}

Logistic 回归中,R^{2} 定义为:

      R^2=\frac{l(\hat{\beta _0})-l(\hat{\beta })}{l(\hat{\beta _0})-l_S(\beta )}=\frac{l(\hat{\beta _0})-l(\hat{\beta })}{l(\hat{\beta _0})}

其中,l(\hat{\beta _0}) 表示仅包括截距参数的模型的 log likelihood,l_S(\hat{\beta }) 模型完美拟合所有数据的 log likelihood (其值为0),l(\hat{\beta }) 为当前模型的 log likelihood。最差的拟合时,就是只拟合了截距,此时 R^{2} 为0;最佳的拟合时,就是完美拟合了所有数据,其log likelihood为 l_S(\hat{\beta })=0,因此R^{2} 为1。

这种 R^{2} 往往需要进一步校正。

R^{2}的公式还可参考相应资料:维基、Logistic Regression。

 

似然比检验

Logistic回归中,似然比检验Likelihood Ratio Test),又叫 Deviance Test,用于评估模型中某些参数是否应该为0,或者说,新模型(复杂模型,full model)比原模型(简单模型,reduced model)中新增的参数是否为真实有效的约束。具体讲解可以参考:似然比检验 LRT

统计量为:

      \Lambda ^*=-2(l(\hat{\beta ^{(0)}})-l(\hat{\beta }))

该统计量服从卡方分布。其中,l(\hat{\beta ^{(0)}}) 表示原模型的 log likelihood,l(\hat{\beta }) 表示新模型的 log likelihood。

定义deviance为 log likelihood 的负2倍。该统计量也常常记为:G^2 = deviance (reduced) - deviance (full)

此处引用 Logistic Regression 中的一个例子:

以自变量LI进行拟合得到模型,并与无自变量的模型(null model)进行对比(似然比检验),得到结果如下:

      模型中各变量对模型的解释程度_第3张图片

可以算出无自变量的模型的log likelihood为 l(\hat{\beta ^{(0)}}) =−17.1859,则deviance为34.372,即Total 中所示值;

当前模型的 log likelihood 为 l(\hat{\beta }) =−13.0365,则deviance为26.073,即Error 中所示值;

G^2=34.372-26.073=8.299 。查表可知p<0.05,当前模型的约束是有必要的。

由这个例子可以看出,变量对应的卡方值,就是该变量所减少的deviance。因此,似然比检验中变量的卡方值可以作为变量重要性评估的指标。

 

Wald 检验

Wald检验是将MLE的各参数估计值 \hat{\beta _i} 与0 进行比较,且用它们的标准差 SE(\hat{\beta _i}) 作为参照。为检验 \hat{\beta _i}=0 是否成立,计算如下统计量:

    Z=\frac{\hat{\beta _i}}{SE(\hat{\beta _i})}

Z 的平方服从自由度为1的卡方分布。大样本时,Z 渐进服从正态分布。

另外,参数的可信区间也基于 Wald 统计量得出。

线性回归中,使用 t 统计量来检验回归系数是否为0,而GLM中使用 Wald检验。

Wald检验的思想其实不难,就是比较估计值的大小与估计值的标准差的关系,如果估计值的绝对值比标准差大很多,则认为估计值是 0 的概率很低;否则认为该估计值仍然有可能为 0。

Wald 检验可以在多因素分析的参数估计中使用的。Wald \chi ^2 越大的变量,统计学意义越显著,对模型的贡献越大。

Wald 检验是 似然比检验的粗糙估计,其计算容易很多,作为似然比检验的替代方法,因而经常被使用。

另外,曾经见过一篇肠癌免疫评分的文章采用回归模型Wald检验的卡方值的proportion来评价各个变量的重要性。找到Frank Harrell大神的rms软件包和《Regression Modeling Strategies》这本神书,发现可以Logistic回归等非线性模型可以用Wald统计量的ANOVA分析,并且可以用卡方值类比SS,也就是说,Wald卡方值的大小可以衡量变量的重要性。文中有这么一句话

This is a very powerful model (ROC area = c = 0.88); the survival patterns are easy to detect. The Wald ANOVA in Table 12.2 indicates especially strong sex and pclass effects (χ2 = 199 and 109, respectively).

但是具体的如何计算各个变量的相对贡献比例,并没有看到相应的说明。

书中和rms包文档里提到

The plot.anova function draws a dot chart showing the relative contribution (χ2, χ2 minus d.f., AIC, partial R2, P -value, etc.) of each factor in the model.

anova函数的margin参数

set to a vector of character strings to write text for selected statistics in the right margin of the dot chart. The character strings can be any combination of "chisq", "d.f.", "P", "partial R2", "proportion R2", and "proportion chisq" 

为了模仿线性模型中的R^{2}和偏R^{2},Harrell大神在书中演示了LRT 体系中的R^{2}和偏R^{2}的构建,但是他又说

Since such likelihood ratio statistics are tedious to compute, the 1 d.f. Wald χ2 can be substituted for the LR statistic (keeping in mind that difficulties with the Wald statistic can arise)

他在接下来的段落中描述了其他统计学家构建的两个指标:

模型中各变量对模型的解释程度_第4张图片

模型中各变量对模型的解释程度_第5张图片

描述了下Wald检验中伪R^{2}的来历。

不过涨了些见识,ANOVA除了传统的基于F统计量的,还可以用Wald统计量和LRT(likelihood ratio test
)统计量来分析(ANOVA)。关于Wald与t检验、Wald与F检验的关系,以后可以再研究下,比如:Are t test and one-way ANOVA both Wald tests?

 

此外,D. Roland Tomas等人基于Pratt’s Index定义了Logistic回归中的类似的指标,此处不做探讨。但是他在论文中提到

... When the question relates to explanatory variables in logistic regression, the usual recommendation is to inspect the relative magnitudes of the Wald statistics for individual explanatory variables (or their square roots which can be interpreted as large sample z-statistics). The problem with this and related approaches can be easily explained with reference to the governance example. For the explanatory variable DISP, its Wald statistic (or its square root z-statistic) shown in Table 3 is a measure of the contribution of DISP to the logistic regression, over and above the contribution of explanatory variables SUPP and INDEP. Similarly, the Wald statistic for variable SUPP measures its contribution over and above variables DISP and INDEP. Clearly, it is not appropriate to use these two Wald statistics as measures of the relative contribution of DISP and SUPP because the reference set of variables is different in both cases (SUPP and INDEP in the first case, and DISP and INDEP in the second case). The equivalent problem occurs in linear regression, i.e., the t-statistics (or corresponding p-values) for individual variables are not appropriate for assessing relative importance. 

 先忽略那几个大写的缩写是什么意思以及忽略Table3是什么内容,总之他的话里可以得到两个结论:1、Wald统计量用来评估变量贡献是经常被推荐的;2、他认为用Wald统计量来衡量变量贡献度不严谨。

 

M Schemper于1993年在论文中也提到了一种用于Cox模型的评价变量贡献度的指标PVE,详见参考文献。

我在stackexchange上看到了两个关于relative importance的问题,两个都是Harrell 教授的答疑。

第一个问题是rms包中采用“proportion chisq”衡量relative importance是否可行:Relative importance of variables in Cox regression

答疑摘录如下:

Adam Robinsson:

I've understood that relative importance of predictors is a tricky question. Suggested methods range from very complex models to very simple variable transformations. I've understood that the brightest still debate which way to go on this matter. I'm looking for an easy but still appealing method to approach this in survival analysis (Cox regression).

My aim is to answer the question: which predictor is the most important one (in terms of predicting the outcome). The reason is simple: clinicians want to know which risk factor to adress first. I understand that "important" in clinical setting is not equal to "important" in the regression-world, but there is a link.

Should I compute the proportion of explainable log-likelihood that is explained by each variable (see Frank Harrell post), by using:

library(survival); library(rms)
data(lung)
S <- Surv(lung$time, lung$status)
f <- cph(S ~ rcs(age,4) + sex, x=TRUE, y=TRUE, data=lung)
plot(anova(f), what='proportion chisq')

As I understand it, its only possible to use the 'proportion chisq' for Cox models and this should suffice to convey some sense of each variables relative importance. Or should I perhaps use the default plot(anova()), which displays Wald χ2 statistic minus its degrees of freedom for assessing the partial effect of each variable?

I would appreciate some guidance if anyone has any experience on this matter.

===========

Frank Harrell:

Thanks for trying those functions. I believe that both metrics you mentioned are excellent in this context. This is useful for any model that gives rise to Wald statistics (which is virtually all models) although likelihood ratio χ2 statistics would be even better (but more tedious to compute).

You can use the bootstrap to get confidence intervals for the ranks of variables computed these ways. For the example code type ?anova.rms.

All this is related to the "adequacy index". Two papers using the approach that have appeared in the medical literature are http://www.citeulike.org/user/harrelfe/article/13265566 and http://www.citeulike.org/user/harrelfe/article/13263849 .

===========

Adam Robinsson:

Many thanks for your time prof Harrell. I was delighted to find this function in the rms package, among the wealth of other useful functions. Considering the abovementioned approach, there was virtually no difference between the two measures. Thus, this appears to be an appealing approach, we'll see what the reviewers say.

I recently submitted a paper using your method professor Harrell. Most reviewers liked it but one reviewer claimed that Heller's method would be superior to the abovementioned method. Heller's method is explained here: ncbi.nlm.nih.gov/pmc/articles/PMC3297826 I did try Heller's method but it yields odd results (as far as I'm concerned). Have You, professor Harrell, compared the two methods and come to any conclusion as to which one is to prefer?

===========

Frank Harrell:

I like the Heller approach; I had not known about it before. I like the Kent and O'Quigley index a bit more (I'm not sure the +1 in the denominator is correct in Heller's description of it). But I still like measures that are functions of the gold standard log likelihood, such as the adequacy index, which is the easiest to compute

 

另一个问题是问大家更喜欢哪种方式的relative importance:Which variable relative importance method to use

哈哈,又看到Pratt方法(有负数值哦)。

 

对此,Harrell大神的回答是:

I prefer to compute the proportion of explainable log-likelihood that is explained by each variable. For OLS models the rms package makes this easy:

f <- ols(y ~ x1 + x2 + pol(x3, 2) + rcs(x4, 5) + ...)
plot(anova(f), what='proportion chisq')
# also try what='proportion R2'

The default for plot(anova()) is to display the Wald χ2 statistic minus its degrees of freedom for assessing the partial effect of each variable. Even though this is not scaled [0,1][0,1] it is probably the best method in general because it penalizes a variable requiring a large number of parameters to achieve the χ2. For example, a categorical predictor with 5 levels will have 4 d.f. and a continuous predictor modeled as a restricted cubic spline function with 5 knots will have 4 d.f.

If a predictor interacts with any other predictor(s), the χ2 and partial R2 measures combine the appropriate interaction effects with main effects. For example if the model was y ~ pol(age,2) * sex the statistic for sex is the combined effects of sex as a main effect plus the effect modification that sex provides for the age effect. This is an assessment of whether there is a difference between the sexes for any age.

Methods such as random forests, which do not favor additive effects, are not likelihood based, and use multiple trees, require a different notion of variable importance.

 

此外,相关链接中看到了一些有趣的讨论:

For linear classifiers, do larger coefficients imply more important features?

Approaches to compare differences in means with differences in proportions?

Different prediction plot from survival coxph and rms cph

 

有关这个问题的其他资料:

Logistic Regression in R

Contribution of each Variables in Logistic Regression

Kenneth P. Burnham, Understanding AIC relative variable importance values

 

效应大小(Effect size)

这部分内容和之前的内容有相似之处,拓展了更多指标。除了相关系数(Pearson r or correlation coefficient)、决定系数(Coefficient of determination),还包括:Eta-squared (η2)、Omega-squared (ω2)、Cohen's ƒ2和Cohen's q等,具体请参考:维基Effect_size。

 

参考文献

Regression Methods

Regression Modeling Strategies

Logistic Regression

维基-决定系数

维基Effect_size

rms-Reference Manual

Yi-Chun E. Chao et al, Quantifying the Relative Importance of Predictors in Multiple Linear Regression Analyses for Public Health Studies, Journal of Occupational and Environmental Hygiene, 5:8, 519-529, DOI: 10.1080/15459620802225481

Tomas, D. Roland, et al.  "On Measuring the Relative Importance of Explanatory Variables in a Logistic Regression ," Journal of Modern Applied Statistical Methods: Vol. 7 : Iss. 1 , Article 4. DOI: 10.22237/jmasm/1209614580

M Schemper, The relative importance of prognostic factors in studies of survival

 

你可能感兴趣的:(统计学)