用R做GLM的Summary相关指标解释——以Poission regression为例

用R做GLM的Summary相关指标解释

  • Residual
    • Response residuals
    • Working residuals
    • Pearson residuals
    • Deviance residuals
      • Deviance
      • Deviance residuals
      • Deviance residuals举例
    • Residual 小结
  • Null Deviance and Residual Deviance
    • 图解Null model, proposed model and saturated model——以Nomal regression为例
  • Confusing one: Deviance Residuals VS Residual Deviance
  • Other outputs of the summary function
    • Dispersion parameter
    • AIC
  • 参考文献

Residual

In ordinary least-squares, the residual associated with the i i i-th observation. The term residual comes from the residual sum of squares (RSS), which is defined as

r i = y i − f ( x i ) R S S = ∑ i = 1 n r i 2 r_i = y_i - f(x_i)\\ RSS=\sum_{i=1}^n r_i^2 ri=yif(xi)RSS=i=1nri2

where the residual r i r_i ri is defined as the difference between predicted values f ( x i ) f(x_i) f(xi) from the observed value y i y_i yi.

For GLMs, there are several ways for specifying residuals. We will cover four types of residuals: response residuals, working residuals, Pearson residuals, and, deviance residuals.

Response residuals

For type = "response", the conventional residual on the response level is computed, that is,
r i = y i − f ^ ( x i ) r_i=y_i−\hat f(x_i) ri=yif^(xi)

# type = "response"
res.response1 <- residuals(model.pois, type = "response")
res.response2 <- expected - estimates
all.equal(res.response1, res.response2)
## [1] TRUE

Working residuals

For type = "working", the residuals are normalized by the estimates f ^ ( x i ) \hat f(x_i) f^(xi):

r i = y i − f ^ ( x i ) f ^ ( x i ) r_i=\frac{y_i−\hat f(x_i)}{\hat f(x_i)} ri=f^(xi)yif^(xi)

# type = "working"
res.working1 <- residuals(model.pois, type="working")
res.working2 <- (expected - estimates) / estimates
all.equal(res.working1, res.working2)
## [1] TRUE

Pearson residuals

For type = "pearson", the Pearson residuals are computed. They are obtained by normalizing the residuals by the square root of the estimate:

r i = y i − f ^ ( x i ) f ( x i ) r_i=\frac{y_i−\hat f(x_i)}{\sqrt{f(x_i)}} ri=f(xi) yif^(xi)

# type = "pearson"
res.pearson1 <- residuals(model.pois, type="pearson")
res.pearson2 <- (expected - estimates) / sqrt(estimates)
all.equal(res.pearson1, res.pearson2)
## [1] TRUE

Deviance residuals

Deviance residuals are defined by the deviance.

Deviance

The deviance of a model is given by

D ( y , μ ^ ) = 2 ( log ⁡ ( p ( y ∣ θ ^ s ) ) − log ⁡ ( p ( y ∣ θ ^ 0 ) ) ) D(y,\hat μ)=2(\log (p(y∣\hat θ_s))−\log (p(y∣\hat θ_0))) D(y,μ^)=2(log(p(yθ^s))log(p(yθ^0)))

where

  • y y y is the outcome.
  • μ ^ \hat μ μ^ is the estimate of the model.
  • θ ^ s \hat θ_s θ^s and θ ^ 0 \hat θ_0 θ^0 are the parameters of the fitted saturated and proposed models, respectively. A saturated model has as many parameters as it has training points, that is, p = n p=n p=n. Thus, it has a perfect fit. The proposed model can be the any other model.
  • p(y|θ) is the likelihood of data given the model.

The deviance indicates the extent to which the likelihood of the saturated model exceeds the likelihood of the proposed model. If the proposed model has a good fit, the deviance will be small. If the proposed model has a bad fit, the deviance will be high. For example, for the Poisson model, the deviance is
D = 2 ⋅ ∑ i = 1 n [ y i ⋅ log ⁡ ( y i μ ^ i ) − ( y i − μ ^ i ) ] D=2 \cdot \sum_{i=1}^n \left[ y_i \cdot \log (\frac{y_i}{\hat μ_i})−(y_i−\hat μ_i) \right] D=2i=1n[yilog(μ^iyi)(yiμ^i)]

具体推导过程参考: Poission regression deviance 的推导

  1. 首先假设 x x x只有一个位置参数, Y Y Y 服从Poission distribution,并且 Y Y Y 的均值是一个关于 x x x 的函数,即
    E ( Y ∣ x ) = λ ( x ) \mathbb E(Y|x) = \lambda (x) E(Yx)=λ(x)

  2. 对于Poission regression用 log ⁡ \log log 作为link function,即 log ⁡ ( λ ( x ) ) = β 0 + β 1 x \log(\lambda(x))=\beta_0 + \beta_1 x log(λ(x))=β0+β1x

  3. 那么Likelihood function就是 L ( β 0 , β 1 ; y i ) = ∏ i = 1 n e − λ ( x i ) [ λ ( x i ) y i ] y i ! = ∏ i = 1 n e e − β 0 + β 1 x i [ e β 0 + β 1 x i ] y i ] y i ! L(\beta_0, \beta_1; y_i) = \prod \limits_{i=1}^n \frac{e^{-\lambda(x_i)}[\lambda(x_i)^{y_i}]}{y_i !} = \prod \limits_{i=1}^n \frac{e^{e^{-\beta_0 + \beta_1 x_i}}[e^{\beta_0 + \beta_1 x_i}]^{y_i}]}{y_i !} L(β0,β1;yi)=i=1nyi!eλ(xi)[λ(xi)yi]=i=1nyi!eeβ0+β1xi[eβ0+β1xi]yi]

Log likelihood function就是:
l ( β 0 , β 1 ; y i ) = ∑ i = 1 n y i ( β 0 + β 1 x i ) − ∑ i = 1 n e β 0 + β 1 x i − ∑ i = 1 n log ⁡ ( y i ! ) ( 1 ) l(\beta_0, \beta_1; y_i) = \sum \limits_{i=1}^n y_i(\beta_0 + \beta_1 x_i) - \sum \limits_{i=1}^n e^{\beta_0 + \beta_1 x_i} - \sum \limits_{i=1}^n \log (y_i !) \quad \quad (1) l(β0,β1;yi)=i=1nyi(β0+β1xi)i=1neβ0+β1xii=1nlog(yi!)(1)

写成用 μ ^ i \hat \mu_i μ^i表示均值预测值的形式,即
l ( μ ^ ) = ∑ i = 1 n y i log ⁡ ( μ ^ i ) − ∑ i = 1 n μ ^ i − ∑ i = 1 n log ⁡ ( y i ! ) ( 1 ′ ) l(\hat \mu) = \sum \limits_{i=1}^n y_i \log(\hat \mu_i) - \sum \limits_{i=1}^n {\hat \mu_i} - \sum \limits_{i=1}^n \log (y_i !) \quad \quad (1') l(μ^)=i=1nyilog(μ^i)i=1nμ^ii=1nlog(yi!)(1)

  1. μ 1 , μ 2 , ⋯   , μ n \mu_1,\mu_2,\cdots,\mu_n μ1,μ2,,μn代替"saturated model"的参数(saturated model:每个数据点都有一个自己的参数构成的模型)。

那么,saturated model 的 log likelihood function 即为:
l ( μ ) = ∑ i = 1 n y i log ⁡ ( μ i ) − ∑ i = 1 n μ i − ∑ i = 1 n log ⁡ ( y i ! ) ( 2 ) l(\mu) = \sum \limits_{i=1}^n y_i \log(\mu_i) - \sum \limits_{i=1}^n {\mu_i} - \sum \limits_{i=1}^n \log (y_i !) \quad \quad (2) l(μ)=i=1nyilog(μi)i=1nμii=1nlog(yi!)(2)

欲求(2)式中 μ i \mu_i μi的最优估计,即对(2)求偏导数,并令其等于0,即:
∂ ∂ μ i l ( μ ) = y i μ i − 1 = 0 \frac{\partial}{\partial \mu_i}l(\mu) = \frac{y_i}{\mu_i} - 1=0 μil(μ)=μiyi1=0
解得,’‘saturated model’’ 的参数估计值为:
μ ^ i = y i \hat \mu_i = y_i μ^i=yi
将上面的结果带入到(2)式中,即得“saturated model”的likelihood function:
l ( μ ^ ) = ∑ i = 1 n y i log ⁡ ( y i ) − ∑ i = 1 n y i − ∑ i = 1 n log ⁡ ( y i ! ) ( 3 ) l(\hat \mu) = \sum \limits_{i=1}^n y_i \log(y_i) - \sum \limits_{i=1}^n {y_i} - \sum \limits_{i=1}^n \log (y_i !) \quad \quad (3) l(μ^)=i=1nyilog(yi)i=1nyii=1nlog(yi!)(3)

  1. 由Deviance的定义: D ( y , μ ^ ) = 2 ( log ⁡ ( p ( y ∣ θ ^ s ) ) − log ⁡ ( p ( y ∣ θ ^ 0 ) ) ) D(y,\hat μ)=2(\log (p(y∣\hat θ_s))−\log (p(y∣\hat θ_0))) D(y,μ^)=2(log(p(yθ^s))log(p(yθ^0))),即用 2 ⋅ [ ( 3 ) − ( 1 ′ ) ] , 得 2 \cdot [(3)-(1')],得 2[(3)(1)]
    D ( y , μ ^ ) = 2 ⋅ [ ∑ i = 1 n y i log ⁡ y i μ ^ i − ∑ i = 1 n ( y i − μ ^ i ) ] D(y,\hat \mu)=2 \cdot \left[\sum \limits_{i=1}^n y_i \log \frac{y_i}{\hat \mu_i} - \sum \limits_{i=1}^n (y_i - \hat \mu_i) \right] D(y,μ^)=2[i=1nyilogμ^iyii=1n(yiμ^i)]

注意:这仅仅对于分类模型,有时也用 (D)表示Deviance, (D∗)表示Scaled Deviance,即D除以上一个尺度参数。


D ∗ = D ⋅ ϕ D^{*}=D \cdot \phi D=Dϕ
例如,Poission分布的尺度参数就为1,Nomal distribution的尺度参数为 σ 2 \sigma^2 σ2.

所以,Nomal distribution的Deviance推出来是
D ( y , μ ) = ( y − μ ) 2 D(y,\mu)=(y−\mu)^2 D(y,μ)=(yμ)2
Scaled Deviance是
D ∗ ( y , μ ) = ( y − μ ) 2 σ 2 D^{*}(y,\mu)=\frac{(y−\mu)^2}{\sigma^2} D(y,μ)=σ2(yμ)2
详情https://stats.stackexchange.com/questions/315788/deviance-in-simple-linear-regression

Deviance residuals

In R, the deviance residuals represent the contributions of individual samples to the deviance D. More specifically, they are defined as the signed square roots of the unit deviances.

How does such a deviance look like in practice? For example, for the Poisson distribution, the deviance residuals are defined as:(这里我个人有个小疑问暂时没解决,根号下的数不一定是正的呀)

r i = s i g n ( y − μ ^ i ) ⋅ 2 ⋅ [ y i ⋅ log ⁡ ( y i μ ^ i ) − ( y i − μ ^ i ) ] r_i=sign (y−\hat μ_i) \cdot \sqrt{2 \cdot \left[ y_i⋅\log \left( \frac{y_i} {\hat μ_i} \right)−(y_i−\hat μ_i) \right]} ri=sign(yμ^i)2[yilog(μ^iyi)(yiμ^i)]

# type = "deviance"
res.dev1 <- residuals(model.pois, type = "deviance")
res.dev2 <- residuals(model.pois)
poisson.dev <- function (y, mu) 
    # unit deviance
    2 * (y * log(ifelse(y == 0, 1, y/mu)) - (y - mu))
res.dev3 <- sqrt(poisson.dev(expected, estimates)) * 
        ifelse(expected > estimates, 1, -1)
all.equal(res.dev1, res.dev2, res.dev3)
## [1] TRUE

Deviance residuals举例

We can obtain the deviance residuals of our model using the residuals function:

summary(residuals(model.pois))

如果有以下输出

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.0451 -2.4138 -0.8169 -0.1840  1.4265  8.7946

Since the median deviance residual is close to zero, this means that our model is not biased in one direction (i.e. the out come is neither over- nor underestimated).

Tips:

Note that, for ordinary least-squares models, the deviance residual is identical to the conventional residual.

推导见:
https://bookdown.org/egarpor/PM-UC3M/glm-deviance.html#eq:dev
或者
https://stats.stackexchange.com/questions/315788/deviance-in-simple-linear-regression

Residual 小结

名称 Response residuals(conventional residuals) Working residuals Pearson residuals Deviance residuals
公式 r i = y i − f ^ ( x i ) r_i=y_i−\hat f(x_i) ri=yif^(xi) r i = y i − f ^ ( x i ) f ^ ( x i ) r_i=\frac{y_i−\hat f(x_i)}{\hat f(x_i)} ri=f^(xi)yif^(xi) r i = y i − f ^ ( x i ) f ( x i ) r_i=\frac{y_i−\hat f(x_i)}{\sqrt{f(x_i)}} ri=f(xi) yif^(xi) r i = s i g n ( y − μ ^ i ) ⋅ 2 ⋅ [ y i ⋅ log ⁡ ( y i μ ^ i ) − ( y i − μ ^ i ) ] r_i=sign (y−\hat μ_i) \cdot \sqrt{2 \cdot \left[ y_i⋅\log \left( \frac{y_i} {\hat μ_i} \right)−(y_i−\hat μ_i) \right]} ri=sign(yμ^i)2[yilog(μ^iyi)(yiμ^i)]
意义 常规 Normalize by the estimates Normalize the residuals by the square root of the estimate represent the contributions of individual samples to the deviance D D D

Null Deviance and Residual Deviance

图解Null model, proposed model and saturated model——以Nomal regression为例

这部分参考了视频:https://www.youtube.com/watch?v=9T0wlKdew6I&t=0s

我们假设只有均值这一种未知参数(即方差已知)。

用R做GLM的Summary相关指标解释——以Poission regression为例_第1张图片

如图,左上角的就是Null model,即这些数据只用一个参数去刻画;右上角的就是Saturated Model,即每个数据都有一个参数刻画它,模型很饱和;下方的是Proposed Model,即我们使用的模型,需要去评估的模型,它可以是任何介于Null Model和Saturated Model之间的模型。

Since we have already introduced the deviance, understanding the null and residual deviance is not a challenge anymore. Let us repeat the definition of the deviance once again:

D ( y , μ ) = 2 ( log ⁡ ( p ( y ∣ θ s ) ) − log ⁡ ( p ( y ∣ θ 0 ) ) ) D(y,\mu)=2(\log(p(y∣\theta^s))−\log(p(y∣\theta^0))) D(y,μ)=2(log(p(yθs))log(p(yθ0)))

The null and residual deviance differ in θ 0 θ_0 θ0:

  • Null deviance: θ 0 θ_0 θ0 refers to the null model (i.e. an intercept-only model)
  • Residual deviance: θ 0 θ_0 θ0 refers to the trained model

How can we interpret these two quantities?

  • Null deviance: A low null deviance implies that the data can be modeled well merely using the intercept. If the null deviance is low, you should consider using few features for modeling the data.
  • Residual deviance: A low residual deviance implies that the model you have trained is appropriate. Congratulations!

图解:
用R做GLM的Summary相关指标解释——以Poission regression为例_第2张图片

Confusing one: Deviance Residuals VS Residual Deviance

名称 Deviance Residuals Residual Deviance
公式 r i = s i g n ( y − μ ^ i ) ⋅ 2 ⋅ [ y i ⋅ log ⁡ ( y i μ ^ i ) − ( y i − μ ^ i ) ] r_i=sign (y−\hat μ_i) \cdot \sqrt{2 \cdot \left[ y_i⋅\log \left( \frac{y_i} {\hat μ_i} \right)−(y_i−\hat μ_i) \right]} ri=sign(yμ^i)2[yilog(μ^iyi)(yiμ^i)] R e s i d u a l D e v i a n c e = ∑ [ ( D e v i a n c e    R e s i d u a l ) i ] 2 Residual Deviance =\sum[(Deviance \, \, Residual)_i]^2 ResidualDeviance=[(DevianceResidual)i]2 D ( y , μ ^ ) = 2 ( log ⁡ ( p ( y ∣ θ ^ s ) ) − log ⁡ ( p ( y ∣ θ ^ 0 ) ) ) D(y,\hat μ)=2(\log (p(y∣\hat θ_s))−\log (p(y∣\hat θ_0))) D(y,μ^)=2(log(p(yθ^s))log(p(yθ^0)))
区别 Deviance Residuals是Residual的一种,重点关注每个样本点,区别于上文提到的另外三种Residual Residual Deviance是Deviance的一种,重点关注模型整体,是‘saturated model’与‘proposed model’的log likelihood差的二倍,区别于Null Deviance( ‘saturated model’ 与 ‘null model’ 的log likelihood差的二倍)

Other outputs of the summary function

Here, I deal with the other outputs of the GLM summary fuction: the dispersion parameter, the AIC, and the statement about Fisher scoring iterations.

Dispersion parameter

Dispersion (variability/scatter/spread) simply indicates whether a distribution is wide or narrow. The GLM function can use a dispersion parameter to model the variability.

However, for likelihood-based model, the dispersion parameter is always fixed to 1. It is adjusted only for methods that are based on quasi-likelihood estimation such as when family = "quasipoisson" or family = "quasibinomial". These methods are particularly suited for dealing with overdispersion.

AIC

The Akaike information criterion (AIC) is an information-theoretic measure that describes the quality of a model. It is defined as

A I C = 2 p − 2 ln ⁡ ( L ^ ) AIC=2p−2 \ln (\hat L) AIC=2p2ln(L^)

where p p p is the number of model parameters and L ^ \hat L L^ is the maximum of the likelihood function. A model with a low AIC is characterized by low complexity (minimizes p p p) and a good fit (maximizes L ^ \hat L L^).

参考文献

  1. 主要参考:Interpreting Generalized Linear Models

  2. 次要参考:Notes for Predictive Modeling 5.5

  3. youtube视频:Deviance Residuals

  4. BOOK:Beyond Multiple Linear Regression——Applied Generalized Linear Models and Multilevel Models in R 第四章

  5. Poission regression deviance 的推导

  6. wikipedia:Deviance

  7. https://stats.stackexchange.com/questions/5135/interpretation-of-rs-lm-output

  8. https://stats.stackexchange.com/questions/315788/deviance-in-simple-linear-regression

  9. https://stats.stackexchange.com/questions/108995/interpreting-residual-and-null-deviance-in-glm-r

  10. Data science in insurance: an R intro

你可能感兴趣的:(r语言)