注:R语言的再复习之路
glm()
函数广义线性模型的拟合形式为:
g ( μ Y ) = β 0 + ∑ j = 1 p β j X j , g(\mu_Y) = \beta_0 + \sum_{j=1}^p\beta_jX_j, g(μY)=β0+j=1∑pβjXj,
其中 g ( μ Y ) g(\mu_Y) g(μY)是条件均值函数(称为连接函数),这里 Y Y Y服从指数组分布中的一种即可。
glm()
函数基本形式:glm(formula, family = family(link = function), data = )
分布族 | 默认的连接函数 |
---|---|
binomial | link = 'logit' |
gaussian | link = 'identity' |
gamma | link = 'inverse' |
poisson | link = 'log' |
quasibinomial | link = 'logit' |
quasipoisson | link = 'log' |
函数 | 展示拟合模型的细节 |
---|---|
summary() |
展示拟合模型的细节 |
coefficients(), coef() |
列出拟合模型的参数 |
confint() |
给出模型参数的置信区间 |
residuals() |
列出拟合模型的残差值 |
anova() |
生成两个拟合模型的方差分析表 |
plot() |
生成评价拟合模型的诊断图 |
predict() |
用拟合模型对新数据集进行预测 |
deviance() |
拟合模型的偏差 |
df.residual() |
拟合模型的残差自由度 |
### 绘制初始变量的预测值与残差的图形
plot(predict(model, type = 'response'), residuals(model, type = 'deviance'))
### 识别离群点、高杠杆值点、强影响点
plot(hatvalues(model))
plot(rstudent(model))
plot(cooks.distance(model))
### 综合性的诊断图
library(car)
influencePlot(model)
### 基本的数据整理,将出轨次数转化成二值型因子
data(Affairs, package = 'AER')
Affairs <- within(Affairs, {
ynaffair <- NA
ynaffair[affairs > 0] <- 1
ynaffair[affairs == 0] <- 0
})
Affairs$ynaffair <- factor(Affairs$ynaffair, levels = c(0, 1), labels = c('No', 'Yes'))
### 构建Logistic回归
fit.full <- glm(ynaffair ~ gender + age + yearsmarried + children + religiousness + education + occupation + rating, data = Affairs, family = binomial())
summary(fit.full)
### 把不显著的变量删除,再次构建Logistic回归
fit.reduced <- glm(ynaffair ~ age + yearsmmaried + religiousness + rating, data = Affairs, family = binomial())
summary(fit.reduced)
### 对两个拟合模型进行方差分析
anova(fit.reduced, fit.full, test = 'Chisq') ## 若检验表明两个拟合模型没有显著差别,则删除以上的变量是合理的
coef(fit.reduced)
exp(coef(fit.reduced))
将拟合参数指数化后,参数的数值表明该参数增加一个单位后,响应变量优势比 Y 1 − Y Y\over {1-Y} 1−YY变化的幅度。
### 评价婚姻评分对婚外情概率的影响
testdata <- data.frame(rating = c(1, 2, 3, 4, 5), age = mean(Affairs$age), yearsmarried = mean(Affairs$yearsmarried), religiousness = mean(Affairs$religiousness))
testdata$prob <- predict(fit.reduced, newdata = testdata, type = 'response')
### 评价年龄对婚外情概率的影响
testdata <- data.frame(rating = mean(Affairs$rating), age = seq(17, 57, 10), yearsmarried = mean(Affairs$yearsmarried), religiousness = mean(Affairs$religiousness))
testdata$prob <- predict(fit.reduced, newdata = testdata, type = 'response')
过度离势:观测到的响应变量的方差大于期望的二项分布的方差
deviance(fit.reduced)/df.residual(fit.reduced)
fit <- glm(ynaffair ~ age + yearsmarried + religiousness + rating, data = Affairs, family = binomial())
fit.od <- glm(ynaffair ~ age + yearsmarried + religiousness + rating, data = Affairs, family = quasibinomial())
pchisq(summary(fit.od)$dispersion * fit$df.residual, fit$df.residual, lower = F)
data(breslow.dat, package = 'robust')
fit <- glm(sumY ~ Base + Age + Trt, data = breslow.dat, family = poisson())
summary(fit)
coef(fit)
exp(coef(fit))
注:这里将参数指数化并不是一成不变的,要根据具体的模型而来,观察怎样变换能够更好地解释预测变量对于响应变量的影响。
deviance(fit) / df.residual(fit)
fit <- glm(sumY ~ Base + Age + Trt, data = breslow.dat, family = poisson())
fit.od <- glm(sumY ~ Base + Age + Trt, data = breslow.dat, family = quasipoisson())
pchisq(summary(fit.od)$dispersion * fit$df.residual, fit$df.residual, lower = F)
library(qcc)
qcc.overdispersion.test(breslow.dat$sumY, type = 'poisson')