一、模型比较的二中方式
(1)使用anova()函数比较二个模型
fit1 <- lm(Murder ~ Population + Illiteracy + Income +
Frost, data = states)
fit2 <- lm(Murder ~ Population + Illiteracy, data = states)
anova(fit2, fit1)
Model 1: Murder ~ Population + Illiteracy
Model 2: Murder ~ Population + Illiteracy + Income + Frost
Res.Df RSS Df Sum of Sq F Pr(>F)
1 47 289.25
2 45 289.17 2 0.078505 0.0061 0.9939
模型1嵌套在模型2中。anova()函数同时还对是否应该添加Income和Frost到线性模型
中进行了检验。由于检验不显著(p=0.994),因此我们可以得出结论:不需要将这两个变量添加到线性模型中,可以将它们从模型中删除
(2)AIC(Akaike Information Criterion,赤池信息准则)也可以用来比较模型,它考虑了模型的统计拟合度以及用来拟合的参数数目。AIC值越小的模型要优先选择,它说明模型用较少的参数获得了足够的拟合度。该准则可用AIC()函数实现
fit1 <- lm(Murder ~ Population + Illiteracy + Income + + Frost, data = states)
fit2 <- lm(Murder ~ Population + Illiteracy, data = states)
AIC(fit1, fit2)
df AIC
fit1 6 241.6429
fit2 4 237.6565
ANOVA需要嵌套模型,而AIC方法不需要。
二、降维的几种方法
MASS包中的stepAIC()函数可以实现逐步回归模型(向前、向后和向前向后),依据的是精确AIC准则
(1)向前逐步回归(forward stepwise):每次添加一个预测变量到模型中,直到添加变量不会使模型有所改进为止.
(2)向后逐步回归(backward stepwise)从模型包含所有预测变量开始,一次删除一个变量直到会降低模型质量为止
(3)向前向后逐步回归(stepwise stepwise,通常称作逐步回归,以避免听起来太冗长),结合了向前逐步回归和向后逐步回归的方法,变量每次进入一个,但是每一步中,变量都会被重新评价,对模型没有贡献的变量将会被删除,预测变量可能会被添加、删除好几次,直到获得最优模型为止。
library(MASS)
fit1 <- lm(Murder ~ Population + Illiteracy + Income + Frost, data = states)
stepAIC(fit, direction = "backward")
缺点:虽然它可能会找到一个好的模型,但是不能保证模型就是最佳型,因为不是每一个可能的模型都被评价了。
(4)全子集回归
即所有可能的模型都会被检验。
全子集回归可用leaps包中的regsubsets()函数实现。
library(leaps)
leaps <- regsubsets(Murder ~ Population + Illiteracy +
Income + Frost, data = states, nbest = 4)
plot(leaps, scale = "adjr2")
library(car)
subsets(leaps, statistic = "cp",
main = "Cp Plot for All Subsets Regression")
abline(1, 1, lty = 2, col = "red")
如何诊断模型的好坏?
(1)交叉验证
shrinkage <- function(fit, k = 10) {
require(bootstrap)
# define functions
theta.fit <- function(x, y) {
lsfit(x, y)
}
theta.predict <- function(fit, x) {
cbind(1, x) %*% fit$coef
}
# matrix of predictors
x <- fit$model[, 2:ncol(fit$model)]
# vector of predicted values
y <- fit$model[, 1]
results <- crossval(x, y, theta.fit, theta.predict, ngroup = k)
r2 <- cor(y, fit$fitted.values)^2
r2cv <- cor(y, results$cv.fit)^2
cat("Original R-square =", r2, "\n")
cat(k, "Fold Cross-Validated R-square =", r2cv, "\n")
cat("Change =", r2 - r2cv, "\n")
}
fit <- lm(Murder ~ Population + Income + Illiteracy +
Frost, data = states)
shrinkage(fit)
fit2 <- lm(Murder ~ Population + Illiteracy, data = states)
shrinkage(fit2)
(2)变量的相对重要性
zstates <- as.data.frame(scale(states))
zfit <- lm(Murder ~ Population + Income + Illiteracy +
Frost, data = zstates)
coef(zfit)
relweights <- function(fit, ...) {
R <- cor(fit$model)
nvar <- ncol(R)
rxx <- R[2:nvar, 2:nvar]
rxy <- R[2:nvar, 1]
svd <- eigen(rxx)
evec <- svd$vectors
ev <- svd$values
delta <- diag(sqrt(ev))
# correlations between original predictors and new orthogonal variables
lambda <- evec %*% delta %*% t(evec)
lambdasq <- lambda^2
# regression coefficients of Y on orthogonal variables
beta <- solve(lambda) %*% rxy
rsquare <- colSums(beta^2)
rawwgt <- lambdasq %*% beta^2
import <- (rawwgt/rsquare) * 100
lbls <- names(fit$model[2:nvar])
rownames(import) <- lbls
colnames(import) <- "Weights"
# plot results
barplot(t(import), names.arg = lbls, ylab = "% of R-Square",
xlab = "Predictor Variables", main = "Relative Importance of Predictor Variables",
sub = paste("R-Square = ", round(rsquare, digits = 3)),
...)
return(import)
}
fit <- lm(Murder ~ Population + Illiteracy + Income +
Frost, data = states)
relweights(fit, col = "lightgrey")