R语言的再复习之路
lm()
拟合回归模型格式
myfit <- lm(Y ~ X1 + X2 + ... + Xk, data)
符号 | 用途 |
---|---|
~ | 分隔符号,左边为相应变量,右边为解释变量。例如Y ~ X + Z + W |
+ | 分隔预测变量 |
: | 表示预测变量的交互项。例如Y ~ X + Z + X:Z |
* | 表示所有可能交互项的简洁方式。Y ~ X * Z * W即为Y ~ X + Z + W + X:Z + Z:W + X:W + X:Z:W |
^ | 表达交互项达到某个次数。Y ~ (X + Z + W)^2即为Y ~ X + Z + W + X:Z + X:W + Z:W |
. | 表示包含除因变量外的所有变量。若一个数据框包含变量X, Y, Z, W,则Y ~ .即为Y ~ X + Z + W |
-1 | 删除截距项。例如Y ~ X - 1拟合Y在X上的回归,并强制通过原点。 |
function | 可以在表达式中用的数学函数。例如log(Y) ~ X + Z + W表示通过X, Z, W来预测log(Y) |
函数 | 用途 |
---|---|
summary() |
展示拟合模型的详细结果 |
coefficients() |
列出拟合模型的模型参数 |
confint() |
提供模型参数的置信区间(默认95%) |
fitted |
列出拟合模型的预测值 |
residuals |
列出拟合模型的残差值 |
anova() |
生成一个拟合模型的方差分析表 |
vcov |
列出模型参数的协方差矩阵 |
AIC() |
输出AIC信息统计量 |
plot |
生成评价拟合模型的诊断图 |
predict |
用拟合模型对新的数据集预测响应变量值 |
fit <- lm(weight ~ height, data = women)
summary(fit)
plot(women$height, women$weight, xlab = 'Height (in inches)', ylab = 'Weight (in inches)')
abline(fit)
fit2 <- lm(weight ~ height + I(weight^2), data = women)
summary(fit)
plot(women$height, women$weight, xlab = 'Height (in inches)', ylab = 'Weight (in inches)')
lines(women$height, fitted(fit2))
library(car)
scatterplot(weight ~ height, data = women, spread = FALSE, smoother.arg = list(lty = 2), pch = 19, main = 'Women Age 30-39', xlab = 'Height (inches)', ylab = 'Weight (lbs.)')
第二块代码里面spread = FALSE
删除了残差正负均方根在平滑曲线上的展开和非对称信息。smoother.args = list(lty = 2)
选项设置loess拟合曲线为虚线。pch = 19
选项设置点为实心圆。
# 1.检查变量间的相关性
states <- as.data.frame(state.x77[, c('Murder', 'Population', 'Illiteracy', 'Income', 'Frost')])
cor(states)
library(car)
scatterplotMatrix(states, spread = FALSE, smoother.args = list(lty = 2), main = 'Scatter Plot Matrix')
# 2.多元线性回归
fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data = states)
fit <- lm(mpg ~ hp + wt + hp:wt, data = mtcars)
# 用图形展示交互项结果
library(effects)
plot(effect('hp:wt', fit, list(wt = c(2.2, 3.2, 4.2))), multiline = TRUE)
OLS回归的统计假设
对lm()
返回的对象使用plot()
函数,可以生成评价模型拟合情况的四幅图形。
fit <- lm(weight ~ height, data = women)
par(mfrow = c(2, 2))
plot(fit)
函数 | 目的 |
---|---|
qqPlot() |
分位数比较图 |
durbinWatsonTest() |
对误差自相关性做Durbin-Watson检验 |
crPlots() |
成分与残差图 |
ncvTest() |
对非恒定的误差方差做得分检验 |
spreadLevelPlot() |
分散水平检验 |
outlierTest() |
Bonferroni离群点检验 |
avPlots() |
添加的变量图形 |
inluencePlot() |
回顾影响图 |
scatterplot() |
增强的散点图 |
scatterplotMatrix() |
增强的散点图矩阵 |
vif() |
方差膨胀因子 |
# Q-Q图
qqPlot(fit, labels = row.names(states), simulate = TRUE, main = 'Q-Q Plot')
# 绘制学生化残差
residplot <- function(fit, nbreaks = 10){
z <- rstudent(fit)
hist(z, breaks = nbreaks, freq = FALSE, xlab = 'Studentized Residual', main = 'Distribution of Errors')
rug(jitter(z), col = 'brown')
curve(dnorm(x, mean = mean(z), sd = sd(z)), add = TRUE, col = 'blue', lwd = 2)
legend('topright', legend = c('Normal Curve', 'Kernel Density Curve'), lty = c(1, 2), col = c('blue', 'red'), cex = .7)
}
residplot(fit)
library(car)
durbinWastonTest(fit)
library(car)
crPlots(fit)
library(car)
ncvTest(fit)
spreadLevelPlot(fit)
library(gvlma)
gvmodel <- gvlma(fit)
summary(gvmodel)
library(car)
vif(fit)
# 方差膨胀因子的平方根>2则
sqrt(vif(fit)) > 2
定义:离群点是指那些模型预测效果不佳的观测点,即具有较大的残差。
library(car)
outlierTest(fit)
定义:高杠杆值点是指预测变量偏离总体的点,即X偏离整体较大的点,与响应变量的值Y没有关系。
hat.plot <- function(fit){
p <- length(coefficients(fit))
n <- length(fitted(fit))
plot(hatvalues(fit), main = 'Index Plot of Hat Values')
abline(h = c(2, 3) * p / n, col = 'red', lty = 2)
}
hat.plot(fit)
cutoff <- 4/(nrow(states) - length(fit$coefficients) - 2)
plot(fit, which = 4, cook, levels = cutoff)
abline(h = cutoff, lty = 2, col = 'red')
library(car)
avPlots(fit, ask = FASLE, id.method = 'identify')
library(car)
influencePlot(fit, id.method = 'identify', main = 'Influence Plot', sub = 'Circle size is proportional to Cook's distance')
在该图中,纵坐标大于+2或小-2的点可以被认为是离群点;横坐标大于0.2或0.3的点可以被认为是高杠杆值;圆圈大小与影响成比例,圆圈很大的点可能是对模型参数的估计造成的不成比例影响的强影响点。
library(car)
summary(powerTransform(states$Murder))
library(car)
boxTidwell(Murder ~ Population + Illiteracy, data = states)
library(car)
spreadLevelPlot(fit)
可以在存在多重共线性的问题时使用
anova()
函数比较fit1 <- lm(Murder ~ Population + Illiteracy + Income + Frost, data = states)
fit2 <- lm(Murder ~ Population + Illiteracy, data = states)
anova(fit2, fit1)
若检验不显著,即 p ≥ 0.05 p \geq 0.05 p≥0.05,那么表明不需要添加fit1模型的多于的变量,可以将它们从模型中删除。
AIC(fit1, fit2)
优先选择AIC值较小的模型
library(MASS)
fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states)
stepAIC(fit, direction = 'backward')
regsubsets()
函数实现,可以通过R平方、调整R平方或Mallows Cp统计量等准则来选择“最佳”模型。library(leaps)
leaps <- regsubsets(Murder ~ Population + Illiteracy + Income + Frost, data = states, nbest = 4)
plot(leaps, scale = 'adjr2')
library(car)
subsets(leaps, statistic = 'cp', main = 'Cp Plot for All Subsets Regression')
abline(1, lty = 2, col = 'red')
在 k k k重交叉验证中,样本被分为 k k k个子样本,轮流将 k − 1 k-1 k−1个子样本组合作为训练集,另外1个子样本作为保留集。这样会获得 k k k个预测方程,记录 k k k个保留样本的预测表现结果,然后求其均值。
shrinkage <- function(fit, k = 10){
require(bootstrap)
theta.fit <- function(x, Y){lsfit(x, y)}
theta.predict <- function(fit, x){cbind(1, x)%*%fit$coef}
x <- fit$model[, 2:ncol(fit$model)]
y <- fit$model[, 1]
results <- crossval(x, y, theta.fit, theta.predict, ngroup = k)
r2 <- cor(y, fit$fitted.values)^2
r2cv <- cor(y, results$cv.fit)^2
cat('Original R-square = ', r2, '\n')
cat(k, 'Fold Cross-Validated R-square = ', r2cv, '\n')
cat('Change = ', r2 - r2cv, '\n')
}
shrinkage(fit)
R平方减少得越少,预测则越精确。