- Linear Regression
- Model
- Prediction
- GLM
- Model
- Prediction
- LOESS regression
- Fitting only
- Extrapolation
使用R建模并预测
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
data("cars")
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
Linear Regression
Model
我们先建一个简单的线性回归模型
model <- lm(dist~speed, cars)
summary(model)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
该模型为: dist = -17.579 + 3.932*speed
.
Prediction
Confidence interval
使用 predict
函数根据新的数据进行预测,并给出预测值和其平均值的95%置信区间
speeds <- data.frame(speed=c(10, 20, 53))
predict(model, newdata = speeds, interval = "confidence")
## fit lwr upr
## 1 21.74499 15.46192 28.02807
## 2 61.06908 55.24729 66.89088
## 3 190.83857 159.12292 222.55422
Prediction interval
给出输入的对应预测值的95%置信区间
predict(model, newdata = speeds, interval = "prediction")
## fit lwr upr
## 1 21.74499 -9.809601 53.29959
## 2 61.06908 29.603089 92.53507
## 3 190.83857 146.542994 235.13415
可视化预测的结果
# 1. Add predictions
pred.int <- predict(model, interval = "prediction")
mydata <- cbind(cars, pred.int)
# 2. Regression line + confidence intervals
p <- ggplot(mydata, aes(speed, dist)) +
geom_point() +
stat_smooth(method = lm, formula = y~x)
# 3. Add prediction intervals
p + geom_line(aes(y = lwr), color = "red", linetype = "dashed")+
geom_line(aes(y = upr), color = "red", linetype = "dashed") +
theme_bw()
其中,
蓝色的是线性回归拟合曲线
灰色的带为置信区间
红色的虚线为预测值区间
GLM
在R中,广义线性回归使用 glm
函数实现
Model
family 参数选择拟合的回归模型
glm.model <- glm(dist~speed, data = cars, family = gaussian)
summary(glm.model)
##
## Call:
## glm(formula = dist ~ speed, family = gaussian, data = cars)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 236.5317)
##
## Null deviance: 32539 on 49 degrees of freedom
## Residual deviance: 11354 on 48 degrees of freedom
## AIC: 419.16
##
## Number of Fisher Scoring iterations: 2
该模型为: dist = -17.579 + 3.934*speed
.
与简单线性回归相差不大
Prediction
对于glm对象的prediction,可以设置 se.fit = TRUE
来显示预测的标准误和用于计算标准误的残差
predict(glm.model, newdata = speeds, se.fit = TRUE)
## $fit
## 1 2 3
## 21.74499 61.06908 190.83857
##
## $se.fit
## 1 2 3
## 3.124921 2.895501 15.773951
##
## $residual.scale
## [1] 15.37959
LOESS regression
还可以使用 LOESS (Local Polynomial Regression Fitting) 的方法拟合并预测
cars %>%
ggplot(aes(speed, dist)) +
geom_point() +
geom_smooth(method = 'loess', formula = y~x, span = 1) + # span: 0.1 ~ 1
theme_classic()
Fitting only
在默认设置下loess拟合模型只能预测处于原始数据range中的值,超出range的值无法预测
cars.lo <- loess(dist~speed, cars)
predict(cars.lo, speeds)
## 1 2 3
## 21.86532 56.46132 NA
Extrapolation
如果想使用loess预测超出range的值,可以设置control = loess.control(surface = "direct")
cars.lo2 <- loess(dist ~ speed, cars,
control = loess.control(surface = "direct"))
predict(cars.lo2, speeds, se = TRUE)
## $fit
## 1 2 3
## 21.86532 56.44526 963.89286
##
## $se.fit
## 1 2 3
## 4.119331 4.061865 467.666621
##
## $residual.scale
## [1] 15.31087
##
## $df
## [1] 44.55085
但这里需要考虑loess smoothing的span
, 如果这个值过小,会过于拟合原始数据,导致预测准确度不高。
以上就是对R中几种线性回归模型建模和预测方法的简述。
完。
ref
https://www.journaldev.com/45290/predict-function-in-r
http://www.sthda.com/english/articles/40-regression-analysis/166-predict-in-r-model-predictions-and-confidence-intervals/