R-建模及预测

  • Linear Regression
    • Model
    • Prediction
  • GLM
    • Model
    • Prediction
  • LOESS regression
    • Fitting only
    • Extrapolation

使用R建模并预测

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## Warning: package 'ggplot2' was built under R version 4.0.5

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

data("cars")
head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Linear Regression

Model

我们先建一个简单的线性回归模型

model <- lm(dist~speed, cars)
summary(model)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

该模型为: dist = -17.579 + 3.932*speed.

Prediction

Confidence interval

使用 predict函数根据新的数据进行预测,并给出预测值和其平均值的95%置信区间

speeds <- data.frame(speed=c(10, 20, 53))
predict(model, newdata = speeds, interval = "confidence")

##         fit       lwr       upr
## 1  21.74499  15.46192  28.02807
## 2  61.06908  55.24729  66.89088
## 3 190.83857 159.12292 222.55422

Prediction interval

给出输入的对应预测值的95%置信区间

predict(model, newdata = speeds, interval = "prediction")

##         fit        lwr       upr
## 1  21.74499  -9.809601  53.29959
## 2  61.06908  29.603089  92.53507
## 3 190.83857 146.542994 235.13415

可视化预测的结果

# 1. Add predictions 
pred.int <- predict(model, interval = "prediction")
mydata <- cbind(cars, pred.int)
# 2. Regression line + confidence intervals
p <- ggplot(mydata, aes(speed, dist)) +
  geom_point() +
  stat_smooth(method = lm, formula = y~x)
# 3. Add prediction intervals
p + geom_line(aes(y = lwr), color = "red", linetype = "dashed")+
    geom_line(aes(y = upr), color = "red", linetype = "dashed") +
  theme_bw()

其中,

  • 蓝色的是线性回归拟合曲线

  • 灰色的带为置信区间

  • 红色的虚线为预测值区间

GLM

在R中,广义线性回归使用 glm 函数实现

Model

family 参数选择拟合的回归模型

glm.model <- glm(dist~speed, data = cars, family = gaussian)
summary(glm.model)

## 
## Call:
## glm(formula = dist ~ speed, family = gaussian, data = cars)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -29.069   -9.525   -2.272    9.215   43.201  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 236.5317)
## 
##     Null deviance: 32539  on 49  degrees of freedom
## Residual deviance: 11354  on 48  degrees of freedom
## AIC: 419.16
## 
## Number of Fisher Scoring iterations: 2

该模型为: dist = -17.579 + 3.934*speed.

与简单线性回归相差不大

Prediction

对于glm对象的prediction,可以设置 se.fit = TRUE来显示预测的标准误和用于计算标准误的残差

predict(glm.model, newdata = speeds, se.fit = TRUE)

## $fit
##         1         2         3 
##  21.74499  61.06908 190.83857 
## 
## $se.fit
##         1         2         3 
##  3.124921  2.895501 15.773951 
## 
## $residual.scale
## [1] 15.37959

LOESS regression

还可以使用 LOESS (Local Polynomial Regression Fitting) 的方法拟合并预测

cars %>% 
  ggplot(aes(speed, dist)) +
  geom_point() +
  geom_smooth(method = 'loess', formula = y~x, span = 1) + # span: 0.1 ~ 1
  theme_classic()

Fitting only

在默认设置下loess拟合模型只能预测处于原始数据range中的值,超出range的值无法预测

cars.lo <- loess(dist~speed, cars)
predict(cars.lo, speeds)

##        1        2        3 
## 21.86532 56.46132       NA

Extrapolation

如果想使用loess预测超出range的值,可以设置control = loess.control(surface = "direct")

cars.lo2 <- loess(dist ~ speed, cars,
  control = loess.control(surface = "direct"))
predict(cars.lo2, speeds, se = TRUE)

## $fit
##         1         2         3 
##  21.86532  56.44526 963.89286 
## 
## $se.fit
##          1          2          3 
##   4.119331   4.061865 467.666621 
## 
## $residual.scale
## [1] 15.31087
## 
## $df
## [1] 44.55085

但这里需要考虑loess smoothing的span, 如果这个值过小,会过于拟合原始数据,导致预测准确度不高。

以上就是对R中几种线性回归模型建模和预测方法的简述。

完。

ref

https://www.journaldev.com/45290/predict-function-in-r

http://www.sthda.com/english/articles/40-regression-analysis/166-predict-in-r-model-predictions-and-confidence-intervals/

你可能感兴趣的:(R-建模及预测)