R语言学习笔记：模型选择（一）

模型的解释力和预测力的区别

解释力的标准：R方，调整后R方

常用的预测力标准：Cp，AIC，BIC，AUC等等

案例：1986年美国各大核心棒球球队队员的表现数据和次年的收入

数据包：ISLR，Hitters数据

fix (Hitters)会出现bug，可以使用Hitters = ISLR::Hitters语句

首先进行数据清洗，去掉空白的数据，判断是否空白（NA）的函数：

is.na()

找出那些行是NA：which（is.na(Hitters[,19])）#Summary显示只有19列salary存在空白数据

命名为drop = which (is.na (Hitters[, 19]))

新的没有NA的数据：newhitters = Hitters[-drop,]

新数据有多少行：dim (newhitters)

先对球员工资进行一个探索性分析，做出频数直方图

ggplot (newhitters, aes (x = Salary))+ #在ggplot中，+号类似于ps里的叠加图层

geom_histogram (bins = 40, fill = 'red')+

ggtitle ("Histogram of Salary")+

ylab ("Frequency")#y轴名称

如果要改成频率直方图：

ggplot (newhitters) +

geom+histogram (aes (x = Salary, y = ..density..), bins = 40, fill = "blue") +

ggtitle ("HIstogram of Salary")

直方图明显右斜（头部效应），需要转化成正态分布的数据，处理方法：取对数

newhitters = newhitters %>% mutate (lnsalary = log (Salary))

制作相应的两两相关散点图，可以看出由于数据数量较多速度很慢

当变量数量多的时候，可以选择相关性热力图

所需程序包：dplyr，ggplot2，tidyverse和reshape2，代码如下：

plotData = melt (cor(newhitters[sapply (newhitters, is.numeric)]))#将非数变量剔除，melt函数将数据转换成可以用于制图的数据

ggplot (plotData,

aes (x = Var1, y = Var2, fill = value))+

geom_tile()+

ylab("")+

xlab("")+

scale_fill_gradient(low = "#56B1F7", high = '#132B43')+

guides (fill = guide_legend (title = 'correlation'))

说明大多数球员表现得数据是和来年的薪水相关的

数据预测能力的判断：预测误差1/n(∑n, i = 1((yi-y'i)^2)))越小越好

两种测量预测误差的方法：直接法与间接法

直接法：

再随机收集一批新数据，康康这个模型在新数据里的预测情况

不可以用生成模型的数据

如果收集新数据的成本可能太高，将已有的数据分成两个部分

或者cross validation交叉验证

间接法：

找一个类似于R方的统计量，用来估算预测误差

常用的间接统计量：Mallow's Cp or BIC

Cp = 1/n(RSS + 2d*^σ^2)

n是数据的个数，RSS是残差平方和，^σ模型均方误（mean square error）

Cp的性质：

随着数据量增大，Cp会无线趋近于真实的预测误差

Cp越小，模型误差越小，也就是预测精度越高

BIC： Bayesian InformationCriterion 贝叶斯信息标准

贝叶斯方法：不存在频率学派生成的客观概率，所谓概率是主观臆断和后验数据的信息之和

BIC = 1/n (RSS + k * ln(n)*^σ^2)

k是要计算的参数个数，对于自变量个数加一，因为除了每个自变量前面的系数以外，还需要计算一个截距项。BIC越小，预测能力越好

对模型进行选择，首先对原始数据进行处理，删掉salary列改为lnsalary

共有20个变量，其中潜在有19个自变量

需要使用程序包：leaps

fit1 = regsubsets (lnsalary~., newhitters, nvmax = 20, method = 'exhaustive')

如果需要包含二次项，.处改为.*.

Nvmax参数：最多使用的变量个数

Exhaustive：还有forward和backward参数，推荐exaustive

观察回归结果：summary (fit1)$which 标准：残差平方和小

左侧表头意味分别选取1/2/3/4……个自变量的时候，那一个或几个自变量的相关性好

此时已获取10个局部最优的模型，接下来需要寻找全局最优的模型。

本次涉及的全部代码：

library (GGally)

library (dplyr)

library (ggplot2)

library (tidyverse)

library (reshape2)

Hitters = ISLR::Hitters

str (Hitters)

drop = which (is.na (Hitters[, 19]))

newhitters = Hitters[-drop,]

dim (newhitters)

ggplot (newhitters, aes (x = Salary))+

geom_histogram (bins = 40, fill = 'red')+

ggtitle ("Histogram of Salary")+

ylab ("Frequency")

newhitters = newhitters %>% mutate (lnsalary = log (Salary))

newhitters %>% select (lnsalary, AtBat, Hits, HmRun, Runs, RBI) %>% ggpairs()

plotData = melt (cor(newhitters[sapply (newhitters, is.numeric)]))

ggplot (plotData,

aes (x = Var1, y = Var2, fill = value))+

geom_tile()+

ylab("")+

xlab("")+

scale_fill_gradient(low = "#56B1F7", high = '#132B43')+

guides (fill = guide_legend (title = 'correlation'))

newhitters[, 19] = NULL

fit1 = regsubsets (lnsalary~., newhitters, nvmax = 20, method = 'exhaustive')

summary (fit1)

summary (fit1)$which

R语言学习笔记：模型选择（一）

你可能感兴趣的:(R语言学习笔记：模型选择（一）)