missForest的R实现

在R中,能处理缺失值的包有很多,比如VIM, mice, Amelia, missForest, Hmisc, mi,等等,那为什么本文偏偏选择missForest作为处理包呢?

这是因为missForest可以处理包含连续变量以及分类变量的缺失值,有很多软件或包在进行插补缺失值的时候,通常识别不了分类变量,如果你有一列二分变量是用“是”和“否”作为答案的,那么值通常是0和1,或1和2。这些软件或包在对这一列变量的缺失数据进行插补的时候,可能出现

  • 小数
  • 低于0或1的数值
  • 大于1或2的数值

而我们需要的仅仅是二分类的整数。missForest便可以很多的处理这些问题。

下面是关于missForest包的官方描述:

The function 'missForest' in this package is used to
        impute missing values particularly in the case of mixed-type
        data. It uses a random forest trained on the observed values of
        a data matrix to predict the missing values. It can be used to
        impute continuous and/or categorical data including complex
        interactions and non-linear relations. It yields an out-of-bag
        (OOB) imputation error estimate without the need of a test set
        or elaborate cross-validation. It can be run in parallel to 
        save computation time.

从以上描述中,能得到关于missForest的一些重要信息:

  • 可以处理混合类型的数据(mixed-type data)
  • 使用了随机森林的方法来预测缺失值
  • 可以对连续变量或类别变量的数据进行插补
  • 使用OOB来衡量插补误差

下面使用iris数据集举例对缺失值进行插补

用summary查看iris数据集,可以看到iris包含5个变量,其中Species是类别变量

summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500 

使用missForest中的prodNa人为地在完整的iris数据集中插入一些缺失值

先看prodNa的用法:

prodNA(x, noNA = 0.1)
  1. x:带有缺失值的数据集
  2. noNA:缺失值的比例

在无缺失值的iris数据集中插入20%的缺失值,也就是将noNA设定为0.2

library(missForest)
miris <- prodNA(iris, noNA = 0.2)

将结果赋值给miris(missing iris dataset)

查看miris,最后一行NA's代表该变量缺失的数量,如Sepal.Length缺失了25个值

summary(miris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.200   Min.   :1.000   Min.   :0.100   setosa    :40  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.500   1st Qu.:0.300   versicolor:38  
 Median :5.800   Median :3.000   Median :4.200   Median :1.300   virginica :39  
 Mean   :5.841   Mean   :3.077   Mean   :3.661   Mean   :1.196   NA's      :33  
 3rd Qu.:6.400   3rd Qu.:3.400   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.300   Max.   :2.500                  
 NA's   :25      NA's   :27      NA's   :33      NA's   :32    

利用missForest函数对含有缺失值的数据集miris进行插补

iiris <- missForest(miris, xtrue = iris, verbose = TRUE)

第一个参数是数据集,第二个xtrue的作用是使用完整数据集iris衡量插补的表现,通过verbose = TRUE可以查看迭代的过程以及插补误差,消耗时间等。

missForest iteration 1 in progress...done!
    error(s): 0.2499382 0.1212121 
    estimated error(s): 0.1813752 0.008547009 
    difference(s): 0.01295499 0.1466667 
    time: 1.09 seconds

  missForest iteration 2 in progress...done!
    error(s): 0.2296085 0.1212121 
    estimated error(s): 0.1546088 0.008547009 
    difference(s): 0.0002750566 0 
    time: 0.44 seconds

  missForest iteration 3 in progress...done!
    error(s): 0.2198678 0.1212121 
    estimated error(s): 0.1468334 0.01709402 
    difference(s): 8.633168e-05 0 
    time: 0.3 seconds

  missForest iteration 4 in progress...done!
    error(s): 0.2149133 0.09090909 
    estimated error(s): 0.1467923 0 
    difference(s): 2.614443e-05 0.006666667 
    time: 0.31 seconds

  missForest iteration 5 in progress...done!
    error(s): 0.2107928 0.09090909 
    estimated error(s): 0.1392262 0.008547009 
    difference(s): 6.956508e-05 0 
    time: 0.22 seconds

  missForest iteration 6 in progress...done!
    error(s): 0.2108121 0.09090909 
    estimated error(s): 0.1371073 0.03418803 
    difference(s): 2.83551e-05 0 
    time: 0.2 seconds

  missForest iteration 7 in progress...done!
    error(s): 0.2119592 0.09090909 
    estimated error(s): 0.1392409 0 
    difference(s): 1.827749e-05 0 
    time: 0.2 seconds

  missForest iteration 8 in progress...done!
    error(s): 0.2103703 0.09090909 
    estimated error(s): 0.1376546 0.008547009 
    difference(s): 2.309659e-05 0 
    time: 0.22 seconds

经过8次迭代后,用str查看插补后的iiris列表

列表中包含了ximp, OOBerror两部分,ximp是插补后的数据集

str(iiris)
List of 2
 $ ximp    :'data.frame':	150 obs. of  5 variables:
  ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 ...
  ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.23 3.6 ...
  ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 ...
  ..$ Petal.Width : num [1:150] 0.2 0.2 0.202 0.2 0.2 ...
  ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ OOBerror: Named num [1:2] 0.1412 0.0256
  ..- attr(*, "names")= chr [1:2] "NRMSE" "PFC"
 - attr(*, "class")= chr "missForest"

OOBerror即Out of Box error,是测量我们的模型基于连续变量和分类变量上的表现的指标。

  • normalized root mean squared error computed (NRMSE) 归一化均方根误差计算,是连续变量的表现指标
  • proportion of falsely classified (PFC) 错分率,是类别变量的表现指标

比较不同插补方式的差异

取iris数据集的前4列变量,因为这4列变量都是连续变量,不涉及分类变量

miris <- miris[, 1:4]

用均值替换数据集中所有的缺失值

library(Hmisc)
iris_mean <- impute(miris, fun = mean)

用随机森林法替换数据集中所有的缺失值

iris_forest <- missForest(miris)

比较两个插补后的数据集和原数据集的相关来查看精确度,diag指的是查看相关矩阵的对角线

diag(cor(iris[,-5], iris_mean))

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
   0.4870259    0.8700510    0.8922259    0.5795956 

diag(cor(iris[,-5], iris_forest$ximp))

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
   0.9740983    0.9315624    0.9893583    0.9814286 

结果显示,基于非参数随机森林方法的填补缺失值相较于简单的单变量均值替换具备更良好的性能,所有4个变量和原数据集的相关都很高,而靠均值替换的数据集表现就差了很多。这一点很值得重视,因为大多数人可能都没想过比较不同插补方式之间的效果。

参考文献:

Daroczi, G. (2015). Mastering Data Analysis with R. Packt Publishing.
Daniel J. Stekhoven (2013). missForest: Nonparametric Missing Value Imputation
  using Random Forest. R package version 1.4.
Stekhoven D. J., & Buehlmann, P. (2012). MissForest - non-parametric missing value
  imputation for mixed-type data. Bioinformatics, 28(1), 112-118.

你可能感兴趣的:(统计分析,R语言)