在R中,能处理缺失值的包有很多,比如VIM, mice, Amelia, missForest, Hmisc, mi,等等,那为什么本文偏偏选择missForest作为处理包呢?
这是因为missForest可以处理包含连续变量以及分类变量的缺失值,有很多软件或包在进行插补缺失值的时候,通常识别不了分类变量,如果你有一列二分变量是用“是”和“否”作为答案的,那么值通常是0和1,或1和2。这些软件或包在对这一列变量的缺失数据进行插补的时候,可能出现
而我们需要的仅仅是二分类的整数。missForest便可以很多的处理这些问题。
下面是关于missForest包的官方描述:
The function 'missForest' in this package is used to
impute missing values particularly in the case of mixed-type
data. It uses a random forest trained on the observed values of
a data matrix to predict the missing values. It can be used to
impute continuous and/or categorical data including complex
interactions and non-linear relations. It yields an out-of-bag
(OOB) imputation error estimate without the need of a test set
or elaborate cross-validation. It can be run in parallel to
save computation time.
从以上描述中,能得到关于missForest的一些重要信息:
下面使用iris数据集举例对缺失值进行插补
用summary查看iris数据集,可以看到iris包含5个变量,其中Species是类别变量
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
使用missForest中的prodNa人为地在完整的iris数据集中插入一些缺失值
先看prodNa的用法:
prodNA(x, noNA = 0.1)
在无缺失值的iris数据集中插入20%的缺失值,也就是将noNA设定为0.2
library(missForest)
miris <- prodNA(iris, noNA = 0.2)
将结果赋值给miris(missing iris dataset)
查看miris,最后一行NA's代表该变量缺失的数量,如Sepal.Length缺失了25个值
summary(miris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.200 Min. :1.000 Min. :0.100 setosa :40
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.500 1st Qu.:0.300 versicolor:38
Median :5.800 Median :3.000 Median :4.200 Median :1.300 virginica :39
Mean :5.841 Mean :3.077 Mean :3.661 Mean :1.196 NA's :33
3rd Qu.:6.400 3rd Qu.:3.400 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.300 Max. :2.500
NA's :25 NA's :27 NA's :33 NA's :32
利用missForest函数对含有缺失值的数据集miris进行插补
iiris <- missForest(miris, xtrue = iris, verbose = TRUE)
第一个参数是数据集,第二个xtrue的作用是使用完整数据集iris衡量插补的表现,通过verbose = TRUE可以查看迭代的过程以及插补误差,消耗时间等。
missForest iteration 1 in progress...done!
error(s): 0.2499382 0.1212121
estimated error(s): 0.1813752 0.008547009
difference(s): 0.01295499 0.1466667
time: 1.09 seconds
missForest iteration 2 in progress...done!
error(s): 0.2296085 0.1212121
estimated error(s): 0.1546088 0.008547009
difference(s): 0.0002750566 0
time: 0.44 seconds
missForest iteration 3 in progress...done!
error(s): 0.2198678 0.1212121
estimated error(s): 0.1468334 0.01709402
difference(s): 8.633168e-05 0
time: 0.3 seconds
missForest iteration 4 in progress...done!
error(s): 0.2149133 0.09090909
estimated error(s): 0.1467923 0
difference(s): 2.614443e-05 0.006666667
time: 0.31 seconds
missForest iteration 5 in progress...done!
error(s): 0.2107928 0.09090909
estimated error(s): 0.1392262 0.008547009
difference(s): 6.956508e-05 0
time: 0.22 seconds
missForest iteration 6 in progress...done!
error(s): 0.2108121 0.09090909
estimated error(s): 0.1371073 0.03418803
difference(s): 2.83551e-05 0
time: 0.2 seconds
missForest iteration 7 in progress...done!
error(s): 0.2119592 0.09090909
estimated error(s): 0.1392409 0
difference(s): 1.827749e-05 0
time: 0.2 seconds
missForest iteration 8 in progress...done!
error(s): 0.2103703 0.09090909
estimated error(s): 0.1376546 0.008547009
difference(s): 2.309659e-05 0
time: 0.22 seconds
经过8次迭代后,用str查看插补后的iiris列表
列表中包含了ximp, OOBerror两部分,ximp是插补后的数据集
str(iiris)
List of 2
$ ximp :'data.frame': 150 obs. of 5 variables:
..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 ...
..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.23 3.6 ...
..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 ...
..$ Petal.Width : num [1:150] 0.2 0.2 0.202 0.2 0.2 ...
..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
$ OOBerror: Named num [1:2] 0.1412 0.0256
..- attr(*, "names")= chr [1:2] "NRMSE" "PFC"
- attr(*, "class")= chr "missForest"
OOBerror即Out of Box error,是测量我们的模型基于连续变量和分类变量上的表现的指标。
取iris数据集的前4列变量,因为这4列变量都是连续变量,不涉及分类变量
miris <- miris[, 1:4]
用均值替换数据集中所有的缺失值
library(Hmisc)
iris_mean <- impute(miris, fun = mean)
用随机森林法替换数据集中所有的缺失值
iris_forest <- missForest(miris)
比较两个插补后的数据集和原数据集的相关来查看精确度,diag指的是查看相关矩阵的对角线
diag(cor(iris[,-5], iris_mean))
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.4870259 0.8700510 0.8922259 0.5795956
diag(cor(iris[,-5], iris_forest$ximp))
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.9740983 0.9315624 0.9893583 0.9814286
结果显示,基于非参数随机森林方法的填补缺失值相较于简单的单变量均值替换具备更良好的性能,所有4个变量和原数据集的相关都很高,而靠均值替换的数据集表现就差了很多。这一点很值得重视,因为大多数人可能都没想过比较不同插补方式之间的效果。
参考文献:
Daroczi, G. (2015). Mastering Data Analysis with R. Packt Publishing.
Daniel J. Stekhoven (2013). missForest: Nonparametric Missing Value Imputation
using Random Forest. R package version 1.4.
Stekhoven D. J., & Buehlmann, P. (2012). MissForest - non-parametric missing value
imputation for mixed-type data. Bioinformatics, 28(1), 112-118.