在R中,能处理缺失值的包有很多,比如VIM, mice, Amelia, missForest, Hmisc, mi,等等,那为什么本文偏偏选择missForest作为处理包呢?
The function 'missForest' in this package is used to
impute missing values particularly in the case of mixed-type
data. It uses a random forest trained on the observed values of
a data matrix to predict the missing values. It can be used to
impute continuous and/or categorical data including complex
interactions and non-linear relations. It yields an out-of-bag
(OOB) imputation error estimate without the need of a test set
or elaborate cross-validation. It can be run in parallel to
save computation time.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
prodNA(x, noNA = 0.1)
miris <- prodNA(iris, noNA = 0.2)
将结果赋值给miris(missing iris dataset)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.200 Min. :1.000 Min. :0.100 setosa :40
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.500 1st Qu.:0.300 versicolor:38
Median :5.800 Median :3.000 Median :4.200 Median :1.300 virginica :39
Mean :5.841 Mean :3.077 Mean :3.661 Mean :1.196 NA's :33
3rd Qu.:6.400 3rd Qu.:3.400 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.300 Max. :2.500
NA's :25 NA's :27 NA's :33 NA's :32
iiris <- missForest(miris, xtrue = iris, verbose = TRUE)
第一个参数是数据集,第二个xtrue的作用是使用完整数据集iris衡量插补的表现,通过verbose = TRUE可以查看迭代的过程以及插补误差,消耗时间等。
missForest iteration 1 in progress...done!
error(s): 0.2499382 0.1212121
estimated error(s): 0.1813752 0.008547009
difference(s): 0.01295499 0.1466667
time: 1.09 seconds
missForest iteration 2 in progress...done!
error(s): 0.2296085 0.1212121
estimated error(s): 0.1546088 0.008547009
difference(s): 0.0002750566 0
time: 0.44 seconds
missForest iteration 3 in progress...done!
error(s): 0.2198678 0.1212121
estimated error(s): 0.1468334 0.01709402
difference(s): 8.633168e-05 0
time: 0.3 seconds
missForest iteration 4 in progress...done!
error(s): 0.2149133 0.09090909
estimated error(s): 0.1467923 0
difference(s): 2.614443e-05 0.006666667
time: 0.31 seconds
missForest iteration 5 in progress...done!
error(s): 0.2107928 0.09090909
estimated error(s): 0.1392262 0.008547009
difference(s): 6.956508e-05 0
time: 0.22 seconds
missForest iteration 6 in progress...done!
error(s): 0.2108121 0.09090909
estimated error(s): 0.1371073 0.03418803
difference(s): 2.83551e-05 0
time: 0.2 seconds
missForest iteration 7 in progress...done!
error(s): 0.2119592 0.09090909
estimated error(s): 0.1392409 0
difference(s): 1.827749e-05 0
time: 0.2 seconds
missForest iteration 8 in progress...done!
error(s): 0.2103703 0.09090909
estimated error(s): 0.1376546 0.008547009
difference(s): 2.309659e-05 0
time: 0.22 seconds
列表中包含了ximp, OOBerror两部分,ximp是插补后的数据集
List of 2
$ ximp :'data.frame': 150 obs. of 5 variables:
..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 ...
..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.23 3.6 ...
..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 ...
..$ Petal.Width : num [1:150] 0.2 0.2 0.202 0.2 0.2 ...
..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
$ OOBerror: Named num [1:2] 0.1412 0.0256
..- attr(*, "names")= chr [1:2] "NRMSE" "PFC"
- attr(*, "class")= chr "missForest"
OOBerror即Out of Box error,是测量我们的模型基于连续变量和分类变量上的表现的指标。
miris <- miris[, 1:4]
iris_mean <- impute(miris, fun = mean)
iris_forest <- missForest(miris)
diag(cor(iris[,-5], iris_mean))
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.4870259 0.8700510 0.8922259 0.5795956
diag(cor(iris[,-5], iris_forest$ximp))
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.9740983 0.9315624 0.9893583 0.9814286
