[个人笔记]R语言:缺失值NA处理

目录

  • 前言
  • 缺失值分布评估
    • 看有多少行有缺失值:
    • VIM:看一下缺失值的分布、比例情况
  • 缺失值填补方法
    • Hmisc或e1071包:均值、中位数、随机填充
    • DMwR2包: 中心填充centralImputation 和knn
    • 6、rpart (略)
    • mice包的mice函数(略)
  • 评估填补效果(还没实施)
  • 实践

前言

项目用到一个数据集,含有很多NA。特此记录怎么处理。
主要参考:CSDN一个教程:DMwR为主
mice为主
银河统计

缺失值分布评估

看有多少行有缺失值:

如果缺失值不多,直接删了有缺失值的样本最方便。
每行一个样本,计算一下又多少行有缺失值:

## ###-------   缺失值统计
# 找出有NA的行一共有多少   --------#
head(data)
dim(data)

NAcounter = 0
for (i in c(1:nrow(data))){
  if( TRUE %in% (is.na(data[i,]))){
    print(sprintf("%d, NA",i))
    NAcounter =NAcounter +1
  }
  else
    print(i)
}
print(sprintf("There are %d/%d patients with NA",NAcounter,nrow(data)))

发现我一共四百多,确实200多。看来不能直接删掉。

VIM:看一下缺失值的分布、比例情况

# 使用VIM包
# site="https://mirrors.tuna.tsinghua.edu.cn/CRAN"
# package_list = c("VIM")
# for(p in package_list){
#  if(!suppressWarnings(suppressMessages(require(p, character.only = TRUE, quietly = TRUE, warn.conflicts = FALSE)))){
#    install.packages(p, repos=site)
#    suppressWarnings(suppressMessages(library(p, character.only = TRUE, quietly = TRUE, warn.conflicts = FALSE)))
#  }
# }
  
library(VIM)
aggr(data,prop=T,numbers=T)
# help(aggr) # 该命令给出该函数的文档

[个人笔记]R语言:缺失值NA处理_第1张图片
这个图左边还看得懂,age,T,M都是列名,给出了每一列含缺失值的比例。右边的图看不太懂,故看文档。
help(aggr),得到的文档相关部分如下,还是看不太懂,就这样吧,大概参考一下。:

Often it is of interest how many missing/imputed values are contained in each variable. Even more interesting, there may be certain combinations of variables with a high number of missing/imputed values.
If combined is FALSE, two separate plots are drawn for the missing/imputed values in each variable and the combinations of missing/imputed and non-missing values. The barplot on the left hand side shows the amount of missing/imputed values in each variable. In the aggregation plot on the right hand side, all existing combinations of missing/imputed and non-missing values in the observations are visualized. Available, missing and imputed data are color coded as given by col. Additionally, there are two possibilities to represent the frequencies of occurrence of the different combinations. The first option is to visualize the proportions or frequencies by a small bar plot and/or numbers. The second option is to let the cell heights be given by the frequencies of the corresponding combinations. Furthermore, variables may be sorted by the number of missing/imputed values and combinations by the frequency of occurrence to give more power to finding the structure of missing/imputed values.
If combined is TRUE, a small version of the barplot showing the amount of missing/imputed values in each variable is drawn on top of the aggregation plot.

缺失值填补方法

Hmisc或e1071包:均值、中位数、随机填充

最简单的填充方法是,用随机数、均值或中位数来填充。
Hmisc和e1071包都有。
e1071可能基础一点,我没装过但他自动就有。

#install.packages("Hmisc")
library(Hmisc)
help(impute)

data$age=impute(data$age,median)        
data$age=impute(data$age,mean)     
data$gender = impute(data$gender,"random")  

填充前:
[个人笔记]R语言:缺失值NA处理_第2张图片
填充后:
[个人笔记]R语言:缺失值NA处理_第3张图片

DMwR2包: 中心填充centralImputation 和knn

在DMwR包里有centralImputation()这个函数是利用数据的中心趋势值来填补缺失值。

6、rpart (略)

mice包的mice函数(略)

评估填补效果(还没实施)

我们需要引进DMwR包install.packages(“DMwR”),library(DMwR)
提到这个包,里面有manyNAs(data,0.2)这么个函数返回的是找出缺失值大于列数20%的行,这个0.2是可以调的。
在计算插补效果需要用到DMwR包的regr.eval()函数
————————————————
版权声明:本文为CSDN博主「Taylor_zhuang」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/zhuangailing/article/details/79253768

实际发现无法安装DMwR这个包,

package ‘DMwR’ is not available for this version of R

在Stack Overflow上看到,原来这个包已经被删了:
https://cran.r-project.org/web/packages/DMwR/index.html
[个人笔记]R语言:缺失值NA处理_第4张图片
但我在输入install.packages(DM的时候,自动补全了install.packages("DMwR2")
使用的时候发现,没有regr.eval(), 但是有centralImputation(data)
knnImputation(data)。

根据http://www.360doc.com/content/20/0514/16/65403234_912307344.shtml
,可以用t检验来评估插值情况
t.test(actuals,predictions)

实践

# ###------- 缺失值填补: 均值和中位数 -----------#
#install.packages("Hmisc")
library(Hmisc)
help(impute)
imputed_data = data
imputed_data$age=impute(data$age,mean)        ####平均值 67.06585, 中位数68
imputed_data$gender = impute(data$gender,"random") #0.5170732*  
imputed_data$T = impute(data$T,mean) ###  5.733496*,6.0
imputed_data$N = impute(data$N,mean) # 1.268293, 0
imputed_data$M = impute(data$M,mean) # 0.2853186,0
imputed_data$Overall.Survival..Months. = impute(data$Overall.Survival..Months.,mean) # 27.92964,22.0250
imputed_data$Overall.Survival.Status = impute(data$Overall.Survival.Status,"random")# 0.2219512,0
imputed_data$Disease.Free..Months. = impute(data$Disease.Free..Months.,mean) # 26.3183763,20.2700000
imputed_data$Disease.Free.Status = impute(data$Disease.Free.Status,"random") # 0.2087629,
imputed_data$microsatelite = impute(data$microsatelite,mean) # 0.5756098*


head(imputed_data)
is.na(imputed_data)

你可能感兴趣的:(癌症分型,R语言,r语言)