关于use
官方给的解释是
an optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs".
其实这几个参数是用于控制NA值的,那具体在实际运用中都有哪些效果呢?
1.everything
If use is "everything", NAs will propagate conceptually, i.e., a resulting value will be NA whenever one of its contributing observations is NA.
就是说矩阵中如果某一列向量有NA值,那么该参数的作用是凡是同具有NA值的那一列向量计算相关性,得到的相关性都是NA
df = data.frame(a = c(1,NA,2,4,5,6),
b = c(3,5,3,3,4,5),
c = c(1,2,3,4,6,7))
cor(df,use = 'everything')
那么凡是与a计算的相关性都为NA
2.all.obs
If use is "all.obs", then the presence of missing observations will produce an error.
这一个参数要求数据中不能含有NA,否则报错,只有不含有NA才能正常计算
df = data.frame(a = c(1,NA,2,4,5,6),
b = c(3,5,3,3,4,5),
c = c(1,2,3,4,6,7))
cor(df,use = 'all.obs')
3.complete.obs
If use is "complete.obs" then missing values are handled by casewise deletion (and if there are no complete cases, that gives an error).
这个参数的作用是将具有NA值的那一行数据删除,用剩下的数据计算两两间相关性;如果每一行数据都有NA则会报错
df = data.frame(a = c(1,NA,2,4,5,6),
b = c(3,5,3,3,4,5),
c = c(1,2,3,4,6,7))
cor(df,use = 'complete.obs')
de = data.frame(a = c(1,2,4,5,6),
b = c(3,3,3,4,5),
c = c(1,3,4,6,7))
cor(de,use = 'complete.obs')
我们看到,其中 df 数据集去掉第2行数据即得到 de 数据集
由上图所知,它们两个计算的相关性是一样的,因此可以推断 complete.obs 是将具有NA值的那一行数据删除,用剩下的数据计算两两间相关性
如果每一行数据都有NA,则会报错:
dt = data.frame(a = c(1,NA,2,4,5,NA),
b = c(3,5,NA,3,NA,5),
c = c(NA,2,3,NA,6,7))
cor(dt,use = 'complete.obs')
4.na.or.complete
"na.or.complete" is the same unless there are no complete cases, that gives NA.
这个参数的作用和complete.obs类似,只不过如果每一列向量都有NA,则计算相关性的矩阵并不会报错,而会给出NA
df = data.frame(a = c(1,NA,2,4,5,6),
b = c(3,5,3,3,4,5),
c = c(1,2,3,4,6,7))
cor(df,use = 'na.or.complete')
de = data.frame(a = c(1,2,4,5,6),
b = c(3,3,3,4,5),
c = c(1,3,4,6,7))
cor(de,use = 'na.or.complete')
而如果每一行数据都有NA,则会相关性会给出NA:
dt = data.frame(a = c(1,NA,2,4,5,NA),
b = c(3,5,NA,3,NA,5),
c = c(NA,2,3,NA,6,7))
cor(dt,use = 'na.or.complete')
5.pairwise.complete.obs
Finally, if use has the value "pairwise.complete.obs" then the correlation or covariance between each pair of variables is computed using all complete pairs of observations on those variables. This can result in covariance or correlation matrices which are not positive semi-definite, as well as NA entries if there are no complete pairs for that pair of variables. For cov and var, "pairwise.complete.obs" only works with the "pearson" method.
这个参数有些复杂,要配合 Pearson 相关系数来计算。它的作用是如果某一列向量中有NA,那么计算该向量与其他向量相关性时,去除具有NA的那一行:
df = data.frame(a = c(1,NA,2,4,5,6),
b = c(3,5,3,3,4,5),
c = c(1,2,3,4,6,7))
cor(df,use = 'pairwise.complete.obs')
dm = data.frame(a = c(1,2,4,5,6),
b = c(3,3,3,4,5))
cor(dm,use = 'pairwise.complete.obs')
dn = data.frame(a = c(1,2,4,5,6),
c = c(1,3,4,6,7))
cor(dn,use = 'pairwise.complete.obs')
dr =data.frame(b = c(3,5,3,3,4,5),
c = c(1,2,3,4,6,7))
cor(dr,use = 'pairwise.complete.obs')
我们看到,由于a向量具有NA,b和c没有NA,所以在计算a,b之间相关性和a,c之间相关性的时候,采取的是去除有NA的那一行,用剩下的数据计算(如dm和dn);而在计算b,c相关性的时候则是采用用全部数据进行计算的方式(如dr)