R语言---scale函数,match函数和%in%详解

R语言---scale函数,match函数详解

        • 1. scale函数
        • 2. 两个向量匹配match和%in%
          • (1)match函数
          • (2)成员判断

1. scale函数

scale: Scaling and Centering of Matrix-like Objects
该函数共有两个参数center 和 scale,默认情况下均为TRUE,此情况下即为标准化zscore计算。
输入数据可以为一列数据,或者多列的matrix,会按列计算,计算过程与下面代码相同,计算结果相同。

df=mtcars[c("mpg","wt")]
scale(df)  ## zscore标准化
apply(df,2, function(x){(x-mean(x ,na.rm=T))/sd(x, na.rm=T)}) ##  ## 传入zscore标准化函数

sacle存在两个参数:

scale(x, center = TRUE, scale = TRUE)

举例说明(默认情况下,两个参数均为TRUE):

aa=c(1,3,5,7,9)
scale(aa)
[,1]
[1,] -1.265
[2,] -0.632
[3,]  0.000
[4,]  0.632
[5,]  1.265
attr(,"scaled:center")
[1] 5  ## 代表均值
attr(,"scaled:scale")
[1] 3.16 ## 代表sd

只进行中心化(每个值减去均值):

scale(aa, center = T, scale = F)
     [,1]
[1,]   -4
[2,]   -2
[3,]    0
[4,]    2
[5,]    4
attr(,"scaled:center")
[1] 5  ## 代表均值

只进行scale转化,不减均值:

scale(aa, scale=T,center = F)
      [,1]
[1,] 0.156
[2,] 0.467
[3,] 0.778
[4,] 1.090
[5,] 1.401
attr(,"scaled:scale")
[1] 6.42
## 计算过程
> aa/sqrt(sum(aa^2)/(length(aa)-1))
[1] 0.156 0.467 0.778 1.090 1.401

> sqrt(sum(aa^2)/(length(aa)-1))
[1] 6.42 ## 上面6.42的计算过程

scale默认情况即为zscore标准化计算过程:

scale(aa) ## scale计算
        [,1]
[1,] -1.3270
[2,] -0.6286
[3,]  0.0698
[4,]  0.7683
[5,]  1.1175
attr(,"scaled:center")
[1] 4.8
attr(,"scaled:scale")
[1] 2.86
 (aa-mean(aa))/sd(aa)  ## 手动计算
[1] -1.3270 -0.6286  0.0698  0.7683  1.1175

如果数据中存在NA,则跳过此值:
比如均值计算为(1+8+3+5+7)/4 = 4.8, sd的计算过程一样,没有算入最后NA,正如下面aa和bb的scale结果一样。

aa  ## 数据1
[1] 1 3 5 7 8
scale(aa)
        [,1]
[1,] -1.3270
[2,] -0.6286
[3,]  0.0698
[4,]  0.7683
[5,]  1.1175
attr(,"scaled:center")
[1] 4.8
attr(,"scaled:scale")
[1] 2.86

bb ## 数据2,包含NA
[1]  1  3  5  7  8 NA
scale(bb)
        [,1]
[1,] -1.3270
[2,] -0.6286
[3,]  0.0698
[4,]  0.7683
[5,]  1.1175
[6,]      NA
attr(,"scaled:center")
[1] 4.8
attr(,"scaled:scale")
[1] 2.86

2. 两个向量匹配match和%in%

(1)match函数

match函数用来匹配两个向量,在向量提取相同元素和不同元素方面可以使用。
match函数共包含四个参数:

match(x, y, nomatch = NA_integer_, incomparables = NULL)
## 直观的功能和 %in% 作用相似
x %in% y

x:vector or NULL: the values to be matched. Long vectors are supported.
y:vector or NULL: the values to be matched against. Long vectors are not supported.
nomatch:the value to be returned in the case when no match is found. Note that it is coerced to integer.
incomparables:a vector of values that cannot be matched. Any value in x matching a value in this vector is assigned the nomatch value. For historical reasons, FALSE is equivalent to NULL.

match的含义是找出a中元素在b中的位置,举例说明:

a=c("A","B","C","D")
b=c("A","C","D","E")
match(a,b) ## 返回b中与a相同元素的index
[1]  1 NA  2  3
b[match(a,b)] ## 在b中将相同的元素提取出来
[1] "A" NA  "C" "D"

剔除上面结果中的NA:

as.vector(na.omit(b[match(a,b)]))

提取差集(仅存在b中,不存在a中的元素):
由于返回值存在NA,也就是b中并未完全包括a中元素,因此直接使用 b[-match(a,b)]会报错。

> b[-match(a,b)]
Error in b[-match(a, b)] : 只有负下标里才能有零
> -match(a,b)
[1] -1 NA -2 -3
> b[c(-1,-2,-3)]
[1] "E"
> b
[1] "A" "C" "D" "E"
或者使用上面的剔除NA:
b[-as.vector(na.omit(match(a,b)))]
(2)成员判断

成员判断使用%in%

a=c("A","B","C","D")
b=c("A","C","D","E")

a[a %in% b]  ## 提取a,b共有元素
sum(a %in% b) ## 统计共有元素的个数
a[ !a %in% b] ## 提取仅在a中存在的元素

## 输出如下:
> a[a %in% b]
[1] "A" "C" "D"
> sum(a %in% b)
[1] 3
> a[ !a %in% b]
[1] "B"

综上看出,如果数据量级不同,为了更准确的进行后续分析,需要使用scale进行zscore标准化;
关于两个向量匹配,%in%和match都很方便,按自己习惯来。

参考:
https://www.jianshu.com/p/3173ee73ec7e (match函数)
RStudio >?scale

你可能感兴趣的:(R语言,r语言)