scale: Scaling and Centering of Matrix-like Objects
该函数共有两个参数center 和 scale
,默认情况下均为TRUE
,此情况下即为标准化zscore计算。
输入数据可以为一列数据,或者多列的matrix,会按列计算,计算过程与下面代码相同,计算结果相同。
df=mtcars[c("mpg","wt")]
scale(df) ## zscore标准化
apply(df,2, function(x){(x-mean(x ,na.rm=T))/sd(x, na.rm=T)}) ## ## 传入zscore标准化函数
sacle存在两个参数:
scale(x, center = TRUE, scale = TRUE)
举例说明(默认情况下,两个参数均为TRUE):
aa=c(1,3,5,7,9)
scale(aa)
[,1]
[1,] -1.265
[2,] -0.632
[3,] 0.000
[4,] 0.632
[5,] 1.265
attr(,"scaled:center")
[1] 5 ## 代表均值
attr(,"scaled:scale")
[1] 3.16 ## 代表sd
只进行中心化(每个值减去均值):
scale(aa, center = T, scale = F)
[,1]
[1,] -4
[2,] -2
[3,] 0
[4,] 2
[5,] 4
attr(,"scaled:center")
[1] 5 ## 代表均值
只进行scale转化,不减均值:
scale(aa, scale=T,center = F)
[,1]
[1,] 0.156
[2,] 0.467
[3,] 0.778
[4,] 1.090
[5,] 1.401
attr(,"scaled:scale")
[1] 6.42
## 计算过程
> aa/sqrt(sum(aa^2)/(length(aa)-1))
[1] 0.156 0.467 0.778 1.090 1.401
> sqrt(sum(aa^2)/(length(aa)-1))
[1] 6.42 ## 上面6.42的计算过程
scale默认情况即为zscore标准化计算过程:
scale(aa) ## scale计算
[,1]
[1,] -1.3270
[2,] -0.6286
[3,] 0.0698
[4,] 0.7683
[5,] 1.1175
attr(,"scaled:center")
[1] 4.8
attr(,"scaled:scale")
[1] 2.86
(aa-mean(aa))/sd(aa) ## 手动计算
[1] -1.3270 -0.6286 0.0698 0.7683 1.1175
如果数据中存在NA,则跳过此值:
比如均值计算为(1+8+3+5+7)/4 = 4.8, sd的计算过程一样,没有算入最后NA,正如下面aa和bb的scale结果一样。
aa ## 数据1
[1] 1 3 5 7 8
scale(aa)
[,1]
[1,] -1.3270
[2,] -0.6286
[3,] 0.0698
[4,] 0.7683
[5,] 1.1175
attr(,"scaled:center")
[1] 4.8
attr(,"scaled:scale")
[1] 2.86
bb ## 数据2,包含NA
[1] 1 3 5 7 8 NA
scale(bb)
[,1]
[1,] -1.3270
[2,] -0.6286
[3,] 0.0698
[4,] 0.7683
[5,] 1.1175
[6,] NA
attr(,"scaled:center")
[1] 4.8
attr(,"scaled:scale")
[1] 2.86
match函数用来匹配两个向量,在向量提取相同元素和不同元素方面可以使用。
match函数共包含四个参数:
match(x, y, nomatch = NA_integer_, incomparables = NULL)
## 直观的功能和 %in% 作用相似
x %in% y
x:vector or NULL: the values to be matched. Long vectors are supported.
y:vector or NULL: the values to be matched against. Long vectors are not supported.
nomatch:the value to be returned in the case when no match is found. Note that it is coerced to integer.
incomparables:a vector of values that cannot be matched. Any value in x matching a value in this vector is assigned the nomatch value. For historical reasons, FALSE is equivalent to NULL.
match的含义是找出a中元素在b中的位置,举例说明:
a=c("A","B","C","D")
b=c("A","C","D","E")
match(a,b) ## 返回b中与a相同元素的index
[1] 1 NA 2 3
b[match(a,b)] ## 在b中将相同的元素提取出来
[1] "A" NA "C" "D"
剔除上面结果中的NA:
as.vector(na.omit(b[match(a,b)]))
提取差集(仅存在b中,不存在a中的元素):
由于返回值存在NA,也就是b中并未完全包括a中元素,因此直接使用 b[-match(a,b)]
会报错。
> b[-match(a,b)]
Error in b[-match(a, b)] : 只有负下标里才能有零
> -match(a,b)
[1] -1 NA -2 -3
> b[c(-1,-2,-3)]
[1] "E"
> b
[1] "A" "C" "D" "E"
或者使用上面的剔除NA:
b[-as.vector(na.omit(match(a,b)))]
成员判断使用%in%
:
a=c("A","B","C","D")
b=c("A","C","D","E")
a[a %in% b] ## 提取a,b共有元素
sum(a %in% b) ## 统计共有元素的个数
a[ !a %in% b] ## 提取仅在a中存在的元素
## 输出如下:
> a[a %in% b]
[1] "A" "C" "D"
> sum(a %in% b)
[1] 3
> a[ !a %in% b]
[1] "B"
综上看出,如果数据量级不同,为了更准确的进行后续分析,需要使用scale进行zscore标准化;
关于两个向量匹配,%in%
和match都很方便,按自己习惯来。
参考:
https://www.jianshu.com/p/3173ee73ec7e (match函数)
RStudio >?scale