Binormalization是一种通过双向归一化消除背景相关性从而使得数据具有可比性的过程。
z-score归一化是一种常见的归一化方式,与其他归一化方法一样,都是用来消除数据的背景相关性,以实现数据的比较。
z-score使用原始数据的均值和标准差来进行归一化,使得处理过后的数据符合正态分布,即均值为0,标准差为1。
对于序列 X X X:
X = [ x 1 , x 2 , . . . , x n ] X=[x_1,x_2,...,x_n] X=[x1,x2,...,xn]
其z-score变换公式如下:
y i = x i − x ˉ σ ( X ) y_i=\frac{x_i-\bar{x}}{\sigma(X)} yi=σ(X)xi−xˉ
其中:
x ˉ = 1 n ∑ i = 1 n x i \bar{x}=\frac{1}{n}\sum_{i=1}^nx_i xˉ=n1i=1∑nxi
σ ( X ) = 1 ( n − 1 ) ∑ i = 1 n ( x i − x ˉ ) 2 \sigma(X)=\sqrt{\frac{1}{(n-1)}\sum_{i=1}^n(x_i-\bar{x})^2} σ(X)=(n−1)1i=1∑n(xi−xˉ)2
最终得到变换后的序列 Y Y Y :
Y = [ y 1 , y 2 , . . . , y n ] Y=[y_1,y_2,...,y_n] Y=[y1,y2,...,yn]
双归一化过程的简要过程如下:
在实际应用中,常使用的迭代次数是20次。理想的双归一化结果是行列的均值均为0(允许误差0.001),标准差均为1(允许误差0.001)。
相关系数是研究变量间线性相关程度的量,通常用字母 ρ \rho ρ 表示。皮尔逊(Pearson)相关系数是最常见的相关系数。
Pearson相关性系数的表示方式如下:
ρ X Y = ∑ i = 1 n ( x i − x ˉ ) ( y i − y ˉ ) ( n − 1 ) σ x σ y \rho_{XY}=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{(n-1)\sigma_x\sigma_y} ρXY=(n−1)σxσy∑i=1n(xi−xˉ)(yi−yˉ)
当我们使用binormalization将数据归一化为均值 x ˉ = 0 , y ˉ = 0 \bar{x}=0,\bar{y}=0 xˉ=0,yˉ=0且标准差 σ x = 1 , σ y = 1 \sigma_x=1,\sigma_y=1 σx=1,σy=1时,上述式子可以被简化为:
ρ X Y = ∑ i = 1 n x i y i ( n − 1 ) \rho_{XY}=\frac{\sum_{i=1}^nx_iy_i}{(n-1)} ρXY=(n−1)∑i=1nxiyi
以下是使用R语言模拟的20次双归一化过程的示例代码:
#----------------functions------------------
# 计算Pearson相关性系数
pearson <- function(X, Y) sum(X-mean(X)*Y-mean(Y))/((length(X)-1)*sd(X)*sd(Y))
# 归一化之后简化Pearson相关性系数
pearson_simplify <- function(X, Y) sum(X*Y)/(length(X)-1)
# 给定数据归一化
renormalization <- function(X,mean,sd) (X-mean)/sd
#----------------program---------------------
# 初始化数据
my_data <- data.frame(
sample1 = c(0.98833543,0.06838629,0.62895223,0.57501167,0.96377077),
sample2 = c(4.173328,1.053216,7.352698,6.011002,8.525912),
sample3 = c(74.594239,23.716227,9.729248,88.667608,11.912238),
sample4 = c(547.4608,889.3850,576.3429,756.6556,996.4749),
sample5 = c(7993.627,6547.821,9074.916,4500.541,5077.699)
)
# sample1 sample2 sample3 sample4 sample5
# 1 0.98833543 4.173328 74.594239 547.4608 7993.627
# 2 0.06838629 1.053216 23.716227 889.3850 6547.821
# 3 0.62895223 7.352698 9.729248 576.3429 9074.916
# 4 0.57501167 6.011002 88.667608 756.6556 4500.541
# 5 0.96377077 8.525912 11.912238 996.4749 5077.699
# 计算行平均值
# 1 1724.169
# 2 1492.409
# 3 1933.794
# 4 1070.490
# 5 1219.115
row_mean <- rowMeans(my_data)
# 计算行的标准差
# 1 3512.107
# 2 2851.715
# 3 3999.644
# 4 1943.293
# 5 2199.147
row_sd <- sapply(c(1:length(my_data[,1])),function(r) sd(my_data[r,]))
# 计算列的平均值
# sample1 sample2 sample3 sample4 sample5
# 0.6448913 5.4232313 41.7239120 753.2638309 6638.9206345
col_mean <- colMeans(my_data)
# 计算列的标准差
# sample1 sample2 sample3 sample4 sample5
# 0.3732068 2.9306244 37.1511782 194.5173518 1922.7376092
col_sd <- sapply(my_data, sd)
# 计算原始数据的相关性
# sample1 sample2 sample3 sample4 sample5
# -23.154471 -11.510881 -6.039138 -20.362683 -18.134793
correlations <- sapply(my_data, function(X) pearson(X,row_mean))
# 循环行列归一化
for(i in 1:20){
for (name in colnames(my_data)) {
my_data[name] <- renormalization(my_data[name],col_mean[name],col_sd[name])
}
row_mean <- rowMeans(my_data)
row_sd <- sapply(c(1:length(my_data[,1])),function(r) sd(my_data[r,]))
for (i in c(1:length(my_data[,1]))) {
my_data[i,] <- renormalization(my_data[i,],row_mean[i],row_sd[i])
}
col_mean <- colMeans(my_data)
col_sd <- sapply(my_data, sd)
}
row_mean <- rowMeans(my_data)
row_sd <- sapply(c(1:length(my_data[,1])),function(r) sd(my_data[r,]))
# > row_mean
# [1] -3.330669e-17 0.000000e+00 -6.661338e-17 2.220446e-17 -2.220446e-17
# > row_sd
# [1] 1 1 1 1 1
# > col_mean
# sample1 sample2 sample3 sample4 sample5
# -0.0002037928 0.0006153285 0.0001256324 0.0001857253 -0.0007228934
# > col_sd
# sample1 sample2 sample3 sample4 sample5
# 0.9994494 0.9991644 1.0013311 0.9993542 1.0006985
# > my_data
# sample1 sample2 sample3 sample4 sample5
# 1 1.2684309 -1.1131204 0.3195442 -0.9219689 0.4471141
# 2 -1.0276294 -1.0128814 0.1996522 1.2235862 0.6172723
# 3 -0.2465572 1.0202117 -0.7394164 -1.1013365 1.0670983
# 4 -0.7944553 0.3637377 1.3944729 0.1613831 -1.1251385
# 5 0.7991919 0.7451289 -1.1736248 0.6392647 -1.0099608
# 值行Binormalization之后的相关性已经趋近于0
# sample1 sample2 sample3 sample4 sample5
# -1.530233e-17 -9.838496e-18 2.390882e-17 2.336509e-17 -2.213309e-17
correlations <- sapply(my_data, function(X) pearson_simplify(X,row_mean))
可以看出Binormalization可以显著的消除平台背景相关性。