DESeq2中vst标准化那些事

前言

首先,vst也是基于负二项分布的一种标准化方法
我们为什么在大样本数据中需要采用vst的标准化方法呢?这是因为:

1.It is a one-size-fits-all solution, ignoring the measurement noise characteristics associated with each instrument and each run.
2.Negative values that frequently result from background correction of low-intensity signals have to be reset before taking the logarithm, and thus they are artificially truncated.
3.logarithmic transformation inflates variances when the intensities are close to zero although it stabilizes the variances at higher intensities
4.A 2-fold difference can be very significant when the intensities are high; however, when the intensities are close to the background level a 2-fold difference can be within the expected measurement error.

总结起来就是rlog标准化方法对于大样本来说运算较慢,并且对于count的值比较敏感,因此引入vst的标准化来得到一个近似为同方差的值矩阵

在Biostars上也给出了rlog与vst之间的差别:

This function calculates a variance stabilizing transformation (VST) from the fitted dispersion-mean relation(s) and then transforms the count data (normalized by division by the size factors or normalization factors), yielding a matrix of values which are now approximately homoskedastic (having constant variance along the range of mean values). The transformation also normalizes with respect to library size. The rlog is less sensitive to size factors, which can be an issue when size factors vary widely. These transformations are useful when checking for outliers or as input for machine learning techniques such as clustering or linear discriminant analysis.

variance-stabilizing transformation

方差稳定变换是2014年提出的一种标准化方式(发表于NAR),具体可参见:Model-based variance-stabilizing transformation for Illumina microarray data

这篇文章主要介绍了vst标准化的model:
1.首先,该model对于每一个基因计算在各个样本中的均值以及方差

公式1

上式是对每一个基因按不同的sample分别计算均值和方差

2.估计均值与方差之间的函数关系

公式2

其中 v 代表方差,u 代表均值;而 v(u) 代表的是方差 v 是关于均值 u 的函数
那么根据数学推导,可以的到h(y)函数为:
公式3

这个公式3是根据渐进理论(delta method)以及公式2推算而来的,经过转换后可以得到一个各个基因间方差近似相等的矩阵(因为转换的过程事实上是进一步压缩,所以各个基因间方差的差异被缩小)

delta method

而delta method的作用是利用经过h(y)变化前的data分布来拟合经过h(y)变化后的data分布
由泰勒公式知,y1和y2分别表示两个基因的表达量,那么转换前的差异为y1-y2,转换后的差异为h(y1)-h(y2),那么由上式得,基因表达量之间的差异经过转换以后可以缩小,h’()相对于缩小,从而达到一个各个基因间方差近似相等的情况
参考:Estimating Transformations for Regression via
Additivity and Variance Stabilization

3.寻找到一个合适的转换函数,将标准化前的矩阵(Y)转换为标准后的矩阵(Y~,波浪线应该打在上面),我们将公式2做一个恒等变形:


公式4

我们发现等式右边是一个线性模型,因此可以利用每个基因的方差和均值拟合线性模型,从而得到c1,c2和c3

4.将c1,c2和c3反带入公式3中,求解积分得:


公式4

这样我们就得到标准化的转换公式h(y)了,即就可以利用h(y)进行标准化了

vst代码详解

DESeq2中,vst的代码如下:

vst <- function(object, blind=TRUE, nsub=1000, fitType="parametric") {
  if (nrow(object) < nsub) {
    stop("less than 'nsub' rows,
  it is recommended to use varianceStabilizingTransformation directly")
  }
  if (is.null(colnames(object))) {
    colnames(object) <- seq_len(ncol(object))
  }
  if (is.matrix(object)) {
    matrixIn <- TRUE
    object <- DESeqDataSetFromMatrix(object, DataFrame(row.names=colnames(object)), ~ 1)
  } else {
    if (blind) {
      design(object) <- ~ 1
    }
    matrixIn <- FALSE
  }
  if (is.null(sizeFactors(object)) & is.null(normalizationFactors(object))) {
    object <- estimateSizeFactors(object)
  }
  baseMean <- rowMeans(counts(object, normalized=TRUE))
  if (sum(baseMean > 5) < nsub) {
    stop("less than 'nsub' rows with mean normalized count > 5, 
  it is recommended to use varianceStabilizingTransformation directly")
  }

  # subset to a specified number of genes with mean normalized count > 5
  object.sub <- object[baseMean > 5,]
  baseMean <- baseMean[baseMean > 5]
  o <- order(baseMean)
  idx <- o[round(seq(from=1, to=length(o), length=nsub))]
  object.sub <- object.sub[idx,]

  # estimate dispersion trend
  object.sub <- estimateDispersionsGeneEst(object.sub, quiet=TRUE)
  object.sub <- estimateDispersionsFit(object.sub, fitType=fitType, quiet=TRUE)

  # assign to the full object
  suppressMessages({dispersionFunction(object) <- dispersionFunction(object.sub)})

  # calculate and apply the VST
  vsd <- varianceStabilizingTransformation(object, blind=FALSE)
  if (matrixIn) {
    return(assay(vsd))
  } else {
    return(vsd)
  }
}

最核心的代码是这一句:

# calculate and apply the VST
  vsd <- varianceStabilizingTransformation(object, blind=FALSE)

varianceStabilizingTransformation的主要功能就是进行vst的标准化,我们不妨看看varianceStabilizingTransformation的内部函数:

varianceStabilizingTransformation <- function (object, blind=TRUE, fitType="parametric") {
  if (is.null(colnames(object))) {
    colnames(object) <- seq_len(ncol(object))
  }
  if (is.matrix(object)) {
    matrixIn <- TRUE
    object <- DESeqDataSetFromMatrix(object, DataFrame(row.names=colnames(object)), ~1)
  } else {
    matrixIn <- FALSE
  }
  if (is.null(sizeFactors(object)) & is.null(normalizationFactors(object))) {
    object <- estimateSizeFactors(object)
  }
  if (blind) {
    design(object) <- ~ 1
  }
  if (blind | is.null(attr(dispersionFunction(object),"fitType"))) {
    object <- estimateDispersionsGeneEst(object, quiet=TRUE)
    object <- estimateDispersionsFit(object, quiet=TRUE, fitType)
  }
  vsd <- getVarianceStabilizedData(object)
  if (matrixIn) {
    return(vsd)
  }
  se <- SummarizedExperiment(
    assays = vsd,
    colData = colData(object),
    rowRanges = rowRanges(object),
    metadata = metadata(object))
  DESeqTransform(se)
}

其中最核心的是

 vsd <- getVarianceStabilizedData(object)

而我们打开getVarianceStabilizedData后发现,其中的核心代码:

ncounts <- counts(object, normalized=TRUE)
sf <- sizeFactors(object)
xg <- sinh( seq( asinh(0), asinh(max(ncounts)), length.out=1000 ) )[-1]
xim <- mean( 1/sf )
baseVarsAtGrid <- dispersionFunction(object)( xg ) * xg^2 + xim * xg
integrand <- 1 / sqrt( baseVarsAtGrid )
    
splf <- splinefun(asinh( ( xg[-1] + xg[-length(xg)] )/2 ),
      cumsum((xg[-1] - xg[-length(xg)]) * (integrand[-1] + integrand[-length(integrand)] )/2 )
)
    
    
h1 <- quantile( rowMeans(ncounts), .95 )
h2 <- quantile( rowMeans(ncounts), .999 )
eta <- ( log2(h2) - log2(h1) ) / ( splf(asinh(h2)) - splf(asinh(h1)))
xi <- log2(h1) - eta * splf(asinh(h1))
tc <- sapply( colnames(counts(object)), function(clm) {
eta * splf(asinh(ncounts[,clm])) + xi
    })
rownames( tc ) <- rownames( counts(object) )
#函数返回值为tc
tc

其中splinefun()为插值样条函数,是数值分析求近似解的一种方法:

#example
x <- c(1,8,14,21,28,35,42,65)
y <- c(65,30,70,150,40,0,15,0)
f <- splinefun(x, y)

z = c(2,9,16,25,39,55)
f(z)

这里DESeq2对这一段的处理有别于公式4,DESeq2利用反双曲正弦函数(定义域内单调递增,且增速逐渐减缓)作为插值样条函数的自变量,用累计函数cumsum累计值作为因变量进行拟合,注意看:

#先拟合好插值样条函数
xg <- sinh( seq( asinh(0), asinh(max(ncounts)), length.out=1000 ) )[-1]
xim <- mean( 1/sf )
baseVarsAtGrid <- dispersionFunction(object)( xg ) * xg^2 + xim * xg
integrand <- 1 / sqrt( baseVarsAtGrid )

splf <- splinefun(asinh( ( xg[-1] + xg[-length(xg)] )/2 ),
      cumsum((xg[-1] - xg[-length(xg)]) * (integrand[-1] + integrand[-length(integrand)] )/2 )
)
##cumsum()代表累积函数,即后面的值为前面所有值之和

#计算标准化因子
eta <- ( log2(h2) - log2(h1) ) / ( splf(asinh(h2)) - splf(asinh(h1)) )

#对ncounts的每一列做标准化处理
tc <- sapply( colnames(counts(object)), function(clm) {
eta * splf(asinh(ncounts[,clm])) + xi
    })
##splf(asinh(ncounts[,clm])) 指的是将ncount矩阵的count值转换为反双曲正弦,然后做拟合转换

这一步主要是用方差平稳转换来标准化,即用asinh(ncounts[,clm])的值作为输入做方差平稳转换,ncount[,clm])是经过文库矫正后的count矩阵(clm代表该矩阵的列),然后乘标准因子,最后返回的结果是经过vst标准化后的矩阵

由于反双曲正弦函数单调递增,且增速逐渐减缓,且某个基因表达量较高,那么它在各个sample中的方差也就较大,因此采用反双曲正弦函数拟合,可以降低表达量高的基因在各个sample中的方差,使各个基因在各个sample中的方差之间的差距进一步缩小

你可能感兴趣的:(DESeq2中vst标准化那些事)