seurat单细胞转录组整合分析教程-02

接前面的流程,我们使用FindIntegrationAnchors()函数识别锚点,该函数将Seurat对象列表作为输入,并使用IntegratedData()函数,用这些锚点将两个数据集集成在一起。

ifnb.list
# $CTRL
# An object of class Seurat 
# 14053 features across 6548 samples within 1 assay 
# Active assay: RNA (14053 features, 2000 variable features)
# 
# $STIM
# An object of class Seurat 
# 14053 features across 7451 samples within 1 assay 
# Active assay: RNA (14053 features, 2000 variable features)

immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, anchor.features = features)

# this command creates an 'integrated' data assay 
immune.combined <- IntegrateData(anchorset = immune.anchors)

先跟着FindIntegrationAnchors()函数过一遍代码,看看如何实现的。
step1:传参:seurat对象和高可变基因传入object.list和anchor.features参数,其余均为默认参数。看下seurat官网对这个函数参数的解释。

object.list = ifnb.list 
anchor.features = features 
assay = NULL 
reference = NULL 
scale = TRUE 
normalization.method = "LogNormalize" 
sct.clip.range = NULL 
reduction = "cca" 
l2.norm = TRUE 
dims = 1:30 
k.anchor = 5 
k.filter = 200 
k.score = 30 
max.features = 200 
nn.method = "annoy" 
n.trees = 50 
eps = 0 
verbose = TRUE

step2:并行运算:FindIntegrationAnchors()函数是可以做并行运算的,先暂时忽略此步。详见https://satijalab.org/seurat/articles/future_vignette.html。

my.lapply <- ifelse( 
  test = verbose && future::nbrOfWorkers() == 1, 
  yes = pbapply::pblapply, 
  no = future.apply::future_lapply 
)

step3:数据校验:查看每个样本的细胞数目,默认是30维(dims = 1:30),每个样本的细胞数目要大于dims值。另外,确保每个样本seurat对象的Integration插槽为空,样本的barcode信息不重复。

object.ncells <- sapply(X = object.list, FUN = function(x) dim(x = x)[2]) 
object.ncells 
# CTRL STIM  
# 6548 7451  
assay <- sapply(X = object.list, FUN = DefaultAssay) 
assay 
# CTRL  STIM  
# "RNA" "RNA"  
# check tool 
object.list <- lapply( 
  X = object.list, 
  FUN = function (obj) { 
    slot(object = obj, name = "tools")$Integration <- NULL 
    return(obj) 
  }) 
object.list <- CheckDuplicateCellNames(object.list = object.list)

step4:基因表达值缩放:对每个样本seurat对象的高可变基因进行scale缩放;

slot <- "data" 
if (scale) { 
  if (verbose) { 
    message("Scaling features for provided objects") 
  } 
  object.list <- my.lapply( 
    X = object.list, 
    FUN = function(object) { 
      ScaleData(object = object, features = anchor.features, verbose = FALSE) 
    } 
  ) 
}
# 查看scale缩放后的矩阵 
head(object.list[['CTRL']]@[email protected][anchor.features,1:5])


step5:采用cca降维方法,计算两个样本有几种组合的方式combinations,expand.grid()函数返回几个元素所有可能的组合,2个样本有1种组合。确定锚点索引的偏移量,当前样本的偏移量为前面样本细胞数目的累加值。
比如有两个样本的细胞数为6548,7451 ,它们的偏移量为前面样本数目的累加值 0,6548。第一个样本的偏移量为0。

nn.reduction <- reduction 
# determine pairwise combinations 
combinations <- expand.grid(1:length(x = object.list), 1:length(x = object.list)) 
combinations <- combinations[combinations$Var1 < combinations$Var2, , drop = FALSE] 
combinations  
#    Var1 Var2 
# 3    1    2 
# determine the proper offsets for indexing anchors 
objects.ncell <- sapply(X = object.list, FUN = ncol) 
objects.ncell 
# CTRL STIM  
# 6548 7451  
offsets <- as.vector(x = cumsum(x = c(0, objects.ncell)))[1:length(x = object.list)] 
offsets 
# [1]    0 6548

step6:因为无reference,寻找成对的锚点。对每个样本间组合调用核心的锚点函数anchoring.fxn。anchoring.fxn函数较复杂,又会调用RunCCA和FindAnchors等函数。


我们的样本就一组样本组合(1-2),通过这个实操逐步分解代码,进入到anchoring.fxn函数。
step6-1:object.1和object.2分别为两个样本只包含高可变基因的seurat对象,包含RNA插槽的数据。重新生成ToIntegrate插槽,包含归一化和缩放的数据。

object.1 <- DietSeurat(
  object = object.list[[i]],
  assays = assay[i],
  features = anchor.features,
  counts = FALSE,
  scale.data = TRUE,
  dimreducs = reduction
)
object.2 <- DietSeurat(
  object = object.list[[j]],
  assays = assay[j],
  features = anchor.features,
  counts = FALSE,
  scale.data = TRUE,
  dimreducs = reduction
)
head(object.1@assays$RNA@data[1:5,1:5])
head(object.1@[email protected][1:5,1:5])

# suppress key duplication warning
suppressWarnings(object.1[["ToIntegrate"]] <- object.1[[assay[i]]])
DefaultAssay(object = object.1) <- "ToIntegrate"
object.1 <- DietSeurat(object = object.1, assays = "ToIntegrate", scale.data = TRUE, dimreducs = reduction)
suppressWarnings(object.2[["ToIntegrate"]] <- object.2[[assay[j]]])
DefaultAssay(object = object.2) <- "ToIntegrate"
object.2 <- DietSeurat(object = object.2, assays = "ToIntegrate", scale.data = TRUE, dimreducs = reduction)

head(object.1@assays$ToIntegrate@data[1:5,1:5])
head(object.1@[email protected][1:5,1:5])

step6-2:针对cca,pca,lsi三种降维方式,查找配对的锚点,此步骤调用RunCCA,将前面生成的object.1,object.2传入该函数,我们只关注cca降维。

object.pair <- switch( 
  EXPR = reduction,
  'cca' = {
    object.pair <- RunCCA(
      object1 = object.1,
      object2 = object.2,
      assay1 = "ToIntegrate",
      assay2 = "ToIntegrate",
      features = anchor.features,
      num.cc = max(dims),
      renormalize = FALSE,
      rescale = FALSE,
      verbose = verbose
    )
    if (l2.norm){
      object.pair <- L2Dim(object = object.pair, reduction = reduction)
      reduction <- paste0(reduction, ".l2")
      nn.reduction <- reduction
    }
    reduction.2 <- character()
    object.pair
  },
  'pca' = {... },
  'lsi' = {... },
  stop("Invalid reduction parameter. Please choose either cca, rpca, or rlsi")
)

step6-3:进入到RunCCA函数,由于RunCCA函数为S3泛型函数,object.1和object.2均为seurat对象,先进入RunCCA.Seurat函数。
RunCCA.Seurat传入的参数

object1 = object.1
object2 = object.2
assay1 = "ToIntegrate"
assay2 = "ToIntegrate"
features = anchor.features
num.cc = max(dims),
renormalize = FALSE
rescale = FALSE
compute.gene.loadings = TRUE
add.cell.id1 = NULL
add.cell.id2 = NULL
rescale = FALSE
verbose = TRUE

RunCCA.Seurat代码比较好理解,从object.1和object.2提取高可变基因的缩放后的表达矩阵,生成data1和data2。此处有个细节,CheckFeatures函数使高可变基因由2000个,变成了1866个。

# ToIntegrate 
assay1 <- assay1 %||% DefaultAssay(object = object1) 
assay2 <- assay2 %||% DefaultAssay(object = object2) 
nfeatures <- length(x = features) 
if (!(rescale)) { 
  data.use1 <- GetAssayData(object = object1, assay = assay1, slot = "scale.data")
  data.use2 <- GetAssayData(object = object2, assay = assay2, slot = "scale.data")
  features <- CheckFeatures(data.use = data.use1, features = features, object.name = "object1", verbose = FALSE)
  length(features)
  # [1] 1938
  features <- CheckFeatures(data.use = data.use2, features = features, object.name = "object2", verbose = FALSE)
  length(features)
  # [1] 1866
  data1 <- data.use1[features, ]
  data2 <- data.use2[features, ]
}
cca.results <- RunCCA( 
  object1 = data1, 
  object2 = data2, 
  standardize = TRUE, 
  num.cc = num.cc, 
  verbose = verbose, 
)

问题1:CheckFeatures函数做了什么操作,为什么要处理?
对于一个高可变基因缩放后的表达矩阵(基因数为2000),在两个样本,分别按列计算每个基因在所有细胞的方差,去除方差为0的基因,去除后剩下1866个。这些基因的scale.data数值也为零值。这些在某个样本均为零值的基因会影响下游找锚点的步骤。

features.var <- RowVar(x = data.use[features, ])
no.var.features <- features[features.var == 0]
features <- setdiff(x = features, y = no.var.features)
features <- features[!is.na(x = features)]

/* Calculates the variance of rows of a matrix */
// [[Rcpp::export(rng = false)]]
NumericVector RowVar(Eigen::Map x){
  NumericVector out(x.rows());
  for(int i=0; i < x.rows(); ++i){
    Eigen::ArrayXd r = x.row(i).array();
    double rowMean = r.mean();
    out[i] = (r - rowMean).square().sum() / (x.cols() - 1);
  }
  return out;
}

进入到RunCCA.default函数,查看该函数,cells1和cells2为每个样本的细胞barcode信息,standardize步骤是针对每个样本高可变基因的缩放数据默认进行standardize标准化。
问题2:standardize标准化具体如何执行的,为何要做此操作?
standardize的源码位置详见链接,通过Rcpp调用C++代码,加速计算运行。

# 传入的参数
object1 = data1
object2 = data2
standardize = TRUE
num.cc = 30
verbose = TRUE
# 主要代码
set.seed(seed = 42)
cells1 <- colnames(x = object1)
cells2 <- colnames(x = object2)
if (standardize) {
  object1 <- Standardize(mat = object1, display_progress = FALSE)
  object2 <- Standardize(mat = object2, display_progress = FALSE)
}

standardize的源码

/* Performs column scaling and/or centering. Equivalent to using scale(mat, TRUE, apply(x,2,sd)) in R.
 Note: Doesn't handle NA/NaNs in the same way the R implementation does, */

// [[Rcpp::export(rng = false)]]
NumericMatrix Standardize(Eigen::Map mat, bool display_progress = true){
  Progress p(mat.cols(), display_progress);
  NumericMatrix std_mat(mat.rows(), mat.cols());
  for(int i=0; i < mat.cols(); ++i){
    p.increment();
    Eigen::ArrayXd r = mat.col(i).array();
    double colMean = r.mean();
    double colSdev = sqrt((r - colMean).square().sum() / (mat.rows() - 1));
    NumericMatrix::Column new_col = std_mat(_, i);
    for(int j=0; j < new_col.size(); j++) {
      new_col[j] = (r[j] - colMean) / colSdev;
    }
  }
  return std_mat;
}

怎么理解这个standardize的函数呢?Standardize函数是等价scale(object, TRUE, apply(object, 2, sd))的,对矩阵的每列(每个细胞)的高可变基因求标准差sd,再scale缩放。
standardize函数的缩放和ScaleData函数有何不同?
ScaleData函数是使每个基因在所有细胞的表达量均值为 0,使每个基因在所有细胞中的表达量方差为 1,针对的是单个基因在所有细胞的缩放;
standardize函数的缩放针对的是单个细胞所有高可变基因集的缩放。表达量方差此处不是1,是 每列求得的sd,apply(object, 2, sd);

testthat::expect_equal(Standardize(object1, display_progress = FALSE), scale(object1, TRUE, apply(object1, 2, sd)),check.attributes = FALSE)
dim(scale(object1, TRUE, apply(object1, 2, sd))) 
# [1] 1866 6548 
length(apply(object1, 2, sd)) 
# [1] 6548
data1[1:5, 1:5]
object1[1:5, 1:5]


重新回到RunCCA.default程序上看后面的运行,
1)crossprod是计算两个矩阵object1,object2的内积,生成mat3 ,crossprod(a,b)等价于t(a)%*%b,但计算速度更快;
2)调用irlba包,针对大型密集和稀疏矩阵的快速截断奇异值分解和主成分分析的R包。奇异值分解(SVD)的作用,就是提取一个较复杂矩阵中的关键部分*,然后用一个简单的矩阵代表其关键部分,以达到简化的目的。在遇到维度灾难的时候,数据处理常用的降维方法是SVD(奇异值分解)和PCA(主成分分析)。
3)cca.svd$u是第一个样本的特征矩阵,cca.svd$v是第二个样本的特征矩阵;

U, V, S are the matrices corresponding to the estimated left and right singular vectors, and diagonal matrix of estimated singular values, respectively.

通过前面这么多步骤,终于将矩阵2000*Ncells降到30*Ncells,实现了2000维降至30维。

顺便,我们也理解下CCA(Cannonical Correlation Analysis)的定义,Seurat每次只组合一对数据集(两个样本),为每个多维数据集降维到一个较低维的子空间。

mat3 <- crossprod(x = object1, y = object2) 
cca.svd <- irlba(A = mat3, nv = num.cc) 
cca.data <- rbind(cca.svd$u, cca.svd$v) 
colnames(x = cca.data) <- paste0("CC", 1:num.cc) 
rownames(cca.data) <- c(cells1, cells2) 
cca.data <- apply( 
  X = cca.data, 
  MARGIN = 2, 
  FUN = function(x) { 
    if (sign(x[1]) == -1) { 
      x <- x * -1 
    } 
    return(x) 
  } 
)
cca.svd$d
# [1] 651320.30 102499.22  94495.34  84746.89  52127.27  49764.53  41900.86
# [8]  38378.88  35422.20  34252.97  32273.63  25540.07  25339.30  23556.11
# [15]  22299.70  22187.58  20179.09  19572.33  18928.07  17757.20  16859.69
# [22]  16709.28  16388.62  16282.70  15720.46  15625.57  15478.67  15203.39
# [29]  14970.87  14797.46
return(list(ccv = cca.data, d = cca.svd$d))


step7:运行完上面的RunCCA.default函数,回到RunCCA.Seurat函数,将上面得到cca矩阵赋值给cca.results。
通过cca.results的整合结果构造了一个combined.object的seurat对象。默认compute.gene.loadings为true,这个操作会调用ProjectDim函数,大致意思是计算每个基因在CC_1到CC_30的权重,方便查看30个维度主要是哪些基因占主要权重。

cca.results <- list(ccv = cca.data, d = cca.svd$d)
combined.object <- merge(
  x = object1,
  y = object2,
  merge.data = TRUE
)
dim(combined.object)
# [1]  2000 13999
rownames(x = cca.results$ccv) <- Cells(x = combined.object)
colnames(x = data1) <- Cells(x = combined.object)[1:ncol(x = data1)]
colnames(x = data2) <- Cells(x = combined.object)[(ncol(x = data1) + 1):length(x = Cells(x = combined.object))]
combined.object[['cca']] <- CreateDimReducObject(
  embeddings = cca.results$ccv[colnames(combined.object), ],
  assay = assay1,
  key = "CC_"
)
combined.object[['cca']]@assay.used <- DefaultAssay(combined.object)
combined.scale <- cbind(data1,data2)
combined.object <- SetAssayData(object = combined.object,new.data = combined.scale, slot = "scale.data")

if (compute.gene.loadings) {
  combined.object <- ProjectDim(
    object = combined.object,
    reduction = "cca",
    verbose = FALSE,
    overwrite = TRUE)
}
combined.object
# An object of class Seurat 
# 2000 features across 13999 samples within 1 assay 
# Active assay: ToIntegrate (2000 features, 0 variable features)
# 1 dimensional reduction calculated: cca

new.feature.loadings.full <- data.use %*% cell.embeddings
head(new.feature.loadings.full)


step8:切回到anchoring.fxn函数,将上面得到combined.object赋值给object.pair,另外,默认情况下,是需要对降维的矩阵进行l2.norm归一化处理的。

object.pair <- list(ccv = cca.data, d = cca.svd$d) 
if (l2.norm){ 
  object.pair <- L2Dim(object = object.pair, reduction = reduction) 
  reduction <- paste0(reduction, ".l2") 
  nn.reduction <- reduction 
} 
reduction.2 <- character()

问题3:为什么需要对典型相关向量进行 L2-normalization?

Canonical correlation vectors (CCV) project the two datasets into a correlated low-dimensional space, but global differences in scale (for example, differences in normalization between datasets) can still preclude comparing CCV across datasets. To address this, we perform L2-normalization of the cell embeddings.
We perform canonical correlation analysis, followed by L2 normalization of the canonical correlation vectors, to project the datasets into a subspace defined by shared correlation structure across datasets.

典型相关向量 (CCV) 将两个数据集投影到相关的低维空间中,但全局scale差异(例如,数据集之间的归一化差异)仍然可以进行跨数据集的 CCV 比较。为了解决这个问题,我们对细胞嵌入进行 L2 归一化。我们执行典型相关分析,然后对典型相关向量进行 L2 归一化,以将数据集投影到由跨数据集的共享相关结构定义的子空间中。
当批次效应与细胞状态之间的生物学差异具有相似的模式时,为了克服这个问题,我们首先使用对角化 CCA 联合降低两个数据集的维数,然后将 L2 归一化应用于典型相关向量。

step9:最后调用FindAnchors函数,该函数也较复杂,继续调用其他函数。关于此函数,后续再写。
到这里,整个FindIntegrationAnchors函数运行完毕。

anchors <- FindAnchors( 
  object.pair = object.pair, 
  assay = c("ToIntegrate", "ToIntegrate"), 
  slot = slot, 
  cells1 = colnames(x = object.1), 
  cells2 = colnames(x = object.2), 
  internal.neighbors = internal.neighbors, 
  reduction = reduction, 
  reduction.2 = reduction.2, 
  nn.reduction = nn.reduction, 
  dims = dims, 
  k.anchor = k.anchor, 
  k.filter = k.filter, 
  k.score = k.score, 
  max.features = max.features, 
  nn.method = nn.method, 
  n.trees = n.trees, 
  eps = eps, 
  verbose = verbose 
) 
anchors[, 1] <- anchors[, 1] + offsets[i] 
anchors[, 2] <- anchors[, 2] + offsets[j]
return(anchors)

相关资料:
1.https://github.com/satijalab/seurat
2.https://www.cell.com/cell/pdf/S0092-8674(19)30559-8.pdf

你可能感兴趣的:(seurat单细胞转录组整合分析教程-02)