seurat单细胞转录组整合分析教程-02

接前面的流程，我们使用FindIntegrationAnchors()函数识别锚点，该函数将Seurat对象列表作为输入，并使用IntegratedData()函数，用这些锚点将两个数据集集成在一起。

ifnb.list
# $CTRL
# An object of class Seurat 
# 14053 features across 6548 samples within 1 assay 
# Active assay: RNA (14053 features, 2000 variable features)
# 
# $STIM
# An object of class Seurat 
# 14053 features across 7451 samples within 1 assay 
# Active assay: RNA (14053 features, 2000 variable features)

immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, anchor.features = features)

# this command creates an 'integrated' data assay 
immune.combined <- IntegrateData(anchorset = immune.anchors)

先跟着FindIntegrationAnchors()函数过一遍代码，看看如何实现的。
step1：传参：seurat对象和高可变基因传入object.list和anchor.features参数，其余均为默认参数。看下seurat官网对这个函数参数的解释。

object.list = ifnb.list 
anchor.features = features 
assay = NULL 
reference = NULL 
scale = TRUE 
normalization.method = "LogNormalize" 
sct.clip.range = NULL 
reduction = "cca" 
l2.norm = TRUE 
dims = 1:30 
k.anchor = 5 
k.filter = 200 
k.score = 30 
max.features = 200 
nn.method = "annoy" 
n.trees = 50 
eps = 0 
verbose = TRUE

step2：并行运算：FindIntegrationAnchors()函数是可以做并行运算的，先暂时忽略此步。详见https://satijalab.org/seurat/articles/future_vignette.html。

my.lapply <- ifelse( 
  test = verbose && future::nbrOfWorkers() == 1, 
  yes = pbapply::pblapply, 
  no = future.apply::future_lapply 
)

step3：数据校验：查看每个样本的细胞数目，默认是30维（dims = 1:30），每个样本的细胞数目要大于dims值。另外，确保每个样本seurat对象的Integration插槽为空，样本的barcode信息不重复。

object.ncells <- sapply(X = object.list, FUN = function(x) dim(x = x)[2]) 
object.ncells 
# CTRL STIM  
# 6548 7451  
assay <- sapply(X = object.list, FUN = DefaultAssay) 
assay 
# CTRL  STIM  
# "RNA" "RNA"  
# check tool 
object.list <- lapply( 
  X = object.list, 
  FUN = function (obj) { 
    slot(object = obj, name = "tools")$Integration <- NULL 
    return(obj) 
  }) 
object.list <- CheckDuplicateCellNames(object.list = object.list)

step4：基因表达值缩放：对每个样本seurat对象的高可变基因进行scale缩放；

slot <- "data" 
if (scale) { 
  if (verbose) { 
    message("Scaling features for provided objects") 
  } 
  object.list <- my.lapply( 
    X = object.list, 
    FUN = function(object) { 
      ScaleData(object = object, features = anchor.features, verbose = FALSE) 
    } 
  ) 
}
# 查看scale缩放后的矩阵 
head(object.list[['CTRL']]@[email protected][anchor.features,1:5])

step5：采用cca降维方法，计算两个样本有几种组合的方式combinations，expand.grid()函数返回几个元素所有可能的组合，2个样本有1种组合。确定锚点索引的偏移量，当前样本的偏移量为前面样本细胞数目的累加值。
比如有两个样本的细胞数为6548，7451 ，它们的偏移量为前面样本数目的累加值 0,6548。第一个样本的偏移量为0。

nn.reduction <- reduction 
# determine pairwise combinations 
combinations <- expand.grid(1:length(x = object.list), 1:length(x = object.list)) 
combinations <- combinations[combinations$Var1 < combinations$Var2, , drop = FALSE] 
combinations  
#    Var1 Var2 
# 3    1    2 
# determine the proper offsets for indexing anchors 
objects.ncell <- sapply(X = object.list, FUN = ncol) 
objects.ncell 
# CTRL STIM  
# 6548 7451  
offsets <- as.vector(x = cumsum(x = c(0, objects.ncell)))[1:length(x = object.list)] 
offsets 
# [1]    0 6548

step6：因为无reference，寻找成对的锚点。对每个样本间组合调用核心的锚点函数anchoring.fxn。anchoring.fxn函数较复杂，又会调用RunCCA和FindAnchors等函数。

我们的样本就一组样本组合（1-2），通过这个实操逐步分解代码，进入到anchoring.fxn函数。
step6-1：object.1和object.2分别为两个样本只包含高可变基因的seurat对象，包含RNA插槽的数据。重新生成ToIntegrate插槽，包含归一化和缩放的数据。

object.1 <- DietSeurat(
  object = object.list[[i]],
  assays = assay[i],
  features = anchor.features,
  counts = FALSE,
  scale.data = TRUE,
  dimreducs = reduction
)
object.2 <- DietSeurat(
  object = object.list[[j]],
  assays = assay[j],
  features = anchor.features,
  counts = FALSE,
  scale.data = TRUE,
  dimreducs = reduction
)
head(object.1@assays$RNA@data[1:5,1:5])
head(object.1@[email protected][1:5,1:5])

# suppress key duplication warning
suppressWarnings(object.1[["ToIntegrate"]] <- object.1[[assay[i]]])
DefaultAssay(object = object.1) <- "ToIntegrate"
object.1 <- DietSeurat(object = object.1, assays = "ToIntegrate", scale.data = TRUE, dimreducs = reduction)
suppressWarnings(object.2[["ToIntegrate"]] <- object.2[[assay[j]]])
DefaultAssay(object = object.2) <- "ToIntegrate"
object.2 <- DietSeurat(object = object.2, assays = "ToIntegrate", scale.data = TRUE, dimreducs = reduction)

head(object.1@assays$ToIntegrate@data[1:5,1:5])
head(object.1@[email protected][1:5,1:5])

step6-2：针对cca，pca，lsi三种降维方式，查找配对的锚点，此步骤调用RunCCA，将前面生成的object.1，object.2传入该函数，我们只关注cca降维。

object.pair <- switch( 
  EXPR = reduction,
  'cca' = {
    object.pair <- RunCCA(
      object1 = object.1,
      object2 = object.2,
      assay1 = "ToIntegrate",
      assay2 = "ToIntegrate",
      features = anchor.features,
      num.cc = max(dims),
      renormalize = FALSE,
      rescale = FALSE,
      verbose = verbose
    )
    if (l2.norm){
      object.pair <- L2Dim(object = object.pair, reduction = reduction)
      reduction <- paste0(reduction, ".l2")
      nn.reduction <- reduction
    }
    reduction.2 <- character()
    object.pair
  },
  'pca' = {... },
  'lsi' = {... },
  stop("Invalid reduction parameter. Please choose either cca, rpca, or rlsi")
)

step6-3：进入到RunCCA函数，由于RunCCA函数为S3泛型函数，object.1和object.2均为seurat对象，先进入RunCCA.Seurat函数。
RunCCA.Seurat传入的参数

object1 = object.1
object2 = object.2
assay1 = "ToIntegrate"
assay2 = "ToIntegrate"
features = anchor.features
num.cc = max(dims),
renormalize = FALSE
rescale = FALSE
compute.gene.loadings = TRUE
add.cell.id1 = NULL
add.cell.id2 = NULL
rescale = FALSE
verbose = TRUE

RunCCA.Seurat代码比较好理解，从object.1和object.2提取高可变基因的缩放后的表达矩阵，生成data1和data2。此处有个细节，CheckFeatures函数使高可变基因由2000个，变成了1866个。

# ToIntegrate 
assay1 <- assay1 %||% DefaultAssay(object = object1) 
assay2 <- assay2 %||% DefaultAssay(object = object2) 
nfeatures <- length(x = features) 
if (!(rescale)) { 
  data.use1 <- GetAssayData(object = object1, assay = assay1, slot = "scale.data")
  data.use2 <- GetAssayData(object = object2, assay = assay2, slot = "scale.data")
  features <- CheckFeatures(data.use = data.use1, features = features, object.name = "object1", verbose = FALSE)
  length(features)
  # [1] 1938
  features <- CheckFeatures(data.use = data.use2, features = features, object.name = "object2", verbose = FALSE)
  length(features)
  # [1] 1866
  data1 <- data.use1[features, ]
  data2 <- data.use2[features, ]
}
cca.results <- RunCCA( 
  object1 = data1, 
  object2 = data2, 
  standardize = TRUE, 
  num.cc = num.cc, 
  verbose = verbose, 
)

问题1：CheckFeatures函数做了什么操作，为什么要处理？
对于一个高可变基因缩放后的表达矩阵（基因数为2000），在两个样本，分别按列计算每个基因在所有细胞的方差，去除方差为0的基因，去除后剩下1866个。这些基因的scale.data数值也为零值。这些在某个样本均为零值的基因会影响下游找锚点的步骤。

features.var <- RowVar(x = data.use[features, ])
no.var.features <- features[features.var == 0]
features <- setdiff(x = features, y = no.var.features)
features <- features[!is.na(x = features)]

/* Calculates the variance of rows of a matrix */
// [[Rcpp::export(rng = false)]]
NumericVector RowVar(Eigen::Map x){
  NumericVector out(x.rows());
  for(int i=0; i < x.rows(); ++i){
    Eigen::ArrayXd r = x.row(i).array();
    double rowMean = r.mean();
    out[i] = (r - rowMean).square().sum() / (x.cols() - 1);
  }
  return out;
}

进入到RunCCA.default函数，查看该函数，cells1和cells2为每个样本的细胞barcode信息，standardize步骤是针对每个样本高可变基因的缩放数据默认进行standardize标准化。
问题2：standardize标准化具体如何执行的，为何要做此操作？
standardize的源码位置详见链接，通过Rcpp调用C++代码，加速计算运行。

# 传入的参数
object1 = data1
object2 = data2
standardize = TRUE
num.cc = 30
verbose = TRUE
# 主要代码
set.seed(seed = 42)
cells1 <- colnames(x = object1)
cells2 <- colnames(x = object2)
if (standardize) {
  object1 <- Standardize(mat = object1, display_progress = FALSE)
  object2 <- Standardize(mat = object2, display_progress = FALSE)
}

standardize的源码

/* Performs column scaling and/or centering. Equivalent to using scale(mat, TRUE, apply(x,2,sd)) in R.
 Note: Doesn't handle NA/NaNs in the same way the R implementation does, */

// [[Rcpp::export(rng = false)]]
NumericMatrix Standardize(Eigen::Map mat, bool display_progress = true){
  Progress p(mat.cols(), display_progress);
  NumericMatrix std_mat(mat.rows(), mat.cols());
  for(int i=0; i < mat.cols(); ++i){
    p.increment();
    Eigen::ArrayXd r = mat.col(i).array();
    double colMean = r.mean();
    double colSdev = sqrt((r - colMean).square().sum() / (mat.rows() - 1));
    NumericMatrix::Column new_col = std_mat(_, i);
    for(int j=0; j < new_col.size(); j++) {
      new_col[j] = (r[j] - colMean) / colSdev;
    }
  }
  return std_mat;
}

怎么理解这个standardize的函数呢？Standardize函数是等价scale(object, TRUE, apply(object, 2, sd))的，对矩阵的每列（每个细胞）的高可变基因求标准差sd，再scale缩放。
standardize函数的缩放和ScaleData函数有何不同？
ScaleData函数是使每个基因在所有细胞的表达量均值为 0，使每个基因在所有细胞中的表达量方差为 1，针对的是单个基因在所有细胞的缩放；
standardize函数的缩放针对的是单个细胞所有高可变基因集的缩放。表达量方差此处不是1，是每列求得的sd，apply(object, 2, sd)；

testthat::expect_equal(Standardize(object1, display_progress = FALSE), scale(object1, TRUE, apply(object1, 2, sd)),check.attributes = FALSE)
dim(scale(object1, TRUE, apply(object1, 2, sd))) 
# [1] 1866 6548 
length(apply(object1, 2, sd)) 
# [1] 6548
data1[1:5, 1:5]
object1[1:5, 1:5]

重新回到RunCCA.default程序上看后面的运行，
1）crossprod是计算两个矩阵object1，object2的内积，生成mat3 ，crossprod(a,b)等价于t(a)%*%b，但计算速度更快；
2）调用irlba包，针对大型密集和稀疏矩阵的快速截断奇异值分解和主成分分析的R包。奇异值分解(SVD）的作用，就是提取一个较复杂矩阵中的关键部分*，然后用一个简单的矩阵代表其关键部分，以达到简化的目的。在遇到维度灾难的时候，数据处理常用的降维方法是SVD(奇异值分解)和PCA(主成分分析)。
3）cca.svd$u是第一个样本的特征矩阵，cca.svd$v是第二个样本的特征矩阵；

U, V, S are the matrices corresponding to the estimated left and right singular vectors, and diagonal matrix of estimated singular values, respectively.

通过前面这么多步骤，终于将矩阵2000*Ncells降到30*Ncells，实现了2000维降至30维。

顺便，我们也理解下CCA（Cannonical Correlation Analysis）的定义，Seurat每次只组合一对数据集（两个样本），为每个多维数据集降维到一个较低维的子空间。

mat3 <- crossprod(x = object1, y = object2) 
cca.svd <- irlba(A = mat3, nv = num.cc) 
cca.data <- rbind(cca.svd$u, cca.svd$v) 
colnames(x = cca.data) <- paste0("CC", 1:num.cc) 
rownames(cca.data) <- c(cells1, cells2) 
cca.data <- apply( 
  X = cca.data, 
  MARGIN = 2, 
  FUN = function(x) { 
    if (sign(x[1]) == -1) { 
      x <- x * -1 
    } 
    return(x) 
  } 
)
cca.svd$d
# [1] 651320.30 102499.22  94495.34  84746.89  52127.27  49764.53  41900.86
# [8]  38378.88  35422.20  34252.97  32273.63  25540.07  25339.30  23556.11
# [15]  22299.70  22187.58  20179.09  19572.33  18928.07  17757.20  16859.69
# [22]  16709.28  16388.62  16282.70  15720.46  15625.57  15478.67  15203.39
# [29]  14970.87  14797.46
return(list(ccv = cca.data, d = cca.svd$d))

step7：运行完上面的RunCCA.default函数，回到RunCCA.Seurat函数，将上面得到cca矩阵赋值给cca.results。
通过cca.results的整合结果构造了一个combined.object的seurat对象。默认compute.gene.loadings为true，这个操作会调用ProjectDim函数，大致意思是计算每个基因在CC_1到CC_30的权重，方便查看30个维度主要是哪些基因占主要权重。

cca.results <- list(ccv = cca.data, d = cca.svd$d)
combined.object <- merge(
  x = object1,
  y = object2,
  merge.data = TRUE
)
dim(combined.object)
# [1]  2000 13999
rownames(x = cca.results$ccv) <- Cells(x = combined.object)
colnames(x = data1) <- Cells(x = combined.object)[1:ncol(x = data1)]
colnames(x = data2) <- Cells(x = combined.object)[(ncol(x = data1) + 1):length(x = Cells(x = combined.object))]
combined.object[['cca']] <- CreateDimReducObject(
  embeddings = cca.results$ccv[colnames(combined.object), ],
  assay = assay1,
  key = "CC_"
)
combined.object[['cca']]@assay.used <- DefaultAssay(combined.object)
combined.scale <- cbind(data1,data2)
combined.object <- SetAssayData(object = combined.object,new.data = combined.scale, slot = "scale.data")

if (compute.gene.loadings) {
  combined.object <- ProjectDim(
    object = combined.object,
    reduction = "cca",
    verbose = FALSE,
    overwrite = TRUE)
}
combined.object
# An object of class Seurat 
# 2000 features across 13999 samples within 1 assay 
# Active assay: ToIntegrate (2000 features, 0 variable features)
# 1 dimensional reduction calculated: cca

new.feature.loadings.full <- data.use %*% cell.embeddings
head(new.feature.loadings.full)

step8：切回到anchoring.fxn函数，将上面得到combined.object赋值给object.pair，另外，默认情况下，是需要对降维的矩阵进行l2.norm归一化处理的。

object.pair <- list(ccv = cca.data, d = cca.svd$d) 
if (l2.norm){ 
  object.pair <- L2Dim(object = object.pair, reduction = reduction) 
  reduction <- paste0(reduction, ".l2") 
  nn.reduction <- reduction 
} 
reduction.2 <- character()

问题3：为什么需要对典型相关向量进行 L2-normalization？

Canonical correlation vectors (CCV) project the two datasets into a correlated low-dimensional space, but global differences in scale (for example, differences in normalization between datasets) can still preclude comparing CCV across datasets. To address this, we perform L2-normalization of the cell embeddings.
We perform canonical correlation analysis, followed by L2 normalization of the canonical correlation vectors, to project the datasets into a subspace defined by shared correlation structure across datasets.

典型相关向量 (CCV) 将两个数据集投影到相关的低维空间中，但全局scale差异（例如，数据集之间的归一化差异）仍然可以进行跨数据集的 CCV 比较。为了解决这个问题，我们对细胞嵌入进行 L2 归一化。我们执行典型相关分析，然后对典型相关向量进行 L2 归一化，以将数据集投影到由跨数据集的共享相关结构定义的子空间中。
当批次效应与细胞状态之间的生物学差异具有相似的模式时，为了克服这个问题，我们首先使用对角化 CCA 联合降低两个数据集的维数，然后将 L2 归一化应用于典型相关向量。

step9：最后调用FindAnchors函数，该函数也较复杂，继续调用其他函数。关于此函数，后续再写。
到这里，整个FindIntegrationAnchors函数运行完毕。

anchors <- FindAnchors( 
  object.pair = object.pair, 
  assay = c("ToIntegrate", "ToIntegrate"), 
  slot = slot, 
  cells1 = colnames(x = object.1), 
  cells2 = colnames(x = object.2), 
  internal.neighbors = internal.neighbors, 
  reduction = reduction, 
  reduction.2 = reduction.2, 
  nn.reduction = nn.reduction, 
  dims = dims, 
  k.anchor = k.anchor, 
  k.filter = k.filter, 
  k.score = k.score, 
  max.features = max.features, 
  nn.method = nn.method, 
  n.trees = n.trees, 
  eps = eps, 
  verbose = verbose 
) 
anchors[, 1] <- anchors[, 1] + offsets[i] 
anchors[, 2] <- anchors[, 2] + offsets[j]
return(anchors)

相关资料：
1.https://github.com/satijalab/seurat
2.https://www.cell.com/cell/pdf/S0092-8674(19)30559-8.pdf

seurat单细胞转录组整合分析教程-02

你可能感兴趣的:(seurat单细胞转录组整合分析教程-02)