scmap:单细胞RNA-seq数据跨数据集的投影

随着越来越多的scRNA-seq数据集可用，对它们进行比较是关键。主要的应用程序是比较不同实验室收集的具有相似生物学来源的数据集，以确保注释和分析是一致的。此外，随着大量的参考数据集，如人类细胞图谱(HCA)的出现，一个重要的应用将是将来自新样本(如来自疾病组织)的细胞投射到参考数据集上，以表征组成的差异，或检测新的细胞类型。

scmap是一种将细胞从scRNA-seq实验投射到不同实验中识别的细胞类型或细胞的方法。bioRxiv.

scmap建立在Bioconductor的singlecellexper对象之上。请阅读如何从你自己的数据创建一个SingleCellExperiment。在这里，我们将展示一个关于如何做到这一点的小例子，但请注意，它不是一个全面的指南。

如果你已经有一个SingleCellExperiment对象，那么继续下一章。

如果您有一个表达矩阵，那么您首先需要创建一个包含您的数据的singlecellexper对象。为了便于说明，我们将使用scmap提供的示例表达式矩阵。数据集(yan)表示来自人类胚胎的90个细胞的FPKM基因表达。作者(Yan等人)在原始出版物(ann数据框架)中定义了所有细胞的发育阶段。我们稍后将在投影中使用这些阶段。

library(SingleCellExperiment)
library(scmap)
head(ann)

##                 cell_type1
## Oocyte..1.RPKM.     zygote
## Oocyte..2.RPKM.     zygote
## Oocyte..3.RPKM.     zygote
## Zygote..1.RPKM.     zygote
## Zygote..2.RPKM.     zygote
## Zygote..3.RPKM.     zygote

yan[1:3, 1:3]

##          Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM.
## C9orf152             0.0             0.0             0.0
## RPS11             1219.9          1021.1           931.6
## ELMO2                7.0            12.2             9.3

Note that the cell type information has to be stored in the cell_type1 column of the rowData slot of the SingleCellExperiment object.

sce <- SingleCellExperiment(assays = list(normcounts = as.matrix(yan)), colData = ann)
logcounts(sce) <- log2(normcounts(sce) + 1)
# use gene names as feature symbols
rowData(sce)$feature_symbol <- rownames(sce)
isSpike(sce, "ERCC") <- grepl("^ERCC-", rownames(sce))
# remove features with duplicated names
sce <- sce[!duplicated(rownames(sce)), ]
sce

## class: SingleCellExperiment 
## dim: 20214 90 
## metadata(0):
## assays(2): normcounts logcounts
## rownames(20214): C9orf152 RPS11 ... CTSC AQP7
## rowData names(1): feature_symbol
## colnames(90): Oocyte..1.RPKM. Oocyte..2.RPKM. ...
##   Late.blastocyst..3..Cell.7.RPKM. Late.blastocyst..3..Cell.8.RPKM.
## colData names(1): cell_type1
## reducedDimNames(0):
## spikeNames(1): ERCC

Feature selection

一旦我们有了一个单独的实验对象，我们就可以运行scmap了。首先，我们需要从我们的输入数据集中选择信息最丰富的特征(基因):

sce <- selectFeatures(sce, suppress_plot = FALSE)

## Warning in linearModel(object, n_features): Your object does not contain
## counts() slot. Dropouts were calculated using logcounts() slot...

用红色突出显示的特征将用于进一步的分析(投影)。
特性存储在输入对象的rowData槽的scmap_features列中。默认scmap选择500个功能(也可以通过设置n_features参数来控制):

table(rowData(sce)$scmap_features)

## 
## FALSE  TRUE 
## 19714   500

scmap-cluster

参考数据集的scmap-cluster索引是通过查找每个集群的中间基因表达来创建的。默认情况下，scmap使用引用中colData的cell_type1列来标识集群。其他列可以通过调整cluster_col参数手动选择:

sce <- indexCluster(sce)

函数indexCluster自动写入引用数据集元数据槽的scmap_cluster_index项。

head(metadata(sce)$scmap_cluster_index)

##           zygote     2cell    4cell     8cell   16cell    blast
## ABCB4   5.788589 6.2258580 5.935134 0.6667119 0.000000 0.000000
## ABCC6P1 7.863625 7.7303559 8.322769 7.4303689 4.759867 0.000000
## ABT1    0.320773 0.1315172 0.000000 5.9787977 6.100671 4.627798
## ACCSL   7.922318 8.4274290 9.662611 4.5869260 1.768026 0.000000
## ACOT11  0.000000 0.0000000 0.000000 6.4677243 7.147798 4.057444
## ACOT9   4.877394 4.2196038 5.446969 4.0685468 3.827819 0.000000

heatmap(as.matrix(metadata(sce)$scmap_cluster_index))

一旦生成了scmap-cluster索引，我们就可以使用它将数据集投射到自身(仅用于说明目的)。这可以通过一次一个索引来实现，但是如果以列表的形式提供，scmap也允许同时投影到多个索引:

scmapCluster_results <- scmapCluster(
  projection = sce, 
  index_list = list(
    yan = metadata(sce)$scmap_cluster_index
  )
)

scmap-cluster将查询数据集投射到index_list中定义的所有投影。细胞标签分配的结果合并为一个矩阵:

head(scmapCluster_results$scmap_cluster_labs)

##      yan     
## [1,] "zygote"
## [2,] "zygote"
## [3,] "zygote"
## [4,] "2cell" 
## [5,] "2cell" 
## [6,] "2cell"

对应的相似性存储在scmap_cluster_siml项中:

head(scmapCluster_results$scmap_cluster_siml)

##            yan
## [1,] 0.9947609
## [2,] 0.9951257
## [3,] 0.9955916
## [4,] 0.9934012
## [5,] 0.9953694
## [6,] 0.9871041

scmap还提供所有参考数据集的组合结果(选择对应于参考数据集之间最大相似性的标签):

head(scmapCluster_results$combined_labs)

## [1] "zygote" "zygote" "zygote" "2cell"  "2cell"  "2cell"

可以将scmap-cluster的结果可视化为Sankey图，以显示如何匹配cell-cluster (getSankey()函数)。请注意，只有在查询和引用数据集都已聚类的情况下，Sankey图才会提供信息，但是没有必要为查询分配有意义的标签(cluster1、cluster2等就足够了):

plot(
  getSankey(
    colData(sce)$cell_type1, 
    scmapCluster_results$scmap_cluster_labs[,'yan'],
    plot_height = 400
  )
)

scmap-cell

与scmap-cluster不同，scmap-cell将输入数据集的单元投射到引用的单个细胞，而不是群。

scmap-cell包含k-means步骤，这使得它是随机的，即多次运行它将提供略有不同的结果。因此，我们将固定一个随机种子，以便用户能够准确地复制我们的结果:

···
set.seed(1)
···
在scmap-cell中，索引是由product quantiser算法创建的，该算法使用一组子中心来标识引用中的每个单元，这些子中心是通过基于特征子集的k-means聚类找到的。

···
sce <- indexCell(sce)
···
与scmap-cluster索引不同，scmap-cell索引包含关于每个细胞的信息，因此不容易可视化。scmap-cell索引由两项组成:

···
names(metadata(sce)$scmap_cell_index)

[1] "subcentroids" "subclusters"

···
subcentroids包含由product quantiser算法的选定特征、k和M参数定义的低维子空间的subcentroids的坐标(参见?indexCell)。

length(metadata(sce)$scmap_cell_index$subcentroids)

## [1] 50

dim(metadata(sce)$scmap_cell_index$subcentroids[[1]])

## [1] 10  9

metadata(sce)$scmap_cell_index$subcentroids[[1]][,1:5]

##                    1         2          3          4         5
## ZAR1L    0.072987697 0.2848353 0.33713297 0.26694708 0.3051086
## SERPINF1 0.179135680 0.3784345 0.35886481 0.39453521 0.4326297
## GRB2     0.439712934 0.4246024 0.23308320 0.43238208 0.3247221
## GSTP1    0.801498298 0.1464230 0.14880665 0.19900079 0.0000000
## ABCC6P1  0.005544482 0.4358565 0.46276591 0.40280401 0.3989602
## ARGFX    0.341212258 0.4284664 0.07629512 0.47961460 0.1296112
## DCT      0.004323311 0.1943568 0.32117489 0.21259776 0.3836451
## C15orf60 0.006681366 0.1862540 0.28346531 0.01123282 0.1096438
## SVOPL    0.003004345 0.1548237 0.33551596 0.12691677 0.2525819
## NLRP9    0.101524942 0.3223963 0.40624639 0.30465156 0.4640308

In the case of our yan dataset:

yan dataset contains N=90

cells
We selected f=500
features (scmap default)
M was calculated as f/10=50
(scmap default for f≤1000
). M is the number of low dimensional subspaces
Number of features in any low dimensional subspace equals to f/M=10
k was calculated as k=N−−√≈9
(scmap default).
子簇包含每个给定细胞所属的亚中心的低维子空间索引:

dim(metadata(sce)$scmap_cell_index$subclusters)

## [1] 50 90

metadata(sce)$scmap_cell_index$subclusters[1:5,1:5]

##      Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM. Zygote..1.RPKM.
## [1,]               6               6               6               6
## [2,]               5               5               5               5
## [3,]               5               5               5               5
## [4,]               3               3               3               3
## [5,]               6               6               6               6
##      Zygote..2.RPKM.
## [1,]               6
## [2,]               5
## [3,]               5
## [4,]               3
## [5,]               6

一旦生成了scmap-cell索引，我们就可以使用它们来投影baron数据集。这可以用一个索引一次完成，但是scmap允许同时投影到多个索引，如果它们以列表的形式提供:

scmapCell_results <- scmapCell(
  sce, 
  list(
    yan = metadata(sce)$scmap_cell_index
  )
)

每个数据集有两个母系。细胞矩阵包含投影数据集的给定细胞最接近的参考数据集的前10个(scmap默认值)细胞id:

scmapCell_results$yan$cells[,1:3]

##       Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM.
##  [1,]               1               1               1
##  [2,]               2               2               2
##  [3,]               3               3               3
##  [4,]              11              11              11
##  [5,]               5               5               5
##  [6,]               6               6               6
##  [7,]               7               7               7
##  [8,]              12               8              12
##  [9,]               9               9               9
## [10,]              10              10              10

similarities matrix contains corresponding cosine similarities:

scmapCell_results$yan$similarities[,1:3]

##       Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM.
##  [1,]       0.9742737       0.9736593       0.9748542
##  [2,]       0.9742274       0.9737083       0.9748995
##  [3,]       0.9742274       0.9737083       0.9748995
##  [4,]       0.9693955       0.9684169       0.9697731
##  [5,]       0.9698173       0.9688538       0.9701976
##  [6,]       0.9695394       0.9685904       0.9699759
##  [7,]       0.9694336       0.9686058       0.9699198
##  [8,]       0.9694091       0.9684312       0.9697699
##  [9,]       0.9692544       0.9684312       0.9697358
## [10,]       0.9694336       0.9686058       0.9699198

如果cell cluster注释可用于参考数据集，除了查找前10位最近邻之外，scmap-cell还允许使用引用的标签来注释投影数据集的单细胞。它通过查看前3个最近的邻居(scmap默认值)，如果它们都属于参考中的相同集群，并且它们的最大相似度高于阈值(0.5是scmap默认值)，则将一个投影细胞分配给相应的参考群:

scmapCell_clusters <- scmapCell2Cluster(
  scmapCell_results, 
  list(
    as.character(colData(sce)$cell_type1)
  )
)

scmap-cell results are in the same format as the ones provided by scmap-cluster (see above):

head(scmapCell_clusters$scmap_cluster_labs)

##      yan         
## [1,] "zygote"    
## [2,] "zygote"    
## [3,] "zygote"    
## [4,] "unassigned"
## [5,] "unassigned"
## [6,] "unassigned"