同时测量多种模式的数据,也称为多模式分析,代表了单细胞基因组学的一个令人兴奋的前沿,迫切需要新的算法来定义基于多种数据类型的细胞状态。每种模式的不同信息内容,即使是在同一数据集的不同细胞中,也是分析和整合多模式数据集的挑战。在(Hao等人,bioRxiv 2020)中,我们引入了"加权邻近分析"(WNN),一个无监督的框架,以了解每个细胞中每个数据类型的相对效用,从而能够对多种模式数据进行整合分析。
此教程介绍了用于分析多模式单细胞数据集的 WNN 工作流。工作流程包括三个步骤
- 每个单独模式数据的独立预处理和降维
- 学习细胞特定模式的"权重",并构建一个整合模式的 WNN 图形
- 基于WNN 图的下游分析(可视化、聚类等)
我们演示了WNN分析对两种单细胞技术多模式数据的应用:CITE-seq和10x multiome。我们根据两种模式来定义细胞状态,而不是根据任何一种单一模式。
CITE-seq, RNA + ADT的WNN分析
我们使用 (CITE-seq 数据集),该数据集由 30,672 个 scRNA-seq 文件组成,包含骨髓中的 25 种抗体一起测量。该对象包含两个assays,RNA 和抗体衍生标签 (ADT)。
要运行此教程,需要安装seurat v4。
install.packages("Seurat")
library(Seurat)
library(SeuratData)
library(cowplot)
library(dplyr)
InstallData("bmcite")
bm <- LoadData(ds = "bmcite")
我们首先独立在两个assays 上进行预处理和降维。我们使用标准的normalization,但你也可以使用 SCTransform或任何替代方法。
DefaultAssay(bm) <- 'RNA'
bm <- NormalizeData(bm) %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA()
DefaultAssay(bm) <- 'ADT'
# we will use all ADT features for dimensional reduction
# we set a dimensional reduction name to avoid overwriting the
VariableFeatures(bm) <- rownames(bm[["ADT"]])
bm <- NormalizeData(bm, normalization.method = 'CLR', margin = 2) %>%
ScaleData() %>% RunPCA(reduction.name = 'apca')
对于每个细胞,我们根据RNA和蛋白质相似性的加权组合计算数据集中最接近的相邻细胞。细胞特定模式权重和多模式邻近值在单个函数中计算,在此数据集上运行大约需要 2 分钟。我们指定每个模式的维度(类似于在 scRNA-seq 聚类中指定包含的 PC 数),但您可以更改这些设置,以查看这些小改动对整体结果的影响是否最小。
# Identify multimodal neighbors. These will be stored in the neighbors slot,
# and can be accessed using bm[['weighted.nn']]
# The WNN graph can be accessed at bm[["wknn"]],
# and the SNN graph used for clustering at bm[["wsnn"]]
# Cell-specific modality weights can be accessed at bm$RNA.weight
bm <- FindMultiModalNeighbors(
bm, reduction.list = list("pca", "apca"),
dims.list = list(1:30, 1:18), modality.weight.name = "RNA.weight"
)
现在,我们可以将这些结果用于下游分析,如可视化和聚类。例如,我们可以根据 RNA 和蛋白质数据的加权组合创建数据的 UMAP 可视化,还可以在 UMAP 上执行基于图形的聚类,并将这些结果与一组细胞注释一起可视化。
bm <- RunUMAP(bm, nn.name = "weighted.nn", reduction.name = "wnn.umap", reduction.key = "wnnUMAP_")
bm <- FindClusters(bm, graph.name = "wsnn", algorithm = 3, resolution = 2, verbose = FALSE)
p1 <- DimPlot(bm, reduction = 'wnn.umap', label = TRUE, repel = TRUE, label.size = 2.5) + NoLegend()
p2 <- DimPlot(bm, reduction = 'wnn.umap', group.by = 'celltype.l2', label = TRUE, repel = TRUE, label.size = 2.5) + NoLegend()
p1 + p2
我们还可以仅根据RNA和蛋白质数据计算UMAP可视化并进行比较。发现RNA分析比ADT分析在识别祖细胞状态(ADT面板包含分化细胞的标记)方面更翔实,而T细胞状态(ADT分析优于RNA)的情况则相反。
bm <- RunUMAP(bm, reduction = 'pca', dims = 1:30, assay = 'RNA',
reduction.name = 'rna.umap', reduction.key = 'rnaUMAP_')
bm <- RunUMAP(bm, reduction = 'apca', dims = 1:18, assay = 'ADT',
reduction.name = 'adt.umap', reduction.key = 'adtUMAP_')
p3 <- DimPlot(bm, reduction = 'rna.umap', group.by = 'celltype.l2', label = TRUE,
repel = TRUE, label.size = 2.5) + NoLegend()
p4 <- DimPlot(bm, reduction = 'adt.umap', group.by = 'celltype.l2', label = TRUE,
repel = TRUE, label.size = 2.5) + NoLegend()
p3 + p4
我们可以在多模式 UMAP 上可视化传统marker基因和蛋白质的表达,这有助于验证所提供的注释:
p5 <- FeaturePlot(bm, features = c("adt_CD45RA","adt_CD16","adt_CD161"),
reduction = 'wnn.umap', max.cutoff = 2,
cols = c("lightgrey","darkgreen"), ncol = 3)
p6 <- FeaturePlot(bm, features = c("rna_TRDC","rna_MPO","rna_AVP"),
reduction = 'wnn.umap', max.cutoff = 3, ncol = 3)
p5 / p6
最后,我们可以可视化每个细胞所学习到的模式权重。RNA权重最高的每个群体代表祖细胞,而蛋白质权重最高的群体代表T细胞。这和我们的生物学预期一致,因为抗体面板不包含能够区分不同祖细胞群体的标记。
VlnPlot(bm, features = "RNA.weight", group.by = 'celltype.l2', sort = TRUE, pt.size = 0.1) +
NoLegend()
10x Multiome, RNA + ATAC的WNN分析
在这里,我们演示了WNN分析对第二个多模式技术的使用,即10x multiome RNA+ATAC kit。我们使用在 10x网站上公开的数据集,其中包含 10,412 个 PBMC的配对转录组和 ATAC-seq数据。
在此示例中,我们将演示如何:
- 用配对的转录组和 ATAC-seq 数据创建多模式Seurat 对象,
- 在RNA+ATAC单细胞数据上执行WNN分析
- 利用这两种模式来识别不同细胞类型和状态的调节因子
您可以在此处下载数据集。请务必下载以下文件:
- 过滤的基因barcode矩阵 (HDF5)
- ATAC 每个片段信息文件 (TSV.GZ)
- ATAC 每个片段信息索引 (TSV.GZ index)
最后,为了运行本教程,请确保安装以下包:
- seurat v4.
- Signac
- EnsDb.Hsapiens.v86
- dplyr
library(Seurat)
library(Signac)
library(EnsDb.Hsapiens.v86)
library(dplyr)
library(ggplot2)
我们将根据基因表达矩阵创建一个 Seurat 对象,然后添加 ATAC-seq 数据作为第二个assay。
# the 10x hdf5 file contains both data types.
inputdata.10x <- Read10X_h5("../data/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5")
# extract RNA and ATAC data
rna_counts <- inputdata.10x$`Gene Expression`
atac_counts <- inputdata.10x$Peaks
# Create Seurat object
pbmc <- CreateSeuratObject(counts = rna_counts)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
# Now add in the ATAC-seq data
# we'll only use peaks in standard chromosomes
grange.counts <- StringToGRanges(rownames(atac_counts), sep = c(":", "-"))
grange.use <- seqnames(grange.counts) %in% standardChromosomes(grange.counts)
atac_counts <- atac_counts[as.vector(grange.use), ]
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
seqlevelsStyle(annotations) <- 'UCSC'
genome(annotations) <- "hg38"
frag.file <- "../data/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz"
chrom_assay <- CreateChromatinAssay(
counts = atac_counts,
sep = c(":", "-"),
genome = 'hg38',
fragments = frag.file,
min.cells = 10,
annotation = annotations
)
pbmc[["ATAC"]] <- chrom_assay
我们根据每个模式的检测到的分子数量以及线粒体百分比执行基本 QC。
VlnPlot(pbmc, features = c("nCount_ATAC", "nCount_RNA","percent.mt"), ncol = 3,
log = TRUE, pt.size = 0) + NoLegend()
pbmc <- subset(
x = pbmc,
subset = nCount_ATAC < 7e4 &
nCount_ATAC > 5e3 &
nCount_RNA < 25000 &
nCount_RNA > 1000 &
percent.mt < 20
)
接下来,使用RNA和ATAC-seq数据的标准方法,独立地对两个assays 进行预处理和降维。
# RNA analysis
DefaultAssay(pbmc) <- "RNA"
pbmc <- SCTransform(pbmc, verbose = FALSE) %>% RunPCA() %>% RunUMAP(dims = 1:50, reduction.name = 'umap.rna', reduction.key = 'rnaUMAP_')
# ATAC analysis
# We exclude the first dimension as this is typically correlated with sequencing depth
DefaultAssay(pbmc) <- "ATAC"
pbmc <- RunTFIDF(pbmc)
pbmc <- FindTopFeatures(pbmc, min.cutoff = 'q0')
pbmc <- RunSVD(pbmc)
pbmc <- RunUMAP(pbmc, reduction = 'lsi', dims = 2:50, reduction.name = "umap.atac", reduction.key = "atacUMAP_")
计算一个 WNN 图形,代表了 RNA 和 ATAC-seq 模式的加权组合。使用此图进行 UMAP 可视化和聚类
pbmc <- FindMultiModalNeighbors(pbmc, reduction.list = list("pca", "lsi"), dims.list = list(1:50, 2:50))
pbmc <- RunUMAP(pbmc, nn.name = "weighted.nn", reduction.name = "wnn.umap", reduction.key = "wnnUMAP_")
pbmc <- FindClusters(pbmc, graph.name = "wsnn", algorithm = 3, verbose = FALSE)
注释cluster如下。请注意,也可以使用监督映射管道注释数据集,使用vignette, or automated web tool, Azimuth.
# perform sub-clustering on cluster 6 to find additional structure
pbmc <- FindSubCluster(pbmc, cluster = 6, graph.name = "wsnn", algorithm = 3)
Idents(pbmc) <- "sub.cluster"
# add annotations
pbmc <- RenameIdents(pbmc, '19' = 'pDC','20' = 'HSPC','15' = 'cDC')
pbmc <- RenameIdents(pbmc, '0' = 'CD14 Mono', '9' ='CD14 Mono', '5' = 'CD16 Mono')
pbmc <- RenameIdents(pbmc, '10' = 'Naive B', '11' = 'Intermediate B', '17' = 'Memory B', '21' = 'Plasma')
pbmc <- RenameIdents(pbmc, '7' = 'NK')
pbmc <- RenameIdents(pbmc, '4' = 'CD4 TCM', '13'= "CD4 TEM", '3' = "CD4 TCM", '16' ="Treg", '1' ="CD4 Naive", '14' = "CD4 Naive")
pbmc <- RenameIdents(pbmc, '2' = 'CD8 Naive', '8'= "CD8 Naive", '12' = 'CD8 TEM_1', '6_0' = 'CD8 TEM_2', '6_1' ='CD8 TEM_2', '6_4' ='CD8 TEM_2')
pbmc <- RenameIdents(pbmc, '18' = 'MAIT')
pbmc <- RenameIdents(pbmc, '6_2' ='gdT', '6_3' = 'gdT')
pbmc$celltype <- Idents(pbmc)
根据基因表达、ATAC-seq 或 WNN 分析来进行可视化聚类。这些差异比之前的分析更微妙(可以探索权重,这些权重比 CITE-seq 示例更均匀地分开),但 WNN 提供了细胞状态的最清晰的分离。
p1 <- DimPlot(pbmc, reduction = "umap.rna", group.by = "celltype", label = TRUE, label.size = 2.5, repel = TRUE) + ggtitle("RNA")
p2 <- DimPlot(pbmc, reduction = "umap.atac", group.by = "celltype", label = TRUE, label.size = 2.5, repel = TRUE) + ggtitle("ATAC")
p3 <- DimPlot(pbmc, reduction = "wnn.umap", group.by = "celltype", label = TRUE, label.size = 2.5, repel = TRUE) + ggtitle("WNN")
p1 + p2 + p3 & NoLegend() & theme(plot.title = element_text(hjust = 0.5))
例如,ATAC-seq 数据有助于分离 CD4 和 CD8 T 细胞状态。这是因为存在多个 loci,在不同的 T 细胞子类型之间表现出不同的可及性。例如,可以使用Signac visualization vignette.
中的工具,将 CD8A 轨迹的"pseudobulk"轨道与基因表达水平的小提琴图一起可视化。
## to make the visualization easier, subset T cell clusters
celltype.names <- levels(pbmc)
tcell.names <- grep("CD4|CD8|Treg", celltype.names,value = TRUE)
tcells <- subset(pbmc, idents = tcell.names)
CoveragePlot(tcells, region = 'CD8A', features = 'CD8A', assay = 'ATAC', expression.assay = 'SCT', peaks = FALSE)
接下来,检查每个细胞的可及区域,以确定富集的motif。如Signac motifs vignette中描述的那样,有几种方法可以做到这一点,但我们将使用来自Greenleaf
lab的 chromVAR
包。它计算每个细胞已知motif的可及性分数,并将这些分数作为 Seurat 对象中的第三个assay (chromvar
) 添加。
要继续,请确保安装了以下包。
- chromVAR for the analysis of motif accessibility in scATAC-seq
- presto for fast differential expression analyses.
- TFBSTools for TFBS analysis
- JASPAR2020 for JASPAR motif models
- motifmatchr for motif matching
- BSgenome.Hsapiens.UCSC.hg38 for chromVAR
library(chromVAR)
library(JASPAR2020)
library(TFBSTools)
library(motifmatchr)
library(BSgenome.Hsapiens.UCSC.hg38)
# Scan the DNA sequence of each peak for the presence of each motif, and create a Motif object
DefaultAssay(pbmc) <- "ATAC"
pwm_set <- getMatrixSet(x = JASPAR2020, opts = list(species = 9606, all_versions = FALSE))
motif.matrix <- CreateMotifMatrix(features = granges(pbmc), pwm = pwm_set, genome = 'hg38', use.counts = FALSE)
motif.object <- CreateMotifObject(data = motif.matrix, pwm = pwm_set)
pbmc <- SetAssayData(pbmc, assay = 'ATAC', slot = 'motifs', new.data = motif.object)
# Note that this step can take 30-60 minutes
pbmc <- RunChromVAR(
object = pbmc,
genome = BSgenome.Hsapiens.UCSC.hg38
)
最后,探索多模式数据集,以确定每个细胞状态的关键调节因子。配对数据提供了一个独特的机会来识别符合多个标准的转录因子 (TFs),从而帮助将假定调节因子列表缩小到最有可能的候选者。我们旨在识别在多种细胞类型RNA测量中表达丰富的 TF,但也富集了 ATAC 测量中其motif的可及性。
作为一个例子和阳性对照,CCAAT增强子结合蛋白(CEBP)家族的蛋白质,包括TF CEBPB,已被反复证明在骨髓细胞包括单核细胞和树突细胞的分化和功能中发挥重要作用。可以看到,CEBPB的表达和 MA0466.2.4 motif(编码CEBPB 蛋白结合位点)的可及性都在单核细胞中富集。
#returns MA0466.2
motif.name <- ConvertMotifID(pbmc, name = 'CEBPB')
gene_plot <- FeaturePlot(pbmc, features = "sct_CEBPB", reduction = 'wnn.umap')
motif_plot <- FeaturePlot(pbmc, features = motif.name, min.cutoff = 0, cols = c("lightgrey", "darkred"), reduction = 'wnn.umap')
gene_plot | motif_plot
我们希望量化这种关系,并搜索所有细胞类型以查找类似例子。为此,将使用presto
包执行快速差异表达分析。运行两个检测:一个使用基因表达数据,另一个使用chromVAR motis可及性。 presto
默认根据 Wilcox 秩和检验计算 p 值,限制搜索范围为在两次测试中返回显著结果的 TF。
presto
还计算了一个"AUC"统计值,它反映了每个基因(或motif)作为细胞类型标记的力度。AUC 的最大值为 1 表示一个完美的marker。由于 AUC 值在基因和motif的比例相同,可以从两个测试中取 AUC 值的平均值,并用它来对每个细胞类型的 TF 进行排序:
markers_rna <- presto:::wilcoxauc.Seurat(X = pbmc, group_by = 'celltype', assay = 'data', seurat_assay = 'SCT')
markers_motifs <- presto:::wilcoxauc.Seurat(X = pbmc, group_by = 'celltype', assay = 'data', seurat_assay = 'chromvar')
motif.names <- markers_motifs$feature
colnames(markers_rna) <- paste0("RNA.", colnames(markers_rna))
colnames(markers_motifs) <- paste0("motif.", colnames(markers_motifs))
markers_rna$gene <- markers_rna$RNA.feature
markers_motifs$gene <- ConvertMotifID(pbmc, id = motif.names)
# a simple function to implement the procedure above
topTFs <- function(celltype, padj.cutoff = 1e-2) {
ctmarkers_rna <- dplyr::filter(
markers_rna, RNA.group == celltype, RNA.padj < padj.cutoff, RNA.logFC > 0) %>%
arrange(-RNA.auc)
ctmarkers_motif <- dplyr::filter(
markers_motifs, motif.group == celltype, motif.padj < padj.cutoff, motif.logFC > 0) %>%
arrange(-motif.auc)
top_tfs <- inner_join(
x = ctmarkers_rna[, c(2, 11, 6, 7)],
y = ctmarkers_motif[, c(2, 1, 11, 6, 7)], by = "gene"
)
top_tfs$avg_auc <- (top_tfs$RNA.auc + top_tfs$motif.auc) / 2
top_tfs <- arrange(top_tfs, -avg_auc)
return(top_tfs)
}
现在,可以计算,可视化和推断任何细胞类型的调节因子。我们得到已知的调节因子,包括TBX21 for NK cells, IRF4 for plasma cells, SOX4 for hematopoietic progenitors, EBF1 and PAX5 for B cells, IRF8 and TCF4 for pDC
我们相信,类似的策略可以用来帮助聚焦不同体系中的一套推断的调节因子。
# identify top markers in NK and visualize
head(topTFs("NK"), 3)
## RNA.group gene RNA.auc RNA.pval motif.group motif.feature motif.auc
## 1 NK TBX21 0.7264660 0.000000e+00 NK MA0690.1 0.9223858
## 2 NK EOMES 0.5895889 1.552097e-100 NK MA0800.1 0.9263286
## 3 NK RUNX3 0.7722418 7.131401e-122 NK MA0684.2 0.7083570
## motif.pval avg_auc
## 1 2.343664e-211 0.8244259
## 2 2.786462e-215 0.7579587
## 3 7.069675e-53 0.7402994
motif.name <- ConvertMotifID(pbmc, name = 'TBX21')
gene_plot <- FeaturePlot(pbmc, features = "sct_TBX21", reduction = 'wnn.umap')
motif_plot <- FeaturePlot(pbmc, features = motif.name, min.cutoff = 0, cols = c("lightgrey", "darkred"), reduction = 'wnn.umap')
gene_plot | motif_plot
# identify top markers in pDC and visualize
head(topTFs("pDC"), 3)
## RNA.group gene RNA.auc RNA.pval motif.group motif.feature motif.auc
## 1 pDC TCF4 0.9998833 3.347777e-163 pDC MA0830.2 0.9959622
## 2 pDC IRF8 0.9905372 2.063258e-124 pDC MA0652.1 0.8814713
## 3 pDC SPIB 0.9114648 0.000000e+00 pDC MA0081.2 0.8997955
## motif.pval avg_auc
## 1 2.578226e-69 0.9979228
## 2 9.702602e-42 0.9360043
## 3 1.130040e-45 0.9056302
motif.name <- ConvertMotifID(pbmc, name = 'TCF4')
gene_plot <- FeaturePlot(pbmc, features = "sct_TCF4", reduction = 'wnn.umap')
motif_plot <- FeaturePlot(pbmc, features = motif.name, min.cutoff = 0, cols = c("lightgrey", "darkred"), reduction = 'wnn.umap')
gene_plot | motif_plot
# identify top markers in HSPC and visualize
head(topTFs("CD16 Mono"),3)
## RNA.group gene RNA.auc RNA.pval motif.group motif.feature motif.auc
## 1 CD16 Mono SPI1 0.8764099 4.116679e-291 CD16 Mono MA0080.5 0.8831213
## 2 CD16 Mono CEBPB 0.8675114 8.321489e-292 CD16 Mono MA0466.2 0.7859496
## 3 CD16 Mono MEF2C 0.7132221 4.210640e-79 CD16 Mono MA0497.1 0.7986104
## motif.pval avg_auc
## 1 3.965856e-188 0.8797656
## 2 1.092808e-105 0.8267305
## 3 4.462855e-115 0.7559162
motif.name <- ConvertMotifID(pbmc, name = 'SPI1')
gene_plot <- FeaturePlot(pbmc, features = "sct_SPI1", reduction = 'wnn.umap')
motif_plot <- FeaturePlot(pbmc, features = motif.name, min.cutoff = 0, cols = c("lightgrey", "darkred"), reduction = 'wnn.umap')
gene_plot | motif_plot
# identify top markers in other cell types
head(topTFs("Naive B"), 3)
## RNA.group gene RNA.auc RNA.pval motif.group motif.feature motif.auc
## 1 Naive B MEF2C 0.8814368 1.002848e-183 Naive B MA0497.1 0.6511956
## 2 Naive B PAX5 0.9437146 0.000000e+00 Naive B MA0014.3 0.5634996
## 3 Naive B POU2F2 0.6022295 4.067907e-13 Naive B MA0507.1 0.8773954
## motif.pval avg_auc
## 1 3.903712e-23 0.7663162
## 2 3.175138e-05 0.7536071
## 3 5.457378e-135 0.7398125
head(topTFs("HSPC"), 3)
## RNA.group gene RNA.auc RNA.pval motif.group motif.feature motif.auc
## 1 HSPC SOX4 0.9869221 7.766427e-71 HSPC MA0867.2 0.7104016
## 2 HSPC GATA2 0.7884615 0.000000e+00 HSPC MA0036.3 0.8308485
## 3 HSPC MEIS1 0.8830582 0.000000e+00 HSPC MA0498.2 0.6781207
## motif.pval avg_auc
## 1 2.059705e-04 0.8486618
## 2 5.336146e-09 0.8096550
## 3 1.677267e-03 0.7805894
head(topTFs("Plasma"), 3)
## RNA.group gene RNA.auc RNA.pval motif.group motif.feature motif.auc
## 1 Plasma IRF4 0.8189901 5.206829e-35 Plasma MA1419.1 0.9782567
## 2 Plasma MEF2C 0.9070644 4.738760e-12 Plasma MA0497.1 0.7687288
## 3 Plasma TCF4 0.8052937 5.956053e-12 Plasma MA0830.2 0.8046950
## motif.pval avg_auc
## 1 2.180043e-12 0.8986234
## 2 7.951876e-05 0.8378966
## 3 7.678507e-06 0.8049943