参考:公众号:[bioinfomics](javascript:void(0);)
Seurat3可以对多个单细胞测序数据集进行整合分析,这些方法可以对来自不同的个体、实验条件、测序技术甚至物种中收集来的数据进行整合,旨在识别出不同数据集之间的共享细胞状态(shared cell states)。
这些方法首先识别出不同数据集对之间的“锚(anchors)”,这些anchors代表了个体细胞之间成对的对应关系(每个数据集中有一个),并假设它们源自相同的生物状态。然后,再利用这些识别出的anchors用于协调不同的数据集,或者将信息从一个数据集传输到另一个数据集。
标准工作流程进行整合分析
在本例教程中,我们选择了通过四种不同测序技术(CelSeq (GSE81076)、 CelSeq2 (GSE85241)、Fluidigm C1 (GSE86469)和SMART-Seq2 (E-MTAB-5061)生成的人类胰岛细胞数据集,我们通过SeuratData包来加载这个数据集。
安装并加载所需的R包
rm(list=ls())
Sys.setenv(language='en')
options(stringsAsFactors = F)
# 安装并加载SeuratData包
devtools::install_github('satijalab/seurat-data')
library(SeuratData)
library(Seurat)
# 查看SeuratData包搜集的数据集
AvailableData()
# 下载安装SeuratData包收集的特定数据集
InstallData("panc8")
library(panc8.SeuratData)
data('panc8')
panc8 (1.3GB)
An object of class Seurat
34363 features across 14890 samples within 1 assay
Active assay: RNA (34363 features, 0 variable features)
分割对象,构建不同的数据集
head([email protected])
orig.ident nCount_RNA nFeature_RNA tech
D101_5 D101 4615.810 1986 celseq
D101_7 D101 29001.563 4209 celseq
D101_10 D101 6707.857 2408 celseq
D101_13 D101 8797.224 2964 celseq
D101_14 D101 5032.558 2264 celseq
D101_17 D101 13474.866 3982 celseq
replicate assigned_cluster celltype dataset
D101_5 celseq gamma celseq
D101_7 celseq acinar celseq
D101_10 celseq alpha celseq
D101_13 celseq delta celseq
D101_14 celseq beta celseq
D101_17 celseq ductal celseq
table([email protected]$tech)
#celseq celseq2 fluidigmc1 indrop smartseq2
#1004 2285 638 8569 2394
# 根据meta信息中不同的测序技术(tech)对Seurat对象进行分割,构建不同的数据集
pancreas.list <- SplitObject(panc8, split.by = "tech")
# 选择出四种不同测序技术产生的数据
pancreas.list <- pancreas.list[c("celseq", "celseq2", "fluidigmc1", "smartseq2")]
pancreas.list
$celseq
An object of class Seurat
34363 features across 1004 samples within 1 assay
Active assay: RNA (34363 features, 2000 variable features)
$celseq2
An object of class Seurat
34363 features across 2285 samples within 1 assay
Active assay: RNA (34363 features, 2000 variable features)
$fluidigmc1
An object of class Seurat
34363 features across 638 samples within 1 assay
Active assay: RNA (34363 features, 2000 variable features)
$smartseq2
An object of class Seurat
34363 features across 2394 samples within 1 assay
Active assay: RNA (34363 features, 2000 variable features)
分别对每个数据集进行标准的预处理
for (i in 1:length(pancreas.list)) {
pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst", nfeatures = 2000, verbose = FALSE)
}
将不同的数据集进行整合
首先使用FindIntegrationAnchors函数来识别anchors,该函数接受Seurat对象的列表(list)作为输入,在这里我们将三个对象构建成一个参考数据集。使用默认参数来识别锚,如数据集的“维数”(30)
reference.list <- pancreas.list[c("celseq", "celseq2", "smartseq2")]
pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
pancreas.anchors
Computing 2000 integration features
Scaling features for provided objects
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=05s
Finding all pairwise anchors
| | 0 % ~calculating Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 3514 anchors
Filtering anchors
Retained 2753 anchors
|+++++++++++++++++ | 33% ~57s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 3499 anchors
Filtering anchors
Retained 2718 anchors
|++++++++++++++++++++++++++++++++++ | 67% ~29s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 6174 anchors
Filtering anchors
Retained 4540 anchors
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=02m 00s
pancreas.anchors
An AnchorSet object containing 20022 anchors between 3 Seurat objects
This can be used as input to IntegrateData.
然后将这些识别好的anchors传递给IntegrateData函数,整合后的数据返回一个Seurat对象,该对象中将包含一个新的Assay(integrated),里面存储了整合后表达矩阵,原始的表达矩阵存储在RNA这个Assay中。
pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)
Merging dataset 1 into 2
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 3 into 2 1
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
#报错 设置memory.limit(size=56000)
pancreas.integrated
An object of class Seurat
36363 features across 5683 samples within 2 assays
Active assay: integrated (2000 features, 2000 variable features)
1 other assay present: RNA
对整合后的数据集进行常规的降维聚类可视化
library(ggplot2)
library(cowplot)
library(patchwork)
# switch to integrated assay. The variable features of this assay are automatically
# set during IntegrateData
DefaultAssay(pancreas.integrated) <- "integrated"
# Run the standard workflow for visualization and clustering
pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)
pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
pancreas.integrated <- RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30)
# 使用group.by函数根据不同的条件进行分群
p1 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "tech")
p2 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) + NoLegend()
p1 + p2
p3 <- DimPlot(pancreas.integrated, reduction = "umap", split.by = "tech")
p3
DimPlot(pancreas.integrated, reduction = "pca", group.by = "tech"
使用整合后的参考数据集对细胞类型进行分类
Seurat3还支持将参考数据集(或元数据)投影到查询对象上。虽然许多方法是一致的(这两个过程都是从识别锚开始的),但数据映射(data transfer)和数据整合(data integration)之间有两个重要的区别:
1)In data transfer, Seurat does not correct or modify the query expression data.
2)In data transfer, Seurat has an option (set by default) to project the PCA structure of a reference onto the query, instead of learning a joint structure with CCA. We generally suggest using this option when projecting data between scRNA-seq datasets.
识别到anchors之后,我们使用TransferData函数根据参考数据集中细胞类型标签向量对查询数据集的细胞进行分类。TransferData函数返回一个带有预测id和预测分数的矩阵,我们可以将其添加到query metadata中。
# 构建query数据集
pancreas.query <- pancreas.list[["fluidigmc1"]]
pancreas.query
An object of class Seurat
34363 features across 638 samples within 1 assay
Active assay: RNA (34363 features, 2000 variable features)
# 识别参考数据集的anchors
pancreas.anchors <- FindTransferAnchors(reference = pancreas.integrated, query = pancreas.query, dims = 1:30)
Performing PCA on the provided reference using 2000 features as input.
Projecting cell embeddings
Finding neighborhoods
Finding anchors
Found 919 anchors
Filtering anchors
Retained 842 anchors
Warning message:
In UseMethod("depth") :
no applicable method for 'depth' applied to an object of class "NULL"
pancreas.anchors
An AnchorSet object containing 842 anchors between the reference and query Seurat objects.
This can be used as input to TransferData.
# 将查询数据集映射到参考数据集上
predictions <- TransferData(anchorset = pancreas.anchors, refdata = pancreas.integrated$celltype, dims = 1:30)
# 添加预测出的信息
pancreas.query <- AddMetaData(pancreas.query, metadata = predictions)
因为我们具有来自整合后数据集中含有的原始注释标签,所以我们可以评估预测的细胞类型注释与完整参考的匹配程度。在此示例中,我们发现在细胞类型分类中具有很高的一致性,有超过97%的细胞被正确的标记出。
pancreas.query$prediction.match <- pancreas.query$predicted.id == pancreas.query$celltype
table(pancreas.query$prediction.match)
FALSE TRUE
21 617
为了进一步验证这一点,我们可以查看一些特定胰岛细胞群中的典型细胞类型标记基因(cell type markers)。
table(pancreas.query$predicted.id)
acinar activated_stellate alpha
22 17 253
beta delta ductal
256 22 30
endothelial gamma macrophage
12 18 1
mast schwann
2 5
VlnPlot(pancreas.query, c("REG1A", "PPY", "SST", "GHRL", "VWF", "SOX10"), group.by = "predicted.id")
代码总结
rm(list=ls())
Sys.setenv(language='en')
options(stringsAsFactors = F)
# 安装并加载SeuratData包
devtools::install_github('satijalab/seurat-data')
library(SeuratData)
library(Seurat)
# 查看SeuratData包搜集的数据集
AvailableData()
# 下载安装SeuratData包收集的特定数据集
InstallData("panc8")
library(panc8.SeuratData)
data('panc8')
panc8
head([email protected])
table([email protected]$tech)
celseq celseq2 fluidigmc1 indrop smartseq2
1004 2285 638 8569 2394
# 根据meta信息中不同的测序技术(tech)对Seurat对象进行分割,构建不同的数据集
pancreas.list <- SplitObject(panc8, split.by = "tech")
# 选择出四种不同测序技术产生的数据
pancreas.list <- pancreas.list[c("celseq", "celseq2", "fluidigmc1", "smartseq2")]
pancreas.list
for (i in 1:length(pancreas.list)) {
pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst", nfeatures = 2000, verbose = FALSE)
}
#整合数据集
reference.list <- pancreas.list[c("celseq", "celseq2", "smartseq2")]
pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
pancreas.anchors
pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)
#报错 设置memory.limit(size=56000)
pancreas.integrated
#对整合后的数据集进行常规的降维聚类可视化
library(ggplot2)
library(cowplot)
library(patchwork)
# switch to integrated assay. The variable features of this assay are automatically
# set during IntegrateData
DefaultAssay(pancreas.integrated) <- "integrated"
# Run the standard workflow for visualization and clustering
pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)
pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
pancreas.integrated <- RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30)
# 使用group.by函数根据不同的条件进行分群
p1 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "tech")
p2 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) + NoLegend()
p1 + p2
p3 <- DimPlot(pancreas.integrated, reduction = "umap", split.by = "tech")
p3
DimPlot(pancreas.integrated, reduction = "pca", group.by = "tech"
使用整合后的参考数据集对细胞类型进行分类
# 构建query数据集
pancreas.query <- pancreas.list[["fluidigmc1"]]
pancreas.query
# 识别参考数据集的anchors
pancreas.anchors <- FindTransferAnchors(reference = pancreas.integrated, query = pancreas.query, dims = 1:30)
pancreas.anchors
# 将查询数据集映射到参考数据集上
predictions <- TransferData(anchorset = pancreas.anchors, refdata = pancreas.integrated$celltype, dims = 1:30)
# 添加预测出的信息
pancreas.query <- AddMetaData(pancreas.query, metadata = predictions)
pancreas.query$prediction.match <- pancreas.query$predicted.id == pancreas.query$celltype
table(pancreas.query$prediction.match)
FALSE TRUE
21 617
table(pancreas.query$predicted.id)
VlnPlot(pancreas.query, c("REG1A", "PPY", "SST", "GHRL", "VWF", "SOX10"), group.by = "predicted.id")