学习 Seurat 官网 https://satijalab.org/seurat/ 中的 Seurat - Guided Clustering Tutorial(Compiled: January 11, 2022),记录学习笔记。
1. 建立 Seurat 对象
分析10X Genomics 给出的外周血单核细胞(PBMC)数据集,数据是用Illumina NextSeq 500测的2700个单细胞,下载连接如下:https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz,解压缩后可得到如下图的三个文件,即10X cellranger pipeline生成的标准文件:
- barcodes.tsv,2700个细胞的barcode
- genes.tsv,基因的Ensembl ID和Symbol,共32738个
- matrix.mtx,以 Matrix Market(MM)文件格式存储的基因表达矩阵
MM文件有四个部分(1. Header line; 2. Comment lines; 3. Size line; 4. Data lines)
以matrix.mtx数据为例:
第一行为Header line,包含一个标识符和四个文本字段,格式符合%%MatrixMarket object format field symmetry(即实数稀疏矩阵);
第二行为以%开头的Comment line;
第三行指定行数,列数,非零项个数(即32738个基因,2700个细胞,2700个细胞中有总计2286884次数的基因表达值非零);
第四行及之后为指定矩阵元素的位置(所在行与所在列)和对应值。
从读取数据开始,Read10X()
函数读取从10X cellranger pipeline得到的输出,返回一个唯一分子标识符(UMI)计数矩阵。该矩阵中的值表示在每个细胞(列)中检测到的每个特征(即基因;行)的分子数量。接下来使用计数矩阵来创建一个 Seurat 对象,采用CreateSeuratObject()
函数生成 Seurat 对象,该对象作为一个容器,可包含单细胞数据集的数据(如计数矩阵)和数据分析结果(如PCA或聚类等)。关于Seurat对象的结构的介绍与讨论可参见 https://github.com/satijalab/seurat/wiki。例如,计数矩阵存储在pbmc[["RNA"]]@counts
中。
library(dplyr)
library(Seurat)
library(patchwork)
# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "D:/Bioinfo/Single_cell/Seurat/data/pbmc3k_filtered_gene_bc_matrices/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data)
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc
## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
生成的dgCMatrix S4
对象和Seurat S4
对象,可用@
和$
搭配,提取查看相关信息。
# Lets examine a few genes in the first thirty cells
pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
## 3 x 30 sparse Matrix of class "dgCMatrix"
##
## CD3D 4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2 3 . . . . . 3 4 1 5
## TCL1A . . . . . . . . 1 . . . . . . . . . . . . 1 . . . . . . . .
## MS4A1 . 6 . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .
矩阵中.
代表0(没有分子被检测到)。因为scRNA-seq矩阵中的大多数值都是0,所以Seurat尽可能使用稀疏矩阵表示。这为Drop-seq/inDrop/10X数据节省了大量内存和速度。
dense.size <- object.size(as.matrix(pbmc.data))
dense.size
## 709591472 bytes
sparse.size <- object.size(pbmc.data)
sparse.size
## 29905192 bytes
dense.size/sparse.size
## 23.7 bytes
2. 标准的预处理工作流程
- 基于QC指标进行细胞的选择与过滤
- 数据标准化与放缩
- 高度可变特征的检测
2.1 为进一步分析进行QC和细胞选择
-
在每个细胞中检测到的unique基因的数量
- 低质量的细胞或空液滴通常只有很少的基因
- Cell doublets或multiplets可能表现出异常高的基因计数
同样,在一个细胞中检测到的分子总数(与unique基因强烈相关)
-
比对到线粒体基因组的reads百分比
- 低质量/死亡的细胞常表现出广泛的线粒体污染
- 通过
PercentageFeatureSet()
函数计算线粒体QC指标,计算源自一组特征的计数的百分比 - 使用以MT-开始的所有基因集合作为一组线粒体基因
-
运行
CreateSeuratObject()
时即会自动计算出unique基因的数量和molecules总数- 存储在Seurat对象的meta data中
# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
# Show QC metrics for the first 5 cells
head([email protected], 5)
## orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1 pbmc3k 2419 779 3.0177759
## AAACATTGAGCTAC-1 pbmc3k 4903 1352 3.7935958
## AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8897363
## AAACCGTGCTTCCG-1 pbmc3k 2639 960 1.7430845
## AAACCGTGTATGCG-1 pbmc3k 980 521 1.2244898
可视化:QC指标的小提琴图VlnPlot()
+特征之间关联性的散点图FeatureScatter()
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
# FeatureScatter is typically used to visualize feature-feature relationships, but can be used
# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
此处的过滤标准:1. 每个细胞unique特征计数大于2500或小于200;2. 细胞的线粒体计数百分比大于5%,subset()
函数。
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
2.2 标准化数据
默认情况下,使用global-scaling标准化方法“LogNormalize”,即NormalizeData()
函数,将每一个细胞的特征表达值除以它的总表达,乘比例因子(默认数值为10000),然后再进行Log转化,标准化后的数据存储在pbmc[["RNA"]]@data
。
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc <- NormalizeData(pbmc)
2.3 高度可变特征的识别(特征选择)
接下来,计算出在数据集中表现出细胞与细胞间高度变化的特征子集(即,它们在一些细胞中高表达,而在另一些细胞中低表达)。在下游分析中关注这些高变基因有助于突出单细胞数据集中的生物学信号。FindVariableFeatures()
寻找特征,默认为2000,VariableFeatures()
提取找到的特征,VariableFeaturePlot()
可视化。
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2
2.4 数据放缩
降维之前用ScaleData()
函数进行放缩,使细胞之间基因的平均表达为0,标准差为1,最终结果存储在pbmc[["RNA"]]@scale.data
。
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc <- ScaleData(pbmc)
设置vars.to.regress参数,例如细胞周期阶段或线粒体污染,对这些因素进行回归,消除其对数据的影响。
pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt")
3. 进行线性降维
利用RunPCA()
函数进行PCA分析
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL
## Negative: MALAT1, LTB, IL32, IL7R, CD2
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1
## Negative: NKG7, PRF1, CST7, GZMB, GZMA
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1
## Negative: PPBP, PF4, SDPR, SPARC, GNG11
## PC_ 4
## Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1
## Negative: VIM, IL7R, S100A6, IL32, S100A8
## PC_ 5
## Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY
## Negative: LTB, IL7R, CKB, VIM, MS4A7
可视化:VizDimLoadings()
,DimPlot()
,DimHeatmap()
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
DimPlot(pbmc, reduction = "pca")
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
4. 确定数据集的“维度”
为了克服scRNA-seq数据的任何单一特征中广泛的技术噪声,Seurat基于它们的PCA分数对细胞进行聚类,每个PC本质上代表一个“metafeature”,该“metafeature”将相关特征集的信息组合在一起。应该选择多少个PC呢?
# NOTE: This process can take a long time for big datasets, comment out for expediency. More
# approximate techniques such as those implemented in ElbowPlot() can be used to reduce
# computation time
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
JackStrawPlot(pbmc, dims = 1:15)
ElbowPlot(pbmc)
在这里选择了10个PC,但鼓励用户考虑以下几点:
- 树突状细胞和NK细胞研究者可能注意到与PC12和PC13密切相关的基因,定义了罕见的免疫亚群(即MZB1是浆细胞样树突状细胞的标志)。然而,这些群体是如此罕见,在没有先验知识的情况下,它们很难从这样规模的数据集的背景噪声中区分出来。
- 鼓励用户使用不同数量的PC重复下游分析(10、15甚至50),结果通常不会有很大的不同。
- 建议用户在选择PC数量这一参数时选择较高的数值。例如,仅使用5个PC执行下游分析会显著影响结果。
5. 细胞聚类
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2638
## Number of edges: 95927
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8728
## Number of communities: 9
## Elapsed time: 0 seconds
# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## 2 3 2 1
## AAACCGTGTATGCG-1
## 6
## Levels: 0 1 2 3 4 5 6 7 8
关于FindClusters()
函数的resolution参数,该参数设置下游聚类的间隔尺寸“granularity”,值增加将导致更多的聚类数。将该参数设置在0.4-1.2之间,通常会对大约3K细胞的单细胞数据集产生良好的结果。对于较大的数据集,最佳的resolution通常会增加。可以使用Idents()
函数找到这些类。
6. 运行非线性降维(UMAP/tSNE)
Seurat 提供了几种非线性降维技术,如tSNE和UMAP,用以可视化和探索这些数据集。建议使用与聚类分析相同的PC作为UMAP和tSNE的输入。
# If you haven't installed UMAP, you can do so via reticulate::py_install(packages =
# 'umap-learn')
pbmc <- RunUMAP(pbmc, dims = 1:10)
# note that you can set `label = TRUE` or use the LabelClusters function to help label
# individual clusters
DimPlot(pbmc, reduction = "umap")
saveRDS(pbmc, file = "D:/Bioinfo/Single_cell/Seurat/data/pbmc_tutorial.rds")
7. 寻找差异表达特征(聚类biomarker)
# find all markers of cluster 2
cluster2.markers <- FindMarkers(pbmc, ident.1 = 2, min.pct = 0.25)
head(cluster2.markers, n = 5)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## IL32 2.892340e-90 1.2013522 0.947 0.465 3.966555e-86
## LTB 1.060121e-86 1.2695776 0.981 0.643 1.453850e-82
## CD3D 8.794641e-71 0.9389621 0.922 0.432 1.206097e-66
## IL7R 3.516098e-68 1.1873213 0.750 0.326 4.821977e-64
## LDHB 1.642480e-67 0.8969774 0.954 0.614 2.252497e-63
# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## FCGR3A 8.246578e-205 4.261495 0.975 0.040 1.130936e-200
## IFITM3 1.677613e-195 3.879339 0.975 0.049 2.300678e-191
## CFD 2.401156e-193 3.405492 0.938 0.038 3.292945e-189
## CD68 2.900384e-191 3.020484 0.926 0.035 3.977587e-187
## RP11-290F20.3 2.513244e-186 2.720057 0.840 0.017 3.446663e-182
# find markers for every cluster compared to all remaining cells, report only the positive
# ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>%
group_by(cluster) %>%
slice_max(n = 2, order_by = avg_log2FC)
## # A tibble: 18 × 7
## # Groups: cluster [9]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
##
## 1 9.57e- 88 1.36 0.447 0.108 1.31e- 83 0 CCR7
## 2 3.75e-112 1.09 0.912 0.592 5.14e-108 0 LDHB
## 3 0 5.57 0.996 0.215 0 1 S100A9
## 4 0 5.48 0.975 0.121 0 1 S100A8
## 5 1.06e- 86 1.27 0.981 0.643 1.45e- 82 2 LTB
## 6 2.97e- 58 1.23 0.42 0.111 4.07e- 54 2 AQP3
## 7 0 4.31 0.936 0.041 0 3 CD79A
## 8 9.48e-271 3.59 0.622 0.022 1.30e-266 3 TCL1A
## 9 5.61e-202 3.10 0.983 0.234 7.70e-198 4 CCL5
## 10 7.25e-165 3.00 0.577 0.055 9.95e-161 4 GZMK
## 11 3.51e-184 3.31 0.975 0.134 4.82e-180 5 FCGR3A
## 12 2.03e-125 3.09 1 0.315 2.78e-121 5 LST1
## 13 3.13e-191 5.32 0.961 0.131 4.30e-187 6 GNLY
## 14 7.95e-269 4.83 0.961 0.068 1.09e-264 6 GZMB
## 15 1.48e-220 3.87 0.812 0.011 2.03e-216 7 FCER1A
## 16 1.67e- 21 2.87 1 0.513 2.28e- 17 7 HLA-DPB1
## 17 1.92e-102 8.59 1 0.024 2.63e- 98 8 PPBP
## 18 9.25e-186 7.29 1 0.011 1.27e-181 8 PF4
cluster0.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
可视化函数:VlnPlot()
,FeaturePlot()
,RidgePlot()
,CellScatter()
,DotPlot()
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
## you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
"CD8A"))
pbmc.markers %>%
group_by(cluster) %>%
top_n(n = 10, wt = avg_log2FC) -> top10
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
8. 细胞类型的识别与注释
Cluster ID | Markers | Cell Type |
---|---|---|
0 | IL7R, CCR7 | Naive CD4+ T |
1 | CD14, LYZ | CD14+ Mono |
2 | IL7R, S100A4 | Memory CD4+ |
3 | MS4A1 | B |
4 | CD8A | CD8+ T |
5 | FCGR3A, MS4A7 | FCGR3A+ Mono |
6 | GNLY, NKG7 | NK |
7 | FCER1A, CST3 | DC |
8 | PPBP | Platelet |
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono",
"NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
saveRDS(pbmc, file = "D:/Bioinfo/Single_cell/Seurat/data/pbmc3k_final.rds")
参考:
https://satijalab.org/seurat/articles/pbmc3k_tutorial.html
https://people.sc.fsu.edu/~jburkardt/data/mm/mm.html
https://www.jianshu.com/p/03b94b2034d5