Seurat - Guided Clustering Tutorial, July 16, 2020
https://satijalab.org/seurat/v3.2/pbmc3k_tutorial.html
下载数据:
https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
需要这三个文件:
barcodes.tsv、genes.tsv、matrix.mtx
本次的数据分析流程,选择了网上10X Genomics上 PBMC 外周血单核细胞(Peripheral Blood Mononuclear Cells )的公开数据。这里使用了 Illumina NextSeq 500仪器测序了 2700 个细胞。raw data可以在这里找到(https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz)。
【Read10X】函数读取10X [cellranger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) pipeline的输出文件,返回一个 UMI count matrix(unique molecular identified)。在这个matrix里面的数值代表了每个特征(feature)(例如基因,行)的分子数目,这些特征可在每个细胞(列)里面被检测到。
接着我们使用count matrix去创造一个【Seurat】对象。这个对象被用作单细胞数据集的,装载数据(类似count matrix)和分析(类似PCA,或聚类结果)的容器。count matrix 被储存在【pbmc[["RNA"]]@counts】。
rm(list=ls())
options(stringsAsFactors = F)
library(dplyr)
library(Seurat)
library(patchwork)
Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")
Environment
History
Connections
List
mportDatasct
GlobalEnvironment
Data
LargedgcMatrix(88392600elements,.
pbmc.data
[1:22868841701661783263634104
0412492494495
.ai:int
[1:2701]
78121333264422447465528631171017634
int
..ap
[1:2]327382700
.@DiM:int
.@Dimnames:Listof2
.S:chr[1:32738]"MIR1302-10""AM138A""R4FS""P11-34P13.7"
AAACATT...
..:chr[1:2700]"AAACATACAACCAC-1""AAACATTGAGCTAC-1"
.x:huM[1:2286884]11211114111.
afactors:listO
Initialize the Seurat object with the raw (non-normalized data).
创建Seurat对象
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
pbmc
An object of class Seurat
13714 features across 2700 samples within 1 assay
Active assay: RNA (13714 features, 0 variable features)
Standard pre-processing workflow
质量控制指标(QC metrics)和筛选细胞作进一步分析:
Seurat允许根据任何用户定义的标准过滤单元格(cells)。
通常使用的QC指标:
每个细胞内被检测到特有的基因(****unique genes****)的数目,unique feature会因为数据质量而调整。
▲低质量的细胞或空的液滴一般含有较少的基因;
▲细胞双重态或多重态可能呈现异常高的基因count值
细胞内被监测到的分子的总数目(与unique genes高度相关)
匹配到线粒体基因组的read的百分比
▲低质量/将要死去的细胞经常呈现过度的线粒体污染情况;
▲使用
function计算线粒体QC metrics, 此function可以计算源于feature集的count值的百分比
▲使用所有以【MT-】为起始的基因集合,作为线粒体基因集
The [[ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
QC metrics 在 Seurat :head([email protected])) QC指标存储
▲过滤掉拥有 feature count>2500, or < 200 的细胞;
▲过滤掉含有> 5% 的线粒体 count值的细胞
Visualize QC metrics as a violin plot
nCount RNA: 这些基因数目一共测到的count数目,也就是以前版本的UMI数目
nFeature RNA: 每个细胞所检测到的基因数目,也就是以前版本的nGene
percent.mt: 每个细胞所检测到的线粒体基因
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
nFeature_RNA
nCountRNA
percent.mt
15000
3000
10000
2000
10
5000
1000
5
ouo
.
.
ldentity
Identity
Identity
FeatureScatter is typically used to visualize feature-feature relationships, but can be used
for anything calculated by the object, i.e. columns in object metadata, PC scores etc.
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
-0.13
0.95
20
3000
NEeJnee-u
15
wtuasjad
2000
ldentity
ldentity
pbmc3k
pbmc3k
10
1000
15000
5000
10000
5000
15000
10000
nCountRNA
nCountRNA
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
pbmc
An object of class Seurat
13714 features across 2638 samples within 1 assay
Active assay: RNA (13714 features, 0 variable features)
Normalizing the data
在去除了数据集中不需要的细胞后,下一步就是标准化数据。默认设置下,使用全局-规模(global-scaling)的标准化方法 "LogNormalize",这方法按总体表达量,对每一个细胞的特征表达量进行标准化,乘以一个比例因子(scale factor)(默认为10,000),然后log转换此结果。
标准化后的值储存在【pbmc[["RNA"]]@data】。
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
▶ 为了更为清晰,在之前的代码(和之后的代码)中,他们在函数调用(function call)中给确定的参数设置提供了默认值。然而,这些设置不是必须的,而且同样的结果可由以下代码实现:
pbmc <- NormalizeData(pbmc)
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**|
Identification of highly variable features (feature selection)
下一步,计算数据集里面基因子集,该子集呈现出高度的细胞间变异(在有些细胞内高表达,在其他细胞内低表达)。还发现了专注在下游分析内的基因可帮助突出单细胞数据内的生物信号。
Seurat3 ,根据旧有版本升级了一下,通过使用【FindVariableFeatures
】函数,直接对单细胞数据中固有内的均值方差(mean-variance)的关系建模。在默认设置下,每个数据集返回2,000个feature。这些数据会被用于下游数据分析,如PCA。
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
Calculating gene variances
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**|
Calculating feature variances of standardized and clipped values
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**|
Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)
top10
[1] "PPBP" "LYZ" "S100A9" "IGLL5" "GNLY" "FTL" "PF4" "FTH1"
[9] "GNG11" "S100A8"
plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
When using repel, set xnudge and ynudge to 0 for optimal results
plot1 + plot2
Warning messages:
1: Transformation introduced infinite values in continuous x-axis
2: Transformation introduced infinite values in continuous x-axis
PPBP
S100A9
LYZ
IGLL5
aoupieApezipjepueis
oueueApazipjepueis
GNLY
FTH1
PF4
GNG11S100A8
Non-yariablecount11714
Non-variablecount11714
Variablecount:2000
Variablecount2000
1e-021e+001e+02
1e-021e+001e+02
AverageExpression
AverageExpression
Scaling the data
使用线性转换('scaling'),它是一个降维技巧(如PCA)的标准预处理步骤,【ScaleData】函数:
▲转换每个基因的表达量,使得细胞之间的平均表达量为0;
▲按比例缩小每个基因的表达量,使得细胞之间的方差(variance)为1;
☼此步骤给予了下游分析的同等权重(equal weight),使得高表达基因不会占主要地位;
▲改结果会被储存在【pbmc[["RNA"]]@scale.data】
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
Centering and scaling data matrix
|=======================================================================| 100%
▶如果这一步骤太久了,可以这么加快?
按比例缩小数据在Seurat的工作路线里面是必须的一步,但仅在基因这方面上会被用作输入数据。因此,【ScaleData】内的默认设置仅仅是为了对之前定义好的变量特征进行按比例缩小(默认为2000个基因)。为了进行上一步,可删掉类似之前函数调用里的【features】参数。
pbmc <- ScaleData(pbmc)
Centering and scaling data matrix
|==========================================================| 100%
pbmc
An object of class Seurat
13714 features across 2638 samples within 1 assay
Active assay: RNA (13714 features, 2000 variable features)
2 dimensional reductions calculated: pca, umap
▶应该怎么去除一些不需要的变量来源,就像在Seurat v2那样?
在【Seurat v2
】里面,我们也会使用【ScaleData】函数来去除单细胞数据集里面一些不需要的变量来源。例如,我们可以“退回(regress out)”与例如细胞周期阶段相关的异质性,或与线粒体污染相关的异质性。这些特征会在【Seurat v3】 中的【ScaleData】中得到认证。
pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt")
Regressing out percent.mt
|| 100%
Centering and scaling data matrix
|| 100%pbmc
An object of class Seurat
13714 features across 2638 samples within 1 assay
Active assay: RNA (13714 features, 2000 variable features)
2 dimensional reductions calculated: pca, umap
Perform linear dimensional reduction
对scaled过的数据进行PCA分析。默认设置上,只有之前的确定变量的feature被用于输入数据,但可被【features】参数定义,如果你希望选择一个不同的子集。
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
PC_ 1
Positive: CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP
FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP
PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD
Negative: MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW
CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A
MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3
PC_ 2
Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74
HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB
BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74
Negative: NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2
CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX
TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC
PC_ 3
Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA
HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8
PLAC8, BLNK, MALAT1, SMIM14, PLD4, LAT2, IGLL5, P2RX5, SWAP70, FCGR2B
Negative: PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU
HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1
NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA
PC_ 4
Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HLA-DPB1, HIST1H2AC, PF4, TCL1A
SDPR, HLA-DPA1, HLA-DRB1, HLA-DQA2, HLA-DRA, PPBP, LINC00926, GNG11, HLA-DRB5, SPARC
GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, GZMB, CLU, TUBB1
Negative: VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL
AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7
LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6
PC_ 5
Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2
GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, CCL5
RBP7, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, TTC38, AKR1C3, XCL1, HOPX
Negative: LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1
PTGES3, LILRB2, MAL, CD27, HN1, CD2, GDI2, ANXA5, CORO1B, TUBA1B
FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, FYB
pbmc
An object of class Seurat
13714 features across 2638 samples within 1 assay
Active assay: RNA (13714 features, 2000 variable features)
1 dimensional reduction calculated: pca
Seurat提供了几个有用的,决定 PCA 的可视化细胞和feature的方法,包括:【VizDimReduction】、【DimPlot】、【DimHeatmap】
Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
PC_ 1
Positive: CST3, TYROBP, LST1, AIF1, FTL
Negative: MALAT1, LTB, IL32, IL7R, CD2
PC_ 2
Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1
Negative: NKG7, PRF1, CST7, GZMB, GZMA
PC_ 3
Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1
Negative: PPBP, PF4, SDPR, SPARC, GNG11
PC_ 4
Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1
Negative: VIM, IL7R, S100A6, IL32, S100A8
PC_ 5
Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY
Negative: LTB, IL7R, CKB, VIM, MS4A7
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
CD79A
TYi
48
LST
HLA-DOA1
HLA-DQB1
HLA-DRA
LINC00926
FCN1
CD79B
APMAP
S100A9
ITC38
TYMP
FCER1G
HOPX
CFD
SRGN
LGALS1
AKR1C3
CLIC3
S100A8
CTSS
品
CD247-
CCL5
IFITM3
FCGR3A
p
GZMH
CCL4
PSAP
SPON2
IFI30
RE
C8L1
CTSW
S100A11
FGFBP2
NPC2
GZMA
GRN
GZMB
L8SP
CST
MALAT11
0.05
.0.1
0.05
0.0
0.10
0.00
0.10
PC_1
PC2
DimPlot(pbmc, reduction = "pca")
10
口
pbmc3k
10
PC_1
可使用【DimHeatmap】对数据集内的异质性的主要数据作简单探索,在决定哪个PC可包含在内被用作下有分析时,此函数显得特别有用。细胞和feature根据PCA的打分值来设置。设置【cells】为一数值绘制出范围两端的“极端”细胞,这可显著地加速大数据集图表的绘制。尽管这是一监督算法分析,发现这仍是用于探索相关feature特征的有价值的工具。
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
PC
MALAT1
LTB
1L32
IL7R
CD2
B2M
ACAP1
CD27
STK17A
CTSW
CD247
GILMAP5
A&
SELL
CTSS
S100A8
LGALS1
CFD
FCER1G
TYMP
S100A9
FCN1
TYROBP
CST3
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
5100A
PC5
PC6
CDM
PC8
PC9
cLEC10
ICB9
AE
WMCA2
PPPRY
UAO1
PC11
TE
PC14
民间
Determine the 'dimensionality' of the dataset
为了克服scRNA-seq数据里面单一特征的广泛技术性噪音,Seurat是基于它们的 PCA 分数聚集细胞的,每一个PC 本质上而言代表了“metafeature”,也就是那些包含相关特征数据集的信息。因此,前几个主要的成分代表了数据集强而有力的压缩。问题是,我们应该囊括多少个成分在内呢?10个? 20个?100个?
在 Macosko的文章内,使用了受JackStraw 程序启发的重新取样测试( resampling test )。随机重新排列了数据集(默认设置为1%),然后再次跑了 PCA,重建特征打分值(feature scores)下的零分布(null distribution),再重复此步骤。鉴定了“显著的”PC作为那些拥有低P-value特征的强有力的富集。
NOTE: This process can take a long time for big datasets, comment out for expediency. More
approximate techniques such as those implemented in ElbowPlot() can be used to reduce
computation time
pbmc <- JackStraw(pbmc, num.replicate = 100)
|| 100% elapsed=02m 48s
|| 100% elapsed=00s
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
pbmc
An object of class Seurat
13714 features across 2638 samples within 1 assay
Active assay: RNA (13714 features, 2000 variable features)
1 dimensional reduction calculated: pca
【JackStrawPlot】函数为每一个PC的P-value的分布与一均匀分布( uniform distribution)的对比提供了可视化的工具(虚线)。“显著的”PC会显示带有较低的P-value的特征的强富集(实线上的虚线)。在此情况下,在第一个10~12 PC之后,有一个显著的断崖式下降。
JackStrawPlot(pbmc, dims = 1:15)
Warning message:
Removed 23496 rows containing missing values (geom_point).
0.3
pC:p-yalue
PC1:9.12e-161
PC2:1.47e-114
[oooDunjleonajoayl
PC3:1e-32
PC4:3.04e-58
0.2
PC5:5.05e-56
PC6:1.57e-55
PC7:4.05e-24
PC8:1.48e-11
PC9:5.22e-12
PC10:6.33e-08
0.1
PC11:6.33e-08
PC12:0.000101
PC13:000253
PC14:0.133
PC15:1
0.0
0.075
0.000
0.100
0.025
0.050
Empirical
另一种启发式方法(heuristic method)可以生成一个“Elbow plot”:主成分的排列是基于差异的百分比,可其差异可通过【ElbowPlot】函数来解释。在此例子中,可观察到转折点位于PC9~10,这意味着真正信号的绝大多数会在第一个 10PC 处被捕捉。
ElbowPlot(pbmc)
uoneInenpjepueis
20
15
鉴定一个数据集的真正维度--对于 使用者而言是一个挑战/未知之事。因此我们建议使用者去考虑着三个方法:
1、带有更多的监督的方法,探索PC 以用于决定异质性的相关来源,然后这可以用于类似与GSEA的共同使用上。
2、基于随机的零模型进行一统计学测试,但这对于大型数据而言是耗时的,而且不会返回清晰的 PC 截点(cutoff)。
3、启发式的方法,且是被广泛使用的,可被立马计算出来的。
在这个例子中,这三种方法都可产生相似的结果,但我们可能已经选择了任何位于PC 7-12的点作为合理的截点。
在这里,我们选择了10,但鼓励使用者考虑以下几点:
1、DC和NK的狂热粉可能认出了这些基因与PC12 、13高度相关,因此决定了稀有免疫亚群(例如MZB1是类浆细胞DC(plasmacytoid DCs)的一个 marker。 然而这些群组太稀少,使得这样大小的数据集在不带有先前的认识下很难从背景噪音中被区分开。
2、我们鼓励使用者选择PC不同的数目重复下游分析(10、15、甚至50!),当你进行观察时,这结果一般不会显著差异过大。
3、我们建议使用者在选择这个参数时宁可多犯错。例如,仅带有5个PC 进行下游分析时也不会显著地、或负面影响结果。
Cluster the cells
Seurat v3 应用了以画图为基础的聚类方法,基于初始策略建立的(Macosko et al)。更为重要的是,距离度量使得聚类分析(基于之前鉴定出的PC)仍然维持一致。然而,我们的方法,即将细胞距离矩阵分区为小聚群,这一能力有所提升。我们的方法是受到近期的一手稿所启示,那个运用了对于scRNA seq数据 [SNN-Cliq, Xu and Su, Bioinformatics, 2015] 和CyTOF数据 [PhenoGraph, Levine et al., Cell, 2015]以画图为基础的聚类方法。简而言之,这些方法将细胞嵌入了图的结构内--如的k最近邻算法图 ( K-nearest neighbor graph****,KNN) ,其带有的edges被画在细胞之间,带有相似的特征表达模式,可被用于将一图划分为高度互通的“准小群体”或“社团”内。
在PhenoGraph里,在PCA的空间里面,我们第一次基于欧几里距离构建了KNN图,基于它们本地近邻之间共有的重叠部分,改善了任何两个细胞之间的边权(edge weights)。这一步通过【FindNeighbors】函数实现,且将先前定义好的数据集(起初10个PC)的维度作为输入。
为了聚类好细胞,我们下一步运用模块优化技术,类似默认的算法(Louvain algorithm)或SLM [SLM, Blondel et al., Journal of Statistical Mechanics],以优化标准模块函数为目标地,迭代地将细胞分组。【FindClusters】函数使用了这一步骤,且包含了一个清晰度的参数,它可以设置下游聚类的间隔距离(granularity),使用上升的数值获得大量的集群。我们发现为越3000下拨的单细胞数据集设置这一参数约 0.4-1.2 ,典型地会返回一个好的结果。最优的分辨率一般会因为更大的数据集而升高。这集群可以通过【Idents】函数被找到。
pbmc <- FindNeighbors(pbmc, dims = 1:10)
Computing nearest neighbor graph
Computing SNN
pbmc <- FindClusters(pbmc, resolution = 0.5)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Ec
Number of nodes: 2638
Number of edges: 96033
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**|
Maximum modularity in 10 random starts: 0.8720
Number of communities: 9
Elapsed time: 0 seconds
Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)
AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
1 3 1 2
AAACCGTGTATGCG-1
6
Levels: 0 1 2 3 4 5 6 7 8
Run non-linear dimensional reduction (UMAP/tSNE)
Seurat提供了多种非线性多降维的技巧,如 tSNE 和 UMAP,用于可视化和探索这些数据集。这些算法的目标是学习潜在的数据多样性为了将相似的细胞放置在低维度的空间。在上述以画图为基础的集群中的细胞应该会共存于这些降维图中。作为UMAP 和 tSNE 的输入数据,我们建议使用相同的 PC 作为聚类分群分析的输入数据。
If you haven't installed UMAP, you can do so via reticulate::py_install(packages ='umap-learn')
pbmc <- RunUMAP(pbmc, dims = 1:10)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
11:20:46 UMAP embedding parameters a = 0.9922 b = 1.112
11:20:46 Read 2638 rows and found 10 numeric columns
11:20:46 Using Annoy for neighbor search, n_neighbors = 30
11:20:46 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**|
11:20:46 Writing NN index file to temp file C:\Users\Dell\AppData\Local\Temp\RtmpU1dvCq\file306837978f5
11:20:46 Searching Annoy index using 1 thread, search_k = 3000
11:20:47 Annoy recall = 100%
11:20:47 Commencing smooth kNN distance calibration using 1 thread
11:20:47 Initializing from normalized Laplacian + noise
11:20:48 Commencing optimization for 500 epochs, with 105386 positive edges
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**|
11:20:53 Optimization finished
note that you can set
label = TRUE
or use the LabelClusters function to help label#individual clusters
DimPlot(pbmc, reduction = "umap")
10
NaiveCD4T
MemorvCD4T
CD14+Mono
三
CD8T
FCGR3A+Mono
NK
DC
Platelet
UMAP1
saveRDS(pbmc, file = "../pbmc_tutorial.rds")
Finding differentially expressed features (cluster biomarkers)
Seurat可帮助寻找那些通过差异表达决定分群的 markers。默认设置下,对比于其他的细胞,其鉴定了单一群落里面的正和负标记(positive and negative markers)(在指定的【ident.1】函数里)。【FindAllMarkers】函数自动化完成所有小集群的过程,但你可以测试小集群的分组,或其他细胞与细胞之间的,或针对所有细胞的分组。
【min.pct】参数或是在细胞的2个分组中的最小百分比处需要一个被检测到的特征,【thresh.test】参数需要一个两组之间存在一定差异表达量(通常而言)的特征。你可以设置这俩为0,但随着时间上的剧烈上升,由于这会检测到大量的特征,这些特征可能是高度无偏的。作为另一种去加快计算力的选择,可设置【max.cells.per.ident】。这会降低每个特征类别的取样,使其细胞数不超过设置的细胞数。然而一般而言,这会造成性能(power)的损失,这速度的提升可能会非常的显著,且大部分的高度差异显著特征将可能仍然上升至最大。
find all markers of cluster 1
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
head(cluster1.markers, n = 5)
p_val avg_logFC pct.1 pct.2 p_val_adj
IL32 1.894810e-92 0.8373872 0.948 0.464 2.598542e-88
LTB 7.953303e-89 0.8921170 0.981 0.642 1.090716e-84
CD3D 1.655937e-70 0.6436286 0.919 0.431 2.270951e-66
IL7R 3.688893e-68 0.8147082 0.747 0.325 5.058947e-64
LDHB 2.292819e-67 0.6253110 0.950 0.613 3.144372e-63
find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=02s
head(cluster5.markers, n = 5)
p_val avg_logFC pct.1 pct.2 p_val_adj
FCGR3A 7.583625e-209 2.963144 0.975 0.037 1.040018e-204
IFITM3 2.500844e-199 2.698187 0.975 0.046 3.429657e-195
CFD 1.763722e-195 2.362381 0.938 0.037 2.418768e-191
CD68 4.612171e-192 2.087366 0.926 0.036 6.325132e-188
RP11-290F20.3 1.846215e-188 1.886288 0.840 0.016 2.531900e-184
find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
Calculating cluster 0
|| 100% elapsed=01s
Calculating cluster 1
|| 100% elapsed=00s
Calculating cluster 2
|| 100% elapsed=02s
Calculating cluster 3
|| 100% elapsed=01s
Calculating cluster 4
|| 100% elapsed=01s
Calculating cluster 5
|| 100% elapsed=02s
Calculating cluster 6
|| 100% elapsed=02s
Calculating cluster 7
|| 100% elapsed=02s
Calculating cluster 8
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
Registered S3 method overwritten by 'cli':
method from
print.boxx spatstat
A tibble: 18 x 7
Groups: cluster [9]
p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene
1 1.96e-107 0.730 0.901 0.594 2.69e-103 0 LDHB
2 1.61e- 82 0.922 0.436 0.11 2.20e- 78 0 CCR7
3 7.95e- 89 0.892 0.981 0.642 1.09e- 84 1 LTB
4 1.85e- 60 0.859 0.422 0.11 2.54e- 56 1 AQP3
5 0. 3.86 0.996 0.215 0. 2 S100A9
6 0. 3.80 0.975 0.121 0. 2 S100A8
7 0. 2.99 0.936 0.041 0. 3 CD79A
8 9.48e-271 2.49 0.622 0.022 1.30e-266 3 TCL1A
9 2.96e-189 2.12 0.985 0.24 4.06e-185 4 CCL5
10 2.57e-158 2.05 0.587 0.059 3.52e-154 4 GZMK
11 3.51e-184 2.30 0.975 0.134 4.82e-180 5 FCGR3A
12 2.03e-125 2.14 1 0.315 2.78e-121 5 LST1
13 7.95e-269 3.35 0.961 0.068 1.09e-264 6 GZMB
14 3.13e-191 3.69 0.961 0.131 4.30e-187 6 GNLY
15 1.48e-220 2.68 0.812 0.011 2.03e-216 7 FCER1A
16 1.67e- 21 1.99 1 0.513 2.28e- 17 7 HLA-DPB1
17 7.73e-200 5.02 1 0.01 1.06e-195 8 PF4
18 3.68e-110 5.94 1 0.024 5.05e-106 8 PPBP
Warning message:
...
is not empty.We detected these problematic arguments:
needs_dots
These dots only exist to allow future extensions and should be empty.
Did you misspecify an argument?
Seurat有多种差异表达的测试,它们可以在【test.use】参数内被设置。例如,ROC 测试会给任何独立的maker(范围由0-随机,到1-完美)返回一个“分类能力”。
cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s
我们涵盖了很多可视化marker表达量的工具。【VlnPlot】显示了集群里面表达量的概率分布,【FeaturePlot】(可用于可视化 tSNE 或 PCA图的特征表达量)是最经常使用的可视化工具。我们建议探索【 RidgePlot】、
【CellScatter
】和【DotPlot
】 作为另一种方法来探索数据集。
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
MS4A1
CD79A
[AUOISSEJDX3
AAETUOISseJdx3
Identity
Identity
you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
NKG7
PF4
30
30
AADUOISSAJDX3
en的uoissaJdx3
10
Identity
Identity
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
"CD8A"))
CD3E
GNLY
MS4A1
10
寸wweC
10
10
10
10
PC
PC_1
PC1
FCER1A
CD14
FCGR3A
10
10
3N
wwro
0
0+
0
0
0
10
10
10
10
PC_1
PC_1
PC_1
LYZ
PPBP
CD8A
10
10
.0
0
柯d
10
10
10
PC_1
PC_1
PC
【DoHeatmap】对给予的细胞和特征生成一热图。在这个情况下,我们对每个集群画出前20个marker(如果marker量少于20就全部的画上)
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
州I国
HIST1H22
Assigning cell type identity to clusters
幸运的是,在这个数据集中,对于已知的细胞类型,我们可以使用权威的marker去轻易地吻合无偏倚的聚类
CellType
clusterID
Markers
IL7R,CCR7
NaiveCD4+T
1L7R.S100A4
MemoryCD4+
CD14+Mono
CD14,LYZ
B
MS4A1
CD8+T
CD8A
FCGR3A+Mono
FCGR3AMS4A7
NK
GNLY.NKG7
7
DC
FCER1ACST3
Platelet
PPBP
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
10
FCD14天MORO
2:0
三
DC
Platelet
FCGR3A+MonN
CD8
-5
NateCD4.
Meno
CD45
10
-10
UMAP1
saveRDS(pbmc, file = "../pbmc3k_final.rds")