1、载入R软件
(base) ┌─[shpc_a98ef85f8f@SHPC-Pro-1]─[~/SingleronTest/ligang/data/hg19/pbmc3k]
└──╼ $R
R version 4.0.2 (2020-06-22) -- "Taking Off Again"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
2 下载Seurat软件
> install.packages("Seurat")
> library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Warning message:
package 'dplyr' was built under R version 4.0.5
> library(Seurat)
Attaching SeuratObject
> library(patchwork)
Warning message:
package 'patchwork' was built under R version 4.0.3
> library(tidyverse)
Registered S3 method overwritten by 'cli':
method from
print.boxx spatstat.geom
-- Attaching packages ------------------------------------------------------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.5 v purrr 0.3.4
v tibble 3.1.5 v stringr 1.4.0
v tidyr 1.1.4 v forcats 0.5.1
v readr 2.0.2
-- Conflicts ---------------------------------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
Warning messages:
1: package 'ggplot2' was built under R version 4.0.5
2: package 'purrr' was built under R version 4.0.3
3: package 'stringr' was built under R version 4.0.5
4: package 'forcats' was built under R version 4.0.3
3 数据下载
Peripheral Blood Mononuclear Cells (PBMC) 是10X Genomics dataset page提供的一个数据,包含2700个单细胞,出自Illumina NextSeq 500平台。
PBMCs是来自健康供体具有相对少量RNA(around 1pg RNA/cell)的原代细胞。在Illumina NextSeq 500平台,检测到2700个单细胞,每个细胞获得69000 reads。
1. tar -xvf pbmc3k_filtered_gene_bc_matrices.tar
#文件夹下包含3个文件 这些文件都在操作的当前路径,所以文件路径输入“./”即可。
barcodes.tsv.gz
features.tsv.gz
matrix.mtx.gz
matrix.mtx:matrix.mtx 是 MatrixMarket格式文件;更多内容见:http://math.nist.gov/MatrixMarket/formats.html
文件中储存非零值;
注释使用%标记;
第一行包含文件中总行数,总列数,总的记录数
每行中提供记录的所处的行号和列号,已经记录的内容
4 读入10X数据
> pbmc.data <- Read10X('./')
> dim(pbmc.data)
[1] 32738 2700
总共得到2700个细胞和13714个基因。
count matrix长什么样呢?我们可以看
# 首先看看三个基因的count值
> pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
3 x 30 sparse Matrix of class "dgCMatrix"
[[ suppressing 30 column names 'AAACATACAACCAC-1', 'AAACATTGAGCTAC-1', 'AAACATTGATCAGC-1' ... ]]
CD3D 4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2 3 . . . . . 3 4 1 5
TCL1A . . . . . . . . 1 . . . . . . . . . . . . 1 . . . . . . . .
MS4A1 . 6 . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .
> summary(colSums(pbmc.data))
Min. 1st Qu. Median Mean 3rd Qu. Max.
548 1758 2197 2367 2763 15844
查看每个细胞有多少基因被检测到
其中的.表示0,即no molecules detected。当然,这个地方还有另外一种含义就是这个基因是真的没有表达。
由于单细胞测序数据中大多数的值都为0,因此,seurat使用一个稀疏矩阵来保存测序得到的count matrix,这样有利于数据存储空间的节省。
我们来看看使用稀疏矩阵和使用0来存储两种方式的大小对比。
> dense.size <- object.size(as.matrix(pbmc.data))
> dense.size
709591472 bytes
> sparse.size <- object.size(pbmc.data)
> sparse.size
29905192 bytes
dense为转换为0后的matrix存储大小,709591472 bytes,
sparse为.即稀疏矩阵的大小,29905192 bytes。
两者比值为23.7 倍,即使用系数矩阵来存储单细胞水平的基因表达值非常节省空间。
使用pbmc数据初始化Seurat对象
> pbmc
An object of class Seurat
13714 features across 2700 samples within 1 assay
Active assay: RNA (13714 features, 0 variable features)
> head(pbmc$RNA@data[,1:5])
6 x 5 sparse Matrix of class "dgCMatrix"
AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
AL627309.1 . . .
AP006222.2 . . .
RP11-206L10.2 . . .
RP11-206L10.9 . . .
LINC00115 . . .
NOC2L . . .
AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
AL627309.1 . .
AP006222.2 . .
RP11-206L10.2 . .
RP11-206L10.9 . .
LINC00115 . .
NOC2L . .
5 单细胞数据分析预处理
预处理主要包括基于QC指标的细胞和基因过滤,数据标准化和归一化,高变基因选择。
5.1首先是QC来筛选高质量的细胞
一般筛选条件有三个:
- 1.每个细胞中检测到的唯一基因数
- 低质量的细胞或者空的droplet液滴通常含有很少的基因
- Cell doublets双胞体或多胞体含有很高的异常的gene counts
- 2.每个细胞中检测到的分子总数
- 3.线粒体基因含量比例
- 低质量或者死亡细胞含有很高的线粒体基因
- 使用PercentageFeatureSet()计算一个特征的比例
- MT-开头的基因认为是线粒体基因
# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
# 查看QC指标
# Show QC metrics for the first 5 cells
> head([email protected], 5)
orig.ident nCount_RNA nFeature_RNA percent.mt
AAACATACAACCAC-1 pbmc3k 2419 779 3.0177759
AAACATTGAGCTAC-1 pbmc3k 4903 1352 3.7935958
AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8897363
AAACCGTGCTTCCG-1 pbmc3k 2639 960 1.7430845
AAACCGTGTATGCG-1 pbmc3k 980 521 1.2244898
我们将使用以下标准进行基因过滤:
We filter cells that have unique feature counts over 2,500 or less than 200
We filter cells that have >5% mitochondrial counts
过滤前三个指标可视化
> pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
> pbmc
An object of class Seurat
13714 features across 2700 samples within 1 assay
Active assay: RNA (13714 features, 0 variable features)
> head(pbmc$RNA@data[,1:5])
6 x 5 sparse Matrix of class "dgCMatrix"
AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
AL627309.1 . . .
AP006222.2 . . .
RP11-206L10.2 . . .
RP11-206L10.9 . . .
LINC00115 . . .
NOC2L . . .
AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
AL627309.1 . .
AP006222.2 . .
RP11-206L10.2 . .
RP11-206L10.9 . .
LINC00115 . .
NOC2L . .
> pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
> head([email protected], 5)
orig.ident nCount_RNA nFeature_RNA percent.mt
AAACATACAACCAC-1 pbmc3k 2419 779 3.0177759
AAACATTGAGCTAC-1 pbmc3k 4903 1352 3.7935958
AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8897363
AAACCGTGCTTCCG-1 pbmc3k 2639 960 1.7430845
AAACCGTGTATGCG-1 pbmc3k 980 521 1.2244898
> VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
我们就出了小提请图。
查看基因数目, 线粒体基因占比与UMI数目的关系
> plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
> plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
> lllg <- plot1 + plot2
> lllg
> ggsave('lllg.pdf')
Saving 7 x 7 in image
6 质控
筛选检测到基因数目超过2500或低于200的细胞
单个细胞中线粒体基因数目占比超过>5%
> pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
> pbmc
An object of class Seurat
13714 features across 2638 samples within 1 assay
Active assay: RNA (13714 features, 0 variable features)
6.1数据标准化
默认使用数据标准化方法是LogNormalize, 每个细胞总的表达量都标准化到10000,然后log取对数;结果存放于pbmc[["RNA"]]@data。
标准化前,每个细胞总的表达量
> hist(colSums(pbmc$RNA@data),
+ breaks = 100,
+ main = "Total expression before normalisation",
+ xlab = "Sum of expression")
6.2 标准化后,每个细胞总的表达量
> pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
> hist(colSums(pbmc$RNA@data),
+ breaks = 100,
+ main = "Total expression after normalisation",
+ xlab = "Sum of expression")
6.3 变化基因鉴定
鉴定在细胞间表达高度变化的基因,后续研究需要集中于这部分基因。Seurat内置的FindVariableFeatures()函数,首先计算每一个基因的均值和方差,并且直接模拟其关系。默认返回2000个基因。
> pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
Calculating gene variances
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
# 10个表达变化最为剧烈的基因
> top10 <- head(VariableFeatures(pbmc), 10) #head([email protected],10)
> top10
[1] "PPBP" "LYZ" "S100A9" "IGLL5" "GNLY" "FTL" "PF4" "FTH1"
[9] "GNG11" "S100A8"
画出表达变化的基因,从而观察其分布
> plot1 <- VariableFeaturePlot(pbmc)
> ggsave('plot1.pdf')
Saving 7 x 7 in image
Warning messages:
1: Transformation introduced infinite values in continuous x-axis
2: Removed 1 rows containing missing values (geom_point).
画出表达变化的基因,标记前10个基因
> plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
When using repel, set xnudge and ynudge to 0 for optimal results
> ggsave('plot2.pdf')
Saving 7 x 7 in image
Warning messages:
1: Transformation introduced infinite values in continuous x-axis
2: Removed 1 rows containing missing values (geom_point).
画出表达变化的基因,标记前10个基因
7 数据缩放
线性转换缩放数据,ScaleData()函数可以实现此功能。
最终每个基因均值为0,方差为1。
结果存放于pbmc[["RNA"]]@scale.data。
> all.genes <- rownames(pbmc)
> pbmc <- ScaleData(pbmc, features = all.genes)
Centering and scaling data matrix
|======================================================================| 100%
设置参数features是因为ScaleData默认处理前面鉴定的差异基因。这一步怎么做都不会影响到后续pca和聚类,但是会影响做热图。
移除影响方差的因素。
> pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt")
Regressing out percent.mt
|======================================================================| 100%
Centering and scaling data matrix
|======================================================================| 100%
8 线性降维分析
8.1 PCA
对缩放后的数据进行PCA分析,默认使用前面鉴定表达变化大的基因。使用features参数可以重新定义数据集。
> pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
VizDimReduction, DimPlot, 和 DimHeatmap可以从基因或细胞角度可视化pca结果
查看对每个主成分影响比较大的基因集
PC_ 1
Positive: CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP
FCER1G, CFD, LGALS1, LGALS2, SERPINA1, S100A8, CTSS, IFITM3, SPI1, CFP
PSAP, IFI30, COTL1, SAT1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD
Negative: MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CTSW, STK17A, CD27
CD247, CCL5, GIMAP5, GZMA, AQP3, CST7, TRAF3IP3, SELL, GZMK, HOPX
MAL, MYC, ITM2A, ETS1, LYAR, GIMAP7, KLRG1, NKG7, ZAP70, BEX2
PC_ 2
Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74
HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB
BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74
Negative: NKG7, PRF1, CST7, GZMA, GZMB, FGFBP2, CTSW, GNLY, B2M, SPON2
CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX
TTC38, CTSC, APMAP, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC
PC_ 3
Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPA1, HLA-DPB1, CD74, MS4A1, HLA-DRB1, HLA-DRA
HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8
PLAC8, BLNK, MALAT1, SMIM14, PLD4, IGLL5, SWAP70, P2RX5, LAT2, FCGR3A
Negative: PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU
HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1
NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, CMTM5, MPP1, MYL9, RP11-367G6.3, GP1BA
PC_ 4
Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HLA-DPB1, HIST1H2AC, HLA-DPA1, HLA-DRB1
TCL1A, PF4, HLA-DQA2, SDPR, HLA-DRA, LINC00926, PPBP, GNG11, HLA-DRB5, SPARC
GP9, PTCRA, CA2, AP001189.4, CD9, NRGN, RGS18, GZMB, CLU, TUBB1
Negative: VIM, IL7R, S100A6, S100A8, IL32, S100A4, GIMAP7, S100A10, S100A9, MAL
AQP3, CD14, CD2, LGALS2, FYB, GIMAP4, ANXA1, RBP7, CD27, FCN1
LYZ, S100A12, MS4A6A, GIMAP5, S100A11, FOLR3, TRABD2A, AIF1, IL8, TMSB4X
PC_ 5
Positive: GZMB, FGFBP2, S100A8, NKG7, GNLY, CCL4, PRF1, CST7, SPON2, GZMA
GZMH, LGALS2, S100A9, CCL3, XCL2, CD14, CLIC3, CTSW, MS4A6A, GSTP1
S100A12, RBP7, IGFBP7, FOLR3, AKR1C3, TYROBP, CCL5, TTC38, XCL1, APMAP
Negative: LTB, IL7R, CKB, MS4A7, RP11-290F20.3, AQP3, SIGLEC10, VIM, CYTIP, HMOX1
LILRB2, PTGES3, HN1, CD2, FAM110A, CD27, ANXA5, CTD-2006K23.1, MAL, VMO1
CORO1B, TUBA1B, LILRA3, GDI2, TRADD, ATP1A1, IL32, ABRACL, CCDC109B, PPA1
查看前五的可变基因
> print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
PC_ 1
Positive: CST3, TYROBP, LST1, AIF1, FTL
Negative: MALAT1, LTB, IL32, IL7R, CD2
PC_ 2
Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1
Negative: NKG7, PRF1, CST7, GZMA, GZMB
PC_ 3
Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPA1
Negative: PPBP, PF4, SDPR, SPARC, GNG11
PC_ 4
Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1
Negative: VIM, IL7R, S100A6, S100A8, IL32
PC_ 5
Positive: GZMB, FGFBP2, S100A8, NKG7, GNLY
Negative: LTB, IL7R, CKB, MS4A7, RP11-290F20.3
可视化对每个主成分影响比较大的基因集
> VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
> ggsave('pca.pdf')
Saving 7 x 7 in image
两个主成分的展示
> DimPlot(pbmc, reduction = "pca",split.by = 'ident')
> ggsave('dimplotpca2.pdf')
Saving 7 x 7 in image
DimHeatmap绘制基于单个主成分的热图,细胞和基因的排序都是基于他们的主成分分数。对于数据异质性的探索是很有帮助的,可以帮助用户选择用于下游分析的主成分维度。
> DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
> ggplot('dimheatmap.pdf')
Error: `data` must be a data frame, or other object coercible by `fortify()`, not a character vector.
Run `rlang::last_error()` to see where the error occurred.
> ggsave('dimheatmap.pdf')
Saving 7 x 7 in image
> DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
> ggsave('dimheatmap.pdf')
Saving 7 x 7 in image
> DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
> ggsave(dimp.pdf')
+ '
Error: unexpected string constant in:
"ggsave(dimp.pdf')
'"
> ggsave('dim.pdf')
Saving 7 x 7 in image
8 数据维度
为了避免单个基因影响,Seurat聚类细胞时使用pca结果。首先需要确定的是使用多少个主成分用于后续分析。常用有两种方法,一种是基于零分布的统计检验方法,这种方法耗时且可能不会返回明确结果。另一种是主成分分析常用的启发式评估。
JackStraw()
在JackStraw()函数中, 使用基于零分布的置换检验方法。随机抽取一部分基因(默认1%)然后进行pca分析得到pca分数,将这部分基因的pca分数与先前计算的pca分数进行比较得到显著性p-Value,。根据主成分(pc)所包含基因的p-value进行判断选择主成分。最终的结果是每个基因与每个主成分的关联的p-Value。保留下来的主成分是那些富集小的p-Value基因的主成分。
处理大数据时会花费大量时间;ElbowPlot()内置了一些其它的方法可以减少运行时间。
> pbmc <- JackStraw(pbmc, num.replicate = 100)
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=06m 05s
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
> JackStrawPlot(pbmc, dims = 1:15)
Error in JackStrawPlot(pbmc, dims = 1:15) :
Jackstraw procedure not scored for all the provided dims. Please run ScoreJackStraw.
> ggsave('jackstraw.pdf')
Saving 7 x 7 in image
> pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
JackStrawPlot()函数提供可视化方法,用于比较每一个主成分的p-value的分布,虚线是均匀分布;显著的主成分富集有小p-Value基因,实线位于虚线左上方。下图表明保留10个pca主成分用于后续分析是比较合理的。
> JackStrawPlot(pbmc, dims = 1:15)
Warning message:
Removed 23426 rows containing missing values (geom_point).
> ggsave('jackstraw.pdf')
Saving 7 x 7 in image
Warning message:
Removed 23426 rows containing missing values (geom_point).
> ElbowPlot(pbmc)
> ggsave('elbowplot')
Error: Unknown graphics device ''
ElbowPlot
> ElbowPlot(pbmc)
> ggsave('elbowplot')
Error: Unknown graphics device ''
启发式评估方法生成一个Elbow plot图。在图中展示了每个主成分对数据方差的解释情况(百分比表示),并进行排序。根据自己需要选择主成分,图中发现第9个主成分是一个拐点,后续的主成分(PC)变化都不大了。
注意:鉴别数据的真实维度不是件容易的事情;除了上面两种方法,Serat官当文档还建议将主成分(数据异质性的相关来源有关)与GSEA分析相结合。Dendritic cell 和 NK aficionados可能识别的基因与主成分 12 和 13相关,定义了罕见的免疫亚群 (i.e. MZB1 is a marker for plasmacytoid DCs)。如果不是事先知道的情况下,很难发现这些问题。
Serat官当文档因此鼓励用户使用不同数量的PC(10、15,甚至50)重复下游分析。其实也将观察到的,结果通常没有显著差异。因此,在选择此参数时,可以尽量选大一点的维度,维度太小的话对结果会产生不好的影响。
9 细胞聚类
Seurat v3应用基于图形的聚类方法,例如KNN方法。具有相似基因表达模式的细胞之间绘制边缘,然后将他们划分为一个内联群体。
在PhenoGraph中,首先基于pca维度中(先前计算的pca数据)计算欧式距离(the euclidean distance),然后根据两个细胞在局部的重合情况(Jaccard 相似系数)优化两个细胞之间的边缘权值。此步骤内置于FindNeighbors函数,输入时先前确定的pc数据。
为了聚类细胞,接下来应用模块化优化技术迭代将细胞聚集在一起。(the Louvain algorithm (default) or SLM [SLM, Blondel et al., Journal of Statistical Mechanics]),FindClusters函数实现这一功能,其中需要注意resolution参数,该参数设置下游聚类分析的“granularity”,更大的resolution会导致跟多的细胞类群。3000左右的细胞量,设置resolution为0.4-1.2是比较合适的。细胞数据集越大,需要更大的resolution参数, 会获得更多的细胞聚类。
查看细胞属于那个类群可以使用函数Idents。
> pbmc <- FindNeighbors(pbmc, dims = 1:10)
Computing nearest neighbor graph
Computing SNN
> pbmc <- FindClusters(pbmc, resolution = 0.5)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 2638
Number of edges: 95893
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8735
Number of communities: 9
Elapsed time: 0 seconds
> head(Idents(pbmc), 5)
AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
0 3 2 1
AAACCGTGTATGCG-1
6
Levels: 0 1 2 3 4 5 6 7 8
>pbmc <- RunUMAP(pbmc, dims = 1:10)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
08:29:35 UMAP embedding parameters a = 0.9922 b = 1.112
08:29:35 Read 2638 rows and found 10 numeric columns
08:29:35 Using Annoy for neighbor search, n_neighbors = 30
08:29:36 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
08:29:36 Writing NN index file to temp file /tmp/Rtmp1UlrA2/file178084559cfdf3
08:29:36 Searching Annoy index using 1 thread, search_k = 3000
08:29:37 Annoy recall = 100%
08:29:37 Commencing smooth kNN distance calibration using 1 thread
08:29:38 Initializing from normalized Laplacian + noise
08:29:38 Commencing optimization for 500 epochs, with 106338 positive edges
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
08:29:41 Optimization finished
> DimPlot(pbmc, reduction = "umap")
> ggsave('umap.pdf')
Saving 7 x 7 in image
添加细胞标签
> DimPlot(pbmc, reduction = "umap",label = TRUE)
> LabelClusters(DimPlot(pbmc, reduction = "umap"),id = 'ident')
> ggsave('dim.pdf')
Saving 7 x 7 in image
此时可以保存数据,方便下次直接导入数据修改图形。
10 寻找差异表达基因 (cluster biomarkers)
Seurat可以通过差异表达分析寻找不同细胞类群的标记基因。FindMarkers函数可以进行此操作,但是默认寻找单个类群(参数ident.1)与其他所有类群阳性和阴性标记基因。FindAllMarkers函数会自动寻找每个类群和其他每个类群之间的标记基因。
min.pct参数:设定在两个细胞群中任何一个被检测到的百分比,通过此设定不检测很少表达基因来缩短程序运行时间。默认0.1
thresh.test参数:设定在两个细胞群中基因差异表达量。可以设置为0 ,程序运行时间会更长。
max.cells.per.ident参数:每个类群细胞抽样设置;也可以缩短程序运行时间。
> cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=06s
> head(cluster1.markers, n = 5)
p_val avg_log2FC pct.1 pct.2 p_val_adj
S100A9 0.000000e+00 5.570063 0.996 0.215 0.000000e+00
S100A8 0.000000e+00 5.477394 0.975 0.121 0.000000e+00
FCN1 0.000000e+00 3.394219 0.952 0.151 0.000000e+00
LGALS2 0.000000e+00 3.800484 0.908 0.059 0.000000e+00
CD14 2.856582e-294 2.815626 0.667 0.028 3.917516e-290
> cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s
> head(cluster5.markers, n = 5)
p_val avg_log2FC pct.1 pct.2 p_val_adj
FCGR3A 8.331882e-208 4.261784 0.975 0.040 1.142634e-203
CFD 1.932644e-198 3.423863 0.938 0.036 2.650429e-194
IFITM3 2.710023e-198 3.876058 0.975 0.049 3.716525e-194
CD68 1.069778e-193 3.013656 0.926 0.035 1.467094e-189
RP11-290F20.3 4.218926e-190 2.722303 0.840 0.016 5.785835e-186
> pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
Calculating cluster 0
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
Calculating cluster 1
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=02s
Calculating cluster 2
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
Calculating cluster 3
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
Calculating cluster 4
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
Calculating cluster 5
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s
Calculating cluster 6
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=03s
Calculating cluster 7
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s
Calculating cluster 8
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=02s
> pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
Error: Problem with `filter()` input `..1`.
i Input `..1` is `top_n_rank(2, avg_logFC)`.
x object 'avg_logFC' not found
i The error occurred in group 1: cluster = 0.
Run `rlang::last_error()` to see where the error occurred.
> pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
Calculating cluster 0
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
Calculating cluster 1
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=02s
Calculating cluster 2
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
Calculating cluster 3
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
Calculating cluster 4
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
Calculating cluster 5
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s
Calculating cluster 6
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=03s
Calculating cluster 7
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s
Calculating cluster 8
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=02s
> pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
Error: Problem with `filter()` input `..1`.
i Input `..1` is `top_n_rank(2, avg_logFC)`.
x object 'avg_logFC' not found
i The error occurred in group 1: cluster = 0.
Run `rlang::last_error()` to see where the error occurred.
Seurat可以通过参数test.use设定检验差异表达的方法(详情见
[DE vignett](https://links.jianshu.com/go?to=https%3A%2F%2Fsatijalab.org%2Fseurat%2Fv3.0%2Fde_vignette.html))。
> cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s
> head(cluster1.markers, n = 5)
myAUC avg_diff power avg_log2FC pct.1 pct.2
RPS12 0.831 0.5132163 0.662 0.7404146 1.000 0.991
RPS6 0.828 0.4730236 0.656 0.6824288 1.000 0.995
RPL32 0.824 0.4362054 0.648 0.6293113 0.999 0.995
RPS27 0.821 0.5010227 0.642 0.7228229 0.999 0.992
RPS14 0.815 0.4366673 0.630 0.6299777 1.000 0.994
Seurat有多种方法可视化标记基因的方法
VlnPlot: 基于细胞类群的基因表达概率分布
FeaturePlot:在tSNE 或 PCA图中画出基因表达情况
RidgePlot,CellScatter,DotPlot
> VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
> ggsave('vlen123.pdf')
Saving 7 x 7 in image
> VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
> ggsave('vlenkg73.pdf')
Saving 7 x 7 in image
Seurat有多种方法可视化标记基因的方法
VlnPlot: 基于细胞类群的基因表达概率分布
FeaturePlot:在tSNE 或 PCA图中画出基因表达情况
RidgePlot,CellScatter,DotPlot
> FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
+ "CD8A"))
> ggsave('featuregene10.pdf')
Saving 7 x 7 in image
DoHeatmap为指定的细胞和基因花表达热图。每个类群默认展示top 20标记基因。
> ggsave('featuregene10.pdf')
Saving 7 x 7 in image
> top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
Error: Problem with `filter()` input `..1`.
i Input `..1` is `top_n_rank(10, avg_logFC)`.
x object 'avg_logFC' not found
i The error occurred in group 1: cluster = 0.
Run `rlang::last_error()` to see where the error occurred.
11 Assigning cell type identity to clusters
Cluster ID Markers Cell Type
0 IL7R, CCR7 Naive CD4+ T
1 IL7R, S100A4 Memory CD4+
2 CD14, LYZ CD14+ Mono
3 MS4A1 B
4 CD8A CD8+ T
5 FCGR3A, MS4A7 FCGR3A+ Mono
6 GNLY, NKG7 NK
7 FCER1A, CST3 DC
8 PPBP Platelet
> new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono",
+ "NK", "DC", "Platelet")
> names(new.cluster.ids) <- levels(pbmc)
> pbmc <- RenameIdents(pbmc, new.cluster.ids)
> DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
> ggsave('wmap123.pdf')
Saving 7 x 7 in image
12 最后保存这个运行过得文件
> saveRDS(pbmc, file = "pbmc3k_final.rds")