本文是参考学习单细胞转录组基础分析五:细胞再聚类的学习笔记。可能根据学习情况有所改动。
单细胞数据分析中,一般需要对可以细分的细胞再聚类,比如本次分析中的T细胞群体可以细分为Naive T cells、CD8+ T cells、Treg cells、Tmemory cells等。细胞子集的提取使用subset函数,主要依据meta.data的分类项目选择,也可以自定义细胞barcode提取。
提取细胞子集
library(Seurat)
重新降维聚类
因为再聚类的细胞之间差异比较小,所以聚类函数FindClusters()控制分辨率的参数建议调高到resolution = 0.9。
##PCA降维
Cluster差异分析
diff.wilcox = FindAllMarkers(scRNAsub)
SingleR细胞鉴定
Subcluster的细胞同样可以使用SingleR鉴定细胞类型。使用的时候注意调整参考数据库和分类标签,以便鉴定结果更有针对性。上节使用SingleR时使用的参考数据库是人类主要细胞图谱(HumanPrimaryCellAtlasData),采用分类标签是主分类标签(label.main);这次建议使用人类免疫细胞数据(MonacoImmuneData),分类标签采用精细分类标签(label.fine)。希望详细了解SingleR的朋友可以到github看看:
https://github.com/dviraran/singler
##细胞类型鉴定
> library(Seurat)
> library(monocle)
载入需要的程辑包:Matrix
载入程辑包:‘Matrix’
The following object is masked from ‘package:S4Vectors’:
expand
The following objects are masked from ‘package:tidyr’:
expand, pack, unpack
载入需要的程辑包:VGAM
载入需要的程辑包:splines
载入程辑包:‘VGAM’
The following object is masked from ‘package:tidyr’:
fill
载入需要的程辑包:DDRTree
载入需要的程辑包:irlba
Warning messages:
1: 程辑包‘monocle’是用R版本4.0.3 来建造的
2: 程辑包‘Matrix’是用R版本4.0.3 来建造的
3: 程辑包‘VGAM’是用R版本4.0.3 来建造的
4: 程辑包‘DDRTree’是用R版本4.0.3 来建造的
5: 程辑包‘irlba’是用R版本4.0.3 来建造的
> library(tidyverse)
> library(patchwork)
> rm(list=ls())
> dir.create("subcluster")
> scRNA <- readRDS("scRNA.rds")
> ##提取细胞子集
> Cells.sub <- subset([email protected], celltype=="T_cells")
Error in eval(e, x, parent.frame()) : object 'celltype' not found
> scRNAsub <- subset(scRNA, cells=row.names(Cells.sub))
Error in row.names(Cells.sub) : object 'Cells.sub' not found
> library(SingleR)
> refdata <- HumanPrimaryCellAtlasData()
Cannot connect to ExperimentHub server, using 'localHub=TRUE' instead
snapshotDate(): 2021-03-04
see ?celldex and browseVignettes('celldex') for documentation
loading from cache
see ?celldex and browseVignettes('celldex') for documentation
loading from cache
> testdata <- GetAssayData(scRNA, slot="data")
> clusters <- [email protected]$seurat_clusters
> cellpred <- SingleR(test = testdata, ref = refdata, labels = refdata$label.main,
+ method = "cluster", clusters = clusters,
+ assay.type.test = "logcounts", assay.type.ref = "logcounts")
Warning message:
'method="cluster"' is no longer necessary when 'cluster=' is specified
> celltype = data.frame(ClusterID=rownames(cellpred), celltype=cellpred$labels, stringsAsFactors = F)
> dim(celltype)
[1] 10 2
> celltype
ClusterID celltype
1 0 Monocyte
2 1 T_cells
3 2 B_cell
4 3 T_cells
5 4 T_cells
6 5 B_cell
7 6 NK_cell
8 7 T_cells
9 8 T_cells
10 9 Monocyte
> ##提取细胞子集
> Cells.sub <- subset([email protected], celltype=="T_cells")
> scRNAsub <- subset(scRNA, cells=row.names(Cells.sub))
> ##PCA降维
> scRNAsub <- FindVariableFeatures(scRNAsub, selection.method = "vst", nfeatures = 2000)
Calculating gene variances
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
> scale.genes <- rownames(scRNAsub)
> scRNAsub <- ScaleData(scRNAsub, features = scale.genes)
Centering and scaling data matrix
|==================================================================================| 100%
> scRNAsub <- RunPCA(scRNAsub, features = VariableFeatures(scRNAsub))
PC_ 1
Positive: RPS12, LTB, CD3E, TRAC, TRBC2, CD3D, IL32, TCF7, CD3G, IL7R
LDHB, ARL4C, CD69, CD247, CD7, NOSIP, CD27, RHOH, SPOCK2, TRBC1
BCL11B, GZMM, SYNE2, CD6, RORA, CTSW, TRABD2A, CCR7, ZAP70, AQP3
Negative: LYZ, FCN1, S100A9, MNDA, FGL2, CST3, VCAN, S100A8, NCF2, SERPINA1
GRN, KLF4, TYMP, CTSS, MS4A6A, CSTA, PSAP, CD36, MPEG1, RNF130
CPVL, TGFBI, CSF3R, SLC7A7, CLEC7A, CD68, AIF1, LST1, LGALS1, S100A12
PC_ 2
Positive: CD79A, MS4A1, BANK1, IGHM, HLA-DQA1, LINC00926, IGHD, CD79B, TCL1A, CD22
HLA-DQA2, HLA-DQB1, BCL11A, CD74, HLA-DRB1, SPIB, FCRL1, HLA-DPA1, HLA-DRA, HLA-DPB1
TNFRSF13C, MEF2C, FAM129C, ARHGAP24, FCRLA, FCER2, PLPP5, HLA-DOB, HVCN1, IGKC
Negative: CD247, S100A4, CTSW, CD7, CD3E, GZMM, IL32, ARL4C, ANXA1, GZMA
CD3D, NKG7, PRF1, CST7, DOK2, CCL5, ZAP70, KLRB1, CD3G, S100A10
MT2A, GNLY, TRAC, GAPDH, IL7R, RORA, KLRD1, APMAP, TRBC1, MATK
PC_ 3
Positive: RPS12, IL7R, LDHB, TRABD2A, CD3D, RCAN3, CD3G, TCF7, TRAC, LTB
MAL, LEF1, VIM, CCR7, PRKCA, CD27, NOSIP, FOSB, PASK, CD5
INPP4B, NELL2, SUSD3, IL6ST, CD40LG, CHRM3-AS2, AQP3, LINC01550, CD3E, BCL11B
Negative: GNLY, GZMB, SPON2, KLRD1, NKG7, KLRF1, FGFBP2, PRF1, CST7, ADGRG1
TTC38, GZMA, CLIC3, FCGR3A, CCL4, HOPX, APOBEC3G, MATK, IL2RB, PTGDR
TRDC, TBX21, CTSW, GZMH, S1PR5, IGFBP7, APMAP, EFHD2, SH2D1B, XCL2
PC_ 4
Positive: S100A12, RAB27A, NCF1, AC020656.1, CYP1B1, RBP7, QPCT, PADI4, MCEMP1, S100A8
IRS2, RETN, TREM1, CLEC4D, VNN3, F5, VCAN, ALOX5AP, PLBD1, ITGAM
BPI, AZI2, BST1, UBE2D1, FNDC3B, VNN2, PGD, FES, AQP9, GAB2
Negative: NOTCH4, CDKN1C, PPP1R17, HES4, TCF7L2, CKB, ABCC3, ZNF703, CDH23, SMIM25
KCNAB3, TNFRSF8, SECTM1, SIGLEC10, DUSP5, MGLL, MS4A4A, PTP4A3, CSF1R, LINC01503
FCGR3A, MAPKAPK3, MTSS1, RNASET2, RHOC, RRAS, SLC35F3, CXCL16, XYLB, ZGRF1
PC_ 5
Positive: RHEX, SLC35F3, CDH1, MYB, LINC00996, KCNK17, IL3RA, NRP1, IGFBP3, LRRC36
CLEC4C, AC002553.1, SMPD3, PACSIN1, DRD4, AR, CIB2, CCDC102B, SERPINF1, PLXNA4
LHFPL2, LGMN, DNASE1L3, DERL3, JCHAIN, PPM1J, PLD4, SLC7A5, MZB1, LDLRAD4
Negative: PPP1R17, CDKN1C, ABCC3, CKB, SIGLEC10, HES4, TCF7L2, FCGR3A, SMIM25, SECTM1
TNFRSF8, HSPA6, ZNF703, MGLL, PTP4A3, MTSS1, MS4A7, RRAS, IFITM3, MAFB
HMOX1, KCNAB3, GPR137B, LINC01503, CSF1R, ABI3, E2F2, CD300E, TSC22D3, MARCKS
> ElbowPlot(scRNAsub, ndims=20, reduction="pca")
> pc.num=1:10
> ##细胞聚类
> scRNAsub <- FindNeighbors(scRNAsub, dims = pc.num)
Computing nearest neighbor graph
Computing SNN
> scRNAsub <- FindClusters(scRNAsub, resolution = 0.9)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 251
Number of edges: 6245
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.7669
Number of communities: 6
Elapsed time: 0 seconds
> table([email protected]$seurat_clusters)
0 1 2 3 4 5
76 50 41 39 28 17
> metadata <- [email protected]
> cell_cluster <- data.frame(cell_ID=rownames(metadata), cluster_ID=metadata$seurat_clusters)
> write.csv(cell_cluster,'subcluster/cell_cluster.csv',row.names = F)
> ##非线性降维
> #tSNE
> scRNAsub = RunTSNE(scRNAsub, dims = pc.num)
> embed_tsne <- Embeddings(scRNAsub, 'tsne')
> write.csv(embed_tsne,'subcluster/embed_tsne.csv')
> plot1 = DimPlot(scRNAsub, reduction = "tsne")
> ggsave("subcluster/tSNE.pdf", plot = plot1, width = 8, height = 7)
> ggsave("subcluster/tSNE.png", plot = plot1, width = 8, height = 7)
> plot1
> ggsave("subcluster/tSNE.pdf", plot = plot1, width = 8, height = 7)
> ggsave("subcluster/tSNE.png", plot = plot1, width = 8, height = 7)
> #UMAP
> scRNAsub <- RunUMAP(scRNAsub, dims = pc.num)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
16:09:00 UMAP embedding parameters a = 0.9922 b = 1.112
16:09:00 Read 251 rows and found 10 numeric columns
16:09:00 Using Annoy for neighbor search, n_neighbors = 30
16:09:00 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
16:09:01 Writing NN index file to temp file C:\Users\Nano\AppData\Local\Temp\Rtmp8sLN3y\file52412ae23ef
16:09:01 Searching Annoy index using 1 thread, search_k = 3000
16:09:01 Annoy recall = 100%
16:09:01 Commencing smooth kNN distance calibration using 1 thread
16:09:03 Initializing from normalized Laplacian + noise
16:09:03 Commencing optimization for 500 epochs, with 8594 positive edges
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
16:09:05 Optimization finished
> embed_umap <- Embeddings(scRNAsub, 'umap')
> write.csv(embed_umap,'subcluster/embed_umap.csv')
> plot2 = DimPlot(scRNAsub, reduction = "umap")
> plot2
> ggsave("subcluster/UMAP.pdf", plot = plot2, width = 8, height = 7)
> ggsave("subcluster/UMAP.png", plot = plot2, width = 8, height = 7)
> #合并tSNE与UMAP
> plotc <- plot1+plot2+ plot_layout(guides = 'collect')
> plotc
> plotc
> ggsave("subcluster/tSNE_UMAP.pdf", plot = plotc, width = 10, height = 5)
> ggsave("subcluster/tSNE_UMAP.png", plot = plotc, width = 10, height = 5)
> diff.wilcox = FindAllMarkers(scRNAsub)
Calculating cluster 0
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=05s
Calculating cluster 1
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=02s
Calculating cluster 2
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s
Calculating cluster 3
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=03s
Calculating cluster 4
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=03s
Calculating cluster 5
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=06s
> all.markers = diff.wilcox %>% select(gene, everything()) %>% subset(p_val<0.05)
> top10 = all.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
> write.csv(all.markers, "subcluster/diff_genes_wilcox.csv", row.names = F)
> write.csv(top10, "subcluster/top10_diff_genes_wilcox.csv", row.names = F)
> ##细胞类型鉴定
> library(SingleR)
> refdata <- MonacoImmuneData()
snapshotDate(): 2020-10-27
see ?celldex and browseVignettes('celldex') for documentation
Could not check id: EH3496 for updates.
Using previously cached version.
loading from cache
see ?celldex and browseVignettes('celldex') for documentation
loading from cache
Warning message:
Could not check database for updates.
Database source currently unreachable.
This should only be a temporary interruption.
Using previously cached version.
> testdata <- GetAssayData(scRNAsub, slot="data")
> clusters <- [email protected]$seurat_clusters
> cellpred <- SingleR(test = testdata, ref = refdata, labels = refdata$label.fine,
+ method = "cluster", clusters = clusters,
+ assay.type.test = "logcounts", assay.type.ref = "logcounts")
Warning message:
'method="cluster"' is no longer necessary when 'cluster=' is specified
> dim(celltype)
[1] 10 2
> celltype = data.frame(ClusterID=rownames(cellpred), celltype=cellpred$labels, stringsAsFactors = F)
> dim(celltype)
[1] 6 2
> celltype
ClusterID celltype
1 0 Classical monocytes
2 1 Th17 cells
3 2 Naive CD4 T cells
4 3 Naive B cells
5 4 MAIT cells
6 5 Natural killer cells
> write.csv(celltype,"subcluster/celltype_singleR.csv",row.names = F)
> [email protected]$celltype = "NA"
> for(i in 1:nrow(celltype)){
+ [email protected][which([email protected]$seurat_clusters == celltype$ClusterID[i]),'celltype'] <- celltype$celltype[i]}
> p1 = DimPlot(scRNAsub, group.by="celltype", label=T, label.size=5, reduction='tsne')
> p2 = DimPlot(scRNAsub, group.by="celltype", label=T, label.size=5, reduction='umap')
> p3 = plotc <- p1+p2+ plot_layout(guides = 'collect')
> ggsave("subcluster/tSNE_celltype.pdf", p1, width=7 ,height=6)
> ggsave("subcluster/UMAP_celltype.pdf", p2, width=7 ,height=6)
> ggsave("subcluster/celltype.pdf", p3, width=10 ,height=5)
> ggsave("subcluster/celltype.png", p3, width=10 ,height=5)