单细胞转录组功能分析(超详细clusterProfiler,GSEA,GSVA分析)

单细胞分析不可避免的会涉及到鉴定不同细胞类型的功能或比较相同类型在不同状态下的功能差异,因此,功能分析在单细胞分析中相当重要.

功能分析的方法很多,这儿我们主要讲述3种功能分析方法,一是基于筛选的差异基因,采用超几何检验判断上调或下调基因在哪些GO或KEGG或其它定义的通路富集,代表clusterProfiler包。另一种是不筛选差异基因,而是对其根据表达量或与表型的相关度排序,然后判断对应的基因集是否倾向于落在有序列表的顶部或底部,从而判断基因集合对表型差异的影响和筛选有影响的基因子集,代表GSEA富集分析。GSVA是一种非参数,无监督的算法,与GSEA不同,GSVA不需要预先对样本进行分组,可以计算每个样本中特定基因集的富集分数。

下载测试数据集:wget https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
这儿我们使用的是pbmc 3k数据作为测试,其它单细胞数据分析类似。
解压:tar -zxvf pbmc3k_filtered_gene_bc_matrices.tar.gz
单细胞转录数据分析之Scanpy:https://www.jianshu.com/p/e22a947e6c60
单细胞转录组之Scanpy - 轨迹推断/拟时序分析:https://www.jianshu.com/p/0b2ca0e0b544
单细胞空间转录分析之Seurat:https://www.jianshu.com/p/c9a601ced91f
单细胞空间转录分析之Seurat-多样本整合(浅谈空间批次):https://www.jianshu.com/p/609b04096b79
单细胞辅助注释工具-SingleR :https://www.jianshu.com/p/1c4abf05cb3e
加载Seurat相关包:

library(dplyr)
library(Seurat)
library(patchwork)

运行Seurat基本流程,获得细胞簇:

setwd("/home/wucheng/jianshu/function/data")
pbmc.data <- Read10X(data.dir = "/home/wucheng/jianshu/function/data/filtered_gene_bc_matrices/hg19/")
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
pbmc <- RunUMAP(pbmc, dims = 1:10)
pdf("Umap.pdf")
DimPlot(pbmc, reduction = "umap")
dev.off()
saveRDS(pbmc,"pbmc.rds")
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) #识别每一簇的marker genes
Umap-celltype

一,使用clusterProfiler包,GO和KEGG富集分析
安装clusterProfiler, org.Hs.eg.db包:BiocManager::install('clusterProfiler')
BiocManager::install("org.Hs.eg.db")

加载包

library(ggplot2)
library(clusterProfiler)
library(org.Mm.eg.db) ##加载小鼠
library(org.Hs.eg.db) ##加载人类

具体分析

library(Seurat)
setwd("/home/wucheng/jianshu/function/data")
pbmc <-readRDS("pbmc.rds")
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)  ##这儿选取C5和C0,C3簇比较,做差异分析,识别差异表达基因,可根据实际具体选择
head(cluster5.markers, n = 5) 
> head(cluster5.markers, n = 5)
                      p_val avg_logFC pct.1 pct.2     p_val_adj
FCGR3A        7.583625e-209  2.963144 0.975 0.037 1.040018e-204
IFITM3        2.500844e-199  2.698187 0.975 0.046 3.429657e-195
CFD           1.763722e-195  2.362381 0.938 0.037 2.418768e-191
CD68          4.612171e-192  2.087366 0.926 0.036 6.325132e-188
RP11-290F20.3 1.846215e-188  1.886288 0.840 0.016 2.531900e-184
up <-rownames(cluster5.markers[intersect(which(cluster5.markers [,1]<0.05),which(cluster5.markers [,2]>=0.25)),])
down <-rownames(cluster5.markers[intersect(which(cluster5.markers [,1]<0.05),which(cluster5.markers [,2]<=(-0.25))),])
##识别up ,down genes
gs = bitr(up, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
head(gs)
ego.bp = enrichGO(gene=gs$ENTREZID, OrgDb = org.Hs.eg.db,ont= "BP",pAdjustMethod = "BH",pvalueCutoff= 0.05,qvalueCutoff= 0.2,readable= TRUE)                    
head(ego.bp)
write.csv(ego.bp, file ="Cluster5_up_go.csv") ##输出up 基因富集的Go term 
pdf("Cluster5_up_go.pdf", height =5, width = 10)
print(dotplot(ego.bp, showCategory=10,title="Cluster5 Vs.Cluster03 up gene GoTerm"))
dev.off()
kk <- enrichKEGG(gene= gs$ENTREZID, organism     = 'hsa',pvalueCutoff = 0.05)
write.csv(kk,file ="Cluster5_up_kegg.csv") ##输出up 基因富集的Kegg term
pdf("Cluster5_up_kegg.pdf", height =5, width = 10)
print(dotplot(kk, showCategory=10,title="KEGG_Up_biological")) #展示前十个条目
dev.off()

gs = bitr(down, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
head(gs)
ego.bp = enrichGO(gene=gs$ENTREZID, OrgDb = org.Hs.eg.db,ont= "BP",pAdjustMethod = "BH",pvalueCutoff= 0.05,qvalueCutoff= 0.2,readable= TRUE)                    
head(ego.bp)
write.csv(ego.bp, file ="Cluster5_down_go.csv") ##输出down 基因富集的Go term 
pdf("Cluster5_down_go.pdf", height =5, width = 10)
print(dotplot(ego.bp, showCategory=10,title="Cluster5Vs.Cluster03 down gene GoTerm"))
dev.off()
kk <- enrichKEGG(gene= gs$ENTREZID, organism     = 'hsa',pvalueCutoff = 0.05)
write.csv(kk,file ="Cluster5_down_kegg.csv") ##输出down 基因富集的Kegg term 
pdf("Cluster5_down_kegg.pdf", height =5, width = 10)
print(dotplot(kk, showCategory=10,title="KEGG_biological")) 
dev.off()
GO-KEGG-UP-DOWN

二,GSEA分析
使用R语言来进行GSEA分析可以有两种方法,一个是fgsea包,另一个是使用clusterProfiler包。http://www.gsea-msigdb.org/gsea/downloads.jsp
MSigdb(Molecular Signatures Database)数据库包含了以下9种不同基因的基因,可供下载以及R软件包载入。

  • H:hallmark gene sets---癌症特征基因集合,共50 gene sets。
  • C1:positional gene sets---染色体位置基因集合,共299 gene sets。
  • C2:curated gene sets---包含了已知数据库,文献等基因集信息,包含5529 gene sets。
  • C3:motif gene sets---调控靶基因集合,包括miRNA-target以及转录因子-target调控模式,3735 gene sets。
  • C4:computational gene sets---计算机软件预测出来的基因集合,主要是和癌症相关的基因,858 gene sets。
  • C5:GO gene sets---Gene Ontology对应的基因集合,10192 gene sets。
  • C6:oncogenic signatures---致癌基因集合,大部分来源于NCBI GEO发表芯片数据,189 gene sets。
  • C7:immunologic signatures---免疫相关基因集,4872 gene sets。
  • C8:single cell identitiy gene sets, 302 gene sets
    fgsea包分析
    导入包:
library(msigdbr) #提供MSigdb数据库基因集
library(fgsea)
library(dplyr)
library(tibble)
library(Seurat)

分析代码:

markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25, logfc.threshold = 0)  ##同样选取C5和C0,C3簇比较,做差异分析,这儿输出所有基因,前端可理解为C5上调,后端为C5下调基因
markers$genes = rownames(markers)
cluster.genes<- markers %>% arrange(desc(avg_logFC)) %>% dplyr::select(genes,avg_logFC) #基因按logFC排序
ranks<- deframe(cluster.genes)

mdb_c2 <- msigdbr(species = "Homo sapiens", category = "C2")## 定义基因集,选取C2
fgsea_sets = mdb_c2 [grep("^KEGG",mdb_c2 $gs_name),] %>% split(x = .$gene_symbol, f = .$gs_name)
length(fgsea_sets)
fgseaRes<- fgsea(fgsea_sets, stats = ranks, nperm = 1000) #运行fgsea

p <-ggplot(fgseaRes %>% as_tibble() %>% arrange(desc(NES)) %>% filter(pval < 0.05) %>% head(n= 20), aes(reorder(pathway, NES), NES)) +
  geom_col(aes(fill= NES)) +
  coord_flip() +
  labs(x="KEGG", y="Normalized Enrichment Score",title="KEGG gene sets NES from GSEA") ##输出差异排秩前20的条目
pdf('GSEA-fgsea.pdf',width=8,height=5)
print(p)
dev.off()
pdf('fgsea_KEGG_PRIMARY_IMMUNODEFICIENCY.pdf',width=8,height=5)
plotEnrichment(fgsea_sets[["KEGG_PRIMARY_IMMUNODEFICIENCY"]],ranks) + labs(title="KEGG_PRIMARY_IMMUNODEFICIENCY") #对某一特定通路分析
dev.off()
GSEA-fgsea

fgsea_KEGG_PRIMARY_IMMUNODEFICIENCY

X轴是所有基因的排序,这儿大约1200个。每个黑条是该基因集中的基因,这样我们可以知道基因在排序列表中的位置。如果基因集位于预先排列的基因列表的顶部,则通过某种度量计算出富集分数(enrichment score,ES),ES为正。如果基因集位于预先排列的基因列表的底部,则ES为负。从以上图中可以看出多数基因都落在了峰值之后(绿线峰值)的核心基因集中,表明基因在该数据集中被抑制,也就是低表达。

clusterProfiler包分析

geneset <- read.gmt("c2.cp.kegg.v7.3.entrez.gmt")  
geneset$ont = str_remove(geneset$ont,"HALLMARK_")
markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.1, logfc.threshold = 0)
gs <-bitr(rownames(markers), fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
markers1<-cbind(markers[gs[,1],],gs)
geneList = markers1$avg_logFC
names(geneList) = markers1$ENTREZID
geneList = sort(geneList,decreasing = T)
egmt <- GSEA(geneList, TERM2GENE=geneset,verbose=F)
egmt1<- setReadable(egmt,OrgDb=org.Hs.eg.db, keyType = "ENTREZID")
y=data.frame(egmt2)
pdf("dotplot.pdf")#气泡图,展示geneset被激活还是抑制
dotplot(egmt1,split=".sign")+facet_grid(~.sign)
dev.off()
library(enrichplot)
for(i in seq_along(egmt@result$ID)){
p <- gseaplot2(egmt, geneSetID = i, title = egmt@result$ID[i])
filename <- paste0('GSEA', egmt@result$ID[i], '.png')
ggsave(filename = filename, p, width = 8, height = 5)
}
GSEA_KEGG_LYSOSOME

三, GSVA分析
GSVA要求输入表达矩阵和基因集列表。表达矩阵从seurat对象导入即可
导入相关包

library(ggplot2)
library(dplyr)
library(msigdbr)
library(Seurat)
library(GSVA)
library(pheatmap)
library(patchwork)

具体代码

setwd("/home/wucheng/jianshu/function/data")
pbmc <-readRDS("pbmc.rds")
expr <- as.data.frame(pbmc@assays$RNA@data) #表达矩阵
meta <- [email protected][,c("orig.ident","seurat_clusters")] #类别
m_df = msigdbr(species = "Homo sapiens", category = "C2", subcategory = "CP:KEGG") #选取物种人类
msigdbr_list = split(x = m_df$gene_symbol, f = m_df$gs_name)
expr=as.matrix(expr) 
kegg <- gsva(expr, msigdbr_list, kcdf="Gaussian",method = "gsva",parallel.sz=10) #gsva
pheatmap(kegg, show_rownames=1, show_colnames=0, annotation_col=meta,fontsize_row=5, filename='gsva_heatmap.png', width=15, height=12)#绘制热图
es <- data.frame(t(kegg),stringsAsFactors=F)  #添加到单细胞矩阵中,可视化相关通路的在umap上聚集情况,可理解为一个通路即一个基因
scRNA <- AddMetaData(pbmc, es)
p1 <- FeaturePlot(scRNA, features = "KEGG_PRIMARY_BILE_ACID_BIOSYNTHESIS", reduction = 'umap')
p2 <- FeaturePlot(scRNA, features = "KEGG_ETHER_LIPID_METABOLISM", reduction = 'umap')
p3 <- FeaturePlot(scRNA, features = "KEGG_RIBOSOME", reduction = 'umap')
p4 <- FeaturePlot(scRNA, features = "KEGG_ASTHMA", reduction = 'umap')
plotc = (p1|p2)/(p3|p4)
ggsave('gsva_featureplot.png', plotc, width = 10, height = 8) #输出图片
##每个细胞类别与功能相关热图
meta <- meta %>%arrange(meta$seurat_clusters)
data <- kegg[,rownames(meta)]
group <- factor(meta[,"seurat_clusters"],ordered = F)
data1 <-NULL
for(i in 0:(length(unique(group))-1)){
ind <-which(group==i)
dat <- apply(data[,ind], 1, mean)
data1 <-cbind(data1,dat)
}
colnames(data1) <-c("C0","C1","C2","C3","C4","C5","C6","C7","C8")
result<- t(scale(t(data1)))
result[result>2]=2
result[result<-2]=-2
library(pheatmap)
p <- pheatmap(result[1:20,],
                cluster_rows = F,
                cluster_cols = F,
                show_rownames = T,
                show_colnames = T,
                color =colorRampPalette(c("blue", "white","red"))(100),
                cellwidth = 10, cellheight = 15,
                fontsize = 10)
pdf(("gsva_celltype.pdf"),width = 7,height = 7)
print(p)
dev.off()
gsva_heatmap

gsva_featureplot

gsva_celltype

假设为两组对照实验,我们希望看到一些通路(功能是否显著),则可使用下面方法

DD1 <-  subset(pbmc,idents = c("3","6")) # 对照组放在前面
expr <- as.data.frame(DD1@assays$RNA@data)
meta <- [email protected][,c("seurat_clusters")]
m_df = msigdbr(species = "Homo sapiens", category = "C2")
msigdbr_list = split(x = m_df$gene_symbol, f = m_df$gs_name)
expr=as.matrix(expr)
kegg <- gsva(expr, msigdbr_list, kcdf="Gaussian",method = "gsva",parallel.sz=10)
library(limma) ##limma做差异分析
exprSet <- kegg
group <- factor(meta,levels = c("3","6"),ordered = F)## 分组变成向量,并且限定leves的顺序, levels里面,把对照组放在前面
design <- model.matrix(~group)# 构建比较矩阵
colnames(design) <- levels(group)
fit <- lmFit(exprSet,design)#线性模型拟合
fit1 <- eBayes(fit)#贝叶斯检验
allDiff=topTable(fit1,adjust='fdr',coef=2,number=Inf)
write.table(allDiff,"kegg.txt",col.names=T,row.names=T,sep="\t")
#根据allDiff找出你自己感兴趣的通路,例如
up <- c("REACTOME_SEROTONIN_NEUROTRANSMITTER_RELEASE_CYCLE",'REACTOME_DOPAMINE_NEUROTRANSMITTER_RELEASE_CYCLE','REACTOME_NEUROTRANSMITTER_RELEASE_CYCLE','REACTOME_NEUROTOXICITY_OF_CLOSTRIDIUM_TOXINS','REACTOME_GABA_SYNTHESIS_RELEASE_REUPTAKE_AND_DEGRADATIO')
down <- c('REACTOME_CREB1_PHOSPHORYLATION_THROUGH_NMDA_RECEPTOR_MEDIATED_ACTIVATION_OF_RAS_SIGNALING','REACTOME_RAS_ACTIVATION_UPON_CA2_INFLUX_THROUGH_NMDA_RECEPTOR','REACTOME_LONG_TERM_POTENTIATION','REACTOME_UNBLOCKING_OF_NMDA_RECEPTORS_GLUTAMATE_BINDING_AND_ACTIVATION','REACTOME_CLEC7A_DECTIN_1_INDUCES_NFAT_ACTIVATIO')
TEST <- c(up,down)
p <- allDiff
p$ID <- rownames(p) 
q <- p[TEST,]
group1 <- c(rep("6",5),rep("3",5)) 
df <- data.frame(ID = q$ID, score = q$t,group=group1 )
# 按照score排序
sortdf <- df[order(df$score),]
sortdf$ID <- factor(sortdf$ID, levels = sortdf$ID)#增加通路ID那一列
head(sortdf)
p <- ggplot(sortdf, aes(ID, score,fill=group)) + geom_bar(stat = 'identity') + 
                    coord_flip() + 
                    theme_bw() + #去除背景色
                    theme(panel.grid =element_blank())+
                    theme(panel.border = element_rect(size = 0.6)) 
pdf("gsva_dif.pdf",width=12,height=8)
p
dev.off()
gsva_dif

希望大家关注点赞,谢谢!!!!!!!!!!!!

你可能感兴趣的:(单细胞转录组功能分析(超详细clusterProfiler,GSEA,GSVA分析))