使用clusterProfiler进行富集分析

clusterProfiler是业界大神Y叔写的R包,看过了它的绘图就会使人沉迷于其颜值无法自拔。本文很多内容直接翻译自官方文档,加上了我的测试数据(还是以果蝇为例)。


clusterProfiler: universal enrichment tool for functional and comparative study

一、术语(Terminology)

(1)基因集(gene set)和通路(pathway):基因集是功能相关的基因的无序集合。忽略基因间的功能关系,可以将通路解释为一个基因集。

(2)Gene Ontology(GO):GO定义了描述基因功能的概念/类(concepts/classes)。它将功能分为三个方面:

  • MF: Molecular Function - molecular activities of gene products
  • CC: Cellular Component - where gene products are active
  • BP: Biological Process - pathways and larger processes made up of the activities of multiple gene products
    GO terms组织在一个有向无环图(directed acyclic graph)中,terms之间的边表示父子关系(parent-child relationship)。

(3)Kyoto Encyclopedia of Genes and Genomes(KEGG):KEGG是手工绘制的代表分子相互作用和反应网络的路径图(pathway maps)的集合。这些途径涵盖了广泛的生化过程,可分为7大类:新陈代谢、遗传和环境信息处理、细胞过程、机体系统、人类疾病和药物开发(metabolism, genetic and environmental information processing, cellular processes, organismal systems, human diseases, and drug development)。

二、功能富集分析方法(Functional Enrichment Analysis Methods)

(1)Over Representation Analysis(ORA)

过表征分析(ORA)(Boyle et al. 2004)是一种广泛使用的方法,用于确定已知的生物学功能或过程是否在实验得到的基因列表中被过度表达(=富集),例如差异表达基因列表(DEGs)。

p值可以通过超几何分布来计算。N是背景基因总数,M是被注释到感兴趣基因集(GO term)的基因数量,n是感兴趣的基因列表的大小,k是该列表中被标注到基因集的基因数。默认情况下,背景基因是所有具有注释的基因。为进行多重比较,应调整p值。


p值计算

举个栗子——one-sided version of Fisher’s exact test


文档中的例子

(2)Gene Set Enrichment Analysis——Functional Class Scoring(FCS)

通常寻找感兴趣的差异表达基因并进行富集分析,这种方法能够检测差异很大的基因,但无法检测差异很小的基因,而这些差异比较小的基因也有可能参与了共同调节。基因集富集分析(GSEA)(Subramanian et al. 2005)直接解决了这一局限性。所有基因均可用于GSEA;GSEA汇总了一个基因集中每个基因的统计数据,因此,它可以检测预先定义的基因集中所有基因以一种较小但协调(small but coordinated)的方式发生变化的情况。因为很多相关的表型差异很可能是通过一组基因中微小但一致的变化(small but consistent changes)来表现的。

基因根据表型进行排序。给定先验定义的基因集S(例如拥有相同DO类别的基因),GSEA的目标是确定S的成员是随机分布在排序的基因列表(L)中,还是主要分布在顶部或底部。

GSEA方法有三个关键元素:

  • Calculation of an Enrichment Score
    enrichment score(ES)代表集合S在排序列表L的顶部或底部被过表达的程度。这个分数是通过遍历列表L来计算的,当我们遇到一个在S中的基因时增加一个running-sum statistic,当遇到的基因不在S中时减少。增量的大小取决于基因统计(例如基因与表型的相关性)。ES为random walk中遇到的与零的最大偏差(maximum deviation from zero);它对应一个加权的类Kolmogorov-Smirnovlike统计量(Subramanian et al. 2005)。
  • Esimation of Significance Level of ES
    利用置换检验(permutation test)计算ES的p值。具体地说,我们对基因列表L的gene labels进行重新排列(permute),并为排列后的数据重新计算基因集的ES,从而为ES生成一个null distribution。然后相对于这个零分布计算观察到的ES的p值。
  • Adjustment for Multiple Hypothesis Testing
    当整个基因集被评估时,为多重检验调整显著性水平,q值是通过FDR调整计算的。

(3)Leading edge analysis and core enriched genes

Leading edge分析报告了:Tags说明对富集分数有贡献的基因的百分比,List指出在列表中获得富集分数的位置,Signal是富集信号的强度。获得有助于富集的核心富集基因也将是非常有趣的。

三、通用的富集分析(Universal enrichment analysis)

clusterProfiler支持对许多ontology/pathway的hypergeometric test和gene set enrichment analyses,但是还有很多用户想分析他们自己的数据,包括不支持的物种、不支持的ontologies/pathways或自定义注释等。clusterProfiler提供了用于hypergeometric test的enricher函数和用于基因集富集分析的GSEA函数,用于接受用户定义的注释。它们接受另外两个参数TERM2GENE和TERM2NAME。从参数名可以看出,TERM2GENE是一个data.frame,第一列为term ID,第二列为对应映射基因;TERM2NAME是一个data.frame,第一列为term ID,第二列为对应term name。TERM2NAME是可选的。

(1)Input data

对于ORA,我们所需要的是一个gene vector,这是一个基因ID的向量。这些基因ID可以通过差异表达分析(如DESeq2 package)获得。

对于GSEA,我们需要一个基因排序列表(a ranked list of genes)。geneList有3个特点:

  • numeric vector: fold change or other type of numerical variable
  • named vector: every number was named by the corresponding gene ID
  • sorted vector: number should be sorted in decreasing order

可以这样获得geneList:

d <- read.csv(your_csv_file)
## assume that 1st column is ID (no duplicated allowed)
## 2nd column is fold change
## feature 1: numeric vector
geneList <- d[,2]
## feature 2: named vector
names(geneList) <- as.character(d[,1])
## feature 3: decreasing order
geneList <- sort(geneList, decreasing = TRUE)

我的测试数据:

> head(mydata,3)
  gene_name     female     male    logFC
1   CG32548 0.02310383 72.43205 11.61428
2   CG15892 0.02624160 57.22716 11.09063
3   CR43803 0.02474626 34.09726 10.42823
> ## 基因列表
> #ORA的gene vector
> DEGdata <- subset(mydata,mydata$logFC>2)
> DEgenelist <- as.character(DEGdata$gene_name)
> head(DEgenelist)
[1] "CG32548" "CG15892" "CR43803" "CG15136" "CG4983"  "CG13989"
> #GSEA的基因排序列表
> FCgenelist <- mydata$logFC #numeric vector
> names(FCgenelist) <- as.character(mydata$gene_name) #named vector
> FCgenelist <- sort(FCgenelist,decreasing=T) #decreasing order
> head(FCgenelist)
 CG32548  CG15892  CR43803  CG15136   CG4983  CG13989 
11.61428 11.09063 10.42823 10.34305 10.29130 10.00569

(2)MSigDb analysis

Molecular Signatures Database包含了8种预定义的基因集合。


来自生信技能树的图

可以下载GMT文件,然后使用read.gmt来解析这些文件,并用于enricher()和GSEA()。R包msigdbr已经将MSigDB基因集打包成整齐的数据格式,可以直接在clusterProfiler中使用。

> ## Molecular Signatures Database
> #library(msigdbr)
> msigdbr_show_species() #支持的物种
 [1] "Bos taurus"               "Caenorhabditis elegans"  
 [3] "Canis lupus familiaris"   "Danio rerio"             
 [5] "Drosophila melanogaster"  "Gallus gallus"           
 [7] "Homo sapiens"             "Mus musculus"            
 [9] "Rattus norvegicus"        "Saccharomyces cerevisiae"
[11] "Sus scrofa"              
> Dm_msigdbr <- msigdbr(species="Drosophila melanogaster")
> head(Dm_msigdbr, 2) %>% as.data.frame
   gs_id        gs_name gs_cat      gs_subcat human_gene_symbol
1 M12609 AAACCAC_MIR140     C3 MIR:MIR_Legacy             ABCC4
2 M12609 AAACCAC_MIR140     C3 MIR:MIR_Legacy             ABCC4
             species_name entrez_gene gene_symbol
1 Drosophila melanogaster       47905   l(2)03659
2 Drosophila melanogaster       35366      CG9270
                                        sources
1  HomoloGene,PhylomeDB,Ensembl,Panther,OrthoDB
2 HomoloGene,Inparanoid,Ensembl,Panther,OrthoDB
> DmGO <- msigdbr(species="Drosophila melanogaster",category="C5") %>% 
+   dplyr::select(gs_name, entrez_gene, gene_symbol)
> head(DmGO)
# A tibble: 6 x 3
  gs_name                                     entrez_gene gene_symbol
                                                      
1 GO_1_4_ALPHA_OLIGOGLUCAN_PHOSPHORYLASE_ACT~       39097 CG3552     
2 GO_1_4_ALPHA_OLIGOGLUCAN_PHOSPHORYLASE_ACT~       36955 Mtap       
3 GO_1_4_ALPHA_OLIGOGLUCAN_PHOSPHORYLASE_ACT~       33386 GlyP       
4 GO_1_4_ALPHA_OLIGOGLUCAN_PHOSPHORYLASE_ACT~       33386 GlyP       
5 GO_1_4_ALPHA_OLIGOGLUCAN_PHOSPHORYLASE_ACT~       33386 GlyP       
6 GO_1_ACYLGLYCEROPHOSPHOCHOLINE_O_ACYLTRANS~       31899 LPCAT   
> ## 通用的富集分析
> #TERM2GENE=gmt,TERM2NAME=NA 都是两列的数据框
> #GENE是基因名,TERM表示GO term编号,NAME表示Description
> em <- enricher(DEgenelist,TERM2GENE=DmGO[,c(1,3)])
> head(em,1)
                                   ID        Description GeneRatio
GO_CILIUM_MOVEMENT GO_CILIUM_MOVEMENT GO_CILIUM_MOVEMENT      8/57
                   BgRatio       pvalue     p.adjust      qvalue
GO_CILIUM_MOVEMENT 46/6055 7.342408e-09 5.903188e-06 5.76089e-06
                                                                    geneID
GO_CILIUM_MOVEMENT CG17564/CG10750/gudu/CG15144/CG15143/Dhc36C/dtr/CG17450
                   Count
GO_CILIUM_MOVEMENT     8
> em1 <- GSEA(FCgenelist,TERM2GENE=DmGO[,c(1,3)])
> head(em1,1)
                 ID Description setSize enrichmentScore     NES
GO_CILIUM GO_CILIUM   GO_CILIUM      16       0.9049122 1.54287
               pvalue    p.adjust qvalues rank
GO_CILIUM 0.001055966 0.001126126      NA  498
                             leading_edge
GO_CILIUM tags=100%, list=10%, signal=90%
                                                                                                core_enrichment
GO_CILIUM CG17564/CG10750/CG9222/TrxT/gudu/CG15144/CG15143/Pkd2/Mks1/Dhc36C/CG6614/dtr/CG17450/B9d2/Erk7/Dhc16F

四、Gene Ontology Analysis

(1)支持的物种

GO分析groupGO()、richer GO()和gseGO()支持拥有可用OrgDb的物种。Bioconductor已经为大约20个物种提供了OrgDb,例如:人类org.Hs.eg.db,果蝇org.Dm.eg.db,拟南芥org.At.tair.db,小鼠org.Mm.eg.db。

如果用户拥有GO注释数据(data.frame格式,第一列为gene ID,第二列为GO ID),则可以使用enricher()和gseGO()函数执行ORA和GSEA。如果基因被直接注释(direction annotation),它也应该被其祖先GO节点(ancestor GO nodes)间接注释(indirect annation)。如果用户只有直接注释,则可以将其注释传递给buildGOmap函数,该函数将推断间接注释并生成适合enricher()和gseGO()的data.frame。

> columns(org.Dm.eg.db) #keytypes(org.Dm.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
 [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
 [9] "EVIDENCEALL"  "FLYBASE"      "FLYBASECG"    "FLYBASEPROT" 
[13] "GENENAME"     "GO"           "GOALL"        "MAP"         
[17] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PMID"        
[21] "REFSEQ"       "SYMBOL"       "UNIGENE"      "UNIPROT"     
> Dm <- org.Dm.eg.db
> head(keys(org.Dm.eg.db)) #默认是ENTREZID 
[1] "30970" "30971" "30972" "30973" "30975" "30976"
> length(keys(Dm)) #25009
[1] 25009
> length(keys(Dm,"SYMBOL")) #25009
[1] 25009
> select(org.Dm.eg.db,keys="mle",keytype="SYMBOL", columns=c("ENTREZID"))
'select()' returned 1:1 mapping between keys and columns
  SYMBOL ENTREZID
1    mle    35523

(2)GO classification

在clusterProfiler中,groupGO是根据GO在特定水平上的分布来进行基因分类的。

groupGO(gene, OrgDb, keyType = "ENTREZID", ont = "CC", level = 2, readable = FALSE)

> ggo <- groupGO(gene=DEgenelist, OrgDb=org.Dm.eg.db,
+                ont="BP", level=2, readable=F)
> head(ggo,3)
                   ID       Description Count GeneRatio geneID
GO:0000003 GO:0000003      reproduction     0     0/929       
GO:0008152 GO:0008152 metabolic process     0     0/929       
GO:0001906 GO:0001906      cell killing     0     0/929

(3)GO over-representation test

一个基因集的GO富集分析。给定一个基因向量,enrichGO函数将返回FDR控制后的GO富集类别。使用参数readable=TRUE或setReadable函数将基因ID映射到基因Symbol。例如ego2 <- setReadable(ego2, OrgDb = org.Hs.eg.db)

enrichGO(gene, OrgDb, keyType = "ENTREZID", ont = "MF", pvalueCutoff = 0.05, pAdjustMethod = "BH", universe, qvalueCutoff = 0.2, minGSSize = 10, maxGSSize = 500, readable = FALSE, pool = FALSE)

  • universe:background genes. If missing, the all genes listed in the database (eg TERM2GENE table) will be used as background.
  • qvalueCutoff:qvalue cutoff on enrichment tests to report as significant. Tests must pass i) pvalueCutoff on unadjusted pvalues, ii) pvalueCutoff on adjusted pvalues and iii) qvalueCutoff on qvalues to be reported.
  • readable :whether mapping gene ID to gene Name
  • pool :If ont='ALL', whether pool 3 GO sub-ontologies
> ## enrichGO
> egoMF <- enrichGO(DEgenelist, OrgDb=org.Dm.eg.db, ont='MF',
+                   pAdjustMethod='BH', pvalueCutoff=0.05, 
+                   qvalueCutoff=0.2, keyType='SYMBOL')
> head(egoMF,1);dim(egoMF)
                   ID                Description GeneRatio  BgRatio
GO:0045503 GO:0045503 dynein light chain binding     7/487 25/12127
                 pvalue   p.adjust     qvalue
GO:0045503 4.128344e-05 0.01080618 0.01041095
                                                 geneID Count
GO:0045503 CG1571/Dhc36C/Dhc16F/Sdic4/btv/Sdic2/CG10859     7
[1] 11  9
> egoall <- enrichGO(DEgenelist, OrgDb=org.Dm.eg.db, ont='ALL',
+                    pAdjustMethod='BH', pvalueCutoff=0.05, 
+                    qvalueCutoff=0.2, keyType='SYMBOL')
> head(egoall,1);dim(egoall)
           ONTOLOGY         ID Description GeneRatio   BgRatio
GO:0005929       CC GO:0005929      cilium    20/492 144/11938
                 pvalue     p.adjust       qvalue
GO:0005929 1.780602e-06 0.0003404754 0.0003124169
                                                                                                                                   geneID
GO:0005929 CG1571/CG12395/Mks1/Dhc36C/dtr/B9d2/Dhc16F/Gas8/TbCMF46/CG14020/IFT57/Sdic4/tilB/unc/CG10958/btv/CG13999/CG31803/Sdic2/CG10859
           Count
GO:0005929    20
[1] 21 10
> sum(egoall$ONTOLOGY=="BP") #Biological process基因产物参与的生物路径或机制
[1] 0
> sum(egoall$ONTOLOGY=="CC") #Cellular component基因产物在细胞内外的位置
[1] 10
> sum(egoall$ONTOLOGY=="MF") #Molecular function基因产物分子层次的功能
[1] 11

(4)reduce redundancy of enriched GO terms

GO以parent-child结构组织,因此父术语与所有子术语有很大比例的重叠。这可能导致冗余的结果。为了解决这一问题,clusterProfiler实现了简化方法simplify,以减少enrichGO 和gseGO产生的冗余的GO术语。函数内部称为GOSemSim (Yu et al. 2010),用于计算GO项之间的语义相似度,并通过保留一个代表性项来去除高度相似的项。

> egosimp <- simplify(egoMF,cutoff=0.7,by="p.adjust",
+                     select_fun=min,measure="Wang")
> head(egosimp);dim(egosimp)
> #方法1:基于它们的共有父条目的注释统计,计算语义相似性得分
> #包含Resnik、Lin、Jiang 和Schlicker四种方法
> #方法2:基于GO图形结构,Wang
> #进行GO terms集的相似性分析时一般采取基于Resnik和Lin两种方法的综合方法,简称为simRel方法

(5)GO Gene Set Enrichment Analysis

gseGO( geneList, ont = "BP", OrgDb, keyType = "ENTREZID", exponent = 1, nPerm = 1000, minGSSize = 10, maxGSSize = 500, pvalueCutoff = 0.05, pAdjustMethod = "BH", verbose = TRUE, seed = FALSE, by = "fgsea")

  • exponent:weight of each step
  • nPerm:permutation numbers
  • verbose:print message or not
  • seed:logical
  • by:one of 'fgsea' or 'DOSE'
> egseGO <- gseGO(FCgenelist, OrgDb=org.Dm.eg.db,
+                 ont='MF',keyType="SYMBOL",
+                 nPerm=1000, minGSSize=100, maxGSSize=500,
+                 pvalueCutoff=0.05, verbose=FALSE, by="fgsea")
> head(egseGO,1);dim(egseGO)
                   ID        Description setSize enrichmentScore
GO:0003674 GO:0003674 molecular_function     323       0.9462411
                NES      pvalue    p.adjust qvalues rank
GO:0003674 1.716837 0.000999001 0.000999001      NA  579
                              leading_edge
GO:0003674 tags=100%, list=11%, signal=95%
core_enrichment
GO:0003674 CG15892/CG15136/CG4983/CG43851/…
[1] 53 11
> head(data.frame(egseGO$ID,egseGO$Description))
   egseGO.ID egseGO.Description
1 GO:0003674 molecular_function
2 GO:0005488            binding

五、KEGG analysis

注释包KEGG.db从2012年就没有更新了。它现在相当老了,在clusterProfiler中,enrichKEGG(KEGG通路)和enrichMKEGG(KEGG模块)支持下载最新的在线版本的KEGG数据进行富集分析。通过将use_internal_data参数设置为TRUE,也支持使用KEGG.db,但是不建议这样做。有了这个新特性,物种不再局限于以前版本中支持的物种,可以是KEGG数据库中有KEGG注释数据的任何物种。使用organism参数提供物种的学名缩写。clusterProfiler提供search_kegg_organism()函数,帮助搜索支持的物种。

(1)ID转换

> gene.df <- bitr(DEgenelist,fromType="SYMBOL",toType=c("ENTREZID","ENSEMBL"),
+                 OrgDb = org.Dm.eg.db)
'select()' returned 1:many mapping between keys and columns
Warning message:
In bitr(DEgenelist, fromType = "SYMBOL", toType = c("ENTREZID",  :
  17.18% of input gene IDs are fail to map...
> head(gene.df)
   SYMBOL ENTREZID     ENSEMBL
1 CG32548    32826 FBgn0052548
2 CG15892  3772603 FBgn0029859
4 CG15136    35033 FBgn0032625

> gene.kegg <- bitr_kegg(gene.df$ENTREZID,fromType="ncbi-geneid",
+                        toType="kegg",organism='dme')
Warning message:
In bitr_kegg(gene.df$ENTREZID, fromType = "ncbi-geneid", toType = "kegg",  :
  98.13% of input gene IDs are fail to map...
> head(gene.kegg)
  ncbi-geneid         kegg
1       35106 Dmel_CG10211
2       35132 Dmel_CG10348
3       35183 Dmel_CG10700

(2)KEGG over-representation test

enrichKEGG(gene, organism = "dme", keyType = "kegg", pvalueCutoff = 0.05,
pAdjustMethod = "BH", universe, minGSSize = 10, maxGSSize = 500,
qvalueCutoff = 0.2, use_internal_data = FALSE)

  • keyType:one of "kegg",’ncbi-geneid’,’ncib-proteinid’ and ’uniprot’
  • use_internal_data=FALSE:logical, use KEGG.db or latest online KEGG data
> #使用在线数据
> ekegg <- enrichKEGG(gene.df$ENTREZID, organism='dme',keyType="ncbi-geneid",
+                     pvalueCutoff=0.05,pAdjustMethod='BH',qvalueCutoff=0.2,
+                     minGSSize=10,maxGSSize=500,use_internal_data=F)
> ekeggx <- setReadable(ekegg,'org.Dm.eg.db','ENTREZID')

enrichKEGG使用在线数据速度实在是太慢了,所以可以先使用createKEGGdb生成本地KEGG.db包。果蝇的kegg ID看上去是FLYBASECG前面加上"Dmel_"。

remotes::install_github("YuLab-SMU/createKEGGdb")
library(createKEGGdb)
create_kegg_db("dme")
install.packages("KEGG.db_1.0.tar.gz",repos=NULL,type="source")
library(KEGG.db)
> ekegg <- enrichKEGG(gene.kegg$kegg, organism='dme',keyType="kegg",
+                     pvalueCutoff=0.5,pAdjustMethod='BH',qvalueCutoff=0.5,
+                     minGSSize=10,maxGSSize=500,use_internal_data=T)
> head(ekegg,1);dim(ekegg)
               ID Description GeneRatio BgRatio       pvalue   p.adjust     qvalue
dme04145 dme04145   Phagosome      6/43 83/3236 0.0006770752 0.02234348 0.01781777
                                                                               geneID Count
dme04145 Dmel_CG12403/Dmel_CG15148/Dmel_CG15719/Dmel_CG1924/Dmel_CG33497/Dmel_CG33499     6
[1] 12  9
> ekegg@gene <- gsub("Dmel_","",ekegg@gene)
> ekegg@result$geneID <- gsub("Dmel_","",ekegg@result$geneID)
> ekeggx <- setReadable(ekegg,'org.Dm.eg.db','FLYBASECG')
> head(ekeggx,1);dim(ekegg)

(3)KEGG Gene Set Enrichment Analysis

gene.tx <- bitr(names(FCgenelist),fromType="SYMBOL",toType=c("ENTREZID"),
                OrgDb = org.Dm.eg.db)
colnames(gene.tx)[1] <- "gene_name"
gene.tx <- merge(gene.tx,mydata,by="gene_name")
FCgenelist <- mydata$logFC #numeric vector
names(FCgenelist) <- as.character(gene.tx$ENTREZID) #named vector
FCgenelist <- sort(FCgenelist,decreasing=T) #decreasing order

egseKEGG <- gseKEGG(FClist2,organism='dme',keyType="ncbi-geneid",
                    nPerm=1000, minGSSize=10, maxGSSize=500,
                    pvalueCutoff=0.05, pAdjustMethod = "BH")
head(egseKEGG,1);dim(egseKEGG)

(4)KEGG Module over-representation test

例如:mkk <- enrichMKEGG(gene = gene, organism = 'hsa')

(5)KEGG Module Gene Set Enrichment Analysis

例如:mkk2 <- gseMKEGG(geneList = geneList, organism = 'hsa')

六、Visualization of Functional Enrichment Result

(1)Bar plot 条形图

#library(enrichplot)
#横轴为基因个数,纵轴为富集到的GO Terms的描述信息
#颜色对应p.adjust值,红色p值小,蓝色p值大
#showCategory指定展示的GO Terms的个数,默认为10,即p.adjust最小的10个
barplot(egoMF,showCategory=10)
Bar plot

(2)Dot plot 气泡图

#dotplot(object,x="GeneRatio",color="p.adjust",showCategory=10,
#        size=NULL,split=NULL,font.size=12,title="",...)
#横轴为GeneRatio,代表该GO term下富集到的基因个数占列表基因总数的比例
#纵轴为富集到的GO Terms的描述信息,showCategory指定展示的GO Terms的个数
dotplot(egoall,showCategory=10)
Dot plot 1
dotplot(egoall,title='Top5 GO terms of each sub-class',
        showCategory=5,split='ONTOLOGY')+ 
  facet_grid(ONTOLOGY~.,scale="free")
Dot plot 2

(3)GO terms关系网络图 Enrichment Map

#对于富集到的GO terms之间的基因重叠关系进行展示
#每个节点是一个富集到的GO term,默认画top30个富集到的GO terms
#节点大小对应该GO terms下富集到的基因个数,节点的颜色对应p.adjust的值,红色小蓝色大
#如果两个GO terms的差异基因存在重叠,说明这两个节点存在overlap关系,用线条连接起来
emapplot(egoMF,showCategory=10) 
Enrichment Map
#文档例子
data(gcSample)
xx <- compareCluster(gcSample, fun="enrichKEGG",
                     organism="hsa", pvalueCutoff=0.05)
p1 <- emapplot(xx)
p2 <- emapplot(xx,legend_n=2) 
p3 <- emapplot(xx,pie="count")
p4 <- emapplot(xx,pie="count", pie_scale=1.5, layout="kk")
cowplot::plot_grid(p1, p2, p3, p4, ncol=2, labels=LETTERS[1:4])
从官方文档里截的图

(4)GO term与差异基因关系网络图 Gene-Concept Network

#对于基因和富集的GO terms之间的对应关系进行展示
#图中灰色的点代表基因,黄色的点代表富集到的GO terms
#如果一个基因位于一个GO Terms下,则将该基因与GO连线
#黄色节点的大小对应富集到的基因个数,默认画top5富集到的GO terms
cnetplot(egoMF,showCategory=5)
cnetplot
#圆形布局,给线条上色
cnetplot(egoall,showCategory=10,foldChange=FClist,circular=TRUE,colorEdge=TRUE)
cnetplot-circular
p1 <- cnetplot(egoMF,showCategory=3,node_label="category")
p2 <- cnetplot(egoMF,showCategory=3,node_label="gene") 
p3 <- cnetplot(egoMF,showCategory=3,node_label="all") 
p4 <- cnetplot(egoMF,showCategory=3,node_label="none") 
cowplot::plot_grid(p1, p2, p3, p4, ncol=2, labels=LETTERS[1:4])
cnetplot4

(5)UpSet Plot

upsetplot(egoMF) #着重于不同基因集间基因的重叠情况
upsetplot(egseGO) #对于GSEA结果将绘制不同类别的fold change分布
upsetplot1

upsetplot2

(6)Heatmap-like functional classification

heatplot(egoMF)
heatplot(egoall,foldChange=FCgenelist)
heatplot

(7)有向无环图 GO DAG graph

#investigate how the significant GO terms are distributed over the GO graph. 
#The goplot function shows subgraph induced by most significant GO terms.
goplot(egoMF,showCategory=5)
DAG

(8)山脊线图 ridgeline plot for expression distribution of GSEA result

ridgeplot(egseGO)
测试数据没有显著性

(9)running score and preranked list of GSEA result

p1 <- gseaplot(egseGO,geneSetID=1,by="runningScore",title=egseGO$Description[1])
p2 <- gseaplot(egseGO,geneSetID=1,by="preranked",title=egseGO$Description[1])
p3 <- gseaplot(egseGO,geneSetID=4,title=egseGO$Description[4])
cowplot::plot_grid(p1, p2, p3, ncol=2, labels=LETTERS[1:3])
gseaplot
p4 <- gseaplot2(egseGO,3,title=egseGO$Description[3])
p5 <- gseaplot2(egseGO,2:4,subplots = 1)
p6 <- gseaplot2(egseGO,geneSetID=2:4, pvalue_table=TRUE,
          color = c("#E495A5", "#86B875", "#7DB0DD"), 
          ES_geom = "dot")
cowplot::plot_grid(p4, p5, p6, ncol=1, labels=LETTERS[1:3])
gseaplot2

你可能感兴趣的:(使用clusterProfiler进行富集分析)