实在懒得找meme了
GOrilla, clusterProfiler, topGO 分析 ranked gene list (GSEA) 对比
Genelist preparation
保证了三个工具输入的基因是一致的。
# genelist preparation
library(hgu133plus2.db)
rankedprobe <- row.names(DEGdf)
load("GSE76275_DEGs.Rdata")
## ranked entrezid by pval
rankedentrezt <- AnnotationDbi::select(hgu133plus2.db,
rankedprobe,
"ENTREZID",
"PROBEID")
rankedentrez <- unique(na.omit(rankedentrezt$ENTREZID))
for GOrilla
纠正一个误解,之前以为 GOrilla 要求的 gene list 是按表达量从高到低排序。
This is particularly useful in many typical cases where genomic data may be naturally represented as a ranked list of genes (e.g. by level of expression or of differential expression).
但实际上是依然是基于t检验,也就是基于p值。
All genes were ranked according to how well they differentiate between the two groups using a simple t-test. The top of the list contained the genes that were the best separators between the two groups.
GOrilla 的优势主要体现在:
- 与基于超几何分布的工具相比——阈值灵活
- 与其他 阈值灵活/类似GSEA/利用Kolmogorov-Smirnov检验 的工具相比:
- 专注于GO,可直接生成DAG
- 利用 "mHG " 可得到精确的 p.val
- 速度快
- (似乎放在现在这些都不是什么问题,毕竟 GSEA 2005年发布,GOrilla 2008年发布...不过一布拿到图+表也是很爽的额 (。﹏。)
## ranked symbol (increasing by pval)
rankedsymbol <- AnnotationDbi::select(hgu133plus2.db,
rankedentrez,
"SYMBOL",
"ENTREZID")[,2]
save(rankedsymbol, file = "rankedsymbol_pvbased_forGOrilla.Rdata")
write.table(rankedsymbol, file =
"rankedsymbol_pvbased_forGOrilla.txt",
quote = F, col.names = F, row.names = F)
手动选择 pvalue cutoff
for clusterProfiler
## ranked entrezid with pval (increasing by pval)
matchedprobe <- match(rankedentrez,
rankedentrezt$ENTREZID)
matchedprobe2 <- rankedentrezt[matchedprobe,][,"PROBEID"]
respon.pval <- DEGdf[matchedprobe2,]$P.Value
entrezlist_pval <- respon.pval
names(entrezlist_pval) <- rankedentrez
save(entrezlist_pval, file = "genelist_ENTREZID_pVal.Rdata")
### decreasing entrezid list by pval
entrezlist_dcrpv <- sort(entrezlist_pval, decreasing = T)
save(entrezlist_dcrpv, file = "genelist_ENTREZID_Decr_pVal.Rdata")
### decreasing entrezid list by FC
entrezlist_FC <- DEGdf[matchedprobe2,]$logFC
names(entrezlist_FC) <- rankedentrez
entrezlist_FC <- sort(entrezlist_FC, decreasing = T)
save(entrezlist_FC, file = "genelist_ENTREZID_FC.Rdata")
for topGO
## probeid with p.val list (increasing by pval)
probelist_pval <- respon.pval
names(probelist_pval) <- matchedprobe2
save(probelist_pval, file = "genelist_PROBEID_pVal.Rdata")
GOrilla
Running mode: Single ranked list of genes
很酷炫的一张大图
重复基因会被自动删除
但并不知道哪来的这12个 duplicate genes (+_+)?
table(table(rankedsymbol))
#
# 1
# 22012
Enrichment (N, B, n, b) is defined as follows:
N - is the total number of genes
B - is the total number of genes associated with a specific GO term
n - is the number of genes in the top of the user's input list or in the target set when appropriate
b - is the number of genes in the intersection
Enrichment = (b/n) / (B/N)
缺点是并没有显示富集到了多少个GO term, 强行扒下这个表后,小小操作:
GOrillares <- read.csv("GOrillatable.txt", header = T, sep = "\t")
save(GOrillares, file = "GOrillatable.Rdata")
所以就是有100个 term.
nrow(GOrillares)
# [1] 100
clusterProfiler
GSEA GO-BP
标准的 GSEA.
关于输入的 gene list, y叔是这么说的:
The
geneList
contains three features:
- numeric vector: fold change or other type of numerical variable
- named vector: every number was named by the corresponding gene ID
- sorted vector: number should be sorted in decreasing order
但用 p.val 的效果似乎一般。
## clusterProfiler GSEA GO-BP
library(clusterProfiler)
library(hgu133plus2.db)
load("genelist_ENTREZID_Decr_pVal.Rdata")
### with decreasing p.val
gseaGO_BP <- gseGO(geneList = entrezlist_dcrpv,
OrgDb = hgu133plus2.db,
keyType = "ENTREZID",
ont = "BP",
nPerm = 1000, ## 排列数
minGSSize = 5,
maxGSSize = 500,
pvalueCutoff = 0.95,
verbose = TRUE) ## 不输出结果
gseaGO_BPresult <- gseaGO_BP@result
save(gseaGO_BPresult, file = "cp_gseaGO_BPresult.Rdata")
可能和网络也有关系?几次输入同样的参数,得到的结果并不一样....有时是报错 no term, 迷惑desu
也没什么显著的 term
还是像示例里那样用 decreasing FC
### with decreasing FC
load("genelist_ENTREZID_FC.Rdata")
gseaGO_BP2 <- gseGO(geneList = entrezlist_FC,
OrgDb = hgu133plus2.db,
keyType = "ENTREZID",
ont = "BP",
nPerm = 1000, ## 排列数
minGSSize = 5,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = TRUE) ## 不输出结果
gseaGO_BPresult2 <- gseaGO_BP2@result
save(gseaGO_BPresult2, file = "cp_gseaGO_BPresult2.Rdata")
结果果然正常多了
topGO
GSEA-like GO-BP (Using the genes score )
准确地说,也许不能直接叫做 GSEA 法。topGO 说明书里的说法是:
We will use two types of test statistics: Fisher's exact test which is based on gene counts, and a Kolmogorov-Smirnov like test which computes enrichment based on gene scores.
并且:
We can use both these tests since each gene has a score (representing how di erentially expressed a gene is) and by the means of topDiffGenes functions the genes are categorized into di erentially expressed or not di erentially expressed genes.
正因为有 topDiffGenes
这个函数的存在,相当于无形中给了 target list, 而 allGenes =
相当于 universe gene list. 所以这也许是 topGO
无论用类似 ORA 或 类似 GSEA 法构建的 topGOdata
,算法和检验方法可以各种混用的原因。(maybe <@_@>
## topGO GSEA GO-BP
library(topGO)
library(hgu133plus2.db)
load("genelist_PROBEID_pVal.Rdata")
## function: topDiffGenes
topDiffGenes <- function(allScore) {
return(allScore < 0.01) ## p.val < 0.01
}
sum(topDiffGenes(probelist_pval))
# [1] 12210
## construct a topGOdata object
GSEAGOdata_BP <- new("topGOdata",
ontology = "BP",
allGenes = probelist_pval,
geneSel = topDiffGenes,
nodeSize = 5,
annot = annFUN.db,
affyLib = "hgu133plus2.db")
## Kolmogorov-Smirnov like test - classic
KSres <- runTest(GSEAGOdata_BP, algorithm = "weight01", statistic = "ks")
KSres
#
# Description:
# Ontology: BP
# 'weight01' algorithm with the 'ks' test
# 9106 GO terms scored: 203 terms with p < 0.01
# Annotation data:
# Annotated genes: 16454
# Significant genes: 9542
# Min. no. of genes annotated to a GO: 5
# Nontrivial nodes: 9106
GSEA_BPres <- GenTable(GSEAGOdata_BP,
weight01KS = KSres,
orderBy = "weight01KS",
ranksOf = "weight01KS",
topNodes = 203)
save(GSEA_BPres, file = "topGO_gseaGO_BPresult.Rdata")
结果对比
先找出三种工具结果中共有的 GO term:
nrow(GOrillares) ## GOrilla
# [1] 100
nrow(gseaGO_BPresult2) ## clusterProfiler
# [1] 457
nrow(GSEA_BPres) ## topGO
# [1] 203
GOrillavscp <- intersect(GOrillares$GO.term, gseaGO_BPresult2$ID)
GOrillavstop <- intersect(GOrillares$GO.term, GSEA_BPres$GO.ID)
triple <- intersect(GOrillavscp, GOrillavstop)
triple
# [1] "GO:0050867"
只有一个
row.names(GOrillares) <- GOrillares$GO.term
row.names(GSEA_BPres) <- GSEA_BPres$GO.ID
theonly <- cbind(GOrillares[triple,],
gseaGO_BPresult2[triple,],
GSEA_BPres[triple,])
finetable <- subset(theonly, select = c("Description",
"Enrichment..N..B..n..b.", "P.value","FDR.q.value",
"setSize","enrichmentScore","pvalue", "p.adjust","qvalues",
"Annotated", "Significant","weight01KS"))
finetable
# Description Enrichment..N..B..n..b.
# GO:0050867 positive regulation of cell activation 35.01 (17224,328,12,8)
# P.value FDR.q.value setSize enrichmentScore pvalue
# GO:0050867 7.67e-11 6.28e-08 312 0.2948657 0.00203252
# p.adjust qvalues Annotated Significant weight01KS
# GO:0050867 0.04044123 0.03411746 313 174 0.00383
看起来, GOrilla 的结果和常用的两个R包的结果差的还是挺多的
出于好奇看一下 topGO 和 clusterProfiler 结果的差别
topvscp <- intersect(GSEA_BPres$GO.ID, gseaGO_BPresult2$ID)
length(topvscp)
# [1] 19
大概就是这个亚子。
最后,向大家隆重推荐生信技能树的一系列干货!
- 生信技能树全球公益巡讲:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
- B站公益74小时生信工程师教学视频合辑:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
- 招学徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw