NGS 测序项目,不管是基因组测序,还是转录组测序,通常会得到一个基因列表,记录了基因突变,或者高/低表达量。
对成百上千甚至上万个基因进行解读,往往是困难的,对基因进行分组以帮助对数据的理解就非常有必要。KEGG 富集分析就是一种非常流行的对基因集进行分组的方法。
BiocManager::install("clusterProfiler")
BiocManager::install("org.Hs.eg.db")
clusterProfiler,功能强大的用于富集分析的 R 包
org.Hs.eg.db,用于转换各种基因 ID 的 R 包
suppressMessages(library(clusterProfiler))
suppressMessages(library(org.Hs.eg.db))
假定经过上游分析,得到了如下的基因列表:
x <- c("GPX3", "GLRX", "LBP", "CRYAB", "DEFB1", "HCLS1", "SOD2", "HSPA2",
"ORM1", "IGFBP1", "PTHLH", "GPC3", "IGFBP3","TOB1", "MITF", "NDRG1",
"NR1H4", "FGFR3", "PVR", "IL6", "PTPRM", "ERBB2", "NID2", "LAMB1",
"COMP", "PLS3", "MCAM", "SPP1", "LAMC1", "COL4A2", "COL4A1", "MYOC",
"ANXA4", "TFPI2", "CST6", "SLPI", "TIMP2", "CPM", "GGT1", "NNMT",
"MAL", "EEF1A2", "HGD", "TCN2", "CDA", "PCCA", "CRYM", "PDXK",
"STC1", "WARS", "HMOX1", "FXYD2", "RBP4", "SLC6A12", "KDELR3", "ITM2B")
因为 KEGG 富集分析用到的函数enrichKEGG
需要的基因列表必须是 Entrez Gene ID,所以需要先将基因名称转换一下:
trans = bitr(x, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db")
## 'select()' returned 1:1 mapping between keys and columns
## Warning in bitr(x, fromType = "SYMBOL", toType = "ENTREZID", OrgDb =
## "org.Hs.eg.db"): 1.79% of input gene IDs are fail to map...
看到警告,提醒有一部分基因 ID 没有转换成功,不管它。
kk <- enrichKEGG(gene = trans$ENTREZID,
organism = 'hsa',
pvalueCutoff = 0.05)
## Reading KEGG annotation online:
##
## Reading KEGG annotation online:
点图:
dotplot(kk, showCategory = 10)
条形图:
barplot(kk, showCategory = 10)
查看 KEGG 富集分析的前几条记录:
head(kk)
## ID Description GeneRatio BgRatio pvalue
## hsa04512 hsa04512 ECM-receptor interaction 6/36 88/8159 1.991210e-06
## hsa04151 hsa04151 PI3K-Akt signaling pathway 9/36 354/8159 1.636401e-05
## hsa04510 hsa04510 Focal adhesion 7/36 201/8159 2.257391e-05
## hsa05146 hsa05146 Amoebiasis 5/36 102/8159 7.668538e-05
## hsa05222 hsa05222 Small cell lung cancer 4/36 92/8159 6.767016e-04
## hsa05134 hsa05134 Legionellosis 3/36 57/8159 1.960214e-03
## p.adjust qvalue geneID
## hsa04512 0.0002867343 0.0002536173 3912/1311/6696/3915/1284/1282
## hsa04151 0.0010835477 0.0009584011 2261/3569/2064/3912/1311/6696/3915/1284/1282
## hsa04510 0.0010835477 0.0009584011 2064/3912/1311/6696/3915/1284/1282
## hsa05146 0.0027606736 0.0024418238 3569/3912/3915/1284/1282
## hsa05222 0.0194890068 0.0172380835 3912/3915/1284/1282
## hsa05134 0.0470451438 0.0416115673 3306/3569/1917
## Count
## hsa04512 6
## hsa04151 9
## hsa04510 7
## hsa05146 5
## hsa05222 4
## hsa05134 3
可以看到,geneID 一列是一些斜线隔开的数字,即 Entrez Gene ID,将其转换成基因名更方便人类阅读。
z = setReadable(kk, OrgDb = org.Hs.eg.db, keyType="ENTREZID")
head(z@result)
## ID Description GeneRatio BgRatio pvalue
## hsa04512 hsa04512 ECM-receptor interaction 6/36 88/8159 1.991210e-06
## hsa04151 hsa04151 PI3K-Akt signaling pathway 9/36 354/8159 1.636401e-05
## hsa04510 hsa04510 Focal adhesion 7/36 201/8159 2.257391e-05
## hsa05146 hsa05146 Amoebiasis 5/36 102/8159 7.668538e-05
## hsa05222 hsa05222 Small cell lung cancer 4/36 92/8159 6.767016e-04
## hsa05134 hsa05134 Legionellosis 3/36 57/8159 1.960214e-03
## p.adjust qvalue
## hsa04512 0.0002867343 0.0002536173
## hsa04151 0.0010835477 0.0009584011
## hsa04510 0.0010835477 0.0009584011
## hsa05146 0.0027606736 0.0024418238
## hsa05222 0.0194890068 0.0172380835
## hsa05134 0.0470451438 0.0416115673
## geneID Count
## hsa04512 LAMB1/COMP/SPP1/LAMC1/COL4A2/COL4A1 6
## hsa04151 FGFR3/IL6/ERBB2/LAMB1/COMP/SPP1/LAMC1/COL4A2/COL4A1 9
## hsa04510 ERBB2/LAMB1/COMP/SPP1/LAMC1/COL4A2/COL4A1 7
## hsa05146 IL6/LAMB1/LAMC1/COL4A2/COL4A1 5
## hsa05222 LAMB1/LAMC1/COL4A2/COL4A1 4
## hsa05134 HSPA2/IL6/EEF1A2 3
至此,我们完成了一个基本的 KEGG 富集分析过程。
[1] clusterProfiler 官网教程:https://yulab-smu.top/biomedical-knowledge-mining-book/clusterprofiler-kegg.html
学生信,计算机基础一定要好。毫不夸张地说,计算机基础决定了你能否入门,而生物学修养决定你能走多远。没有出发,如何走得远?
从零开始学生信交流群,免费咨询,量身定制学习计划