非模式生物如何做富集分析

最近在分析数据水稻的转录组数据，遇到了一系列问题，感觉是之前没有踏踏实实学习转录组数据分析而留下的隐患全部爆雷了。也挺好！这样才能知道不足。这次是水稻的富集分析，参考网上的资料，记录一下学习过程

1. KEGG分析（以水稻为例）

因为网上有 hoptop 师兄的教程，所以先讲它。

1.1 水稻的基因命名

水稻的基因组有两个版本，序列相同，对基因的命名不同，一个是MSU(MSU Rice Genome Annotation Project，又称TIGR)，该项目主要是对是对粳稻（亚种）日本晴（Nipponbare subspecies of rice)的12条染色体序列提供功能注释。因此从Rice Genome Annotation Project 可以下载到基因组序列(all.chrs.con)和注释文件(all.gff3)，目前是第7个版本（2013年），称呼为：RGAP7.0 它的命名规则是这样的：

Each nuclear gene is labeled LOC_OsXXg##### with LOC_Os referring to Oryza sativa locus, XX referring to chromosome 01-12, g referring to gene, and a 5-digit number referring to the gene order on the chromosome. A convention of LOC_Osp#g##### is used for plastidic genes, while LOC_Osm#g##### for mitochondrial genes. The genes (loci) are numbered sequentially along the chromosome or organellar genome in increments of 10 which will allow for insertion of future loci. Sufficient spacing in the numbering system for physical gaps has been provided in the sequence to allow for insertion of new genes in the event the physical gap is filled.

所以注释文件gff3如下图所示：

MSU

第一个基因LOC_Os01g01010 之后紧接着是 LOC_Os01g01019，依次是 30 和 40，每个之间相隔差不多10位。（只不过为什么第一个不是LOC_Os01g00010呢？）

另一个是RAP(The Rice Annotation Project)，RAP-DB 是这样描述自己的：

The Rice Annotation Project (RAP) was conceptualized in 2004 upon the completion of the Oryza sativa ssp. japonica cv. Nipponbare genome sequencing by the International Rice Genome Sequencing Project with the aim of providing the scientific community with an accurate and timely annotation of the rice genome sequence. One of the major objectives of this project is to facilitate a comprehensive analysis of the genome structure and function of rice on the basis of the annotation.

它提供的基因组称之为 IRGSP-1.0 ，命名规则如下：

Since the International Rice Genome Sequencing Project (IRGSP) completed the genome sequencing of Oryza sativa L. ssp. japonica cultivar Nipponbare, it was anticipated to decipher all of the genic regions in the genome. Systematic locus identifiers were assigned to the RAP loci on the IRGSP genome assembly. An ID (OsXXg#######) consists of the species name (Os for Oryza sativa), a two-digit number for chromosomes, the type of an identifier (g for genes), and a seven-digit number that indicates a sequential order of loci in a chromosome. This nomenclature was proposed by the Committee on Gene Symbolization, Nomenclature and Linkage in the First and Second Rice Annotation Project Meetings, modified on the basis of intensive discussions, and then approved by IRGSP/RAP. The RAP annotations with the Os identifiers were submitted to DDBJ/EMBL/Genbank under the accession numbers of AP008207-AP009218, and used in RefSeq. The Os code is used also in UniProtKB/Swiss-Prot as a locus identifier. For details, see the following paper

注释文件如下图所示：

PAP-DB

采用的是后面七位数字编号，同时每个基因相隔100。但是其中也有叫 ID =Os1t0100100-01 ，不知道什么意思。不过和前面的有联系，看似只是 g 改成 t 了，后面加上-01，但是其实有些后面是-02，所以还需要对应ID匹配文本。同时NCBI RefSeq采用的也是这种命名规则。因此就出现了两种ID相互批量转换以及RAP_ID转 Uniprot 的需要了，目前RIGW提供了ID批量转换的功能(PlantGSEA已经挂了），当然你也可以下载对应的命名对应文本文件 Uniprot然后写脚本来提取你提供对应的基因名，脚本参考粳稻参考基因组ID转换，修改之后放在这里那么就可以转换实现上述的需要。

1.2 enrichKEGG()

使用 clusterProfiler R包中的enrichKEGG()函数进行KEGG分析，用法如下：

enrichKEGG(gene,organism="hsa",keyType="kegg",pvalueCutoff=0.05,pAdjustMethod="BH",qvalueCutoff=0.2) 
# gene ：a vector of entrez gene id ,其实也不全是Entrez ID ,依赖于后面的 organism 
# organism ：这个需要到 https://www.genome.jp/kegg/catalog/org_list.html 上去找你的目标物种的缩写

因此如hoptop教程所述，如果不知道该输入什么样的 id给 gene，就可以用
enrichKEGG(gene="abc",organism="你的物种缩写",keyType="kegg") 试出来
有了差异基因，用上面脚本转成想要的基因ID，水稻的KEGG就完成了。

2. GO分析（以番茄为例）

2.1 AnnotationHub

这里就要使用AnnotationHub在线下载注释包orgdb了，参考文章有saber01 生信笔记 Y叔以下是代码，以番茄(solanum lycopersicum)为例：

# install AnnotationHub if (!requireNamespace("BiocManager", quietly = TRUE))
 install.packages("BiocManager")
BiocManager::install("AnnotationHub", version = "3.8")

library(AnnotationHub)
ah <- AnnotationHub() #然后就是更新下载 unique(ah$dataprovider)
unique(ah$rdataclass)
display(ah) # shiny写的网页版搜索界面 # downlaod orgdb,take tomato as example
query(ah," Solanum lycopersicum")
sly <- ah[["AH61992"]]
sly
saveDb(sly,file="sly.orgdb")
laodDb(file="sly.orgdb")

# 查看sly的具体信息，keys与 keytype搭配使用 columns(sly)
length(keys(sly))
head(keys(sly)) # 此时默认keytype="ENTREZID" head(keys(sly,keytype="GENENAME"))
keys(sly,keytype = "ENTREZID", column = "REFSEQ", pattern = "XP_010323781.1")
keys(sly,keytype = "ENTREZID", colums = c("ENTREZID","GO"),sly)
head(select(sly,keys=keys(sly),columns= c("GENENAME","GO"),keytype = "ENTREZID"))

2.2 enrichGO

用法：

enrichGO(gene, orgDb, keyType = "ENTREZID", ont = "MF", pvalueCutoff = 0.05, pAdjustMethod = "BH", pool = FALSE) 
# gene 必须是 Entrez ID; 
# pool : if ont="ALL", whether pool 3 GO sub-ontologies

因此需要将差异基因的命名ID 转化为 Entrez ID，这里就需要用到上面的 sly orgdb 注释包和 bitr 函数，番茄我用的是SL3.0 的参考基因组及其注释文件，之所以不用SL2.5/2.4是因为这两个版本的注释文件对基因的命名采用的是 Sol genomics 的命名法，类似于：SolycXXgXXXXXX，这个命名法在并不通用，只有Uniprot可以搜索到对应的蛋白，而NCBI gene只能搜索到很少一部分的结果，因此我使用的是SL3.0，注释文件gtf中有 gene_name一项是对应了sly orgDb注释包的 SYMBOL 一列，所以在获得read_count数的那步，使用 featureCounts应该使用以下命令：

featureCounts -T 16 -p -t exon -g gene_name -a SL3.gtf -o featureCountx.txt SRRxxxxxxx.bam &
 # 这里 -g 选择的是 gene_name ,而不是 gene_id

因此可以通过这个联系对差异基因的基因名进行基因ID的转换。

sly_SigExpGene_EntrezID <- bitr(SigExp_gene, fromType = "SYMBOL", toType = "ENTREZID",OrgDb = sly)

之后就可以进行GO分析了

3.总结

我分别以两个“非”模式生物（这里的“非”指的是不包括在19个物种的意思）来阐述类似物种如何去做KEGG和GO富集分析，其中关键的问题其实是对于不同基因ID之间的转换，而这个问题对于研究植物的人来说确实是个难题。比如上面如果公司给你的测序结果是以SL2.5/2.4作为参考基因组注释文件就难办了。因为到目前为止我还没找到SolycXXgXXXXXX直接对应其他通用ID如Uniprot_ID或者Entrez ID，有一个解决办法是通过爬虫或者Uniprot API获得对应的匹配文件，然后再找到对应的EntrezID。
以上！