KEGG pathway 注释整理

# KEGG pathway 注释整理 ## 获得KEGG注释 通过[eggnog-mapper](https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2)和[interproscan](https://github.com/ebi-pf-team/interproscan/wiki)两个软件(或数据库),可以获得KEGG ORTHOLOGY(KO)的注释,即基因或者转录本对应的K number, 具体参见两个软件的wiki. ## 获得KO与pathway的关系 进入[KEGG官网](https://www.genome.jp/kegg/),然后点击[**KEGG BRITE**](https://www.genome.jp/kegg/brite.html)进入该数据库,在这个数据库中可以下载KEGG数据库中手工创建的层次结构文件([BRITE hierarchy files)](https://www.genome.jp/kegg/kegg3b.html)。在这里,需要下载包含pathway和KO对应关系的文件,点击[KEGG Orthology (KO)](https://www.genome.jp/kegg-bin/get_htext)下载,这里下载json版本。 下面解析该文件,生成表格文件便于使用。 ```python import json import re with open("ko00001.json") as f: ko_map_data = json.load(f) with open("KEGG_pathway_ko.txt", "w") as oh: line = "level1_pathway_id\tlevel1_pathway_name\tlevel2_pathway_id\tlevel2_pathway_name" line += "\tlevel3_pathway_id\tlevel3_pathway_name\tko\tko_name\tko_des\tec\n" oh.write(line) for level1 in ko_map_data["children"]: m = re.match(r"(\S+)\s+([\S\w\s]+)", level1["name"]) level1_pathway_id = m.groups()[0].strip() level1_pathway_name = m.groups()[1].strip() for level2 in level1["children"]: m = re.match(r"(\S+)\s+([\S\w\s]+)", level2["name"]) level2_pathway_id = m.groups()[0].strip() level2_pathway_name = m.groups()[1].strip() for level3 in level2["children"]: m = re.match(r"(\S+)\s+([^\[]*)", level3["name"]) level3_pathway_id = m.groups()[0].strip() level3_pathway_name = m.groups()[1].strip() if "children" in level3: for ko in level3["children"]: m = re.match(r"(\S+)\s+(\S+);\s+([^\[]+)\s*(\[EC:\S+(?:\s+[^\[\]]+)*\])*", ko["name"]) if m is not None: ko_id = m.groups()[0].strip() ko_name = m.groups()[1].strip() ko_des = m.groups()[2].strip() ec = m.groups()[3] if ec==None: ec = "-" line = level1_pathway_id + "\t" + level1_pathway_name + "\t" + level2_pathway_id + "\t" + level2_pathway_name line += "\t" + level3_pathway_id + "\t" + level3_pathway_name + "\t" + ko_id + "\t" + ko_name + "\t" + ko_des + "\t" + ec + "\n" oh.write(line) ``` 这会生成KEGG_pathway_ko.txt文件,随后对行去重。 ```python import pandas as pd data = pd.read_csv("KEGG_pathway_ko.txt", sep="\t",dtype=str) data = data.drop_duplicates() data.to_csv("KEGG_pathway_ko_uniq.txt", index=False, sep="\t") ``` 最后得到KEGG_pathway_ko_uniq.txt文件,这个文件包含了KO和KEGG pathway的对应关系信息,也包含了pathway的级别分类(KEGG pathway分为3级),如下所示: ```shell level1_pathway_id level1_pathway_name level2_pathway_id level2_pathway_name level3_pathway_id level3_pathway_name ko ko_name ko_des ec 9100 Metabolism 9101 Carbohydrate metabolism 10 Glycolysis / Gluconeogenesis K00844 HK hexokinase [EC:2.7.1.1] 9100 Metabolism 9101 Carbohydrate metabolism 10 Glycolysis / Gluconeogenesis K12407 GCK glucokinase [EC:2.7.1.2] 9100 Metabolism 9101 Carbohydrate metabolism 10 Glycolysis / Gluconeogenesis K00845 glk glucokinase [EC:2.7.1.2] ``` ## 合并结果 现在是表格文件,和容易将上面多种对应关系合并起来,进行后续的分析,例如可以对KEGG的注释结果按照KEGG中通路类型或者不同的level进行分类汇总,又或者对特定的基因集进行KEGG pathway的富集分析等。

你可能感兴趣的:(KEGG pathway 注释整理)