生物信息学: 使用python包GOATOOLS查询GO

GO Programming Exercise

网上有一些关于GOATOOLS包的使用,但感觉跨度有点大,对一些生物信息学刚入门的同学很不利,我特意贴了一些案例,给同学们参考。
重点理解在于,我们应将goterm看成一种数据类型,并在obo_parser.py文件查找对应的使用方法。我写的案例十分简单,希望能带来一些帮助。
案例来源:https://link.springer.com/content/pdf/10.1007%2F978-1-4939-3743-1.pdf
Here we will use the Python package - GOATOOLS to query the GO. This package can read the GO structure stored in OBO format, which is available from the GO website. After loading this file, it is convenient to traverse the GO hierarchy, search for particular GO terms, and find out which other terms they are related to and how.

You can install the goatools through pip tool.The GOATOOLS package contains the function obo_parser.GODag() to load the GO file. Each GO term in the resulting object is an instance of the GOTerm class, which contains many useful attributes, including:

  • GOTerm.name: textual definition;
  • GOTerm.namespace: the ontology the term belongs to (i.e., MF, BP, CC);
  • GOTerm.parents: list of parent terms;
  • GOTerm.children: list of children terms;
  • GOTerm.level: shortest distance to the root node.

Exercise A

  • Download the GO basic file in OBO format (go-basic.obo), and load it using the function obo_parser.GODag() from the package GOATOOLS.
  • go-basic.obo下载链接,右键链接另存为
  • Answer the following questions:
    • What is the name of the GO term GO:0048527?
    • What are the immediate parent(s) of the term GO:0048527?
    • What are the immediate children of the term GO:0048527?
    • Recursively find all the parent and child terms of the term GO:0048527. Hint: use your
      solutions to the previous two questions, with a recursive loop.
    • How many GO terms have the word “growth” in their name?
    • What is the lowest common ancestor term of GO:0048527 and GO:0097178?
    • Which GO terms regulate GO:0007124 (pseudohyphal growth)? Hint: load the relationship
      tags and look for terms which define regulation.
from goatools import obo_parser
term1 = obo_parser.GODag('D:/edit/biology informatics/go-basic.obo').query_term('GO:0048527')
print("The  the immediate parent(s):")
print(term1.parents)
print("The  the immediate children:")
print(term1.children)
print("all the parent and child terms:")
print(term1.get_all_parents())
print(term1.get_all_children())
reader = obo_parser.OBOReader('D:/edit/biology informatics/go-basic.obo')
x = []
i = 0
for rec in reader:
    if "growth" in rec.name:
        x = x + [rec.id]
        i = i + 1
print("GO terms have the word “growth”")
print(x)
print("The number of GO terms have the word “growth”")
print(i)
parent1 = term1.get_all_parents()
term2 = obo_parser.GODag('D:/edit/biology informatics/go-basic.obo', optional_attrs = "relationship").query_term('GO:0097178')
parent2 = term2.get_all_parents()
sameparents = parent1&parent2
print("sameparent")
print(sameparents)
term = obo_parser.OBOReader('D:/edit/biology informatics/go-basic.obo', optional_attrs = "relationship")
print("regulate GO:0007124")
rela1 = {'regulates': {'GO:0007124'}}
for rec1 in term:
    if rela1 == rec1.relationship:
        print(rec1)

跑出来的结果如下:

The  the immediate parent(s):
{GOTerm('GO:0048528'):
  id:GO:0048528
  name:post-embryonic root development
  namespace:biological_process
  _parents: 2 items
    GO:0048364
    GO:0090696
  parents: 2 items
    GO:0090696	level-03	depth-04	post-embryonic plant organ development [biological_process]
    GO:0048364	level-04	depth-04	root development [biological_process]
  children: 1 items
    GO:0048527	level-05	depth-06	lateral root development [biological_process]
  level:4
  depth:5
  is_obsolete:False
  alt_ids: 0 items}
The  the immediate children:
set()
all the parent and child terms:
{'GO:0009791', 'GO:0048364', 'GO:0090696', 'GO:0048528', 'GO:0032502', 'GO:0099402', 'GO:0008150', 'GO:0048856', 'GO:0032501'}
set()
GO terms have the word “growth”
['GO:0000190', 'GO:0000191', 'GO:0000192', 'GO:0000193', 'GO:0000194', 'GO:0000195', 'GO:0000903', 'GO:0001402', 'GO:0001403', 'GO:0001404', 'GO:0001544', 'GO:0001545', 'GO:0001546', 'GO:0001547', 'GO:0001555', 'GO:0001557', 'GO:0001558', 'GO:0001559', 'GO:0001560', 'GO:0001571', 'GO:0001616', 'GO:0001617', 'GO:0001832', 'GO:0003135', 'GO:0003141', 'GO:0003241', 'GO:0003243', 'GO:0003244', 'GO:0003245', 'GO:0003246', 'GO:0003247', 'GO:0003248', 'GO:0003268', 'GO:0003302', 'GO:0003416', 'GO:0003417', 'GO:0003418', 'GO:0003419', 'GO:0003420', 'GO:0003421', 'GO:0003422', 'GO:0003423', 'GO:0003424', 'GO:0003425', 'GO:0003426', 'GO:0003427', 'GO:0003428', 'GO:0003429', 'GO:0003430', 'GO:0003431', 'GO:0003432', 'GO:0003434', 'GO:0003435', 'GO:0003436', 'GO:0003437', 'GO:0004903', 'GO:0005006', 'GO:0005007', 'GO:0005008', 'GO:0005010', 'GO:0005017', 'GO:0005018', 'GO:0005019', 'GO:0005021', 'GO:0005024', 'GO:0005025', 'GO:0005026', 'GO:0005072', 'GO:0005104', 'GO:0005105', 'GO:0005111', 'GO:0005114', 'GO:0005131', 'GO:0005154', 'GO:0005155', 'GO:0005156', 'GO:0005159', 'GO:0005160', 'GO:0005161', 'GO:0005163', 'GO:0005171', 'GO:0005172', 'GO:0005520', 'GO:0007117', 'GO:0007118', 'GO:0007119', 'GO:0007124', 'GO:0007125', 'GO:0007150', 'GO:0007173', 'GO:0007174', 'GO:0007175', 'GO:0007176', 'GO:0007179', 'GO:0007180', 'GO:0007181', 'GO:0007285', 'GO:0007295', 'GO:0007426', 'GO:0007446', 'GO:0008083', 'GO:0008084', 'GO:0008259', 'GO:0008543', 'GO:0008582', 'GO:0009825', 'GO:0009826', 'GO:0009831', 'GO:0009860', 'GO:0009932', 'GO:0010075', 'GO:0010080', 'GO:0010081', 'GO:0010082', 'GO:0010083', 'GO:0010448', 'GO:0010449', 'GO:0010450', 'GO:0010451', 'GO:0010465', 'GO:0010568', 'GO:0010570', 'GO:0010573', 'GO:0010574', 'GO:0010575', 'GO:0010640', 'GO:0010641', 'GO:0010642', 'GO:0014815', 'GO:0014843', 'GO:0015058', 'GO:0016049', 'GO:0016520', 'GO:0016608', 'GO:0016942', 'GO:0017015', 'GO:0017052', 'GO:0017134', 'GO:0019838', 'GO:0021811', 'GO:0021875', 'GO:0021899', 'GO:0021907', 'GO:0022003', 'GO:0022026', 'GO:0030252', 'GO:0030307', 'GO:0030308', 'GO:0030353', 'GO:0030372', 'GO:0030373', 'GO:0030426', 'GO:0030427', 'GO:0030447', 'GO:0030448', 'GO:0030511', 'GO:0030512', 'GO:0030616', 'GO:0030617', 'GO:0030618', 'GO:0030715', 'GO:0030947', 'GO:0030948', 'GO:0030949', 'GO:0031384', 'GO:0031385', 'GO:0031770', 'GO:0031994', 'GO:0031995', 'GO:0032455', 'GO:0032584', 'GO:0032601', 'GO:0032605', 'GO:0032643', 'GO:0032646', 'GO:0032683', 'GO:0032686', 'GO:0032723', 'GO:0032726', 'GO:0032902', 'GO:0032903', 'GO:0032904', 'GO:0032905', 'GO:0032906', 'GO:0032907', 'GO:0032908', 'GO:0032909', 'GO:0032910', 'GO:0032911', 'GO:0032912', 'GO:0032913', 'GO:0032914', 'GO:0032915', 'GO:0032916', 'GO:0033665', 'GO:0033666', 'GO:0033667', 'GO:0034713', 'GO:0034714', 'GO:0035001', 'GO:0035264', 'GO:0035265', 'GO:0035266', 'GO:0035318', 'GO:0035463', 'GO:0035464', 'GO:0035465', 'GO:0035559', 'GO:0035602', 'GO:0035603', 'GO:0035604', 'GO:0035607', 'GO:0035625', 'GO:0035728', 'GO:0035729', 'GO:0035766', 'GO:0035768', 'GO:0035790', 'GO:0035791', 'GO:0035793', 'GO:0035842', 'GO:0035924', 'GO:0035980', 'GO:0036095', 'GO:0036119', 'GO:0036120', 'GO:0036165', 'GO:0036168', 'GO:0036170', 'GO:0036171', 'GO:0036177', 'GO:0036178', 'GO:0036180', 'GO:0036187', 'GO:0036267', 'GO:0036323', 'GO:0036324', 'GO:0036325', 'GO:0036332', 'GO:0036363', 'GO:0036364', 'GO:0036365', 'GO:0036366', 'GO:0036454', 'GO:0036458', 'GO:0038004', 'GO:0038005', 'GO:0038029', 'GO:0038033', 'GO:0038044', 'GO:0038045', 'GO:0038084', 'GO:0038085', 'GO:0038086', 'GO:0038087', 'GO:0038088', 'GO:0038089', 'GO:0038090', 'GO:0038091', 'GO:0038167', 'GO:0038168', 'GO:0038180', 'GO:0040007', 'GO:0040008', 'GO:0040009', 'GO:0040010', 'GO:0040014', 'GO:0040015', 'GO:0040018', 'GO:0040036', 'GO:0040037', 'GO:0042057', 'GO:0042058', 'GO:0042059', 'GO:0042065', 'GO:0042066', 'GO:0042547', 'GO:0042567', 'GO:0042568', 'GO:0042702', 'GO:0042814', 'GO:0042815', 'GO:0043183', 'GO:0043184', 'GO:0043185', 'GO:0043567', 'GO:0043568', 'GO:0043569', 'GO:0043929', 'GO:0043930', 'GO:0044110', 'GO:0044112', 'GO:0044116', 'GO:0044117', 'GO:0044119', 'GO:0044121', 'GO:0044123', 'GO:0044125', 'GO:0044126', 'GO:0044128', 'GO:0044130', 'GO:0044133', 'GO:0044135', 'GO:0044137', 'GO:0044139', 'GO:0044140', 'GO:0044142', 'GO:0044144', 'GO:0044146', 'GO:0044148', 'GO:0044151', 'GO:0044153', 'GO:0044180', 'GO:0044181', 'GO:0044182', 'GO:0044294', 'GO:0044295', 'GO:0044344', 'GO:0044408', 'GO:0044412', 'GO:0045189', 'GO:0045311', 'GO:0045420', 'GO:0045421', 'GO:0045422', 'GO:0045570', 'GO:0045571', 'GO:0045572', 'GO:0045741', 'GO:0045742', 'GO:0045743', 'GO:0045886', 'GO:0045887', 'GO:0045926', 'GO:0045927', 'GO:0045967', 'GO:0046620', 'GO:0046621', 'GO:0046622', 'GO:0048008', 'GO:0048009', 'GO:0048010', 'GO:0048012', 'GO:0048175', 'GO:0048176', 'GO:0048177', 'GO:0048178', 'GO:0048406', 'GO:0048407', 'GO:0048408', 'GO:0048588', 'GO:0048589', 'GO:0048630', 'GO:0048631', 'GO:0048632', 'GO:0048633', 'GO:0048638', 'GO:0048639', 'GO:0048640', 'GO:0048689', 'GO:0048768', 'GO:0050431', 'GO:0051124', 'GO:0051210', 'GO:0051211', 'GO:0051394', 'GO:0051395', 'GO:0051396', 'GO:0051510', 'GO:0051511', 'GO:0051512', 'GO:0051513', 'GO:0051514', 'GO:0051515', 'GO:0051516', 'GO:0051517', 'GO:0051518', 'GO:0051519', 'GO:0051520', 'GO:0051521', 'GO:0051522', 'GO:0051523', 'GO:0051524', 'GO:0051819', 'GO:0051827', 'GO:0051831', 'GO:0051853', 'GO:0051854', 'GO:0051857', 'GO:0052019', 'GO:0052024', 'GO:0052108', 'GO:0052171', 'GO:0052184', 'GO:0052186', 'GO:0052512', 'GO:0052513', 'GO:0055017', 'GO:0055021', 'GO:0055022', 'GO:0055023', 'GO:0060123', 'GO:0060124', 'GO:0060125', 'GO:0060243', 'GO:0060258', 'GO:0060396', 'GO:0060397', 'GO:0060398', 'GO:0060399', 'GO:0060400', 'GO:0060416', 'GO:0060419', 'GO:0060420', 'GO:0060421', 'GO:0060437', 'GO:0060447', 'GO:0060499', 'GO:0060507', 'GO:0060560', 'GO:0060595', 'GO:0060682', 'GO:0060724', 'GO:0060726', 'GO:0060727', 'GO:0060728', 'GO:0060736', 'GO:0060737', 'GO:0060763', 'GO:0060787', 'GO:0060797', 'GO:0060798', 'GO:0060799', 'GO:0060801', 'GO:0060822', 'GO:0060825', 'GO:0060826', 'GO:0060835', 'GO:0060851', 'GO:0060878', 'GO:0061033', 'GO:0061049', 'GO:0061050', 'GO:0061051', 'GO:0061052', 'GO:0061112', 'GO:0061117', 'GO:0061313', 'GO:0061335', 'GO:0061387', 'GO:0061388', 'GO:0061389', 'GO:0061390', 'GO:0061391', 'GO:0061850', 'GO:0061913', 'GO:0061914', 'GO:0061916', 'GO:0061917', 'GO:0062031', 'GO:0070018', 'GO:0070019', 'GO:0070020', 'GO:0070021', 'GO:0070022', 'GO:0070123', 'GO:0070186', 'GO:0070195', 'GO:0070783', 'GO:0070784', 'GO:0070785', 'GO:0070786', 'GO:0070848', 'GO:0070849', 'GO:0070851', 'GO:0071363', 'GO:0071364', 'GO:0071378', 'GO:0071559', 'GO:0071560', 'GO:0071604', 'GO:0071634', 'GO:0071635', 'GO:0071636', 'GO:0071774', 'GO:0072690', 'GO:0075013', 'GO:0075014', 'GO:0075065', 'GO:0075066', 'GO:0075067', 'GO:0075068', 'GO:0075305', 'GO:0075309', 'GO:0075337', 'GO:0075338', 'GO:0075339', 'GO:0075340', 'GO:0080034', 'GO:0080092', 'GO:0080112', 'GO:0080113', 'GO:0080117', 'GO:0080186', 'GO:0080189', 'GO:0080190', 'GO:0090010', 'GO:0090012', 'GO:0090013', 'GO:0090033', 'GO:0090080', 'GO:0090214', 'GO:0090243', 'GO:0090269', 'GO:0090270', 'GO:0090271', 'GO:0090272', 'GO:0090287', 'GO:0090288', 'GO:0090360', 'GO:0090361', 'GO:0090362', 'GO:0090667', 'GO:0090668', 'GO:0090723', 'GO:0090724', 'GO:0090725', 'GO:0097076', 'GO:0097317', 'GO:0097318', 'GO:0097321', 'GO:0098867', 'GO:0098868', 'GO:0099126', 'GO:0100040', 'GO:0100041', 'GO:0100042', 'GO:0100064', 'GO:1900238', 'GO:1900428', 'GO:1900429', 'GO:1900430', 'GO:1900431', 'GO:1900432', 'GO:1900433', 'GO:1900434', 'GO:1900435', 'GO:1900436', 'GO:1900437', 'GO:1900438', 'GO:1900439', 'GO:1900440', 'GO:1900441', 'GO:1900442', 'GO:1900443', 'GO:1900444', 'GO:1900445', 'GO:1900456', 'GO:1900460', 'GO:1900461', 'GO:1900462', 'GO:1900741', 'GO:1900742', 'GO:1900743', 'GO:1900746', 'GO:1900747', 'GO:1900748', 'GO:1901048', 'GO:1901388', 'GO:1901389', 'GO:1901390', 'GO:1901392', 'GO:1901393', 'GO:1901394', 'GO:1901395', 'GO:1901396', 'GO:1901397', 'GO:1901398', 'GO:1901399', 'GO:1901400', 'GO:1902178', 'GO:1902202', 'GO:1902203', 'GO:1902204', 'GO:1902352', 'GO:1902547', 'GO:1902548', 'GO:1902727', 'GO:1902728', 'GO:1902733', 'GO:1903547', 'GO:1903548', 'GO:1903549', 'GO:1903844', 'GO:1903845', 'GO:1903846', 'GO:1904046', 'GO:1904740', 'GO:1904741', 'GO:1904847', 'GO:1904848', 'GO:1904849', 'GO:1904857', 'GO:1904858', 'GO:1904859', 'GO:1905251', 'GO:1905252', 'GO:1905253', 'GO:1905254', 'GO:1905282', 'GO:1905283', 'GO:1905284', 'GO:1905313', 'GO:1905427', 'GO:1905613', 'GO:1905614', 'GO:1905615', 'GO:1905942', 'GO:1905943', 'GO:1905944', 'GO:1990089', 'GO:1990090', 'GO:1990265', 'GO:1990270', 'GO:1990314', 'GO:1990418', 'GO:1990761', 'GO:1990812', 'GO:1990835', 'GO:1990864', 'GO:2000217', 'GO:2000218', 'GO:2000219', 'GO:2000220', 'GO:2000221', 'GO:2000222', 'GO:2000313', 'GO:2000314', 'GO:2000315', 'GO:2000387', 'GO:2000388', 'GO:2000544', 'GO:2000545', 'GO:2000546', 'GO:2000583', 'GO:2000584', 'GO:2000585', 'GO:2000586', 'GO:2000587', 'GO:2000588', 'GO:2000603', 'GO:2000604', 'GO:2000605', 'GO:2000699', 'GO:2000702', 'GO:2000703', 'GO:2000704', 'GO:2001112', 'GO:2001113', 'GO:2001114', 'GO:2001201', 'GO:2001202', 'GO:2001203']
The number of GO terms have the word “growth”
663
optional_attrs(relationship)
sameparent
{'GO:0008150'}
regulate GO:0007124
GO:2000220	regulation of pseudohyphal growth [biological_process]

答案:

• What is the name of the GO term GO:0048527?

post-embryonic root development

• What are the immediate parent(s) of the term GO:0048527?
GO:0048364 and GO:0090696

• What are the immediate children of the term GO:0048527

None

• Recursively find all the parent and child terms of the term GO:0048527Hint: use your solutions to the previous two questions, with a recursiveloop

all the parent is {‘GO:0009791’, ‘GO:0048364’, ‘GO:0090696’, ‘GO:0048528’, ‘GO:0032502’, ‘GO:0099402’, ‘GO:0008150’, ‘GO:0048856’, ‘GO:0032501’}
and all the child terms is none.

• How many GO terms have the word “growth” in their name

663

• What is the lowest common ancestor term of GO:0048527 andGO:0097178?

GO:0008150

• Which GO terms regulate GO:0007124 (pseudohyphal growth)? Hint:load the relationship tags and look for terms which define regulatio

GO:2000220

Exercise B

The GOATOOLS package also includes functions to visualize the GO graphs. For instance, it is possible to depict the location of a particular GO term in the ontology using the method GOTerm.draw_lineage().

  • Visualize the GO term GO:0097190. From the figure, what is the name of this term?
  • Using this figure, what is the most specific term that is in the parent terms of both
    GO:0097191 (extrinsic apoptotic signaling pathway) and GO:0038034 (signal
    transductional in absence of ligand)? This is referred to as the lowest common ancestor.

Exercise B 需要安装一些包和软件比较麻烦,我直接用pycharm无法安装pygraphviz,我在这里提供给大家一个安装方法。
纪念一下装了一天终于成功了的 pygraphviz
代码:

from goatools import obo_parser
g = obo_parser.GODag('D:/edit/biology informatics/go-basic.obo', optional_attrs = "relationship")
rec = g.query_term('GO:0097190')
g.draw_lineage([rec])
term1 = obo_parser.GODag('D:/edit/biology informatics/go-basic.obo', optional_attrs = "relationship").query_term('GO:0097191')
term2 = obo_parser.GODag('D:/edit/biology informatics/go-basic.obo', optional_attrs = "relationship").query_term('GO:0038034')
parent1 = term1.get_all_parents()
parent2 = term2.get_all_parents()
sameparents = parent1&parent2
print("sameparent")
print(sameparents)

会生成一张图片,自己去试一试。
答案:
The most specific term that is in the parent terms of both GO:0097191 (extrinsic apoptotic signalling pathway) and GO:0038034 (signal transduction in absence of ligand) is GO:0007165.

Exercise C

An alternative to GOATOOLS and OBO files is to retrieve information relating to a specific term from a web service. One such service is the EMBL-EBI QuickGO, which can provide descriptive information about GO terms in OBO-XML format via the following URL:

http://www.ebi.ac.uk/QuickGO/GTerm?id=&format=oboxml
  • Write a Python function get_oboxml(go_id) to
    • Use urllib package to request the OBO-XML;
    • Parse the XML result using the xmltodict package to convert it into easy-to-use
      dict;
  • Use the function to find the name and description of the GO term GO:0048527 (lateral root development). Hint: print out the dictionary returned by the function or create a visualization using the Python package visualisedictionary to study the structure.

这里话不多说,给大家贴代码:

import future.standard_library
future.standard_library.install_aliases()
from urllib.request import urlopen
import xmltodict
def get_oboxml(go_id):
    """
  This function retrieves the OBO-XML for a
given Gene Ontology term, using EMBL-EBI's
QuickGO browser.
Input: go_id  - a valid Gene Ontology ID,
e.g. GO:0048527.
    """
    quickgo_url =  "http://ebi.ac.uk/QuickGO/GTerm?id="+go_id+"&format=oboxml"
    oboxml = urlopen(quickgo_url)
    # Check the response
    if(oboxml.getcode() == 200):
        obodict = xmltodict.parse(oboxml.read())
        return obodict
    else:
        raise ValueError("Couldn't receive OBOXML from QuickGO. Check URL and try again.")

get_oboxml('GO:0048527')

Exercise D: Retrieve GO Annotations

In this exercise, we will learn how to parse a GAF file (GO Annotation File) downloaded from the UniProt-GOA database using an iterator from the BioPython package (Bio.UniProt.GOA.gafiterator):

from Bio.UniProt.GOA import gafiterator
import gzip

fname = "gene_association.goa_arabidopsis.gz"
with gzip.open(fname, "rt") as fp:
	for annotation in gafiterator(fp):
		print(annotation['DB_Object_ID'])

A GAF file is a tab-delimited file containing 17 fields including:

  • DB: the protein database;
  • DB_Object_ID: protein ID;
  • Qualifier: annotation qualifier (such as NOT);
  • GO_ID: GO term;
  • Evidence: evidence code.
  1. Find the total number of annotations for Arabidopsis thaliana with NOT qualifiers. What is this as a percentage of the total number of annotations for this species?
from Bio.UniProt.GOA import gafiterator
import gzip
import string

i = 0
j = 0
k = 0
t = 'NOT'

fname = "D:/edit/biology informatics/untitled1/goa_arabidopsis.gaf.gz"
with gzip.open(fname, "rb") as fp:
    for annotation in gafiterator(fp):
        i = i + 1
        for key in annotation:
            result = t in annotation[key]
            if result == True:
                j = 1
        if j == 1:
            k = k + 1
            j = 0
print i, k

168540

  1. How many genes (of Arabidopsis thaliana) have the annotation GO:0048527 (lateral root development)?

1044

  1. Generate a list of annotated proteins which have the word “growth” in their name.
from Bio.UniProt.GOA import gafiterator
import gzip
import string

i=0
j=0
k=0
growth='growth'
list1=[]

fname = "D:/edit/biology informatics/untitled1/goa_arabidopsis.gaf.gz"
with gzip.open(fname, "rb") as fp:
    for annotation in gafiterator(fp):
        i = i + 1
        for key in annotation:
            result = growth in annotation[key]
            if result == True:
                j = 1
        if j == 1:
            k = k + 1
            j = 0
            list1.append(annotation)

print i, k
for m in list1:
    print m

你可能感兴趣的:(生物信息学,python)