算法文献阅读:dbCAN2:一个用于自动化碳水化合物活性酶注释的宏服务器

期刊

Nucleic Acids Research (19.160/Q1)

dbCAN2: a meta server for automated carbohydrate-active enzyme annotation

dbCAN2:一个用于自动化碳水化合物活性酶注释的宏服务器

ABSTRACT

Complex carbohydrates of plants are the main food sources of animals and microbes, and serve as promising renewable feedstock for biofuel and biomaterial production. Carbohydrate active enzymes (CAZymes) are the most important enzymes for complex carbohydrate metabolism. With an increasing number of plant and plant-associated microbial genomes and metagenomes being sequenced, there is an urgent need of automatic tools for genomic data mining of CAZymes. We developed the dbCAN web server in 2012 to provide a public service for automated CAZyme annotation for newly sequenced genomes. Here, dbCAN2 (http://cys.bios.niu.edu/dbCAN2) is presented as an updated meta server, which integrates three state-of-the-art tools for CAZome (all CAZymes of a genome) annotation: (i) HMMER search against the dbCAN HMM (hidden Markov model) database; (ii) DIAMOND search against the CAZy pre-annotated CAZyme sequence database and (iii) Hotpep search against the conserved CAZyme short peptide database. Combining the three outputs and removing CAZymes found by only one tool can significantly improve the CAZome annotation accuracy. In addition, dbCAN2 now also accepts nucleotide sequence submission, and offers the service to predict physically linked CAZyme gene clusters (CGCs), which will be a very useful online tool for identifying putative polysaccharide utilization loci (PULs) in microbial genomes or metagenomes.

植物的复杂碳水化合物是动物和微生物的主要食物来源,并作为有前途的可再生原料用于生物燃料和生物材料生产。碳水化合物活性酶(CAZymes)是复杂碳水化合物代谢的最重要酶。随着越来越多的植物和植物相关的微生物基因组和宏基因组被测序,迫切需要自动工具来进行CAZymes的基因组数据挖掘。我们在2012年开发了dbCAN网络服务器,为新测序的基因组提供自动CAZyme注释的公共服务。在这里,dbCAN2(http://cys.bios.niu.edu/dbCAN2)作为一个更新的宏服务器被提出来,它整合了三种最先进的CAZome(一个基因组的所有CAZymes)注释工具:(i)针对dbCAN HMM(隐马尔可夫模型)数据库的HMMER搜索;(ii)针对CAZy预先注释的CAZyme序列数据库的DIAMOND搜索;(iii)针对保守CAZyme短肽数据库的Hotpep搜索。将这三种输出结果结合起来,并删除仅由一种工具发现的CAZymes,可以显著提高CAZome注释的准确性。此外,dbCAN2现在也接受核苷酸序列的提交,并提供预测物理连接的CAZyme基因簇(CGCs)的服务,这将是一个非常有用的在线工具,用于识别微生物基因组或宏基因组中假定的多糖利用位点(PULs)。

INTRODUCTION

Importance of complex carbohydrates 

复合碳水化合物的重要性 

Carbohydrates are one of the four major classes of large biopolymers found in all cells together with nucleic acids, proteins, and lipids. Carbohydrates include monosaccharides, oligosaccharides, and polysaccharides. Hybrid biopolymers with carbohydrates covalently linked to other biopolymers, such as glycoproteins and glycolipids, are called glycoconjugates. Complex carbohydrates and glycoconjugates are synthesized, degraded, and modified by carbohydrate active enzymes (CAZymes) in all organisms (1). Particularly, plants use photosynthesis to convert carbon dioxide and water into sugars, which are further turned into carbohydrates such as starches and celluloses with the help of CAZymes. Therefore, CAZymes are vitally important for plants and plant-associated animals and microbes, and not surprisingly CAZyme genes are particularly abundant in genomes of plants and plant-degrading microbes (2,3).

碳水化合物是与核酸、蛋白质和脂质一起存在于所有细胞中的四类主要大型生物聚合物之一碳水化合物包括单糖、低聚糖和多糖。碳水化合物与其他生物聚合物共价连接的混合生物聚合物,如糖蛋白和糖脂,被称为糖共轭物。复杂的碳水化合物和糖共轭物在所有生物体中都是由碳水化合物活性酶(CAZymes)合成、降解和修饰的(1)。特别是,植物利用光合作用将二氧化碳和水转化为糖类,在CAZymes的帮助下进一步转化为淀粉和纤维素等碳水化合物。因此,CAZymes对植物和与植物相关的动物和微生物来说是极其重要的,而且CAZyme基因在植物和降解植物的微生物的基因组中特别丰富,这一点不足为奇(2,3)。

Importance of CAZymes

CAZymes的重要性

In addition to their significance in bioenergy and agricultural industries (4), CAZymes are also extremely important for human health (5). This is because humans and other animals depend on bacteria living in the digestive tracts to degrade various indigestible carbohydrates and salvage nutrients (6). It has been shown that the genomes of animal gut bacteria encode hundreds of carbohydrate-degrading GH (glycoside hydrolase) genes, in contrast to only 17 digestive GH genes encoded in the human genome (7). Recent research has suggested that altering the dietary carbohydrate composition has a profound impact on the gut microbiota structure, which further influence the human health (8,9).

除了在生物能源和农业产业中的意义外(4),CAZymes对人类健康也极为重要(5)。这是因为人类和其他动物依靠生活在消化道中的细菌来降解各种难以消化的碳水化合物并挽救营养物质(6)。已有研究表明,动物肠道细菌的基因组编码了数百个降解碳水化合物的GH(糖苷水解酶)基因,而人类基因组中只编码了17个消化道GH基因(7)。最近的研究表明,改变饮食中的碳水化合物成分对肠道微生物群结构有深刻的影响,从而进一步影响人类健康(8,9)。

CAZy database

Since 1990s over 360 CAZyme families have been defined and classified by the CAZy database (10), forming six major classes: glycosyltransferases [GTs], glycoside hydrolases [GHs], polysaccharide lyases [PLs], carbohydrate esterases [CEs], carbohydrate-binding module [CBM] and enzymes for the auxiliary activities [AAs]. CAZy also assigns GenBank proteins to CAZyme families and these CAZy pre-annotated proteins are the foundation for sequence similarity-based CAZyme annotation.

自20世纪90年代以来,超过360个CAZyme家族被CAZy数据库定义和分类(10),形成六个主要类别:糖基转移酶[GTs]、糖苷水解酶[GHs]、多糖裂解酶[PLs]、碳水化合物酯酶[CEs]、碳水化合物结合模块[CBM]和辅助性活动的酶[AAs]。CAZy还将GenBank的蛋白质分配给CAZyme家族,这些CAZy预注解的蛋白质是基于序列相似性的CAZyme注释的基础。

Methods for CAZyme annotation

Owing to the importance of CAZymes, newly sequenced genomes are often analyzed for putative CAZymes (collectively named CAZome). Two approaches of CAZome annotation exist in the literature:

由于CAZymes的重要性,新测序的基因组经常被分析为推定的CAZymes(统称为CAZome)。文献中存在两种CAZome注释的方法。

(A) Users contact the CAZy database for collaboration, who will perform semi-automatic CAZome annotation for the users (11); as expert manual curations are involved, CAZy annotation is regarded as the gold standard method.

(A) 用户联系CAZy数据库进行合作,他们将为用户进行半自动的CAZome注释(11);由于涉及到专家的手工整理,CAZy注释被认为是黄金标准方法。

(B) Users run automatic tools such as HMMER (12) or BLAST (13) by themselves for CAZome annotation on their own computers or on the web (see below). Before 2012, BLAST was often used to search against CAZy pre-annotated proteins on users’ own computers.

(B) 用户自己在自己的电脑上或网络上运行自动工具,如HMMER(12)或BLAST(13),进行CAZome注释(见下文)。在2012年之前,BLAST经常被用来在用户自己的电脑上对照CAZy预先注释的蛋白质进行搜索。

In 2010, CAT (CAZyme Analysis Toolkit) was developed as a web server, which allows users to run both BLAST and HMMER searches remotely on the CAT web server (14). The HMMER search is run against Pfam HMMs (hidden Markov models) that are associated with CAZy preannotated CAZymes.

2010年,CAT(CAZyme Analysis Toolkit)被开发成一个网络服务器,它允许用户在CAT网络服务器上远程运行BLAST和HMMER搜索(14)。HMMER搜索是针对与CAZy预注解的CAZymes相关的Pfam HMMs(隐马尔可夫模型)运行的。

In 2012, we developed dbCAN, a database of HMMs for CAZyme family-specific signature domains (4). Different from CAT, for each CAZyme family we retrieved its signature domains from CAZy pre-annotated members, by searching against the CDD (conserved domain database of NCBI) database and manual literature curation; we then built our own HMMs for most CAZyme families instead of using Pfam HMMs.

2012年,我们开发了dbCAN,一个针对CAZyme家族特定特征域的HMMs数据库(4)。与CAT不同的是,对于每个CAZyme家族,我们通过搜索CDD(NCBI保守域数据库)数据库和人工文献整理,从CAZy预先注释的成员中检索其特征域;然后我们为大多数CAZyme家族建立自己的HMMs,而不是使用Pfam HMMs。

We update dbCAN almost once a year, by creating HMMs for CAZyme families and subfamilies newly created in the CAZy database (Figure 1). Users can download our HMMs and run HMMER locally for automated CAZome annotation. We also provide a Perl script to help parse the HMMER output, which returns CAZyme signature domains, their boundaries, E-values, and HMM domain coverage. Such domain-based annotation is particularly useful for CAZymes, as they tend to be modular proteins with multiple CAZyme domains and sometime domain repeats (e.g. multiple CBMs of the same family).

我们几乎每年更新一次dbCAN,为CAZyme家族和CAZy数据库中新创建的亚家族创建HMMs(图1)。用户可以下载我们的HMMs并在本地运行HMMER以实现CAZome的自动注释。我们还提供了一个Perl脚本来帮助解析HMMER的输出,该脚本返回CAZyme特征域、其边界、E值和HMM域覆盖率。这种基于域的注释对CAZymes特别有用,因为它们往往是具有多个CAZyme域的模块化蛋白质,有时会出现域的重复(例如同一家族的多个CBM)。

Figure 1

Figure 1. dbCAN is updated every year and now has 575 HMMs. X-axis: year; Y-axis: number of HMMs of families (blue) and subfamilies (red).

图1. dbCAN每年更新一次,现在有575个HMMs。X轴:年份;Y轴:科(蓝色)和亚科(红色)的HMMs的数量。

To help users who do not have programming experience, we also developed a web server to allow users submit protein sequences and run HMMER on our server to identify CAZymes. With the CAT website no longer maintained since 2013 and eventually obsolete in 2017, dbCAN has become the only web server that is still actively updated and offering online CAZyme annotation service.

为了帮助没有编程经验的用户,我们还开发了一个网络服务器,允许用户提交蛋白质序列并在我们的服务器上运行HMMER来识别CAZymes。随着CAT网站自2013年起不再维护,并在2017年最终被淘汰,dbCAN成为唯一仍在积极更新并提供在线CAZyme注释服务的网络服务器

In 2017, a new tool named Hotpep (15) annotates CAZymes by searching against PPR (peptide pattern recognition) library for conserved short peptide motifs (16) present in different CAZyme families. In the PPR library, each CAZyme family has a set of 6-mer peptides that are conserved in that family, and Hotpep is used to scan new proteins for the presence of these peptides in order to assign the query proteins into existing CAZyme families.

2017年,一个名为Hotpep(15)的新工具通过针对PPR(肽模式识别)库搜索不同CAZyme家族中存在的保守的短肽图案(16)来注释CAZymes。在PPR库中,每个CAZyme家族都有一组在该家族中保守的6-mer肽,Hotpep被用来扫描新蛋白质中是否存在这些肽,以便将查询的蛋白质归入现有的CAZyme家族。

Importance of automated CAZyme annotation

自动CAZyme注释的重要性

It should be mentioned that approach B is actually also included in approach A, but can be fully automated and carried out in the users’ own hands. Using CAZy already annotated CAZomes to benchmark the automated CAZyme annotation found >90% of accuracy typically for model bacterial genomes (3). Clearly, as more and more genomes and metagenomes becoming available, such automated CAZome annotation has a clear advantage over annotation by CAZy through collaboration, in that users can quickly obtain the candidate CAZyme gene list by themselves as part of their bioinformatics pipeline for genome annotation.

值得一提的是,方法B实际上也包括在方法A中,但可以完全自动化,由用户自己进行。使用CAZy已经注释过的CAZomes来作为自动CAZyme注释的基准,发现典型的模型细菌基因组的准确率>90%(3)。显然,随着越来越多的基因组和宏基因组的出现,这种自动CAZome注释比CAZy通过合作进行的注释具有明显的优势,因为用户可以自己快速获得候选CAZyme基因列表,作为他们基因组注释的生物信息学管道的一部分

Indeed, the popularity of automated CAZome annotation can be manifested by citations of the two approaches. Specifically, ∼100 papers have been published since 2012 with CAZomes annotated by collaboration with CAZy (according to http://www.cazy.org/Genomes.html). As a comparison, more than 300 papers have been published since 2012 using dbCAN for CAZome annotation (according to Google Scholar: https://scholar.google.com/scholar?cites=5112424923296812233 , only counted papers that used the tool for finding CAZymes), and more than 100 papers have been published since 2012 using CAT for CAZome annotation (according to Google Scholar: https://scholar.google.com/scholar?cites=12948408578800903520, also only counted papers that used the tool for finding CAZymes).

事实上,自动CAZome注释的普及可以从两种方法的引用中体现出来。具体来说,自2012年以来,有100多篇论文是通过与CAZy合作进行CAZomes注释而发表的(根据http://www.cazy.org/Genomes.html)。作为比较,自2012年以来,使用dbCAN进行CAZome注释的论文有300多篇(根据谷歌学术:https://scholar.google.com/scholar?cites=5112424923296812233,只计算了使用该工具寻找CAZymes的论文),自2012年以来,使用CAT进行CAZome注释的论文有100多篇(根据谷歌学术:https://scholar.google.com/scholar?cites=12948408578800903520,也只计算了使用该工具寻找CAZymes的论文)。

Lastly, the availability of dbCAN HMMs has also enabled other bioinformatics tools to incorporate CAZyme annotation step into their data analysis workflows, e.g., MOCAT2 (17), DemaDb (18), proGenomes (19) and SACCHARIS (20).

最后,dbCAN HMMs的可用性也使其他生物信息学工具能够将CAZyme注释步骤纳入其数据分析工作流程,例如MOCAT2(17)、DemaDb(18)、proGenomes(19)和SACCHARIS(20)。

NEW FUNCTIONS AND UPDATES

Figure 2 shows the overall design of dbCAN2, an updated meta server of dbCAN server, which has the following new functions: (i) allows submission of DNA sequences in addition to protein sequences; (ii) integrates three state-of-theart tools/databases for automated CAZyme annotation; (iii) can identify transcription factors (TFs), transporters (TCs), and further CAZyme gene clusters (CGCs) using CGC-Finder (3); (iv) combines the results from the three tools, allows visualization as a Venn diagram and detailed results as graphs, and offers an easy solution to download results as text files.

图2显示了dbCAN2的整体设计,它是dbCAN服务器的一个更新的宏服务器,具有以下新功能。(i) 除蛋白质序列外,还允许提交DNA序列;(ii) 集成了三个最先进的工具/数据库,用于自动CAZyme注释;(iii) 可以识别转录因子(TF)、转运体(TC),并使用CGC-Finder(3)进一步识别CAZyme基因簇(CGC);(iv) 结合三个工具的结果,允许以维恩图的形式进行可视化,并以图表形式显示详细结果,并提供一个简单的解决方案,以文本文件下载结果

Figure 2

Figure 2. Overall design of dbCAN2 meta server. GCPU (gene cluster plot utility) and CGC-Finder (CAZyme gene cluster finder) are two tools developed for dbCAN2.

图2. dbCAN2宏服务器的总体设计。GCPU(基因簇绘图工具)和CGC-Finder(CAZyme基因簇搜索器)是为dbCAN2开发的两个工具。

DNA sequence submission

In addition to protein submission, dbCAN2 now also accepts nucleotide sequences, e.g. the complete or draft genomes and metagenomes of prokaryotes. Protein sequences are predicted by calling Prodigal (21) if the query is genomes, or FragGeneScan (22) if the query is short DNAs from metagenomes or mRNAs or coding sequences of proteins. As eukaryotic gene prediction is more complex and often needs additional input data (e.g. transcriptome data), users should perform gene predictions for eukaryotic genomes elsewhere and only submit protein sequences to dbCAN2.

除了蛋白质的提交,dbCAN2现在也接受核苷酸序列,例如原核生物的完整或草稿基因组和宏基因组。如果查询的是基因组,则通过调用Prodigal(21)来预测蛋白质序列;如果查询的是来自宏基因组的短DNA或mRNA或蛋白质的编码序列,则调用FragGeneScan(22)。由于真核生物的基因预测比较复杂,往往需要额外的输入数据(如转录组数据),因此用户应该在其他地方进行真核生物基因组的基因预测,只提交蛋白质序列给dbCAN2。

Meta server of three tools/databases

The dbCAN web server (http://csbl.bmb.uga.edu/dbCAN/) currently provides HMMER search against dbCAN HMM database, and also DIAMOND (23) search against CAZy pre-annotated CAZyme sequence database. However, the results from the two tools are presented on two separate pages and not integrated at any level. In dbCAN2, we have added the third tool: Hotpep search against the PPR short peptide library. We have also systematically compared the outputs of the three tools against the CAZy pre-annotated CAZomes (i.e. as the gold standard sets) of three bacterial genomes and three eukaryotic genomes (Supplementary Table S1), in order to: (i) find the best parsing thresholds (e.g. E-value) for each tool, (ii) evaluate the annotation performance of the three tools and (iii) find the best way to aggregate the three outputs to achieve the best annotation performance.

dbCAN网络服务器(http://csbl.bmb.uga.edu/dbCAN/)目前提供了针对dbCAN HMM数据库的HMMER搜索,以及针对CAZy预注解CAZyme序列数据库的DIAMOND(23)搜索。然而,这两个工具的结果是在两个独立的页面上显示的,没有在任何层面上进行整合。在dbCAN2中,我们增加了第三个工具。针对PPR短肽库的Hotpep搜索。我们还将这三个工具的输出结果与三个细菌基因组和三个真核生物基因组的CAZy预注释CAZomes(即作为黄金标准集)进行了系统的比较(补充表S1),以便。(i) 找到每个工具的最佳解析阈值(如E值),(ii) 评估三个工具的注释性能,(iii) 找到汇总三个输出的最佳方式,以实现最佳的注释性能

The accuracy is calculated as an F-score = 2 × (Recall × Precision)/(Recall + Precision) for the three tools on each examined genome, following the method presented in our previous papers (2,3). We removed unclassified CAZymes (e.g. GH0) and families not in the PPR library when calculating F-scores. Supplementary Table S1 presents the best parsing thresholds that we selected to use for the web server: (i) for HMMER+dbCAN, we use E-value <1e–15 and coverage >0.35; (ii) for DIAMOND+CAZy, we use E-value <1e–102 and (iii) for Hotpep+PPR, we use the number of conserved peptide hits >6 and the sum of conserved peptide frequencies >2.6. Table 1 shows that DIAMOND+CAZy has the highest F-score (0.89) for bacteria but the lowest F-score for eukaryotes (0.84); in contrast, Hotpep + PPR has the highest F-score (0.94) for eukaryotes but the lowest F-score for bacteria (0.80). HMMER + dbCAN performs very well for both eukaryotes (0.86) and bacteria (0.88) and a slightly higher overall F-score than the other two tools (Supplementary Table S1). In terms of running time, DIAMOND runs the fastest, followed by Hotpep and HMMER.

按照我们以前的论文(2,3)中提出的方法,在每个被检查的基因组上,三个工具的准确度被计算为F分数=2×(召回率×精确度)/(召回率+精确度)。在计算F分数时,我们删除了未分类的CAZymes(如GH0)和不在PPR库中的家族。补充表S1列出了我们选择用于网络服务器的最佳解析阈值:(i)对于HMMER+dbCAN,我们使用E值<1e-15和覆盖率>0.35;(ii)对于DIAMOND+CAZy,我们使用E值<1e-102和(iii)对于Hotpep+PPR,我们使用保守肽点击率>6和保守肽频率之和>2.6。表1显示,DIAMOND+CAZy对细菌的F分数最高(0.89),但对真核生物的F分数最低(0.84);相反,Hotpep+PPR对真核生物的F分数最高(0.94),但对细菌的F分数最低(0.80)。HMMER + dbCAN对真核生物(0.86)和细菌(0.88)都有很好的表现,总体的F分数比其他两个工具略高(补充表S1)。就运行时间而言,DIAMOND运行最快,其次是Hotpep和HMMER

More importantly, we found that the best performance of automated CAZyme annotation is to aggregate the outputs of the three tools and keep candidates found by at least two tools. Table 1 shows that the F-score can be increased to 0.93 when keeping proteins found by at least two tools.

更重要的是,我们发现自动CAZyme注释的最佳性能是汇总三个工具的输出,并保留至少两个工具发现的候选蛋白。表1显示,当保留至少两个工具发现的蛋白质时,F-score可以提高到0.93。

Table 1

aTwenty four CAZyme families are classified into 207 subfamilies by phylogenetic clustering and CAZy expert curation (10). bThree hundred and forty two CAZyme families are classified into 7036 groups by PPR (15,16). cThe time is in seconds and calculated on Escherichia coli K-12 MG1655 proteome (4140 proteins). The detailed calculations on accuracy and speed are available in Supplementary Table S1. No correspondence has been established between PPR groups and CAZy subfamilies, and in dbCAN web server we only report CAZy subfamily annotation, whenever it is available.

a通过系统发育聚类和CAZy专家策划(10),将24个CAZyme家族分为207个亚家族;b通过PPR(15,16),将342个CAZyme家族分为7036组;c时间以秒为单位,在大肠杆菌K-12 MG1655蛋白质组(4140个蛋白质)上计算。关于准确性和速度的详细计算结果见补充表S1。PPR组和CAZy亚家族之间没有建立对应关系,在dbCAN网络服务器中,只要有CAZy亚家族注释,我们只报告CAZy亚家族。

However, the above F-score calculation only considered whether a protein is found by any of the three tools. When considering if a protein is assigned to the correct family or families, we found that the F-scores for all the three tools had slightly dropped (Supplementary Table S2), with Hotpep + PPR dropped the most (dropped to 0.86 for eukaryotes and 0.70 for bacteria) and HMMER + dbCAN dropped the least (dropped to 0.85 for eukaryotes and 0.82 for bacteria). Additionally, proteins can have multiple CAZyme domains, and it is also interesting to know where the domain boundaries are. Figure 3 shows two example CAZyme proteins found by all the three tools. Both proteins have multiple CAZyme domains according to dbCAN annotation (Figure 3A). According to HMMER + dbCAN output (Figure 3C), AT1G11720.1 is annotated as CBM53(154–237) + CBM53(329–423) + CBM53(496–584) + GT5(595–1038) and YP 002573728.1 as GH9(36–466) +CBM3(491–576) + CBM3(724–804) + CBM3(923–1003) + GH48(1134–1753), i.e. all the CAZyme domains and domain repeats and their positions are reported (Table 1). However, according to both Hotpep + PPR and DIAMOND + CAZy, AT1G11720.1 is annotated as GT5 + CBM53 and YP 002573728.1 as GH9 + GH48 + CBM3, i.e. proteins are assigned to the multiple families correctly, though without reporting domain repeats and positions (Table 1).

然而,上述F分数的计算只考虑了一个蛋白质是否被这三种工具中的任何一种发现。当考虑一个蛋白质是否被分配到正确的科或族时,我们发现三个工具的F分数都略有下降(补充表S2),其中Hotpep + PPR下降最多(真核生物下降到0.86,细菌下降到0.70),HMMER + dbCAN下降最少(真核生物下降到0.85,细菌下降到0.82)。此外,蛋白质可以有多个CAZyme结构域,知道结构域的边界在哪里也很有意思。图3显示了所有三种工具发现的两个CAZyme蛋白的例子。根据dbCAN的注释,这两个蛋白质都有多个CAZyme结构域(图3A)。根据HMMER + dbCAN输出(图3C),AT1G11720.1被注释为CBM53(154-237)+ CBM53(329-423)+ CBM53(496-584)+ GT5(595-1038),YP 002573728. 1为GH9(36-466)+CBM3(491-576)+CBM3(724-804)+CBM3(923-1003)+GH48(1134-1753),即报告了所有CAZyme结构域和结构域重复及其位置(表1)。然而,根据Hotpep + PPR和DIAMOND + CAZy,AT1G11720.1被注释为GT5 + CBM53,YP 002573728.1被注释为GH9 + GH48 + CBM3,即蛋白质被正确地分配到多个家族,尽管没有报告结构域重复和位置(表1)。

Figure 3

Figure 3. Comparison of annotation results for multi-domain CAZymes using three different tools. (A) Two example proteins (AT1G11720.1 and YP 002573728.1) are illustrated with their CAZyme domain architecture based on dbCAN search. (B) DIAMOND search result for the two proteins showing the best CAZy protein hit; (C) HMMER search result against dbCAN HMM database, from which (A) is derived; (D) Hotpep search result against PPR library; Frequency means the sum of conserved peptide frequencies and Hits means the number of conserved peptide hits (15).

图3. 使用三种不同工具对多域CAZymes的注释结果的比较。(A) 两个例子的蛋白质(AT1G11720.1和YP 002573728.1)与它们基于dbCAN搜索的CAZyme领域结构图示。(B) 这两个蛋白质的DIAMOND搜索结果显示了最佳的CAZy蛋白命中率;(C) 针对dbCAN HMM数据库的HMMER搜索结果,(A)就是来自该数据库;(D) 针对PPR库的Hotpep搜索结果;Frequency指保守肽频率之和,Hits指保守肽命中率的数量(15)。

It should be mentioned that DIAMOND + CAZy has a much higher risk than the other two tools to give wrong CAZyme family annotation. For example, if a query protein only has a GT5 domain and has AAD30251.1 as its best CAZy hit, transferring the family assignment of AAD30251.1 (GT5 + CBM53) to the query would be wrong (as no CBM53 in the query). However, such mistakes will not happen in HMMER and Hotpep searches, as they are conserved domain and motif-based methods.

值得一提的是,与其他两个工具相比,DIAMOND + CAZy给出错误的CAZyme家族注释的风险要高很多。例如,如果一个查询蛋白只有GT5结构域,而AAD30251.1是其最好的CAZy命中,那么将AAD30251.1(GT5+CBM53)的家族分配转移到查询中是错误的(因为查询中没有CBM53)。然而,这种错误不会发生在HMMER和Hotpep搜索中,因为它们是基于保守域和主题的方法。

CAZyme gene clusters (CGCs)

Another important new function of dbCAN2 is that it allows identification of CGCs, when the genomic locations of all genes of the query genome are given. In literature, CGCs are also known as polysaccharide utilization loci (PULs), which are defined as physically linked genes specializing in the degradation of various complex carbohydrates (24). Most experimentally characterized PULs are found in Bacteroidetes genomes (25), but have also been reported in Proteobacteria and Firmicutes of various carbohydrate-rich environments (26). The PULDB of CAZy initially focused on susCD (starch utilization system C and D transporters) associated PULs, and more recently expanded to present CAZyme clusters (3 and more CAZyme genes clustered in the genome) on its website (25). However, PULDB focuses on Bacteroidetes genomes and does not allow online genome submissions for PUL predictions. Recently, we defined CGCs as a more general term of PULs (3), which must contain three classes of signature genes: at least one CAZyme gene, one transporter (TC) gene, and one transcription factor (TF) gene. Between two adjacent signature genes, a certain number of non-signature genes can be inserted. We have developed a Python program (CGCFinder) that can automatically identify CGCs (3).

dbCAN2的另一个重要的新功能是,当查询基因组的所有基因的基因组位置都给定时,它可以识别CGCs。在文献中,CGCs也被称为多糖利用位点(PULs),它被定义为专门用于降解各种复杂碳水化合物的物理连接基因(24)。大多数实验特征的PULs在类杆菌的基因组中发现(25),但也有报道在各种富含碳水化合物的环境中的变形杆菌和韧皮菌中发现(26)。CAZy的PULDB最初专注于susCD(淀粉利用系统C和D转运器)相关的PULs,最近扩展到在其网站上呈现CAZyme集群(基因组中3个及以上CAZyme基因集群)(25)。然而,PULDB专注于类杆菌基因组,不允许在线提交基因组进行PUL预测。最近,我们将CGC定义为PULs的一个更普遍的术语(3),它必须包含三类特征基因:至少一个CAZyme基因、一个转运体(TC)基因和一个转录因子(TF)基因。在两个相邻的签名基因之间,可以插入一定数量的非签名基因。我们已经开发了一个Python程序(CGCFinder)可以自动识别CGC(3)。

In the dbCAN2 job submission page, we provide the ‘Find CAZyme gene clusters’ option. When users submit a protein query file, they must also provide a gene position file in order to predict CGCs. This gene position file is not required if users submit a nucleotide query file, because the gene prediction programs can generate the gene position file internally. With protein sequences, our server will predict TFs and TCs by DIAMOND search against TF and TC databases (explained in (3)), and then CGC-Finder will be called to locate genes of CAZymes, TFs, TCs in the genome, and identify CGCs.

在dbCAN2工作提交页面,我们提供了 "查找CAZyme基因簇 "选项。当用户提交蛋白质查询文件时,他们还必须提供一个基因位置文件,以便预测CGCs。如果用户提交的是核苷酸查询文件,则不需要这个基因位置文件,因为基因预测程序可以在内部生成基因位置文件。有了蛋白质序列,我们的服务器将通过对TF和TC数据库的DIAMOND搜索来预测TF和TC(在(3)中解释),然后调用CGC-Finder来定位基因组中的CAZymes、TF、TC的基因,并识别CGCs

Web design

For the job submission page, we have options to allow users to specify if they would: (i) use one of the three tools or all three tools for CAZyme annotation; (ii) use protein or nucleotide sequences as input; (iii) use CGC-Finder to predict CGCs. As shown in Figure 2, if nucleotide sequences are submitted, gene prediction programs will be first called to predict protein-coding genes and then protein sequences     will be used for CAZyme annotation. If CGC-Finder option is selected, TFs and TCs will also be predicted and the gene location file will be used to predict CGCs.

对于工作提交页面,我们有一些选项,允许用户指定他们是否会。(i) 使用三种工具中的一种或所有三种工具进行CAZyme注释;(ii) 使用蛋白质或核苷酸序列作为输入;(iii) 使用CGC-Finder来预测CGCs。如图2所示,如果提交核苷酸序列,将首先调用基因预测程序来预测蛋白质编码基因,然后使用蛋白质序列进行CAZyme注释。如果选择CGC-Finder选项,TFs和TCs也将被预测,基因位置文件将被用于预测CGCs

For the result page (Figure 4), five tabs are shown each with a data table: (i) HMMER result table; (ii) DIAMOND result table; (iii) Hotpep result table; (iv) Overview table; (v) CGC-Finder table. Above the tabs, a Venn diagram is shown to illustrate the overlaps among the outputs of the three tools (Figure 4A). Click on any numbers in the diagram will open a pop-out window displaying the protein IDs in that region.

在结果页面(图4),显示了五个标签,每个标签都有一个数据表:(i)HMMER结果表;(ii)DIAMOND结果表;(iii)Hotpep结果表;(iv)概述表;(v)CGC-Finder表。在标签的上方,显示了一个维恩图,以说明三个工具的输出结果之间的重叠(图4A)。点击图中的任何数字都会打开一个弹出的窗口,显示该区域的蛋白质ID。

Figure 4

Figure 4. Screenshots of dbCAN2 result pages. (A) Venn diagram to show overlaps among the results of the three tools; (B) CGC-Finder result tab; (C) Overview tab combining results from the three tools and SignalP; (D) genomic location plot of an example CGC (signature genes are in red, green and blue colors, while non-signature genes are in gray); (E) detailed information of an example CGC.

图4. dbCAN2结果页面的截图。(A) 显示三种工具结果重叠的维恩图;(B) CGC-Finder结果标签;(C) 结合三种工具和SignalP结果的概览标签;(D) 一个例子CGC的基因组位置图(标志性基因为红、绿、蓝三色,而非标志性基因为灰色);(E) 一个例子CGC的详细信息。

The Overview tab combines the results of the three CAZyme annotation tools plus SignalP (27) prediction result (Figure 4C). The number of tools that find a CAZyme protein is also shown as a column, in addition to the CAZyme family assignment (for DIAMOND and Hotpep) and domain assignment (for HMMER). Users can sort the Table according to the number of tools column and easily filter out proteins found by only one tool to get the most accurate CAZyme list.

概述 "选项卡结合了三个CAZyme注释工具的结果和SignalP(27)的预测结果(图4C)。找到CAZyme蛋白的工具数量也作为一列显示,此外还有CAZyme家族分配(对于DIAMOND和Hotpep)和结构域分配(对于HMMER)。用户可以根据工具的数量对该表进行排序,并轻松地过滤掉仅由一个工具发现的蛋白质,以获得最准确的CAZyme列表。

The CGC-Finder tab presents the CGCs identified in the query genome/proteome, with columns such as the genomic locations of the CGC and the three classes of signature genes in the CGCs (Figure 4B). The default parameters in running CGC-Finder include: (i) at least one CAZyme and one TC genes and (ii) the number of non-signature genes that are allowed to be inserted between two adjacent signature genes is ≤2. The two parameters can be changed underneath the CGC table to rerun CGC-Finder and then the CGC-Finder tab will be updated to display the new CGC list.

CGC-Finder选项卡显示了在查询基因组/蛋白质组中确定的CGC,其中有CGC的基因组位置和CGC中的三类特征基因等栏目(图4B)。运行CGC-Finder时的默认参数包括。(i)至少一个CAZyme和一个TC基因;(ii)允许在两个相邻的签名基因之间插入的非签名基因的数量为≤2。这两个参数可以在CGC表下改变,重新运行CGC-Finder,然后CGC-Finder标签就会更新,显示新的CGC列表。

Clicking on each CGC opens a new page showing the CGC genomic context plot using GCPU (gene cluster plotting utility), a Python script we developed to plot the genes in the CGCs as arrows in different colors (Figure 4D). Below the plot is a Table (Figure 4E), which shows the detailed genomic location of each member gene in the CGC, including the distance of a signature gene from its upstream signature gene (Upstream distance) and the distance from its downstream signature gene (Downstream distance), as well as their best DIAMOND hits in the CAZy, TF and TC databases.

点击每个CGC可以打开一个新的页面,显示CGC的基因组背景图,使用GCPU(基因组绘图工具),这是我们开发的一个Python脚本,将CGC中的基因绘制成不同颜色的箭头(图4D)。图的下面是一个表(图4E),显示了CGC中每个成员基因的详细基因组位置,包括一个特征基因与其上游特征基因的距离(上游距离)和与其下游特征基因的距离(下游距离),以及它们在CAZy、TF和TC数据库中的最佳DIAMOND结果。

In all the five tabs and the individual CGC page, links to tab-delimited plain text files are provided for users to conveniently download and open in their local computers using Excel spreadsheet for further analysis. The Venn diagram and the CGC plot can also be downloadable as image files (e.g. SVG and PDF) and further edited by the users using Illustrator.

在所有五个标签和单个CGC页面中,都提供了标签限定的纯文本文件的链接,方便用户下载并在本地计算机中使用Excel电子表格打开,以便进一步分析。维恩图和CGC图也可以下载为图像文件(如SVG和PDF),并由用户用Illustrator进一步编辑

Lastly, we also provide a web page for each CAZyme protein to plot its dbCAN domains and PPR conserved peptides in the sequence. We also allow users to download a master script to run all tools as well as the CGC-Finder program on their local computers.

最后,我们还为每个CAZyme蛋白提供了一个网页,以绘制其在序列中的dbCAN域和PPR保守肽。我们还允许用户下载一个主脚本,在他们的本地计算机上运行所有工具以及CGC-Finder程序。

CONCLUSIONS

dbCAN2 is a web server for automated carbohydrate-active enzyme annotation. It is an updated version of the original dbCAN web server, and has the following new features:

dbCAN2是一个用于自动化碳水化合物活性酶注释的网络服务器。它是原dbCAN网络服务器的更新版本,并具有以下新功能。

(1) dbCAN2 allows submission of nucleotide sequences: genomic sequences of prokaryotic draft genomes and metagenomes;

(1) dbCAN2允许提交核苷酸序列:原核生物基因组草案和宏基因组的基因组序列。

(2) dbCAN2 integrates three state-of-the-art tools/databases for automated CAZyme annotation: (i) HMMER for annotated CAZyme domain boundaries determination according to the dbCAN CAZyme domain HMM database; (ii) DIAMOND for fast Blast hits in the CAZy database; (iii) Hotpep for short conserved motifs in the PPR library;

(2) dbCAN2集成了三个最先进的工具/数据库用于自动CAZyme注释:(i) HMMER用于根据dbCAN CAZyme结构域HMM数据库确定注释的CAZyme结构域边界;(ii) DIAMOND用于CAZy数据库中的快速Blast点击;(iii) Hotpep用于PPR库中的短保守主题。

(3) dbCAN2 can also identify transcription factors (TFs), transporters (TCs), and further CAZyme gene clusters (CGCs) using CGC-Finder if users submit protein sequences plus gene location files or genomic DNA sequence file;

(3)如果用户提交蛋白质序列加上基因位置文件或基因组DNA序列文件,dbCAN2还可以使用CGC-Finder识别转录因子(TFs)、转运体(TCs)和进一步的CAZyme基因簇(CGCs)。

(4) dbCAN2 combines the results from the three tools and allows visualization of the overlaps as Venn diagram and the detailed results as graphs. dbCAN2 meta server will be updated once a year to use the most updated CAZy database, dbCAN HMM database and Hotpep peptide database.

(4) dbCAN2结合了三个工具的结果,并允许以维恩图的形式对重叠部分进行可视化,以图表的形式对详细结果进行可视化。dbCAN2元服务器将每年更新一次,以使用最新的CAZy数据库、dbCAN HMM数据库和Hotpep肽数据库。

你可能感兴趣的:(算法文献阅读:dbCAN2:一个用于自动化碳水化合物活性酶注释的宏服务器)