算法文献阅读3:WGDI(BioRxiv版)

WGDI这个最新的共线性算法软件到现在还没有正式发表,只能在网上搜到BioRxiv版,根据笔者的初步了解,似乎是一个不需要经过同行评审的期刊。所以,现在的WGDI是预印本。

bioRxiv(发音为“bio-archive”)是一种免费的在线存档和分发服务。它是由非营利性研究和教育机构冷泉港实验室(The Cold Spring Harbor Laboratory,CSHL)运营的。通过在BioRxiv上张贴预印本,作者能够立即向科学界提供他们的研究结果,并在手稿草稿提交期刊之前收到反馈意见——参考:http://lib.cpu.edu.cn/f9/33/c1197a129331/page.htm。

好了,开始翻译。。。

WGDI: A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes

WGDI:用于全基因组复制和祖先核型进化分析的用户友好工具包

Keywords: Polyploidy; Dotplot; Collinearity; Inference of hierarchical and eventrelated gene colinearity; Ancestral chromosomal karyotype

关键词:多倍体; 点图; 共线性; 层次性和事件性相关基因共线性的推断; 祖先染色体核型

Evidence of whole-genome duplications (WGDs) and subsequent karyotype changes has been detected in most major lineages of life on Earth. To clarify the complex resulting multiple-layered patterns of gene collinearity in genome analyses there is a need for convenient and accurate toolkits. To meet this need, we introduce here WGDI (Whole-Genome Duplication Integrated analysis), a Python-based command-line tool that facilitates comprehensive analysis of recursive polyploidizations and cross-species genome alignments. WGDI supports three main workflows (polyploid inference, hierarchical inference of genomic homology, and ancestral chromosomal karyotyping) that can improve detection of WGD and characterization of related events. It incorporates a more sensitive and accurate collinearity detection algorithm than previous softwares, and can accelerate WGD-related karyotype research. As a freely available toolkit at GitHub (https://github.com/SunPengChuan/wgdi), WGDI outperforms similar tools in terms of efficiency, flexibility and scalability. In an illustrative example of its application, WGDI convincingly clarified karyotype evolution in Aquilegia coerulea and Vitis vinifera following WGDs and rejected the hypothesis that Aquilegia contributed as a parental lineage to the allopolyploid origin of core dicots.

摘要

在地球上大多数主要生命谱系中都发现了全基因组复制 (WGD) 和随后的核型变化的证据。为了阐明基因组分析中复杂的多层基因共线性模式,需要方便和准确的工具包。为了满足这一需求,我们在此介绍了 WGDI(全基因组复制集成分析),这是一种基于 Python 的命令行工具,有助于对递归多倍化和跨物种基因组比对进行综合分析。 WGDI 支持三个主要工作流程(多倍体推断、基因组同源性的层次推断和祖先染色体核型分析),可以改进 WGD 的检测和相关事件的鉴定。它采用了比以前的软件更灵敏、更准确的共线性检测算法,可以加速WGD相关的核型研究。作为 GitHub (https://github.com/SunPengChuan/wgdi) 上免费提供的工具包,WGDI 在效率、灵活性和可扩展性方面优于类似工具在其应用的一个说明性示例中,WGDI 令人信服地阐明了 WGD 后 耧斗菜 和 葡萄的核型进化,并拒绝了耧斗菜属作为亲本谱系对核心双子叶植物的异源多倍体起源做出贡献的假设。

There is clear evidence that whole-genome duplication (WGD), or polyploidy, and the accompanying change in karyotype, have repeatedly occurred in diverse eukaryotic lineages (Van de Peer, Mizrachi et al. 2017). It is recognized as a prominent evolutionary process, especially in plants (Soltis and Soltis 2016, Landis, Soltis et al. 2018). Thus, identifying WGDs, their dates and locations in evolutionary history, and ancestral karyotypes, is crucial for thorough understanding of how eukaryotes have diversified and adapted to different environments (Fawcett, Maere et al. 2009, Mabry, Brose et al. 2020). To date, three main types of methods have been used to detect WGD: Ks-based, gene tree-based, and synteny-based methods (Rabier, Ta et al. 2014, Mabry, Brose et al. 2020). Previous studies have shown that Ks-based or gene tree-based methods alone can be potentially misleading (Hahn 2007, Vanneste, Van de Peer et al. 2013, Ruprecht, Lohaus et al. 2017, Tiley, Barker et al. 2018, Nakatani and McLysaght 2019, Zwaenepoel, Li et al. 2019). In contrast, synteny-based methods are more reliable, conclusive and thus currently serve as the gold standard for inferring WGD.

背景

有明确的证据表明,全基因组复制 (WGD) 或多倍体以及伴随的核型变化在不同的真核细胞谱系中反复发生。它被认为是一个突出的进化过程,尤其是在植物中。因此,识别 WGD、它们在进化历史中的时间和位置以及祖先核型对于彻底了解真核生物如何多样化和适应不同环境至关重要。迄今为止,已使用三种主要类型的方法来检测 WGD:基于 Ks、基于基因树和基于共线性的方法。先前的研究表明,仅基于 Ks 或基于基因树的方法可能会产生误导。相比之下,基于共线性的方法更可靠、更有说服力,因此目前作为推断 WGD 的黄金标准。

With the increasing availability of assembled genomes, a number of methods have been developed to identify conserved syntenic blocks in eukaryotes, i.e. preserved co-localization of homologous genes on chromosomes. Early software packages developed for this purpose, such as ADHoRe and DiagHunter, often relied on clustering of neighboring matching gene pairs. In contrast, more recent packages use dynamic programming to build chains of pairwise collinear genes. Examples include ColinearScan, Cyntenator, McScan, MCScanX and JCVI. However, these software packages lack sufficient compatibility with the Windows platform, and the algorithms they incorporate lack sufficient sensitivity to collinearity, and hence accuracy. Due to the high complexity of extant genomes after often recursive WGDs and subsequent genome reconfiguration (often with extensive chromosomal rearrangement and massive gene loss), capacities for detailed downstream analysis are essential, such as generation of homologous gene dotplots or circles displaying collinear genes or blocks, Ks calculations, Ks peak fits, exploration of ancestral karyotype evolution, and comparison of syntenic genes. No previous software or toolkit provided all these capacities and/or the ability to integrate different software, thus hindering WGD research and frequently leading to erroneous interpretation of ancient polyploidization patterns.

随着组装的基因组越来越多,已经开发了许多方法来识别真核生物中的保守共线块,即染色体上同源基因的保留共定位。为此目的开发的早期软件包,例如 ADHoRe 和 DiagHunter,通常依赖于相邻匹配基因对的聚类。相比之下,最近的软件包使用动态规划来构建成对共线基因链。例如 ColinearScan、Cyntenator、MCScan、MCScanX 和 JCVI。然而,这些软件包与 Windows 平台缺乏足够的兼容性,并且它们所包含的算法对共线性缺乏足够的敏感性,因此也缺乏准确性。由于现存的基因组通常经过递归的WGDs和随后的基因组重组(通常伴随着大量的染色体重排和大量的基因丢失)后的高度复杂性,对下游进行详细分析的能力至关重要,如生成同源基因点阵或圆圈显示串联基因或区块,Ks计算,Ks峰拟合,祖先核型进化探索,共线基因比较。以前的软件或工具包没有提供所有这些能力和/或整合不同软件的能力,因此阻碍了WGD研究,经常导致对古代多倍体化模式的错误解释。

To facilitate WGD analysis, here we introduce a convenient toolkit called WGDI (Whole-Genome Duplication Integrated analysis), a convenient Python-based command-line toolkit with the following advantages. It incorporates a more sensitive collinearity algorithm than previous packages, which can improve the resolution and accuracy of synteny block analyses. It also provides integrated capacities for nearly all current WGD-related and bioinformatic analyses, including (among others) inter- and intra-genomic dotplot comparison, collinearity detection, Ks estimation and peak fitting, ancestral karyotype evolution, and inference of synteny trees. Moreover, parameters for all these analyses can be very simply adjusted.

为了方便 WGD 分析,这里我们介绍一个方便的工具包,称为 WGDI(Whole-Genome Duplication Integrated analysis),这是一个方便的基于 Python 的命令行工具包,具有以下优点。 它包含比以前的软件包更敏感的共线性算法,可以提高共线性块分析的分辨率和准确性。 它还提供了目前几乎所有与wgd相关的生物信息学分析的综合能力,包括基因组间和基因组内的点图比较、共线性检测、Ks 估计和峰值拟合、祖先核型进化和共线性树的推断。 此外,所有这些分析的参数都可以非常简单地调整

结果

WGDI包的架构

The complete WGDI source code is freely available at GitHub (https://github.com/SunPengChuan/wgdi) and can be deployed in Windows, Linux, or macOS operating systems. WGDI is written in python3 and can be installed via pip or conda. WGDI supports three main workflows: (1) analysis and inference of polyploidy using homologous dotplots, collinearity and Ks distributions, and homologous gene trees; (2) hierarchical inference of genomic homology resulting from recursive paleopolyploidization; (3) subgenomic and ancestral chromosomal karyotyping and analysis of evolutionary scenarios. WGDI has multiple subroutines, and users only need to simply modify the configuration file and enter their names (e.g., ‘wgdi -d your.conf’) to execute them. The subroutine parameters and functions of WGDI are shown in Fig. 1. WGDI outputs may include vector diagrams (e.g. in SVG format) that are suitable for direct publication. Detailed function descriptions and parameter settings are available at https://wgdi.readthedocs.io/en/latest/usage.html.

完整的WGDI源代码可以在GitHub (https://github.com/SunPengChuan/wgdi)上免费获得,可以部署在Windows、Linux或macOS操作系统命令行打包方式还是比GUI要好哈,不过只适合算法类工具,一旦设计到Ui的一个便携式操作还是得用GUI打包方式--届时就不可避免要解决跨平台打包的问题了)中。WGDI是用Python3编写的,可以通过pip或conda安装。WGDI支持三个主要的工作流程:(1)利用同源点图、共线性和Ks分布以及同源基因树对多倍体进行分析推断;(2)由递归古多倍体化产生的基因组同源性层次推断;(3)亚基因组和祖先染色体核型分析和进化场景分析。WGDI有多个子程序,用户只需要简单地修改配置文件并输入它们的名称(例如,' WGDI -d your.conf ')就可以执行它们。WGDI子程序参数和函数如图1所示。WGDI输出可能包括适合直接发布的矢量图(例如SVG格式)。详细的功能描述和参数设置请访问https://wgdi.readthedocs.io/en/latest/usage.html。

图1

更灵敏的共线性检测

The synteny blocks extracted by WGDI are obviously accurate for two reasons. First, the settings of homologous genes are more flexible. WGDI only considers pairs of homologous genes in analyses, and retains homologous gene pairs related to polyploidization. This greatly reduces confounding effects of homology in large families (Supplementary Fig. 1). Second, WGDI scores and ranks homologous genes, following clear rules described in the homologous dotplot part of the Methods section, generating a dotplot with sets of red, blue, and gray dots with scores declining from high to low. For example, in block a−c shown in Fig. 2a, two dots are designated b1 and b2. If the homologous genes are not ranked and scored, the final scores will be the same, as illustrated by results obtained using the dynamic programming algorithm. In contrast, if they are ranked the gene with the higher homology (indicated here by the b2 dot) can be clearly identified.

WGDI 提取的共线性块显然是准确的,原因有两个。 首先,同源基因的设置更加灵活。 WGDI 在分析中只考虑同源基因对,并保留与多倍化相关的同源基因对。 这大大减少了大家族中同源性的混杂效应(补充图1)。 其次,WGDI 对同源基因进行评分和排序,遵循方法部分的同源点图部分中描述的明确规则,生成带有红色、蓝色和灰色点集的点图,分数从高到低递减。 例如,在图 2a 所示的块 a-c 中,两个点被指定为 b1 和 b2。 如果没有对同源基因进行排序和评分,最终的分数将是相同的,这可以用动态规划算法得到的结果来说明。 相反,如果对它们进行排序,则可以清楚地识别具有更高同源性的基因(此处由 b2 点表示)。

图2

In addition, when the maximum gaps parameter is increased, the program will assign less highly homologous genes to the blocks, and the blocks will become longer.  When the parameter is small, the error rate for ends of blocks will be reduced, but the blocks will be terminated earlier and shortened. After scoring and ranking homologous genes, the search range for homologous genes within blocks is also subject to a penalty rule, i.e. the range is positively related to the strength of homology (Fig. 2b). In this manner, the ends of blocks and maximum gap value can be optimized, and extracted blocks lengthened, without reducing sensitivity.

另外,当最大间隙参数增大时,程序分配给该区块的高同源基因会减少,区块会变长。当参数较小时,会减少块结束的错误率,但块会提前终止并缩短。在对同源基因进行评分和排序后,块内同源基因的搜索范围也受到惩罚规则的约束,即范围与同源性强度正相关(图2b)。 通过这种方式,可以优化块的末端和最大间隙值,并延长提取的块,而不会降低灵敏度。

To evaluate the algorithm’s performance, we compared synteny blocks extracted by WGDI with those extracted by two other commonly used tools: MCScanX and JCVI (v1.1.12). The three tools were tested with the same datasets, Human/Chimpanzee (Homo sapiens/Pan troglodytes). The WGDI parameters were set to repeat_number=10, mg=25,25, muplite=1, grading=50,25,10, and score>100. MCScanX and JCVI parameters were set to default values. The number of synteny blocks extracted between H. sapiens chromosome 12 and P. troglodytes chromosome 13 by both WGDI and JCVI was 3, while that of MCScanX was 4. According to the dotplot on the left in Fig. 3, the explored region is rich in repeated genes. To illustrate capabilities of the three software packages for extracting synteny blocks in this region, we show a partial list of collinear genes in the figure, which we renamed according to their orders along the chromosome (the last five digits of each gene’s id indicate its position). Clearly, WGDI extracted more collinear genes than the other two packages, although JCVI performed better than MCScanX. In addition, both blast scores and gene arrangements suggested that synteny blocks extracted by JCVI and MCScanX included fewer or less accurate sets of homologous genes than those extracted by WGDI. We also compared synteny blocks in the region where the chromosome is inverted and drew similar conclusions (Supplementary Fig. 2).

为了评估算法的性能,我们将WGDI提取的块与共线性块与其他两个常用工具MCScanX和JCVI (v1.1.12)提取的共线性块进行了比较。这三种工具用同样的数据集——人类/黑猩猩(智人/穴居人)——进行了测试。WGDI参数设置为repeat_number=10, mg=25,25, muplite=1, grade =50,25,10, score>100。MCScanX和JCVI参数设置为默认值。WGDI和JCVI分别提取了智人第12号染色体和类人猿第13号染色体间的共线性块3个,MCScanX提取了4个。从图3左侧的点阵图可以看出,该区域富含重复基因。为了说明三个软件包提取该区域共线性块的能力,我们在图中展示了共线基因的部分列表,我们根据它们在染色体上的顺序(每个基因id的后五位数表示其位置)对其进行了重命名。显然,WGDI比其他两个包提取出更多的共线基因,而JCVI的表现优于MCScanX。此外,blast评分和基因排列结果均表明,JCVI和MCScanX提取的共线性块包含的同源基因组少于WGDI提取的共线性块,或准确性低于WGDI提取的共线性块。我们还比较了染色体倒置区域的共线性块,得出了类似的结论(补充图2)。——(WGDI倾向提取更多的共线性基因对,作者认为这样表示更准确。)

图3
补充图2

Examples of WGDI application

WGDI应用的示例

Polyploid inference

多倍化推断

Hierarchical inference of genomic homology

基因组同源性的层级推断

Ancestral chromosomal karyotype

祖先染色体核型

note:大体就是通过一些例子来说明WGDI的优势,并说明了整个下游分析的流程。读者有兴趣可以去仔细读读。里面的含金量挺高的——目前或者将来在网上学不到的东西,一家独大相当于。

讨论

Polyploidization is recognized as an important driving force for the evolution of species. It plays an important role in the evolution of species and formation of new species. Gene collinearity is an important way to study species polyploidization. Although several tools for analyzing multiplication have emerged in recent years, there have been few substantial improvements in the algorithm for collinearity extraction, and the downstream evolution analysis program provided no distinction between collinearity fragments caused by multiplication. This incompleteness of functionality has reduced the usefulness of existing collinearity detection tools. WGDI is particularly useful for identifying polyploidy events, and the inference of hierarchy and event-related gene collinearity proposed by this tool helps the actual phylogeny of plants affected by recursive polyploidization. Also, many biological analyses implemented in WGDI are unique. WGDI outperforms similar tools in terms of efficiency, flexibility, scalability.

多倍化被认为是物种进化的重要驱动力。它在物种进化和新物种形成中起着重要作用。基因共线性是研究物种多倍化的重要途径。尽管近年来出现了几种用于分析多倍体的工具,但在共线性提取算法方面几乎没有实质性的改进,并且下游的进化分析程序没有提供由多倍体引起的共线性片段之间的区别。这种功能的不完整性降低了现有共线性检测工具的实用性。 WGDI 对于识别多倍体事件特别有用,该工具提出的层次结构和事件相关基因共线性的推断有助于了解受递归多倍体化影响的植物的实际系统发育。此外,WGDI 中实施的许多生物学分析都是独一无二的。 WGDI 在效率、灵活性和可扩展性方面优于类似工具。——(先阐述了其他类似工具的缺点,再突出了WGDI的优势,并再三突出,细化突出,分点描述突出等)

In addition, WGDI is highly useful and effective for reconstructing ancestral chromosomal karyotype of current species. WGD causes rapid genome reorganization and structural variations to produce the new chromosomal karyotype of the following lineage or species. Such karyotype evolution is regarded as an important factor for evaluating the phylogenetic position of one disputed lineage and inferring the genome structure of an extinct species. However, two models of fission and fusion obviously cannot understand how chromosomes of the ancestral genome evolved into the current karyotypes. For example, the ancestral karyotypes of all angiosperms, monocots, and core dicots have numerouslarge unlabeled (blank) regions. The shared synteny or synteny breaks during karyotype evolution are critical characters for such inference and a more rigorous framework to perform such analyses is badly needed in addition to the simple fission-fusion model. WGDI can accurately extract syntenic blocks and thus facilitate reconstruction of the evolutionary process of ancestral chromosomes.

此外,WGDI对于重建当前物种的祖先染色体核型非常有用和有效。WGD引起快速的基因组重组和结构变异,从而产生下一个谱系或物种的新染色体核型。这种核型的进化被认为是评价一个有争议的谱系的系统发育位置和推断一个灭绝物种的基因组结构的重要因素。然而,裂变和融合的两个模型显然不能理解祖先基因组的染色体是如何演化成现在的核型的。例如,所有被子植物、单子叶植物和核心双子叶植物的祖先核型都有许多大的未标记的(空白)区域。核型演化过程中共享的合成或合成断裂是这种推断的关键特征,除了简单的裂变-融合模型外,还迫切需要一个更严格的框架来进行这种分析。WGDI可以准确地提取合成块,从而促进对祖先染色体进化过程的重建。——(这段写的真好,大拿的文笔,明显)

总结

WGDI outperforms similar tools in terms of efficiency, flexibility and scalability in WGD evolutionary analyses. This new toolkit implements a dynamic programming-based collinearity extraction algorithm and incorporates multiple computer programs for visualization and analysis.Inference of hierarchical and event-related gene colinearity helps actual phylogeny of affected by recursive polyploidization. WGDI is freely available for public use via GitHub (https://github.com/SunPengChuan/wgdi). WGDI also makes use of conda environments and the bioconda platform, which allows hassle-free installation and upgrading of known-compatible and known-functional tools.

WGDI 在 WGD 演化分析中的效率、灵活性和可扩展性方面优于类似工具。 这个新的工具包实现了一个基于动态规划的共线性提取算法,并集成了多个计算机程序进行可视化和分析。 层级和事件相关基因共线性的推断有助于研究受递归多倍化影响的实际系统发育。WGDI 可通过 GitHub (https://github.com/SunPengChuan/wgdi) 免费供公众使用。 WGDI 还利用 conda 环境和 bioconda 平台,可以轻松安装和升级已知兼容和已知功能的工具。

references

total 49

总结:是一篇写的很好的文章,有理有据,短小精悍,有大家风采。厉害啊厉害啊。。。学习!

你可能感兴趣的:(算法文献阅读3:WGDI(BioRxiv版))