Versatile and open software for comparing large genomes
用于比较大型基因组的多功能开放软件
Abstract
The newest version of MUMmer easily handles comparisons of large eukaryotic genomes at varying evolutionary distances, as demonstrated by applications to multiple genomes. Two new graphical viewing tools provide alternative ways to analyze genome alignments. The new system is the first version of MUMmer to be released as open-source software. This allows other developers to contribute to the code base and freely redistribute the code. The MUMmer sources are available at http://www.tigr.org/software/mummer.
最新版本的 MUMmer 可以轻松处理不同进化距离的大型真核基因组的比较,如应用于多个基因组中所证明的那样。 两种新的图形查看工具提供了分析基因组比对的替代方法。 新系统是第一个作为开源软件发布的 MUMmer 版本。 这允许其他开发人员为代码库做出贡献并自由地重新分发代码——(呀呀呀,学习,学习,搞IT人员的思想:开源至上?)。 MUMmer 资源可在 http://www.tigr.org/software/mummer 获得。
Background
Genome sequence comparison has been an important method for understanding gene function and genome evolution since the early days of gene sequencing. The pairwise sequence-comparison methods implemented in BLAST [1] and FASTA [2] have proved invaluable in discovering the evolutionary relationships and functions of thousands of proteins from hundreds of different species. The most commonly used application of these sequence-analysis programs is for comparing a single gene (either a DNA sequence or the protein translation of that sequence) to a large database of other genes. The results of such protein and nucleotide database searches have been used in recent years as the basis for assigning function to most of the newly discovered genes emerging from genome projects. In recent years, an important new sequence-analysis task has emerged: comparing an entire genome with another. Until 1999, each new genome published was so distant from all previous genomes that aligning them would not yield interesting results. With the publication of the second strain of Helicobacter pylori [3] in 1999, following the publication of the first strain [4] in 1997, the scientific world had its first chance to look at two complete bacterial genomes whose DNA sequences lined up very closely. Comparison of these genomes revealed an overall genomic structure that was very similar, but showed evidence of two large inversion events centered on the replication origin. The comparison also made it clear that a new type of bioinformatics program was needed, one that could efficiently compare two megabase-scale sequences, something that BLAST cannot do. In response to this need, TIGR released MUMmer 1.0, the first system that could perform genome comparisons of this scale [5]. The first two releases of MUMmer had over 1,600 site licensees, a number that has grown since moving to an open-source license in May 2003.
自基因测序早期以来,基因组序列比较一直是了解基因功能和基因组进化的重要方法。在 BLAST [1] 和 FASTA [2] 中实施的成对序列比较方法已被证明在发现来自数百个不同物种的数千种蛋白质的进化关系和功能方面非常宝贵。这些序列分析程序最常用的应用是将单个基因(DNA 序列或该序列的蛋白质翻译)与其他基因的大型数据库进行比较。近年来,此类蛋白质和核苷酸数据库搜索的结果已被用作为基因组计划中出现的大多数新发现基因分配功能的基础。近年来,出现了一项重要的新序列分析任务:将整个基因组与另一个进行比较。直到 1999 年,每一个发表的新基因组都与之前的所有基因组相距甚远,以至于将它们比对不会产生有趣的结果。随着 1999 年第二株幽门螺杆菌 [3] 的发表,继 1997 年第一株幽门螺杆菌 [4] 的发表之后,科学界第一次有机会研究两个完整的细菌基因组,它们的 DNA 序列排列非常紧密.这些基因组的比较揭示了一个非常相似的整体基因组结构,但显示了以复制起点为中心的两个大倒位事件的证据。比较还清楚地表明,需要一种新型的生物信息学程序,它可以有效地比较两个兆碱基规模的序列,这是 BLAST 无法做到的。为了满足这一需求,TIGR 发布了 MUMmer 1.0,这是第一个可以执行这种规模的基因组比较的系统 [5]。 MUMmer 的前两个版本拥有 1,600 多个站点许可,自 2003 年 5 月转向开放源代码许可以来,这一数字一直在增长。
The number of pairs of closely related genomes has increased dramatically in recent years, with a corresponding increase in the number of scientific studies of genome structure and evolution, facilitated by new software that permits the comparisons of these genomes. As of mid-2003, there are more than 150 complete published genomes, with over 380 prokaryotic genome projects and 240 eukaryotic projects under way. Many of these involve species that are closely related to published genomes. The published databases already include 33 species for which at least one other closely related species has been sequenced; for a detailed list see [6]. More distantly related pairs of species, for example, Plasmodium falciparum and P. yoelii, fail to show DNA sequence similarity but do show large-scale similarity when their translated protein sequences are aligned, as described in earlier studies [7,8].
近年来,密切相关的基因组对的数量急剧增加,基因组结构和进化的科学研究数量也相应增加,这得益于允许比较这些基因组的新软件。 截至 2003 年年中,已发表的完整基因组已超过 150 个,其中有 380 多个原核基因组计划和 240 个真核计划正在进行中。 其中许多涉及与已发表的基因组密切相关的物种。 已发表的数据库已经包括 33 个物种,其中至少有一个其他密切相关的物种已被测序; 详细列表见[6]。 更远亲缘关系的物种对,例如恶性疟原虫和约氏疟原虫,未能显示 DNA 序列相似性,但当它们的翻译蛋白质序列对齐时,确实显示出大规模相似性,如早期研究中所述 [7,8]。
Related to the growing number of closely related species that have been sequenced is a rapid growth in the number of known species whose genomes are similar but have undergone significant rearrangement. The human and mouse genomes, for example, are both available in draft form, and the chromosomes of either species can be aligned with the other at the DNA level. Various lines of evidence in the past have pointed to massive genome rearrangements separating the species, and the latest analysis indicates that the mouse genome can be split into 217 large segments that can be rearranged to produce the same gene order as in the human genome [9]. This very large-scale similarity interrupted by rearrangements places additional demands on genome-comparison programs: essentially, one must produce all pairs of similar regions in the sequences (in form of local alignments), not merely a single 'best' or longest global alignment of the entire sequences.
与已测序的密切相关物种数量的增加相关的是,基因组相似但经历了显着重排的已知物种数量的快速增长。 例如,人类和小鼠基因组都以草稿形式提供,任何一个物种的染色体都可以在 DNA 水平上与另一个物种对齐。 过去的各种证据都指向了将物种分开的大规模基因组重排,最新的分析表明,小鼠基因组可以分成 217 个大片段,这些片段可以重新排列以产生与人类基因组相同的基因顺序 [9 ]。 这种被重排打断的大规模相似性对基因组比较程序提出了额外的要求:本质上,必须在序列中产生所有相似区域对(以局部比对的形式),而不仅仅是单个“最佳”或最长的全局比对 的整个序列。
In addition to the need for whole-genome alignment programs, another need has become evident recently - a means of reliably evaluating and comparing genome assemblies. The explosion of genome sequencing has brought with it an explosion in genome-assembly programs, with several new assemblers either under development or recently released [10-12]. Unlike the previous generation of assemblers (TIGR Assembler [13], phrap [14], and CAP3 [15]), these second-generation assemblers are designed to handle large eukaryotic genomes. Assembly of large genomes is a major technical challenge, and once an assembly has been produced, evaluating it can be almost as difficult. Debates over the relative quality of assemblies produced by different assemblers are ongoing, and whole-genome comparison algorithms represent a critical tool in these analyses. Different assemblies of the same data should be nearly 100% identical, making the comparison problem analogous to the problem of comparing closely related species. Assembly differences may represent errors in one of the algorithms, and are useful for providing insights into the strengths and weaknesses of different methods. The large-scale comparison problem also occurs for assemblies delivered by the same software but from different inputs; for example, assemblies at threefold (3×) coverage and sixfold (6×) coverage of the same genome. With larger eukaryotic projects, multiple assemblies are run at different stages of the project, and comparisons of the successive assemblies provide a map showing how to transfer any analyses (such as gene predictions) from one assembly to another.
除了对全基因组比对程序的需要外,最近还出现了另一个需要——一种可靠地评估和比较基因组组装的方法。基因组测序的爆炸式增长带来了基因组组装程序的爆炸式增长,一些新的组装程序正在开发中或最近发布 [10-12]。与上一代装配器(TIGR 装配器 [13]、phrap [14] 和 CAP3 [15])不同,这些第二代装配器旨在处理大型真核基因组。大型基因组的组装是一项重大的技术挑战,一旦组装完成,评估它几乎同样困难。关于由不同组装商生产的组装件的相对质量的争论正在进行中,全基因组比较算法是这些分析中的关键工具。相同数据的不同组合应该几乎 100% 相同,使得比较问题类似于比较密切相关物种的问题。装配差异可能代表其中一种算法的错误,并且有助于深入了解不同方法的优缺点。对于由相同软件交付但来自不同输入的程序集,也会出现大规模比较问题;例如,同一基因组的三倍 (3×) 覆盖和六倍 (6×) 覆盖的组装。对于较大的真核项目,在项目的不同阶段运行多个组装,连续组装的比较提供了一张地图,显示如何将任何分析(例如基因预测)从一个组装转移到另一个组装。
A third use for rapid, large-scale alignment programs has come up in our own applications. As part of our annotation 'pipeline' at TIGR, we routinely rebuild a database containing the results of all-against-all BLAST searches for all known proteins. Each time a new genome is added to the public archives, many thousands of searches need to be re-run to incorporate the newly sequenced genes. Because of the size of the archive, these additional searches take a relatively long time. A rapid method for identifying potential hits is used as a pre-screen as follows: for each new gene that is being added to the database, we use the high-speed method (MUMmer) to determine if it has any potential hits. If it does not, then it can be omitted from subsequent BLAST searches. If a new genome has a large number of novel proteins, this pre-screening step can substantially reduce the time required to search it against the database.
快速、大规模对齐程序的第三个用途出现在我们自己的应用程序中。作为我们在TIGR的注释“管道”的一部分,我们定期重建一个数据库,其中包含所有已知蛋白质的all-against-all BLAST搜索结果。每当一个新的基因组被添加到公共档案中,就需要重新进行数千次搜索,以纳入新测序的基因。由于存档的大小,这些额外的搜索需要相对较长的时间。一种快速识别潜在命中点的方法被用作预筛选:对于每一个添加到数据库中的新基因,我们使用高速方法(MUMmer)来确定它是否有任何潜在命中点。如果没有,则可以在后续的BLAST搜索中忽略它。如果一个新的基因组有大量的新蛋白质,这个预筛选步骤可以大大减少根据数据库搜索它所需的时间。
The new MUMmer system, version 3.0, addresses all of the above uses and more, including new graphical modules for viewing assembly comparisons and for looking at more distantly related species alignments. In addition, the implementations of all the fundamental search operations are now either optimal or nearly optimal, in the sense of running in time proportional to the sum of their input and output sizes. Other parts of the code have also been rewritten to improve their efficiency.
新的 MUMmer 系统 3.0 版解决了上述所有用途以及更多用途,包括用于查看装配比较和查看更远相关物种比对的新图形模块。 此外,所有基本搜索操作的实现现在要么是最优的,要么是接近最优的,在时间上与它们的输入和输出大小的总和成正比。 代码的其他部分也被重写以提高效率——(学到了,一个算法可以努力想方设法发多篇SCI,只要有好的突破优化点,牛啊牛啊,学习!)。
What may be the most significant change with MUMmer 3.0 is that it is now an open-source system. All code is publicly available without restriction on its use or redistribution, and we encourage others to add to the code base and distribute their own improvements. The modularity of the code base makes it easily extendable as well. Others can build on our matching algorithm, for example, and create their own clustering and extension steps.
MUMmer 3.0 最显著的变化可能是它现在是一个开源系统——(开源也可以作为新版本的一个亮点,学到了,学到了。。。)。 所有代码都是公开可用的,对其使用或重新分发没有限制,我们鼓励其他人添加到代码库并分发他们自己的改进。 代码库的模块化也使其易于扩展。 例如,其他人可以在我们的匹配算法的基础上创建自己的聚类和扩展步骤。
Results
Since the development of MUMmer 1.0 in 1999, several other programs for large-scale genome comparison have been developed, for example, SSAHA [16], AVID [17], MGA [18], BLASTZ [19], and LAGAN [20] (see also [21] for a review). Most of these programs follow an anchor-based approach, which can be divided into three phases: computation of potential anchors; computation of a colinear sequence of non-overlapping potential anchors - these anchors form the basis of the alignment; and alignment of the gaps in between the anchors. The traditional methods to compute potential anchors, that is, maximal matches of some length l or longer, use a generate-and-test approach. In a first step, all matches of some fixed length k < l, called k-mers, are generated using a method based on hashing (adopted from [22]). Each such k-mer is checked to see whether it can be extended to a maximal exact match of length at least l. The extension is done by pairwise character comparisons, and thus the run time of this approach depends not only on the number of potential anchors, but also on their lengths. This can be illustrated by an example where all maximal matches of length 20 or larger between two different strains of Escherichia coli (strain K12, 4,639,221 base-pairs (bp) and strain O157:H7, 5,528,445 bp) are computed. With k = 10, a typical choice for k, the hashing approach first generates 4.99 × 107 k-mers and then performs 1.66 × 107 character comparisons to determine all 46,629 maximal matches of length 20 or larger. Thus, less than 0.1% of the generated k-mers are extended to maximal matches of the specified length. For this reason, the generate-and-test approach leads to long running times, if the sequences under consideration share long substrings.
自 1999 年 MUMmer 1.0 开发以来,已经开发了其他几个用于大规模基因组比较的程序,例如 SSAHA [16]、AVID [17]、MGA [18]、BLASTZ [19] 和 LAGAN [20] (另见[21]的评论)。这些程序中的大多数都遵循基于锚的方法,可以分为三个阶段:潜在锚点的计算;计算非重叠潜在锚的共线序列 - 这些锚构成对齐的基础;和对齐锚之间的间隙。计算潜在锚点的传统方法,即某个长度为 l 或更长的最大匹配,使用生成和测试方法。在第一步中,所有固定长度 k < l 的匹配,称为 k-mers,是使用基于哈希的方法生成的(从 [22] 中采用)。检查每个这样的 k-mer 以查看它是否可以扩展到长度至少为 l 的最大精确匹配。扩展是通过成对字符比较完成的,因此这种方法的运行时间不仅取决于潜在锚点的数量,还取决于它们的长度。这可以通过一个示例来说明,其中计算了两个不同大肠杆菌菌株(菌株 K12,4,639,221 碱基对 (bp) 和菌株 O157:H7,5,528,445 bp)之间长度为 20 或更大的所有最大匹配。当 k = 10(k 的典型选择)时,哈希方法首先生成 4.99 × 107 k-mer,然后执行 1.66 × 107 字符比较以确定长度为 20 或更大的所有 46,629 个最大匹配。因此,不到 0.1% 的生成 k-mer 被扩展到指定长度的最大匹配。出于这个原因,如果所考虑的序列共享长子串,则生成和测试方法会导致运行时间长——(分析了传统哈希方法的弊端及原因)。
Recognizing this disadvantage of the hashing approach, MUMmer 1.0 was the first software system to use suffix trees to find potential anchors for an alignment. Suffix trees have been studied for almost three decades in computer science (see [23] for an overview). A suffix tree is a data structure for representing all the substrings of a string, whether that string is a DNA sequence, a protein sequence, or plain text. Suffix trees have the following nice features which make them an important data structure for large-scale genome analysis: a suffix tree for a string S of length n can be represented in space proportional to n; fast algorithms have been designed that can construct a suffix tree in time proportional to n [24,25]; given the suffix tree of S and a query string Q of length m, there are algorithms to compute all unique maximal matches between S and Q of any specified minimum length (the potential anchors) in time proportional to m. All maximal matches, unique or not, can be found in near-optimal time. Note especially that, unlike the hashing approaches, the run-time of the suffix-tree algorithms does not depend on the length of the maximal matches.
认识到哈希方法的这个缺点,MUMmer 1.0 是第一个使用后缀树来寻找对齐的潜在锚点的软件系统。后缀树在计算机科学中已经研究了近 30 年(参见 [23] 的概述)。后缀树是一种数据结构,用于表示字符串的所有子字符串,无论该字符串是 DNA 序列、蛋白质序列还是纯文本。后缀树具有以下很好的特性,使其成为大规模基因组分析的重要数据结构:长度为 n 的字符串 S 的后缀树可以在与 n 成比例的空间中表示;已经设计了快速算法,可以在与 n 成比例的时间内构建后缀树 [24,25];给定 S 的后缀树和长度为 m 的查询字符串 Q,有一些算法可以在与 m 成比例的时间内计算任何指定最小长度(潜在锚点)的 S 和 Q 之间的所有唯一最大匹配。所有最大匹配,无论是否唯一,都可以在接近最佳的时间找到。特别注意,与哈希方法不同,后缀树算法的运行时间不依赖于最大匹配的长度——(这是将后缀树数据结构的优势及原理又做了一遍综述的描述,并通过突出传统的哈希方法的劣势突出后缀树的优势,学习!)。
Details of the suffix-tree algorithms incorporated in earlier versions of MUMmer have been described in [5,7]. Here we will focus on novel developments. MUMmer is among the fastest programs for large-scale alignment; one recent test reported times for MUMmer that ranged from 4 to 110 times faster than AVID, BLASTZ, and LAGAN [20]. At its default settings, MUMmer is less sensitive at detecting matches than these programs; however, we have added several command-line options to MUMmer 3.0 that permit the detection of much weaker matches than the system would find otherwise. Note that the modularity of MUMmer, and its availability as open-source code, means that others can now build a hybrid system using, for example, the suffix-tree matching algorithm in MUMmer and the match extension program code from LAGAN or AVID.
MUMmer 早期版本中包含的后缀树算法的详细信息已在 [5,7] 中进行了描述。 在这里,我们将专注于新的发展。 MUMmer 是最快的大规模对齐程序之一; 最近的一项测试报告了 MUMmer 的时间比 AVID、BLASTZ 和 LAGAN [20] 快 4 到 110 倍。 在默认设置下,MUMmer 在检测匹配方面不如这些程序敏感——(这让笔者忆起了MCScanX的敏感性/精确度不如原来的MCScan的说法,但效率高啊,所以只要算法工具包有优势,那就值得发表,当然,只要你有money); 但是,我们在 MUMmer 3.0 中添加了几个命令行选项,允许检测比系统发现的更弱的匹配。 请注意,MUMmer 的模块化及其作为开源代码的可用性意味着其他人现在可以使用例如 MUMmer 中的后缀树匹配算法和来自 LAGAN 或 AVID 的匹配扩展程序代码来构建混合系统——(这个不错,学习!)。
Additional features added to MUMmer 3.0 are a new Java viewer, DisplayMUMs; a new graphical output program to generate images in fig-format or PDF, showing the alignment of a set of contigs to a reference chromosome; and new options to find non-unique matches. These will be described below.
添加到 MUMmer 3.0 的附加功能是新的 Java 查看器 DisplayMUM; 一个新的图形输出程序——(学习学习,厉害厉害。。。),用于生成 fig 格式或 PDF 格式的图像,显示一组 contig 与参考染色体的对齐; 以及查找非唯一匹配项的新选项。 这些将在下面描述。
Optimized suffix-tree data structure and suffix-tree library
优化后缀树数据结构和后缀树库
The most significant technical improvement in MUMmer 3.0 is a complete rewrite of the suffix-tree code, based on the compact suffix-tree representation of [26]. This representation was also used in the repeat analysis tool REPuter [27]. However, REPuter could only accommodate sequences up to 134 million bp (Mbp). For MUMmer 3.0 the implementation was improved so that it allows sequences up to 250 Mbp on a PC with 4 gigabytes (GB) of real memory, at the cost of a slightly larger space usage per base pair. For example, one can construct the suffix tree for human chromosome 2 (237.6 Mbp, the largest human chromosome) using 15.4 bytes per base-pair. For processing DNA sequences less than 134 Mbp in length, MUMmer can be compiled so that it uses only about 12.5 bytes per bp [26]. Since suffix trees for DNA sequences are typically larger than for protein sequences, the bytes per base-pair ratio is even better for the latter.
MUMmer 3.0 中最显著的技术改进是基于 [26] 的紧凑后缀树表示对后缀树代码的完全重写。 这种表示也用于重复分析工具 REPuter [27]。 然而,REPuter 最多只能容纳 1.34 亿 bp (Mbp) 的序列。 对于 MUMmer 3.0,实施进行了改进,以便在具有 4 GB 实际内存的 PC 上允许高达 250 Mbp 的序列,但每个碱基对的空间使用量稍大。 例如,可以使用每个碱基对 15.4 个字节来构建人类第 2 号染色体(237.6 Mbp,最大的人类染色体)的后缀树。 对于处理长度小于 134 Mbp 的 DNA 序列,可以编译 MUMmer,使其每 bp 仅使用约 12.5 个字节 [26]。由于DNA序列的后缀树通常比蛋白质序列大,每碱基对的字节数比后者还要高。
MUMmer now requires approximately 25% less memory than release 2.1 and it runs slightly faster. Compared to the initial release in 1999, the system is more than twice as fast and uses less than half the memory. As in MUMmer 2.1, release 3.0 streams the query sequence against the suffix tree of the reference sequence. Thus the total space requirement of MUMmer is the size of the suffix tree plus the size of the reference and the query sequences. Table 1 shows run times and memory requirements for MUMmer release 2.1 and 3.0, when computing maximal matches for different pairs of genomes or chromosomes.
MUMmer 现在需要的内存比 2.1 版少大约 25%,而且运行速度稍快。 与 1999 年的初始版本相比,该系统的速度提高了一倍以上,并且使用的内存不到一半。 与 MUMmer 2.1 中一样,3.0 版将查询序列与参考序列的后缀树进行流式传输。 因此,MUMmer 的总空间需求是后缀树的大小加上参考和查询序列的大小。 表 1 显示了 MUMmer 版本 2.1 和 3.0 在计算不同基因组或染色体对的最大匹配时的运行时间和内存要求。
Timings were done on a Linux-based computer with a 2.4 GHz Pentium processor. The human-mouse comparison was run only with MUMmer 3.0. Mbp, millions of base pairs; MB, megabytes. A suffix tree is constructed only for the reference genome.
计时是在一台基于linux的计算机上完成的,该计算机带有2.4 GHz的奔腾处理器。人和小鼠的比较只在MUMmer 3.0上进行。Mbp,数百万个碱基对; MB,兆字节。仅为参考基因组构建后缀树。
With release 3.0, MUMmer now has the ability to run a multi-contig query against a multi-contig reference. Previously this was available by using the Nucmer package, but not directly within the core mummer program. In Table 1, for example, the genome sequence of Aspergillus fumigatus consisted (at the time of this study) of 19 scaffolds that were aligned to 248 contigs from A. nidulans. This comparison was handled by a simple call to the mummer program in release 3.0, but in release 2.1 the reference sequence needs to be first collapsed into a single contig and, after matching, the coordinates have to be re-mapped (by Nucmer) to the correct contig locations. Both releases handle multi-contig query files. Table 1 also shows the times for aligning the 22.2 Mbp chromosome 2L of the fruit fly Drosophila melanogaster to an interim assembly (before the genome project was complete) of D. pseudoobscura. In this case the query sequence, consisting of 4,653 scaffolds containing approximately 150 Mbp of sequence, was much longer than the reference. The program required 485 Mb of total memory, approximately 310 Mb for the suffix tree and the rest to hold the input sequences.
在3.0版本中,MUMmer现在能够针对多contig引用运行多contig查询。以前这是可用的使用Nucmer包,但不直接在核心mummer程序。例如,在表1中,烟曲霉(Aspergillus fumigatus)的基因组序列包括(在本研究时)19个支架,这些支架与a.n idulans的248个contigs对齐。在release 3.0中,这个比较是通过对mummer程序的一个简单调用来处理的,但是在release 2.1中,引用序列需要首先折叠成一个contig,在匹配之后,坐标必须重新映射(由Nucmer)到正确的contig位置。这两个版本都处理多contig查询文件。表1还显示了果蝇果蝇(Drosophila melanogaster) 22.2 Mbp染色体2L对D. pseudoobscura临时装配(基因组计划完成前)的比对时间。在这种情况下,查询序列由4653个scaffold组成,包含大约150 Mbp的序列,比参考序列长得多。该程序需要485 Mb的总内存,大约310 Mb用于后缀树,其余的用于保存输入序列。
Non-unique maximal matches
Previous versions of MUMmer emphasized maximal unique matches (MUMs) as prospective anchors for an alignment. MUMs are unique in that they occur exactly once in each of the genomes. In some cases, the uniqueness constraint will prevent MUMmer from finding all matches for a repetitive substring. For example, if the reference genome has two exact copies of a particular string and the query has just one copy, then earlier versions of MUMmer would generally miss one of the matching copies, depending on the surrounding sequence. To overcome this problem, the new MUMmer system can find all maximal matches - including non-unique ones - between two input sequences simply by providing a command-line option to the program mummer. Other command-line options allow the user to produce MUMs that are unique in both the query and the reference sequence or MUMs that are unique only in the reference sequence.
以前版本的 MUMmer 强调最大唯一匹配 (MUM) 作为对齐的预期锚。 MUM 的独特之处在于它们在每个基因组中只出现一次。 在某些情况下,唯一性约束将阻止 MUMmer 查找重复子字符串的所有匹配项。 例如,如果参考基因组具有特定字符串的两个精确副本,而查询只有一个副本,则早期版本的 MUMmer 通常会丢失一个匹配的副本,具体取决于周围的序列。 为了克服这个问题,新的 MUMmer 系统只需向程序 mummer 提供命令行选项,就可以找到两个输入序列之间的所有最大匹配(包括非唯一匹配)。 其他命令行选项允许用户生成在查询和参考序列中唯一的 MUM 或仅在参考序列中唯一的 MUM。
Although the algorithm to produce all maximal matches is more complicated than the algorithm to produce unique maximal matches, it still runs in nearly optimal time, where optimal time would be proportional to the sum of the sizes of the input strings and the number of matches found. The run times to produce any of the three types of maximal matches are generally similar. Note, however, that when the program is directed to find all non-unique matches, including short ones, the size of the output can be extremely large, and the time to create the output file will be the dominant part of the computation.
尽管产生所有最大匹配的算法比产生唯一最大匹配的算法复杂,但它仍然运行在接近最佳的时间,其中最佳时间与输入字符串的大小和找到的匹配数的总和成正比。产生三种最大匹配类型中的任何一种的运行时间通常是相似的。 但是请注意,当程序被指示查找所有非唯一匹配项(包括短匹配项)时,输出的大小可能非常大,并且创建输出文件的时间将是计算的主要部分——(这个要学习,切记!)。
Distant matches
One of the criticisms that has been leveled at MUMmer 1.0 was that it only finds exact matches, whereas in practice we often want to find approximate matches, that is, matches between sequences that are less than 100% identical. We addressed this concern in release 2.1, with the introduction of the Nucmer and Promer packages built on top of MUMmer. These have been substantially improved in the 3.0 release, and now exhibit performance only marginally slower than the basic search itself. The speed-up of Nucmer and Promer compared to release 2.1 is approximately 10-fold.
对 MUMmer 1.0 的批评之一是它只能找到完全匹配,而在实践中我们经常希望找到近似匹配,即小于 100% 相同的序列之间的匹配——(对自己也批评,好,学习!)。 我们在 2.1 版中解决了这个问题,引入了基于 MUMmer 构建的 Nucmer 和 Promer 包。 这些在 3.0 版本中得到了显著改进,现在表现出的性能仅比基本搜索本身慢一点。 与 2.1 版相比,Nucmer 和 Promer 的加速大约是 10 倍。
Both Nucmer and Promer produce a collection of local alignments using the algorithm described below. The difference between the two programs is that Nucmer constructs nucleotide alignments between two sets of DNA sequences, while Promer constructs amino acid alignments. Each set of sequences is a collection of one or more sequences from the same genome, for example, a collection of contigs produced by a genome assembler. Promer first translates both reference and query in all six frames, finds all matches in the amino acid sequences, and then maps the matches back to the original DNA coordinate system. For the extension step below, Promer uses a standard amino acid substitution matrix (BLOSUM62 is the default) to score mismatches.
Nucmer 和 Promer 都使用下面描述的算法生成局部对齐的集合。 这两个程序的区别在于,Nucmer 构建了两组 DNA 序列之间的核苷酸比对,而 Promer 构建了氨基酸比对。 每组序列是来自同一基因组的一个或多个序列的集合,例如,由基因组组装器产生的 contigs 的集合。 Promer 首先翻译所有六个框中的参考和查询,找到氨基酸序列中的所有匹配项,然后将匹配项映射回原始 DNA 坐标系。 对于下面的扩展步骤,Promer 使用标准氨基酸替换矩阵(BLOSUM62 是默认值)对错配进行评分。
The Nucmer/Promer alignment algorithm is as follows. First, both programs run MUMmer to find all exact matches longer than a specified length l, measured in nucleotides for Nucmer and amino acids for Promer. Second, the matches are clustered in preparation for extending them. Two matches are joined into the same cluster if they are separated by no more than g nucleotides (Nucmer) or amino acids (Promer). Then from each cluster, the maximum-length colinear chain of matches is extracted and processed further if the combined length of its matches is at least c nucleotides/amino acids. (Note that a chain can consist of a single matching region if l >c.) The parameters l, g, and c can all be set on the command line. The chain matches are then extended using an implementation of the Smith-Waterman dynamic programming algorithm [28], which is applied to the regions between the exact matches and also to the boundaries of the chains, which may be extended outward. This 'match and extend' step in the algorithm is essentially the same as that used by FASTA [29], BLAST [30], and many other sequence-alignment programs.
Nucmer/Promer 对齐算法如下。首先,两个程序都运行 MUMmer 来查找所有长于指定长度 l 的完全匹配,Nucmer 以核苷酸为单位,Promer 以氨基酸为单位。其次,对匹配进行聚类以准备扩展它们。如果两个匹配项之间的距离不超过 g 个核苷酸(Nucmer)或氨基酸(Promer),则它们将连接到同一个簇中。然后,如果其匹配的组合长度至少为 c 个核苷酸/氨基酸,则从每个簇中提取最大长度的共线匹配链并进一步处理。 (请注意,如果 l > c,则链可以由单个匹配区域组成。)参数 l、g 和 c 都可以在命令行上设置。然后使用 Smith-Waterman 动态规划算法 [28] 的实现来扩展链匹配,该算法应用于精确匹配之间的区域以及可以向外扩展的链边界。该算法中的“匹配和扩展”步骤与 FASTA [29]、BLAST [30] 和许多其他序列比对程序使用的步骤基本相同。
When two species are very similar, such as the two isolates of the Bacillus anthracis Ames strain sequenced at TIGR [31-33], then MUMmer is ideally suited for aligning the genomes. In that comparison of anthrax isolates, only four single-nucleotide differences separated the two 5.3 Mbp main chromosomes from one another. Similarly, in our comparison of a clinical isolate of Mycobacterium tuberculosis to a laboratory strain [31], MUMmer quickly found the approximately 1,100 SNPs and a handful of IS elements that distinguished the strains. However, when the species being compared are more distant, Nucmer and Promer provide much more detailed and more useful alignments than MUMmer alone. In the examples described below, we show how each of the programs described here may be run for genomes at different evolutionary distances
当两个物种非常相似时,例如在 TIGR [31-33] 测序的炭疽杆菌 Ames 菌株的两个分离株,则 MUMmer 非常适合比对基因组。 在炭疽分离株的比较中,只有四个单核苷酸差异将两条 5.3 Mbp 的主要染色体彼此分开。 同样,在我们将结核分枝杆菌临床分离株与实验室菌株进行比较时 [31],MUMmer 很快发现了大约 1,100 个 SNP 和少数区分菌株的 IS 元素。 然而,当被比较的物种更远时,Nucmer 和 Promer 提供比单独的 MUMmer 更详细和更有用的比对。 在下面描述的示例中,我们展示了如何针对不同进化距离的基因组运行此处描述的每个程序
Fly versus fly
The 130 Mbp genome of D. melanogaster is largely complete, with the six main chromosome arms containing only a few gaps. Recently, the Human Genome Sequencing Center at Baylor College of Medicine completed the shotgun sequencing of D. pseudoobscura, a closely related species with a genome of approximately the same size. These two species are close enough that almost all genes are shared, and exons show a high level of sequence identity. However, they are sufficiently distant that intergenic regions and introns do not align well, and there have been hundreds of large-scale chromosomal rearrangements since the species diverged. Thus, one cannot simply align each chromosome arm to its counterpart. Complicating matters further, the D. pseudoobscura shotgun assembly consists of thousands of scaffolds and contigs. To facilitate comparison, the first computational task is to align all the scaffolds to each of the D. melanogaster arms. (The comprehensive analysis of D. pseudoobscura, organized by the sequencing center scientists and their collaborators, will appear in a future paper. The description here is primarily intended to illustrate the use and capabilities of Nucmer.)
D. melanogaster 的 130 Mbp 基因组基本上是完整的,六个主要的染色体臂只包含几个缺口。最近,贝勒医学院的人类基因组测序中心完成了对 D. pseudoobscura 的鸟枪测序,这是一种基因组大小大致相同的密切相关物种。这两个物种足够接近,几乎所有基因都是共享的,并且外显子显示出高水平的序列一致性。然而,它们距离足够远,以至于基因间区域和内含子不能很好地对齐,并且自从物种分化以来已经发生了数百次大规模的染色体重排。因此,不能简单地将每个染色体臂与其对应物对齐。更复杂的是,D. pseudoobscura 鸟枪组件由数千个 scaffolds 和 contigs 组成。为了便于比较,第一个计算任务是将所有 scaffolds 与每个 D. melanogaster 臂对齐。 (由测序中心的科学家及其合作者组织的对 D. pseudoobscura 的综合分析将出现在未来的论文中。这里的描述主要是为了说明 Nucmer 的用途和能力。)
We ran the Nucmer program with a minimum match length of 25, which was adequate to capture virtually all matching exons. Because matching genes are much longer, we required cluster chains to contain at least 100 matching nucleotides. To account for long introns and to allow the program to cluster together multiple genes, we allowed the gap between exact matches to be as long as 3,000 bp. At the time of our analysis (before completion of the sequencing project), the D. pseudoobscura assembly contained 4,653 scaffolds spanning 150 Mbp. We ran Nucmer separately to align the full set of scaffolds to each D. melanogaster chromosome arm. Using these settings, the program takes about 6 minutes per arm and uses approximately 490 Mb of memory on a 2.8 GHz desktop Pentium 4 PC running Linux.
我们以 25 的最小匹配长度运行 Nucmer 程序,这足以捕获几乎所有匹配的外显子。因为匹配的基因要长得多,我们需要簇链包含至少 100 个匹配的核苷酸。为了解释长内含子并允许程序将多个基因聚集在一起,我们允许完全匹配之间的差距长达 3,000 bp。在我们进行分析时(在完成测序项目之前),D. pseudoobscura 组件包含 4,653 个跨度为 150 Mbp 的scaffolds。我们分别运行 Nucmer 以将全套scaffolds与每个 D. melanogaster 染色体臂对齐。使用这些设置,在运行 Linux 的 2.8 GHz 台式 Pentium 4 PC 上,该程序每臂大约需要 6 分钟,并使用大约 490 Mb 的内存。
Fly versus mosquito
When the two species are more distantly related, the only means of detecting large-scale similarity is through comparisons on the amino acid level. One example of this phenomenon arose during our comparison of the genomes of the malaria mosquito, Anopheles gambiae, and the fruit fly D. melanogaster. Because Anopheles was the second insect genome to be sequenced, the only available species for comparison was fruit fly. Our detailed analysis, done jointly with colleagues at the European Molecular Biology Laboratory in Heidelberg, was based on a combination of BLAST and MUMmer analysis [34]. These two species diverged about 250 million years ago, and they have an average protein sequence identity of 56%, less than that shared between humans and pufferfish. Although the two insects have the same number of chromosomes, the Anopheles genome is approximately twice as large, and the gene order has been almost completely shuffled, as our alignments revealed. Only small, but numerous, regions of 'microsynteny' remain: we reported 948 regions, the largest containing 8 genes in Anopheles and 31 in Drosophila. An interesting finding, though, was that despite extensive shuffling, each chromosome arm had a clear predominance of homologs on a single arm in the other species, indicating that intrachromosome gene shuffling was the primary force affecting gene order (see Figure 7 of [34]).
当两个物种的亲缘关系较远时,检测大规模相似性的唯一方法是通过氨基酸水平的比较。这种现象的一个例子是在我们比较疟蚊、冈比亚按蚊和果蝇 D. melanogaster 的基因组时出现的。因为按蚊是第二个被测序的昆虫基因组,唯一可供比较的物种是果蝇。我们与海德堡欧洲分子生物学实验室的同事共同完成的详细分析是基于 BLAST 和 MUMmer 分析的结合 [34]。这两个物种大约在 2.5 亿年前分化,它们的平均蛋白质序列一致性为 56%,低于人类和河豚之间的平均序列一致性。虽然这两种昆虫的染色体数量相同,但按蚊基因组大约是两倍大,而且基因顺序几乎完全被打乱,正如我们的比对所揭示的那样。只剩下很小但数量众多的“微同线性”区域:我们报告了 948 个区域,其中最大的区域包含按蚊中的 8 个基因和果蝇中的 31 个基因。然而,一个有趣的发现是,尽管进行了广泛的改组/洗牌,但每个染色体臂在其他物种的单个臂上具有明显优势的同源物,这表明染色体内基因改组/洗牌是影响基因顺序的主要力量(参见 [34] 中的图 7 )。
Fungus versus fungus
In a current application, we are using both Nucmer and Promer to compare two related fungal genomes, Aspergillus fumigatus (a human pathogen) and A. nidulans (a non-pathogenic model organism). Shotgun sequencing of these two genomes has been completed, and A. fumigatus is in the process of being completely finished; that is, all gaps are being closed. (A. fumigatus is a joint sequencing project of TIGR and The Sanger Institute, while A. nidulans is being sequenced at the Whitehead/MIT Genome Center.) At the time of our most recent comparison, the A. fumigatus genome had progressed to the point where it was assembled into 19 scaffolds spanning 28 Mbp, and the A. nidulans genome was assembled into 238 contigs spanning 30 Mbp. For this comparison, we first ran Nucmer and found that most of the two genomes mapped onto one another quite clearly: there are sufficient matches to reveal large segments of similarity in a simple dot plot. There has been extensive rearrangement of the chromosomes, but large-scale synteny is still present. For example, the largest contig (A1058) in A. fumigatus, at 2.9 Mbp, representing an essentially complete chromosome, maps onto five different scaffolds in A. nidulans. If one looks only at the Nucmer alignment of the largest of these, a 2.1 Mbp scaffold containing 10 contigs, it appears to be rearranged into multiple segments, but the matches are so scattered that it is difficult to tell how many segments there are (Figure 1, left-hand side).
在当前的应用中,我们同时使用 Nucmer 和 Promer 来比较两种相关的真菌基因组,即烟曲霉(一种人类病原体)和构巢曲霉(一种非致病性模式生物)。这两个基因组的鸟枪测序已经完成,烟曲霉正在完全完成;也就是说,所有的差距都在缩小。 (A. fumigatus 是 TIGR 和 Sanger 研究所的联合测序项目,而 A. nidulans 正在 Whitehead/MIT 基因组中心进行测序。)在我们最近的比较时,A. fumigatus 基因组已经发展到它被组装成 19 个跨度为 28 Mbp 的scaffolds,而构巢曲霉基因组被组装成 238 个跨度为 30 Mbp 的contigs。对于这个比较,我们首先运行 Nucmer,发现两个基因组中的大多数都非常清楚地映射到彼此:有足够的匹配以在一个简单的点图中揭示大段的相似性。染色体发生了广泛的重排,但仍然存在大规模的同线性。例如,烟曲霉中最大的contig (A1058) 为 2.9 Mbp,代表基本完整的染色体,映射到构巢曲霉中的五个不同scaffolds 上。如果只看其中最大的一个的 Nucmer 比对,一个包含 10 个 contig 的 2.1 Mbp 支架,它似乎被重新排列成多个片段,但匹配非常分散,很难判断有多少片段(图1,左侧)。
Dot-plot alignments of a 2.9 Mbp chromosome of A. fumigatus (x-axis) to a 2.1 Mbp scaffold of A. nidulans (y-axis). Left: nucleotide-based alignment with Nucmer. Right: amino-acid-based alignment with Promer. Aligned segments are represented as dots or lines, up to 3,000 bp long in the Nucmer alignment and up to 9,500 bp in the Promer alignment. These alignments were generated by the mummerplot script and the Unix program gnuplot.
烟曲霉的 2.9 Mbp 染色体(x 轴)与构巢曲霉的 2.1 Mbp scaffold(y 轴)的点图比对。 左:与 Nucmer 的基于核苷酸的比对。 右图:与 Promer 的基于氨基酸的比对。 对齐的片段表示为点或线,在 Nucmer 对齐中最长为 3,000 bp,在 Promer 对齐中最长为 9,500 bp。 这些对齐是由 mummerplot 脚本和 Unix 程序 gnuplot 生成的。
The syntenic alignment is much more clearly visible, however, if we use Promer instead. The simplest summary is just the number of bases included in the alignments: if we look at the Nucmer alignment between the scaffolds, the total number of matching bases is 81 kbp. In contrast, the Promer alignment covers 1.87 Mbp of A1058, beginning at nucleotide position 1,000,000 and continuing to the end of the chromosome. A graphical illustration is shown in Figure 1, which displays both the Promer and Nucmer alignments between the 2.1 Mbp scaffold from A. nidulans and scaffold A1058 of A. fumigatus. As the figure makes clear, the amino-acid-based alignment covers much more of the sequence of both species, and is therefore much more useful for determining homologous relationships between genes and chromosomal relationships.
但是,如果我们改用 Promer,则同线对齐会更加清晰可见。 最简单的总结就是比对中包含的碱基数量:如果我们查看scaffolds之间的 Nucmer 比对,匹配碱基的总数为 81 kbp。 相比之下,Promer 比对覆盖 A1058 的 1.87 Mbp,从核苷酸位置 1,000,000 开始,一直到染色体末端。 图 1 显示了图形说明,显示了来自构巢曲霉的 2.1 Mbp scaffolds和烟曲霉的scaffolds A1058 之间的 Promer 和 Nucmer 比对。 如图所示,基于氨基酸的比对涵盖了两个物种的更多序列,因此对于确定基因之间的同源关系和染色体关系更有用。
Human versus human
One of the most challenging computational tasks one can perform today is the cross-comparison of mammalian genomes. The human and mouse genomes are sufficiently complete that much ongoing research is based on mappings between these two species. As shown in Table 1, MUMmer 3.0 can compare human and mouse chromosomes in a matter of minutes. The table shows the time (7 minutes 10 seconds, on a 2.4 GHz Pentium processor) required to align mouse chromosome 16 (Mm16)to human chromosome 21 (Hs21). These two were chosen because nearly all of Hs21 maps to one end of Mm16; in fact, researchers have developed a mouse model of Down syndrome that has an extra copy of this part of Mm16. We ran a benchmark test of MUMmer 3.0 in which we compared the human genome (version of 3 January 2003, down- loaded from GenBank) to itself by computing all maximal matches of length at least 300 between each chromosome and all the others. The resulting 631,975 matches allow one to identify both large- and small-scale interchromosomal duplications. Note that the run-times reported in [6] are only for the match-finding part of MUMmer. The time for processing clusters and performing alignments in the gaps between matches are omitted as these vary widely depending on the parameters used.
当今最具挑战性的计算任务之一是哺乳动物基因组的交叉比较。人类和小鼠的基因组已经足够完整,因此许多正在进行的研究都是基于这两个物种之间的映射。如表1所示,MUMmer 3.0可以在几分钟内比较人类和小鼠的染色体。该表显示了将小鼠16号染色体(Mm16)对齐到人类21号染色体(Hs21)所需的时间(在2.4 GHz奔腾处理器上为7分10秒)。选择这两个是因为几乎所有的Hs21映射到Mm16的一端;事实上,研究人员已经开发出了一种唐氏综合症小鼠模型,它有一个额外的Mm16的这个部分的副本。我们对MUMmer 3.0进行了基准测试,将人类基因组(2003年1月3日版本,从GenBank下载)与自身进行比较,计算每个染色体与所有其他染色体之间长度至少为300的所有最大匹配。由此产生的631,975个匹配使得人们可以识别大的和小的染色体间重复。注意,[6]中报告的运行时仅用于MUMmer的匹配查找部分。处理集群和在匹配之间的间隙中执行对齐的时间被省略,因为这些时间根据所使用的参数变化很大。
For this test, we needed a maximum of about 4 GB of memory. As we did not have a PC available with this amount of memory, we used a Sun-Sparc computer running the Solaris operating system, with 64 GB of memory and a 950 MHz processor.
对于这个测试,我们最多需要大约 4 GB 的内存。 由于我们没有具有这种内存量的 PC,因此我们使用了运行 Solaris 操作系统的 Sun-Sparc 计算机,具有 64 GB 内存和 950 MHz 处理器。
We ran the alignment as follows. Each human chromosome was used as a reference, and the rest of the genome was used as a query and streamed against it. To avoid duplication, we only included chromosomes in the query if they had not already been compared; thus we first used chromosome 1 as a reference, and streamed the other 23 chromosomes against it. Then we used chromosome 2 as a reference, and streamed chromosomes 3-22, X, and Y against that, and so on.
我们按如下方式进行对齐。 每条人类染色体都被用作参考,基因组的其余部分被用作查询并针对它进行流式传输。 为了避免重复,我们只在查询中包含尚未比较的染色体; 因此,我们首先使用 1 号染色体作为参考,并将其他 23 条染色体作为参考。 然后我们使用 2 号染色体作为参考,并将 3-22、X 和 Y 号染色体作为参考,以此类推。
The total length of all human chromosomes for this test was 2,839 Mbp. The time required to build all the suffix trees was 4.7 hours. The space requirement for the suffix tree was remarkably constant, with about 15.5 bytes per base-pair (with only one exception). The total query time was 101.5 hours, and memory usage never exceeded 3.9 GB (see [6] for details). Thus, in approximately 4.5 days on a single processor, we matched the human genome against itself. This could easily be divided up among multiple computers, with each chromosome handled separately, bringing the time down to just 11 hours.
该测试的所有人类染色体的总长度为 2,839 Mbp。 构建所有后缀树所需的时间为 4.7 小时。 后缀树的空间需求非常稳定,每个碱基对大约 15.5 字节(只有一个例外)。 总查询时间为 101.5 小时,内存使用量从未超过 3.9 GB(详见 [6])。 因此,在单个处理器上大约 4.5 天后,我们将人类基因组与其自身进行了匹配。 这可以很容易地在多台计算机之间分配,每个染色体单独处理,将时间缩短到 11 小时。
Graphical viewers
Because the text-format output of MUMmer 3.0 is often voluminous, we have developed two graphical viewers, one for the purpose of comparing two genome assemblies or near-identical sequences, and the other for comparing more distantly related genomes, such as two distinct species. The first viewer, DisplayMUMs, is an open-source, platform-inde- pendent Java program. It has been tested on a variety of Unix/Linux platforms and also runs on Apple Macintosh (OS X) or Microsoft Windows computers. The program, which takes as input the results of running MUMmer, allows the user to align and view the results of two different assemblies of the same or very closely related genomes and to tile one set of contigs onto the other. This provides a powerful graphical front end for assembly comparison, a function that is frequently used in the process of assembling and finishing genomes. It allows a user to visualize the tiling of sequence reads onto an assembly in order to understand why contigs might not have properly merged together. Alternatively, one can compare the output of different genome assemblers on the same data, a task that can be quite bewildering when the genome is large and the assemblers disagree.
由于 MUMmer 3.0 的文本格式输出通常很庞大,我们开发了两个图形查看器——(这句话学习),一个用于比较两个基因组组装或几乎相同的序列,另一个用于比较更远相关的基因组,例如两个不同的物种。第一个查看器 DisplayMUMs 是一个开源的、独立于平台的 Java 程序。它已经在各种 Unix/Linux 平台上进行了测试,也可以在 Apple Macintosh (OS X) 或 Microsoft Windows 计算机上运行。该程序将运行 MUMmer 的结果作为输入,允许用户对齐和查看相同或非常密切相关的基因组的两个不同组装的结果,并将一组contigs平铺到另一组上。这为组装比较提供了强大的图形前端,该功能在组装和完成基因组的过程中经常使用。它允许用户将序列读取的平铺可视化到组件上,以了解为什么重叠群可能没有正确合并在一起。或者,可以比较不同基因组组装器在相同数据上的输出,当基因组很大并且组装器不同意时,这项任务可能会非常令人困惑。
DisplayMUMs creates a stand-alone display, illustrated in Figure 2. It contains three main areas. The upper area can show a variety of types of information, including zoomed-in nucleotide alignments. The central panel shows a summary of the alignment, with the reference shown as a gray bar. The matches of the queries to the reference are shown as green (forward) and red (reverse) rectangles, with gaps indicated in gray. A second gray bar shows the gaps in blue, which may seem redundant but is useful when the scale is zoomed out; for example, if the sequence has only one small gap and the scale shows 1 Mbp, then the small gap will be invisible in the upper bar but will still be visible on the lower bar. The lower panel shows the tiling of all the query sequences on the reference, with red and green colors indicating the forward and reverse matching substrings. As Figure 2 shows, some sequences might match for only a small portion of their length, while others will match across their entire length. DisplayMUMs has many other features, including mouse-over and searching functions, all of which are documented in the software. As this example makes clear, its primary purpose is to improve the utility of MUMmer for genome-assembly analysis.
DisplayMUMs 创建一个独立的显示,如图 2 所示。它包含三个主要区域。上部区域可以显示各种类型的信息,包括放大的核苷酸比对。中央面板显示对齐的摘要,参考显示为灰色条。查询与参考的匹配显示为绿色(正向)和红色(反向)矩形,间隙以灰色表示。第二个灰色条以蓝色显示间隙,这可能看起来多余,但在缩小比例时很有用;例如,如果序列只有一个小间隙且刻度显示为 1 Mbp,则小间隙将在上条中不可见,但在下条中仍可见。下面的面板显示参考上所有查询序列的平铺,红色和绿色表示正向和反向匹配子字符串。如图 2 所示,某些序列可能仅匹配其一小部分长度,而其他序列将匹配其整个长度。 DisplayMUMs 具有许多其他功能,包括鼠标悬停和搜索功能,所有这些功能都记录在软件中。正如这个例子所表明的,它的主要目的是提高 MUMmer 在基因组组装分析中的实用性。
Sample display from DisplayMUMs, showing whole-genome alignment of individual shotgun reads (query sequences) to a contig from the Staphylococcus epidermidis genome. The display illustrates how exact matches of the tiling reads can be seen against the contig consensus. Green and red colors in the query sequences indicate alignment on the forward and reverse strands, respectively.
DisplayMUM 的示例显示,显示了单个鸟枪法读数(查询序列)与表皮葡萄球菌基因组的 contig 的全基因组比对。 该显示说明了如何根据重叠群共识看到平铺读数的完全匹配。 查询序列中的绿色和红色分别表示正向和反向链上的比对。
The second viewer, MapView, creates a picture of the mapping between two species based on Nucmer or Promer output. The motivation for creating this viewer was the rapidly increasing number of genome projects that are undertaken to enhance our understanding of another, already completed genome. In these projects, the second genome may have only faint DNA sequence similarity to the first, and in some cases the similarity may be detectable only through protein sequence alignments, such as those produced by Promer. A good example of such a project is the recent effort to sequence D. pseudoobscura mentioned above. The primary motivation for this project is to improve the annotation of D. melanogaster, and MUMmer is one of the tools being used to map the newly assembled D. pseudoobscura onto it. Because the reference genome is well annotated, we included in the viewer the option to display the locations of the genes (and their identifiers) along with the mapping at either the DNA or amino acid sequence level. A snapshot of this alignment by MapView is in Figure 3, which makes it clear that the amino acid conservation between these two species closely matches the annotated exon structure. This viewer can be used to highlight areas of a genome where exons might have been missed in previous analyses.
第二个查看器,MapView,根据Nucmer或Promer的输出,创建两个物种之间的映射图。创建这个查看器的动机是迅速增加的基因组项目,这些项目是为了加强我们对另一个已经完成的基因组的理解。在这些项目中,第二个基因组可能与第一个基因组只有微弱的DNA序列相似,在某些情况下,这种相似性可能只能通过蛋白质序列比对来检测,例如由Promer产生的比对。这种项目的一个很好的例子是上面提到的最近对D. pseudoobscura的测序工作。这个项目的主要动机是改善D. melanogaster的注释,而MUMmer是用来将新组装的D. pseudoobscura映射到其上的工具之一。由于参考基因组有很好的注释,我们在查看器中加入了显示基因位置(及其标识符)的选项,以及在DNA或氨基酸序列水平的映射。图3是通过MapView进行比对的快照,它清楚地表明这两个物种之间的氨基酸保存与注释的外显子结构密切相关。该查看器可用于突出显示先前分析中可能遗漏外显子的基因组区域。
The MapView program can produce output in three formats: fig (for viewing with the Unix xfig program), PostScript, or PDF. The most flexible format, fig, allows for unlimited scrolling and zooming, and for export to a wide range of additional formats. This makes it easy to view the mapping between a large collection of contigs and a large chromosome.
MapView 程序可以生成三种格式的输出:fig(用于使用 Unix xfig 程序查看)、PostScript 或 PDF。 最灵活的格式 fig 允许无限滚动和缩放,并可以导出为多种其他格式。 这使得查看大量重叠群和大染色体之间的映射变得容易——(做ui的优势之一)。
Sample display created by the MapView program, showing a 185 kbp slice of D. melanogaster chromosome 2L and its alignment to D. pseudoobscura. The alignment, generated by Promer, shows all regions of conserved amino acid sequence. The blue rectangle spanning the figure represents the reference (D. melanogaster), with annotated genes shown above it. Alternative splice variants of the same gene are stacked vertically. Exons are shown as boxes, with intervening introns connecting them. The 5' and 3' UTRs are colored pink and blue to indicate the gene's direction of translation. Promer matches are shown twice, once just below the reference genome, where all matches are collapsed into red boxes, and in a larger display showing the separate matches within each contig, where the contigs are colored differently to indicate contig boundaries. The vertical position of the matches indicates their percent identity, ranging from 50% at the bottom of the display to 100% just below the red rectangles.
MapView 程序创建的示例显示,显示了黑腹果蝇染色体 2L 的 185 kbp 切片及其与 D. pseudoobscura 的对齐。 Promer 生成的比对显示了保守氨基酸序列的所有区域。跨越该图的蓝色矩形代表参考(D. melanogaster),上面显示了注释基因。同一基因的可变剪接变体垂直堆叠。外显子显示为方框,中间有内含子连接它们。 5' 和 3' UTR 是粉红色和蓝色的,以指示基因的翻译方向。 Promer 匹配显示两次,一次在参考基因组下方,所有匹配都折叠成红色框,在更大的显示中显示每个 contig 内的单独匹配,其中 contig 用不同的颜色表示 contig 边界。匹配的垂直位置表示它们的一致性百分比,范围从显示屏底部的 50% 到红色矩形下方的 100%。
结论
As the examples above show, the capabilities of MUMmer 3.0 enable a researcher to compare virtually any two genomes, or collections of genomic sequences, using computers widely available today. Bacterial genomes and relatively small eukaryotes can be aligned on a standard desktop computer, while larger genomes may require larger, server-class machines. With the state of the art representation of the suffix-tree data structure, the memory usage of MUMmer 3.0 is close to the minimum possible, while retaining optimal or near-optimal worst-case run time, depending on the match algorithm used. The additional features in MUMmer 3.0 allow one to find non-unique and non-exact matches, greatly enhancing the flexibility of the system. Finally, by making the system open source, we hope to encourage others to expand upon and improve the code base, which is freely available to all.
如上例所示,MUMmer 3.0 的功能使研究人员能够使用当今广泛使用的计算机比较几乎任何两个基因组或基因组序列集合。 细菌基因组和相对较小的真核生物可以在标准台式计算机上对齐,而更大的基因组可能需要更大的服务器级机器。 使用最先进的后缀树数据结构表示,MUMmer 3.0 的内存使用量接近可能的最小值,同时根据所使用的匹配算法保持最佳或接近最佳的最坏情况运行时间。 MUMmer 3.0 中的附加功能允许人们找到非唯一和非完全匹配,大大增强了系统的灵活性。 最后,通过使系统开源,我们希望鼓励其他人扩展和改进代码库,该代码库可供所有人免费使用。
自我总结,写的好,说的妙,顶呱呱,笔者也不知道说什么,因为作者结论已经写的很详细了。厉害,学习!