算法文献阅读6：MCscan（算法描述版）

Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps

通过多重排列的被子植物基因图谱揭开古代六倍体的面纱

期刊 Genome Research

2021年影响因子/JCR分区：9.043/Q1

图1

Large-scale (segmental or whole) genome duplication has been recurring in angiosperm evolution. Subsequent gene loss and rearrangements further affect gene copy numbers and fractionate ancestral gene linkages across multiple chromosomes. The fragmented “multiple-to-multiple” correspondences resulting from this distinguishing feature of angiosperm evolution complicates comparative genomic studies. Using a robust computational framework that combines information from multiple orthologous and duplicated regions to construct local syntenic networks, we show that a shared ancient hexaploidy event (or perhaps two roughly concurrent genome fusions) can be inferred based on the sequences from several divergent plant genomes. This “paleo-hexaploidy” clearly preceded the rosid–asterid split, but it remains equivocal whether it also affected monocots. The model resulting from our multi-alignments lays the foundation for approximating the number and arrangement of genes in the last universal common ancestor of angiosperms. Comparative analysis of inferred homologous genes derived from this model shows patterns of preferential gene retention or loss after polyploidy and reveals large variability of nucleotide substitution rates among plant nuclear genomes.

摘要

被子植物进化中反复出现大规模（片段或整个）基因组重复。随后的基因丢失和重排进一步影响基因拷贝数和跨多条染色体的分离祖先基因链接。被子植物进化的这一显着特征导致的碎片化“多对多”对应关系使比较基因组研究复杂化。使用一个强大的计算框架，结合来自多个直系同源和重复区域的信息来构建局部共线性网络，我们表明可以基于来自几个不同植物基因组的序列推断出一个共享的古老六倍体事件（或者可能是两个大致同时的基因组融合）。这种“古六倍体”显然早于蔷薇-菊亚纲植物分歧，但它是否也影响了单子叶植物仍然是模棱两可的。我们的多重比对产生的模型为近似被子植物最后一个普遍共同祖先中基因的数量和排列奠定了基础。从该模型推导出的同源基因的比较分析显示了多倍体后优先基因保留或丢失的模式，并揭示了植物核基因组中核苷酸替代率的巨大变异性。

Ancient genome duplications are evident for many lineages of fungi (Kellis et al. 2004), animals (Jaillon et al. 2004), and plants (Bowers et al. 2003), offering opportunities for the evolution of new (Spillane et al. 2007) or modified (Hittinger and Carroll 2007) gene functions, altering gene dosages, and creating new gene arrangements. Traces from past whole-genome duplication events can often be detected from pairwise syntenic segments, including two sets of retained paralogs that have maintained relative genomic locations on syntenic chromosomes. In angiosperms, genome duplications are recurring in many lineages (Bowers et al. 2003), generating large numbers of paralogous loci.

真菌、动物和植物的许多谱系都存在古代基因组复制，这为新的进化提供了机会或修改基因功能，改变基因剂量，并创造新的基因排列。过去全基因组复制事件的痕迹通常可以从成对的共线片段中检测到，包括两组保留的旁系同源物，它们在共线染色体上保持相对基因组位置。在被子植物中，基因组复制在许多谱系中反复出现，产生大量的旁系同源基因座。

Gene loss at duplicated loci effectively fractionates ancestral linkage patterns and reduces the density of continuous stretches of “paleologous” gene pairs, which are the remaining signatures of paleo-polyploidy (Thomas et al. 2006). Depending on the level of gene loss, the remaining signatures of duplication are sometimes so eroded that the homologous segments can no longer be identified based only on similarity to one another. The problem is multiplied when the species in question has undergone several genome duplications, with recent duplications tending to obscure synteny from more ancient events as is found in most angiosperm genomes. Such highly degenerate duplicated segments have been referred to as “ghost duplications” and can often be resolved by comparison to an appropriate “outgroup” genome that did not experience polyploidy or undergo massive gene loss (Van de Peer 2004). For example, “bridging” of ghost duplications using outgroups has clarified the history of polyploidy in both Saccharomyces and Tetraodon (Jaillon et al. 2004; Kellis et al.2004; Scannell et al. 2007).

重复位点的基因丢失有效地分割了祖先连锁模式并降低了“古同源”基因对的连续延伸的密度，这是古多倍体的剩余特征。根据基因丢失的程度，重复的剩余特征有时会被侵蚀，以至于不能再仅根据彼此的相似性来识别同源片段。当所讨论的物种经历了几次基因组重复时，问题就会成倍增加，最近的重复往往会掩盖大多数被子植物基因组中发现的更古老事件的同线性。这种高度退化的重复片段被称为“幽灵重复”，通常可以通过与没有经历多倍体或经历大量基因丢失的适当“外群”基因组进行比较来解决。例如，使用外群“桥接”幽灵重复已经阐明了酵母菌和四齿龙中多倍体的历史。

Continuous stretches of duplicate genes can be computationally deduced through synteny, using some variants of clustering approaches (Vandepoele et al. 2002; Hampson et al. 2005) or more specifically using dynamic programming with a customized scoring scheme if conserved gene order (collinearity) is also considered (Haas et al. 2004; Wang et al. 2006). Traditional methods for deduction of synteny based on “best-in-genome” criteria (Miller et al. 2007), uncovering one to-one best matching regions during pairwise genome comparisons, are relatively straightforward in vertebrates yet difficult in angiosperms because of additional challenges that are more prominent in angiosperm genomes (Tang et al. 2008). These challenges include frequent genome duplications and convoluted genome shuffling (rearrangements, chromosomal fusions and fissions), such as the extensive rearrangement that has occurred in Arabidopsis within the past 5 million years (Kuittinen et al. 2004).

重复基因的连续延伸可以通过synteny计算推导，使用一些变异的聚类方法或更具体地，如果保守的基因顺序(共线性)也被考虑，则使用带有定制评分方案的动态规划(Haas等人2004;Wang et al. 2006—ColinearScan)。传统的基于“基因组中最佳”标准的共性状推断方法，在两两基因组比较中发现一对一的最佳匹配区域，在脊椎动物中相对简单，但在被子植物中则比较困难，因为被子植物基因组中更突出的额外挑战。这些挑战包括频繁的基因组重复和复杂的基因组重组(重排列、染色体融合和分裂)，如拟南芥在过去500万年中发生的大规模重排列。

One approach for the computational de-convolution of paleopolyploidy for deduction of ancestral gene orders is a bottom-up approach in which one attempts to resolve one duplication event at a time, starting with the most recent one. This is exemplified by studies in Arabidopsis and Paramecium where the most recently duplicated segments are merged to generate hypothetical intermediate profiles that are further recursively merged (Bowers et al. 2003; Aury et al. 2006).

用于推断祖先基因顺序的古多倍体的计算反卷积的一种方法是自下而上的方法，其中尝试一次解决一个重复事件，从最近的一个开始。这以拟南芥和草履虫的研究为例，其中最近复制的片段被合并以生成进一步递归合并的假设中间配置文件。

Herein, we elaborate on an alternative top-down approach (Tang et al. 2008) that is conceptually more attractive in that it only requires one cycle of deduction—first searching for pairwise synteny information and then combining the resulting pairs to form a multi-way correspondence among all structurally similar chromosomal segments. The efficacy of the top-down approach, however, depends on the searching strategy because of the de-generate synteny resulting from post-duplication gene loss. In particular, a top-down search strategy can incorporate “ghost duplications” (Van de Peer 2004), which are not discernible using a bottom-up approach based on information from only one species.

在这里，我们详细阐述了另一种自上而下的方法（Tang et al. 2008—作者推荐引用的Science文章），它在概念上更具吸引力，因为它只需要一个演绎循环——首先搜索成对的同线性信息，然后将结果对组合起来，在所有结构相似的染色体段中形成多路对应（多重动态规划）。然而，自上而下方法的有效性取决于搜索策略，因为复制后基因丢失导致退化同线性。特别是，自上而下的搜索策略可以包含“幽灵重复”，使用仅基于一个物种的信息的自下而上方法无法辨别。

New angiosperm genome sequences (Table 1) promise to qualitatively improve our deductions about the evolution of angiosperm gene repertoire and arrangement. Arabidopsis (Arabidopsis Genome Initiative 2000), rice (Oryza sativa) (International Rice Genome Sequencing Project 2005), poplar (Populus trichocarpa) (Tuskan et al. 2006), grapevine (Vitis vinifera) (Jaillon et al. 2007), and papaya (Carica papaya) (Ming et al. 2008) have been sequenced, and more are in the pipeline. Indeed, Arabidopsis thaliana—a leading botanical model—is now known to be a relatively difficult system from which to deduce ancient gene orders. For example, many Carica segments show collinearity with three or four Arabidopsis segments, showing that two genome duplications have affected the Arabidopsis lineage since its divergence from Carica (Ming et al. 2008). Individual Arabidopsis genome segments correspond to only one Carica segment, showing that Carica has not duplicated since its divergence from Arabidopsis. Both Vitis and Carica have only one duplication event, while and occurred in the Arabidopsis lineage after its divergence from the Carica lineage (Ming et al. 2008; Tang et al. 2008).

新的被子植物基因组序列（表 1）有望在质量上改进我们对被子植物基因库和排列进化的推论。拟南芥 (Arabidopsis Genome Initiative 2000)、水稻 (Oryza sativa) (International Rice Genome Sequencing Project 2005)、杨树 (Populus trichocarpa) (Tuskan et al. 2006)、葡萄 (Vitis vinifera) (Jaillon et al. 2007) 和木瓜 (Carica papaya) (Ming et al. 2008) 已被测序，更多正在筹备中。事实上，拟南芥（Arabidopsis thaliana）——一种领先的植物模型——现在被认为是一个相对困难的系统，无法从中推断出古老的基因顺序。例如，许多 Carica 片段与三个或四个拟南芥片段共线性，表明自从拟南芥与 Carica 分歧以来，两个基因组复制影响了拟南芥谱系。单个拟南芥基因组片段仅对应一个 Carica 片段，表明 Carica 自与拟南芥分歧以来没有复制。 Vitis 和 Carica 都只有一次复制事件，而和发生在拟南芥谱系与 Carica 谱系分歧后。

Some newly sequenced genomes have less complicated genome structure and thus may represent better models for comparative genomics than Arabidopsis. In this study, we exploit fragmentary conservation of plant gene orders from multiple genomes along with a new top-down algorithm MCscan, to improve deductions about the course of angiosperm genome structural evolution.

一些新测序的基因组具有较简单的基因组结构，因此可能是比拟南芥更好的比较基因组学模型。在本研究中，我们利用多个基因组的植物基因序列片段保存和一种新的自顶向下的算法MCscan，以改进对被子植物基因组结构演化过程的推断。

Results

MCscan: Algorithm for multiple gene order alignments

When several genomes and subgenomes (resulting from ancient duplication events) are compared simultaneously, synteny and collinearity between all possible pairs of genomes are tedious to enumerate because chromosomal homology is “transitive.” For example, if there are corresponding chromosomal regions in three genomes A, B, and C, comparisons between the genomes would reveal three pairwise synteny blocks (A-B, B-C, A-C), whereas it could be better represented as a single multiple synteny block (A-B-C). To solve this problem, we implemented a novel algorithm, MCscan, that exploits this transitivity property of collinearity to perform multiple alignments by incorporating pairwise synteny that is derived from shared evolutionary events.

当同时比较几个基因组和亚基因组（由古代重复事件引起）时，所有可能的基因组对之间的同线性和共线性很难枚举，因为染色体同源性是“传递的”。例如，如果在三个基因组 A、B 和 C 中存在相应的染色体区域，则基因组之间的比较将揭示三个成对的同线性块（A-B、B-C、A-C），而它可以更好地表示为单个多同线性块（ A-B-C)。为了解决这个问题，我们实现了一种新算法 MCscan，该算法利用共线性的传递性特性，通过合并源自共享进化事件的成对同线性来执行多重对齐。

The algorithm involves a four-stage pipeline illustrated in Figure 1, with each individual stage described in further detail in Methods.

该算法涉及图1所示的四阶段流程，每个阶段在方法中都有进一步的详细描述。

图1

We first use a sequence similarity search program to detect matchings among genes in all possible pairs of chromosomes and scaffolds and in both transcriptional directions. This is followed by the “pairwise collinearity” stage, in which the neighboring matches are chained along using dynamic programming. The pairwise collinear blocks are combined in the “multi-collinearity” stage, by fixing one gene order as reference and then heuristically stacking the pairwise synteny tracks one after another. In this step, we need to use a “reference” gene order as the basis for stacking the tracks; we then describe the aligned synteny blocks as “threaded by the reference order,” a procedure inspired by TBA aligner (Blanchette et al. 2004). Once the multi-syntenic blocks are identified, we can classify the segments and index them to different evolutionary events, mainly duplications and divergence.

相似性搜索程序来检测所有可能的染色体和scaffolds对以及两个转录方向的基因之间的匹配。接下来是“成对共线性”阶段，其中相邻匹配使用动态规划链接在一起。成对共线块在“多重共线性”阶段组合，通过固定一个基因顺序作为参考，然后启发式地一个接一个地堆叠成对同线性轨道。在这一步中，我们需要使用“参考”基因顺序作为堆叠轨道的基础；然后，我们将对齐的同线性块描述为“按参考顺序进行线程化”，这是一个受 TBA aligner 启发的程序（Blanchette 等人，2004 年）。一旦识别出多同线性块，我们就可以对这些片段进行分类并将它们索引到不同的进化事件，主要是重复和分歧。

As a result, MCscan condenses the combinatorial matches between multiple chromosomal segments resulting from divergence and recursive duplication events and creates a view of the multiply-aligned segments.

因此，MCscan浓缩了由分歧和递归复制事件产生的多个染色体片段之间的组合匹配，并创建了一个多重对齐片段的视图。

Patterns of synteny conservation

Using the top-down algorithm MCscan, we have aligned large portions of the five sequenced genomes (Arabidopsis, Carica, Populus, Vitis, and Oryza) based on synteny. A total of 61% of the Arabidopsis genes have preserved their ancestral locations based on cross-species synteny (Table 2), versus 44%, 51%, and 46% of Carica, Populus, and Vitis genes, respectively.

使用自上而下的算法 MCscan，我们基于同线性对齐了五个已测序基因组（拟南芥、番木瓜、杨树、葡萄和水稻）的大部分。共有 61% 的拟南芥基因基于跨物种同线性保留了它们的祖先位置（表 2），而番木瓜、杨树和葡萄基因分别为 44%、51% 和 46%。

The variation in frequencies of aligned genes might be due to different levels of synteny conservation in different species. However, it is also correlated with the degree of contiguity of the respective sequences (Table 1), with a higher percentage of genes explained by synteny in the genomes with higher N50. Indeed, if most genes are in small or unanchored scaffolds, it would be very difficult for MCscan to detect them as syntenic, even if they do remain in their ancestral locations.

比对基因频率的变化可能是由于不同物种的同线性保守水平不同。然而，它也与各个序列的邻接程度相关（表 1），具有较高 N50 的基因组中的同线性解释了较高百分比的基因。事实上，如果大多数基因位于小型或未锚定的scaffolds中，MCscan 很难将它们检测为同线，即使它们确实保留在它们的祖先位置。

Alignments with gene order preserved across four eudicot species show clear triplicated structure in many local regions. Each triplicated branch contains orthologous segments from up to four Arabidopsis regions, one Carica region, two Populus regions, and one Vitis region, supporting the hypothesis that this genome triplication () occurred in a common ancestor of all four species;Populus has one duplication event () in its salicoid lineage, and Arabidopsis has two duplications ( and ) in its crucifer lineage. The multiple alignments were threaded by Vitis as the reference order (Supplemental Data 1), since Vitis appeared to have the most close-to-ancestral karyotype among the genomes that we investigated (Jaillon et al. 2007). This is likely to change in the future when we include additional genomes; however, using Vitis as the current “reference” would produce the best solution so far.

与四个真双子叶植物物种中保存的基因顺序的比对在许多局部区域显示出清晰的三重结构。每个三重分支包含来自多达四个拟南芥区域、一个番木瓜区域、两个杨树区域和一个葡萄区域的直系同源片段，支持这种基因组三重 () 发生在所有四个物种的共同祖先中的假设；杨有一个重复事件 () 在其水杨类谱系中，而拟南芥在其十字花科植物谱系中有两个重复 ( 和 )。 Vitis 将多重比对作为参考顺序，因为在我们研究的基因组中，葡萄似乎具有最接近祖先的核型（Jaillon 等人，2007）。当我们包含更多基因组时，这可能会在未来发生变化；但是，使用葡萄作为当前的“参考”将产生迄今为止最好的解决方案。

The triplication of gene loci is also evident from Table 2. For example, we found that 88 aligned loci in Carica have multiplicity levels of three (triplication （）), with only one aligned locus exceeding a multiplicity of 3; 54 aligned loci in Populus have the expected multiplicity level of 6 (triplication（）x duplication （）), but only three loci exceed 6. The loci that exceed the expected multiplicity level are likely produced by additional small-scale (single gene or segmental) duplications in each lineage.

从表 2 中也可以明显看出基因位点的三重化。例如，我们发现番木瓜中的 88 个对齐基因座的多重性水平为 3（三重），只有一个对齐的基因座超过多重性 3；杨树中 54 个对齐的基因座的预期多重性水平为 6（三重重复 p），但只有三个基因座超过 6。超过预期多重性水平的位点可能是由每个谱系中额外的小规模 (单基因或片段) 重复产生的。

表2

Further circumscribing the () duplication event

。。。再次推断描述了三倍乘事件

Comparisons of () paleologs show that triplicate subgenomes are mostly homogeneous

。。。新的一个分析阐述的结果

看图3，聪明的读者一定看到笔者标黄的字体了，大佬就是大佬哈，看见没，构树用的NJ法，只用100 bootstrap replicates（note：切记，非大佬千万不要这么做，因为笔者试过，审稿人说现在一般都1000，明白了吧，哈哈！而且要求用ML构树，哈哈！）——这其实是笔者第二篇阅读到的只用100 bootstrap replicates的文章了，上一篇还是。。。

图3

讨论

By exploiting fragmentary conservation of plant gene orders, together with a new top-down multi-alignment approach, limitations of Arabidopsis for comparative genomics are mitigated by using new angiosperm genome sequences to qualitatively improve our deductions about the tempo and modes of evolution of angiosperm genes and genomes.

通过利用植物基因序列的碎片化保存，以及一种新的自上而下的多序列比对方法，拟南芥的比较基因组学的局限性得到了缓解，通过使用新的被子植物基因组序列，从质量上改善了我们对被子植物基因和基因组进化速度和模式的推断。

图4

讨论描述过于牛逼，超过笔者目前理解能力，这里就不做翻译了，有兴趣的读者可以试着去阅读理解一下。

Methods

Gene set and sequence homology search

基因集和序列同源性搜索

Protein sequences from Arabidopsis, Carica, Populus, Vitis, and Oryza genome annotations were used (Table 1). A few annotated moss (Physcomitrella patens) genes (JGI annotation version 1.1) were also used as the outgroup in gene tree analysis. Carica, Populus, and Vitis gene names were renamed according to their incremental position on the chromosomes or scaffolds (see Supplemental Data 4 for a conversion table to original gene identifiers). In case the original gene identifiers are subject to future changes, the conversion table will be updated accordingly to ensure easy translation. If a gene had more than one transcript, only the first transcript in the annotation was considered. Each genome was compared against itself and other genomes using BLASTP (Altschul et al. 1990), retrieving the best five hits meeting an E-value threshold of 1 x 10 -5.

使用了来自拟南芥、番木瓜、杨树、葡萄和水稻基因组注释的蛋白质序列（表 1）。一些带注释的苔藓 (Physcommitrella patens) 基因 (JGI 注释版本 1.1) 也被用作基因树分析中的外群。番木瓜、杨树和葡萄的基因名称根据其在染色体或scaffolds上的递增位置进行了重新命名（见补充数据4中的原始基因标识符转换表）。如果原始基因标识符将来会发生变化，转换表将相应更新以确保易于翻译。 如果一个基因有多个转录本，则只考虑注释中的第一个转录本。使用 BLASTP (Altschul et al. 1990) 将每个基因组与其自身和其他基因组进行比较，检索满足 1 x 10 -5 的 E 值阈值的最佳五个命中。

Pairwise gene order alignments

成对的基因顺序比对

The syntenic regions were grouped to form multiple alignments using a novel algorithm MCscan (multiple collinearity scan). We first took whole-genome BLASTP results and computed strictly collinear segments for all possible pairs of chromosomes and scaffolds. A pairwise alignment procedure was implemented using an empirical scoring scheme similar to that of Haas et al. (2004). The default scoring scheme (configurable) is min(log10 E, 50) match score for one gene pair, and -1 gap penalty for each 10-kb distance between any two consecutive gene pairs. The score for each pairwise collinear chain is then calculated via dynamic programming through the following recurrence condition, assuming that two gene pairs, u and v, are on the path where u precedes v,

使用新算法 MCscan（多重共线性扫描）将同线区域分组以形成多重比对。我们首先获取全基因组 BLASTP 结果并计算所有可能的染色体和scaffolds的严格共线片段。使用类似于 Haas 等人（2004 年）的经验评分方案实施了成对对齐程序。默认评分方案（可配置）是一个基因对的 min(log10 E, 50) 匹配分数，以及任何两个连续基因对之间每 10-kb 距离的 -1 空位罚分。然后通过以下递归条件通过动态规划计算每个成对共线链的分数，假设两个基因对 u 和 v 在 u 之前 v 的路径上，

Tandem matches <50 kb apart are collapsed using a representative pair that has the smallest BLASTP E-value. This threshold, indeed, did not purge all tandems—we still found a very few long-distance tandems in our clustered ancestral loci—however, this is reasonable trade-off since increasing the threshold would remove some of the intra-chromosomal WGD duplicates. All pairwise segments with scores above 300 are reported. Each pairwise segment consists of two distinct genomic locations with aligned, collinear genes as anchors.

使用具有最小 BLASTP E 值的代表性对来折叠相距 <50 kb 的串联匹配。事实上，这个阈值并没有清除所有串联——我们仍然在我们的聚集祖先基因座中发现了很少的长距离串联——然而，这是合理的权衡，因为增加阈值会去除一些染色体内 WGD 重复。报告分数高于 300 的所有成对分段。每个成对片段由两个不同的基因组位置组成，其中对齐的共线基因作为锚。

可以使用以下方法估计成对共线性模式的预期出现次数，类似于 Wang 等人 (2006)—ColinearScan使用的方法。

where N is the number of matching gene pairs (by BLASTP or BLAT, etc.) between two chromosomal regions defined by the syntenic block; m is the number of collinear gene pairs in the identified block; L1 and L2 are respective lengths of the two chromosomal regions; and l1i and l2i are distances between two adjacent collinear gene pairs in the syntenic block. The expectation multiplies by two since there are two possible orientation configurations between two collinear segments. This is only an approximation to a more rigorous yet computationally expensive permutation test (Van de Peer 2004) and Monte Carlo methods (Hampson et al. 2005); however, computational experiments and analytical results (Wang et al. 2006) suggest that this gives a reasonable estimate for the significance of the syntenic blocks. All the pairwise alignments that we reported are significant at E < 1 x 10-10.

其中 N 是由同线块定义的两个染色体区域之间匹配基因对的数量（通过 BLASTP 或 BLAT 等）； m 是识别块中共线基因对的数量； L1和L2分别是两个染色体区域的长度；和 l1i 和 l2i 是同线块中两个相邻共线基因对之间的距离。由于两个共线段之间存在两种可能的方向配置，因此期望值乘以 2。这只是对更严格但计算成本更高的置换测试 (Van de Peer 2004) 和蒙特卡洛方法 (Hampson et al. 2005) 的近似；然而，计算实验和分析结果（Wang et al. 2006）表明，这对同线块的重要性给出了合理的估计。 我们报告的所有成对比对在 E < 1 x 10-10 时都很重要。

Multiple gene order alignments

多基因顺序比对

Pairwise syntenic matches were clustered into multi-way anchors through a Markov clustering algorithm MCL (Enright et al. 2002), in order to simplify the correspondences among multiple loci.Multiple chromosomal regions threaded by consecutive ancestral loci are recovered and aligned using a heuristic that constructs the multiple alignments progressively by aligning one closest related region at a time by dynamic programming. We then use a reference genome to report all the multiple blocks. Notice that when we use a “reference” as the basis, we lose symmetry. For example, let us assume A-B-C as a multiple alignment, formed by syntenic regions A, B, and C. If we allow the blocks to be threaded by A, B, or C, we can find this block three times; however, the resulting multiple alignment may be slightly different because of the order in which we stack A, B, and C. We found that the “once a gap, always a gap” rule applies to the multiple alignment of gene orders, in that the order of progressive stacking does affect the resulting alignment. Therefore, we implement a refinement procedure to ameliorate such effect by iteratively realigning each segment, allowing the falsely placed gaps to be corrected and further optimize the gap placement.

通过马尔可夫聚类算法 MCL (Enright et al. 2002) 将成对的同线性匹配聚类为多路锚，以简化多个基因座之间的对应关系。由连续祖先基因座串接的多个染色体区域被恢复并使用启发式对齐通过动态编程一次对齐一个最近的相关区域，逐步构建多个对齐。然后我们使用参考基因组来报告所有的多个区块。请注意，当我们使用“参考”作为基础时，我们会失去对称性。例如，让我们假设 A-B-C 是一个多重对齐，由同线区域 A、B 和 C 组成。如果我们允许块由 A、B 或 C 串接，我们可以找到该块 3 次；然而，由于我们堆叠 A、B 和 C 的顺序，得到的多重比对可能略有不同。我们发现“一次有缺口，总是有缺口”规则适用于基因顺序的多重比对，因为渐进堆叠的顺序确实会影响结果对齐。因此，我们实施了一个细化程序，通过迭代地重新对齐每个段来改善这种效果，允许纠正错误放置的间隙并进一步优化间隙放置。

Clustering the multiply-aligned genomic regions

聚类多重比对的基因组区域

If we consider “gene retention at the ancestral locus” as the ancestral state and “gene loss” as derived, then each aligned chromosomal segment can be described as a vector of binary characters. We could then search for hierarchical clustering based on “Camin-Sokal parsimony” since genes that had been lost are highly unlikely to re-emerge at original paleologous locations, that is, reversal to the ancestral state is prohibited (Camin and Sokal 1965). Using this simplistic parsimony principle, syntenic genomic regions in multiple alignment blocks can be clustered, using the “mix” program in the PHYLIP package (Retief 2000) with 0/1-coded chromosomal regions within each block as input.

如果我们把 "基因保留在祖先位置 "视为祖先状态，把 "基因丢失 "视为派生状态，那么每个排列整齐的染色体段就可以描述为一个二进制字符的向量。然后，我们可以根据 "Camin-Sokal parsimony "来寻找分层聚类，因为已经丢失的基因极不可能在原来的古生物学位置重新出现，也就是说，禁止逆转到祖先状态（Camin and Sokal 1965）。利用这一简单的解析原则，可以对多个排列区块中的合成基因组区域进行聚类，使用PHYLIP软件包中的 "mix "程序（Retief 2000），将每个区块中的0/1编码的染色体区域作为输入。

MCscan implementation and availability

MCscan 实施和可用性

The multi-aligned plant gene orders and implemented algorithm and C++ source codes are publicly available (http://chibba.agtec.uga.edu/duplication/mcscan/). The program uses only two input files—a file containing BLASTP results and a file describing gene coordinates—and outputs both pairwise syntenic blocks and the multi-aligned gene orders threaded by a reference genome. There are several parameters to configure according to the user’s need. For example, the significance cutoff would reduce sensitivity but increase specificity for the uncovered syntenic blocks.

多对齐的植物基因顺序和实现的算法和 C++ 源代码是公开的 (http://chibba.agtec.uga.edu/duplication/mcscan/)。该程序仅使用两个输入文件——一个包含 BLASTP 结果的文件和一个描述基因坐标的文件——并输出成对的同线块和由参考基因组串接的多对齐基因顺序。 有几个参数可以根据用户的需要进行配置。例如，显着性截止值会降低敏感性，但会增加未覆盖的同线块的特异性。

Comparison between Vitis and Solanum, Musa

For Solanum, we downloaded 195-nt sequences for tomato (Solanum lycopersicum) from NCBI (September 2007) that were >= 100 kb, discarding one chloroplast sequence from analysis, for a total of 25 Mb (representing ~2.5% of the tomato genome). We retrieved 53,792 TIGR Solanum unigenes (S. lycopersicum TIGR transcript assembly version 5), mapping them to the collected BACs (BLASTN E-value < 1 x 10-6) and took the best hit that had 200-bp alignment length and 97% identity. This should accommodate minor sequencing errors or cultivar differences between the ESTs and BACs, if any. If multiple unigenes went within 300 bp on the tomato sequence, only the longest hit was retained. This was to resolve cases in which the unigenes were not assembled completely or correctly for a gene and the real gene was represented by more than one unigene. A total of 2243 Solanum unigenes, 4.2% of the total, were anchored to BACs. Solanum unigenes were assigned their base-pair locations within the BACs, and we used these mapped unigenes as tentative gene models on these Solanum BACs. The mapped unigenes were then searched for homology against the Vitis proteins using BLASTX (E < 1 x 10-5). We analyzed synteny of Vitis chromosomal regions and 17 banana (Musa acuminata) BACs in a similar procedure.

对于茄属植物，我们从 NCBI（2007 年 9 月）下载了大于等于 100 kb 的番茄（Solanum lycopersicum）的 195-nt 序列，从分析中丢弃了一个叶绿体序列，总共 25 Mb（占番茄基因组的约 2.5%））。我们检索了 53,792 个 TIGR Solanum unigenes（S. lycopersicum TIGR 转录本组装版本 5），将它们映射到收集的 BAC（BLASTN E 值 < 1 x 10-6），并获得了 200 bp 比对长度和 97% 的最佳匹配身份。这应该适应 EST 和 BAC 之间的轻微测序错误或品种差异（如果有的话）。如果多个 unigenes 在番茄序列上的 300 bp 以内，则只保留最长的命中。这是为了解决其中 unigenes 没有完全或正确组装成一个基因并且真正的基因由一个以上的 unigene 代表的情况。共有 2243 个 Solanum unigenes，占总数的 4.2%，锚定到 BAC。 Solanum unigenes 被分配了它们在 BACs 中的碱基对位置，我们使用这些映射的 unigenes 作为这些 Solanum BACs 上的暂定基因模型。然后使用 BLASTX (E < 1 x 10-5) 搜索映射的 unigenes 与 Vitis 蛋白的同源性。我们以类似的程序分析了葡萄染色体区域和 17 个香蕉（Musa acuminata）BAC 的同线性。

Synonymous substitution (Ks) and fourfold degenerate site transversion (4DTV) calculation

同义替换 (Ks) 和四重简并位点颠换 (4DTV) 计算

For each pair of homologs, we aligned their protein sequences using CLUSTALW (Thompson et al. 1994) and converted the protein alignment to DNA alignment using PAL2NAL (Suyama et al. 2006). Some homologous genes could not produce reliable CLUSTALW alignment for various reasons and were discarded from further analysis. Ks values were calculated using the Nei-Gojobori algorithm (Nei and Gojobori 1986) implemented in the PAML package (Yang 1997). We repeated the Ks calculation using other algorithms and found that the differences are small, systematic biases that do not affect major conclusions. We calculated 4DTV values between gene pairs using in-house Perl scripts. 4DTV values are calculated for gene pairs having >= 10 fourfold degenerate sites. Fourfold degenerate sites are codons of amino acid residues G, A, T, P, V, and R, S, L. Raw 4DTV values are then corrected for possible multiple transversions at the same site using this formula:

对于每对同源物，我们使用 CLUSTALW (Thompson et al. 1994) 比对它们的蛋白质序列，并使用 PAL2NAL (Suyama et al. 2006) 将蛋白质比对转换为 DNA 比对。由于各种原因，一些同源基因不能产生可靠的 CLUSTALW 比对，因此被丢弃在进一步的分析中。Ks 值是使用 PAML 包 (Yang 1997) 中实现的 Nei-Gojobori 算法 (Nei and Gojobori 1986) 计算的。我们使用其他算法重复了 Ks 计算，发现差异很小，系统偏差不会影响主要结论。我们使用内部 Perl 脚本计算了基因对之间的 4DTV 值。计算具有 >= 10 个四倍简并位点的基因对的 4DTV 值。四重简并位点是氨基酸残基 G、A、T、P、V 和 R、S、L 的密码子。然后使用以下公式针对同一位点可能的多个颠换校正原始 4DTV 值：

Finite mixture models of genome duplications based on Ks distribution

基于Ks分布的基因组复制有限混合模型

The actual distribution of Ks between paleologs can be modeled as mixtures of log-transformed exponentials and normals, representing single gene duplications and whole genome duplications, respectively. Since we have identified the paralogs that show segmental correspondence with most of the single gene duplications excluded, the actual distributions can be described as mixtures of log-normal components that represent multiple rounds of genome duplications, using the EMMIX software (http://www.maths.uq.edu.au/gjm/emmix/emmix.html). Ks values that are <0.005 were discarded to avoid fitting a component to infinity (Cui et al. 2006), and the mixed populations were modeled with one to five components. We selected one best mixture model for each paleolog distribution on the basis of Bayesian information criterion (BIC) and an additional restriction on the mean/variance structure for Ks (Cui et al. 2006).

古生物之间 Ks 的实际分布可以建模为对数转换指数和正态的混合，分别代表单基因复制和全基因组复制。由于我们已经确定了与排除的大多数单基因复制部分对应的旁系同源物，因此实际分布可以描述为代表多轮基因组复制的对数正态成分的混合物，使用 EMMIX 软件（http://www.maths.uq.edu.au/gjm/emmix/emmix.html）。 丢弃小于 0.005 的 Ks 值以避免将分量拟合到无穷大（Cui 等人，2006 年），并用一到五个分量对混合种群进行建模。我们根据贝叶斯信息准则 (BIC) 和对 Ks 的均值/方差结构的附加限制（Cui et al. 2006）为每个paleolog分布选择了一个最佳混合模型。

Acknowledgments

We appreciate financial support from the U.S. National Science Foundation (MCB-0450260 to A.H.P. and J.E.B., DBI-0421803 to R.M. and A.H.P.), the University of Hawaii to M.A., and the U.S. Department of Defense W81XWH0520013 to M.A. We thank Guojun Li for helpful discussions on the synteny deduction algorithm.

致谢

我们感谢美国国家科学基金会（MCB-0450260 给 A.H.P. 和 J.E.B.，DBI-0421803 给 R.M. 和 A.H.P.）、夏威夷大学给 M.A. 以及美国国防部 W81XWH0520013 给 M.A. 的财政支持。我们感谢Guojun Li的帮助同线性推演算法的讨论。

总结：牛掰牛掰，将ColinearScan的动态规划升级为多重比对动态规划，赞！文章写的手笔是笔者值得学一辈子的，好文章！

算法文献阅读6：MCscan（算法描述版）

你可能感兴趣的:(算法文献阅读6：MCscan（算法描述版）)