MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity
MCScanX:一个用于基因同线性和共线性检测和进化分析的工具包
2021年影响因子/JCR分区:16.971/Q1
Nucleic Acids Research (这个期刊影响因子特别高,时而为一区,有的时候又被中科院划为二区,ncst为B档)(图 1)——这个期刊似乎数据库网站类的文章比较好接受?
摘要
MCScan is an algorithm able to scan multiple genomes or subgenomes in order to identify putative homologous chromosomal regions, and align these regions using genes as anchors. The MCScanX toolkit implements an adjusted MCScan algorithm for detection of synteny and collinearity that extends the original software by incorporating 14 utility programs for visualization of results and additional downstream analyses. Applications of MCScanX to several sequenced plant genomes and gene families are shown as examples. MCScanX can be used to effectively analyze chromosome structural changes, and reveal the history of gene family expansions that might contribute to the adaptation of lineages and taxa. An integrated view of various modes of gene duplication can supplement the traditional gene tree analysis in specific families. The source code and documentation of MCScanX are freely available at http://chibba.pgml.uga.edu/mcscan2/.
MCScan 是一种能够扫描多个基因组或亚基因组以识别假定的同源染色体区域并使用基因作为锚点对齐这些区域的算法——(细心的读者会发现MCcan原发明人的s是小写的,这里大写了。当然无关紧要了。)。 MCScanX 工具包实现了一个调整后的 MCScan 算法,用于检测同线性和共线性,通过合并 14 个实用程序来显示结果和附加下游分析,从而扩展了原始软件。 将 MCScanX 应用于几个已测序的植物基因组和基因家族作为示例。MCScanX 可用于有效分析染色体结构变化,并揭示可能有助于谱系和分类群适应的基因家族扩展的历史。各种基因复制模式的综合视图可以补充特定家族中传统的基因树分析。MCScanX 的源代码和文档可在 http://chibba.pgml.uga.edu/mcscan2/ 免费获得。
介绍
Comparative genomic studies often rely on the accurate identification of homology (genes that share a common evolutionary origin) within or across genomes. Homologous genes are further classified as either orthologous, if they were separated by a speciation event, or paralogous, if they were separated by a gene duplication event. Recently, comparisons between related eukaryotic genomes reveal various degrees to which homologous genes remain on corresponding chromosomes (synteny) and in conserved orders (collinearity) during evolution (1). Over evolutionary time, genomes have been shaped and dynamically restructured by several forces such as whole-genome duplication (WGD), segmental duplication, inversions and translocations (2–5). These forces have acted in various combinations and to differing degrees to result in taxonomic groups with different modes of genome structure modification and gene family expansion. For example, angiosperm (flowering plants) genomes appear more volatile than mammalian genomes (1). Angiosperm genomes show remarkable fluctuations in size and organization, even among close relatives, and all examined angiosperms have undergone one or more ancient WGD (6). In contrast, karyotype evolution among major vertebrate lineages appears to have been slower, with a single whole-genome duplication event ~500 million years ago (4,7). However, hundreds of invertebrates are paleopolyploids (8) and their rates of chromosomal rearrangement have been suggested to be almost twice that of vertebrates (1,9,10). Further, there is also a remarkable lack of synteny and high rate of rearrangement in the parasitic and pathogenic protistan phylum Apicomplxa compared to what is seen in vertebrates (11).
比较基因组研究通常依赖于基因组内或基因组之间同源性(具有共同进化起源的基因)的准确鉴定。同源基因进一步分类为直系同源(如果它们被物种形成事件分开)或旁系同源(如果它们被基因复制事件分开)。最近,相关真核基因组之间的比较揭示了同源基因在进化过程中在不同程度上保留在相应的染色体上(同线性)和保守顺序(共线性)(1)。随着进化时间的推移,基因组已经被多种力量塑造和动态重组,例如全基因组复制 (WGD)、片段复制、倒位和易位 (2-5)。这些力量以各种组合和不同程度起作用,导致具有不同基因组结构修饰和基因家族扩展模式的分类群。例如,被子植物(开花植物)基因组似乎比哺乳动物基因组更不稳定 (1)。被子植物基因组在大小和组织方面表现出显着的波动,即使在近亲中也是如此,并且所有被检查的被子植物都经历了一次或多次古老的 WGD (6)。相比之下,主要脊椎动物谱系的核型进化似乎较慢,大约在 5 亿年前发生了一次全基因组重复事件 (4,7)。然而,数百种无脊椎动物是古多倍体 (8),它们的染色体重排率几乎是脊椎动物的两倍 (1,9,10)。此外,与脊椎动物相比,寄生和致病原生生物门 Apicomplxa 也显着缺乏同线性和高重排率 (11)。
Traditionally, synteny was identified via the clustering of neighboring matching gene pairs, as implemented in various programs including ADHoRe (12), TEAM (13), LineUp (14), the Max-gap Clusters by Multiple Sequence Comparison (MCMuSeC) (15) and OrthoCluster (16). However, detection of synteny is often complicated by gene loss, tandem duplications, gene transpositions and chromosomal rearrangements, any of which may produce artifacts. Collinearity, a more specific form of synteny, requires conserved gene order. More recent methods apply dynamic programming to chains of pair-wise collinear genes, and often specify a certain scoring scheme that rewards the adjacent collinear gene pairs (or ‘anchor genes’) and penalizes the distance between anchor genes. This class of methods has been implemented in software tools such as DAGchainer (17), ColinearScan (18), MCScan (19), SyMAP (20), FISH (21) and CYNTENATOR (22). In addition to algorithmic differences, synteny and collinearity detection tools often differ in application ranges, inputs, presentation of results and/or computational costs.
传统上,同线性是通过相邻匹配基因对的聚类来识别的,如在各种程序中实施的,包括 ADHoRe (12)、TEAM (13)、LineUp (14)、通过多序列比较的 Max-gap Clusters (MCMuSec) (15)和 OrthoCluster (16)。然而,同线性的检测通常因基因丢失、串联重复、基因转座和染色体重排而变得复杂,其中任何一种都可能产生伪影。共线性是一种更具体的同线性形式,需要保守的基因顺序。最近的方法将动态编程应用于成对的共线基因链,并且通常指定某种评分方案来奖励相邻的共线基因对(或“锚基因”)并惩罚锚基因之间的距离。此类方法已在 DAGchainer (17)、ColinearScan (18)、MCScan (19)、SyMAP (20)、FISH (21) 和 CYNTENATOR (22) 等软件工具中实现。除了算法差异之外,同线性和共线性检测工具通常在应用范围、输入、结果呈现和/或计算成本方面有所不同。
Although pair-wise collinear relationships among chromosomal regions have been widely studied, the multi-alignment (alignment of three or more regions) of collinear chromosomal regions (referred to as collinear blocks) is more important as it can reveal ancient WGD events (19,23) and complex chromosomal duplication/rearrangement relationships (24). Collinear blocks are comprised of anchor genes which are located at collinear positions and non-anchor genes which are assumed to have experienced gene gains, losses or transposition. Further, anchor genes are more likely to be homologs (25) and tend to be under stronger purifying selection than non-anchor genes (26). Patterns of synteny and collinearity can provide insight into the evolutionary history of a genome, and inform on potentially useful downstream analyses. However, although graphic interfaces for visualizing synteny and collinearity may be incorporated, many available software packages for synteny and collinearity detection do not directly provide downstream analysis tools. Further, genes may be duplicated by mechanisms other than whole-genome duplication, such as tandem, proximal and/or dispersed duplications, each of which may make different contributions to evolution (11,27). In addition, analysis of gene family evolution may require that it be placed in the context of genome evolution. To analyze the evolution of a genome, it may be helpful to correlate gene family analysis with different duplication modes for a more integrated view. To our knowledge, only the MicroSyn package (28) provides analysis of collinearity within gene families, but it cannot superimpose such analysis on a context of whole-genome collinearity.
尽管染色体区域之间的成对共线关系已被广泛研究,但共线染色体区域(称为共线块)的多重对齐(三个或更多区域的对齐)更为重要,因为它可以揭示古代 WGD 事件(19, 23)和复杂的染色体重复/重排关系(24)。共线块由位于共线位置的锚基因和假定经历了基因获得、损失或转座的非锚基因组成。此外,锚基因更可能是同源基因 (25),并且往往比非锚基因 (26) 受到更强的纯化选择。同线性和共线性模式可以提供对基因组进化历史的洞察,并为潜在有用的下游分析提供信息。然而,尽管可以合并用于可视化同线性和共线性的图形界面,但许多用于同线性和共线性检测的可用软件包并不直接提供下游分析工具。此外,基因可以通过全基因组复制以外的机制进行复制,例如串联、近端和/或分散复制,每一种都可能对进化做出不同的贡献 (11,27)。此外,对基因家族进化的分析可能需要将其置于基因组进化的背景下。为了分析基因组的进化,将基因家族分析与不同的复制模式相关联以获得更综合的观点可能会有所帮助。据我们所知,只有 MicroSyn 包 (28) 提供了基因家族内共线性的分析,但它不能将这种分析叠加在全基因组共线性的背景下。
MCScan is able to identify collinear blocks in genomes or subgenomes and then conduct multi alignments of collinear blocks using collinear genes as anchors (19,23). MCScan is also customizable for genomes of different sizes and with different average intergenic distances. Using MCScan, a Plant Genome Duplication Database (PGDD) has been constructed and is publicly available at http://chibba.pgml.uga.edu/duplication/. The MCScan software package and PGDD database have been applied to a variety of research areas such as genome duplication and evolution (11,29–36), annotation of newly sequenced genomes (37) and the evolution of gene families (38–48).
MCScan 能够识别基因组或亚基因组中的共线块,然后使用共线基因作为锚点对共线块进行多重比对 (19,23)。 MCScan 还可针对不同大小和不同平均基因间距离的基因组进行定制。 使用 MCScan,已经构建了植物基因组复制数据库 (PGDD),并可在 http://chibba.pgml.uga.edu/duplication/ 上公开获取。 MCScan 软件包和 PGDD 数据库已应用于各种研究领域,例如基因组复制和进化 (11,29-36)、新测序基因组的注释 (37) 和基因家族的进化 (38-48)。
Building on the MCScan algorithm, here we describe a software package named MCScanX for synteny and collinearity detection, visualization and diverse downstream analyses. Compared with MCScan, the usage of MCScanX has been greatly simplified. To more clearly show how frequently chromosomal regions are duplicated, multi-alignments of collinear blocks against reference chromosomes can be viewed through a web browser with various highlighted features (e.g. tandem arrays, coverage statistics). The overall pattern of synteny and collinearity between or among genomes can be visualized by up to four types of plots. Compared with existing synteny and collinearity detection tools, a distinct feature of MCScanX is that diverse tools for evolutionary analyses of synteny and collinearity are incorporated, aiding efforts to construct gene families using collinearity information, infer gene duplication modes and enrichments, characterize collinear genes with nucleotide substitution rates, detect collinear tandem arrays, perform statistical analyses of duplication depths and collinear orthologs, and analyze collinearity within gene families. MCScanX enables rapid and convenient conversion of synteny and collinearity information into evolutionary insights.
在 MCScan 算法的基础上,我们在这里描述了一个名为 MCScanX 的软件包,用于同线性和共线性检测、可视化和各种下游分析。与MCScan相比,MCScanX的使用大大简化了。为了更清楚地显示染色体区域的重复频率,可以通过具有各种突出显示的特征(例如串联阵列、覆盖率统计)的网络浏览器查看共线块与参考染色体的多重比对。基因组之间或基因组之间的同线性和共线性的总体模式可以通过多达四种类型的图进行可视化。与现有的同线性和共线性检测工具相比,MCScanX 的一个显着特点是结合了多种用于同线性和共线性进化分析的工具,有助于利用共线性信息构建基因家族,推断基因重复模式和富集,用核苷酸表征共线性基因替代率,检测共线串联阵列,对重复深度和共线直系同源物进行统计分析,并分析基因家族内的共线性。 MCScanX 可以将同线性和共线性信息快速方便地转换为进化见解。
materials and methods
Gene set and homology search (跟MCscan算法文章同样/类似的标题)
Whole-genome protein sequences and gene positions for Arabidopsis thaliana, Populus trichocarpa, Vitis vinifera, Glycine max, Oryza sativa and Brachypodium distachyon were retrieved from Phytozome v7.0 (http://www.phytozome.net/). Whole-genome protein sequences and gene positions for Sorghum bicolor and Zea mays were retrieved from EnsemblPlants (http://plants.ensembl.org/index.html) and MaizeSequence Release 5b.60 (http://www.maizesequence.org/index.html) respectively. If a gene had more than one transcript, only the first transcript in the annotation was used. To search for homology, the protein-coding genes from each genome was compared against itself and other genomes using BLASTP (49). For a protein sequence, the best five non-self hits in each target genome that met an E-value threshold of 10-5 were reported.
从Phytozome v7.0 (http://www.phytozome.net/)中检索到拟南芥(Arabidopsis thaliana)、白杨(Populus trichocarpa)、葡萄(Vitis vinifera)、大豆(Glycine max)、水稻(Oryza sativa)和二穗短柄草(Brachypodium disachachyon)的全基因组蛋白序列和基因位置。高粱和玉米的全基因组蛋白序列和基因位置分别来自EnsemblPlants (http://plants.ensembl.org/index.html)和MaizeSequence Release 5b.60(http://www.maizesequence.org/index.html)。如果一个基因有多个转录本,则只使用注释中的第一个转录本。为了寻找同源性,使用BLASTP将每个基因组的蛋白质编码基因与自身和其他基因组进行比较(49)。对于一个蛋白质序列,报告了每个目标基因组中最佳的5个非自体命中值,e值阈值为10-5。
MCScanX algorithm
The MCScanX algorithm is a modified version of MCScan (19). Whole-genome BLASTP results are used to compute collinear blocks for all possible pairs of chromosomes and scaffolds. First, BLASTP matches are sorted according to gene positions. To avoid high numbers of local collinear gene pairs due to tandem arrays, if consecutive BLASTP matches have a common gene and its paired genes are separated by fewer than five genes, these matches are collapsed using a representative pair with the smallest BLASTP E-value. Then, dynamic programming is employed to find the highest scoring paths (i.e. chains of collinear gene pairs) using the following scoring schema, assuming that two gene pairs, u and v, are on the path where u precedes v,
MCScanX 算法是 MCScan (19) 的修改版本。 全基因组 BLASTP 结果用于计算所有可能的染色体和scaffolds的共线块。 首先,根据基因位置对 BLASTP 匹配进行排序。 为了避免由于串联阵列而导致大量局部共线基因对,如果连续的 BLASTP 匹配具有一个共同的基因,并且其配对基因被少于五个基因分开,则使用具有最小 BLASTP E 值的代表对来折叠这些匹配。 然后,假设两个基因对 u 和 v 位于 u 之前 v 的路径上,动态规划用于使用以下评分模式找到最高评分路径(即共线基因对链),
where by default MatchScore(v) = 50 for one gene pair, GapPenalty = -1, and NumberofGaps (u,v), the maximum number of intervening genes between u and v, should be fewer than 25. Non- overlapping chains with scores over 250 (i.e. involving at least 5 collinear gene pairs) are reported. In a pair of collinear blocks, there are two distinct genomic locations with aligned collinear genes as anchors.
其中,默认情况下,每个基因对的MatchScore(v) = 50, GapPenalty = -1,以及介于u和v之间的最大插入基因数量NumberofGaps (u,v)应小于25。报道了得分超过250的非重叠链(即涉及至少5对共线基因)。在一对共线块中,有两个不同的基因组位置,以对齐的共线基因为锚。
The expected number of occurrences (E-value) of a pair of collinear blocks is estimated using the formula introduced by Wang et al. (18)——ColinearScan,
使用 Wang 等人 引入的公式估计一对共线块的预期出现次数(E 值)。
where N is the number of matching gene pairs between the two chromosomal regions defined by the pair of collinear blocks, m is the number of anchors in the pair of collinear blocks, L1 and L2 are respective lengths of the two chromosomal regions, and l1i and l2i are distances (in terms of nucleotide numbers) between two adjacent anchors in the pair of collinear blocks. The default E-value cutoff of MCScanX is 10-5.
其中 N 是由一对共线块定义的两个染色体区域之间的匹配基因对数,m 是一对共线块中的锚点数,L1 和 L2 是两个染色体区域的各自长度,l1i 和 l2i 是一对共线块中两个相邻锚点之间的距离(以核苷酸数计)。 MCScanX 的默认 E 值截止值为 10-5。
Multiple chromosomal regions threaded by consecutive ancestral loci are progressively aligned against reference chromosomes, where each genome being tested is used as a reference successively, according to the following procedure: (i) any reference chromosome is scanned from start to end, and empty tracks are placed alongside the reference chromosome to hold potential aligned collinear blocks; (ii) collinear blocks are progressively aligned against reference chromosomes pinpointed by anchors and assigned to the nearest empty tracks (once a track region is filled, it cannot be assigned collinear blocks again). In aligned collinear blocks, only symbols of anchor genes are shown while un-matched positions (gaps) between anchors (regardless of numbers of intervening genes) are denoted by ‘jj’; (iii) at each locus of reference chromosomes, the number of tracks occupied by collinear blocks is recorded to reflect the duplication depth.
由连续祖先基因座串接的多个染色体区域逐渐与参考染色体对齐,其中每个被测试的基因组被连续用作参考,根据以下程序:(i)从开始到结束扫描任何参考染色体,并且空轨道被 放置在参考染色体旁边以保持潜在的对齐共线块; (ii) 共线块与由锚点精确定位的参考染色体逐渐对齐,并分配给最近的空轨道(一旦轨道区域被填充,就不能再次分配共线块)。 在对齐的共线块中,仅显示锚基因的符号,而锚之间的不匹配位置(间隙)(无论干预基因的数量)用“||”表示; (iii) 在参考染色体的每个位点,记录共线块占用的磁道数,以反映重复深度。
Classifification of duplicate gene origins
Genes within a single genome can be classified as singletons, dispersed duplicates, proximal duplicates, tandem duplicates and segmental/WGD duplicates depending on their copy number and genomic distribution. The following procedure is used to assign gene classes: (i) All genes are initially classified as ‘singletons’ and assigned gene ranks according to their order of appearance along chromosomes; (ii) BLASTP results are evaluated and the genes with BLASTP hits to other genes are re-labeled as ‘dispersed duplicates’; (iii) In any BLASTP hit, the two genes are re-labeled as ‘proximal duplicates’ if they have a difference of gene rank<20 (configurable); (iv) In any BLASTP hit, the two genes are re-labeled as ‘tandem duplicates’ if they have a difference of gene rank = 1; (v) MCScanX is executed. The anchor genes in collinear blocks are re-labeled as ‘WGD/segmental’. So, if a gene appears in multiple BLASTP hits, it will be assigned a unique class according to the order of priority: WGD/segmental>tandem>proximal>dispersed.
根据其拷贝数和基因组分布,单个基因组内的基因可分为单独、分散复制、近端复制、串联复制和片段/WGD复制。以下程序用于分配基因类别: (i) 所有基因最初被分类为“单独”,并根据它们沿染色体出现的顺序分配基因等级; (ii) 评估 BLASTP 结果,并将 BLASTP 命中其他基因的基因重新标记为“分散复制”; (iii) 在任何 BLASTP 命中中,如果两个基因的基因等级差异<20(可配置),它们将被重新标记为“近端复制”; (iv) 在任何 BLASTP 命中中,如果两个基因的基因等级差异 = 1,则将它们重新标记为“串联复制”; (v) 执行MCScanX 。共线块中的锚基因被重新标记为“WGD/segmental”。因此,如果一个基因出现在多个 BLASTP hits 中,它将根据优先级顺序分配一个唯一的类:WGD/segmental>tandem>proximal>dispersed。
Detection of orthologous gene pairs using OrthoMCL
Whole-genome protein sequences from Arabidopsis, Populus, Vitis, Glycine, Oryza, Brachypodium, Sorghum and Zea were merged and searched against themselves for homology using BLASTP with an E-value cutoff of 10-5. Default parameters of OrthoMCL (50) were used. The combination of OrthoMCL intermediate files ‘orthologs.txt’ and ‘coorthologs.txt’ (generated by orthomclDumpPairsFiles) was used as the whole set of ortholog pairs.
来自拟南芥、杨属、葡萄、大豆、稻、短柄草属、高粱和玉米的全基因组蛋白质序列被合并并使用 BLASTP 搜索它们自身的同源性,E 值截止值为 10-5。 使用了 OrthoMCL (50) 的默认参数。 OrthoMCL 中间文件‘orthologs.txt’和‘coorthologs.txt’(由 orthomclDumpPairsFiles 生成)的组合被作为直系同源对的集合。
Enrichment analysis
Enrichment analysis is performed using Fisher’s exact test. The P-value was calculated for the null hypothesis that there is no association between the members of a gene family and a particular gene duplication mode and is corrected with the total number of duplication modes for multiple comparisons (i.e. Bonferroni correction). The P-value cutoff of 0.05 is used to suggest putative enrichment of certain gene duplication modes.
使用 Fisher 精确检验进行富集分析。 P 值是针对基因家族成员与特定基因复制模式之间没有关联的零假设计算的,并使用多重比较的复制模式总数进行校正(即 Bonferroni 校正)。 0.05 的 P 值截止值用于表明某些基因复制模式的假定富集。
Computing Ka and Ks
Non-synonymous (Ka) and synonymous (Ks) substitution rates are estimated by Nei-Gojobori statistics (51), available through the ‘Bio::Align::DNAStatistics’ module of the BioPerl package (http://www.bioperl.org/wiki/Module:Bio::Align::DNAStatistics). Note that the ‘Bio::Align::DNAStatistics’ module may generate invalid Ka or Ks (i.e. non-digital output) for some homologous gene pairs due to mis-alignments.
非同义 (Ka) 和同义 (Ks) 替代率由 Nei-Gojobori 统计 (51) 估计,可通过 BioPerl 包 (http://www.bioperl.org/wiki/Module:Bio::Align::DNAStatistics)。请注意,由于对齐错误,“Bio::Align::DNAStatistics”模块可能会为某些同源基因对生成无效的 Ka 或 Ks(即非数字输出)。
Gene family examples
Lists of published Arabidopsis gene families were obtained from TAIR (http://www.arabidopsis.org/browse/genefamily/index.jsp). Only families with more than nine genes were considered in order to have enough statistical power to detect enrichment of duplication modes. Arabidopsis disease resistance gene homologs were downloaded from the NIBLRRS Project website (http://niblrrs.ucdavis.edu).
已发表的拟南芥基因家族列表获自 TAIR (http://www.arabidopsis.org/browse/genefamily/index.jsp)。 仅考虑具有超过九个基因的家族,以便具有足够的统计能力来检测重复模式的富集。 拟南芥抗病基因同源物从 NIBLRRS 项目网站(http://niblrrs.ucdavis.edu)下载。
Execution of the MCScanX package
MCScanX is freely available at http://chibba.pgml.uga.edu/mcscan2. All programs in the MCScanX package should be executed using command line arguments on Mac OS or Linux systems. On Mac OS, Xcode (http://developer.apple.com/xcode/) should be installed prior to the installation of MCScanX package. On Linux systems, the Java SE Development Kit (JDK) and ‘libpng’ should be installed before the installation of MCScanX package. To list available command line options, the user can simply type the name of a program without any options.
MCScanX 可在 http://chibba.pgml.uga.edu/mcscan2 免费获得。 MCScanX 包中的所有程序都应在 Mac OS 或 Linux 系统上使用命令行参数执行。 在 Mac OS 上,应在安装 MCScanX 包之前安装 Xcode (http://developer.apple.com/xcode/)。 在 Linux 系统上,应在安装 MCScanX 软件包之前安装 Java SE Development Kit (JDK) 和“libpng”。 要列出可用的命令行选项,用户可以简单地键入程序的名称而无需任何选项。
结果
Structure of the MCScanX package
The MCScanX package consists of two main components: (i) three core programs that implement an adjusted MCScan algorithm to generate pairwise and multiple alignments of collinear blocks and (ii) 12 downstream analysis programs for displaying and analyzing identified synteny and collinearity output by the core programs. The structure of the MCScanX package is shown in Figure 1. Compared with the previous version (0.8) of MCScan, there are numerous improvements in MCScanX. First, preprocessing of BLASTP input has been pipelined into the execution of core programs. Next, in MCScan, each gene was assigned a family ID to identify tandem genes, where the family ID has to be pre-computed using the Markov Clustering Algorithm (MCL) software (52). In MCScanX, tandem genes are assessed by gene rank according to chromosomal positions and thus, execution of MCL is no longer required. The aforementioned two improvements have made the installation and execution of MCScanX easier and more efficient. Furthermore, multi-alignments of collinear blocks, which are output as HTML files in MCScanX, can be easily and clearly viewed. In addition, numerous visualization and downstream analysis tools are incorporated into the MCScanX package, greatly enhancing the biological applications of the MCScan algorithm. In the following, we describe in detail each program in the MCScanX package.
MCScanX 包由两个主要组件组成:(i) 三个核心程序,实现调整后的 MCScan 算法以生成共线块的成对和多重比对;(ii) 12 个下游分析程序,用于显示和分析核心输出的已识别同线性和共线性程式。 MCScanX包的结构如图1所示。与之前的MCScan版本(0.8)相比,MCScanX有很多改进。首先,BLASTP 输入的预处理已流水线化到核心程序的执行中。接下来,在 MCScan 中,为每个基因分配一个家族 ID 以识别串联基因,其中家族 ID 必须使用马尔可夫聚类算法 (MCL) 软件预先计算 (52)。在 MCScanX 中,串联基因根据染色体位置通过基因等级进行评估,因此不再需要执行 MCL。前面提到的两项改进使得 MCScanX 的安装和执行变得更加容易和高效。此外,在 MCScanX 中输出为 HTML 文件的共线块的多对齐可以轻松清晰地查看。此外,大量的可视化和下游分析工具被整合到 MCScanX 软件包中,极大地增强了 MCScan 算法的生物学应用。下面,我们将详细介绍 MCScanX 包中的每个程序。
The first core program, named MCScanX, can generate both pair-wise and multiple alignments of collinear blocks, similar to the previous MCScan version (0.8). However, MCScanX takes only a simplified GFF format file and a BLASTP tabular file as inputs. The simplified GFF file should contain the gene locations (which include chromosome, gene symbol, start and end) for the genomes to be compared. The BLASTP input file is one BLASTP output or combined multiple BLASTP outputs in tabular format (option ‘-m8’ in BLAST and ‘-outfmt 6’ in BLAST+) for all protein sequences in the species of interest. Note that when MCScanX is applied to multiple species, it may be useful to guard against over-enrichment of gene pairs from closely related species and we recommend that the BLASTP input file include the combined BLASTP outputs of pairwise genome comparisons and self-genome comparisons with a cutoff of best hits instead of a single BLASTP output of pooled protein sequences from different species. Alternatively, the BLASTP input can be replaced by a tab-delimited file containing pair-wise homologous relationships detected by third party software. In this case, the user needs to implement MCScanX_h (the second core program). In addition, MCScanX_h can generate statistics on numbers of collinear homolog pairs and their percentages (relative to the numbers of input homolog pairs).
第一个名为 MCScanX 的核心程序可以生成共线块的成对和多重对齐,类似于之前的 MCScan 版本 (0.8)。然而,MCScanX 只接受一个简化的 GFF 格式文件和一个 BLASTP 表格文件作为输入。简化的 GFF 文件应包含要比较的基因组的基因位置(包括染色体、基因符号、开始和结束)。 BLASTP 输入文件是一个 BLASTP 输出或以表格格式组合多个 BLASTP 输出(BLAST 中的选项“-m8”和 BLAST+ 中的“-outfmt 6”),用于感兴趣物种中的所有蛋白质序列。请注意,当 MCScanX 应用于多个物种时,可能有助于防止来自密切相关物种的基因对过度富集,我们建议 BLASTP 输入文件包括成对基因组比较和自身基因组比较的组合 BLASTP 输出最佳命中的截止,而不是来自不同物种的汇集蛋白质序列的单个 BLASTP 输出。或者,BLASTP 输入可以替换为制表符分隔的文件,其中包含由第三方软件检测到的成对同源关系。在这种情况下,用户需要实现 MCScanX_h(第二个核心程序)。此外,MCScanX_h 可以生成共线同源对的数量及其百分比(相对于输入同源对的数量)的统计信息。
We also adopted an adjusted MCScan algorithm. Matches among genes are first sorted according to chromosomal positions for all possible pairs of chromosomes and scaffolds, and in both transcriptional directions. Adjacent collinear genes are chained using dynamic programming (see ‘Materials and Methods’ section), outputting pairwise collinear blocks and tandem gene pairs to ‘.collinearity’ and ‘.tandem’ files respectively. Note that during the chaining of collinear genes, distances between genes are calculated in terms of differences in gene ranks. Use of differences in gene ranks provides relative gene distances, which can mitigate the effects of different gene densities (per unit physical DNA) among species on collinearity detection. Next, multiple chromosomal regions threaded by consecutive anchor loci are progressively aligned against ‘reference’ genomes. Because there could be many intervening/non-anchor genes between consecutive anchor genes, especially for divergent genomes, the alignment of non-anchor genes is highly flexible and could clutter the view of results. Thus, in MCScanX, the alignment among non-anchor genes is discarded in the output and non-anchor genes (mismatches) are simply denoted by ‘||’ in the multi-alignment of gene orders. As a result, the layout of multiple alignments is less affected by alignment parameters and anchor genes and duplication depths can be easily discerned in the resulting multiple alignments.
我们还采用了调整后的 MCScan 算法。基因之间的匹配首先根据所有可能的染色体和scaffolds的染色体位置以及在两个转录方向上进行排序。使用动态编程(参见“材料和方法”部分)链接相邻的共线基因,将成对的共线块和串联基因对分别输出到“.collinearity”和“.tandem”文件。请注意,在共线基因的链接过程中,基因之间的距离是根据基因等级的差异来计算的。使用基因等级的差异提供了相对基因距离,这可以减轻物种间不同基因密度(每单位物理 DNA)对共线性检测的影响。接下来,由连续锚定基因座串连的多个染色体区域逐渐与“参考”基因组对齐。因为在连续的锚基因之间可能有许多干预/非锚基因,特别是对于不同的基因组,非锚基因的比对非常灵活,可能会使结果视图混乱。因此,在 MCScanX 中,非锚定基因之间的比对在输出中被丢弃,非锚定基因(错配)在基因顺序的多比对中简单地用“||”表示。因此,多个比对的布局受比对参数和锚基因的影响较小,并且可以在产生的多个比对中轻松辨别重复深度。
The results of MCScanX multiple alignments are presented in HTML format with variously colored features that can be displayed using a web browser. An example is shown in Figure 2. In a reference chromosome, both anchor and non-anchor genes are shown, while in aligned collinear blocks only anchor genes are shown. Along the reference chromosome, duplication depth (i.e. number of aligned collinear blocks) is shown at each locus to indicate how frequently chromosomal regions are duplicated, and tandem genes are highlighted in red. In principle, all aligned collinear blocks can be also references. Note that in certain cases, in a specific alignment (e.g. A–B–C), an anchor locus is lost in the reference chromosome (A) and in turn cannot be shown in aligned collinear blocks (B and C) due to the non-reciprocity of the employed algorithm. To study differential gene loss, the user is suggested to analyze the results using the gene or genome of interest as the reference (i.e. the alignments B–A–C and C–A–B can show that the anchor locus exists between B and C but is lost in A) to ensure that complete chromosomal neighborhoods and matching segments are observed.
MCScanX 多重比对的结果以 HTML 格式呈现,具有各种颜色的特征,可以使用 Web 浏览器显示。图 2 显示了一个示例。在参考染色体中,显示了锚定基因和非锚定基因,而在对齐的共线块中仅显示了锚定基因。沿着参考染色体,重复深度(即对齐的共线块的数量)显示在每个基因座处,以指示染色体区域重复的频率,串联基因以红色突出显示。原则上,所有对齐的共线块也可以是参考。请注意,在某些情况下,在特定的比对中(例如 A-B-C),参考染色体 (A) 中的锚基因座会丢失,由于所采用的算法的非互认性,因此无法在对齐的共线块(B和C)中显示。为了研究差异基因丢失,建议用户使用感兴趣的基因或基因组作为参考来分析结果(即比对 B-A-C 和 C-A-B 可以表明锚定基因座存在于 B 和 C 之间但在 A) 中丢失,以确保观察到完整的染色体邻域和匹配片段。
The third core program, named duplicate gene classifier, can classify the duplicate genes of a single species into WGD/segmental, tandem, proximal and dispersed duplicates. WGD/segmental duplicates are inferred by the anchor genes in collinear blocks. Tandem duplicates are defined as paralogs that are adjacent to each other on chromosomes, which are suggested to arise from illegitimate chromosomal recombination (27). Proximal duplicates are paralogs near each other, but interrupted by several other genes (e.g. separated by fewer than 20 genes, configurable).Proximal duplicates are inferred to result from localized transposon activities (53), or ancient tandem arrays interrupted by more recent gene insertions. Dispersed duplicates are paralogs that are neither near each other on chromosomes, nor do they show conserved synteny (54). Distant single gene translocations mediated by transposons may explain the wide spread of dispersed duplicates (27), often via pack-MULEs (55), helitrons (56), or CACTA elements (37) in plant genomes, or through ‘retropositions’ (57). Inferences about the mechanism(s) responsible for duplication of genes may reveal unusual evolutionary characteristics for particular lineages.Duplicate gene classifier, incorporating the MCScanX procedure, takes in the same input files as MCScanX, and returns statistics of duplicate gene origins and a file showing the likely origin of each gene.
第三个核心程序(称为复制基因分类器)可以将单个物种的复制基因分类为WGD/片段复制,串联复制,近端复制和分散复制。 WGD/片段复制是通过共线块中的锚基因推断出来的。串联复制被定义为染色体上彼此相邻的成对复制,这被认为是由不正常的染色体重组产生的(27)。近端重复物是彼此近的旁系同源物,但被其他几个基因中断(例如,由少于20个基因隔开,可配置)。近端重复被推断为局部转座子活动的结果(53),或被较新的基因插入打断的古老串联阵列。分散重复项是在染色体上既不彼此靠近也不显示保守共线性的旁系同源物(54)。由转座子介导的远距离的单基因易位可以解释分散的重复项的广泛传播(27),通常通过植物基因组中的pack-mule(55)、helitrons(56)或CACTA elements(37),或通过“逆转录”(57)。关于负责基因复制的机制的推论可能揭示特定谱系的不寻常进化特征。复制基因分类器,结合 MCScanX 程序,接收与 MCScanX 相同的输入文件,并返回复制基因起源的统计数据和显示每个基因的可能起源。
Once the outputs of the core programs are generated, various visualization and downstream analysis tools can be applied. To display synteny and collinearity, four types of plots can be generated: dual synteny plot (Figure 3A), circle plot (Figure 3B), dot plot (Figure 3C) and bar plot (Figure 3D) using the Java programs: dual synteny plotter, circle plotter, dot plotter and bar plotter, respectively. The ‘.collinearity’ file generated by MCScanX can be annotated with non-synonymous (Ka) and synonymous (Ks) substitution rates using the Perl program add ka and ks to collinearity.pl. Gene families constructed based on collinear relationships (instead of BLAST hits) can be generated based on the ‘.collinearity’ file using the Perl program group collinear genes. It may be interesting to see how frequently chromosomal regions are duplicated within or across species for understanding species-specific or shared evolutionary events, and the program dissect multiple alignment can compute the number of intra- and inter-species collinear blocks at each locus of reference genomes and show statistics on gene numbers at different duplication depths. To avoid high numbers of local collinear gene pairs generated by MCScanX due to tandem arrays, tandem matches are collapsed using a representative pair with the smallest BLASTP E-value during MCScanX execution. However, a tandem array at an ancestral locus may imply positional gene family expansion (58). Thus, a tool named detect collinear tandem arrays is provided for detection of collinear tandem arrays.
一旦核心程序的输出生成,各种可视化和下游分析工具就可以应用——(然后也是很多整合包的一个二次发表的契机,例如,读硕士期间发现一个专门将MCScanX可视化的一个网站——完全还原MCScanX里的图并且可以处矢量图而且发表了一篇SCI——https://www.hindawi.com/journals/bmri/2016/7823429/,后来进不去这个网址了——http://bio.njfu.edu.cn/vgsc-web/service)。为了显示同线性和共线性,可以使用Java程序生成四种类型的图:双共线图(图3A)、圈图(图3B)、点图(图3C)和条形图(图3D):分别为双共线图、圈图、点图和条形图。MCScanX 生成的“.collinearity”文件可以使用 Perl 程序 add ka and ks to collinearity.pl 用非同义 (Ka) 和同义 (Ks) 替换率进行注释。基于共线关系(而不是 BLAST 命中)构建的基因家族可以基于 “.collinearity”文件使用 Perl 程序 group collinear genes 生成。观察染色体区域在物种内部或跨物种复制的频率可能是有趣的,以了解物种特定或共享的进化事件,程序解剖多重比对可以计算每个参考基因组位点的物种内和物种间共线块的数量,并显示不同复制深度的基因数量统计。为了避免MCScanX由于串联阵列而产生大量局部共线基因对,在MCScanX执行过程中使用BLASTP e值最小的代表对进行串联匹配。然而,一个祖先位点上的串联阵列可能意味着位置上的基因家族扩张(58)。因此,提供了一种共线串联阵列检测工具,用于共线串联阵列的检测。
The MCScanX package provides a variety of tools for analyzing gene family evolution based on the synteny and collinearity identified by MCScanX. Origin enrichment analysis can detect potential enrichment of duplicate gene origins for gene families, based on the classification of whole-genome duplicate genes (the output of duplicate gene classifier). Detect collinearity within gene families outputs all collinear gene pairs among gene family members. Family circle plotter can detect all collinear gene pairs within a gene family and plot them using a genomic circle Family tree plotter, with a Newick-format tree (direct results from most phylogenetic software) and ‘.collinearity’ and ‘.tandem’ files (generated by MCScanX) as inputs, can graphically annotate a phylogenetic tree with collinear and tandem relationships.
MCScanX 软件包提供了多种工具,用于根据 MCScanX 确定的同线性和共线性分析基因家族进化。起源富集分析可以检测基因家族的复制基因起源的潜在富集,基于全基因组复制基因的分类(复制基因分类器的输出)。检测基因家族内的共线性输出基因家族成员之间的所有共线性基因对。家族圈图绘制程序可以检测基因家族中的所有共线基因对,并使用基因组圈图家族树绘图程序绘制它们——(这也是很多基因家族文章常见的可视化形式,但遗憾的是,据笔者了解,似乎极少有基因家族文章会使用MCScanX的可视化程序,而是用其他的可视化工具来实现。笔者认为作为工具包的开发者看到这样的情形,心里应该不是滋味,嘿嘿!),带有 Newick 格式树(大多数系统发育软件的直接结果)和“.collinearity”和“.tandem 文件(由 MCScanX 生成)作为输入,可以用图形方式注释具有共线和串联关系的系统发育树。
Estimation of the number of WGD events.
MCScan version 0.8 was implemented to estimate the number of WGD events of Arabidopsis, Carica, Populus and Vitis, through analysis of the duplication depths of their collinear blocks using Vitis as the reference genome (19,23). To facilitate this analysis using the output of MCScanX, the tool dissect multiple alignment is provided. When the user applies the MCScanX package, the BLASTP and GFF inputs should be restricted to a single genome for self-genome comparison or between two genomes for cross-genome comparison. Alternatively, a BLASTP of self-genome comparison and cross-genome comparison may be merged for both comparisons. However, self-genome comparison may not be as sensitive as cross-genome comparison due to the differential loss of functionally redundant genes, sometimes in a complementary fashion (19). Although the determination of an exact number of WGD events may be heuristic, the output of ‘dissect multiple alignment’ can give a reasonable estimate. Note that a duplication depth x indicates that there are x and x+1 aligned collinear blocks in the target genome using cross-genome and self-genome comparisons respectively. For example, ‘dissect multiple alignment’ was applied to both self-genome and cross-genome comparisons between Arabidopsis and Vitis. Using Arabidopsis and Vitis as references, the maximum duplication depths of Arabidopsis collinear blocks are 7 (self-genome comparison, so the maximum number of aligned Arabidopsis collinear blocks is 8) and 11 (cross-genome comparison, so the maximum number of aligned Arabidopsis collinear blocks is 11), respectively, suggesting that the lineage experienced at least three WGD events to achieve this duplication depth, i.e. a triplication WGD event () x two duplication WGD events () and () (6,19,23). By applying dissect multiple alignment to self-genome comparison of Vitis, the maximum duplication depth of Vitis collinear blocks is 4. However, the gene numbers at levels 3 and 4 (297 and 6, respectively) are much smaller than at level 2 (6993). A whole-genome triplication (WGT) plus small scale chromosomal duplications is the simplest explanation for this duplication pattern (19,23). Note that analysis of duplication depths of collinear blocks can generate good estimates on relatively recent WGD events. Very ancient WGD events often do not result in discernable collinear blocks in extant species due to extensive chromosome rearrangement, loss or gain of chromosomal segments, loss or transposition of duplicate genes, horizontal gene transfers, etc. A recent study, through analyzing the phylogenetic trees of cross-species gene families, reported two ancestral WGD events for seed plants and angiosperms respectively (59).
利用MCScan 0.8版本对拟南芥、番木瓜、杨树和葡萄的共线块重复深度进行分析,以葡萄为参考基因组——(因为葡萄最接近祖先染色体核型),估算WGD事件的数量(19,23)。为了方便使用MCScanX的输出进行分析,提供了工具dissect multiple alignment。当用户应用MCScanX包时,BLASTP和GFF输入应限制在单个基因组内进行自基因组比较,或两个基因组之间进行跨基因组比较。或者,可以将自身基因组比较和跨基因组比较的 BLASTP 合并用于两种比较。然而,由于功能冗余基因的不同损失,自我基因组比较可能不如跨基因组比较敏感,有时以互补的方式 (19)。尽管确定 WGD 事件的确切数量可能是启发式的,但“dissect multiple alignment”的输出可以给出合理的估计。请注意,重复深度 x 表示目标基因组中存在 x 和 x+1 对齐的共线块,分别使用跨基因组和自基因组比较。例如,“dissect multiple alignment”被应用于拟南芥和葡萄之间的自我基因组和跨基因组比较。以Arabidopsis和Vitis为参考,拟南芥共线块的最大重复深度为7(自基因组比对,因此拟南芥共线块最大比对数为8)和11(跨基因组比对,因此拟南芥共线块的最大比对数目为11),表明该谱系经历了至少三个 WGD 事件以实现此重复深度,即一个三重 WGD 事件 () x 两个重复 WGD 事件 () 和 () (6,19,23)。通过将 dissect multiple alignment 应用于葡萄的自身基因组比较,葡萄共线块的最大重复深度为 4。但是,3 级和 4 级(分别为 297 和 6)的基因数远小于 2 级(6993)。 全基因组三复制 (WGT) 加上小规模染色体复制是对这种复制模式的最简单解释 (19,23)。 请注意,对共线块的重复深度的分析可以对相对较新的 WGD 事件产生良好的估计。 由于广泛的染色体重排、染色体片段的丢失或获得、重复基因的丢失或转座、水平基因转移等,非常古老的 WGD 事件通常不会导致现存物种中可识别的共线块。最近的一项研究,通过分析跨物种基因家族的系统发育树,分别报道了种子植物和被子植物的两个祖先 WGD 事件 (59)。
Detection of collinear orthologs.
Detection of collinear orthologs is important for understanding gene evolution. The comparison between collinear orthologs and all orthologs can reveal how gene orders are conserved (or inversely, how frequently chromosomes are rearranged) between species. Limited only by the state of a genome’s annotation and the assumption that sufficient sequence similarity is present for detection, a complete set of orthologs for a set of species can be generated by third-party software such as OrthoMCL (50). We implemented OrthoMCL to find ortholog pairs among Arabidopsis, Populus, Vitis, Glycine, Rice, Brachypodium, Sorghum and Zea. The ortholog pairs identified by OrthoMCL were regarded as the whole set of orthologs, and were then used as the input of MCScanX_h. Besides standard MCScanX output, MCScanX_h generated statistics on the numbers of collinear ortholog pairs and all ortholog pairs, and percentages of collinear ortholog pairs between any two of the selected angiosperm genomes (Table 1). As expected, gene order is better conserved within monocots and within eudicots than between monocots and eudicots. Within eudicots, Vitis shows the highest level of collinearity with the other 3 species, suggesting that Vitis most closely resemble the gene order of the eudicot ancestral genome, due in part to the lack of recent WGDs (60).
共线直系同源物的检测对于理解基因进化很重要。共线直系同源物和所有直系同源物之间的比较可以揭示物种之间基因顺序是如何保守的(或者相反,染色体重排的频率如何)。仅受基因组注释状态和存在足够序列相似性以进行检测的假设的限制,一组物种的完整直系同源物可以由第三方软件生成,例如 OrthoMCL (50)。我们实施了 OrthoMCL 以在拟南芥、杨属、葡萄、大豆属、水稻、短柄草属、高粱和玉米中寻找直系同源物对。 OrthoMCL 识别出的直系同源物对被视为整个直系同源物集合,然后作为 MCScanX_h 的输入。除了标准的 MCScanX 输出,MCScanX_h 还生成了关于共线直系同源物对和所有直系同源物对的数量以及任何两个选定被子植物基因组之间共线直系同源物对的百分比的统计数据(表 1)。正如所料,基因顺序在单子叶植物和真双子叶植物中比在单子叶植物和真双子叶植物之间更保守。在真双子叶植物中,葡萄与其他 3 个物种的共线性水平最高,这表明葡萄与真双子叶植物祖先基因组的基因顺序最相似,部分原因是缺乏最近的 WGD (60)。
Differences in duplicate gene origins among angiosperms.
Using self-genome BLASTP outputs and the tool duplicate gene classifier, we classified the origins of duplicate genes for Arabidopsis, Populus, Vitis, Glycine, Oryza, Brachypodium, Sorghum and Zea respectively. The results are shown in Table 2. The collinear blocks in the self-genome comparisons result from segmental or whole-genome duplications. Most collinear blocks within these flowering plant genomes were derived from WGDs because of their high coverage throughout the genome as well as supporting Ks evidence (19).
使用自身基因组 BLASTP 输出和工具复制基因分类器,我们分别对拟南芥、杨属、葡萄、大豆、稻、短柄草、高粱和玉米的基因的复制起源进行了分类。 结果如表2所示。自基因组比较中的共线块来自片段或全基因组复制。 这些开花植物基因组中的大多数共线块来自 WGD,因为它们在整个基因组中的高覆盖率并且支持 Ks 证据 (19)。
WGDs have had different impacts on the gene repertoires of the investigated taxa. Strikingly, ~76.0% of Glycine genes were duplicated and retained from WGD events, versus only 14.5% of Oryza genes. The proportions of genes involved in WGD events may reflect the relative timing of the most recent WGD event, as well as the level of gene retention following the WGD. For example, Vitis, with only 15.0% of genes created by WGD (actually WGT), was inferred to have undergone the g WGT event, which likely predated the divergence of most eudicots >100 million years ago (19,23). Other eudicot lineages have experienced lineage-specific WGDs in addition to the shared g event. Twenty-seven percent of Arabidopsis appear to have been created through WGD, having experienced a and b WGD events since its divergence from other members of the Brassicales clade (6,23). Populus, with 51.6% of genes created by WGD, was inferred to have undergone an additional WGD event in the Salicoid lineage (23). Glycine, with the highest proportion of WGD genes, was reported to have experienced two additional WGD events, with the most recent occurring 13 million years ago (61). A total of 29.2% of Zea genes were created through WGD, which experienced a lineage-specific WGD after its divergence from Sorghum (15.2% genes created by WGD) (62,63). Although tandem genes are volatile after gene duplication, those retained may indicate functional significance. We find that tandem genes account for about 1–3% of genes in each genome, smaller than ~10% reported by Rizzon et al. (64). This difference is due to the algorithm of duplicate gene classifier, which treats the tandem duplicates located at ancestral loci as WGD duplicates. Proximal duplicates account for larger proportions of genes in the genomes with fewer WGD duplicates, e.g. there are 5.4% of Oryza genes and 6.7% of Vitis genes created by proximal duplications, while in other genomes, the numbers of proximal duplicates are comparable to those of tandem duplicates.
WGD 对所研究分类群的基因库产生了不同的影响。引人注目的是,约 76.0% 的大豆基因从 WGD 事件中复制并保留,而水稻基因只有 14.5%。参与 WGD 事件的基因比例可能反映最近 WGD 事件的相对时间,以及 WGD 后基因保留的水平。例如,只有 15.0% 的基因由 WGD(实际上是 WGT)产生的葡萄 被推断经历了() WGT 事件,这可能早于 1 亿年前大多数真双子叶植物的分化 (19,23)。除了共享的()事件之外,其他真双子叶植物谱系还经历了谱系特定的 WGDs。 27% 的拟南芥似乎是通过 WGD 创建的,自从其与十字花目进化枝的其他成员分歧以来,它经历了()和()WGD 事件 (6,23)。杨树具有 51.6% 的基因由 WGD 创建,被推断在杨柳科谱系中经历了额外的 WGD 事件 (23)。据报道,具有最高比例的WGD基因的大豆经历了另外两次WGD事件,最近一次发生在1300万年前(61)。共有 29.2% 的玉米基因是通过 WGD 创建的,它在从高粱分化后经历了谱系特异性 WGD(15.2% 的基因由 WGD 创建)——(果然如作者所说,最近的WGD事件来阐述哈)(62,63)。尽管串联基因在基因复制后是不稳定的,但保留的那些可能表明功能意义。我们发现串联基因约占每个基因组中基因的 1-3%,小于 Rizzon 等人报道的约 10%。 (64)。这种差异是由于复制基因分类器的算法,它将位于祖先基因座的串联复制视为WGD——(所以,笔者认为如果做串联重复可能不用MCScanX的结果会更让你满意)。近端复制占基因组中较大比例的基因,WGD 较少,例如有 5.4% 的水稻基因和 6.7% 的葡萄基因由近端复制产生,而在其他基因组中,近端复制的数量与串联复制的数量相当。
Detection of collinear tandem arrays.
In the MCScanX package, tandem arrays are defined as clusters of consecutive tandem duplicates. Via ‘detect collinear tandem arrays’, tandem arrays are first determined according to successive gene ranks in all chromosomes. Collinear gene pairs are then searched against these tandem arrays. If any gene of a collinear pair is located within a tandem array, the gene is replaced by the tandem array and then reported. If a tandem array is located at an anchor locus of a collinear block, it is termed a collinear tandem array. Collinear tandem arrays can indicate positional gene family expansions (58), which could be important for forming large gene families, or adopted as an alternative path to increasing gene copy number in the genomes that experienced fewer WGD events. For example, we applied the tool ‘detect collinear tandem arrays’ to a comparison of the Arabidopsis and Vitis genomes. A total of 1160 pairs of collinear tandem arrays were detected between Arabidopsis and Vitis, of which only 68 (5.9%) pairs have equal numbers of tandem duplicates in each species, while 54.3% of pairs have more tandem duplicates in Vitis than Arabidopsis. In conjunction with the finding above that Vitis has more proximal duplicates than other species, we suggest that tandem and proximal duplications contribute relatively more to the expansion of the Vitis genome than to other eudicots that experienced more WGDs in their evolutionary histories.
在 MCScanX 包中,串联阵列被定义为连续串联重复的集群。通过“检测共线串联阵列”,首先根据所有染色体中的连续基因等级确定串联阵列。然后针对这些串联阵列搜索共线基因对。如果共线对的任何基因位于串联阵列内,则该基因被串联阵列替换,然后被报告。如果串联阵列位于共线块的锚点处,则称为共线串联阵列。共线串联阵列可以指示位置基因家族扩张(58),这对于形成大型基因家族可能很重要,或者被用作增加经历较少 WGD 事件的基因组中基因拷贝数的替代途径。例如,我们将“检测共线串联阵列”工具应用于拟南芥和葡萄基因组的比较。在拟南芥和葡萄之间共检测到1160对共线串联阵列,其中只有68对(5.9%)对在每个物种中具有相同数量的串联重复,而54.3%的对在葡萄中的串联重复比拟南芥多。结合上述发现,即葡萄比其他物种具有更多的近端重复,我们认为串联和近端重复对葡萄基因组的膨胀贡献相对大于其他在进化历史中经历更多WGDs的双子叶植物。
Analysis of gene family evolution.
While MCScanX can detect synteny and collinearity using whole-genome homology and gene positional information, it is also of interest to analyze collinearity within a gene family, toward clarifying gene family evolution (65). We used the Arabidopsis MADS-box gene family as an example to illustrate the usefulness of MCScanX for analyzing the history of gene family expansion. Using the tool ‘detect collinearity with gene families’, we detected 14 collinear gene pairs from the members of the MADS box gene family. The inferred collinear relationships of the MADS box gene family members can be displayed and placed within the context of whole-genome collinearity using a genomic circle generated by ‘family circle plotter’ (Figure 4). Next, a phylogenetic tree was constructed for the MADS box gene family using PhyML package (66). The Newick tree was then used as the input of ‘family tree plotter’. A plot that showed the phylogenetic tree, collinear and tandem relationships for the MADS box gene family was generated (Figure 5). The overlay of positional history over the gene clades reveals interesting characteristics of the MADS-box gene family. We note that the clade with many collinear relationships (WGD or segmentally duplicated) appears to be the MIKCc-type (67). In contrast, the remaining clades of MADS-box genes appear to favor dispersed duplications (27,68).
虽然 MCScanX 可以使用全基因组同源性和基因位置信息检测同线性和共线性,但分析基因家族内的共线性也很有意义,以阐明基因家族进化 (65)。我们以拟南芥 MADS-box 基因家族为例来说明 MCScanX 在分析基因家族扩张历史方面的有用性。使用“检测与基因家族的共线性”工具,我们从 MADS-box 基因家族的成员中检测到 14 个共线性基因对。 MADS-box 基因家族成员的推断共线性关系可以显示并放置在全基因组共线性的背景中,使用由“家族圈绘图仪”生成的基因组圈图(图 4)。接下来,使用 PhyML 包为 MADS-box 基因家族构建系统发育树 (66)。然后将 Newick 树用作“家谱绘图仪”的输入。生成了一个显示 MADS-box 基因家族的系统发育树、共线和串联关系的图(图 5)。基因进化枝上位置历史的叠加揭示了 MADS-box 基因家族的有趣特征。我们注意到具有许多共线关系(WGD 或片段复制)的进化枝似乎是 MIKCc-type(67)。相比之下,MADS-box 基因的其余进化枝似乎有利于分散复制 (27,68)。
The tool ‘origin enrichment analysis’, which is able to detect potential enrichments of duplicate gene origins, was applied to 126 published Arabidopsis gene families of 10 or more genes, available at TAIR (http://www.arabidopsis.org/). We found that 46 (36.5%) gene families were enriched for at least one of the four types of origins at a = 0.05. For example, disease resistance gene homologs and the cytochrome P450 gene family are enriched for dispersed and proximal duplicates, while the cytoplasmic ribosomal protein gene family and C2H2 zinc finger proteins are enriched for WGD duplicates, as previously noted (68).
能够检测基因复制起源的潜在富集的工具“起源富集分析”被应用于 126 个已发表的 10 个或更多基因的拟南芥基因家族——(就是作者前文说的为了不失去统计学意义,只分析大于9个家族成员的基因家族),可在 TAIR (http://www.arabidopsis.org/) 获得。 我们发现 46 个 (36.5%) 基因家族在 a = 0.05 时富集了四种来源中的至少一种。 例如,正如先前报道所述,抗病基因同源物和细胞色素 P450 基因家族因分散和近端重复而富集,而细胞质核糖体蛋白基因家族和 C2H2 锌指蛋白因WGD 重复而富集(68)。
Comparison with other synteny and collinearity tools
Existing tools for synteny and collinearity detection mainly include ADHoRe (12), TEAM (13), LineUp (14), MCMuSeC (15), OrthoCluster (16), DiagHunter (69), DAGChainer (17), ColinearScan (18), MCScan (19), SyMAP (20), FISH (21), Cyntenator (22), MicroSyn (28) and Cinteny (70), of which OrthoCluster, ADHoRe and SyMAP are currently upgraded to OrthoClusterDB (71), i-ADHoRe 3 (72) and SyMAP 3.4 (73), respectively. We summarized the functions of synteny and collinearity detection tools regarding five elements: graphic visualization, operation on multiple (>2) genomes, multi-alignments, evolutionary analyses of synteny and collinearity (e.g. estimating WGD events, gene-order conservation and duplicate gene origins, constructing collinear gene groups/families, etc.) and analyses of gene families. Functional comparison of different synteny and collinearity detection tools is shown in Table 3. If there were multiple versions for a tool, we used the latest one for comparison. Seven tools output synteny or collinearity information as plain texts, while the other tools provide graphic visualization options, though types and numbers of plots vary among different tools. As for the data scale, most tools published in the past 4 years can operate on multiple genomes. Five tools can perform multi-alignments of collinear blocks. MicroSyn is focused on collinearity analysis within gene families. i-ADHoRe 3 has provided several post-processing programs for dissecting multi-alignments of collinear blocks, in addition to detecting and visualizing synteny and collinearity. Among these synteny and collinearity detection tools, 11 tools cover no more than two functions, and OrthoclusterDB, MicroSyn and i-ADHoRe 3 cover three functions. MCScanX, with all five functions, can perform more biological analyses than any other synteny or collinearity detection tool.
现有的同线性和共线性检测工具主要有ADHoRe(12)、TEAM(13)、LineUp(14)、MCMuSeC(15)、OrthoCluster(16)、DiagHunter(69)、DAGChainer(17)、ColinearScan(18)、MCScan (19)、SyMAP (20)、FISH (21)、Cyntenator (22)、MicroSyn (28) 和 Cinteny (70),其中 OrthoCluster、ADHoRe 和 SyMAP 目前升级到 OrthoClusterDB (71)、i-ADHoRe 3 ( 72) 和 SyMAP 3.4 (73)。我们总结了关于五个要素的同线性和共线性检测工具的功能:图形可视化、对多个 (>2) 基因组的操作、多比对、同线性和共线性的进化分析(例如估计 WGD 事件、基因顺序保守和基因复制起源,构建共线的基因集/家族等)和基因家族分析。不同同线性和共线性检测工具的功能比较如表 3 所示。如果一个工具有多个版本,我们使用最新版本进行比较。七个工具以纯文本形式输出同线性或共线性信息,而其他工具提供图形可视化选项,尽管不同工具的绘图类型和数量有所不同。至于数据规模,过去 4 年发表的大多数工具都可以在多个基因组上运行。五个工具可以执行共线块的多重对齐。 MicroSyn 专注于基因家族内的共线性分析。除了检测和可视化同线性和共线性之外,i-ADHoRe 3 还提供了几个用于剖析共线块的多对齐的后处理程序。在这些同线性和共线性检测工具中,有 11 个工具涵盖了不超过两个功能,而 OrthoclusterDB、MicroSyn 和 i-ADHoRe 3 涵盖了三个功能。 MCScanX 具有所有五种功能,可以执行比任何其他同线性或共线性检测工具更多的生物分析。
MCScanX is unique in providing multiple programs for evolutionary analysis of synteny and collinearity, which are a necessary step towards biological discovery. Further, MCScanX has connected collinearity analyses between whole-genome and gene family scales. To our knowledge, the following biological analyses implemented in MCScanX are not yet available in other synteny and collinearity detection tools: constructing gene families using collinearity information, inferring gene duplication modes and enrichments, detecting collinear tandem arrays, performing statistical analyses of duplication depths and collinear orthologs and annotating phylogenetic trees with collinearity and tandems.
MCScanX 在为同线性和共线性的进化分析提供多个程序方面是独一无二的,这是迈向生物学发现的必要步骤。此外,MCScanX 还连接了全基因组和基因家族尺度之间的共线性分析。据我们所知,在 MCScanX 中实现的以下生物学分析在其他同线性和共线性检测工具中尚不可用:使用共线性信息构建基因家族,推断基因重复模式和富集,检测共线串联阵列,对重复深度和共线同源进行统计分析,并使用共线性和共线串联注释系统发育树。
For synteny and collinearity detection tools, effective identification of collinear gene pairs is the basis for collinear block construction and downstream analyses. It is informative to perform a quantitative evaluation of MCScanX on the identification of collinear gene pairs. Two widely implemented tools, MCScan and i-ADHoRe 3 were chosen as competitors. Since a benchmark for assessing synteny and collinearity tools has not been established (72), we compared their performances by applying them to the Arabidopsis thaliana genome. Note that a higher number of detected collinear gene pairs does not simply indicate better performance, as true and false positives must be simultaneously considered and well balanced (69). A total of 5794 collinear gene pairs (i.e. WGD duplicate gene pairs) in the Arabidopsis genome including 3822 a, 1451 b and 521 g pairs profiled using an integrated phylogenomic approach in the study from Bowers et al. (6), were regarded as the whole set of collinear gene pairs. The performances of MCScan, MCScanX and i-ADHoRe 3 were evaluated by power (i.e. sensitivity), defined as the ratio between numbers of true positives and all collinear gene pairs; and precision, defined as the ratio between numbers of true positives and all positives (i.e. true positives+false positives). When MCScan and MCScanX were compared, the same parameters were used. Based on the default parameters of MCScanX (match size = 5, max gaps= 25), MCScan and MCScanX identified 4134 and 4225 collinear gene pairs, of which 3375 and 3407 were true positives, respectively. Power was 0.58 and 0.59, and precision was 0.82 and 0.81 for MCScan and MCScanX, respectively.The above statistics suggest that MCScan and MCScanX are generally comparable in detecting collinear gene pairs, while MCScanX has a slightly higher power and a slightly lower precision. Based on its default parameters, i-ADHoRe 3 identified 6233 non-overlapping collinear gene pairs, of which 3459 were true positives. Its power and precision was 0.60 and 0.55. However, direct comparison between MCScanX and i-ADHoRe 3 using their respective default parameters was not reasonable because i-ADHoRe 3 output many more positives. To this end, we executed MCScan and MCScanX using a more relaxed set of parameters (match size = 3, max gaps= 50), which output 5554 and 6110 positives, respectively. Based on the new parameters, power was 0.65 and 0.67, and precision was 0.68 and 0.64 for MCScan and MCScanX, respectively. The new statistics suggest that in terms of identification of collinear gene pairs, MCScan and MCScanX each perform better than i-ADHoRe 3 and remain comparable to one another, with MCScan having higher precision and MCScanX having higher power. The small difference between MCScan and MCScanX is because in order to make MCScanX more easily and efficiently implemented, pre-processing of BLASTP input was pipelined into the execution of the main programs and the dependency of MCL was dropped. In MCScan, cross-family BLASTP hits are removed based on MCL output, while in MCScanX, all non-self BLASTP hits are considered, leading to an enlarged pool of BLASTP hits. MCL may generate 5–20% incorrect families and its performance is affected by inflation value (a parameter of the MCL algorithm used to control the granularity/tightness of protein clusters) (52). So the cross-family BLASTP hits based on MCL gene families indeed contain some collinear gene pairs, though the proportion of collinear gene pairs is smaller in cross-family BLASTP hits than in within-family BLASTP hits. This results in marginally higher power and lower precision for MCScanX than MCScan, though their performances on identifying collinear gene pairs are very similar. Since MCScan was successfully applied to the distantly related apicomplexans (11), we believe that MCScanX is also applicable over a wide range of organisms besides angiosperms.
对于同线性和共线性检测工具,共线性基因对的有效识别是共线性块构建和下游分析的基础。对 MCScanX 进行定量评估以识别共线基因对是有益的。两个广泛实施的工具,MCScan 和 i-ADHoRe 3 被选为竞争对手。由于尚未建立评估同线性和共线性工具的基准(72),我们通过将它们应用于拟南芥基因组来比较它们的性能。请注意,更多检测到的共线基因对并不仅仅表明性能更好,因为必须同时考虑真阳性和假阳性并保持良好平衡(69)。在 Bowers 等人的研究中,使用综合系统发育方法分析了拟南芥基因组中共有 5794 个共线基因对(即 WGD 复制基因对),包括 3822 ()、1451 () 和 521 () 对,被视为整套共线基因对(6)。MCScan、MCScanX和i-ADHoRe 3的性能是通过功率(即灵敏度)和精度来评估的,功率定义为真阳性数量与所有共线基因对之间的比率;精度定义为真阳性和所有阳性的数量之间的比率(即真阳性+假阳性)。比较 MCScan 和 MCScanX 时,使用相同的参数。基于 MCScanX 的默认参数(匹配大小 = 5,最大间隙 = 25),MCScan 和 MCScanX 识别出 4134 和 4225 个共线基因对,其中 3375 和 3407 分别为真阳性。 MCScan 和 MCScanX 的功效分别为 0.58 和 0.59,精度分别为 0.82 和 0.81。以上统计数据表明,MCScan 和 MCScanX 在检测共线基因对方面通常具有可比性,而 MCScanX 的功效略高,精度略低。基于其默认参数,i-ADHoRe 3 识别出 6233 个非重叠共线基因对,其中 3459 个为真阳性。其功率和精度分别为 0.60 和 0.55。然而,使用它们各自的默认参数直接比较 MCScanX 和 i-ADHoRe 3 是不合理的,因为 i-ADHoRe 3 输出更多的阳性。为此,我们使用一组更宽松的参数(匹配大小 = 3,最大间隙 = 50)执行 MCScan 和 MCScanX,分别输出 5554 和 6110 个正数。基于新参数,MCScan 和 MCScanX 的功效分别为 0.65 和 0.67,精度分别为 0.68 和 0.64。新的统计数据表明,在共线基因对的识别方面,MCScan 和 MCScanX 的性能均优于 i-ADHoRe 3,并且彼此之间保持可比性,其中 MCScan 具有更高的精度,而 MCScanX 具有更高的功效。 MCScan和MCScanX的微小区别在于,为了使MCScanX更容易、更高效地实现,将BLASTP输入的预处理流水线化到主程序的执行中,去掉了对MCL的依赖。在 MCScan 中,跨家族的 BLASTP 命中会根据 MCL 输出被删除,而在 MCScanX 中,所有非自身的 BLASTP 命中都被考虑在内,从而导致 BLASTP 命中池的扩大。 MCL 可能会生成 5-20% 的错误家族,其性能受膨胀值(用于控制蛋白质簇的粒度/紧密度的 MCL 算法的参数)的影响(52)。因此,尽管在跨家族BLASTP命中中共线基因对的比例小于在家族内BLASTP命中中的比例,基于MCL基因家族的跨家族BLASTP命中确实包含一些共线基因对。这导致MCScanX比MCScan的功率略高,精度略低,尽管它们在识别共线基因对上的性能非常相似。由于MCScan已经成功地应用于远缘植物(11),我们相信MCScanX也适用于除被子植物以外的广泛的生物。
讨论
Synteny and collinearity information is important for elucidating the evolutionary histories of both genomes and gene families. Although many synteny and collinearity tools are available, their output files are often difficult to read and downstream evolutionary analysis programs are rarely provided. For this reason, users often have to write additional programs or reformat the synteny and collinearity output files in order to use third-party evolutionary analysis tools. This incompleteness of functionality has reduced the usefulness of existing synteny and collinearity detection tools. A distinguishing feature of MCScanX is that diverse tools for evolutionary analyses of synteny and collinearity are incorporated, which enables rapid and convenient conversion of synteny and collinearity information into evolutionary insights. In addition, many biological analyses implemented in MCScanX are unique. MCScanX can be used to effectively analyze chromosome structural changes and evolution, annotate new genomes and reveal the history of gene family expansions.
同线性和共线性信息对于阐明基因组和基因家族的进化历史很重要。尽管有许多同线性和共线性工具可用,但它们的输出文件通常难以阅读,并且很少提供下游进化分析程序。出于这个原因,用户通常必须编写额外的程序或重新格式化同线性和共线性输出文件才能使用第三方进化分析工具。这种功能的不完整性降低了现有同线性和共线性检测工具的有用性。 MCScanX 的一个显着特点是结合了用于同线性和共线性进化分析的多种工具——(遗憾的是,似乎世人还是喜欢只用MSCcanX的算法输出结果,然后自我编程或使用其他工具来进行作者主张的下游分析及可视化,哈哈,笔者想说个玩笑话:亮/特点突出了个寂寞。。。),这使得同线性和共线性信息能够快速方便地转换为进化见解。此外,在 MCScanX 中实现的许多生物学分析都是独一无二的。 MCScanX 可用于有效分析染色体结构变化和进化,注释新基因组并揭示基因家族扩张的历史。
In conclusion, MCScanX is a toolkit that implements an adjusted MCScan algorithm for detection of synteny and collinearity and incorporates 14 computer programs for visualizing and analyzing identified synteny and collinearity. The usefulness of the MCScanX toolkit has been demonstrated through a series of real data applications and comparison with other synteny and collinearity detection tools. MCScanX is freely available at http://chibba.pgml.uga.edu/mcscan2/.
总之,MCScanX 是一个工具包,它实现了一个调整后的 MCScan 算法,用于检测同线性和共线性,并包含 14 个计算机程序,用于可视化和分析识别的同线性和共线性。 MCScanX 工具包的实用性已通过一系列真实数据应用以及与其他同线性和共线性检测工具的比较得到证明。 MCScanX 可在 http://chibba.pgml.uga.edu/mcscan2/ 免费获得。
总结:又是一篇不可多得的经典算法软件包,赞,但作者自己也在文章中说了MCScanX的精度要略低于MCScan,但可用性和工作效率很高,有很完美的下游分析工具,是一个很nice的共线性分析包。In summary,又一个值得学习的算法工具包文章。。。