算法文献阅读10：基因的6种复制模式（MCScanX作者又一力作）

Modes of Gene Duplication Contribute Differently to Genetic Novelty and Redundancy, but Show Parallels across Divergent Angiosperms （2011）

基因复制模对基因创新和冗余的贡献不同，但在不同被子植物中显示相似

Plos One （3.240/Q2）

中科院三区（综合性期刊，网上很多人说是水刊？）（图1）

图1

摘要
Background: Both single gene and whole genome duplications (WGD) have recurred in angiosperm evolution. However, the evolutionary effects of different modes of gene duplication, especially regarding their contributions to genetic novelty or redundancy, have been inadequately explored.

背景：单基因和全基因组重复（WGD）都在被子植物进化中反复出现。然而，不同基因复制模式的进化效应，特别是关于它们对基因创新或冗余的贡献，尚未得到充分探索。

Results: In Arabidopsis thaliana and Oryza sativa (rice), species that deeply sample botanical diversity and for which expression data are available from a wide range of tissues and physiological conditions, we have compared expression divergence between genes duplicated by six different mechanisms (WGD, tandem, proximal, DNA based transposed, retrotransposed and dispersed), and between positional orthologs. Both neo-functionalization and genetic redundancy appear to contribute to retention of duplicate genes. Genes resulting from WGD and tandem duplications diverge slowest in both coding sequences and gene expression, and contribute most to genetic redundancy, while other duplication modes contribute more to evolutionary novelty.WGD duplicates may more frequently be retained due to dosage amplification, while inferred transposon mediated gene duplications tend to reduce gene expression levels. The extent of expression divergence between duplicates is discernibly related to duplication modes, different WGD events, amino acid divergence, and putatively neutral divergence (time), but the contribution of each factor is heterogeneous among duplication modes. Gene loss may retard inter-species expression divergence. Members of different gene families may have non-random patterns of origin that are similar in Arabidopsis and rice, suggesting the action of pan-taxon principles of molecular evolution.

结果：在拟南芥和水稻（水稻）中，这些物种对植物多样性进行了深入采样，并且其表达数据可从广泛的组织和生理条件中获得，我们比较了通过六种不同机制（WGD、串联、近端、基于 DNA 的转座、反转录转座和分散）复制的基因以及位置直系同源物之间的表达差异。新功能化和遗传冗余似乎都有助于保留重复基因。 WGD 和串联重复产生的基因在编码序列和基因表达中发散最慢，并且对遗传冗余的贡献最大，而其他重复模式对进化创新的贡献更大。WGD 重复可能由于剂量放大而更频繁地保留，而推断的转座子介导基因重复往往会降低基因表达水平。重复之间的表达分歧程度与复制模式、不同的 WGD 事件、氨基酸分歧和假定的中性分歧（时间）明显相关，但每个因素的贡献在复制模式之间是不一致的。基因丢失可能会延缓物种间的表达分歧。在拟南芥和水稻中，不同基因家族的成员可能具有相似的非随机起源模式，这表明泛分类群分子进化原理的作用。

Conclusion: Gene duplication modes differ in contribution to genetic novelty and redundancy, but show some parallels in taxa separated by hundreds of millions of years of evolution.

结论：基因复制模式对基因创新和冗余的贡献不同，但在经过数亿年进化的分类群中显示出一些相似之处。

介绍

Whole-genome duplications (WGDs) have occurred in the lineages of plants [1], animals [2,3] and fungi [4,5], with possible consequences including evolution of novel or modified gene functions [6,7,8,9], and/or provision of ‘‘buffer capacity’’ [10,11] or genetic redundancy that increases genetic robustness [12,13,14,15,16,17]. Genome duplication may also increase opportunities for nonreciprocal recombination [18,19,20], permitting or causing duplicated genes to evolve in concert for a period of time. Rapid DNA loss and restructuring of low-copy DNA [21,22,23,24], retrotransposon activation [25,26,27] and epigenetic changes [28,29,30,31,32,33] following WGD may further provide materials for evolutionary change.

植物 [1]、动物 [2,3] 和真菌 [4,5] 的谱系中发生了全基因组复制 (WGD)，其可能的后果包括进化出新的或被修改的基因功能 [6,7,8, 9]，和/或提供“缓冲能力”[10,11] 或增加遗传稳健性的遗传冗余[12,13,14,15,16,17]。基因组复制也可能增加不可逆重组的机会[18,19,20]，允许或导致复制的基因在一段时间内协同进化。WGD之后快速的DNA丢失和低拷贝DNA的重组[21,22,23,24]、反转录转座子的激活[25,26,27]和表观遗传学的变化[28,29,30,31,32,33]可能进一步为进化变化提供材料。

百度科普，来源于百度百科：https://baike.baidu.com/item/%E5%8F%8D%E8%BD%AC%E5%BD%95%E8%BD%AC%E5%BA%A7%E5%AD%90/3764001?fr=aladdin

反转录转座子（反转座子/逆转座子/逆转录转座子）：反转录转座子(retrotransposon或retroposon)指通过RNA为中介，反转录成DNA后进行转座的可动元件。这样的转座过程称为反转座作用(retrotrans—position)。

反转座作用出现在真核生物，包括能自由地感染宿主细胞的反转录病毒，以及通过以RNA为中介进行转座的DNA序列。除反转录病毒外，反转录转座子可以分成两类：一类是病毒超家族(viral superfamily)，这类反转录转座子编码反转录酶或整合酶(integrases)，能自主地进行转录，其转座的机制同反转录病毒相似，但不能像反转录病毒那样以独立感染的方式进行传播；另一类是非病毒超家族(nonviral superfamily)，自身没有转座酶或整合酶的编码能力，而在细胞内已有的酶系统作用下进行转座。病毒超家族同非病毒超家族都来源于细胞内的转录物，两者的明显区别在于病毒超家族成员的DNA分子两端有长末端重复序列(long terminal repeats，LTR)，这是反转录病毒DNA基因组的特征性结构，非病毒超家族的成员没有LTR结构。同时，病毒超家族成员都能编码产生转座酶或整合酶，或者二者兼而有之，所以能自主地进行转座。非病毒超家族成员不产生有生物学活性的酶，因此不能进行自主转座。但所有反转录转座子都有一个共同的特点，即在其插入位点上产生短的正向重复序列。

Genes may be duplicated by several mechanisms in addition to WGDs, which have been collectively referred to as small scale duplications [34] or single gene duplications [35,36]. Tandem duplicates are consecutive in the genome while proximal duplicates are near one another but separated by a few genes. These two gene duplication modes are presumed to arise through unequal crossing over [36] or localized transposon activities [37]. Dispersed duplicates are neither adjacent to each other in the genome nor within homeologous chromosome segments [38]. Distant single gene transposition may explain the widespread existence of dispersed duplicates within and among genomes [36]. Distant single gene transposition duplication (referred to as distantly transposed duplication) may occur by DNA based or RNA based mechanisms [35]. DNA transposons such as packmules (rice) [39], helitrons (maize) [40], and CACTA elements (sorghum) [27] may relocate duplicated genes or gene segments to new chromosomal positions (referred to as DNA based transposed duplication). RNA based transposed duplication, often referred to as retrotransposition, typically creates a single-exon retrocopy from a multi-exon parental gene, by reverse transcription of a spliced messenger RNA. It is presumed that the retrocopy duplicates only the transcribed sequence of the parental gene, detached from the parental promoter. The new retrogene is often deposited in a novel chromosomal environment with new (i.e. non-ancestral) neighboring genes and, having lost its native promoter, is only likely to survive as a functional gene if a new promoter is acquired [41,42].

除了 WGD 之外，基因还可以通过多种机制进行复制，这些机制统称为小规模复制 [34] 或单基因复制 [35,36]。串联复制在基因组中是连续的，而近端复制彼此靠近但被几个基因分开。这两种基因复制模式被认为是通过不等交叉[36]或局部转座子活动[37]产生的。分散复制在基因组中和同源染色体片段内都不相邻[38]。远距离单基因转座可以解释基因组内和基因组之间广泛存在的分散重复[36]。远距离单基因转座复制（称为远距离转座复制）可能通过基于 DNA 或基于 RNA 的机制发生 [35]。DNA 转座子，如 packmules（水稻）[39]、helitrons（玉米）[40] 和 CACTA 元件（高粱）[27] 可以将复制出的基因或基因片段重新定位到新的染色体位置（称为基于 DNA 的转座复制）。基于RNA的转座复制，通常被称为逆转录，通常通过剪接的信使RNA的逆转录，从一个多外显子的亲本基因中产生一个单外显子逆转录。据推测，逆转录只复制了亲代基因的转录序列，与亲本启动子分离。新的逆转录基因通常存放在具有新（即非祖先）相邻基因的新染色体环境中，并且失去了其天然启动子，只有在获得新启动子时才有可能作为功能基因存活[41,42]。

Classical population genetic theory suggests that a likely consequence of gene duplication is reversion to single copy (singleton), unless at least one gene copy evolves new function [8]. More recently, the subfunctionalization model, which proposes that duplicated gene copies might both be retained if they partition the functions of the ancestral gene between them, has described an important modification of the classical model [9,43]. Some studies also show evidence to support the value of genetic redundancy per se [10,12,13,14,15,16,17,44,45] or dosage balance [34,46,47,48].

经典群体遗传理论表明，基因复制的一个可能结果是恢复为单拷贝（单子），除非至少一个基因拷贝进化出新功能[8]。最近，亚功能化模型提出，如果复制的基因副本之间划分了祖先基因的功能，则它们可能都被保留，该模型描述了对经典模型的一个重要修改[9,43]。 一些研究还显示证据支持遗传冗余本身的价值 [10,12,13,14,15,16,17,44,45] 或剂量平衡 [34,46,47,48]。

The angiosperms (flowering plants) are an outstanding model in which to elucidate the consequences of gene duplication. All angiosperms are now thought to be paleopolyploids [49], many of which underwent multiple WGDs [50,51]. Traces of past WGDs can often be detected from pairwise syntenic alignments through software such as ColinearScan [52] and multiple alignments using MCScan [53]. Arabidopsis, selected as the first angiosperm genome to be sequenced due to its small genome size and minimal DNA sequence duplication, has experienced two ‘recent’ WGDs, i.e. since its divergence from other members of the Brassicales clade (a and b), and a more ancient triplication (c) shared with most if not all eudicots [49,51,53]. Likewise, rice appears to have experienced at least two WGDs, one shared with most if not all cereals (r), and another more ancient event (s) [54]. Single gene duplications in angiosperms are also widespread [36,55,56].

被子植物（开花植物）是阐明基因复制后果的杰出模型。 所有被子植物现在都被认为是古多倍体 [49]，其中许多经历了多次 WGDs [50,51]。通过软件如ColinearScan[52]和使用MCScan[53]的多重比对，通常可以从成对的共线对齐中检测到过去WGDs的痕迹。拟南芥因其较小的基因组大小和最少的 DNA 序列重复而被选为第一个被测序的被子植物基因组，它经历了两次“最近的”WGDs，即自其与十字花科其他成员分化以来（（）和（）），以及与与大多数（如果不是全部）真双子叶植物共享的更古老的三倍乘事件（）[49,51,53]。同样，水稻似乎也经历了至少两次WGDs，一次是与大多数（如果不是所有）的谷类共有的（），另一次是更古老的事件（）[54]。被子植物的单基因复制也很普遍[36,55,56]。

One avenue for systematic investigation of functional divergence between duplicate genes is comparison of their spatiotemporal expression profiles, comparing degrees of divergence with proxies of duplication age such as synonymous substitution rates (Ks) between duplicate genes. In Arabidopsis, the rate of protein sequence evolution is asymmetric in >20% of duplicate pairs and functional diversification of surviving duplicate genes has been proposed to be a major feature of the long-term evolution of polyploids [57]. Arabidopsis genes created by large-scale duplication events are more evolutionarily conserved in gene expression than those created by small-scale duplication or those that do not lie in duplicate segments, and the time since duplication is correlated with functional divergence of genes [58]. Further, there may be also a strong positive correlation between expression divergence and non-synonymous mutation (Ka) in Arabidopsis, and the different modes (segmental, tandem and dispersed) of duplication may affect patterns of expression divergence [38]. Arabidopsis duplicated genes show greater expression diversitythan singleton genes across closely related species and allopolyploids [59]. In rice, expression correlation is significantly higher for gene pairs from WGDs or tandem duplications than dispersed duplications, and expression divergence is closely related to divergence time [60].

系统研究复制基因之间功能差异的一种途径是比较它们的时空表达谱，将分歧程度与复制年龄的替代指标（例如重复基因之间的同义替换率 (Ks)）进行比较。在拟南芥中，超过 20% 的重复对的蛋白质序列进化速率是不对称的，并且已提出存活的复制基因的功能多样化是多倍体长期进化的主要特征 [57]。由大规模复制事件产生的拟南芥基因在基因表达上比由小规模复制或不位于重复片段中的拟南芥基因在基因表达上更保守，并且复制后的时间与基因的功能差异相关[58]。此外，拟南芥中的表达分歧和非同义替换率（Ka）之间也可能存在强正相关，并且不同的复制模式（片段、串联和分散）可能会影响表达分歧的模式[38]。拟南芥的重复基因在近缘物种和异源多倍体中表现出比单子基因更大的表达多样性[59]。在水稻中，来自 WGD 或串联复制的基因对的表达相关性显着高于分散复制，并且表达分歧与分歧时间密切相关 [60]。

Though many studies have investigated the functional divergence and retention of duplicate genes, conclusions are often contradictory, e.g. gene retention has been attributed to either neofunctionalization [6,7] or genetic redundancy [12,13,14,15,16,17], and expression divergence between duplicate genes has been suggested to be either time dependent [58,60] or selection dependent [38]. The fates of duplicate genes may be influenced by different modes of gene duplication, which have been suggested to retain genes in a biased manner [36]. With much richer expression and annotation data available now than for most prior studies, and improved ability to discern various mechanisms of gene duplication, we find merit in re-examining some existing hypotheses and exploring some new hypotheses regarding the consequences of gene duplication. Here, we related multiple types of genomic data to gene expression divergence in two angiosperm species, Arabidopsis and Oryza (rice), to formally test possible evolutionary patterns (hypotheses). A far richer volume of analyzed microarray data than was available in prior studies improves the robustness of statistical analyses.

尽管许多研究已经调查了重复基因的功能差异和保留，但结论往往是矛盾的，例如基因保留归因于新功能化 [6,7] 或遗传冗余 [12,13,14,15,16,17]，重复基因之间的表达差异被认为是时间依赖性 [58,60] 或选择依赖[38]。重复基因的命运可能会受到不同的基因复制模式的影响，这些模式被认为以有偏见的方式保留基因[36]。与大多数先前的研究相比，现在可以获得更丰富的表达和注释数据，并且提高了辨别各种基因复制机制的能力，我们发现重新检查一些现有假设和探索一些关于基因复制后果的新假设是有价值的。在这里，我们将多种类型的基因组数据与两种被子植物拟南芥和水稻的基因表达差异联系起来，以正式测试可能的进化模式（假设）。与之前的研究相比，分析的微阵列数据量更丰富，提高了统计分析的稳健性。

结果

A total of 4,566 Affymetrix Arabidopsis Genome ATH1 Arrays and 508 Affymetrix GeneChip Rice Genome Arrays were used to generate the expression profiles of 22,810 Arabidopsis genes and 27,910 rice genes. We classified gene duplications into six modes: WGD, tandem, proximal, DNA based transposed, retrotransposed and dispersed duplication, according to the procedure shown in Figure 1 and described in methods. Note that in this study, a gene may have up to five potential duplication relationships, depending on the number of BLASTP hits. For WGD duplicates, redundant duplication relationships were removed using co-linearity restrictions. If a gene was created by single gene duplications, all possible duplication relationships were considered. However, redundant duplication relationships in single gene duplications did not enlarge the gene set created by each duplication mode. In a distantly transposed duplication, one duplicate gene is the parental (ancestral) copy while the other is the transposed (derived) copy, at a novel locus. Dispersed duplications, which we cannot attribute to specific mechanisms, are regarded as a control group. The number of pairs of duplicate genes and number of unique genes (i.e. number of created genes) in each mode of duplication is summarized in Table 1. A total of 2,981 a, 1,161 b and 417 c WGD duplicate pairs in Arabidopsis; and 1,712 r and 568 c WGD duplicate pairs in rice, have expression profiles. In this study, the degree of similarity between the expression profiles of a pair of genes across all experiments is measured by the Pearson’s correlation coefficient (r). To express in positive values the evolution of gene expression between duplicates or orthologs, we use the term ‘‘expression divergence’’, measured by 1-r [61,62].

总共使用了 4,566 个 Affymetrix 拟南芥基因组 ATH1 Arrays 和 508 个 Affymetrix GeneChip Rice Genome Arrays 来生成 22,810 个拟南芥基因和 27,910 个水稻基因的表达谱。根据图 1 所示的程序和方法中描述的方法，我们将基因复制分为六种模式：WGD、串联、近端、基于 DNA 的转座、反向转座和分散复制。请注意，在这项研究中，一个基因可能有多达五个潜在的复制关系，具体取决于 BLASTP 命中的数量。对于 WGD 复制，使用共线性限制删除了冗余的复制关系——（就是说如果处在共线性block中，就认为是WGD复制，而不再考虑是其他五种复制的可能性）。如果一个基因是由单基因复制产生的，则考虑所有可能的复制关系——（也就是说一个基因可能来源于多种复制模式，而不单单只能来自一种）。然而，单基因复制中的冗余复制关系并没有扩大被每种复制模式创建的基因集。在远距离转座复制中，一个复制基因是亲本（祖先）拷贝，而另一个是转座（衍生）拷贝，位于一个新的基因座上。我们不能将其归因于特定机制的分散复制被视为对照组。表 1 总结了每种复制模式中复制基因对的数量和独特基因的数量（即创造的基因的数量）。拟南芥中共有 2,981 （）、1,161 （）和 417 （） WGD 重复对；水稻中的 1,712 （）和 568 （） WGD 重复对具有表达谱。在这项研究中，所有实验中一对基因的表达谱之间的相似程度通过皮尔逊相关系数 (r) 来衡量。为了以正值表示复制或直系同源物之间基因表达的进化，我们使用术语“表达分歧”，由 1-r 测量 [61,62] 。

图1

表1

Gene duplication modes contribute differentially to genetic novelty and redundancy

Expression divergence between duplicate genes was compared across modes of duplication (Figure 2). The trends of expression divergence between duplicates in Arabidopsis and rice are very similar: DNA based transposed duplication ~~ retrotransposed duplication > dispersed duplication > proximal duplication > WGD ~~ tandem duplication (both ANOVA model involving all duplication modes and Tukey’s HSD test between adjacent duplication modes are significant at a= 0.05). Although retrotransposed duplications have a little higher average expression divergence than DNA based transposed duplications, the difference is not significant (P-value > 0.05). WGDs result in a little higher expression divergence than tandem duplications in Arabidopsis but the difference is not significant in rice.

跨复制模式比较复制基因之间的表达差异（图2）。 拟南芥和水稻中复制基因之间的表达差异趋势非常相似：基于 DNA 的转座复制 ~~ 反转录转座复制 > 分散复制 > 近端复制 > WGD ~~ 串联复制（包括所有复制模式的 ANOVA 模型和相邻复制之间的 Tukey 的 HSD 检验）模式在（） = 0.05 时显著）。尽管反转录转座复制比基于 DNA 的转座复制具有更高的平均表达差异，但差异并不显着（P 值 > 0.05）。 WGDs 在拟南芥中导致比串联复制稍高的表达差异，但在水稻中差异不显著。

图2

Despite the relatively fast evolution of gene expression shown by distantly transposed duplications, a tendency toward co-expression between genes duplicated by all modes can be observed by comparison with 10,000 randomly selected gene pairs (Figure 2). Furthermore, we used r,0.371 and r,0.621 (95% quantile of the r values obtained from random gene pairs) as criteria for determining that two duplicate genes have diverged in expression in Arabidopsis and rice respectively [57,63]. The proportions of divergent expression between genes duplicated by different modes are shown in Table 2. All these data suggest that the extent of expression divergence of retained duplicates is affected by the duplication mechanism: WGD and tandem duplicates are more likely to maintain their original expression patterns, proximal duplications show intermediate divergence, and distantly transposed duplications tend to have the biggest changes of gene expression profiles.

尽管远距离转座复制显示相对较快的基因表达进化，但通过与随机选取的10000对基因比对，可以发现所有模式的重复基因之间都有共同表达的趋势(图2)。此外，我们使用 r < 0.371 和 r < 0.621（从随机基因对获得的 r 值的 95% 分位数）作为确定两个重复基因分别在拟南芥和水稻中表达不同的标准 [57,63]。不同模式复制的基因之间的差异表达比例如表2所示。所有这些数据表明，保留复制的表达差异程度受复制机制的影响：WGD和串联复制更可能保持其原有的表达模式，近端复制的差异居中，远距离转座复制的基因表达谱变化最大。

表2

Computationally, genetic redundancy may be inferred from simultaneous conservation in protein sequences that determine molecular functions, and expression patterns which determine biological processes [64,65]. WGD and tandem duplicates tend to be simultaneously conserved in protein sequences (using 25% quartile of Ka of all duplicate pairs, i.e. <0.329 in Arabidopsis and <0.383 in rice, as criteria) and in gene expression (using r>=0:371 in Arabidopsis and r>=0:621 in rice as criteria), while distantly transposed and dispersed duplicates have a random association (assuming that conservation in protein sequences and gene expression were independent in the pooled duplicate genes) between these parameters, and proximal duplicates fall in between (Table 3).

在计算上，遗传冗余可以从决定分子功能的蛋白质序列的同一保守性和决定生物过程的表达模式中推断出来[64,65]。 WGD 和串联复制往往在蛋白质序列（使用所有重复对的 Ka 的 25% 四分位数，即拟南芥中 <0.329 和水稻中 <0.383 作为标准）和基因表达（使用在拟南芥中 r>=0:371，在水稻中 r>=0:621 作为标准），而远距离的转座和分散复制序列在这些参数之间有随机关联(假设蛋白质序列的保守和基因表达在汇集的复制基因中是独立的)，而近端复制序列介于两者之间(表3)。

表3

Expression levels differ between the genes created by different duplication modes (Figure 3). WGD and dispersed duplicates have higher gene expression levels than tandem, proximal and distantly transposed duplications (2-sample t-tests are significant at a = 0.05). The higher expression of WGD duplicates is consistent with their retention due to dosage amplification, a theory which has been proven in yeast [47,66,67]. Potentially transposon mediated gene duplications including tandem, proximal and distantly transposed duplications tend to be associated with lower gene expression levels than other duplication modes (Figure 3). Dispersed duplication, with unclear genetic mechanisms so far, is associated with gene expression levels comparable to WGD.

不同复制模式产生的基因之间的表达水平不同（图 3）。WGD 和分散复制比串联、近端和远距离转座复制具有更高的基因表达水平（2 样本 t 检验在（） = 0.05 时显着）。由于剂量放大，WGD 复制的较高表达与它们的保留一致，这一理论已在酵母中得到证实 [47,66,67]。与其他复制模式相比，潜在的转座子介导的基因复制，包括串联、近端和远距离转座复制，往往与较低的基因表达水平相关（图 3）。到目前为止，遗传机制尚不清楚的分散复制与与 WGD 相当的基因表达水平相关。

图3

Expression divergence following polyploidy

Since its divergence from other Brassicales, Arabidopsis experienced two WGDs (a and b), while sharing a more ancient genome triplication (c) with all rosids and perhaps all eudicots [49,51,53].Rice has experienced two WGDs: the r event shared with all Poaceae, and the more ancient s event [54]. Although expression divergence has been compared between WGD and single gene duplications [38,58,60], the combinational effects of different WGD events on expression divergence have not been addressed. We propose that WGD events themselves, together with the subsequent ‘adaptation’ of the resulting genome to the newly-duplicated state, may accelerate evolution, contributing to variation in expression divergence sometimes attributed to time (usually measured by Ks) alone [58,60].

拟南芥自与其他芸苔属(Brassicales)分化以来，经历了两个WGDs (（）和（）)，同时与所有蔷薇和可能所有真双子叶植物共享更古老的基因组三倍体（）[49,51,53]。水稻经历了两次WGDs：与所有禾本科植物共享的（）事件和更古老的（）事件[54]。我们认为，WGD事件本身，以及随后产生的基因组对新复制状态的 "适应"，可能会加速进化，导致有时仅归因于时间（通常由Ks衡量）的表达分歧的变化[58,60]。

To further investigate the combinational effects of multiple WGD events, we compared the expression divergence of duplicates from different WGD events (Figure 4). Not surprisingly, expression divergence between the WGD duplicates of more ancient events tends to be larger: c duplicates . b duplicates . a duplicates in Arabidopsis, and s duplicates . r duplicates in rice (both ANOVA model involving all WGD events and Tukey’s HSD test between adjacent WGD events are significant at a = 0.05). Next, we fitted a curve between expression divergence and Ks for each WGD event using a smooth spline with 10 degrees of freedom available in R packages (Figure 4). We found no significant correlation between expression divergence and Ks within the more ancient Arabidopsis b duplicates (r = 0.036, P-value = 0.241) or c duplicates (r= 20.008, P-value = 0.883), or rice s duplicates (r= 0.045, P-value = 0.307) but correlations are significant within the most recent Arabidopsis a duplicates (r = 0.126, P-value =1.364 x 10-11) and rice r duplicates (r = 0.105, P-value = 2.054 x 10-5). Further, we conducted a power analysis for these correlations. We found that at a = 0.05, the non-significant correlations (b, c and s duplicates) did not have higher power than conventionally desired (.0.8) while significant correlations (a and r duplicates) had power greater than 0.98, confirming that the relationship between expression divergence and Ks differs among different WGD events.

为了进一步研究多个 WGD 事件的组合效应，我们比较了来自不同 WGD 事件的复制基因的表达差异（图 4）。毫不奇怪，更古老事件的 WGD 复制之间的表达差异往往更大：拟南芥中（）重复 > （）重复 > （）重复，水稻中（）重复 > （）重复（涉及所有 WGD 事件的 ANOVA 模型和相邻 WGD 事件之间的 Tukey 的 HSD 检验在 a = 0.05 时均显著）。接下来，我们使用 R 包中可用的具有 10 自由度的平滑样条拟合每个 WGD 事件的表达式散度和 Ks 之间的曲线（图 4）。我们发现，在较古老的拟南芥() 复制 (r = 0.036, P-value = 0.241)、（）复制(r= 20.008, P-value = 0.883)或水稻（）复制(r= 0.045, P-value = 0.307)中，表达差异与Ks之间没有显著的相关性，但在最新的拟南芥（）复制(r = 0.126, P-value =1.364 x 10-11)和水稻（）复制(r = 0.105, P-value = 2.054 x 10-5)中，相关性显著。进一步，我们对这些相关性进行了幂分析。我们发现，在() = 0.05时，非显著相关(（）、（）和（）复制)的幂次并不高于常规期望的幂次(>0.8)，而显著相关(（）和（）复制)的幂次大于0.98，证实了不同WGD事件之间表达差异和Ks的关系不同。

图4

WGD events themselves influence gene expression divergence, with more ancient WGD duplicated genes likely to have greater expression divergence than more recent duplications, even if both have similar Ks (Figure 5). To support this hypothesis statistically, we coded the a, b and c events by 1, 2 and 3 in Arabidopsis and the r and s events by 1 and 2 in rice. Then different linear regression models of expression divergence on Ks and/or WGD codes were fit in Arabidopsis and rice respectively. All regression models and their coefficients were statistically significant. For both Arabidopsis and rice, the model which counts both Ks and the number of WGD events that duplicate genes underwent results in the highest adjusted R2 and lowest Akaike information criterion (AIC) (Table 4) with significant nonzero slopes of all coefficients, supporting the hypothesis that WGD events themselves, in addition to Ks, can lead to increased expression divergence between duplicates.

WGD 事件本身会影响基因表达差异，更古老的 WGD 复制基因可能比最近的复制具有更大的表达差异，即使两者具有相似的 Ks（图 5）。为了在统计学上支持这一假设，我们将拟南芥中的 a、b 和 c 事件编码为 1、2 和 3，将水稻中的 r 和 s 事件编码为 1 和 2。然后，分别在拟南芥和水稻中拟合了 Ks 和/或 WGD 代码表达差异的不同线性回归模型。所有回归模型及其系数均具有统计学意义。对于拟南芥和水稻，计算 Ks 和重复基因的 WGD 事件数的模型导致最高的调整 R2 和最低的 Akaike 信息标准 (AIC)（表 4），所有系数的斜率显著非零，支持除了 Ks 之外，WGD 事件本身可能导致重复之间的表达差异增加的假设。

图5

表4

Selection after WGD events may constrain expression divergence of some duplicates. To examine this question, we studied the 25% of WGD duplicate pairs with most conserved expression at each WGD event. At a P-value threshold of 0.05 by Fisher’s exact test (corrected for multiple tests), specific GO terms/Pfam domains were associated with conserved expression at each WGD event, and some recurred across different WGD events, e.g. transcription factor activity (GO:0003700) and ribosome (GO:0005840) for Arabidopsis a and c and rice r events; protein biosynthesis (GO:0006412) for Arabidopsis a and b and rice r events (Table S1). In contrast, WGD duplicates with divergent expression (25% of pairs with highest d values at each event) showed little or no enrichment of specific GO terms/Pfam domains and functional terms did not recur between different WGD events.

WGD 事件后的选择可能会限制某些重复基因的表达分歧。为了检验这个问题，我们研究了在每个 WGD 事件中具有最保守表达的 25% 的 WGD 重复对。 在 Fisher 精确检验的 P 值阈值为 0.05（针对多个测试进行校正）时，特定的 GO 术语/Pfam 域与每个 WGD 事件的保守表达相关，并且一些在不同的 WGD 事件中重复出现，例如拟南芥（）和（）以及水稻（）事件的转录因子活性 (GO:0003700) 和核糖体 (GO:0005840)；拟南芥（）和（）以及水稻（）事件的蛋白质生物合成（GO:0006412）（表 S1）。相比之下，具有不同表达的 WGD 重复（每个事件中 d 值最高的对中有 25%）显示很少或没有特定 GO 术语/Pfam 域的富集，并且功能术语在不同的 WGD 事件之间没有重复出现。

Expression divergence between Arabidopsis and rice

In that most angiosperms share most genes, changes in expression may be fundamental to angiosperm biodiversity. Previous studies have associated duplicated genes with greater expression diversity than singletons in closely related species of both animals [68] and plants [59]. However, it has been difficult to extend such comparisons to more distant species such as Arabidopsis, a eudicot, and rice, a monocot, due to greater difficulty discerning orthology or paralogy. To facilitate the comparison of gene expression data generated by different microarray platforms, we adopted a conceptual framework of comparing co-expression patterns across species [69] (see Methods). Further, we restricted our study to 2,012 gene pairs suggested both by DNA sequence similarity and by synteny/collinearity to be orthologs between Arabidopsis and rice, downloaded from the PGDD database [51,53]. The comparison of expression divergence between different types of orthologs shows the following trend: duplicate-duplicate.singleton-duplicate. singleton-singleton (Figure 6), with P-values of 0.049 between duplicate-duplicate and singleton-duplicate and 0.010 between singleton-duplicate and singleton singleton using two-sample t-tests. This finding supports that singletons are more conserved in expression than duplicated genes, consistent with the hypothesis that one consequence of gene duplication is increased expression diversity.

由于大多数被子植物共享大多数基因，因此表达的变化可能是被子植物生物多样性的基础。以前的研究表明，在动物 [68] 和植物 [59] 的密切相关物种中，重复基因的表达多样性比单例基因更大。然而，很难将这种比较扩展到更远的物种，例如拟南芥（一种真双子叶植物）和水稻（一种单子叶植物），因为辨别直系或旁系的难度更大。为了便于比较不同微阵列平台生成的基因表达数据，我们采用了一个概念框架来比较跨物种的共表达模式[69]（见方法）。此外，我们将研究限制在 2,012 个基因对，这些基因对通过 DNA 序列相似性和同线性/共线性建议为拟南芥和水稻之间的直系同源物，从 PGDD 数据库下载 [51,53]。不同类型直系同源物之间的表达差异比较显示以下趋势：duplicate-duplicate > singleton-duplicate > singleton-singleton（图 6），不同类型直系同源物之间的表达差异比较显示以下趋势：duplicate-duplicate > singleton-duplicate > singleton-singleton（图 6），duplicate-duplicate 和 singleton-duplicate 之间的 P 值为 0.049，singleton-duplicate 和 singleton-singleton 之间的 P 值为 0.010 使用双样本 t 检验。 这一发现支持单子在表达上比复制基因更保守，这与基因复制的一个后果是增加表达多样性的假设一致。

图6

Expression divergence may be correlated with both Ks and Ka

Divergence in coding sequences can be denoted by Ks, which indicates putatively-neutral mutations that are synonymous at the amino acid level, or by Ka, which indicates altered amino acids suggestive of the action of selection on gene function. The correlations between expression divergence and coding sequence divergence in angiosperms have been widely discussed [38,58,60] but conclusions were inconsistent: Casneuf et al. and Li et al. suggested that Ks is closely correlated with gene expression divergence, while Ganko et al. found little correlation. Since microarray data contain a high level of noise and previous studies often relied on small sets of microarray data or only one species, our analysis of ‘‘all arrays’’ and two highly-divergent species may have broader inference space.

编码序列的分歧可以用 Ks 表示，它表示假定的中性突变在氨基酸水平上是同义的，或者用 Ka 表示，它表示改变的氨基酸表明选择对基因功能的作用。被子植物中表达差异和编码序列差异之间的相关性已被广泛讨论 [38,58,60] 但结论不一致：Casneuf 等人和李等人表明 Ks 与基因表达差异密切相关，而 Ganko 等人发现相关性不大。由于微阵列数据包含高水平的噪声，并且以前的研究通常依赖于小规模的微阵列数据或仅一个物种，我们对“所有阵列”和两个高度分化的物种的分析可能具有更广泛的推理空间。

The distributions of Ka or Ks differ markedly for different gene duplication modes, but are relatively consistent in Arabidopsis and rice (Figure 7). Tandem/proximal and WGD duplicates have qualitatively lower Ks (putatively reflecting younger age) than distantly transposed (DNA and RNA) or dispersed duplicates, the distinction being much clearer in the small genome of Arabidopsis (Figure 7A) than the 36 larger and more repeat-rich genome of rice (Figure 7B). Within these qualitative distinctions, quantitative differences among the categories are also evident and largely consistent, with relative Ks (putatively age) of duplications following the trend of: dispersed > distantly transposed > WGD > proximal > tandem (both ANOVA model involving all duplication modes and Tukey’s HSD test between adjacent duplication modes are significant at a = 0.05). Retrotransposed duplicates differ slightly in the two taxa, being similar to DNA based transposed duplicates in Arabidopsis, and to dispersed duplicates in rice. The trend of Ka shows the same qualitative distinction as that of Ks (Figure 7C and 7D), but differing in the quantitative trend with amino-acid altering mutation frequencies being retrotransposed > dispersed > DNA based transposed > proximal~~WGD~~tandem. (both ANOVA model involving all duplication modes and Tukey’s HSD test between adjacent duplication modes are significant at a= 0.05). WGD duplicates are more functionally constrained, with higher Ks but equal or lower Ka than proximal duplicates. These data do not show the conventional L-shaped distribution for dispersed and distantly transposed duplicates, because the filters employed in gene selection focus this analysis only on genes that have survived a long time, implying that the genes serve important functions.

对于不同的基因复制模式，Ka 或 Ks 的分布显著不同，但在拟南芥和水稻中相对一致（图 7）。串联/近端和WGD复制序列在比远距离的 (DNA和RNA) 转座或分散复制序列在性质上具有更小的k值 (推定反映了更年轻的年龄)，在拟南芥的小基因组中 (图7A) 比大3倍且重复序列更多的水稻基因组的 (图7B) 更明显。在这些定性差异中，类别之间的定量差异也很明显并且基本一致，重复的相对 Ks（假定年龄）遵循以下趋势：分散 > 远距离转座 > WGD > 近端 > 串联（所有重复模式的方差分析模型和相邻重复模式之间的Tukey’s HSD检验均在（）= 0.05处显著）。反转录转座复制在两个分类群中略有不同，在拟南芥中与基于DNA的转座复制相似，在水稻中与分散复制相似。Ka 的趋势显示出与 Ks 相同的定性差异（图 7C 和 7D），但在数量趋势上不同，氨基酸改变突变频率为反转录转座 > 分散 > 基于 DNA 的转座 > 近端~~WGD~~串联（所有重复模式的方差分析模型和相邻重复模式之间的Tukey’s HSD检验均在（）= 0.05处显著）。与近端复制相比，WGD复制受到更多的功能约束，Ks更高，但Ka相等或更低。这些数据并没有显示出传统的分散和远距离转座复制的L型分布，因为基因选择中使用的过滤器只对存活了很长时间的基因进行分析，这意味着这些基因具有重要的功能。

图7

Relationships between coding sequence divergence and expression divergence are heterogeneous, and differ among gene duplication modes. For WGD duplicates, expression divergence is significantly correlated with both Ka and Ks in both Arabidopsis and rice, although the strength of the correlations is progressively weaker for more ancient duplications and in some cases reaches non-significance (Table 5). Expression divergence is also significantly correlated with both Ka and Ks among proximal duplicates. Tandem duplicates differ in the two taxa, with those of rice resembling WGD genes with expression divergence significantly correlated with both Ka and Ks, and those of Arabidopsis resembling distantly transposed duplications with marginal and sometimes non-significant correlation.

编码序列差异和表达差异之间的关系是异质性的，在不同的基因复制模式中也有所不同。在拟南芥和水稻中，对于WGD复制，表达差异与Ka和Ks都有显著的相关性，尽管对于更古老的复制，相关性的强度逐渐减弱，在某些情况下达到不显著性（表5）。表达差异与近端复制中的Ka和Ks也有明显的相关性。两个类群中的串联复制不同，水稻的串联复制类似于WGD基因，其表达分化与Ka和Ks都显著相关，而拟南芥的串联复制类似于远距离转座复制，相关程度不高，有时不显著。

表5

While age and functional divergence are more closely related to expression divergence in WGD genes than those resulting from other duplication modes, this does not reflect a lack of expression divergence among other gene duplicates. Indeed, proximal duplication is associated with higher expression divergence than WGD, despite its smaller average Ks. Likewise, DNA based transposed duplication is associated with higher expression divergence than dispersed duplication, despite smaller Ks (Table 6).

虽然年龄和功能差异与 WGD 基因中的表达差异比其他复制模式导致的表达差异更密切相关，但这并不反映其他基因复制之间缺乏表达差异。实际上，尽管其平均 Ks 较小，但与 WGD 相比，近端复制与更高的表达差异相关。同样，尽管 Ks 较小，但基于 DNA 的转座复制与分散复制相比表达差异更高（表 6）。

表6

In partial summary, expression divergence between duplicate genes may be affected by duplication modes, as well as by the ‘age’ (Ks) of the duplicated genes, i.e. gene expression divergence may differ among duplication modes at the same Ks or Ka levels. To further validate this claim, we fit a smooth spline curve between expression divergence and Ks or Ka for each duplication mode (Figure 8). While these curves fluctuate markedly, at fixed Ks or Ka levels distantly transposed duplications (for example) are generally associated with higher expression divergence between duplicates than WGD or tandem duplications.

总之，重复基因之间的表达差异可能受复制模式以及复制基因的“年龄”（Ks）的影响，即相同 Ks 或 Ka 水平的复制模式之间的基因表达差异可能不同。为了进一步验证这一说法，我们在每种复制模式的表达分歧和 Ks 或 Ka 之间拟合了一条平滑的样条曲线（图 8）。虽然这些曲线显著波动，但在固定的 Ks 或Ka 水平上，远距离的转座复制 (例如) 与复制之间通常比WGD或串联重复更高的表达差异相关性。

图8

DNA methylation of the promoter regions has little impact on expression divergence

Epigenetic mechanisms such as DNA methylation have been suggested to potentially differentiate newly arisen duplicate genes [32,70] as well as orthologous genes across closely related species [59]. Transcriptional silencing has often been associated with DNA methylation in promoter regions [71,72]. Using data on genome-wide DNA methylation status for both Arabidopsis and rice [73], we examined whether DNA methylation status in promoter regions is related to expression divergence between duplicates or between orthologs. This comparison carries an inherent assumption that methylation patterns are relatively static and generally apply to all of the microarray studies. A gene promoter region was considered to be methylated if two or more adjacent probes are methylated within the region [72]. Proportions of pairs of duplicates that differ in DNA methylation status in promoter regions, separated by gene duplication modes, are summarized in Table 7. Distantly transposed duplications appear somewhat more likely to differ in DNA methylation status than other duplication modes. However, the duplicate genes that differ in DNA methylation status in promoter regions do not have more divergent expression than those that have the same DNA methylation status, within any duplication mode (negative data are not shown). Likewise, different methylation status among orthologs also showed no significant relationship to expression divergence, although we confirmed that singletons are a little more likely to be methylated in promoter regions than duplicates (Table 8), as proposed by others [59]. These analyses suggest that the mechanisms by which DNA methylation status affects expression divergence between homologous genes may be complicated, and direct association may not be informative for unraveling such mechanisms.

表观遗传机制，如DNA甲基化，被认为可以潜在地区分新出现的重复基因[32,70]以及近缘物种[59]的同源基因。转录沉默通常与启动子区域的 DNA 甲基化有关 [71,72]。使用拟南芥和水稻的全基因组 DNA 甲基化状态数据 [73]，我们检查了启动子区域中的 DNA 甲基化状态是否与重复之间或直系同源物之间的表达差异有关。这种比较带有一个固有的假设，即甲基化模式是相对静态的，并且通常适用于所有微阵列研究。如果两个或多个相邻探针在该区域内被甲基化，则认为该基因启动子区域被甲基化[72]。表 7 总结了启动子区域中 DNA 甲基化状态不同的复制对的比例，按基因复制模式分隔。与其他复制模式相比，远距离转座复制似乎更容易出现DNA甲基化状态的差异。然而，在任何复制模式下，启动子区域中 DNA 甲基化状态不同的复制基因与具有相同 DNA 甲基化状态的复制基因相比，其表达差异并不大（阴性数据未显示）。同样，直系同源物之间的不同甲基化状态也显示出与表达差异没有显著关系，尽管我们证实单子在启动子区域比复制更容易甲基化（表 8），正如其他人提出的那样 [59]。这些分析表明，DNA甲基化状态影响同源基因之间表达差异的机制可能很复杂，并且直接关联可能无法为解开这些机制提供信息。

表7

表8

Gene family members may have non-random patterns of origin

The diversity of gene duplication mechanisms and patterns of gene expression divergence raise questions about how gene families expand and how their members have been retained in the history of evolution. WGD duplicates are differentially retained across different gene functional classifications [10,34,57,74]. However, we suggest that gene families may be more informative units than functional terms for investigating patterns of gene origin, as duplication relationships in gene families are clearer. Based on our findings above, both functional divergence and redundancy may contribute to retention of duplicate genes. Furthermore, because the degrees of functional diversification are not equal across gene families and gene duplication modes add additional heterogeneity to patterns of functional divergence, it is possible that gene family members may have non-random patterns of origin, e.g. the gene families with high functional diversification may be enriched with distantly transposed duplications while those families contributing to genetic redundancy are likely to be enriched with WGD duplications.

基因复制机制的多样性和基因表达差异的模式引发了关于基因家族如何扩展以及它们的成员如何保留在进化史上的问题。 WGD 重复在不同的基因功能分类中存在差异 [10,34,57,74]。然而，我们认为基因家族可能比用于研究基因起源模式的功能术语提供更多信息，因为基因家族中的重复关系更加清晰。根据我们上面的发现，功能差异和冗余都可能导致复制基因的保留。此外，由于基因家族的功能多样化程度不相等，并且基因复制模式为功能分歧模式增加了额外的异质性，基因家族成员可能具有非随机起源模式，例如具有高度功能多样性的基因家族可能富含远距离转座重复，而那些导致遗传冗余的家族可能富含WGD重复。

To examine these questions, we investigated the gene duplication modes of 126 Arabidopsis and 24 rice published gene families of 10 or more genes, available at TAIR (http://www.arabidopsis.org/) and Michigan State University (http://rice.plantbiology.msu.edu/) respectively. By using Bonferroni-corrected Fisher’s exact test, we found that 64 (50.8%) Arabidopsis gene families and 19 (79.2%) rice gene families are enriched for at least one gene duplication mode at a = 0.05 (Table S2). For example, DNA based transposed duplications are enriched in disease resistance gene homologs and the cytochrome P450 gene family (Figure 9 A–C). Disease resistance gene homologs, most of which have nucleotide binding site-leucine rich repeat (NBS-LRR) domains, express at different levels and tissue specificities, and function in diverse biological processes in Arabidopsis [75]. P450s also express in many tissues in a tissue specific manner and are involved in diverse metabolic processes [76,77]. The cytochrome P450 family also shows enrichment for DNA based transposed duplications in rice. Thus, these two gene families may have achieved functional and expression diversity through some combination of transposition activity and retention of distantly transposed duplicates. Interestingly, these two families are also enriched with proximal duplications, again often associated with greater expression divergence than WGD despite generally similar coding sequence divergence.

为了研究这些问题，我们研究了126个拟南芥和24个水稻已发表的10个或更多基因家族的基因复制模式，这些家族分别在TAIR (http://www.arabidopsis.org/)和密歇根州立大学(http://rice.plantbiology.msu.edu/)上获得。通过使用 Bonferroni 校正的 Fisher 精确检验，我们发现 64 个（50.8%）拟南芥基因家族和 19 个（79.2%）水稻基因家族在（） = 0.05 时至少富集了一种基因复制模式（表 S2）。例如，基于 DNA 的转座重复富含抗病基因同源物和细胞色素 P450 基因家族（图 9 A-C）。抗病基因同源物，其中大多数具有核苷酸结合位点富亮氨酸重复 (NBS-LRR) 结构域，在不同水平和组织特异性下表达，并在拟南芥的不同生物过程中发挥作用 [75]。P450 还以组织特异性方式在许多组织中表达，并参与多种代谢过程 [76,77]。细胞色素P450家族在水稻中也显示出基于DNA的转座复制的富集。因此，这两个基因家族可能通过转座活动和远距离转座复制的保留的某种组合实现了功能和表达的多样性。有趣的是，这两个家族也富集了近端复制，尽管编码序列的分化大致相似，但同样经常与比WGD更大的表达分化相关性。

WGD duplicates are enriched in other gene families, such as the cytoplasmic ribosomal protein gene family, and C2H2 zinc finger proteins (Figure 9 D–F). In Arabidopsis, a large number of ribosomal genes are co-regulated [78]. C2H2 zinc finger proteins have been shown to be involved in some basic biological processes such as transcriptional regulation, RNA metabolism and chromatin-remodeling [79]. Furthermore, C2H2 zinc finger proteins are enriched with retained WGD duplicates in both Arabidopsis and rice. Our analyses suggest that gene family members may have common non-random patterns of origin, that recur independently in different evolutionary lineages (such as monocots, and dicots, studied here), and that such patterns may result from specific biological functions and evolutionary needs.

WGD复制富集于其他基因家族，如细胞质核糖体蛋白基因家族和C2H2锌指蛋白(图9 D-F)。在拟南芥中，大量核糖体基因共同调控[78]。C2H2锌指蛋白已被证明参与一些基本的生物学过程，如转录调控、RNA代谢和染色质重塑[79]。此外，拟南芥和水稻的C2H2锌指蛋白均含有被保留的WGD复制。我们的分析表明，基因家族成员可能具有共同的非随机起源模式，这些模式在不同的进化谱系中独立出现 (如本研究中的单子叶植物和双子叶植物)，这些模式可能是特定的生物学功能和进化需要的结果。

图9

讨论

In two species that sample a wide range of tissues and physiological conditions in major angiosperm lineages diverged by about 140–170 million years [80] and affected by at least 5 different genome duplication events, we have compared expression divergence between positional orthologs and between genes duplicated by several additional mechanisms. Both neo-functionalization and genetic redundancy can result in retention of duplicate genes. WGD duplicates generally are more frequently associated with genetic redundancy than genes resulting from other duplication modes, partly due to dosage amplification. Tandem duplications also contribute to genetic redundancy, while other duplication modes are more frequently associated with evolutionary novelty. Potentially transposon mediated gene duplications tend to reduce gene expression levels. Expression divergence between duplicates is discernibly related to duplication modes, WGD events, Ka, Ks, and possibly the DNA methylation status of their promoter regions. However, the contribution of each factor is heterogeneous among duplication modes, and new factors as well as combinatorial effects of different factors are worth further investigation. Gene loss may retard inter-species expression divergence, as singletons are generally more conserved in gene expression than duplicates. Members of different gene families have non-random patterns of origin, and such patterns may be similar between Arabidopsis and rice.

在对主要被子植物谱系中广泛的组织和生理条件进行采样的两个物种中，它们相差约 1.40-1.7 亿年 [80] 并受到至少 5 种不同的基因组复制事件的影响，我们比较了位置直系同源基因之间的表达差异，以及几种其他机制复制的基因之间的表达差异。新功能化和遗传冗余都可以导致复制基因的保留。 WGD 复制通常比其他复制模式产生的基因更频繁地与遗传冗余相关，部分原因是剂量放大。串联复制也有助于遗传冗余，而其他复制模式更常与进化创新相关。潜在的转座子介导的基因复制倾向于降低基因表达水平。复制之间的表达差异明显与复制模式、WGD 事件、Ka、Ks 以及它们启动子区域的 DNA 甲基化状态有关。但各因素的贡献在复制模式之间存在异质性，新的因素以及不同因素的组合效应值得进一步研究。基因丢失可能会延缓物种间的表达差异，因为单子在基因表达中通常比复制更保守。不同基因家族的成员具有非随机的起源模式，这种模式在拟南芥和水稻之间可能是相似的。

The use of large volumes of data and inclusion of as many genes as possible may help to mitigate factors specific to particular developmental states, noise associated with microarray data, and bias reflecting features specific to particular gene families. For example, we have found that the correlations between expression divergence and Ks are not consistent within gene duplication modes (Figure 5 and 8). For WGD duplicates, significant correlations only exist in those generated by recent WGD events - if only relatively ‘young’ WGD duplicates are studied, the correlations may be overestimated. Moreover, such correlations are not uniformly distributed among Ks levels - at low Ks levels (<1), all duplication modes may show correlations.

使用大量数据并包含尽可能多的基因可能有助于减轻特定发育状态的特定因素、与微阵列数据相关的噪声以及反映特定基因家族特定特征的偏差。例如，我们发现表达分歧和 Ks 之间的相关性在基因复制模式中不一致（图 5 和 8）。对于 WGD 复制，仅在最近的 WGD 事件产生的那些中存在显著的相关性 - 如果只研究相对“年轻”的 WGD 复制，则相关性可能被高估。此外，这种相关性在 Ks 水平之间并不是均匀分布的——在低 Ks 水平 (<1) 时，所有复制模式都可能显示相关性。

We find evidence for duplicate gene retention by both neofunctionalization and genetic redundancy, seemingly at opposite ends of the spectrum of possible fates of duplicated gene pairs. Genetic redundancy has clear biological significance, i.e. provision of buffering capacity [10,11] and/or dosage balance [34,46,47,48], and seems most closely related to WGD or tandem duplicates. The origins of genetic novelty, of clear biological significance in occupation of new niches or adaptation to new environments, may lie more with the greater expression divergence and more independent evolution of distantly transposed and dispersed duplications. Proximal duplication is more balanced in its contributions to genetic novelty and redundancy than other gene duplication modes.

我们发现重复基因通过新功能化和遗传冗余保留的证据，这似乎是复制基因对的可能命运的两端。遗传冗余具有明显的生物学意义，即提供缓冲能力[10,11]和/或剂量平衡[34,46,47,48]，并且似乎与WGD或串联复制基因关系最密切。基因创新的起源，在占领新生态位或适应新环境方面具有明确的生物学意义，可能更多地与远距离转座和分散复制的更大的表达分化和更独立的进化有关。与其他基因复制模式相比，近端复制模式对基因创新和冗余性的贡献更为平衡。

Detailed delineation of gene duplication modes reveals some new trends. Prior studies classified genes into as few as two types (anchors generated by polyploidy, and non-anchors generated by single gene duplication [58]), or as many as three types (segmental, tandem and dispersed: [38]). In this study, we have attempted to distinguish DNA/RNA based transposed from dispersed duplication, and proximal from tandem duplication. DNA based transposed duplications tend to evolve faster in expression while having smaller Ks than dispersed duplicates. Tandem duplicates diverge slower in gene expression than proximal duplicates. Proximal duplicates tend to diverge faster in expression than WGD duplicates, though concerted evolution [20] may homogenize their coding sequences.

基因复制模式的详细描述揭示了一些新趋势。 先前的研究将基因分为少至两种类型（多倍体产生的锚点和单基因复制产生的非锚点[58]），或多达三种类型（片段、串联和分散：[38]）。 在这项研究中，我们试图区分基于 DNA/RNA 的转座与分散复制，以及近端与串联复制。基于 DNA 的转座复制倾向于在表达中进化得更快，同时具有比分散复制更小的 Ks。串联复制在基因表达上的分化速度比近端复制慢。 尽管协同进化 [20] 可能使它们的编码序列同质化，但近端复制的表达趋向于比 WGD 复制更快地分化。

The factors that affect expression divergence are complex

Our analyses suggest that it may be inappropriate to make generalizations about levels and patterns of expression divergence across gene duplication modes. Ks, putatively a proxy for age, seems to be related to expression divergence only within a subset of duplication modes and largely only among younger duplicates. Ka, putatively a proxy for functional change, also shows statistically significant and heterogeneous relationships to expression divergence. The level of these correlations is very low, even in recent WGD duplicates.

我们的分析表明，对不同基因复制模式的表达差异水平和模式进行概括可能是不合适的。Ks，假定是年龄的代表，似乎只在一部分复制模式中与表达差异有关，而且主要只在年轻的复制个体中。Ka，假定是功能改变的代表，也显示出与表达分化的统计学意义上的异质性关系。这些相关性的水平非常低，甚至在最近的WGD复制中也是如此。

Although expression divergence between duplicates is often significantly correlated with coding sequence divergence, it is well known that gene expression is also regulated by other genomic regions such as promoters, 5' UTRs, and 3' UTRs. The correlations between expression divergence and nucleotide substitution rates (m) of different genomic regions for pairs of duplicates are summarized in Table S3. WGD duplicates show significant correlations between expression divergence and nucleotide substitution rates in all three regions. These correlations become marginal and often non-significant among tandem duplicates. Expression divergence of proximal duplicates is more closely associated with divergence in promoters, 5' UTRs and 3' UTRs than coding sequences. Expression divergence of DNA based transposed duplicates seem to be most related to Ka and （） of 3' UTRs. Expression divergence of dispersed duplicates is very slightly correlated with Ka but not with other substitution rates. Retrotransposed duplication is least related to any type of sequence divergence, consistent with its general separation of a gene from its native regulatory elements.

虽然复制之间的表达差异通常与编码序列差异显著相关，但众所周知，基因表达也受其他基因组区域的调节，例如启动子、5' UTR 和 3' UTR。表S3总结了成对复制的不同基因组区域的表达差异和核苷酸替代率（）之间的相关性。 WGD 复制显示所有三个区域的表达差异和核苷酸替代率之间存在显著相关性。这些相关性在串联复制中变得微不足道，而且往往不显著。与编码序列相比，近端复制的表达差异与启动子、5' UTR 和 3' UTR 的差异更密切相关。基于 DNA 的转座复制的表达差异似乎与 3' UTR 的 Ka 和（）最相关。分散复制的表达差异与 Ka 有非常小的相关，但与其他替代率无相关。反转录转座复制与任何类型的序列分歧最小相关，这与基因从其天然调控元件的一般分离一致。

In partial summary, expression divergence between duplicate genes may be affected by different and multiple genetic factors depending on the causal duplication mechanism. For pairs of orthologs between Arabidopsis and rice, expression divergence seems only correlated with Ka (Table 5 and Table S3). Single gene duplications including translocated and tandem/proximal duplications have been suggested to be much more prone to promoter disruption than WGD [58].We examined this hypothesis using >45% sequence identity as criterion for determining duplicated (non-disrupted) promoter regions, finding proximal duplicates to have higher proportions of duplicated promoter regions than WGD duplicates (Table 9). This finding seems to contradict the greater expression divergence of proximal duplicates than WGD duplicates. Thus, we note that each of the investigated genetic/epi-genetic factors may only explain a small portion of the variation of expression divergence between duplicate genes, and perhaps only for certain duplication modes. New factors that may affect expression divergence and how different factors work together are worth investigation.

总之，重复基因之间的表达差异可能受到不同和多种遗传因素的影响，具体取决于因果性复制机制。对于拟南芥和水稻之间的直系同源物对，表达差异似乎仅与 Ka 相关（表 5 和表 S3）。包括易位——（就是我们总说的片段复制）和串联/近端复制在内的单基因复制被认为比 WGD [58] 更容易发生启动子破坏/中断。我们使用 >45% 的序列一致性作为确定重复的（未中断）启动子区域的标准检查了这一假设——（这个一致性的设置学习，但是感觉太低了吧），发现近端复制比 WGD 复制具有更高比例的重复的启动子区域（表 9）。这一发现似乎与近端复制比 WGD 复制更大的表达差异相矛盾。因此，我们注意到每个研究的遗传/表观遗传因素可能只能解释复制基因之间表达差异的一小部分变化，并且可能仅适用于某些复制模式。可能影响表达差异的新因素以及不同因素如何协同工作值得研究。

表9

Possible non-random associations between duplication mode and population size

WGD is often associated with speciation in plants [81,82]. If ancestral polyploidy was attendant with speciation, new species would have likely initially faced very small Ne (i.e. effective population size), weak selection, high drift and high mutational load. This could put a premium on buffering, but allow little chance for beneficial mutations. On the other hand, small-scale duplications may have been only infrequently associated with speciation, if at all. Thus they might be more likely to arise in established populations with larger Ne and more efficient selection, all putting a greater premium on evolutionary novelty to attain fixation. A hypothesis worthy of further investigation is that nonrandom associations between duplication mode and population size have shaped which specific genes and functional variations are retained.

WGD 通常与植物中的物种形成有关 [81,82]。如果祖先多倍体伴随着物种形成，那么新物种最初可能会面临非常小的 Ne（即有效种群大小）、弱选择、高漂移和高突变负荷。这可能会增加缓冲，但几乎没有机会产生有益的突变。另一方面，小规模的复制可能只是很少与物种形成相关联，如果真有的话。因此，它们可能更有可能出现在拥有更大Ne和更有效选择的已有种群中，所有这些都使进化创新得到更大的重视，以达到固定化。一个值得进一步研究的假设是，复制模式和种群大小之间的非随机关联决定了哪些特定的基因和功能变异被保留。

Methods

Genome annotation

Genome annotations were obtained from TAIR (http://www.arabidopsis.org) for Arabidopsis, and from the Rice Genome Annotation Project data (http://rice.plantbiology.msu.edu) for rice. Gene structures were retrieved using ENSEMBL Biomart (http://plants.ensembl.org/biomart/martview).

拟南芥的基因组注释来自TAIR (http://www.arabidopsis.org)，水稻的基因组注释来自Rice Genome Annotation Project (http://rice.plantbiology.msu.edu)。使用ENSEMBL Biomart (http://plants.ensembl.org/biomart/martview)检索基因结构。

Gene expression data

To reliably assess the expression divergence between duplicates or between orthologs, we used as many publicly available microarray datasets as possible, all of which were obtained from NCBI’s GEO (http://www.ncbi.nlm.nih.gov/geo/). At the time of retrieval, 6,009 samples existed for the Affymetrix Arabidopsis ATH1 Genome Array (GEO platform GPL198), of which 800 were not available and a total of 5,209 CEL files were downloaded. 550 CEL files for the Affymetrix GeneChip Rice Genome Array (GEO platform GPL2020) were downloaded, of which 13 were removed due to incorrect array types. For both Arabidopsis and rice raw expression data, RMA normalization was performed using the RMAExpress software (http://rmaexpress.bmbolstad.com) across the entire dataset. Outliers were detected using the arrayQualityMetrics [83] Bioconductor package, which implements three different statistical tests to identify outliers. A total of 443 and 29 samples were detected as outliers and removed in Arabidopsis and rice respectively. Thus, 4,566 and 508 samples remained for Arabidopsis and rice, respectively. The annotation files (Release 30) of these two arrays were downloaded from the Affymetrix website (http://www.affymetrix.com), containing 22,810 Arabidopsis genes and 27,910 rice genes. For a gene, there may be multiple probe sets or multiple types of probe sets available on the array. However, a general rule for selection of a probe set that best represents the gene’s expression profile has not been resolved yet [84,85]. In this study, inclusion or exclusion of ‘‘sub-optimal’’ probe sets with suffix ‘‘_s_at’’ or ‘‘_x_at’’ that are suspected of potential cross-hybridization (may be not sub-optimal in practice according to ref. [84,85]) had only trivial effects. Thus, to survey as many genes as possible, all types of probe sets were considered, and for a gene with multiple probe sets, we used the first probe set according to alphabetic sorting to represent its expression profile.

为了可靠地评估重复之间或直系同源物之间的表达差异，我们使用了尽可能多的公开可用的微阵列数据集，所有这些数据集都是从 NCBI 的 GEO (http://www.ncbi.nlm.nih.gov/geo/) 获得的。在检索时，Affymetrix 拟南芥 ATH1 基因组阵列（GEO 平台 GPL198）存在 6,009 个样本，其中 800 个不可用，总共下载了 5,209 个 CEL 文件。下载了 Affymetrix GeneChip Rice Genome Array（GEO 平台 GPL2020）的 550 个 CEL 文件，其中 13 个因阵列类型不正确而被删除。对于拟南芥和水稻原始表达数据，使用 RMAExpress 软件 (http://rmaexpress.bmbolstad.com) 对整个数据集进行 RMA 标准化。使用 arrayQualityMetrics [83] Bioconductor 软件包检测异常值，该软件包实施三种不同的统计测试来识别异常值。在拟南芥和水稻中分别检测到 443 个和 29 个样本为异常值并去除。因此，拟南芥和水稻的样品分别为 4,566 和 508 个。这两个阵列的注释文件（第 30 版）从 Affymetrix 网站（http://www.affymetrix.com）下载，包含 22,810 个拟南芥基因和 27,910 个水稻基因。对于一个基因，阵列上可能有多个探针组或多种类型的探针组。然而，选择最能代表基因表达谱的探针组的一般规则尚未解决 [84,85]。在本研究中，包含或排除后缀为“_s_at”或“_x_at”的怀疑可能存在交叉杂交的“次优”探针集(根据文献[84,85]，在实践中可能不是次优的)的影响微乎其微。因此，为了调查尽可能多的基因，考虑了所有类型的探针组，对于具有多个探针组的基因，我们根据字母排序使用第一个探针组来表示其表达谱。

Analysis of expression data

Similarity between the expression profiles of two duplicate genes within species was initially measured by either Pearson’s (denoted by PCC or r) or Spearman’s correlation coefficient. Note that all replicate chips were retained and correlations were computed across all individual chips.These two measures generated highly consistent results, and thus we only showed the statistics measured by Pearson’s correlation coefficient. The expression divergence between two duplicate genes or orthologs was measured by 1-r [61,62].

物种内两个重复基因的表达谱之间的相似性最初通过皮尔逊（用 PCC 或 r 表示）或斯皮尔曼相关系数来测量。请注意，所有复制芯片都被保留，并且所有单个芯片的相关性都被计算出来。这两个测量产生了高度一致的结果，因此我们只显示了由 Pearson 相关系数测量的统计数据。通过 1-r 测量两个重复基因或直系同源物之间的表达差异 [61,62] 。

Orthologous gene pairs compared between Arabidopsis and rice were restricted to 2,012 pairs of orthologs located at corresponding loci in paired syntenic blocks between Arabidopsis and rice as identified by MCScan [53], and having expression profiles on the arrays. To assess the expression conservation (EC) for a pair of Arabidopsis-rice orthologs, we adopted a conceptual framework of comparing co-expression patterns across species [69] implemented in several other studies similar to ours [86,87,88,89,90]. In this study, the framework can be described as:

拟南芥和水稻之间比被比较的直系同源基因对仅限于 2,012 对直系同源物，它们位于拟南芥和水稻之间成对的同线区块中的相应基因座，由 MCScan [53] 鉴定，并且在阵列上具有表达谱。为了评估一对拟南芥-水稻直系同源物的表达保守性 (EC)，我们采用了一个概念框架来比较物种间的共表达模式 [69]，这些框架在与我们类似的其他几项研究中实施 [86,87,88,89, 90]。在这项研究中，该框架可以描述为：

1) The expression matrices, A and B, in Arabidopsis and rice respectively, are restricted to genes for which orthology relationships have been identified and ordered accordingly (i.e., equivalent rows of the two matrices correspond to the expression profiles of a pair of orthologs):

1) 拟南芥和水稻中的表达矩阵 A 和 B 分别限于已确定直系同源关系并相应排序的基因（即，两个矩阵的等效行对应于一对直系同源物的表达谱）：

where ai and bi are the vectors of expression profiles for any pair i of orthologs for Arabidopsis and rice, respectively, and k is the number of orthologous gene pairs.

其中 ai 和 bi 分别是拟南芥和水稻的任何 i 对直系同源物的表达谱向量，k 是直系同源基因对的数量。

2) A and B are then converted into two pair-wise correlation matrices, RA and RB, by computing the PCCs between the expression profile of each gene and that of any other gene in each species separately:

2）然后通过分别计算每个基因的表达谱与每个物种中任何其他基因的表达谱之间的 PCC，将 A 和 B 转换为两个成对相关矩阵 RA 和 RB：

3) The expression conservation for an orthologous gene pair i is computed as:

3) 直系同源基因对 i 的表达守恒计算为：

Its corresponding expression divergence is 1 - EC(i).

其对应表达式散度为1 - EC(i)。

Identification of different modes of gene duplications

The populations of potential gene duplications in Arabidopsis or rice were identified using BLASTP. Only the top five non-self protein matches that met a threshold of E<10-10 were considered. Genes without BLASTP hits that met a threshold of E<10-10 were deemed singletons. Pairs of WGD duplicates were downloaded from the PGDD database [51,53]. Pairs of a, b, c duplicates in Arabidopsis and pairs of r, s duplicates in rice were obtained from published lists [49,54]. Single gene duplications were derived by excluding pairs of WGD duplicates from the population of gene duplications. Tandem duplications were defined as being adjacent to each other on the same chromosome. Proximal duplications were defined as non-tandem genes within 20 annotated genes of each other on the same chromosome [38].

使用 BLASTP 鉴定拟南芥或水稻中潜在基因复制的数量。 仅考虑满足 E<10-10 阈值的前五个非自身蛋白质匹配。 没有达到 E<10-10 阈值的 BLASTP 命中的基因被认为是单子。从 PGDD 数据库 [51,53] 下载了WGD 复制对。拟南芥中的（）、（）、（）复制对和水稻中的（）、（）复制对是从已发表的列表中获得的 [49,54]。 通过从基因复制数量中排除WGD复制对来获得单基因复制。 串联复制被定义为在同一条染色体上彼此相邻。 近端复制被定义为在同一条染色体上的 20 个注释基因中的非串联基因 [38]。

The remaining single gene duplications (after deducting tandem and proximal duplications) were searched for distant single gene-transposed duplications. To accomplish this aim, genes at ancestral chromosomal positions need to be discerned by aligning syntenic blocks within and between species [53,55]. Angiosperm syntenic blocks were downloaded from the Plant Genome Duplication Database (PGDD), available at http://chibba.agtec.uga.edu/duplication. At the time of retrieval, PGDD provided syntenic blocks within and between 10 species including Arabidopsis thaliana, Carica papaya, Prunus persica, Populus trichocarpa, Medicago truncatula, Glycine max, Vitis vinifera, Brachypodium distachyon, Oryza sativa, Sorghum bicolor, Zea mays [51,53]. An Arabidopsis or rice gene locus was regarded as ancestral if the resident gene along with any of its homologous genes (paralogs/orthologs) occur at corresponding loci within any pair of syntenic blocks in PGDD. Using this criterion, the population of Arabidopsis/rice genes was divided into two subsets: genes at ancestral loci and genes that were transposed. For a pair of distantly transposed duplicate genes, we required that one copy was at its ancestral locus and the other was at a non-ancestral locus, named the parental copy and transposed copy respectively. If the parental copy has more than two exons and the transposed copy is intronless, we inferred that this pair of duplicate genes occurred by retrotransposition (RNA based transposition). If both copies have a single exon, the pair of duplicates was unclassified. For other cases of a pair of distantly transposed duplicate genes, we inferred that the duplication occurred by DNA based transposition. The remaining single gene duplications in the population, i.e. after deducting WGD, tandem, proximal, DNA based transposed and retrotransposed duplications from the BLASTP output, were classified as dispersed duplications. After pairs of duplicate genes in each duplication mode were identified, we assigned a unique origin to each duplicated gene, according to the following order of priority: WGD > tandem > proximal > retrotransposed > DNA based transposed > dispersed.

搜索剩余的单基因复制（在扣除串联和近端复制之后）以寻找远距离的单基因转座复制。为了实现这一目标，需要通过对齐物种内和物种之间的共线块来辨别祖先染色体位置的基因[53,55]。被子植物共线块从植物基因组复制数据库 (PGDD) 下载，可在 http://chibba.agtec.uga.edu/duplication 获得。在检索时，PGDD 提供了 10 个物种内部和之间的共线块，包括拟南芥、番木瓜、桃、毛果杨、蒺藜苜蓿、大豆、葡萄、二穗短柄草、水稻、高粱和玉米 [51 ,53]。一个拟南芥或水稻基因座，如果其固有基因及其任何同源基因(同源/同源)出现在PGDD中任何一对共联块的相应位点上，则被认为是祖先基因座。使用这个标准，拟南芥/水稻基因群被分为两个子集：祖先基因座的基因和转座的基因。对于一对远距离转座的重复基因，我们要求一个拷贝在其祖先位点，另一个在非祖先位点，分别命名为亲本拷贝和转座拷贝。如果亲本拷贝有两个以上的外显子并且转座拷贝是无内含子的，我们推断这对重复基因是通过逆转录转座（基于RNA的转座）发生的。如果两个拷贝都有一个外显子，则这对重复是未分类的。对于一对远距离转座的重复基因的其他情况，我们推断复制是通过基于 DNA 的转座发生的。群体中剩余的单基因复制，即在从 BLASTP 输出中扣除 WGD、串联、近端、基于 DNA 的转座和反转录转座复制后，被归类为分散复制——（转座复制中未分类的应该也是被划分为分散复制了）。在每种复制模式中的复制基因对被识别后，我们根据以下优先顺序为每个复制基因分配一个唯一的起源：WGD > 串联 > 近端 > 反转录转座 > 基于 DNA 的转座 > 分散。

GO/Pfam enrichment analysis

GO/Pfam enrichment analysis was performed using Fisher’s exact test. The P-value was calculated for the null hypothesis that there is no association between a subset of genes and a particular functional/domain category and was corrected with the total number of terms to account for multiple comparisons.

GO/Pfam 富集分析使用 Fisher 精确检验进行。 P 值是针对零假设计算的，即基因子集与特定功能/结构域类别之间没有关联，并使用项总数进行校正以解释多重比较。

Assessing DNA sequence divergence

Coding sequence divergence between a pair of genes was denoted by either non-synonymous (Ka) or synonymous (Ks) substitution rates. Protein sequences were aligned using Clustalw [91] with default parameters. The protein alignment was then converted to DNA alignment using the ‘‘Bio::Align::Utilities’’ module of the BioPerl package (http://www.bioperl.org/). Ka and Ks were estimated by Nei-Gojobori statistics [92], available through the ‘‘Bio::Align::DNAStatistics’’ module of the BioPerl package. Note that the ‘‘Bio::Align::DNAStatistics’’ module may generate invalid Ka or Ks for some duplicate gene pairs due to mis-alignments, which were ruled out from related analysis. All levels of valid Ka or Ks values were considered in related statistical analyses. Because distributions of Ka or Ks were centered at low levels (~1.0), in related figures, to improve their clarity, we only displayed Ka or Ks values between 0 and 2.0.

一对基因之间的编码序列差异由非同义（Ka）或同义（Ks）替代率表示。使用具有默认参数的 Clustalw [91] 比对蛋白质序列。然后使用 BioPerl 包 (http://www.bioperl.org/) 的“Bio::Align::Utilities”模块将蛋白质比对转换为 DNA 比对。 Ka 和 Ks 由 Nei-Gojobori 统计 [92] 估计，可通过 BioPerl 包的“Bio::Align::DNAStatistics”模块获得。请注意，“Bio::Align::DNAStatistics”模块可能会由于对齐错误而为某些重复基因对生成无效的 Ka 或 Ks，这些已从相关分析中排除。在相关统计分析中考虑了所有水平的有效 Ka 或 Ks 值。因为 Ka 或 Ks 的分布集中在低水平（~1.0），在相关图中，为了提高它们的清晰度，我们只显示了 0 到 2.0 之间的 Ka 或 Ks 值。

The promoter region of a gene was restricted to a maximum of 1,000 bp upstream of the transcription start site (TSS) or less if the nearest adjacent upstream gene is closer than 1,000 bp. For a pair of genes, the divergence of promoter sequences was indicated by their Jukes-Cantor nucleotide substitution rate (m) [93], which is available through the ‘‘Bio::Align::DNAStatistics’’ module of the BioPerl package. The divergence in 5' UTR and 3' UTR is also measured by nucleotide substitution rates (m). Note that the ‘‘Bio::Align::DNAStatistics’’ module may not output m if the distance between two input nucleotide sequences is too near or too far. Duplicate gene pairs lacking estimation of m in the promoter region, 5' UTR or 3' UTR were removed from related analysis.

一个基因的启动子区域被限制在转录起始位点(TSS)上游1000 bp以内，如果最接近的上游基因接近1000 bp，则限制在更小的范围内。对于一对基因，启动子序列的差异由它们的 Jukes-Cantor 核苷酸替换率 ( ) [93] 表示，该替换率可通过 BioPerl 包的“Bio::Align::DNAStatistics”模块获得。 5' UTR 和 3' UTR 的差异也通过核苷酸替代率 ( ) 来衡量。请注意，如果两个输入核苷酸序列之间的距离太近或太远，“Bio::Align::DNAStatistics”模块可能不会输出 () 。在相关分析中删除了在启动子区域、5' UTR 或 3' UTR 中缺乏（）估计的复制基因对。

DNA methylation data and its analysis

DNA甲基化数据及其分析

Arabidopsis and rice genome-wide DNA methylation data were obtained from GEO (accession number: GSE21152) [73]. We chose this study, which provided DNA methylation for both Arabidopsis and rice, because the systematic errors between species should be smaller than in data from separate studies. A gene methylated in the promoter region is defined by the presence of two or more adjacent methylated probes within the promoter DNA sequence [59,72].

拟南芥和水稻全基因组DNA甲基化数据来自GEO（登录号：GSE21152）[73]。我们选择了这项研究，它提供了拟南芥和水稻的DNA甲基化数据，因为物种之间的系统误差应该比单独研究的数据要小。一个基因在启动子区域的甲基化是由启动子DNA序列中出现两个或多个相邻的甲基化探针来定义的[59,72]。

Gene families

Lists of published gene families were obtained from TAIR (http://www.arabidopsis.org/browse/genefamily/index.jsp) for Arabidopsis, and from the Rice Genome Annotation Project data (http://rice.plantbiology.msu.edu/annotation_community_ families.shtml) for rice. Only families with more than nine genes were considered. Arabidopsis disease resistance gene homologs were downloaded from the NIBLRRS Project website (http:// niblrrs.ucdavis.edu/). The Rice Cytochrome P450 gene family was downloaded from the Cytochrome P450 homepage [94].

已发表的基因家族列表分别来自拟南芥的TAIR (http://www.arabidopsis.org/browse/genefamily/index.jsp)和水稻的Rice Genome Annotation Project数据库(http://rice.plantbiology.msu.edu/annotation_community_ families.shtml)。仅考虑具有九个以上基因的家庭。拟南芥抗病基因同系物从NIBLRRS项目网站(http://niblrrs.ucdavis.edu/).)下载。水稻 Cytochrome P450基因家族从Cytochrome P450主页下载[94]。

总结：论述性极强的一篇好文章，值得学习，却只发了Plos One?

算法文献阅读10：基因的6种复制模式（MCScanX作者又一力作）

你可能感兴趣的:(算法文献阅读10：基因的6种复制模式（MCScanX作者又一力作）)