Integrating Coexpression Networks with GWAS to Prioritize Causal Genes in Maize

0. 简介

这篇文章是来自于明尼苏达大学的Chad L. Myers在The Plant Cell发表的关于整合共表达网络与GWAS来挖掘玉米中的causal genes.

文章链接

1. 摘要

  • 在不同物种中,GWAS已经确定出了很多与性状相关的位点,然而,由于每个确定的位点周围连锁不平衡的存在,我们并不知道causal genes到底是哪些。这个问题在非人类,非模式物种中尤其明显,因为这些物种中的功能注释相对较少,只有很少的信息可以用来挖掘候选基因;
  • 本文作者开发了一个方法叫Camoco, 从大规模的玉米GWAS中挖掘候选基因。值得注意的是,作者发现他们方法的性能依赖于共表达网络的类型:使用群体RNA-Seq构建的共表达网络的性能比其他网络都要好
  • Camoco确定的两个候选基因可以通过突变体实验验证。这项研究证明了共表达网络可以为从GWAS中挖掘causal genes提供有力的基础,但是也表明了这种策略依赖于所用的基因表达数据。

2. 前言

  • GWAS是理解性状变异遗传基础的有力工具。以玉米为例,GWAS确定了大概40个与开花时间相关的loci,89个与株高有关的loci,36个与leaf length有关的loci,32 loci for resistance to southern leaf blight, and 26 loci for kernel protein. 虽然在统计上已经确定了很多与性状相关的loci,但是causal genes的确定已经与这些loci相关的功能等位基因的生物学解释仍然存在很大挑战;
  • 为GWAS提供动力的连锁不平衡(LD)是限制因果基因鉴定的主要障碍。遗传标记由GWAS鉴定,但通常位于基因边界之外,而且可能距离真实的causal snp很远。在玉米中,LD变化可以从1kb到1Mb,在其他农作物中这个范围可能更大,而且,越来越多的证据表明基因调控区域在功能变异中扮演着重要的角色,导致causal variants落在基因边界的外部。在玉米中,一经报道有些QTL由非编码序列组成。这些因素意味着即使一个位点与性状很相关,在确定真正的causal genes之前仍然会有很多看似可信的候选基因被确定;
  • 对于基因注释不完整的农作物来说,缩小候选基因的范围非常重要。举个例子,在玉米中,只有大约~1%的基因有突变体的功能注释;
  • 对于功能注释来说,一个简单而且有信息的策略就是基因表达。使用不同的遗传背景或者不同的组织大雨可以帮助我们确立基因的生物学功能。比较两个基因的表达(共表达)可以衡量基因的联合响应。而且共表达的分析已被成功应用于很多功能相关基因的确定,而且已被应用于解析拟南芥GWAS的结果;
  • 由于共表达为功能关系提供了全局的测量,因此可以作为解析GWAS候选位点的有力工具。尤其是,作者期望在同一生物学过程中不同基因表达的变化与给定表型是相关的。因此,如果GWAS捕获的SNP是由共调控的基因编码的话,这些数据集的重叠就不会是随机的。虽然并不是所有的功能都具有共表达关系,但是这些数据将会提供有价值的线索。而且在人类和小鼠中,共表达已经被作为理解GWAS的基础
  • 作者开发了一个计算框架Camoco (Coanalysis of molecular components),通过整合基因共表达网络和GWAS挖掘候选基因。Camoco用传统的GWAS评估候选SNPs,然后确定一组高置信度的候选基因;
  • 作者将Camoco应用在玉米上,作者关注到玉米籽粒中17个元素(Al, As, B, Ca, Cd, Fe, K, Mg, Mn, Mo, Na, Ni, Rb, S, Se, Sr, and Zn)的累积。植物必须摄取除土壤中碳和氧以外的所有元素,使植物离子组成为理解植物环境响应,谷物营养品质和植物生理学的关键组成部分;
  • 作者评估了三个不同类型共表达网络的利用并且用模拟数据证明了Camoco的有效性。这项研究也证明了共表达网络中的功能模块与GWAS候选SNP的一致性。同时确定了高置信度的候选基因,并通过单基因突变体证明了方法的有效性。

3. 结果

3.1 Camoco: 整合GWAS结果与共表达网络的框架 (Camoco: A framework for integrating GWAS results and comparing coexpression networks)

  • Camoco输入的是:一组与性状相关的SNPs和一个基因表达的表,输出GWAS信号周围高可信度的具有很强共表达关系的候选基因。如下图所示:Camoco框架有三个模块:
  • Figure 1A: SNP-to-gene mapping, 用户指定一个window size以及每个SNP最大的候选基因数目,然后确定所有的candidate loci;
  • Figure 1B: 共表达网络的构建与分析;
  • Figure 1C: Camoco的overlap算法,使用两个网络打分策略:subnetwork density and subnetwork locality. 前者衡量GWAS信号附近基因的两两组和的平均相互作用强度。所谓density计算是subnetwork中平均的相互作用分值然后在用subnetwork的大小做矫正;后者计算:

Subnetwork locality measures the proportion of significant (Z > 3) coex- pression interactions among genes within a GWAS-derived subnetwork (local interactions) as compared with the number of global interactions with other genes in the genome (global interactions)

  • locality的计算:

Specifically, locality is obtained by first fitting a linear regression between all genes’ local degree (among the subnetwork of interest) and their global degree and measuring the mean of the residual for genes in the subnetwork (Equation 2).

  • Figure 1D: 通过比较GWAS确定的基因的共表达强度与随机网络中共表达的强度来计算子网络相互作用的共表达强度。


  • Density和Locality可以在整个子网络上或基于特定基因计算,以通过分解每个基因对子网络的贡献来优先考虑候选基因

3.2 从不同类型的转录组数据中建立共表达网络 (Generating coexpression networks from diverse transcriptional data)

  • 作者推断不同的表达谱来源可能对共表达网络解释GWAS捕获的遗传变异的效用产生很大影响,作者在这部分就建了多个共表达网络评估了Camoco框架的有效性。作者使用了三组不同的数据,分别是:
    • 503 diverse inbred lines (maize pan-genome);
    • 不同组织、发育时间点的基因表达数据;
    • ionomics GWAS research program中的数据。

3.3 考虑顺式基因相互作用 (Accounting for cis gene interactions)

  • 顺式基因比反式基因拥有更强的共表达。


3.4 Camoco框架的评估 (Evaluation of the Camoco Framework)

  • 理想情况下,GWAS确定的位点可以直接找到causal genes,而且所有的基因都表现出很强的共表达,但事实上,由于SNP影响调控序列或者与功能重要的allele发生连锁不平衡导致了一大部分SNPs都落在基因之外。
  • 作者在这部分评估了影响SNP-to-gene mapping的两个主要挑战。分别是:
    • 子网络中功能相关的基因的数目,这里用MCR(Missing Candidate gene Rate)衡量, 也就是说1减去GWAS确定出来的候选基因的数目所占的比例;
    • 每个显著的SNP会找到大量的noncausal genes, 也就是FCR (False Candidate gene Rate)
  • 作者为了评估Camoco的性能,就用GO功能注释模拟GWAS

3.5 模拟的GWAS数据表明对于MCR和FCR来说具有很强的共表达信号 (Simulated GWAS Data Sets Show Robust Coexpression Signal to MCR and FCR)

  • 如下图所示:随着MCR的升高,GO terms中共表达的强度降低


  • 每一个GO term基因的start position作为输入的SNP

Subnetwork density and locality were calculated for the simulated candidate genes corresponding to each SNP-to-gene mapping combination, in each network, to evaluate the decay of coex pression signal as FCR increases


  • 使用不同大小的window size。

3.6 High-Priority Candidate Causal Genes under Ionomic GWAS Loci

  • 17个离子组的GWAS数据
  • 三组不同的共表达网络

3.7 Genotypically Diverse Networks Support Stronger Candidate Gene Discoveries Than Tissue Atlases

The ZmRoot coexpression network proved to be the strongest input, discovering genes for 15 of the 17 elements (absent in Ni and Rb) for a total of 335 HPO genes, ranging from 1 to 126 per trait

4. 讨论

5. 方法

5.1 共表达网络的构建与质控(Construction and quality control of coexpression networks)

  • ZmPAN: 24,756个基因用于构建网络,对于每一对基因计算Pearson相关系数,共产生大概306 million个边。然后Pearson相关系数然后被Fisher转换以及标准化(Z-score)从而可以实现cross-networkcomparison. Z>=3被认为是全局显著的。为了评估网络总体的可靠程度,作者采用了以下几个方法:
    • 玉米注释的GO terms中基因的z scores与包含相同数目基因的1000个随机term进行比较;
    • Degree distribution.
  • ZmSAM: 与上述类似
  • ZmRoot: 与上述类似

5.2 SNP-to-Gene Mapping and Effective Loci

  • 两个参数:candidate window size and maximum number of flanking genes

Candidate genes were ranked by the absolute value of their distance to the center of their parental effective locus. Algorithms implementing the SNP-to-gene mapping used here are accessible through the Camoco command line interface.

5.3 Calculating subnetwork density and locality

  • Subnetwork density计算公式如下:
  • 其中,是基因i与j的共表达分值, 是pairwise的数目。

To quantify network locality, both local and global degree are calculated for each gene within a subnetwork where local degree is the number of interactions to other genes in the subnetwork and global degree is the total number of interactions a gene has. To account for degree bias, where genes with a high global degree are more likely to have more local interactions, a linear regression is calculated on local degree using global degree (designated local ; global), and regression residuals for each gene are analyzed:

  • 其中:是subnetwork中的基因的数目
  • : 共表达网络分值
  • : 共表达网络中基因的数目
  • 是回归模型中gene 的残差:
  • 是子网络中满足阈值的相互作用数目
  • 是基因j与基因组中所有基因相互作用满足阈值的数目

5.4 Simulating GWAS Using GO Terms

  • 选取具有50-100个基因的term,同时选取具有显著共表达的term用于分析, 选取MCR和FCR用于模拟噪音。

5.5 Simulating MCR

The effects of MCR were evaluated by subjecting GO terms with significant coexpression (P # 0.05; described above) to varying levels of MCRs. True GO term genes were replaced with random genes at varying rates (MCR: 0, 10, 20, 50, 80, 90, and 100%). The effect of MCR was evaluated by as- sessing the number of GO terms that retained significant coexpression (compared with 1000 randomizations) at each level of MCR.

5.6 Adding False Candidate Genes by Expanding SNP-to-Gene Mapping Parameters

To determine how false candidates due to imperfect SNP-to-gene map- ping affected the ability to detect coexpressed candidate genes linked to a GWAS trait, GO terms with significantly coexpressed genes were re- assessed after incorporating false candidate genes. Each gene in a GO term was treated as an SNP and remapped to a set of candidate genes using the different SNP-to-gene mapping parameters (all combinations of 50, 100, and 500 kb and one, two, or five flanking genes). Effective FCR at each SNP-to-gene mapping parameter setting was calculated by dividing the number of true GO genes with candidates identified after SNP-to-gene mapping. Since varying SNP-to-gene mapping pa- rameters changes the number of candidate genes considered within a term, each term was considered independently for each parameter combination.

你可能感兴趣的:(Integrating Coexpression Networks with GWAS to Prioritize Causal Genes in Maize)