A comprehensive catalogue of regulatory variants in the cattle transcriptome

牛转录组中调控变异的全面目录

Liu, Shuli, et al. "A comprehensive catalogue of regulatory variants in the cattle transcriptome." bioRxiv (2020).

Abstract

表征牲畜转录组的遗传调控变异体对于解释经济价值特征的分子机制以及通过人工选择提高遗传增益的速率至关重要。在这里，我们基于代表100多个牛组织的公开可用数据集中的11,642个RNA序列，为研究社区构建了牛基因型组织表达图集（cGTEx，http：//cgtex.roslin.ed.ac.uk/）。我们描述了跨组织转录组的情况，并报告了成千上万个顺式和转基因与基因表达和24种主要组织的选择性剪接的关联。我们评估了跨组织的这些遗传调控作用的特异性/相似性，并使用多组学数据的组合在功能上对其进行了注释。最后，我们使用大型转录组关联研究（transcriptome-wide association study，TWAS）将不同组织中的基因表达与43个重要的经济性状联系起来，以提供新颖的生物学见识，以了解牛农艺性状的分子调控机制。

Introduction

① 在牛性状相关组织中，农艺性状的GWAS信号表达的基因调控区域中显着丰富，但就个体和组织数目而言，剖析基因表达变异的尝试通常很少；

② 我们通过分析11,642种公开可用的牛RNA-Seq序列，描述了超过100种不同的组织和细胞类型，并通过一个门户网站（http://cgtex.roslin.ed.ac.uk/）使各研究部门可以自由，轻松地获得结果

③ 我们提出了一个标准，用于统一整合11,642个公共RNA-Seq数据集，并识别顺式和反式表达以及剪接24个重要牛组织的定量性状位点（eQTL和sQTL）。

--- 通过直接从RNA-Seq读数中调用变体并使用1000 Bull Bullomes Project7的数据推算到序列水平来实现后者，就像以前对人类数据所做的那样;

--- 接下来，我们进行了 in silico analyses，为eQTL和sQTL标注了各种公开的组学数据，包括DNA甲基化，染色质状态和染色质构象特征;

--- 最后，我们通过转录组全关联研究（TWAS），将基因表达结果与来自43个牛性状的27,214头大型公牛的GWAS进行了整合，并鉴定了与43个性状相关的442个先前未知的基因

Result

1、Data summary

① 我们使用统一的质量控制管道分析了8,642个样品中的11,642个可公开获得的RNA-Seq，可产生约2000亿次清洁读数

② single/paired reads, clean read number, read length, sex, age, and mapping rate across samples show that the quality of publicly available data is acceptable for the analyses shown here.

③ 我们还分析了21种牛组织的144种可公开获得的全基因组亚硫酸氢盐测序（whole-genome bisulfite sequencing , WGBS）数据集，以研究DNA甲基化的组织特异性并在功能上注释eQTL和sQTL

Variation in gene expression across individuals and tissues

① TPM(Transcripts Per Kilobase Million) 基因的绝对表达量，用于标准化测序深度和基因长度

② Only 61 genes were not expressed in any of the samples. About half of those (54.10%) were located in unplaced scaffolds, with significantly (P < 0.05, 1-sided) shorter gene length, fewer exons, higher CG density, and lower sequence conservation than expressed genes

③ 同样，随着clean reads次数的增加，我们检测到了更多的 alternative splicing events

④ Furthermore, 27% of them were housekeeping RNAs and included snRNA, snoRNA, snRNAs, snoRNAs and rRNAs known to play important roles in RNA splicing

⑤ Genes without splicing events were significantly engaged in the integral component of membrane and G-protein coupled receptor signaling pathway（膜和G蛋白偶联受体信号转导通路）

2、Tissue specificity of gene expression

① Tissue-specificity of gene expression was conserved across cattle and human (Fig. 2a- b), and the function of genes with tissue-specific expression accurately reflected the known biology of the tissues

② We also calculated tissue-specificity of promoter DNA methylation and alternative splicing (Methods)

③ We found that, based on tissue-specificity, gene expression level was significantly (FDR < 0.05) and negatively correlated with DNA methylation level in promoter (Fig. 2c), and positively correlated with splicing ratios

3、Discovery of expression and splicing QTLs

① 与人类发现的重大变异（significant variants ，eVariant）一致，这些变异以被测基因组的转录本起始位点（the transcript start sites，TSS）为中心。

② 我们发现46％（范围14.5-73.9％）的eGenes具有一个以上与表达相关的独立SNP（图3c），表明基因表达的广泛等位基因异质性

③ 等位基因特异性表达（Allele-specific expression，ASE）分析发现，相关的遗传变异在cis-eQTL中明显过量表达，并且其效应大小显着相关

④ Patterns and biological mechanisms underlying tissue specificity/similarity of cis-QTLs provide insights into pleiotropic regulatory effects on phenotypes（顺式QTL的组织特异性/相似性背后的模式和生物学机制提供了对表型的多效调节作用的见解）

⑤ We therefore conducted a non-model-based pairwise analysis using significant eGene-eVariant pairs in one tissue to estimate the proportion of non-null associations of these pairs in another tissue (Methods).（因此，我们在一个组织中使用重要的eGene-eVariant对进行了非基于模型的成对分析，以估计在另一组织中这些对的非空关联的比例（方法））

⑥ We speculated that the large number of trans-eQTLs in cattle could be due to the high selection intensity for economically important traits and the modest effective population size. This has led to a complex inter-chromosomal pattern of linkage disequilibrium (LD) and gene co-expression. We showed this might be the case, as we found significantly higher LD for cis-eQTL & trans-eQTL pairs than cis-eQTL & random-SNP pairs (on matched chromosomes) across all tissues (Fig. 4a).（我们推测牛中大量的反式eQTL可能是由于对重要经济性状的高选择强度和适度的有效种群规模所致。这导致了连锁不平衡（LD）和基因共表达的复杂的染色体间模式。我们证明了这可能是事实，因为我们发现在所有组织中，顺式-eQTL和反式-eQTL对的LD显着高于顺式-eQTL和随机-SNP对（在匹配的染色体上）（图4a））

4、Functional annotation of QTLs（QTLs的功能注释）

① 我们采用了多层生物学数据来更好地定义遗传调控作用的分子机制

② 如预期的那样， cis-e/sQTLs的功能元件（例如3’UTR 和 open chromatin regions）显着丰富

③ 与其他组织相比，横跨13个组织的Hypomethylated regions（次甲基化区域）富含cis-e / sQTL

④ Topologically associated domains (TADs) enable chromatin interactions between distal regulatory regions and target promoters （拓扑相关结构域（TAD）使远端调节区和靶标启动子之间的染色质相互作用）

⑤ 通过检查牛肺组织的Hi-C数据，我们获得了TAD和重要的Hi-C接触点，这些接触点在整个组织中都是保守的15。通过与具有匹配距离的随机eGene-SNP对进行比较，我们观察到大多数组织中TAD中eGene-eVariant对的百分比明显更高（图5f）。例如，APCS and its cis-eQTL peak （TSS上游144 kb）被TAD包围，并通过显着的Hi-C接触而链接，从而允许通过远端（距TSS> 2 kb）调节其表达

5、eQTLs and complex trait associations

① 这项研究的主要目的是为阐明牛的农艺性状的遗传和生物学基础提供资源。因此，我们评估了在每种组织中检测到的e / sQTL与四个不同农艺性状的关联

② TWAS增强了我们检测因果基因并更好地了解这些特征的生物学基础的能力

Methods

Quantification of gene expression

① We downloaded 11,642 RNA-Seq runs (by July, 2019) from SRA (n = 11,513, https://www.ncbi.nlm.nih.gov/sra/) and BIGD databases (n = 129, https://bigd.big.ac.cn/bioproject/).

② we first removed adaptors and low quality reads using Trimmomatic (v0.39) with parameters: adapters/TruSeq3SE.fa:2:30:10 LEADING:3 329 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36.

③ 我们筛选出干净读数≤500K的样品，得到7,680个样品，并使用参数为outFilterMismatchNmax 3，outFilterMultimapNmax10和outFilterScoreMinOverLread 0.66 的STAR（v2.7.0）的单个或成对映射模块将干净读数映射到ARS-UCD1.2牛参考基因组24。

④ 我们保留了7,264个样本，其uniquely mapping rates≥334 60％（平均值为91.07％；范围为60.44％-100％；表S1中的映射详细信息）

⑤ 然后，我们使用Stringtie（v2.1.1）获得了27,608个Ensembl（v96）注释基因的归一化表达（TPM），并使用featureCounts（v1.5.2）提取了它们的原始读取计数。

⑥ 最后，我们使用R包dendextend中实现的层次聚类方法，基于log2（TPM +1）对7,264个样本进行聚类，距离=（1-r），其中r为Pearson相关系数。我们排除了具有明显聚类错误的样本（例如，标记为肝的样本未与其他肝样本聚类），从而得到7,180个样本用于后续分析。

Quantification of alternative splicing

We used Leafcutter (v0.2.9) to identify and quantify variable alternative splicing events of genes by leveraging information of junction reads (i.e., reads spanning introns) that were obtained from the STAR alignment.（有提供Leafcutter处理过程和脚本说明，此处不列举）

Genotyping and imputation

① 我们按照基因组分析工具包（GATK）（v4.0.8.1）28中建议的最佳做法管线（默认设置），分别将1000个Bull Bull基因组计划中的7180个高质量RNA-Seq样品称为已知基因组变异的基因型。

②最后，我们获得了6,123个样本，这些样本已成功进行了基因分型和估算。我们过滤掉了MAF <0.05和 dosage R-squared (DR2)<0.8的变体，从而产生了3,824,444个SNP用于QTL定位。

③ 然后，我们测量了跨13个组织的WGS-SNP和RNA-Seq /估算的SNP之间的基因型一致性比率。然后，我们使用plink（v1.90）30（--indeppairwise 1000 5 0.2）提取了153,913个 LD-independent 的SNP，并在EIGENSOFT（v7.2.1）31中使用这些SNP对所有6,123个样品进行了PCA分析。我们还使用这些独立的SNP删除了重复的个体，从而计算出样品之间的identity- by-state distance（IBS距离）

Allele specific expression (ASE)

We conducted ASE analysis using the GATK ASEReadCounter tool (v4.0.8.1)

DNA methylation analysis of WGBS data

① 我们（2019年7月）从21种不同的组织中下载了144个可公开获得的WGBS牛数据（表S2）。

② 我们首先使用FastQC（v0.11.2）和Trim Galore v0.4.0（--max_n 15-quality 20 -length 20 -e 0.1）分别确定读取质量和过滤低质量的读取。

③ 然后，我们使用带有默认参数的Bismark软件（v0.14.5）32将纯净读段映射到相同的牛参考基因组（ARS-UCD1.2）。

④ After deduplication of reads, we extracted methylation levels of cytosines using the bismark_methylation_extractor (--ignore_r2 6) function.

⑤ 所有WGBS数据的覆盖率都是根据干净的读数计算得出的，覆盖率为5到47倍，平均覆盖率为27.6倍。最终，我们保留了至少由5个读数代表的CpG位点，用于后续分析。我们使用分层聚类和t-SNE方法基于共享CpG的DNA甲基化水平对样本进行聚类

Identification of TAD and significant Hi-C contacts

① 为了发现远端eVariant与目标eGenes之间的潜在染色质相互作用，我们从牛的肺组织的Hi-C数据中鉴定了TAD和Hi-C接触物

② 我们使用Trim Galore（v0.4.0）修整了adapter sequences 和低质量读值（--max_n 15 --quality 20 --length 20 -e 0.1），产生了约8.2亿次纯净读取。

③ 然后，我们使用BWA将纯净读图映射到牛参考基因组（ARS-UCD1.224）。我们应用HiCExplorer v3.4.1构建了一个分辨率为10kb的 Hi-C contacts 矩阵，并使用hicFindTAD识别了TAD。

④ 我们使FDR小于0.01的TAD保持eQTL与eGenes的链接。我们进一步采用HiC-Pro（v2.11.4）从Hi-C数据中以10 kb分辨率调用Hi-C contacts

⑤ 简而言之，HiC-Pro使用Bowtie2（v2.3.5）34对牛参考基因组进行了比对的纯净读取。建立a contact matrix后，HiC-Pro生成染色体内和染色体间图谱，并使用原始的ICE归一化算法对其进行归一化。我们认为FDR <0.05的Hi-C接触很重要。

Tissue-specificity analysis of gene expression, alternative splicing and DNA methylation

① 为了量化基因的组织特异性表达，我们计算了114个组织中每个基因的 t-统计量。

② We scaled the log2-transformed expression (i.e., log2TPM) of genes to have a mean of zero and variance of one within each tissue.

③ 然后，我们使用最小二乘法（least-squares）拟合每个组织中每个基因的线性模型

④ 为了检测组织特异性的选择性剪接，我们使用 leafcutter 通过比较目标组织和其余组织的样本来分析差异内含子切除

⑤ 我们使用Benjamini-Hochberg方法（FDR）来控制多重测试

⑥ 对于DNA甲基化，我们着眼于基因启动子（从TSS的上游1500bp到下游500bp）和身体区域（从TSS到TES）的DNA甲基化水平，这是使用roimethstat函数MethPipe（v3.4.3）通过加权甲基化方法计算的

⑦ 我们使用与组织特异性表达分析相同的方法，计算了每个基因的启动子的 t-统计量。

⑧ 我们还使用参数-t DeNovoDMR -MR 0.5 -AG 1.0 -MS 0.5 -ED 0.2 -SM 0.6-CD 500 -CN 5-SL 20 -PD 0.05-使用SMART2在全基因组模式下检测了组织特异性甲基化区域

Covariate analysis for QTL discovery

为了解决基因表达中转录组范围内变异( transcriptome-wide variation )的 hidden batch effects 和其他技术/生物学来源，我们使用表达残基概率估计（PEER）方法估算了每个组织中的潜在协变量。

cis-eQTL mapping

Meta-analysis of cis-eQTLs of muscle samples from three sub-species

We then conducted a meta-analysis to integrate cis-eQTL results from three sub-species using the Metal tool45.

cis-sQTL mapping

在这24个组织的每一个中，我们应用了在FastQTL41中实施的线性回归模型，以测试目标内含子簇上下游1 Mb内的基因型（MAF> 1％）及其对应的内含子切除率的关联。

trans-eQTL mapping

WGCNA co-expression network analysis and estimation of π1 statistics

We applied the Weighted Gene Co-Expression network (WGCNA) analysis to obtain gene co-expression network in each of 24 tissues used in eQTL mapping. We estimated the sharing of cis-eQTLs and cis-sQTLs among tissues using the π1 statistics, as described in human GTEx Consortium (2015).

TWAS analyses

① To associate gene expression in a tissue with complex traits, we conduced TWAS analysis using S-PrediXcan48 by prioritizing GWAS summary statistics for 43 agronomic traits of economic importance in cattle, including reproduction (n = 11), production (milk- relevant; n = 6), body type (n = 17), and health (immune/metabolic-relevant; n = 9).（为了使基因表达与具有复杂性状的组织相关联，我们通过优先利用GWAS摘要统计数据优先考虑牛的43个具有重要经济价值的农艺性状，包括繁殖（n = 11），生产（与牛奶相关； n = 6），体型（n = 17）和健康（免疫/代谢相关； n = 9）。）

② For body conformation (type), reproduction, and production traits, we conducted a single-marker GWAS by fitting a linear mixed model in 27,214 U.S. Holstein bulls as described previously.(对于身体构象（类型），繁殖和生产特征，我们通过拟合线性混合模型在27,214个美国荷斯坦公牛中进行了单标记GWAS，如前所述)

③ 我们使用基因型和表达数据构建了一个嵌套的交叉验证的弹性网预测模型(a Nested Cross Validated Elastic Net prediction model)。

④ We visualized the Manhattan plots of P-values of all tested genes using ggplot2 (v3.3.2) in R（我们使用R中的ggplot2（v3.3.2）可视化了所有测试基因的P值的曼哈顿图）

Other downstream bioinformatic analysis（其他下游生物信息学分析）

① We used Genomic Association Tester (GATv1.3.4) 10,000 permutations to estimate the functional enrichment of QTLs in particular genomic regions, chromatin states, methylation elements, and WGCNA co-expression modules.（我们使用了基因组关联测试仪（GATv1.3.4）10,000个排列来估计特定基因组区域，染色质状态，甲基化元件和WGCNA共表达模块中QTL的功能富集。） We considered enrichments with FDR (adjusted P-values with Benjamini-Hochberg method) < 0.05 as significant.

② We used the R package, ClusterProfiler, to annotate function of genes based on the Gene ontology database from Bioconductor (org.Bt.eg.db v3.11.4). We considered GO terms with FDR < 0.05 as significant. （我们使用R包ClusterProfiler，基于Bioconductor（org.Bt.eg.db v3.11.4）的基因本体数据库对基因的功能进行注释。我们认为FDR <0.05的GO项很重要。）

③ We obtained complex traits/diseases that were associated with a particular gene in humans using the Region PheWAS function in GeneATLAS database (http://geneatlas.roslin.ed.ac.uk/) with set region of ±50kb and P-value threshold of 10-8 （我们使用GeneATLAS数据库（http://geneatlas.roslin.ed.ac.uk/）中的Region PheWAS函数获得了与人类特定基因相关的复杂性状/疾病，其设定区域为±50kb，P值阈值为的10-8）

Data availability statement

本研究中分析的所有原始测序数据均可在NCBI Gene Expression Omnibus（GEO; https://www.ncbi.nlm.nih.gov/geo/）中公开获得。所有已处理的结果和脚本代码都可以在 https://cgtex.roslin.ed.ac.uk/ 中找到。