Roadmap paper解读

Integrative analysis of 111 reference human epigenomes

因为我的描述都是基于我自己的理解,不一定正确,所以我的习惯是解析的时候将原文的内容也拷下来,虽然增加了篇幅,但有利于理解。头一次在上写文章,若有错误还请谅解。

Roadmap计划重要的paper:https://www.nature.com/articles/nature14248#Sec41

Computational tools and methods:https://www.nature.com/articles/nature14316

Intro

We integrate information about histone marks, DNA methylation, DNA accessibility and RNA expression to infer high-resolution maps of regulatory elements annotated jointly across a total of 127 reference epigenomes spanning diverse cell and tissue types.

In addition,we study the role of regulatory regions in human disease by relating our epigenomic annotations to genetic variants associated with common traits and disorders.

Specific highlights of our findings are given below.

  • Histone mark combinations show distinct levels ofDNA methylation and accessibility, and predict differences in RNA expression levels that are not reflected in either accessibility or methylation.
  • Megabase-scale regions with distinct epigenomic signatures show strong differences in activity, gene density and nuclear lamina associations, suggesting distinct chromosomal domains.
  • Approximately 5% ofeach reference epigenome shows enhancer and promoter signatures, which are twofold enriched for evolutionarily conserved non-exonic elements on average.
  • Epigenomic data sets can be imputed at high resolution from existing data, completing missing marks in additional cell types, and providing a more robust signal even for observed data sets.
  • Dynamics of epigenomic marks in their relevant chromatin states allow a data-driven approach to learn biologically meaningful relationships between cell types, tissues and lineages.
  • Enhancers with coordinated activity patterns across tissues are enriched for common gene functions and human phenotypes, suggesting that they represent coordinately regulated modules.
  • Regulatorymotifs are enriched in tissue-specific enhancers, enhancer modules and DNA accessibility footprints, providing an important resource for gene-regulatory studies.
  • Genetic variants associated with diverse traits show epigenomic enrichments in trait-relevant tissues, providing an important resource for understanding the molecular basis of human disease.

Reference epigenome mapping across tissues and cell types

The REMCs generated a total of 2,805 genome-wide data sets, including 1,821 histone modification data sets, 360DNA accessibility data sets,
277 DNA methylation data sets, and 166 RNA-seq data sets, encompassing a total of150.21 billion mapped sequencing reads corresponding to 3,174-fold coverage of the human genome.

Here, we focus on a subset of 1,936 data sets comprising 111 reference epigenomes, which we define as having a core set of five histone modification marks

The five marks consist of: histone H3 lysine 4 trimethylation (H3K4me3), associated with promoter regions, H3K4me1 associated with enhancer regions ;H3 lysine36 trimethylation (H3K36me3), associated with transcribed regions; H3 lysine 27 trimethylation (H3K27me3), associated with Polycomb repression ; and H3 lysine 9 trimethylation(H3K9me3), associated with heterochromatin regions

Selected epigenomes also contain a subset of additional epigenomic marks, including: acetylation marks H3K27acand H3K9ac, associated with increasedactivationofenhancer andpromoter regions2

We computed several quality control measures介绍数据集的质控指标:

  • the number of distinct uniquely mapped reads; 唯一比对的reads总数
  • the fraction of mapped reads overlapping areas of enrichment;经典的FRiP,富集在peaks的reads比例
  • genome-wide strand cross-correlation 交叉相关性质量评估度量值,相关概念见:https://www.baidu.com/link?url=xS6iz8-hJ0J6sNoxpxVgRsPeU52ZMXCGanmaWmyjM04Kw1xxiwqiAhpIwvNJS910&wd=&eqid=c0e6730200024bea000000065e7dd6f5NSC值越大表明富集效果越好,NSC值低于1.1
    表明较弱的富集,小于1表示无富集。
    NSC值稍微低于1.05,有较低的信噪比或很少的峰,这肯能是生物学真实现象,比如有的因子在特定组织类型中只有很少的结合位点;也可能确实是数据质量差。
  • inter-replicate correlation; 重复间相关性
  • multidimensional scaling of data sets from different production centres 不同机构产出数据的归一化
  • correlation across pairs of data sets 不同数据集的相关性
  • consistency between assays carried out in multiple mapping centres
  • read mapping quality for bisulfite-treated reads
  • agreement with imputed data

Outlier data sets were flagged, removed or replaced, and lower-coverage data sets were combined where possible (see Methods).

Roadmap_ref_ChromStates.jpg

Chromatin states,DNAmethylation and DNAaccessibility

15-state model

As a foundation for integrative analysis, we used a common set of com-
binatorial chromatin states across all 111 epigenomes, plus 16 additional epigenomes generated by the ENCODEproject (127 epigenomes in total), using the core set of five histone modification marks that were common to all.

We trained a 15-state model consisting of 8 active states and 7 repressed states that were recurrently recovered and showed distinct levels of DNA methylation、DNA accessibility 、regulator binding and evolutionary conservation

作者对127个epigenomes进行chromStates建模,调参(比如shift-size,具体见methods),用60个高质量的epigenomes数据集作为训练集,构建15-state model,并应用到剩下的数据集中(还有expanded 18-state model)

multiple-statesDynamic.jpg

关于states的详细描述:

15-states_model.jpg

增强子和启动子区在进化保守非外显子区呈现富集趋势,上图的f

Enhancer and promoter states covered approximately 5% of each reference epigenome on average, and showed enrichment for evolutionarily conserved non-exonic regions

Evolutionary conservation analysis of chromatin states in each cell type for conserved elements (GERP), using all conserved elements (a,b), or only non-exonic conserved elements (c,d) for both the 15-state model (a,c) and the 18-state model (b,d)

GERP_overlap.jpg

关于15-state model的稳健性:

model_robust.jpg

之前的15-state模型是把111个参考表观组联合起来构建的,为了评估这个模型的稳健性,这里作者对111个参考转录组分别应用ChromHMM构建15-state,然后把得到的1,680-state
emission probability vectors(估计是111*15+15)进行聚类,发现分别对数据集建模得到的聚类结果非常好(仍然是主要的15个state),且同一个state的数据集间有一定variation。具体的可以参考method:

The trained model was then used to compute the posterior probability of each state for each genomic bin in each reference epigenome. The regions were labelled using the state with the maximum posterior probability.
(最大后验概率,意思是构建一个似然函数,参数是state的类别(15个),选择一个类别使得在给定样本(这里是基因组区间bin)的条件下似然函数值最大,这个state参数就是预测的bin的state

且有两个新的clusters:

This analysis revealed two new clusters (red crosses) which are not represented in the 15 states of the jointly learned model: ‘HetWk’, a cluster showing weak enrichment for H3K9me3; and ‘Rpts’, a cluster showing H3K9me3 along with a diversity of other marks, and enriched in specific types of repetitive

Relationship between different modalities

We used chromatin states to study the relationship between histone modification patterns, RNA expression levels, DNA methylation and DNA accessibility.

we found low DNA methylation and high accessibility in promoter states, high DNAmethylation and low accessibility in transcribed states, and intermediate DNAmethylation and accessibility in enhancer states

relationship.jpg

可以看出,对于高表达的基因,DNA甲基化的差异更显著(c),且高表达基因更多地位于strong enhancers附近(H3K27ac+H3K4me1)

Chromatin states sometimes captured differences in RNA expression that are missed by DNA methylation or accessibility. For example, TxFlnk, Enh, TssBiv and BivFlnk states show similar distributions of DNA accessibility but widely differing enrichments for expressed genes染色质状态有时可以捕获更细致的那些会被DNA甲基化或可及性忽略的RNA差异表达信息;又或者两个state可能甲基化水平相当但可及性和差异表达水平相差很远

enrichment.jpg

除此之外,作者发现一种中间状态的甲基化可能是一种特殊的染色质状态:

Intermediate methylation signatures were equally strong within tissue samples, peripheral blood and purified cell types, suggesting that intermediate methylation is not simply reflecting differential methylation between cell types, but probably reflects a stable state of cell-to-cell variability within a population of cells of the same type.

Epigenomic differences during lineage specification

接下来作者探讨DNA甲基化在不同cell lineage中的动态变化

We next studied the relationship between DNA methylation dynamics and histone modifications across 95 epigenomes with methylation data, extendingprevious studies that focused on individual lineages

distribution_me.jpg

We also studied DNA methylation changes in three different systems.

First, we studied DNA methylation changes during embryonic stem(ES) cell differentiation . We identified regions that lost methylation (differentially methylated regions (DMRs)) upon differentiation of ES cells (E003) to mesodermal (E013), endodermal (E011)and ectodermal(E012) lineages (Fig. 4h). Each lineage showed a largely distinct set of 2,200–4,400 DMRs that are enriched for distinct transcription factor binding events (Fig. 4h, right column) ,consistent with their distinct developmental regulation. Upon further differentiation, ectodermal DMRs remained hypomethylated in three neural progenitor populations, despite the usage of distinct human ES cell
in DNA methylation during early differentiation .
(hESC) lines, and mesodermal and endodermal DMRs remained highly methylated (Fig. 4h), highlighting the lineage-specific nature of changes

DMRs.jpg

h图中显示了特定转录因子在特定DMR区和特定发育时期的富集

Second, we studied DNA methylation changes associated with breast epithelia differentiation

we found differences in nearest-gene enrichments,and differences in motif density (luminal DMRs show greater motif density for 51 transcription factors and lower density for 0 transcription factors).

在探讨了DMR的动态性后,作者进一步探讨造成动态性、差异甲基化的原因:是组织环境因素还是发育起源因素

Third, we asked whether tissue environment or developmental origin is the primary driving factor in DNA methylation differences observed in more differentiated cell types using epigenomes from skin cell types (keratinocytes E057/058, melanocytes E059/E061and fibroblasts E055/056) that share a common tissue environment but possess distinct embryonic origins (surface ectoderm, neural crest andmesoderm, respectively)选取具有相同组织环境而起源各不相同的皮肤细胞类型

作者发现这些相同组织环境的细胞在甲基化谱和组蛋白修饰谱上overlap很少,相反他们和各自的相同来源的细胞却更相似;举例来说,同样来源于表皮外胚层的角质细胞和乳腺细胞的shared DMR预示着一个common调控网络,和共同的信号通路以及结构组分

keratinocytes shared 1,392 (18%) of DMRs with surface ectoderm derived breast cell types (hypergeometric P value ,1026), and 97% of these were hypomethylated. These shared DMRs were enriched for regulatory elements and cell-type-relevant genes, suggesting a common gene-regulatory network and shared signalling pathways and structural
components. These results suggest that common developmental origin can be a primary determinant ofglobalDNAmethylation patterns, and sometimes supersedes the immediate tissue environment in which they are found.

Most variable states and distinct chromosomal domains

作者接下来探讨每个chromatin state在不同细胞和组织中的variability

We next sought to characterize the overall variability of each chromatin state across the full range ofcell and tissue types

coverage.jpg

可以看出,Quies最为constitutive,EnhG/TxFlnk等相对比较tissue specific

states之间的转换频率frequency矩阵

We next studied the relative frequency with which different chromatin states switch to other states across different tissues and cell types

relative_frequency.jpg

This revealed a relative switching enrichment between active states and repressed states, consistent with activation and repression of regulatory regions. The only exception was significant switching between transcribed states and active promoter and enhancer states, possibly due to alternative usage of promoters and enhancers embedded within transcribed elements.

We found that enhancers and promoters maintained their identity, except for a small subset of regions switching between enhancer signatures and promoter signatures
regions indeed possess both enhancer and promoter activity
. Luciferase assays showed that these , consistent with their epigenomic marks.

作者发现活性调控区和抑制区的转换呈现明显富集趋势,不过也有转录区向活性启动子、增强子states的转换,这可能是某些启动子、增强子嵌合在转录区的结果

具体可以参考这篇文章:Conserved role of intragenic DNA methylation in regulating alternative promoters: https://doi.org/10.1038/nature09165 Nature文章,值得一看

而且有的区域在启动子活性和增强子活性间转换

具体可以参考:Integrative analysis of haplotype-resolved epigenomes across
human tissues.(已读,笔记后续整理)

https://www.nature.com/articles/nature14217#article-info

这篇文章亮点是allelic biased enhancer-gene pairs

整合增强子和mRNA表达数据,通过共表达分析可获得增强子的候选靶基因。对于共表达的特定增强子-基因组合,至少存在3种可能的关系模型:(1)因果关系,增强子表达的变化引起基因的差异表达;(2)reactive关系,基因位于增强子的上游;(3)共响应关系,增强子和基因都响应其它分子变化。本文中以第一种关系进行探讨,引入eQTL进行分析。基本原理如下:影响增强子活性的单核苷酸多态性(SNP)会影响增强子下游靶基因的表达,由此使得SNP(或邻近连锁遗传的SNP)成为目标基因的eQTL位点;对于这样的共表达增强子-基因对,使用Hi-C数据来评估该因果关系是否为直接调控。

来自:http://www.360doc.com/content/19/0821/17/65172408_856276298.shtml

关于enhancer及其临床价值,参考:http://www.360doc.com/content/18/0413/16/45954995_745357589.shtml

相关数据挖掘文章:https://www.sohu.com/a/230491180_177233

motif clustering:同样在这篇文章的method中提及,类似这个问题:https://www.biostars.org/p/140532/作者希望先cluster sequence然后找characteristic motif

HaploSeq能够使得临床医生确定两个突变是存在于相同的染色体上或是在不同的染色体上,从而有助于风险评估;快速确定哪些遗传变异共同发生在同一染色体片段上,因此来自于同一亲缘

参考文:Hi-C分型绝招之HapCUT:http://wap.sciencenet.cn/blog-2970729-1175790.html?mobile=1认识到传统分型方法仅能分型出部分杂合变异,无法构建基因组水平的单体型块http://www.360doc.com/content/19/0423/15/52645714_830825887.shtml

参考:Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing:https://www.nature.com/articles/nbt.2728#article-info

Compartment clusters

作者以2Mb为一个bin区间,考察在这种resolution下各states的分布、overlap情况,各个cluster的情况如上图d

While chromatin states were defined at nucleosome resolution (200 bp), we also studied the overall co-occurrence of chromatin states across tissues at a larger resolution (2 Mb) to recognize higher-order properties

可以看到active enhancer region部分(c1-c6)与剩下的cluster明显分开,consistent with the identification of two large chromatin conformation compartments与先前鉴定出两个大的染色质构象组分相一致,且每个compartment内部又可以按states分成若干subdivisions

These subdivisions were based on average state density across a large diversity of cell types and showed strong differences in gene density,
CpG island occupancy, lamina association and cytogenetic bands (Fig. 5d), suggesting that they represent stable chromosomal features.

图中热图的计算:按照在所有样本中平均的states分数(每一列是bin,每一行是states)

Relationships between marks and lineages

接下来作者对不同的组织和细胞类型基于histone marks进行层次聚类:一个有趣的现象是:ES来源的细胞基本上还是和ES、ips聚在一起,而不是和他们将要分化成的组织聚在一起,这说明相对somatic tissue而言它们还是更接近于pluripotent status

除了用树来衡量细胞、组织的相似性,作者还考虑了其他方式,比如相似性矩阵MDS-plot(和PCA相似的降维方法,不过MDS是基于距离的,PCA是基于相关性的;这里用欧式距离衡量相似性恰好合适)。并比较了用不同的marks signal计算的效果:

similarity_matrix.jpg
MDS_similarity.jpg

为了减少占屏,这里只截了部分图片

对上述方法,不同的marks在捕获similarities上有区别:比如immune cell similarities、pluripotent cell similarities是分别用不同的marks分析捕获到的

Imputation and completion of epigenomic data sets

imputation and completion:不是所有的epigenome数据集每个marks的信息都有,作者这里应该是基于每个细胞系里不同marks的相关性、不同细胞系里相同marks的分布规律,对缺失marks信息进行预测,从而补全signal tracks

当然,对于imputed data和observed data之间annotation、captured cell type relationships也做了比较,相关性较好,说明imputation和completion是可靠的

说到chromatin-states,如果做25-state模型,可以对enhancer的状态作进一步细分,从而reveal更多的关于基因表达调控和人类疾病相关的信息

Enhancer modules and their putative regulators

We clustered enhancer-only elements(Enh,EnhBiv,EnhG) into 226 enhancer modules of coordinated activity , promoter-only elements into 82 promoter modules and promoter/enhancer ‘dyadic’ elements into 129 modules , enabling us to distinguish ubiquitously active, lineage-restricted and tissue specific modules for each group.关于调控元件的module分析在生信中也很常见,就好比如果是做癌症,经常会涉及signatures,signature可以是基因集也可以是突变集,参考生物学背景。同一module和signatures的individuals往往代表着他们参与的生物学功能的一致性。这里作者尝试将enhancer、promoter elements基于在cell line、tissue中的分布和active情况聚成module。

这一步是基于上一步的complement,更完整的数据可能对GO term的统计检验功效更好

regulatory_modules.jpg

上图分别展示了: Proximal gene enrichments for each module using gene ontology (GO) biological process (b) and human phenotypes(c),对module近邻的基因功能进行GO分析

The genome sequence of enhancers in the same module showed substantial enrichment for sequence motifs associated with diverse transcription factors对于每个module内的enhancer的motif进行分析,存在大量TF motif的富集,意味着他们是co-regulated sets,或许基于此还可以寻找到upstream regulators

进一步地,就是探究这些motif,哪些motif对应active TF,哪些对应repressive TF,要做好这一步,就是结合gene expression数据,找出enhancer-gene pairs pattern;对于每个module,他们的regulator如果刚好就是tissue-restrictive,那么就可以用这些regulator来定义每个module

Linking-regulators-tissue.jpg

Impact of DNA sequence and genetic variation

接近尾声,上升到更精细的序列层次,哪些variation(snp)、allele是与疾病相关的

用序列中的motif可以实现对marks的预测分析:

Using the area under the receiver operating curve (AUROC), we found between 71% predictive power for H3K4me1peaks and 98% for H3K4me3 peaks (average of 85% across six marks and methylation-depleted regions)用ROC曲线、AUC衡量预测效果

As an example of a boundary enrichment, H3K4me3 peaks were flanked by motifs consisting
of a continuous stretch of A and T followed by a G and C, which may have a role in nucleosome positioning or recruiting promoter-associated transcription factors, such as nuclear receptors. Enhancer and promoter predictive motifs were enriched in high-resolution DNase hypersensitive sites. 举例描述了H3K4me3 peaks的边界motif特征

Second, we studied how sequence variants between the two alleles of the sameindividual can lead to allelic biases in histone modifications, DNAmethylation and transcript levels. 关于allelie biase,可以参考相对应的paperhttps://www.nature.com/articles/nature14217#article-info,这个部分的methods我做了记录,比较详细的haplotype方法学文章在文末也有ref

对于那些allele-biased gene,他们对应的有: allelic epigenomic modifications in promoters (71%) and Hi-C-linked enhancers (69%)

Trait-associated variants enrich in tissue-specific marks

用到典型的GWAS分析,据以前的研究,很多疾病关联snp就是落在regulatory elements内的

举例来说,代谢疾病相关变异在肝脏enhancer marks中富集

trait.jpg

上图每行代表一种疾病和其PubMedID,每一列是一个cell line,颜色分数应该是相关variants的富集程度

附录:增强子、启动子数据库

因为在看参考文章时出现了很多相关数据库,暂时整理在下面

FANTOM:https://fantom.gsc.riken.jp/

全称为Function Annotation Of The Mammalian Genome,是一项国际性的研究项目,创建于2000年,最初的目的是对小鼠全长cDNA序列进行功能注释。随着不断发展,研究的内容在也在转录组学层面不断拓展。该项目中所用到的主要技术为RIKEN所发展出的Cap Analysis of Gene Expression(CAGE)技术,该技术的优势在于对于基因表达水平的测定具有更高的敏感性。

Question: FANTOM5 Promoter Atlashttps://www.biostars.org/p/101956/

FANTOM5技术之定位增强子:http://www.360doc.com/content/19/0821/17/65172408_856276298.shtml

CAGE-TSSchip: promoter-based expression profiling using the 5'-leading label of capped transcripts:https://genomebiology.biomedcentral.com/articles/10.1186/gb-2007-8-3-r42

有时具有明显enhancer mark的element也会出现CAGE信号的富集,表示他们还有潜在的promoter活性,比如cis-regulatory
elements withdynamic signatures (cREDS),详见Integrative analysis of haplotype-resolved epigenomes across human tissues这篇文章

绝大多数基因有两个甚至两个以上的转录起始位点,不同的转录起始位点会导致基因受到不同的上游非翻译区的调控作用(5'UTR)。

不同的5'UTR序列中可能包含截然不同的作用元件,不同的起始位点导致了基因的表达所响应的信号也完全不同。同一个基因有可能受不同的启动子调控而导致表达的差异,可能会导致某些疾病的发生。

CAGE-seq (Cap Analysis of Gene Expression AND deep Sequencing) 可以对mRNA中所有的TSS进行鉴定,这是通过加帽位点鉴定实现的

VISTA enhancer browser:https://enhancer.lbl.gov/z增强子的体内活性验证数据集

参考:https://www.cnblogs.com/yahengwang/p/11228108.html

多组学联合分析-Matrix eQTL:https://www.jianshu.com/p/6e6d54d7483e可以探索一下这个R包

RepeatMasker:https://www.jianshu.com/p/50ce4bcd1972

A promoter-level mammalian expression atlas:https://www.nature.com/articles/nature13182#article-info

A map of the cis-regulatory sequences in the mouse genome:https://www.nature.com/articles/nature11243#additional-information这篇文章会涉及Shannon-­‐entropy-­‐based
analysis,后期会check out

你可能感兴趣的:(Roadmap paper解读)