2020-06-04 EnrichedHeatmap可视化表观遗传学数据

image.png

当我第一眼看到时，惊艳到了！！！如同登徒子见了久违的天人。与ComplexHeatmap包开发者为同一人，都是顾大神的杰作。

将原文摘录如下：

此图展现的是来自Roadmap数据集的表观遗传信号的整体关联。以下是这些复杂关系的方法和配置，此文重点放在可视化部分，数据处理简要介绍。

（1）加载R包和预处理的R对象

library(EnrichedHeatmap)
library(GetoptLong)
library(circlize)
library(RColorBrewer)

download.file("https://jokergoo.github.io/supplementary/EnrichedHeatmap-supplementary/roadmap_normalized_matrices.RData",
    destfile = "roadmap_normalized_matrices.RData")
load("roadmap_normalized_matrices.RData")
file.remove("roadmap_normalized_matrices.RData")

(2) 数据概况

Roadmap数据集(http://egg2.wustl.edu/roadmap/web_portal/) 涵盖了不同的人类细胞类型和组织，并已统一处理过。Roadmap包含DNA甲基化、基因表达的RNA测序和不同组蛋白修饰的ChIP测序数据。

如我们所见，相比于组蛋白修饰，基因表达与DNA甲基化表现出强的关联，因而整体的整合分析以基因表达和甲基化为中心，关联其它的组蛋白修饰信号。也就是先看基因表达和甲基化关联的区域，然后再在这些关联的区域检测组蛋白修饰信号是关联还是不关联。

我们仅用了Roadmap中的27个样本，它们高质量的匹配了表达和甲基化数据，并在表达和甲基化数据上都有一致的亚群。

SAMPLE包含了样本的注释信息，而COLOR表示注释的对应的颜色。

SAMPLE

##        id        group    sample_type  subgroup
## E016 E016          ESC PrimaryCulture subgroup1
## E003 E003          ESC PrimaryCulture subgroup1
## E024 E024          ESC PrimaryCulture subgroup1
## E007 E007     ES-deriv     ESCDerived subgroup1
## E013 E013     ES-deriv     ESCDerived subgroup1
## E012 E012     ES-deriv     ESCDerived subgroup1
## E011 E011     ES-deriv     ESCDerived subgroup1
## E004 E004     ES-deriv     ESCDerived subgroup1
## E005 E005     ES-deriv     ESCDerived subgroup1
## E006 E006     ES-deriv     ESCDerived subgroup1
## E050 E050 HSC_&_B-cell    PrimaryCell subgroup2
## E112 E112       Thymus  PrimaryTissue subgroup2
## E071 E071        Brain  PrimaryTissue subgroup2
## E100 E100       Muscle  PrimaryTissue subgroup2
## E104 E104        Heart  PrimaryTissue subgroup2
## E095 E095        Heart  PrimaryTissue subgroup2
## E105 E105        Heart  PrimaryTissue subgroup2
## E065 E065        Heart  PrimaryTissue subgroup2
## E109 E109    Digestive  PrimaryTissue subgroup2
## E106 E106    Digestive  PrimaryTissue subgroup2
## E079 E079    Digestive  PrimaryTissue subgroup2
## E094 E094    Digestive  PrimaryTissue subgroup2
## E097 E097        Other  PrimaryTissue subgroup2
## E066 E066        Other  PrimaryTissue subgroup2
## E098 E098        Other  PrimaryTissue subgroup2
## E096 E096        Other  PrimaryTissue subgroup2
## E113 E113        Other  PrimaryTissue subgroup2

COLOR

## $group
##          ESC     ES-deriv HSC_&_B-cell       Thymus        Brain       Muscle        Heart 
##    "#924965"    "#4178AE"    "#678C69"    "#DAB92E"    "#C5912B"    "#C2655D"    "#D56F80" 
##    Digestive        Other 
##    "#C58DAA"    "#999999" 
## 
## $sample_type
## PrimaryCulture     ESCDerived    PrimaryCell  PrimaryTissue 
##      "#8cd2c7"      "#bfb9db"      "#faf7b4"      "#f57f73" 
## 
## $subgroup
## subgroup1 subgroup2 
## "#A6CEE3" "#1F78B4"

27个样本被分成了两个亚群，一个是胚胎干细胞，另一个是其它（包括原代组织和成熟细胞）。这两个亚群的样本有截然不同的表达和甲基化数据集。

在此，我们在基因TSS上游5kb和下游10kb附近富集的不同的表观基因组信息进行可视化。所有的表观基因组信号都已经归一化，储存在R对象中。这是一些对象的定义：

mat_neg_cr: a normalized matrix for correlated regions (CRs, The definition of CR is introduced in following sections) showing significant negative correlation between methylation and gene expression. The value in the matrix is whether a window is covered by negative CRs (values are 0 or 1).
meth_mat_corr: a normalized matrix for all CRs. The value in the matrix is the mean correlation for CRs overlapped to a window.
meth_mat_mean: a normalized matrix for mean methylation across all samples.
meth_mat_diff: a normalized matrix for mean methylation difference between two subgroups.
mat_cgi: a normalized matrix for CpG islands. The value in the matrix is whether a window is covered by CGIs.
hist_mat_corr_list: a list of normalized matrices for the correlation between histone modification signals and gene expression. Each matrix corresponds to one type of histone modification.
hist_mat_mean_list: a list of normalized matrices for the mean histone modification signals across all samples.
hist_mat_diff_list: a list of normalized matrices for the histone modification signal difference between two subgroups.

Other R objects are

gene: A GRanges object which contains positions of genes.
tss: A GRanges object which contains positions of gene TSS.
gene_symbol a mapping between Ensembl IDs and gene symbols.
expr: gene expression matrix (Values are measured by log2(RPKM + 1)).

（3）对基因进行排序和分类

先根据表达量分成高低两类：

expr_mean = rowMeans(expr[, SAMPLE$subgroup == "subgroup1"]) - 
    rowMeans(expr[, SAMPLE$subgroup == "subgroup2"])
expr_split = ifelse(expr_mean > 0, "high", "low")
expr_split = factor(expr_split, levels = c("high", "low"))

然后根据甲基化提取TSS上游20%和下游40%的两类：

set.seed(123)
upstream_index = length(attr(meth_mat_mean, "upstream_index"))
meth_split = kmeans(meth_mat_mean[, seq(round(upstream_index*0.8), round(upstream_index*1.4))], 
    centers = 2)$cluster
x = tapply(rowMeans(meth_mat_mean[, seq(round(upstream_index*0.8), round(upstream_index*1.4))]), 
    meth_split, mean)
od = structure(order(x), names = names(x))
meth_split = paste0("cluster", od[as.character(meth_split)])

把表达和甲基化的基因信息放到一起：

combined_split = paste(meth_split, expr_split, sep = "|")

tb = table(combined_split)

tb

## combined_split
## cluster1|high  cluster1|low cluster2|high  cluster2|low 
##           306           940            25           561

tb["cluster2|high"]/sum(tb)

## cluster2|high 
##    0.01364629

去掉cluster2|high 组

l = combined_split != "cluster2|high"
tss = tss[l]
expr = expr[l, ]
hist_mat_corr_list = lapply(hist_mat_corr_list, function(x) x[l, ])
hist_mat_mean_list = lapply(hist_mat_mean_list, function(x) x[l, ])
hist_mat_diff_list = lapply(hist_mat_diff_list, function(x) x[l, ])
mat_neg_cr = mat_neg_cr[l, ]
mat_cgi = mat_cgi[l, ]
meth_mat_corr = meth_mat_corr[l, ]
meth_mat_mean = meth_mat_mean[l, ]
meth_mat_diff = meth_mat_diff[l, ]
expr_split = expr_split[l]
meth_split = meth_split[l]
combined_split = combined_split[l]
n_row_cluster = length(unique(combined_split))

计算基因的行序：

merge_row_order = function(l_list) {
    do.call("c", lapply(l_list, function(l) {
        if(sum(l) == 0) return(integer(0))
        if(sum(l) == 1) return(which(l))
        dend1 = as.dendrogram(hclust(dist_by_closeness(mat_neg_cr[l, ])))
        dend1 = reorder(dend1, -enriched_score(mat_neg_cr[l, ]))
        od = order.dendrogram(dend1)
        which(l)[od]
    }))
}

row_order = merge_row_order(list(
    combined_split == "cluster1|high",
    combined_split == "cluster1|low",
    combined_split == "cluster2|low"
))

组织热图

表达热图

dend1 = as.dendrogram(hclust(dist(t(expr[, SAMPLE$subgroup == "subgroup1"]))))
hc1 = as.hclust(reorder(dend1, colMeans(expr[, SAMPLE$subgroup == "subgroup1"])))
expr_col_od1 = hc1$order
dend2 = as.dendrogram(hclust(dist(t(expr[, SAMPLE$subgroup == "subgroup2"]))))
hc2 = as.hclust(reorder(dend2, colMeans(expr[, SAMPLE$subgroup == "subgroup2"])))
expr_col_od2 = hc2$order
expr_col_od = c(which(SAMPLE$subgroup == "subgroup1")[expr_col_od1], 
              which(SAMPLE$subgroup == "subgroup2")[expr_col_od2])

把样本注释放到顶端：

ht_list = Heatmap(expr, name = "expr", show_row_names = FALSE,
    show_column_names = FALSE, width = unit(4, "cm"), show_column_dend = FALSE, 
    cluster_columns = FALSE, column_order = expr_col_od,
    top_annotation = HeatmapAnnotation(df = SAMPLE[, -1], col = COLOR, 
        annotation_name_side = "left"),
    column_title = "Expression", column_title_gp = gpar(fontsize = 12),
    show_row_dend = FALSE, use_raster = TRUE)

提取用T检验最显著前20个基因

library(genefilter)
df = rowttests(expr, factor(SAMPLE$subgroup))
top_genes = rownames(df[order(df$p.value)[1:20], ])

为这些基因添加文本注释

index =  which(rownames(expr) %in% top_genes)
labels = gene_symbol[rownames(expr)[index]]
ht_list = rowAnnotation(sig_gene = anno_mark(at = index, labels = labels,
        side = "left", labels_gp = gpar(fontsize = 10), 
        extend = unit(c(1, 0), "cm"))) + ht_list

右边添加基因长度的行注释

gl = width(gene[names(tss)])
gl[gl > quantile(gl, 0.95)] = quantile(gl, 0.95)
ht_list = ht_list + rowAnnotation(gene_len = anno_points(gl, size = unit(1, "mm"), 
        gp = gpar(col = "#00000040"), 
        axis_param = list(at = c(0, 1e5, 2e5), labels = c("0bp", "100bp", "200bp")), 
    width = unit(1.5, "cm")))

加入表明CGI到TSS富集的热图

axis_name = c("-5kb", "TSS", "10kb")
ht_list = ht_list + EnrichedHeatmap(mat_cgi, col = c("white", "darkorange"), name = "CGI",
    column_title = "CGI", column_title_gp = gpar(fontsize = 12),
    top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = "darkorange", 
        lty = 1:n_row_cluster), axis_param = list(side = "right", facing = "inside"))), 
    axis_name = axis_name, axis_name_gp = gpar(fontsize = 8), use_raster = TRUE)

加入甲基化和表达的热图

bg_col = brewer.pal(8, "Set2")
cor_col_fun = colorRamp2(c(-1, 0, 1), c("darkgreen", "white", "red"))
ht_list = ht_list + EnrichedHeatmap(meth_mat_corr, col = cor_col_fun, name = "meth_corr", 
    top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(pos_col = "red", 
            neg_col = "darkgreen", lty = 1:n_row_cluster), 
        axis_param = list(side = "right", facing = "inside"))), 
    column_title = "meth_corr", column_title_gp = gpar(fontsize = 12, fill = bg_col[1]),
        axis_name = axis_name, axis_name_gp = gpar(fontsize = 8), use_raster = TRUE)

添加表明所有样本平均甲基化的热图

meth_col_fun = colorRamp2(c(0, 0.5, 1), c("blue", "white", "red"))
ht_list = ht_list + EnrichedHeatmap(meth_mat_mean, col = meth_col_fun, name = "meth_mean", 
    column_title = "meth_mean", column_title_gp = gpar(fontsize = 12, fill = bg_col[1]),
    top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = "red", 
        lty = 1:n_row_cluster), axis_param = list(side = "right", facing = "inside"))),
    axis_name = axis_name, axis_name_gp = gpar(fontsize = 8), use_raster = TRUE)

定义函数为阳性和阴性差异的亚群添加对称的颜色

generate_diff_color_fun = function(x) {
    q = quantile(x, c(0.05, 0.95))
    max_q = max(abs(q))
    colorRamp2(c(-max_q, 0, max_q), c("#3794bf", "#FFFFFF", "#df8640"))
}

ht_list = ht_list + EnrichedHeatmap(meth_mat_diff, name = "meth_diff", 
    col = generate_diff_color_fun(meth_mat_diff),
    column_title = "meth_diff", column_title_gp = gpar(fontsize = 12, fill = bg_col[1]),
    top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(pos_col = "#df8640", 
            neg_col = "#3794bf", lty = 1:n_row_cluster), 
        axis_param = list(side = "right", facing = "inside"))),
    axis_name = axis_name, axis_name_gp = gpar(fontsize = 8), use_raster = TRUE)

17个热图太长了，分成两行。

ht_list_2 = NULL
ht_list_1 = NULL
mark_name = names(hist_mat_corr_list)
for(i in seq_along(hist_mat_corr_list)) {
    # heatmaps for the 2nd, 3th and 4th histone modifications are assigned to a new `ht_list`
    if(i == 2) {
        ht_list_1 = ht_list
        ht_list = NULL
    }

    ht_list = ht_list + EnrichedHeatmap(hist_mat_corr_list[[i]], col = cor_col_fun, 
        name = qq("@{mark_name[i]}_corr"), column_title = qq("@{mark_name[i]}_corr"), 
        column_title_gp = gpar(fontsize = 12, fill = bg_col[i+1]),
        top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(pos_col = "red",
                neg_col = "darkgreen", lty = 1:n_row_cluster), 
            axis_param = list(side = "right", facing = "inside"))), 
        axis_name = axis_name, axis_name_gp = gpar(fontsize = 8), use_raster = TRUE)

    ht_list = ht_list + EnrichedHeatmap(hist_mat_mean_list[[i]], 
        col = colorRamp2(c(0, quantile(hist_mat_mean_list[[i]], 0.95)), c("white", "purple")), 
        name = qq("@{mark_name[i]}_mean"), column_title = qq("@{mark_name[i]}_mean"), 
        column_title_gp = gpar(fontsize = 12, fill = bg_col[i+1]),
        top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(col = "purple", 
            lty = 1:n_row_cluster), axis_param = list(side = "right", facing = "inside"))),
        axis_name = axis_name, axis_name_gp = gpar(fontsize = 8), use_raster = TRUE)

    ht_list = ht_list + EnrichedHeatmap(hist_mat_diff_list[[i]], 
        col = generate_diff_color_fun(hist_mat_diff_list[[i]]), 
        name = qq("@{mark_name[i]}_diff"), column_title = qq("@{mark_name[i]}_diff"), 
        column_title_gp = gpar(fontsize = 12, fill = bg_col[i+1]), 
        top_annotation = HeatmapAnnotation(lines = anno_enriched(gp = gpar(pos_col = "#df8640", 
                neg_col = "#3794bf", lty = 1:n_row_cluster), 
            axis_param = list(side = "right", facing = "inside"))),
        axis_name = axis_name, axis_name_gp = gpar(fontsize = 8), use_raster = TRUE)
}
ht_list_2 = ht_list

为两个热图定义同样的分类和排序

split = as.vector(combined_split)
split[combined_split == "cluster1|high"] = "cluster1"
split[combined_split == "cluster1|low"] = "cluster2"
split[combined_split == "cluster2|low"] = "cluster3"

对每个热图，最左边添加一列热图代表各亚群的差异表达。

ht_list_1 = Heatmap(expr_split, show_row_names = FALSE, name = "expr_diff", 
  col = c("high" = "red", "low" = "darkgreen"), 
  show_column_names = FALSE, width = unit(2, "mm")) + ht_list_1

绘制热图

ht_list_1 = draw(ht_list_1, 
    cluster_rows = FALSE, row_order = row_order, show_row_dend = FALSE,
    row_split = split, heatmap_legend_side = "bottom", ht_gap = unit(2, "mm"))

add_boxplot_of_gene_length = function(ht_list) {

    row_order_list = row_order(ht_list)
    lt = lapply(row_order_list, function(ind) gl[ind])
    bx = boxplot(lt, plot = FALSE)$stats
    n = length(row_order_list)

    decorate_annotation("gene_len", slice = 1, {
        rg = range(bx)
        rg[1] = rg[1] - (rg[2] - rg[1])*0.1
        rg[2] = rg[2] + (rg[2] - rg[1])*0.1
        pushViewport(viewport(y = unit(1, "npc") + unit(1, "mm"), just = "bottom", 
            height = unit(2, "cm"), yscale = rg, xscale = c(0.5, n + 0.5)))
        grid.rect()
        for(i in 1:n) {
            grid.boxplot(pos = i, lt[[i]], gp = gpar(lty = i), outline = FALSE)
        }
        grid.text("Gene length", y = unit(1, "npc") + unit(2.5, "mm"), 
            gp = gpar(fontsize = 12), just = "bottom")
        upViewport() 
    })
}

add_boxplot_of_gene_length(ht_list_1)

ht_list_2 = Heatmap(expr_split, show_row_names = FALSE, name = "expr_diff", 
    col = c("high" = "red", "low" = "darkgreen"), 
    show_column_names = FALSE, width = unit(2, "mm")) + ht_list_2
ht_list_2 = draw(ht_list_2,
    cluster_rows = FALSE, row_order = row_order, show_row_dend = FALSE,
    row_split = split, heatmap_legend_side = "bottom", ht_gap = unit(2, "mm"))

提取注释图形。

add_anno_enriched = function(ht_list, name, ri, ci) {
    pushViewport(viewport(layout.pos.row = ri, layout.pos.col = ci))
    extract_anno_enriched(ht_list, name, newpage = FALSE)
    upViewport()
}

pushViewport(viewport(layout = grid.layout(nr = 3, nc = 6)))
add_anno_enriched(ht_list_1, "meth_corr",     1, 1)
add_anno_enriched(ht_list_1, "meth_mean",     1, 2)
add_anno_enriched(ht_list_1, "meth_diff",     1, 3)
add_anno_enriched(ht_list_1, "CGI",           1, 4)
add_anno_enriched(ht_list_1, "H3K4me3_corr",  2, 1)
add_anno_enriched(ht_list_1, "H3K4me3_mean",  2, 2)
add_anno_enriched(ht_list_1, "H3K4me3_diff",  2, 3)
add_anno_enriched(ht_list_2, "H3K4me1_corr",  2, 4)
add_anno_enriched(ht_list_2, "H3K4me1_mean",  2, 5)
add_anno_enriched(ht_list_2, "H3K4me1_diff",  2, 6)
add_anno_enriched(ht_list_2, "H3K27ac_corr",  3, 1)
add_anno_enriched(ht_list_2, "H3K27ac_mean",  3, 2)
add_anno_enriched(ht_list_2, "H3K27ac_diff",  3, 3)
add_anno_enriched(ht_list_2, "H3K27me3_corr", 3, 4)
add_anno_enriched(ht_list_2, "H3K27me3_mean", 3, 5)
add_anno_enriched(ht_list_2, "H3K27me3_diff", 3, 6)
upViewport()

image.png

图形的解读

Generally, genes in cluster 1 and 2 have high expression, long gene length (annotation “Gene length”) and low methylation over TSS (heatmap “meth_mean”) which correspond well with the enrichment of CpG islands over TSS (heatmap “CGI”), while genes in cluster 3 have low expression, short gene length, and intermediate mean methylation with almost none CGIs overlapping TSS. There is enrichment for significant negative CRs (negCRs) downstream of TSS in cluster 1 and cluster 2 (solid and dashed green lines in annotation of “meth_corr” heatmap, the peaks of the enrichment locate at approximately +2kb of TSS.) while for cluster 3 genes, the enrichment of negCRs is very close to TSS. By associating the heatmap “CGI”, “meth_corr”, “meth_mean” and “meth_diff” together, we can make the conclusion that for genes in cluster 1 and cluster 2, negCRs are enriched at the downstream border of CGI over TSS with high methylation variability, and even for cluster 3 genes, there is also a trend that the negCRs are enriched at close downstream of TSS. It might give hypotheses that when the transcription machine moves into the gene body from TSS, these exist some mechanism that blocks this process and reflects on the changes of methylations.

H3K4me3 is a histone mark which is enriched at active TSS or promoters. Heatmap “H3K4me3_mean” shows strong enrichment of the mean signal over TSS for cluster 1 and cluster 2 genes with high expression. Such enrichment corresponds very well to the low TSS-methylation. Interestingly, strong positive correlation to expression dominates in cluster 1 and the signals are significantly higher in embryonic cells (heatmap “H3K4me3_diff”). The peak for the enrichment of correlation signals in cluster 1 (solid red line in annotation of heatmap “H3K4me3_corr”) is broader than the mean signals while it is very similar as the enrichment peak for negCRs. For cluster 2 genes, the positive correlated regions are enriched at downstream border of H3K4me3 peaks while directly at the H3K4me3 peaks shows negative correlation although the correlation signals are weak and signal difference is small. Surprisingly, strong positive correlations dominate cluster 3 although the mean signals selves are very weak.

H3K4me1 is an active mark enriched at enhancers and promoter flanking regions. Nevertheless, it shows negative correlation at the TSS (solid and dashed green lines in annotation of heatmap “H3K4me1_corr”), especially strong for cluster 1. The peak for the negative correlation enrichment correlates well with CGI and low TSS-methylation, however the signals selves are low at TSS (heatmap “H3K4me1_mean”). Flanking TSS is dominated by positive correlations and the signal difference is comparably big in cluster 1 (solid brown line in annotation of heatmap “H3K4me1_diff”).

H3K27ac is also an active mark enriched in both active enhancers and promoters, and it generally shows positive correlations to expression in all three clusters (heatmap “H3K27ac_corr”). Interestingly the mean signals are the strongest in cluster 2 and mature cells have significantly higher signal intensity than embryonic cells (dashed blue line in annotation of heatmap “H3K27ac_diff”). The peak for the correlation signal enrichment is comparably broader than other marks.

H3K27me3 is a repressive mark and it generally shows negative correlation around TSS at relatively low level, excluding cluster 1 where there are no dominant correlation patterns (heatmap “H3K27me3_corr”). The signals selves are lower and sparser compared to other marks.