ComplexHeatmap复杂热图绘制学习——8.upsetplot

upset-plot

UpSet与传统方法(即维恩图)相比,UpSet 图提供了一种可视化多个集合的交集的有效方法。通过R中的UpSetR 包中实现。在这里,我们使用ComplexHeatmap 包重新实现了 UpSet 图,并进行了一些改进。

8.1 输入数据

为了表示多个集合,变量可以表示为:

  1. 一个集合列表,其中每个集合都是一个向量,例如:
list(set1 = c("a", "b", "c"),
     set2 = c("b", "c", "d", "e"),
     ...)
  1. 一个二进制矩阵/数据框,其中行是元素,列是集合,例如:
  set1 set2 set3
h    1    1    1
t    1    0    1
j    1    0    0
u    1    0    1
w    1    0    0
...

例如,在矩阵中的t行表示:t在集合set1 中,不在集合set2 中,在集合set3 中。(只有在该矩阵是逻辑矩阵时才有效)

如果变量是数据框,则只使用二进制列(仅包含 0 和 1)和逻辑列。

两种格式都可以用于制作 UpSet 图,用户仍然可以使用 list_to_matrix()从列表到二进制矩阵的转换。

lt = list(set1 = c("a", "b", "c"),
          set2 = c("b", "c", "d", "e"))
list_to_matrix(lt)
##   set1 set2
## a    1    0
## b    1    1
## c    1    1
## d    0    1
## e    0    1

您还可以在list_to_matrix()下位置设置通用集:

list_to_matrix(lt, universal = letters[1:10])
##   set1 set2
## a    1    0
## b    1    1
## c    1    1
## d    0    1
## e    0    1
## f    0    0
## g    0    0
## h    0    0
## i    0    0
## j    0    0

如果全集没有完全覆盖输入集,那些不在全集中的元素将被删除:

list_to_matrix(lt, universal = letters[1:4])
##   set1 set2
## a    1    0
## b    1    1
## c    1    1
## d    0    1
  1. 该集合可以是基因组区间,那么它只能表示为GRanges/IRanges对象的列表。
list(set1 = GRanges(...),
     set2 = GRanges(...),
     ...)

8.2 upset模式

例如,对于三个集合(ABC),选择在或不在集合中的元素的所有组合编码如下:

A B C
1 1 1
1 1 0
1 0 1
0 1 1
1 0 0
0 1 0
0 0 1

1 表示选择该集合,0 表示不选择该集合。例如,1 1 0意味着选择集合 A、B 而不选择集合 C。注意没有0 0 0,因为这里的背景集合不感兴趣。在本节的以下部分,我们将ABC称为集合,将每个组合称为组合集。整个二元矩阵称为组合矩阵

UpSet 图将每个组合集的大小可视化。有了每个组合集的二进制代码,接下来我们需要定义如何计算该组合集的大小。共有三种模式:

  1. distinct模式: 1 表示在该集合中,0 表示不在该集合中,然后1 1 0表示AB是集合元素,而C不是集合中的元素( setdiff(intersect(A, B), C)) 。在这种模式下,七个组合集就可以看成维恩图中的七个分区,它们是相互排斥的。

  2. intersect模式: 1 表示在该集合中,不考虑0,然后1 1 0表示AB是集合元素,它们也可以在或不在C中( intersect(A, B))。在此模式下,七个组合集可以重叠。

  3. union模式: 1 表示在该集合中,不考虑0。当有多个1时,关系为OR。然后,1 1 0表示AB集合中的元素,它们也可以在或不在 C ( union(A, B)) 中。在此模式下,七个组合集可以重叠。

三种模式如下图所示:

image

8.3 生成组合矩阵

make_comb_mat()函数生成组合矩阵并计算集合和组合集合的大小。输入可以是单个变量或名称-值对:

set.seed(123)
lt = list(a = sample(letters, 5),
          b = sample(letters, 10),
          c = sample(letters, 15))
m1 = make_comb_mat(lt)
m1
## A combination matrix with 3 sets and 7 combinations.
##   ranges of combination set size: c(1, 8).
##   mode for the combination size: distinct.
##   sets are on rows.
## 
## Combination sets are:
##   a b c code size
##   x x x  111    2
##   x x    110    1
##   x   x  101    1
##     x x  011    4
##   x      100    1
##     x    010    3
##       x  001    8
## 
## Sets are:
##   set size
##     a    5
##     b   10
##     c   15
m2 = make_comb_mat(a = lt$a, b = lt$b, c = lt$c)
m3 = make_comb_mat(list_to_matrix(lt))

m1m2m3结果是相同的。

模式由mode参数控制:

m1 = make_comb_mat(lt) # the default mode is `distinct`
m2 = make_comb_mat(lt, mode = "intersect")
m3 = make_comb_mat(lt, mode = "union")

不同模式下的 UpSet 图将在后面演示。

当集合过多时,可以通过集合大小对集合进行预过滤(min_set_sizetop_n_sets)。min_set_size 控制集合的最小大小,top_n_sets控制具有最大大小的顶部集合的数量。

m1 = make_comb_mat(lt, min_set_size = 6)
m2 = make_comb_mat(lt, top_n_sets = 2)

集合的子集会影响组合集大小的计算,这就是为什么需要在组合矩阵生成步骤对其进行控制。组合集的子集可以直接通过对矩阵进行子集来进行:

m = make_comb_mat(lt)
m[1:4]
## A combination matrix with 3 sets and 4 combinations.
##   ranges of combination set size: c(1, 4).
##   mode for the combination size: distinct.
##   sets are on rows.
## 
## Combination sets are:
##   a b c code size
##   x x x  111    2
##   x x    110    1
##   x   x  101    1
##     x x  011    4
## 
## Sets are:
##   set size
##     a    5
##     b   10
##     c   15

make_comb_mat() 还允许指定全集,以便还考虑包含不属于任何集合的元素的补集。

m = make_comb_mat(lt, universal_set = letters)
m
## A combination matrix with 3 sets and 8 combinations.
##   ranges of combination set size: c(1, 8).
##   mode for the combination size: distinct.
##   sets are on rows.
## 
## Combination sets are:
##   a b c code size
##   x x x  111    2
##   x x    110    1
##   x   x  101    1
##     x x  011    4
##   x      100    1
##     x    010    3
##       x  001    8
##          000    6
## 
## Sets are:
##          set size
##            a    5
##            b   10
##            c   15
##   complement    6

全集可以小于所有集合的并集,那么对于每个集合,只考虑与全集的交集。

m = make_comb_mat(lt, universal_set = letters[1:10])
m
## A combination matrix with 3 sets and 5 combinations.
##   ranges of combination set size: c(1, 3).
##   mode for the combination size: distinct.
##   sets are on rows.
## 
## Combination sets are:
##   a b c code size
##   x x    110    1
##   x   x  101    1
##     x x  011    2
##       x  001    3
##          000    3
## 
## Sets are:
##          set size
##            a    2
##            b    3
##            c    6
##   complement    3

如果您已经知道补码的大小,则可以直接设置 complement_size参数。

m = make_comb_mat(lt, complement_size = 5)
m
## A combination matrix with 3 sets and 8 combinations.
##   ranges of combination set size: c(1, 8).
##   mode for the combination size: distinct.
##   sets are on rows.
## 
## Combination sets are:
##   a b c code size
##   x x x  111    2
##   x x    110    1
##   x   x  101    1
##     x x  011    4
##   x      100    1
##     x    010    3
##       x  001    8
##          000    5
## 
## Sets are:
##          set size
##            a    5
##            b   10
##            c   15
##   complement    5

当输入的矩阵不属于任何集合的元素时,这些元素被视为补集。

x = list_to_matrix(lt, universal_set = letters)
m = make_comb_mat(x)
m
## A combination matrix with 3 sets and 8 combinations.
##   ranges of combination set size: c(1, 8).
##   mode for the combination size: distinct.
##   sets are on rows.
## 
## Combination sets are:
##   a b c code size
##   x x x  111    2
##   x x    110    1
##   x   x  101    1
##     x x  011    4
##   x      100    1
##     x    010    3
##       x  001    8
##          000    6
## 
## Sets are:
##          set size
##            a    5
##            b   10
##            c   15
##   complement    6

接下来我们演示第二个示例,其中集合是基因组区域。 当集合是基因组区域时,大小计算为每个集合中区域宽度的总和(也就是指碱基对的总数)。

library(circlize)
library(GenomicRanges)
lt2 = lapply(1:4, function(i) generateRandomBed())
lt2 = lapply(lt2, function(df) GRanges(seqnames = df[, 1], 
    ranges = IRanges(df[, 2], df[, 3])))
names(lt2) = letters[1:4]
m2 = make_comb_mat(lt2)
m2
## A combination matrix with 4 sets and 15 combinations.
##   ranges of combination set size: c(184941701, 199900416).
##   mode for the combination size: distinct.
##   sets are on rows.
## 
## Top 8 combination sets are:
##   a b c d code      size
##       x x 0011 199900416
##   x       1000 199756519
##   x   x x 1011 198735008
##   x x x x 1111 197341532
##   x x x   1110 197137160
##   x x   x 1101 194569926
##   x     x 1001 194462988
##   x   x   1010 192670258
## 
## Sets are:
##   set       size
##     a 1566783009
##     b 1535968265
##     c 1560549760
##     d 1552480645

我们不建议将两组基因组区域的交集用于区域数。有两个原因:
1. 取值不对称,即set1中测得的相交区域数并不总是与set2中测得的相交区域数相同,因此很难为set1和 set2之间的交集赋值;
2. 如果 set1 中的一个长区域与 set2 中的另一个长区域重叠,但只有几个碱基对,那么说这两个区域在两组中是常见的是否有意义?

通用集也适用于作为基因组区域的集合。

8.4 upset实用功能

make_comb_mat()返回一个矩阵,也在comb_mat类中。有一些实用函数可以应用于这个comb_mat对象:

  • set_name(): 集合名称。
  • comb_name(): 组合集名称。组合集的名称被格式化为一串二进制位。例如对于三组A , B , C,名称为“101”的组合集合对应于选择集合 A,不选择集合B和选择集合C
  • set_size(): 设置的大小。
  • comb_size():组合套装尺寸。
  • comb_degree():组合集的度数是选择的集数。
  • t():转置组合矩阵。默认情况下make_comb_mat() 生成一个矩阵,其中集合在行上,组合集在列上,它们在 UpSet 图上也是如此。通过对组合矩阵进行转置,可以在 UpSet 图上切换集合和组合集合的位置。
  • extract_comb():提取指定组合集中的元素。用法将在后面解释。
  • 用于对矩阵进行子集化的函数。

快速示例是:

m = make_comb_mat(lt)
set_name(m)
## [1] "a" "b" "c"
comb_name(m)
## [1] "111" "110" "101" "011" "100" "010" "001"
set_size(m)
##  a  b  c 
##  5 10 15
comb_size(m)
## 111 110 101 011 100 010 001 
##   2   1   1   4   1   3   8
comb_degree(m)
## 111 110 101 011 100 010 001 
##   3   2   2   2   1   1   1
t(m)
## A combination matrix with 3 sets and 7 combinations.
##   ranges of combination set size: c(1, 8).
##   mode for the combination size: distinct.
##   sets are on columns
## 
## Combination sets are:
##   a b c code size
##   x x x  111    2
##   x x    110    1
##   x   x  101    1
##     x x  011    4
##   x      100    1
##     x    010    3
##       x  001    8
## 
## Sets are:
##   set size
##     a    5
##     b   10
##     c   15

对于extract_comb()的使用,有效的组合集名称应该是comb_name()。请注意,组合集中的元素取决于 make_comb_mat()中设置的“mode”。

extract_comb(m, "101")
## [1] "j"

以及作为基因组区域的集合的示例:

# `lt2` was generated in the previous section 
m2 = make_comb_mat(lt2)
set_size(m2)
##          a          b          c          d 
## 1566783009 1535968265 1560549760 1552480645
comb_size(m2)
##      1111      1110      1101      1011      0111      1100      1010      1001 
## 197341532 197137160 194569926 198735008 191312455 192109618 192670258 194462988 
##      0110      0101      0011      1000      0100      0010      0001 
## 191359036 184941701 199900416 199756519 187196837 192093895 191216619

现在extract_comb()返回相应组合集中的基因组区域。

extract_comb(m2, "1010")
## GRanges object with 5063 ranges and 0 metadata columns:
##          seqnames            ranges strand
##                        
##      [1]     chr1     255644-258083      *
##      [2]     chr1     306114-308971      *
##      [3]     chr1   1267493-1360170      *
##      [4]     chr1   2661311-2665736      *
##      [5]     chr1   3020553-3030645      *
##      ...      ...               ...    ...
##   [5059]     chrY 56286079-56286864      *
##   [5060]     chrY 57049541-57078332      *
##   [5061]     chrY 58691055-58699756      *
##   [5062]     chrY 58705675-58716954      *
##   [5063]     chrY 58765097-58776696      *
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

使用comb_size()comb_degree(),我们可以将组合矩阵过滤为:

m = make_comb_mat(lt)
# combination set size >= 4
m[comb_size(m) >= 4]
## A combination matrix with 3 sets and 2 combinations.
##   ranges of combination set size: c(4, 8).
##   mode for the combination size: distinct.
##   sets are on rows.
## 
## Combination sets are:
##   a b c code size
##     x x  011    4
##       x  001    8
## 
## Sets are:
##   set size
##     a    5
##     b   10
##     c   15
# combination set degree == 2
m[comb_degree(m) == 2]
## A combination matrix with 3 sets and 3 combinations.
##   ranges of combination set size: c(1, 4).
##   mode for the combination size: distinct.
##   sets are on rows.
## 
## Combination sets are:
##   a b c code size
##   x x    110    1
##   x   x  101    1
##     x x  011    4
## 
## Sets are:
##   set size
##     a    5
##     b   10
##     c   15

对于补集,这个特殊组合集的名称仅由零组成。

m2 = make_comb_mat(lt, universal_set = letters)
comb_name(m2) # see the first element
## [1] "111" "110" "101" "011" "100" "010" "001" "000"
comb_degree(m2)
## 111 110 101 011 100 010 001 000 
##   3   2   2   2   1   1   1   0

如果在make_comb_mat()中设置universal_setextract_comb()则可以应用于补集。

m2 = make_comb_mat(lt, universal_set = letters)
extract_comb(m2, "000")
## [1] "a" "b" "f" "p" "u" "z"
m2 = make_comb_mat(lt, universal_set = letters[1:10])
extract_comb(m2, "000")
## [1] "a" "b" "f"

当设置universal_setextract_comb()也适用于基因组区域集。

在前面的例子中,我们演示了使用“一维索引”,例如:

m[comb_degree(m) == 2]

由于组合矩阵本质上是一个矩阵,因此索引也可以应用于两个维度。在默认设置中,集合在行上,组合集在列上,因此,矩阵第一维上的索引对应于集合,第二维上的索引对应于组合集:

# by set names
m[c("a", "b", "c"), ]
# by nummeric indicies
m[3:1, ]

可以通过以下方式将新的空集添加到组合矩阵中:

# `d` is the new empty set
m[c("a", "b", "c", "d"), ]

注意当指定的索引没有覆盖原始组合矩阵中的所有非空集合时,会重新计算组合矩阵,因为它会影响组合集合中的值:

# if `c` is a non-empty set
m[c("a", "b"),]

与组合集对应的第二维上的子集类似:

# reorder
m[, 5:1]
# take a subset
m[, 1:3]
# by charater indices
m[, c("110", "101", "011")]

也可以通过设置字符索引来添加新的空组合集:

m[m, c(comb_name(m), "100")]

只有当集合索引覆盖所有非空集合时,才能同时在两个维度上设置索引:

m[3:1, 5:1]
# this will throw an error because `c` is a non-empty set
m[c("a", "b"), 5:1]

如果组合矩阵进行了转置,则需要切换矩阵的集索引和组合集索引的边距。

tm = t(m)
tm[reverse(comb_name(tm)), reverse(set_name(tm))]

如果仅将组合集的索引设置为一维,则它会自动适用于转置或未转置的两个矩阵:

m[1:5]
tm[1:5]

8.5 生成upset图

生成 UpSet 图非常简单,用户只需将组合矩阵发送到UpSet()函数即可:

m = make_comb_mat(lt)
UpSet(m)
image

默认情况下,集合按大小排序,组合集合按度数(选择的集合数)排序。

订单由set_order和控制comb_order

UpSet(m, set_order = c("a", "b", "c"), comb_order = order(comb_size(m)))
image

点的颜色、点的大小和线段的线宽由pt_sizecomb_col和控制 lwdcomb_col是组合集对应的向量。在下面的代码中,由于comb_degree(m)返回一个整数向量,我们只将它用作颜色向量的索引。

UpSet(m, pt_size = unit(5, "mm"), lwd = 3,
    comb_col = c("red", "blue", "black")[comb_degree(m)])
image

背景颜色(代表集合的矩形和圆点没有被选中)由bg_colbg_pt_col控制。bg_col 的长度可以是1或2。

UpSet(m, comb_col = "#0000FF", bg_col = "#F0F0FF", bg_pt_col = "#CCCCFF")
image
UpSet(m, comb_col = "#0000FF", bg_col = c("#F0F0FF", "#FFF0F0"), bg_pt_col = "#CCCCFF")
image

组合矩阵转置将集合切换为列,将组合集合切换为行。

UpSet(t(m))
image

正如我们所介绍的,如果对组合集进行子集化,也可以将矩阵的子集可视化:

UpSet(m[comb_size(m) >= 4])
UpSet(m[comb_degree(m) == 2])
image

以下比较了make_comb_mat()中的不同模式:

m1 = make_comb_mat(lt) # the default mode is `distinct`
m2 = make_comb_mat(lt, mode = "intersect")
m3 = make_comb_mat(lt, mode = "union")
UpSet(m1)
UpSet(m2)
UpSet(m3)
image

对于包含补集的图,有一个额外的列显示此补集不与任何集重叠(所有点均为灰色)。

m2 = make_comb_mat(lt, universal_set = letters)
UpSet(m2)
image

请记住,如果您已经知道补集的大小,则可以直接通过make_comb_mat()中的complement_size参数分配它。

m2 = make_comb_mat(lt, complement_size = 10)
UpSet(m2)
image

对于全集小于所有集合的并集的情况:

m2 = make_comb_mat(lt, universal_set = letters[1:10])
UpSet(m2)
image

在某些情况下,您可能有补集但不想显示它,尤其是当输入为make_comb_mat()已包含补集的矩阵时,您可以按组合度进行过滤。

x = list_to_matrix(lt, universal_set = letters)
m2 = make_comb_mat(x)
m2 = m2[comb_degree(m2) > 0]
UpSet(m2)
image

8.6 UpSet 图作为热图

在 UpSet 图中,主要成分是组合矩阵,两侧是表示集合大小和组合集合的条形图,因此,将其实现为“热图”是非常简单的,其中热图是用点和段定义,两个条形图是由anno_barplot().

默认的顶部注释是:

HeatmapAnnotation("Intersection\nsize" = anno_barplot(comb_size(m), 
        border = FALSE, gp = gpar(fill = "black"), height = unit(3, "cm")), 
    annotation_name_side = "left", annotation_name_rot = 0)

此顶部注释被包裹在upset_top_annotation()中,其中仅包含翻转顶部条形图注释。大多数参数 upset_top_annotation()直接转到anno_barplot(),例如设置条形的颜色:

UpSet(m, top_annotation = upset_top_annotation(m, 
    gp = gpar(col = comb_degree(m))))
image

控制数据范围和轴:

UpSet(m, top_annotation = upset_top_annotation(m, 
    ylim = c(0, 15),
    bar_width = 1,
    axis_param = list(side = "right", at = c(0, 5, 10, 15),
        labels = c("zero", "five", "ten", "fifteen"))))
image

控制注释名称:

UpSet(m, top_annotation = upset_top_annotation(m, 
    annotation_name_rot = 90,
    annotation_name_side = "right",
    axis_param = list(side = "right")))
image

右注释的设置非常相似:

UpSet(m, right_annotation = upset_right_annotation(m, 
    ylim = c(0, 30),
    gp = gpar(fill = "green"),
    annotation_name_side = "top",
    axis_param = list(side = "top")))
image

upset_top_annotation()upset_right_annotation()可以自动识别集合是在行上还是列上。

upset_top_annotation()upset_right_annotation()只包含一个条形图注释。如果用户想要添加更多的注释,则需要手动构造一个HeatmapAnnotation具有多个注释的对象。

要在顶部添加更多注释:

UpSet(m, top_annotation = HeatmapAnnotation(
    degree = as.character(comb_degree(m)),
    "Intersection\nsize" = anno_barplot(comb_size(m), 
        border = FALSE, 
        gp = gpar(fill = "black"), 
        height = unit(2, "cm")
    ), 
    annotation_name_side = "left", 
    annotation_name_rot = 0))
image

要在右侧添加更多注释:

UpSet(m, right_annotation = rowAnnotation(
    "Set size" = anno_barplot(set_size(m), 
        border = FALSE, 
        gp = gpar(fill = "black"), 
        width = unit(2, "cm")
    ),
    group = c("group1", "group1", "group2")))
image

将右侧注释移动到组合矩阵的左侧,请使用upset_left_annotation()

UpSet(m, left_annotation = upset_left_annotation(m))
image

在条形顶部添加数字:

UpSet(m, top_annotation = upset_top_annotation(m, add_numbers = TRUE),
    right_annotation = upset_right_annotation(m, add_numbers = TRUE))
image

返回的对象UpSet()实际上是一个Heatmap类对象,因此,您可以通过+%v%将其添加到其他热图和注释中。

ht = UpSet(m)
class(ht)
## [1] "Heatmap"
## attr(,"package")
## [1] "ComplexHeatmap"
ht + Heatmap(1:3, name = "foo", width = unit(5, "mm")) + 
    rowAnnotation(bar = anno_points(1:3))
image
ht %v% Heatmap(rbind(1:7), name = "foo", row_names_side = "left", 
        height = unit(5, "mm")) %v% 
    HeatmapAnnotation(bar = anno_points(1:7),
        annotation_name_side = "left")
image

添加多个 UpSet 图:

m1 = make_comb_mat(lt, mode = "distinct")
m2 = make_comb_mat(lt, mode = "intersect")
m3 = make_comb_mat(lt, mode = "union")
UpSet(m1, row_title = "distinct mode") %v%
    UpSet(m2, row_title = "intersect mode") %v%
    UpSet(m3, row_title = "union mode")
image

或者先将所有组合矩阵转置,然后水平相加:

m1 = make_comb_mat(lt, mode = "distinct")
m2 = make_comb_mat(lt, mode = "intersect")
m3 = make_comb_mat(lt, mode = "union")
UpSet(t(m1), column_title = "distinct mode") +
    UpSet(t(m2), column_title = "intersect mode") +
    UpSet(t(m3), column_title = "union mode")
image

三个组合矩阵实际上是相同的,将它们绘制三次是多余的。借助ComplexHeatmap包中的功能,我们可以直接添加三个条形图注释。

top_ha = HeatmapAnnotation(
    "distict" = anno_barplot(comb_size(m1), 
        gp = gpar(fill = "black"), height = unit(2, "cm")), 
    "intersect" = anno_barplot(comb_size(m2), 
        gp = gpar(fill = "black"), height = unit(2, "cm")), 
    "union" = anno_barplot(comb_size(m3), 
        gp = gpar(fill = "black"), height = unit(2, "cm")), 
    gap = unit(2, "mm"), annotation_name_side = "left", annotation_name_rot = 0)
# the same for using m2 or m3
UpSet(m1, top_annotation = top_ha)
image

组合矩阵转置时类似:

right_ha = rowAnnotation(
    "distict" = anno_barplot(comb_size(m1), 
        gp = gpar(fill = "black"), width = unit(2, "cm")), 
    "intersect" = anno_barplot(comb_size(m2), 
        gp = gpar(fill = "black"), width = unit(2, "cm")), 
    "union" = anno_barplot(comb_size(m3), 
        gp = gpar(fill = "black"), width = unit(2, "cm")), 
    gap = unit(2, "mm"), annotation_name_side = "bottom")
# the same for using m2 or m3
UpSet(t(m1), right_annotation = right_ha)
image

初始 UpSet 实现,组合集大小也绘制在条形图的顶部。这里我们不直接支持,但是可以通过decorate_annotation()函数手动添加尺寸。请参阅以下示例:

ht = draw(UpSet(m))
od = column_order(ht)
cs = comb_size(m)
decorate_annotation("intersection_size", {
    grid.text(cs[od], x = seq_along(cs), y = unit(cs[od], "native") + unit(2, "pt"), 
        default.units = "native", just = "bottom", gp = gpar(fontsize = 8))
})
image

我们不直接支持将组合集大小添加到绘图中的原因有几个:
1. 添加新文本意味着向函数添加几个新参数,例如图形参数的参数、旋转、位置、条形的边距,这将使功能变的重复。
2.需要正确计算barplot注释的ylim,让文字不超过注释区域。
3、使用decoration_annotation()更灵活,不仅可以添加大小,还可以添加自定义文本。

8.7 电影数据集的例子

UpsetR 包还提供了一个movies 数据集,其中包含 3883 部电影的 17 个流派。首先加载数据集。

movies = read.csv(system.file("extdata", "movies.csv", package = "UpSetR"), 
    header = TRUE, sep = ";")
head(movies) # `make_comb_mat()` automatically ignores the first two columns
##                                 Name ReleaseDate Action Adventure Children
## 1                   Toy Story (1995)        1995      0         0        1
## 2                     Jumanji (1995)        1995      0         1        1
## 3            Grumpier Old Men (1995)        1995      0         0        0
## 4           Waiting to Exhale (1995)        1995      0         0        0
## 5 Father of the Bride Part II (1995)        1995      0         0        0
## 6                        Heat (1995)        1995      1         0        0
##   Comedy Crime Documentary Drama Fantasy Noir Horror Musical Mystery Romance
## 1      1     0           0     0       0    0      0       0       0       0
## 2      0     0           0     0       1    0      0       0       0       0
## 3      1     0           0     0       0    0      0       0       0       1
## 4      1     0           0     1       0    0      0       0       0       0
## 5      1     0           0     0       0    0      0       0       0       0
## 6      0     1           0     0       0    0      0       0       0       0
##   SciFi Thriller War Western AvgRating Watches
## 1     0        0   0       0      4.15    2077
## 2     0        0   0       0      3.20     701
## 3     0        0   0       0      3.02     478
## 4     0        0   0       0      2.73     170
## 5     0        0   0       0      3.01     296
## 6     0        1   0       0      3.88     940

要生成与此示例相同的 UpSet 图:

m = make_comb_mat(movies, top_n_sets = 6)
m
## A combination matrix with 6 sets and 39 combinations.
##   ranges of combination set size: c(1, 1028).
##   mode for the combination size: distinct.
##   sets are on rows.
## 
## Top 8 combination sets are:
##   Action Comedy Drama Horror Romance Thriller   code size
##                     x                         001000 1028
##               x                               010000  698
##                            x                  000100  216
##        x                                      100000  206
##                                             x 000001  183
##               x     x                         011000  180
##               x                    x          010010  160
##                     x              x          001010  158
## 
## Sets are:
##          set size
##       Action  503
##       Comedy 1200
##        Drama 1603
##       Horror  343
##      Romance  471
##     Thriller  492
##   complement    2
m = m[comb_degree(m) > 0]
UpSet(m)
image

以下代码使其看起来与原始图更相似。代码有点长,但大部分代码主要是自定义注释和行/列顺序。

ss = set_size(m)
cs = comb_size(m)
ht = UpSet(m, 
    set_order = order(ss),
    comb_order = order(comb_degree(m), -cs),
    top_annotation = HeatmapAnnotation(
        "Genre Intersections" = anno_barplot(cs, 
            ylim = c(0, max(cs)*1.1),
            border = FALSE, 
            gp = gpar(fill = "black"), 
            height = unit(4, "cm")
        ), 
        annotation_name_side = "left", 
        annotation_name_rot = 90),
    left_annotation = rowAnnotation(
        "Movies Per Genre" = anno_barplot(-ss, 
            baseline = 0,
            axis_param = list(
                at = c(0, -500, -1000, -1500),
                labels = c(0, 500, 1000, 1500),
                labels_rot = 0),
            border = FALSE, 
            gp = gpar(fill = "black"), 
            width = unit(4, "cm")
        ),
        set_name = anno_text(set_name(m), 
            location = 0.5, 
            just = "center",
            width = max_text_width(set_name(m)) + unit(4, "mm"))
    ), 
    right_annotation = NULL,
    show_row_names = FALSE)
ht = draw(ht)
od = column_order(ht)
decorate_annotation("Genre Intersections", {
    grid.text(cs[od], x = seq_along(cs), y = unit(cs[od], "native") + unit(2, "pt"), 
        default.units = "native", just = c("left", "bottom"), 
        gp = gpar(fontsize = 6, col = "#404040"), rot = 45)
})
image

movies数据集中,还有一列AvgRating给出了每部电影的评分,接下来我们根据评分将所有电影分为五组。

genre = c("Action", "Romance", "Horror", "Children", "SciFi", "Documentary")
rating = cut(movies$AvgRating, c(0, 1, 2, 3, 4, 5))
m_list = tapply(seq_len(nrow(movies)), rating, function(ind) {
    m = make_comb_mat(movies[ind, genre, drop = FALSE])
    m[comb_degree(m) > 0]
})

中的组合矩阵m_list可能有不同的组合集:

sapply(m_list, comb_size)
## $`(0,1]`
## 010000 001000 000100 000001 
##      1      2      1      1 
## 
## $`(1,2]`
## 101010 100110 110000 101000 100100 100010 001010 100000 010000 001000 000100 
##      1      1      1      4      5      5      8     14      7     38     14 
## 000010 000001 
##      3      2 
## 
## $`(2,3]`
## 101010 110000 101000 100100 100010 010100 010010 001010 000110 100000 010000 
##      4      8      2      6     35      3      1     27      7    126     99 
## 001000 000100 000010 000001 
##    142     77     27      9 
## 
## $`(3,4]`
## 110010 101010 100110 110000 101000 100010 011000 010100 010010 001100 001010 
##      1      6      1     20      6     45      3      4      4      1     11 
## 000110 100000 010000 001000 000100 000010 000001 
##      5    176    276     82    122     66     87 
## 
## $`(4,5]`
## 110010 101010 110000 101000 100010 100000 010000 001000 000100 000010 000001 
##      1      1      4      1      6     23     38      4      4     10     28

为了用 UpSet 图在多个组之间进行比较,我们需要对所有矩阵进行归一化,使它们具有相同的集合和相同的组合集。 normalize_comb_mat()基本上将零添加到以前不存在的新组合集。

m_list = normalize_comb_mat(m_list)
sapply(m_list, comb_size)
##        (0,1] (1,2] (2,3] (3,4] (4,5]
## 110001     0     1     0     1     0
## 100101     0     1     4     6     1
## 100011     0     0     0     1     1
## 110000     0     5     6     0     0
## 100100     0     4     2     6     1
## 100010     0     1     8    20     4
## 100001     0     5    35    45     6
## 010100     0     0     0     1     0
## 010010     0     0     3     4     0
## 010001     0     0     7     5     0
## 000110     0     0     0     3     0
## 000101     0     8    27    11     0
## 000011     0     0     1     4     0
## 100000     0    14   126   176    23
## 010000     1    14    77   122     4
## 001000     1     2     9    87    28
## 000100     2    38   142    82     4
## 000010     1     7    99   276    38
## 000001     0     3    27    66    10

我们计算两个条形图的范围:

max_set_size = max(sapply(m_list, set_size))
max_comb_size = max(sapply(m_list, comb_size))

最后,我们垂直添加五个 UpSet 图:

ht_list = NULL
for(i in seq_along(m_list)) {
    ht_list = ht_list %v%
        UpSet(m_list[[i]], row_title = paste0("rating in", names(m_list)[i]),
            set_order = NULL, comb_order = NULL,
            top_annotation = upset_top_annotation(m_list[[i]], ylim = c(0, max_comb_size)),
            right_annotation = upset_right_annotation(m_list[[i]], ylim = c(0, max_set_size)))
}
ht_list
image.png

比较五个 UpSet 图后,我们可以看到大多数电影的评分在 2 到 4 之间。恐怖片的评分往往较低,而爱情片的评分往往较高。

除了直接比较组合集的大小之外,我们还可以将相对分数与完整集进行比较。在下面的代码中,我们删除了c(0, 1]组,因为那里的电影数量太少。

m_list = m_list[-1]
max_set_size = max(sapply(m_list, set_size))
rel_comb_size = sapply(m_list, function(m) {
    s = comb_size(m)
    # because the combination matrix is generated under "distinct" mode
    # the sum of `s` is the size of the full set
    s/sum(s)
})
ht_list = NULL
for(i in seq_along(m_list)) {
    ht_list = ht_list %v%
        UpSet(m_list[[i]], row_title = paste0("rating in", names(m_list)[i]),
            set_order = NULL, comb_order = NULL,
            top_annotation = HeatmapAnnotation(
                "Relative\nfraction" = anno_barplot(
                    rel_comb_size[, i],
                    ylim = c(0, 0.5),
                    gp = gpar(fill = "black"),
                    border = FALSE,
                    height = unit(2, "cm"),
                ), 
                annotation_name_side = "left",
                annotation_name_rot = 0),
            right_annotation = upset_right_annotation(m_list[[i]], 
                ylim = c(0, max_set_size))
        )
}
ht_list
image

现在的趋势更加明显,恐怖片评分低,纪录片评分高。

接下来我们按年份划分电影:

year = floor(movies$ReleaseDate/10)*10
m_list = tapply(seq_len(nrow(movies)), year, function(ind) {
    m = make_comb_mat(movies[ind, genre, drop = FALSE])
    m[comb_degree(m) > 0]
})
m_list = normalize_comb_mat(m_list)
max_set_size = max(sapply(m_list, set_size))
max_comb_size = max(sapply(m_list, comb_size))
ht_list1 = NULL
for(i in 1:5) {
    ht_list1 = ht_list1 %v%
        UpSet(m_list[[i]], row_title = paste0(names(m_list)[i], "s"),
            set_order = NULL, comb_order = NULL,
            top_annotation = upset_top_annotation(m_list[[i]], ylim = c(0, max_comb_size),
                height = unit(2, "cm")),
            right_annotation = upset_right_annotation(m_list[[i]], ylim = c(0, max_set_size)))
}

ht_list2 = NULL
for(i in 6:10) {
    ht_list2 = ht_list2 %v%
        UpSet(m_list[[i]], row_title = paste0(names(m_list)[i], "s"),
            set_order = NULL, comb_order = NULL,
            top_annotation = upset_top_annotation(m_list[[i]], ylim = c(0, max_comb_size),
                height = unit(2, "cm")),
            right_annotation = upset_right_annotation(m_list[[i]], ylim = c(0, max_set_size)))
}
grid.newpage()
pushViewport(viewport(x = 0, width = 0.5, just = "left"))
draw(ht_list1, newpage = FALSE)
popViewport()
pushViewport(viewport(x = 0.5, width = 0.5, just = "left"))
draw(ht_list2, newpage = FALSE)
popViewport()
image

现在我们可以看到大部分电影都是 1990 年代制作的,两大类型是动作片和爱情片。

类似地,如果我们将顶部注释更改为完整集的相对分数(代码未显示):

image

最后,我们可以在 UpSet 图的右侧添加作为箱线图注释的每个组合集的年份、评级和观看次数的统计数据。

m = make_comb_mat(movies[, genre])
m = m[comb_degree(m) > 0]
comb_elements = lapply(comb_name(m), function(nm) extract_comb(m, nm))
years = lapply(comb_elements, function(ind) movies$ReleaseDate[ind])
rating = lapply(comb_elements, function(ind) movies$AvgRating[ind])
watches = lapply(comb_elements, function(ind) movies$Watches[ind])

UpSet(t(m)) + rowAnnotation(years = anno_boxplot(years),
    rating = anno_boxplot(rating),
    watches = anno_boxplot(watches),
    gap = unit(2, "mm"))
image

我们可以看到“科幻+儿童”类型的电影制作时间很长,但收视率还不错。“动作+儿童”类型的电影收视率最低。

8.8 基因组区域示例

来自六个路线图样本的 H3K4me3 ChIP-seq 峰通过 UpSet 图进行可视化。这六个样本是:

  • 电调,E016
  • ES衍生,E004
  • ES衍生,E006
  • 大脑,E071
  • 肌肉,E100
  • 心脏,E104

首先读取文件并转换为GRanges对象。

file_list = c(
    "ESC" = "data/E016-H3K4me3.narrowPeak.gz",
    "ES-deriv1" = "data/E004-H3K4me3.narrowPeak.gz",
    "ES-deriv2" = "data/E006-H3K4me3.narrowPeak.gz",
    "Brain" = "data/E071-H3K4me3.narrowPeak.gz",
    "Muscle" = "data/E100-H3K4me3.narrowPeak.gz",
    "Heart" = "data/E104-H3K4me3.narrowPeak.gz"
)
library(GenomicRanges)
peak_list = lapply(file_list, function(f) {
    df = read.table(f)
    GRanges(seqnames = df[, 1], ranges = IRanges(df[, 2], df [, 3]))
})

制作组合矩阵。现在注意集合和组合集合的大小是总碱基对或区域宽度的总和。我们只保留超过 500kb 的组合集。

m = make_comb_mat(peak_list)
m = m[comb_size(m) > 500000]
UpSet(m)
image

我们可以通过设置axis_param很好地格式化轴标签:

UpSet(m, 
    top_annotation = upset_top_annotation(
        m,
        axis_param = list(at = c(0, 1e7, 2e7),
            labels = c("0Mb", "10Mb", "20Mb")),
        height = unit(4, "cm")
    ),
    right_annotation = upset_right_annotation(
        m,
        axis_param = list(at = c(0, 2e7, 4e7, 6e7),
            labels = c("0Mb", "20Mb", "40Mb", "60Mb"),
            labels_rot = 0),
        width = unit(4, "cm")
    ))
image

对于每组基因组区域,我们可以将更多信息与其关联,例如平均甲基化或与最近 TSS 的距离。

subgroup = c("ESC" = "group1",
    "ES-deriv1" = "group1",
    "ES-deriv2" = "group1",
    "Brain" = "group2",
    "Muscle" = "group2",
    "Heart" = "group2"
)
comb_sets = lapply(comb_name(m), function(nm) extract_comb(m, nm))
comb_sets = lapply(comb_sets, function(gr) {
    # we just randomly generate dist_to_tss and mean_meth
    gr$dist_to_tss = abs(rnorm(length(gr), mean = runif(1, min = 500, max = 2000), sd = 1000))
    gr$mean_meth = abs(rnorm(length(gr), mean = 0.1, sd = 0.1))
    gr
})
UpSet(m, 
    top_annotation = upset_top_annotation(
        m,
        axis_param = list(at = c(0, 1e7, 2e7),
            labels = c("0Mb", "10Mb", "20Mb")),
        height = unit(4, "cm")
    ),
    right_annotation = upset_right_annotation(
        m,
        axis_param = list(at = c(0, 2e7, 4e7, 6e7),
            labels = c("0Mb", "20Mb", "40Mb", "60Mb"),
            labels_rot = 0),
        width = unit(4, "cm")
    ),
    left_annotation = rowAnnotation(group = subgroup[set_name(m)], show_annotation_name = FALSE),
    bottom_annotation = HeatmapAnnotation(
        dist_to_tss = anno_boxplot(lapply(comb_sets, function(gr) gr$dist_to_tss), outline = FALSE),
        mean_meth = sapply(comb_sets, function(gr) mean(gr$mean_meth)),
        annotation_name_side = "left"
    )
)
image

你可能感兴趣的:(ComplexHeatmap复杂热图绘制学习——8.upsetplot)