R 数据可视化 —— 集合可视化 UpSetR

前言

上一节,我们介绍了如何绘制韦恩图来显示集合间的交叠关系

但是,随着集合的增多,韦恩图显示的关系会越来越复杂,很难一眼看出其中的信息。

今天,我们要介绍的是,当集合数目较多时,该如何绘制

我们将使用 UpSetR 包来绘制下面这种图

该图由三个子图组成:

  1. 表示交集大小的柱状图(上方)
  2. 表示集合大小的条形图(下左)
  3. 表示集合之间的交叠矩阵(下右),矩阵的列表示每种交集组合,对应于柱状图的横坐标;矩阵的行表示集合,对应于条形图的纵坐标

通过这样一张图,可以展示多个集合之间的交叠关系,且很容易从图中看出集合之间的交集信息

那怎么绘制出这样一张图呢?

基础

1. 安装导入

install.packages("UpSetR")

library(UpSetR)

我们使用该包自带的示例数据

movies <- read.csv(system.file("extdata", "movies.csv", package = "UpSetR"), 
    header = T, sep = ";")

2. 数据

在开始绘制之前,我们需要知道输入数据的格式。

UpSetR 提供了两个转换函数 fromListfromExpression 用于格式化数据

  • fromList 函数接受一个 list(每个变量表示一个集合),并将其转换为数据框,例如
listInput <- list(
        one = c(1, 2, 3, 5, 7, 8, 11, 12, 13), 
        two = c(1, 2, 4, 5, 10), 
        three = c(1, 5, 6, 7, 8, 9, 10, 12, 13))
  • fromExpression 函数接受一个命名向量表达式,包含了每个集合的大小,以及交集的大小,交集的名称通过 & 符号相连,例如
expressionInput <- c(
        one = 2, two = 1, three = 2, 
        `one&two` = 1, `one&three` = 4, 
        `two&three` = 1, `one&two&three` = 2)

根据上面的数据,可以绘制如下图形

upset(fromList(listInput), order.by = "freq")
# upset(fromExpression(expressionInput), order.by = "freq")

3. 绘制部分集合

在这里,我们通过设置 nsets = 6 将集合范围限制在最大的 6 个集合

upset(movies, nsets = 6, 
      number.angles = 30, 
      point.size = 3.5, 
      line.size = 2, 
      mainbar.y.label = "Genre Intersections", 
      sets.x.label = "Movies Per Genre", 
      text.scale = c(1.3, 1.3, 1, 1, 2, 0.75))

同时,可以指定参数,来调整图形属性,例如,使用 number.angles 来设置柱状图柱子上方数字的倾斜角度;使用 point.sizeline.size 来设置矩阵点图中点和线的大小;mainbar.y.labelsets.x.label 可以设置柱状图和条形图的轴标签;text.scale 包含 6 个值,用于指定图上所有文本标签的大小。

text.scale 参数值的顺序为:

  • 柱状图的轴标签和刻度
  • 条形图的轴标签和刻度
  • 集合名称
  • 柱子上方表示交集大小的数值

我们也可以指定需要展示的集合

upset(movies, 
      sets = c("Action", "Comedy", "Drama", 
               "Mystery", "Thriller", "Romance", "War"),
      mb.ratio = c(0.55, 0.45)
      )

mb.ratio 用于控制上下图形所占比例

4. 排序

我们可以设置 order.by 参数,来对交集进行排序。

upset(movies, 
      sets = c("Action", "Comedy", "Drama", 
               "Mystery", "Thriller", "Romance", "War"),
      mb.ratio = c(0.55, 0.45),
      order.by = "freq",
      decreasing = TRUE
      )

freq 默认是升序,可以使用 decreasing = TRUE 让其降序排列

upset(movies, 
      sets = c("Action", "Comedy", "Drama", 
               "Mystery", "Thriller", "Romance", "War"),
      mb.ratio = c(0.55, 0.45),
      order.by = "degree",
      decreasing = FALSE
      )

degree 默认为降序排序,设置 decreasing = FALSE 使其升序排列

也可以同时指定这两个值

upset(movies, 
      sets = c("Action", "Comedy", "Drama", 
               "Mystery", "Thriller", "Romance", "War"),
      mb.ratio = c(0.55, 0.45),
      order.by = c("degree", "freq"),
      decreasing = c(TRUE, FALSE)
      )

如果想要让集合按照 sets 参数中指定的出现的顺序排列,可以设置 keep.order = TRUE

upset(movies, 
      sets = c("Action", "Comedy", "Drama", 
               "Mystery", "Thriller", "Romance", "War"),
      mb.ratio = c(0.55, 0.45),
      order.by = c("degree", "freq"),
      decreasing = c(TRUE, FALSE),
      keep.order = TRUE
      )

如果想要显示交集为空的组合,可以设置 empty.intersections 参数

upset(movies, 
      sets = c("Action", "Comedy", "Drama", 
               "Mystery", "Thriller", "Romance", "War"),
      empty.intersections = "on"
      )

查询

查询通过 queries 参数来执行,接受一个嵌套的 list 来表示多个查询条件,每个查询条件包含四个字段:

  • query:需要执行的查询
  • params:查询参数列表
  • color:设置满足查询条件的元素在图中的颜色
  • active:如果为 TRUE,柱状图颜色将会被覆盖,为 FALSE 则会在柱子上添加带有随机扰动的点

例如

1. 内置交集查询

我们使用内置的交集查询:intersects,用来寻找或显示特定的交集,并将找到的交集进行上色

upset(movies, queries = list(
  list(
    query = intersects, 
    params = list("Drama", "Comedy", "Action"), 
    color = "orange", 
    active = T), 
  list(
    query = intersects, 
    params = list("Drama"), 
    color = "red", 
    active = F), 
  list(
    query = intersects,
    params = list("Action", "Drama"), 
    active = T)
  )
  )

2. 内置元素查询

我们使用 elements 来进行元素查询,来展示元素在交集中的分布情况

upset(movies, 
      queries = list(
        list(
          query = elements, 
          params = list("AvgRating",  3.5, 4.1), 
          color = "blue", 
          active = T), 
        list(
          query = elements, 
          params = list("ReleaseDate", 1980, 1990, 2000), 
          color = "red", 
          active = F)
        )
      )

3. 使用表达式

我们可以为 expression 参数设置过滤表达式来提取查询结果的子集。

upset(movies, 
      queries = list(
        list(
          query = intersects, 
          params = list("Action", "Drama"), 
          active = T), 
        list(
          query = elements, 
          params = list("ReleaseDate", 1980, 1990, 2000), 
          color = "red", 
          active = F)), 
      expression = "AvgRating > 3 & Watches > 100"
      )

4. 自定义查询

查询函数会应用于数据的每一行中,我们可以定义如下查询函数

Myfunc <- function(row, release, rating) {
  data <- (row["ReleaseDate"] %in% release) & (row["AvgRating"] > rating)
}

筛选发行日期在 release 内,且平均评分大于某个值的电影

执行查询

upset(movies, 
      queries = list(
        list(
          query = Myfunc, 
          params = list(c(1970, 1980, 1990, 1999, 2000), 2.5), 
          color = "blue", 
          active = T)
        )
      )

5. 添加查询图例

可以使用 query.legend 参数来指定查询图例的位置,topbottom

在查询条件中,使用 query.name 来设置查询的名称,如果为设置,会自动生成

upset(movies, 
      query.legend = "top", 
      queries = list(
        list(
          query = intersects, 
          params = list("Drama", "Comedy", "Action"), 
          color = "orange", active = T, 
          query.name = "Funny action"), 
        list(
          query = intersects, 
          params = list("Drama"), 
          color = "red", active = F), 
        list(
          query = intersects, 
          params = list("Action", "Drama"), 
          active = T, 
          query.name = "Emotional action")
        )
      )

属性图

attribute.plots 参数用于执行属性图的绘制,包含 3 个字段:

  • gridrows:设置属性图的空间大小,UpSet plot 默认为 100 X 100,如果设置为 50,则整个图形变成 150 X 100
  • plots:图形列表,每个元素包含 4 个参数:
    • plot:返回 ggplot 对象的函数
    • x:图形的 x 轴变量
    • y:图形的 y 轴变量
    • queries:是否使用已经存在的查询来覆盖绘图数据
  • ncols:设置列数

1. 内置绘图函数

我们使用包中自带的 histogram 函数来绘制直方图

upset(movies, 
      main.bar.color = "black", 
      queries = list(
        list(
          query = intersects, 
          params = list("Drama"), 
          active = T)
        ), 
      attribute.plots = list(
        gridrows = 50, 
        plots = list(
          list(
            plot = histogram, 
            x = "ReleaseDate", 
            queries = F), 
          list(
            plot = histogram,
            x = "AvgRating", 
            queries = T)
          ), 
        ncols = 2
        )
      )

使用 scatter_plot 函数绘制散点图

upset(movies, 
      main.bar.color = "black", 
      queries = list(
        list(
          query = intersects, 
          params = list("Drama"), 
          color = "red", 
          active = F), 
        list(
          query = intersects, 
          params = list("Drama", "Comedy", "Action"), 
          color = "orange", 
          active = T)
        ), 
      attribute.plots = list(
        gridrows = 45, 
        plots = list(
          list(
            plot = scatter_plot, 
            x = "ReleaseDate", 
            y = "AvgRating", 
            queries = T), 
          list(plot = scatter_plot, 
               x = "AvgRating", 
               y = "Watches", 
               queries = F)
          ), 
        ncols = 2), 
      query.legend = "bottom"
      )

2. 自定义绘图函数

我们先定义两个基于 ggplot2 的函数,用于绘制散点图和密度图

my_scatter <- function(data, x, y) {
  p <- ggplot(data, aes_string(x, y, colour = "color")) +
    geom_point() +
    scale_colour_identity() +
    theme(
      plot.margin = unit(c(0, 0, 0, 0), "cm")
    )
  p
}

my_density <- function(data, x, y) {
  data$decades <- data[, y] %/% 10 * 10
  data <- data[which(data$decades >= 1970), ]
  p <- ggplot(data, aes_string(x)) +
    geom_density(aes(fill = factor(decades)), alpha = 0.3) +
    theme(
      plot.margin = unit(c(0, 0, 0, 0), "cm"), 
      legend.key.size = unit(0.4, "cm")
    )
  p
}

然后应用在属性图中

upset(movies, 
      main.bar.color = "black", 
      queries = list(
        list(
          query = intersects, 
          params = list("Drama"), 
          color = "red", active = F), 
        list(
          query = intersects, 
          params = list("Action", "Drama"), 
          active = T),
        list(
          query = intersects, 
          params = list("Drama", "Comedy", "Action"), 
          color = "orange", active = T)
        ), 
      attribute.plots = list(
        gridrows = 45, 
        plots = list(
          list(
            plot = my_scatter, 
            x = "ReleaseDate", 
            y = "AvgRating", 
            queries = T),
          list(
            plot = my_density,
            x = "AvgRating",
            y = "ReleaseDate",
            queries = F)
          ),
        ncols = 2)
      )

3. 绘制箱线图

想要绘制箱线图,可以使用 boxplot.summary 参数,最多只能同时绘制两个变量的箱线图。

upset(movies, boxplot.summary = c("AvgRating", "ReleaseDate"))

当然,用自定义的方式也能实现

集合元数据

set.metadata 参数可以用来设置集合的元数据,包含 3 个字段:

  • data:数据框,第一列为集合名,后面的列为对应的集合属性
  • ncols:列数
  • plots:也是一个 list,每个元素包含 4 个字段 column, type, assigncolors
    • columndata 中用于绘制的列名

    • type:需要绘制的图像类型,如果指定的列为数值型,则可以是 histheat;如果是布尔型,则可以绘制 bool 热图;如果是分类类型(字符串),则可以是 heattext;如果想在矩阵中绘制,可以使用 matrix_rows

    • assign:该元数据图分配的列数,如果绘制 2 列数据,并分别分配了 2010,则 UpSet 图变为 100 X 130

    • colors:元数据图颜色,如果是条形图,则会应用于整个元数据图;如果是 heatbool,则可以设置一个颜色向量;如果是 factor 则没有 colors 参数,并且图像为渐变色;如果是 text 则可以为每个唯一的字符串设置一个颜色,不设置会自动分配颜色

1. 条形图

我们为每个集合添加元数据属性,为每部电影随机设置烂番茄的电影评分

sets <- names(movies[3:19])
avgRottenTomatoesScore <- round(runif(17, min = 0, max = 90))
metadata <- as.data.frame(cbind(sets, avgRottenTomatoesScore))
names(metadata) <- c("sets", "avgRottenTomatoesScore")

要绘制条形图,需要保证对应列的数据类型必须是数值型

> str(metadata)
'data.frame':   17 obs. of  2 variables:
 $ sets                  : Factor w/ 17 levels "Action","Adventure",..: 1 2 3 4 5 6 7 8 12 9 ...
 $ avgRottenTomatoesScore: Factor w/ 12 levels "13","16","21",..: 6 10 12 5 1 1 3 2 11 11 ...

我们看到,评分列为 factor,所以需要先进行转换

metadata$avgRottenTomatoesScore <- as.numeric(as.character(metadata$avgRottenTomatoesScore))

现在可以绘制元数据图了

upset(movies, 
      set.metadata = list(
        data = metadata, 
        plots = list(
          list(
            type = "hist", 
            column = "avgRottenTomatoesScore", 
            assign = 20)
          )
        )
      )

2. 热图

我们再构造电影的元数据,为电影添加城市属性,同时确保该列为字符串类型而不是 factor

Cities <- sample(c("Boston", "NYC", "LA"), 17, replace = T)
metadata <- cbind(metadata, Cities)
metadata$Cities <- as.character(metadata$Cities)

我们绘制两幅热图,一幅指定了颜色,另一幅不指定颜色

upset(movies, 
      set.metadata = list(
        data = metadata, 
        plots = list(
          list(
            type = "heat",
            column = "Cities", 
            assign = 10, 
            colors = c(
              Boston = "green", 
              NYC = "navy",
              LA = "purple")
            ), 
          list(
            type = "heat", 
            column = "avgRottenTomatoesScore", 
            assign = 10)
          )
        )
      )

可以看到,不指定颜色的热图为灰色渐变色

布尔型热图

我们为电影添加一列 accepted 信息,值为 01

accepted <- round(runif(17, min = 0, max = 1))
metadata <- cbind(metadata, accepted)

设置方式与上面类似

upset(movies, 
      set.metadata = list(
        data = metadata, 
        plots = list(
          list(
            type = "bool", 
            column = "accepted", 
            assign = 5, 
            colors = c("#FF3333", "#006400")
            )
          )
        )
      )

如果将 bool 换成 heat

upset(movies, 
      set.metadata = list(
        data = metadata, 
        plots = list(
          list(
            type = "heat", 
            column = "accepted", 
            assign = 5, 
            colors = c("#FF3333", "#006400")
            )
          )
        )
      )

会将 01 布尔型数据视为数值型,并绘制渐变色

3. 文本

对于城市信息元数据,可能显示文本比热图更合适一些

upset(movies, 
      set.metadata = list(
        data = metadata, 
        plots = list(
          list(
            type = "text", 
            column = "Cities", 
            assign = 10, 
            colors = c(
              Boston = "green", 
              NYC = "navy",        
              LA = "purple")
            )
          )
        )
      )

4. 在矩阵中应用元数据

有时候,我们可能想将元数据信息直接体现在 UpSet 图中,可以设置 type = "matrix_rows",在矩阵中为不同城市设置不同的颜色

upset(movies, 
      set.metadata = list(
        data = metadata, 
        plots = list(
          list(
            type = "hist", 
            column = "avgRottenTomatoesScore", 
            assign = 20), 
          list(
            type = "matrix_rows", 
            column = "Cities", 
            colors = c(
              Boston = "green", 
              NYC = "navy", 
              LA = "purple"),
            alpha = 0.5)
          )
        )
      )

汇总

最后,我们将这些图合并在一起

upset(movies, 
      # 查询
      queries = list(
        list(
          query = intersects, 
          params = list("Drama"), 
          color = "red", 
          active = F), 
        list(
          query = intersects, 
          params = list("Action", "Drama"), 
          active = T), 
        list(
          query = intersects,
          params = list("Drama", "Comedy", "Action"), 
          color = "orange", 
          active = T)), 
      # 元数据图
      set.metadata = list(
        data = metadata, 
        plots = list(
          list(
            type = "hist", 
            column = "avgRottenTomatoesScore", 
            assign = 20), 
          list(
            type = "bool", 
            column = "accepted",
            assign = 5, 
            colors = c("#FF3333", "#006400")), 
          list(
            type = "text", 
            column = "Cities",
            assign = 5, 
            colors = c(
              Boston = "green", 
              NYC = "navy", 
              LA = "purple")), 
          list(
            type = "matrix_rows", 
            column = "Cities", 
            colors = c(
              Boston = "green", 
              NYC = "navy", 
              LA = "purple"), 
            alpha = 0.5)
          )
        ), 
      # 属性图
      attribute.plots = list(
        gridrows = 45, 
        plots = list(
          list(
            plot = my_scatter, 
            x = "ReleaseDate", 
            y = "AvgRating", 
            queries = T), 
          list(plot = my_density, 
               x = "AvgRating", 
               y = "ReleaseDate", 
               queries = F)), 
        ncols = 2), 
      query.legend = "bottom"
      )

代码:
https://github.com/dxsbiocc/learn/blob/main/R/plot/upset_plot.R

参数详情


你可能感兴趣的:(R 数据可视化 —— 集合可视化 UpSetR)