前言
上一节,我们介绍了如何绘制韦恩图来显示集合间的交叠关系
但是,随着集合的增多,韦恩图显示的关系会越来越复杂,很难一眼看出其中的信息。
今天,我们要介绍的是,当集合数目较多时,该如何绘制
我们将使用 UpSetR
包来绘制下面这种图
该图由三个子图组成:
- 表示交集大小的柱状图(上方)
- 表示集合大小的条形图(下左)
- 表示集合之间的交叠矩阵(下右),矩阵的列表示每种交集组合,对应于柱状图的横坐标;矩阵的行表示集合,对应于条形图的纵坐标
通过这样一张图,可以展示多个集合之间的交叠关系,且很容易从图中看出集合之间的交集信息
那怎么绘制出这样一张图呢?
基础
1. 安装导入
install.packages("UpSetR")
library(UpSetR)
我们使用该包自带的示例数据
movies <- read.csv(system.file("extdata", "movies.csv", package = "UpSetR"),
header = T, sep = ";")
2. 数据
在开始绘制之前,我们需要知道输入数据的格式。
UpSetR
提供了两个转换函数 fromList
和 fromExpression
用于格式化数据
-
fromList
函数接受一个list
(每个变量表示一个集合),并将其转换为数据框,例如
listInput <- list(
one = c(1, 2, 3, 5, 7, 8, 11, 12, 13),
two = c(1, 2, 4, 5, 10),
three = c(1, 5, 6, 7, 8, 9, 10, 12, 13))
-
fromExpression
函数接受一个命名向量表达式,包含了每个集合的大小,以及交集的大小,交集的名称通过&
符号相连,例如
expressionInput <- c(
one = 2, two = 1, three = 2,
`one&two` = 1, `one&three` = 4,
`two&three` = 1, `one&two&three` = 2)
根据上面的数据,可以绘制如下图形
upset(fromList(listInput), order.by = "freq")
# upset(fromExpression(expressionInput), order.by = "freq")
3. 绘制部分集合
在这里,我们通过设置 nsets = 6
将集合范围限制在最大的 6
个集合
upset(movies, nsets = 6,
number.angles = 30,
point.size = 3.5,
line.size = 2,
mainbar.y.label = "Genre Intersections",
sets.x.label = "Movies Per Genre",
text.scale = c(1.3, 1.3, 1, 1, 2, 0.75))
同时,可以指定参数,来调整图形属性,例如,使用 number.angles
来设置柱状图柱子上方数字的倾斜角度;使用 point.size
和 line.size
来设置矩阵点图中点和线的大小;mainbar.y.label
和 sets.x.label
可以设置柱状图和条形图的轴标签;text.scale
包含 6
个值,用于指定图上所有文本标签的大小。
text.scale
参数值的顺序为:
- 柱状图的轴标签和刻度
- 条形图的轴标签和刻度
- 集合名称
- 柱子上方表示交集大小的数值
我们也可以指定需要展示的集合
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45)
)
mb.ratio
用于控制上下图形所占比例
4. 排序
我们可以设置 order.by
参数,来对交集进行排序。
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45),
order.by = "freq",
decreasing = TRUE
)
freq
默认是升序,可以使用 decreasing = TRUE
让其降序排列
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45),
order.by = "degree",
decreasing = FALSE
)
degree
默认为降序排序,设置 decreasing = FALSE
使其升序排列
也可以同时指定这两个值
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45),
order.by = c("degree", "freq"),
decreasing = c(TRUE, FALSE)
)
如果想要让集合按照 sets
参数中指定的出现的顺序排列,可以设置 keep.order = TRUE
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
mb.ratio = c(0.55, 0.45),
order.by = c("degree", "freq"),
decreasing = c(TRUE, FALSE),
keep.order = TRUE
)
如果想要显示交集为空的组合,可以设置 empty.intersections
参数
upset(movies,
sets = c("Action", "Comedy", "Drama",
"Mystery", "Thriller", "Romance", "War"),
empty.intersections = "on"
)
查询
查询通过 queries
参数来执行,接受一个嵌套的 list
来表示多个查询条件,每个查询条件包含四个字段:
-
query
:需要执行的查询 -
params
:查询参数列表 -
color
:设置满足查询条件的元素在图中的颜色 -
active
:如果为TRUE
,柱状图颜色将会被覆盖,为FALSE
则会在柱子上添加带有随机扰动的点
例如
1. 内置交集查询
我们使用内置的交集查询:intersects
,用来寻找或显示特定的交集,并将找到的交集进行上色
upset(movies, queries = list(
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange",
active = T),
list(
query = intersects,
params = list("Drama"),
color = "red",
active = F),
list(
query = intersects,
params = list("Action", "Drama"),
active = T)
)
)
2. 内置元素查询
我们使用 elements
来进行元素查询,来展示元素在交集中的分布情况
upset(movies,
queries = list(
list(
query = elements,
params = list("AvgRating", 3.5, 4.1),
color = "blue",
active = T),
list(
query = elements,
params = list("ReleaseDate", 1980, 1990, 2000),
color = "red",
active = F)
)
)
3. 使用表达式
我们可以为 expression
参数设置过滤表达式来提取查询结果的子集。
upset(movies,
queries = list(
list(
query = intersects,
params = list("Action", "Drama"),
active = T),
list(
query = elements,
params = list("ReleaseDate", 1980, 1990, 2000),
color = "red",
active = F)),
expression = "AvgRating > 3 & Watches > 100"
)
4. 自定义查询
查询函数会应用于数据的每一行中,我们可以定义如下查询函数
Myfunc <- function(row, release, rating) {
data <- (row["ReleaseDate"] %in% release) & (row["AvgRating"] > rating)
}
筛选发行日期在 release
内,且平均评分大于某个值的电影
执行查询
upset(movies,
queries = list(
list(
query = Myfunc,
params = list(c(1970, 1980, 1990, 1999, 2000), 2.5),
color = "blue",
active = T)
)
)
5. 添加查询图例
可以使用 query.legend
参数来指定查询图例的位置,top
或 bottom
在查询条件中,使用 query.name
来设置查询的名称,如果为设置,会自动生成
upset(movies,
query.legend = "top",
queries = list(
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange", active = T,
query.name = "Funny action"),
list(
query = intersects,
params = list("Drama"),
color = "red", active = F),
list(
query = intersects,
params = list("Action", "Drama"),
active = T,
query.name = "Emotional action")
)
)
属性图
attribute.plots
参数用于执行属性图的绘制,包含 3
个字段:
-
gridrows
:设置属性图的空间大小,UpSet plot
默认为100 X 100
,如果设置为50
,则整个图形变成150 X 100
-
plots
:图形列表,每个元素包含4
个参数:-
plot
:返回ggplot
对象的函数 -
x
:图形的x
轴变量 -
y
:图形的y
轴变量 -
queries
:是否使用已经存在的查询来覆盖绘图数据
-
-
ncols
:设置列数
1. 内置绘图函数
我们使用包中自带的 histogram
函数来绘制直方图
upset(movies,
main.bar.color = "black",
queries = list(
list(
query = intersects,
params = list("Drama"),
active = T)
),
attribute.plots = list(
gridrows = 50,
plots = list(
list(
plot = histogram,
x = "ReleaseDate",
queries = F),
list(
plot = histogram,
x = "AvgRating",
queries = T)
),
ncols = 2
)
)
使用 scatter_plot
函数绘制散点图
upset(movies,
main.bar.color = "black",
queries = list(
list(
query = intersects,
params = list("Drama"),
color = "red",
active = F),
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange",
active = T)
),
attribute.plots = list(
gridrows = 45,
plots = list(
list(
plot = scatter_plot,
x = "ReleaseDate",
y = "AvgRating",
queries = T),
list(plot = scatter_plot,
x = "AvgRating",
y = "Watches",
queries = F)
),
ncols = 2),
query.legend = "bottom"
)
2. 自定义绘图函数
我们先定义两个基于 ggplot2
的函数,用于绘制散点图和密度图
my_scatter <- function(data, x, y) {
p <- ggplot(data, aes_string(x, y, colour = "color")) +
geom_point() +
scale_colour_identity() +
theme(
plot.margin = unit(c(0, 0, 0, 0), "cm")
)
p
}
my_density <- function(data, x, y) {
data$decades <- data[, y] %/% 10 * 10
data <- data[which(data$decades >= 1970), ]
p <- ggplot(data, aes_string(x)) +
geom_density(aes(fill = factor(decades)), alpha = 0.3) +
theme(
plot.margin = unit(c(0, 0, 0, 0), "cm"),
legend.key.size = unit(0.4, "cm")
)
p
}
然后应用在属性图中
upset(movies,
main.bar.color = "black",
queries = list(
list(
query = intersects,
params = list("Drama"),
color = "red", active = F),
list(
query = intersects,
params = list("Action", "Drama"),
active = T),
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange", active = T)
),
attribute.plots = list(
gridrows = 45,
plots = list(
list(
plot = my_scatter,
x = "ReleaseDate",
y = "AvgRating",
queries = T),
list(
plot = my_density,
x = "AvgRating",
y = "ReleaseDate",
queries = F)
),
ncols = 2)
)
3. 绘制箱线图
想要绘制箱线图,可以使用 boxplot.summary
参数,最多只能同时绘制两个变量的箱线图。
upset(movies, boxplot.summary = c("AvgRating", "ReleaseDate"))
当然,用自定义的方式也能实现
集合元数据
set.metadata
参数可以用来设置集合的元数据,包含 3
个字段:
-
data
:数据框,第一列为集合名,后面的列为对应的集合属性 -
ncols
:列数 -
plots
:也是一个list
,每个元素包含4
个字段column
,type
,assign
和colors
column
:data
中用于绘制的列名type
:需要绘制的图像类型,如果指定的列为数值型,则可以是hist
和heat
;如果是布尔型,则可以绘制bool
热图;如果是分类类型(字符串),则可以是heat
和text
;如果想在矩阵中绘制,可以使用matrix_rows
。assign
:该元数据图分配的列数,如果绘制2
列数据,并分别分配了20
和10
,则UpSet
图变为100 X 130
colors
:元数据图颜色,如果是条形图,则会应用于整个元数据图;如果是heat
或bool
,则可以设置一个颜色向量;如果是factor
则没有colors
参数,并且图像为渐变色;如果是text
则可以为每个唯一的字符串设置一个颜色,不设置会自动分配颜色
1. 条形图
我们为每个集合添加元数据属性,为每部电影随机设置烂番茄的电影评分
sets <- names(movies[3:19])
avgRottenTomatoesScore <- round(runif(17, min = 0, max = 90))
metadata <- as.data.frame(cbind(sets, avgRottenTomatoesScore))
names(metadata) <- c("sets", "avgRottenTomatoesScore")
要绘制条形图,需要保证对应列的数据类型必须是数值型
> str(metadata)
'data.frame': 17 obs. of 2 variables:
$ sets : Factor w/ 17 levels "Action","Adventure",..: 1 2 3 4 5 6 7 8 12 9 ...
$ avgRottenTomatoesScore: Factor w/ 12 levels "13","16","21",..: 6 10 12 5 1 1 3 2 11 11 ...
我们看到,评分列为 factor
,所以需要先进行转换
metadata$avgRottenTomatoesScore <- as.numeric(as.character(metadata$avgRottenTomatoesScore))
现在可以绘制元数据图了
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "hist",
column = "avgRottenTomatoesScore",
assign = 20)
)
)
)
2. 热图
我们再构造电影的元数据,为电影添加城市属性,同时确保该列为字符串类型而不是 factor
Cities <- sample(c("Boston", "NYC", "LA"), 17, replace = T)
metadata <- cbind(metadata, Cities)
metadata$Cities <- as.character(metadata$Cities)
我们绘制两幅热图,一幅指定了颜色,另一幅不指定颜色
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "heat",
column = "Cities",
assign = 10,
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple")
),
list(
type = "heat",
column = "avgRottenTomatoesScore",
assign = 10)
)
)
)
可以看到,不指定颜色的热图为灰色渐变色
布尔型热图
我们为电影添加一列 accepted
信息,值为 0
、1
accepted <- round(runif(17, min = 0, max = 1))
metadata <- cbind(metadata, accepted)
设置方式与上面类似
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "bool",
column = "accepted",
assign = 5,
colors = c("#FF3333", "#006400")
)
)
)
)
如果将 bool
换成 heat
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "heat",
column = "accepted",
assign = 5,
colors = c("#FF3333", "#006400")
)
)
)
)
会将 0
、1
布尔型数据视为数值型,并绘制渐变色
3. 文本
对于城市信息元数据,可能显示文本比热图更合适一些
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "text",
column = "Cities",
assign = 10,
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple")
)
)
)
)
4. 在矩阵中应用元数据
有时候,我们可能想将元数据信息直接体现在 UpSet
图中,可以设置 type = "matrix_rows"
,在矩阵中为不同城市设置不同的颜色
upset(movies,
set.metadata = list(
data = metadata,
plots = list(
list(
type = "hist",
column = "avgRottenTomatoesScore",
assign = 20),
list(
type = "matrix_rows",
column = "Cities",
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple"),
alpha = 0.5)
)
)
)
汇总
最后,我们将这些图合并在一起
upset(movies,
# 查询
queries = list(
list(
query = intersects,
params = list("Drama"),
color = "red",
active = F),
list(
query = intersects,
params = list("Action", "Drama"),
active = T),
list(
query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange",
active = T)),
# 元数据图
set.metadata = list(
data = metadata,
plots = list(
list(
type = "hist",
column = "avgRottenTomatoesScore",
assign = 20),
list(
type = "bool",
column = "accepted",
assign = 5,
colors = c("#FF3333", "#006400")),
list(
type = "text",
column = "Cities",
assign = 5,
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple")),
list(
type = "matrix_rows",
column = "Cities",
colors = c(
Boston = "green",
NYC = "navy",
LA = "purple"),
alpha = 0.5)
)
),
# 属性图
attribute.plots = list(
gridrows = 45,
plots = list(
list(
plot = my_scatter,
x = "ReleaseDate",
y = "AvgRating",
queries = T),
list(plot = my_density,
x = "AvgRating",
y = "ReleaseDate",
queries = F)),
ncols = 2),
query.legend = "bottom"
)
代码:
https://github.com/dxsbiocc/learn/blob/main/R/plot/upset_plot.R
参数详情