在基因组研究领域,探索一个或一系列与感兴趣的途径相关的基因表达谱是很常见的。这里我介绍一个很实用而且华丽的包ggpubr,可以提供发表级质量的作图效果,而且可以直接套用特定期刊规定的调色板,以方便生命科学家进行探索性数据分析(EDA)。
看到这样一张图,小伙伴们是不是觉得很专业?是不是想做出一张同样的图?下面我将逐步演示。事先说明,所有这些图都可以使用非常灵活的ggplot2 R包创建。然而,要自定义gglot,对于初学者来说,语法可能看起来不透明,这增加了没有高级R编程技能的研究人员的难度。ggpubr是一个围绕ggplot2的包装器,它提供了一些易于使用的函数,用于创建基于“ggplot2”的发表级绘图。我们将使用ggpubr函数从TCGA基因组数据集中可视化基因表达谱。
1、ggpubr包:可以用CRAN以如下命令安装。
install.packages("ggpubr")
或者,从Github安装最新的测试版。
if(!require(devtools)) install.packages("devtools")devtools::install_github("kassambara/ggpubr")
然后加载该包。
library(ggpubr)
2、TCGA数据
癌症基因组图谱(TCGA)数据是一个公开的数据,包含33种癌症的临床和基因组数据。这些数据包括基因表达、CNV图谱、SNP基因型、DNA甲基化、miRNA图谱、外显子组测序和其他类型的数据。Marcin等人开发的RTCGA 软件包为获取TCGA中可用的临床和基因组数据提供了方便的解决方案。具体的安装方法可以查询Bioconductor仓库或者参考我的另一篇文章《RTCGA:TCGA数据挖掘的终极利器》。下面的R代码需要安装核心RTCGA软件包以及clinical和mRNA基因表达数据包。
要查看每种癌症类型的可用数据类型,请使用以下命令:
library(RTCGA)infoTCGA()
3、基因表达数据
RTCGA包中的函数expressionTCGA()可以很容易地提取一种或多种癌症类型中感兴趣的基因的表达值。在下面的R代码中,我们首先从3个不同的数据集中(乳腺浸润性癌BRCA,卵巢浆液性囊腺癌OV,肺鳞癌LUSC)提取5个感兴趣的基因GATA3、PTEN、XBP1、ESR1和MUC1的mRNA表达。
library(RTCGA)library(RTCGA.mRNA)expr
要显示每个数据集中的样本数,请键入以下内容。
nb_samples
我们可以通过删除“mRNA”标记来简化数据集名称。这可以使用R基本函数gsub()来完成。
expr$dataset
让我们也简化一下患者的条形码(barcode)列。下面的R码会将条形码更改为BRCA1、BRCA2、…,ov1,ov2,…等。
expr$bcr_patient_barcode
上述演示所需数据集在网上也已经整理好,可供下载。此数据是练习本教程中提供的R代码所必需的。如果您在安装RTCGA包时遇到一些问题,您可以简单地加载数据,如下所示:
expr
创建基因表达谱的框图,按组着色(此处为数据集/癌症类型):
library(ggpubr)# GATA3ggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco")# PTENggboxplot(expr, x = "dataset", y = "PTEN", title = "PTEN", ylab = "Expression", color = "dataset", palette = "jco")
palette参数用于使用不同的调色板。关于调色板知识,以后打算再写一篇文章来系统性介绍。目前您只需知道,ggpubr可以直接调用ggsci包的科学期刊调色板,例如:“NPG”,“AAAS”,“Lancet”,“JCO”,“ucscgb”等。很显然,上面代码直接调用了适合于JCO杂志的调色板,很美观大方。
您可以一次创建一个曲线图列表,而不是为每个基因重复相同的R代码,如下所示:
# Create a list of plotsp
请注意,当参数y包含多个变量(这里是多个基因名称)时,参数title、xlab和ylab也可以是与y长度相同的字符向量。要将p值和显著性级别添加到框图中,简单地说,您可以这样做:
my_comparisons
对于每个基因,您可以按如下方式比较不同的组
compare_means(c(GATA3, PTEN, XBP1) ~ dataset, data = expr)
如果要选择要显示的项目(此处为癌症类型)或要从绘图中删除特定项目,请使用参数select或remove,如下所示:
# Select BRCA and OV cancer typesggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco", select = c("BRCA", "OV"))# or remove BRCAggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco", remove = "BRCA")
要更改数据集在x轴上的顺序,请使用参数order。例如order=c(“LUSC”,“OV”,“BRCA”):
# Order data setsggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco", order = c("LUSC", "OV", "BRCA"))
要创建水平绘图,请使用参数rotate=true。
ggboxplot(expr, x = "dataset", y = "GATA3", title = "GATA3", ylab = "Expression", color = "dataset", palette = "jco", rotate = TRUE)
要将三个基因表达图合并为多面板图,请使用参数combine=TRUE。
ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, ylab = "Expression", color = "dataset", palette = "jco")
也可以使用参数merge=TRUE或merge=“asis”合并这3个绘图。
ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), merge = TRUE, ylab = "Expression", palette = "jco")
在上面的图表中,很容易直观地比较每种癌症类型中不同基因的表达水平。但是你可能想把基因(y变量)放在x轴上,以便比较不同细胞亚群中的表达水平。在这种情况下,y变量(即:基因)成为x刻度标签,而x变量(即:数据集)成为分组变量。为此,请使用参数merge=“Flip”。
ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), merge = "flip", ylab = "Expression", palette = "jco")
您可能希望在框图上添加抖动点。每一点都对应于个别的观察结果。要添加抖动点,请使用参数add=“jitter”,如下所示。要自定义添加的元素,请指定参数add.params。
ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", ylab = "Expression", add = "jitter", # Add jittered points add.params = list(size = 0.1, jitter = 0.2) # Point size and the amount of jittering )
注意,当使用ggboxplot()时,参数add的合理值是c(“jitter”,“dotplot”)之一。如果您决定使用add=“dotplot”,当您有一个很密集的点图时,您可以调整点大小和bin宽度。您可以按如下方式添加和调整点图。
ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", ylab = "Expression", add = "dotplot", # Add dotplot add.params = list(binwidth = 0.1, dotsize = 0.3) )
您可能希望在盒装图中标记前n个最高或最低值的样本名称。在这种情况下,可以使用以下参数:
label:包含点标签的列的名称。
label.select:可以有两种格式:
指定要显示的一些标签的字符向量。
包含以下组件之一或组合的列表:
top.up和top.down:用于显示顶部向上/向下点的标签。例如,label.select=list(top.up=10,
top.down=4)。
criteria:例如,要按x和y变量值进行过滤,请使用以下命令: label.select=list(criteria=“`y`>3.9 & `y`<5 & `x` %in% c(‘BRCA’,‘OV’)”)。
ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", ylab = "Expression", add = "jitter", # Add jittered points add.params = list(size = 0.1, jitter = 0.2), # Point size and the amount of jittering label = "bcr_patient_barcode", # column containing point labels label.select = list(top.up = 2, top.down = 2),# Select some labels to display font.label = list(size = 9, face = "italic"), # label font repel = TRUE # Avoid label text overplotting )
可以按如下方式指定复杂的标签。
label.select.criteria 3.9 & `x` %in% c('BRCA', 'OV')")ggboxplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", ylab = "Expression", label = "bcr_patient_barcode", # column containing point labels label.select = label.select.criteria, # Select some labels to display font.label = list(size = 9, face = "italic"), # label font repel = TRUE # Avoid label text overplotting )
下面的R代码绘制内部带有框图的小提琴图。
ggviolin(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", ylab = "Expression", add = "boxplot")
除了在小提琴图内添加框图外,您可以按如下方式添加中位数+分位数范围。
ggviolin(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", ylab = "Expression", add = "median_iqr")
使用函数ggviolin()时,参数add的合理值包括:“means”、“means_se”、“means_sd”、“mean_ci”、“mean_range”、“median”、“median_iqr”、“median_mad”、“median_range”。您还可以在小提琴曲线图中添加“jitter”点和“dotplot”,如前所述。
要绘制条形图,请键入以下内容。
ggstripchart(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", size = 0.1, jitter = 0.2, ylab = "Expression", add = "median_iqr", add.params = list(color = "gray"))
对于点图,用下面代码。
ggdotplot(expr, x = "dataset", y = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", fill = "white", binwidth = 0.1, ylab = "Expression", add = "median_iqr", add.params = list(size = 0.9))
要将分布可视化为密度图,请使用函数ggdensity(),如下所示。
# Basic density plotggdensity(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..density..", combine = TRUE, # Combine the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE # Add marginal rug)
# Change color and fill by datasetggdensity(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..density..", combine = TRUE, # Combine the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE, # Add marginal rug color = "dataset", fill = "dataset", palette = "jco")
# Merge the 3 plots# and use y = "..count.." instead of "..density.."ggdensity(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..count..", merge = TRUE, # Merge the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE , # Add marginal rug palette = "jco" # Change color palette)
# color and fill by x variablesggdensity(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..count..", color = ".x.", fill = ".x.", # color and fill by x variables merge = TRUE, # Merge the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE , # Add marginal rug palette = "jco" # Change color palette)
# Facet by "dataset"ggdensity(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..count..", color = ".x.", fill = ".x.", facet.by = "dataset", # Split by "dataset" into multi-panel merge = TRUE, # Merge the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE , # Add marginal rug palette = "jco" # Change color palette)
要将分布可视化为直方图,请使用函数gghistogram(),如下所示。
# Basic histogram plot gghistogram(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..density..", combine = TRUE, # Combine the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE # Add marginal rug)
# Change color and fill by datasetgghistogram(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..density..", combine = TRUE, # Combine the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE, # Add marginal rug color = "dataset", fill = "dataset", palette = "jco")
# Merge the 3 plots# and use y = "..count.." instead of "..density.."gghistogram(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..count..", merge = TRUE, # Merge the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE , # Add marginal rug palette = "jco" # Change color palette)
# color and fill by x variablesgghistogram(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..count..", color = ".x.", fill = ".x.", # color and fill by x variables merge = TRUE, # Merge the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE , # Add marginal rug palette = "jco" # Change color palette)
# Facet by "dataset"gghistogram(expr, x = c("GATA3", "PTEN", "XBP1"), y = "..count..", color = ".x.", fill = ".x.", facet.by = "dataset", # Split by "dataset" into multi-panel merge = TRUE, # Merge the 3 plots xlab = "Expression", add = "median", # Add median line. rug = TRUE , # Add marginal rug palette = "jco" # Change color palette)
# Basic ECDF plot ggecdf(expr, x = c("GATA3", "PTEN", "XBP1"), combine = TRUE, xlab = "Expression", ylab = "F(expression)")
# Change color by datasetggecdf(expr, x = c("GATA3", "PTEN", "XBP1"), combine = TRUE, xlab = "Expression", ylab = "F(expression)", color = "dataset", palette = "jco")
# Merge the 3 plots and color by x variablesggecdf(expr, x = c("GATA3", "PTEN", "XBP1"), merge = TRUE, xlab = "Expression", ylab = "F(expression)", color = ".x.", palette = "jco")
# Merge the 3 plots and color by x variables# facet by "dataset" into multi-panelggecdf(expr, x = c("GATA3", "PTEN", "XBP1"), merge = TRUE, xlab = "Expression", ylab = "F(expression)", color = ".x.", palette = "jco", facet.by = "dataset")
# Basic ECDF plot ggqqplot(expr, x = c("GATA3", "PTEN", "XBP1"), combine = TRUE, size = 0.5)
# Change color by datasetggqqplot(expr, x = c("GATA3", "PTEN", "XBP1"), combine = TRUE, color = "dataset", palette = "jco", size = 0.5)
# Merge the 3 plots and color by x variablesggqqplot(expr, x = c("GATA3", "PTEN", "XBP1"), merge = TRUE, color = ".x.", palette = "jco")
# Merge the 3 plots and color by x variables# facet by "dataset" into multi-panelggqqplot(expr, x = c("GATA3", "PTEN", "XBP1"), merge = TRUE, size = 0.5, color = ".x.", palette = "jco", facet.by = "dataset")
看了上面的演示,小伙伴们是不是跃跃欲试了呢?欢迎继续关注我的后续文章!