参考学习资料:https://academic.oup.com/bioinformatics/article/34/14/2515/4917355
之前看了技能树的系列推文和曾老师的推荐的文献,对ce-RNA有了初步的认识,为这个教程的理解奠定了理论基础。
1 安装包GDCRNATools
及相关依赖包
rm(list = ls())
options(stringsAsFactors = F)
#if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager")
#BiocManager::install("GDCRNATools", version = "devel")这个版本的需要R版本'4.0'
## try http:// if https:// URLs are not supported if (!requireNamespace("BiocManager", quietly=TRUE))
BiocManager::install("GDCRNATools")
BiocManager::install("DT")
library(GDCRNATools)
library(DT)
2 准备测试数据
在GDCRNATools
中有些函数被设计来帮助其他人有效的下载和处理GDC数据。当然用户也可以用他们自己的数据从UCSC Xena GDC hub
, TCGAbiolinks
(Colaprico et al. 2016)等处获取的,或者TCGA-Assembler
(Zhu, Qiu, and Ji 2014)都可以。
具体见case study
library(DT)
### load RNA counts data
data(rnaCounts)
### load miRNAs counts data data(mirCounts)
data(mirCounts)
2.1Normalization of HTSeq-Counts data
####### Normalization of RNAseq data #######
rnaExpr <- gdcVoomNormalization(counts = rnaCounts, filter = FALSE)
####### Normalization of miRNAs data #######
mirExpr <- gdcVoomNormalization(counts = mirCounts, filter = FALSE)
2.2 Parse metadata
####### Parse and filter RNAseq metadata #######
metaMatrix.RNA <- gdcParseMetadata(project.id = 'TCGA-CHOL',
data.type = 'RNAseq',
write.meta = FALSE)
metaMatrix.RNA <- gdcFilterDuplicate(metaMatrix.RNA)
metaMatrix.RNA <- gdcFilterSampleType(metaMatrix.RNA)
datatable(as.data.frame(metaMatrix.RNA[1:5,]), extensions =
'Scroller',options = list(scrollX = TRUE, deferRender = TRUE,
scroller = TRUE))
样本信息输出列表:
3 ceRNAs network analysis
3.1 Identication of dierentially expressed genes (DEGs)
DEGAll <- gdcDEAnalysis(counts = rnaCounts,
group = metaMatrix.RNA$sample_type,
comparison = 'PrimaryTumor-SolidTissueNormal',
method = 'limma')
datatable(as.data.frame(DEGAll),
options = list(scrollX = TRUE, pageLength = 5))
差异基因列表:
所有差异基因分类获取
### All DEGs
deALL <- gdcDEReport(deg = DEGAll, gene.type = 'all')
### DE long-noncoding
deLNC <- gdcDEReport(deg = DEGAll, gene.type = 'long_non_coding')
### DE protein coding genes
dePC <- gdcDEReport(deg = DEGAll, gene.type = 'protein_coding')
3.2 ceRNAs network analysis of DEGs
> ceOutput <- gdcCEAnalysis(lnc = rownames(deLNC), pc = rownames(dePC),
+ lnc.targets = 'starBase',
+ pc.targets = 'starBase',
+ rna.expr = rnaExpr,
+ mir.expr = mirExpr)
Step 1/3: Hypergenometric test done !
Step 2/3: Correlation analysis done !
Step 3/3: Regulation pattern analysis done !
> datatable(as.data.frame(ceOutput),
+ options = list(scrollX = TRUE, pageLength = 5))
3.3 Export ceRNAs network to Cytoscape
ceOutput2 <- ceOutput[ceOutput$hyperPValue<0.01
& ceOutput$corPValue<0.01 & ceOutput$regSim != 0,]
### Export edges
edges <- gdcExportNetwork(ceNetwork = ceOutput2, net = 'edges')
datatable(as.data.frame(edges),
options = list(scrollX = TRUE, pageLength = 5))
导出数据,进一步通过Cytoscape进行可视化
### Export nodes
nodes <- gdcExportNetwork(ceNetwork = ceOutput2, net = 'nodes')
datatable(as.data.frame(nodes),
options = list(scrollX = TRUE, pageLength = 5))
Case study: TCGA-CHOL
下载RNA 和miRNA表达矩阵及临床信息:
####### Download RNAseq data #######
gdcRNADownload(project.id = 'TCGA-CHOL',
data.type = 'RNAseq',
write.manifest = FALSE,
method = 'gdc-client',
directory = rnadir)
####### Download mature miRNA data #######
gdcRNADownload(project.id = 'TCGA-CHOL',
data.type = 'miRNAs',
write.manifest = FALSE,
method = 'gdc-client',
directory = mirdir)
####### Download clinical data #######
clinicaldir <- paste(project, 'Clinical', sep='/')
gdcClinicalDownload(project.id = 'TCGA-CHOL',
write.manifest = FALSE,
method = 'gdc-client',
directory = clinicaldir)
自动下载网络可能不好,下载很慢,也可以手动下载:
用户可以从GDC cart下载 manifest file
- Step1: Download
GDC Data Transfer Tool
on the GDC website - Step2: Add data to the GDC cart, then download manifest file and metadata of the cart
- Step3: Download data using
gdcRNADownload()
orgdcClinicalDownload()
functions by providing the manifest file
Data organization and DE analysis
临床信息来源可以是metadata file (.json) 从下载步骤自动下载的, 或者是
project.id
和data.type
通过gdcParseMetadata()
函数从manifest file获取的诸如age, stage and gender
等。只有一个样本最终会被保留,如果测序次数不只一次的情况下,过滤函数
gdcFilterDuplicate()
做的这件事。样本既不是Primary Tumor (code: 01) 也不是Solid Tissue Normal (code: 11) 将会被
gdcFilterSampleType()
函数过滤掉。
Parse metadata
####### Parse RNAseq metadata #######
metaMatrix.RNA <- gdcParseMetadata(project.id = 'TCGA-CHOL',
data.type = 'RNAseq',
write.meta = FALSE)
####### Filter duplicated samples in RNAseq metadata #######
metaMatrix.RNA <- gdcFilterDuplicate(metaMatrix.RNA)
####### Filter non-Primary Tumor and non-Solid Tissue Normal samples in RNAseq metadata #######
metaMatrix.RNA <- gdcFilterSampleType(metaMatrix.RNA)
####### Parse miRNAs metadata #######
metaMatrix.MIR <- gdcParseMetadata(project.id = 'TCGA-CHOL',
data.type = 'miRNAs',
write.meta = FALSE)
####### Filter duplicated samples in miRNAs metadata #######
metaMatrix.MIR <- gdcFilterDuplicate(metaMatrix.MIR)
####### Filter non-Primary Tumor and non-Solid Tissue Normal samples in miRNAs metadata #######
metaMatrix.MIR <- gdcFilterSampleType(metaMatrix.MIR)
Merge raw counts data
gdcRNAMerge()
合并raw counts data of RNAseq到一个表达矩阵行是Ensembl id
列是samples
. miRNAs的5p
和3p
分别来自isoform quantication文件和数据库miRBase v21. 如果数据样本来自不同的样本和不同的文件夹设置参数specify organized=FALSE
此外设置specify organized=TRUE
.
####### Merge RNAseq data #######
rnaCounts <- gdcRNAMerge(metadata = metaMatrix.RNA,
path = rnadir,
organized = FALSE, # if the data are in separate folders
data.type = 'RNAseq')
####### Merge miRNAs data #######
mirCounts <- gdcRNAMerge(metadata = metaMatrix.MIR,
path = mirdir, # the folder in which the data
organized = FALSE, # if the data are in separate folders
data.type = 'miRNAs')
Merge clinical data
设置参数key.info=TRUE
, 仅common clinical信息可以被保留否则所有的clinical information从XML文件中被提取。
####### Merge clinical data #######
clinicalDa <- gdcClinicalMerge(path = clinicaldir, key.info = TRUE)
clinicalDa[1:6,5:10]
TMM normalization and voom transformation
edgeR
(Robinson, McCarthy, and Smyth 2010) 中的函数TMM
将raw counts
进行normalized
然后通过limma
(Ritchie et al. 2015)函数进一步转换。低表达基因(logcpm < 1 in more than half of the samples)
默认会被过滤掉。所有的基因可以通过设置gdcVoomNormalization()
的参数filter=TRUE
进行保留。
####### Normalization of RNAseq data #######
rnaExpr <- gdcVoomNormalization(counts = rnaCounts, filter = FALSE)
####### Normalization of miRNAs data #######
mirExpr <- gdcVoomNormalization(counts = mirCounts, filter = FALSE)
Dierential gene expression analysis
通常,人们对在不同组之间差异表达的基因感兴趣(eg. Primary Tumor vs. Solid Tissue Normal)。一种简易包装函数gdcDEAnalysis()
来自limma
, edgeR
和 DESeq2
用于获取差异基因(DEGs)或差异miRNAs。 注意,DESeq2
对于单核处理器可能会很慢。 如果使用了DESeq2
,则可以使用nCore
参数指定多个内核。 鼓励用户查阅每种方法的vignette
帮助文档,以获取更多详细信息。
DEGAll <- gdcDEAnalysis(counts = rnaCounts,
group = metaMatrix.RNA$sample_type,
comparison = 'PrimaryTumor-SolidTissueNormal',
method = 'limma')
所有DEGs,DE长非编码基因,DE蛋白编码基因和DE miRNA都可以通过在gdcDEReport()
中设置geneType
参数来分别报告。 报告中输出了基于“ Ensembl 90”注释的Gene symbols和biotypes。
data(DEGAll)
### All DEGs
deALL <- gdcDEReport(deg = DEGAll, gene.type = 'all')
### DE long-noncoding
deLNC <- gdcDEReport(deg = DEGAll, gene.type = 'long_non_coding')
### DE protein coding genes
dePC <- gdcDEReport(deg = DEGAll, gene.type = 'protein_coding')
DEG visualization
火山图和条形图分别通过gdcVolcanoPlot()
和gdcBarPlot()
函数以不同的方式可视化DE分析结果。可以通过gdcHeatmap()
函数分析和绘制DEGs表达矩阵的分层聚类。
Volcano plot,Barplot
a1 <- gdcVolcanoPlot(DEGAll)
a2 <- gdcBarPlot(deg = deALL, angle = 45, data.type = 'RNAseq')
library(patchwork)
a1+a2
Heatmap
热图是基于gplots
包中的heatmap.2()
函数生成的。
degName = rownames(deALL)
gdcHeatmap(deg.id = degName, metadata = metaMatrix.RNA, rna.expr = rnaExpr)
Competing endogenous RNAs network analysis
Three criteria are used to determine the competing endogenous interactions between lncRNA-mRNA pairs:
- The lncRNA and mRNA must share signicant number of miRNAs
- Expression of lncRNA and mRNA must be positively correlated
- Those common miRNAs should play similar roles in regulating the expression of lncRNA and mRNA
Hypergeometric test
超几何检验以测试lncRNA和mRNA是否显著共享许多miRNA。
新开发的算法spongeScan
(Furi’o-Tar’i et al. 2016) 用于预测充当ceRNA的lncRNA中的MREs。 使用starBase v2.0
(Li et al. 2014),miRcode
(Jeggari, Marks, and Larsson 2012) and mirTarBase release 7.0
(Chou et al. 2017)等数据库收集预测的和经过实验验证的miRNA-mRNA和/或 miRNA-lncRNA相互作用。 这些数据库中的基因ID已更新为人类基因组的最新Ensembl 90注释,而miRNA名称已更新为新版本的miRBase 21标识符。 用户还可以提供自己的miRNA-lncRNA和miRNA-mRNA相互作用数据集。
here m is the number of shared miRNAs, N is the total number of miRNAs in the database, n is the number of miRNAs targeting the lncRNA, K is the number of miRNAs targeting the protein coding gene.
Pearson correlation analysis
皮尔逊相关系数是两个变量之间线性关联强度的度量。 众所周知,miRNA是基因表达的负调控因子。 如果更多的常见miRNA被lncRNA占据,则它们中的更少将与靶mRNA结合,从而增加mRNA的表达水平。 因此,在ceRNA对中lncRNA和mRNA的表达应呈正相关。
Two methods are used to measure the regulatory role of miRNAs on the lncRNA and mRNA:
-
Regulation similarity
-
Sensitivity correlation
看到公式头就晕,数学太差不想深究。暂时搁置。
ceRNAs network analysis
miRNA和lncRNA-mRNA共享的超几何检验的表达相关性分析及调控模式分析均已在gdcCEAnalysis()
函数。
ceRNAs network analysis using internal databases
用户可以使用内部整合的miRNA-mRNA(starBase v2.0, miRcode, and mirTarBase v7.0)
和miRNA-lncRNA(starBase v2.0, miRcode, spongeScan)
交互的数据库来进行ceRNAs网络分析。
ceOutput <- gdcCEAnalysis(lnc = rownames(deLNC),
pc = rownames(dePC),
lnc.targets = 'starBase',
pc.targets = 'starBase',
rna.expr = rnaExpr,
mir.expr = mirExpr)
ceRNAs network analysis using user-provided datasets
gdcCEAnalysis()
还可以获取用户提供的miRNA-mRNA和miRNA-lncRNA相互作用数据集,例如TargetScan
,miRanda
和Diana Tools
等预测的miRNA-靶标相互作用,用于ceRNAs网络分析。
### load miRNA-lncRNA interactions
data(lncTarget)
### load miRNA-mRNA interactions
data(pcTarget)
pcTarget[1:3]
ceOutput <- gdcCEAnalysis(lnc = rownames(deLNC),
pc = rownames(dePC),
lnc.targets = lncTarget,
pc.targets = pcTarget,
rna.expr = rnaExpr,
mir.expr = mirExpr)
Network visulization in Cytoscape
lncRNA-miRNA-mRNA相互作用可以通过gdcExportNetwork()
报告,并在Cytoscape
中可视化。 应将edges
作为网络导入,将节点作为特征表导入。
ceOutput2 <- ceOutput[ceOutput$hyperPValue<0.01 &
ceOutput$corPValue<0.01 & ceOutput$regSim != 0,]
edges <- gdcExportNetwork(ceNetwork = ceOutput2, net = 'edges')
nodes <- gdcExportNetwork(ceNetwork = ceOutput2, net = 'nodes')
write.table(edges, file='edges.txt', sep='\t', quote=F)
write.table(nodes, file='nodes.txt', sep='\t', quote=F)
Correlation plot
gdcCorPlot(gene1 = 'ENSG00000251165',
gene2 = 'ENSG00000091831',
rna.expr = rnaExpr,
metadata = metaMatrix.RNA)
Correlation plot on a local webpage
只需单击每个下拉框(in the GUI window)的基因,即可轻松操作基于shiny
软件包的交互式绘图功能shinyCorPlot()
。 通过运行使用shinyCorPlot()
功能,将弹出一个本地网页,并自动显示lncRNA和mRNA之间的相关图。
shinyCorPlot(gene1 = rownames(deLNC),
gene2 = rownames(dePC),
rna.expr = rnaExpr,
metadata = metaMatrix.RNA)
这个图很棒,可以随意挑选任意搭配。
Other downstream analyses
在GDCRNATools
软件包中开发了下游分析,例如单变量生存分析和功能富集分析,以促进ceRNA网络中基因的鉴定,这些基因在预后中起重要作用或在重要途径中起作用。
Univariate survival analysis
提供了两种方法来执行单变量生存分析:Cox Proportional-Hazards (CoxPH)
模型和基于survival
包的 Kaplan Meier (KM)
分析。 CoxPH
模型将表达值视为连续变量,而KM
分析通过用户定义的阈值(例如中位数或均值)将患者分为高表达和低表达组。gdcSurvivalAnalysis()
将基因列表作为输入,并报告hazard ratio,95%的置信区间,并测试每种基因对总体存活率的影响。
CoxPH analysis
####### CoxPH analysis #######
survOutput <- gdcSurvivalAnalysis(gene = rownames(deALL),
method = 'coxph',
rna.expr = rnaExpr,
metadata = metaMatrix.RNA)
KM analysis
####### KM analysis #######
survOutput <- gdcSurvivalAnalysis(gene = rownames(deALL),
method = 'KM',
rna.expr = rnaExpr,
metadata = metaMatrix.RNA,
sep = 'median')
KM plot
gdcKMPlot(gene = 'ENSG00000136193',
rna.expr = rnaExpr,
metadata = metaMatrix.RNA,
sep = 'median')
KM plot on a local webpage by shinyKMPlot
The shinyKMPlot()
function is also a simple shiny
app which allow users view KM plots (based on the R package survminer
.) of all genes of interests on a local webpackage conveniently.
shinyKMPlot(gene = rownames(deALL), rna.expr = rnaExpr,
metadata = metaMatrix.RNA)
这里应该可以批量导出生存曲线了,也是挺不错的小工具。
Functional enrichment analysis
gdcEnrichAnalysis()
can perform Gene ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) and Disease Ontology (DO) functional enrichment analyses of a list of genes simultaneously. GO and KEGG analyses are based on the R/Bioconductor packages clusterProlier
(Yu et al. 2012) and DOSE
(Yu et al. 2015). Redundant GO terms can be removed by specifying simplify=TRUE
in the gdcEnrichAnalysis()
function which uses the simplify()
function in the clusterProlier
(Yu et al. 2012) package.
enrichOutput <- gdcEnrichAnalysis(gene = rownames(deALL), simplify = TRUE)
这一步做了5种富集分析,耗时有点长。
Barplot
data(enrichOutput)
gdcEnrichPlot(enrichOutput, type = 'bar', category = 'GO', num.terms = 10)
Bubble plot
gdcEnrichPlot(enrichOutput, type='bubble', category='GO', num.terms = 10)
View pathway maps on a local webpage
shinyPathview()
allows users view and download pathways of interests by simply selecting the pathway terms on a local webpage.
library(pathview)
deg <- deALL$logFC
names(deg) <- rownames(deALL)
pathways <- as.character(enrichOutput$Terms[enrichOutput$Category=='KEGG'])
pathways
shinyPathview(deg, pathways = pathways, directory = 'pathview')
> pathways
[1] "hsa05414~Dilated cardiomyopathy (DCM)"
[2] "hsa05410~Hypertrophic cardiomyopathy (HCM)"
[3] "hsa05412~Arrhythmogenic right ventricular cardiomyopathy (ARVC)"
[4] "hsa04512~ECM-receptor interaction"
[5] "hsa04510~Focal adhesion"
[6] "hsa04360~Axon guidance"
[7] "hsa04270~Vascular smooth muscle contraction"
[8] "hsa05205~Proteoglycans in cancer"
[9] "hsa04022~cGMP-PKG signaling pathway"
[10] "hsa00480~Glutathione metabolism"
sessionInfo
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] shiny_1.4.0 GDCRNATools_1.6.0 pathview_1.26.0
[4] org.Hs.eg.db_3.10.0 AnnotationDbi_1.48.0 IRanges_2.20.0
[7] S4Vectors_0.24.0 Biobase_2.46.0 BiocGenerics_0.32.0
loaded via a namespace (and not attached):
[1] backports_1.1.5 Hmisc_4.2-0
[3] fastmatch_1.1-0 BiocFileCache_1.10.0
[5] plyr_1.8.4 igraph_1.2.4.1
[7] lazyeval_0.2.2 splines_3.6.1
[9] BiocParallel_1.20.0 GenomeInfoDb_1.22.0
[11] ggplot2_3.2.1 urltools_1.7.3
[13] digest_0.6.23 htmltools_0.4.0
[15] GOSemSim_2.12.0 rsconnect_0.8.15
[17] viridis_0.5.1 GO.db_3.10.0
[19] gdata_2.18.0 magrittr_1.5
[21] checkmate_1.9.4 memoise_1.1.0
[23] cluster_2.1.0 limma_3.42.0
[25] Biostrings_2.54.0 readr_1.3.1
[27] annotate_1.64.0 graphlayouts_0.5.0
[29] matrixStats_0.55.0 askpass_1.1
[31] enrichplot_1.6.0 prettyunits_1.0.2
[33] colorspace_1.4-1 blob_1.2.0
[35] rappdirs_0.3.1 ggrepel_0.8.1
[37] xfun_0.10 dplyr_0.8.3
[39] crayon_1.3.4 RCurl_1.95-4.12
[41] jsonlite_1.6 graph_1.64.0
[43] genefilter_1.68.0 zeallot_0.1.0
[45] zoo_1.8-6 survival_2.44-1.1
[47] glue_1.3.1 survminer_0.4.6
[49] GenomicDataCommons_1.10.0 polyclip_1.10-0
[51] gtable_0.3.0 zlibbioc_1.32.0
[53] XVector_0.26.0 DelayedArray_0.12.0
[55] Rgraphviz_2.30.0 scales_1.1.0
[57] DOSE_3.12.0 DBI_1.0.0
[59] edgeR_3.28.0 Rcpp_1.0.3
[61] viridisLite_0.3.0 xtable_1.8-4
[63] progress_1.2.2 htmlTable_1.13.2
[65] gridGraphics_0.4-1 foreign_0.8-72
[67] bit_1.1-14 europepmc_0.3
[69] km.ci_0.5-2 Formula_1.2-3
[71] DT_0.11 htmlwidgets_1.5.1
[73] httr_1.4.1 fgsea_1.12.0
[75] gplots_3.0.1.1 RColorBrewer_1.1-2
[77] acepack_1.4.1 pkgconfig_2.0.3
[79] XML_3.98-1.20 farver_2.0.1
[81] nnet_7.3-12 dbplyr_1.4.2
[83] locfit_1.5-9.1 ggplotify_0.0.4
[85] tidyselect_0.2.5 rlang_0.4.2
[87] reshape2_1.4.3 later_1.0.0
[89] munsell_0.5.0 tools_3.6.1
[91] generics_0.0.2 RSQLite_2.1.2
[93] broom_0.5.2 ggridges_0.5.1
[95] stringr_1.4.0 fastmap_1.0.1
[97] yaml_2.2.0 knitr_1.25
[99] bit64_0.9-7 tidygraph_1.1.2
[101] survMisc_0.5.5 caTools_1.17.1.2
[103] purrr_0.3.3 KEGGREST_1.26.0
[105] ggraph_2.0.0 nlme_3.1-141
[107] mime_0.7 KEGGgraph_1.46.0
[109] DO.db_2.9 xml2_1.2.2
[111] biomaRt_2.42.0 compiler_3.6.1
[113] rstudioapi_0.10 png_0.1-7
[115] curl_4.2 ggsignif_0.6.0
[117] tibble_2.1.3 tweenr_1.0.1
[119] geneplotter_1.64.0 stringi_1.4.3
[121] lattice_0.20-38 Matrix_1.2-17
[123] KMsurv_0.1-5 vctrs_0.2.0
[125] pillar_1.4.2 lifecycle_0.1.0
[127] BiocManager_1.30.9 triebeard_0.3.0
[129] data.table_1.12.6 cowplot_1.0.0
[131] bitops_1.0-6 httpuv_1.5.2
[133] GenomicRanges_1.38.0 qvalue_2.18.0
[135] R6_2.4.1 latticeExtra_0.6-28
[137] promises_1.1.0 KernSmooth_2.23-16
[139] gridExtra_2.3 gtools_3.8.1
[141] MASS_7.3-51.4 assertthat_0.2.1
[143] SummarizedExperiment_1.16.0 rjson_0.2.20
[145] openssl_1.4.1 DESeq2_1.26.0
[147] GenomeInfoDbData_1.2.2 hms_0.5.2
[149] clusterProfiler_3.14.1 grid_3.6.1
[151] rpart_4.1-15 tidyr_1.0.0
[153] rvcheck_0.1.6 ggpubr_0.2.4
[155] ggforce_0.3.1 base64enc_0.1-3
这个教程还是非常有实用价值的。很多函数需要进一步理解,当然在学习这个教程之前最好还是先学习一下TCGA教程,不然有些不太容易理解。