《生物信息学生R入门教程》读书笔记 Chapter 7

这一章来介绍NGS的下游分析

富集分析

简单的讲,富集分析就是一种统计分析的手段,用来筛选功能相类的一组基因是否富集中差异表达的基因中。当人们拿到了差异表达的基因时,很多时候因为差异表达的基因数量很多,面对这么多的基因,人们不知道如何找到合适的突破口进行下游的验证实验。于是从一堆差异表达的基因中找出有意义的基因进行RT-PCR验证以及基因敲除验证就需要使用到富集分析了

选择正确的库(library, 或者说是universe gene id)。正确的库的大小的选择决定着分析的结果是否可靠

1.GO富集

示例代码:

library(clusterProfiler)
#加载人类的库
library(org.Hs.eg.db)
data(geneList, package="DOSE")
#筛选genelist
gene <- names(geneList)[abs(geneList) > 2]
#转换geneID
gene.df <- bitr(gene, fromType = "ENTREZID",
        toType = c("ENSEMBL", "SYMBOL"),
        OrgDb = org.Hs.eg.db)
#功能富集
ego <- enrichGO(gene          = gene,
                OrgDb         = org.Hs.eg.db,
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                readable      = TRUE)

2.KEGG

#kegg富集
kk <- enrichKEGG(gene         = gene,
                 organism     = 'hsa',
                 pvalueCutoff = 0.05)

热图分析

热图(heatmap)。热图是一种三维数据转二维的表达手段,它通过使用不同的颜色来表达本应该在第三维上显示的数据。热图的方便的于,它可以在一张二维图像上快速的通过颜色的差异来了解组内的一致性及差异组间的差别
热图的软件包有很多,常用的有pheatmap, ggplot2以及ComplexHeatmap

以ComplexHeatmap:

## 首先, 我们先生成一个随机数据。
set.seed(1)
mat <- matrix(runif(36), nrow=6, 
              dimnames=list(paste("row", letters[1:6], sep="_"), 
                            paste("col", LETTERS[1:6], sep="_")))
library(ComplexHeatmap)

当然ComplexHeatmap还有一些高级用法
生成两个热图,并对行列进行注释:

ha_column <- 
  HeatmapAnnotation(
    df = data.frame(## data.frame, 行数与mat的列数一致。
      type1 = sample(c("a", "b"), 
                     ncol(mat), 
                     replace = TRUE)),
    col = list(type1 = c("a" = "red", "b" = "blue")) 
    ## 颜色list, 名字与df中的列字一致,
    ##           元素名与df列中的出现的元素一致。
)
ha_row <- rowAnnotation( ## 对行进行注释
  df = data.frame(type2 = sample(c("A", "B"),
                                 nrow(mat), 
                                 replace = TRUE)),
  col = list(type2 = c("A" = "green", "B" = "cyan")),
  width = unit(1, "cm") ##指定宽度,必须是unit,参见grid::unit
)

## 生成两个Heatmap实例。这里我们使用了相同的数据。
## 在实际应用中,它应该是行数一致的不同数据。
ht1 <- Heatmap(mat, name = "ht1", #name将出现的图例中
               column_title = "Heatmap 1", #title将出现在标题中
               top_annotation = ha_column) #ha_column将出现在热图上方(也可以是下方)
ht2 <- Heatmap(mat, name = "ht2", 
               column_title = "Heatmap 2")
ht_list <- ht1 + ht2 + ha_row
draw(ht_list)
#把图例调来左边
draw(ht_list, heatmap_legend_side = "left", 
     show_annotation_legend = FALSE)

当然利用HeatmapAnnotation()函数来进行加工也可以,可以在热图的基础上加点图,条形图
1.row_anno_points():点图
2.row_anno_barplot():条形图
3.row_anno_boxplot():箱线图
4.row_anno_histogram():直方图
5.row_anno_density():密度图

ha_mix_top <- HeatmapAnnotation(
  points = anno_points(runif(ncol(mat))), ## points
  barplot = anno_barplot(runif(ncol(mat)), axis=TRUE), ## barplot
  histogram = anno_histogram(mat, 
                             gp = gpar(fill = 1:6)), ## histogram
  density_line = anno_density(mat, type = "line", 
                              gp = gpar(col = 1:6)), ## density
  violin = anno_density(mat, type = "violin", gp = gpar(fill = 6:1)), ## violin
  box = anno_boxplot(mat), ## boxplot
  heatmap = anno_density(mat, type = "heatmap"), ## heatmap
  annotation_height = unit(8, "cm") ## annotation height
)
Heatmap(mat, name = "mat", 
        top_annotation = ha_mix_top)

如果想只对个别的行或者列标示名称,其它的不标注

mat = matrix(rnorm(10000), nr = 1000)
rownames(mat) = sprintf("%.2f", rowMeans(mat))
subset = sample(1000, 20)
labels = rownames(mat)[subset]
Heatmap(mat, show_row_names = FALSE, show_row_dend = FALSE, show_column_dend = FALSE) + 
rowAnnotation(link = row_anno_link(at = subset, labels = labels),
  width = unit(1, "cm") + max_text_width(labels))

突变分析

常用于TCGA分析数据
我们先构建对象

mat <- read.table(textConnection(
",s1,s2,s3
g1,snv1;indel,snv1;snv2,indel
g2,,snv2;indel,snv1
g3,snv1,,indel;snv1;snv2"), 
row.names = 1, header = TRUE, 
sep = ",", stringsAsFactors = FALSE)
mat <- as.matrix(mat)
mat

数据结构长这样


画热图来查看各种突变的占比:

library(ComplexHeatmap)
col <- c(snv1 = "#008000", snv2 = "#008000", indel = "blue")
oncoPrint(mat, get_type = function(x) strsplit(x, ";")[[1]],
    alter_fun = list(
      indel = function(x, y, w, h) 
        grid.rect(x, y, w*0.9, h*0.9, 
                  gp = gpar(fill = col["indel"], 
                            col = NA)) , #must be draw first
        snv1 = function(x, y, w, h) 
          grid.rect(x, y-.15*h, w*0.9, h*0.2, 
                    gp = gpar(fill = col["snv1"], 
                              col = NA)),
        snv2 = function(x, y, w, h) 
          grid.rect(x, y+.15*h, w*0.9, h*0.2, 
                    gp = gpar(fill = col["snv2"], 
                              col = NA))
    ), col = col)

上面的例子有两个重要的概念,一个是get_type,一个是alter_fun。两者都接受一个函数,其中get_type是如何将mat中的每个单元如何分解成突变类型,alter_fun是如何对每一种突变类型进行绘图。 get_type接受一个参数x,这个x将会是mat中的每一个值。比如这里的mat[g1, s1]就是"snv1;indel"。而alter_fun将会是一个list of function,list中的每个元素名称都将是可能出现在get_type得到的值中。list中的每一个函数有四个参数,x,y,w,h,它们分别指heatmap中对应mat中的每个单元格的x, y的坐标以及单元格的高度和宽度。

接下来我们看下如何画瀑布图:

#建立对象
mat <- read.delim(file.path(system.file("extdata", "tcga_lung_adenocarcinoma_provisional_ras_raf_mek_jnk_signalling.txt",
                                       package = "ComplexHeatmap")), 
                 stringsAsFactors=FALSE, row.names=1)
mat[is.na(mat)] <- "  "
mat <-  mat[, -ncol(mat)]
mat = t(as.matrix(mat))

##假设有三种突变类型“MUT”,“AMP”,“HOMDEL”
#写出类型
alter_fun <- list(
    background = function(x, y, w, h) {
        grid.rect(x, y, w-unit(0.5, "mm"), h-unit(0.5, "mm"), gp = gpar(fill = "#CCCCCC", col = NA))
    },
    HOMDEL = function(x, y, w, h) {
        grid.rect(x, y, w-unit(0.5, "mm"), h-unit(0.5, "mm"), gp = gpar(fill = "blue", col = NA))
    },
    AMP = function(x, y, w, h) {
        grid.rect(x, y, w-unit(0.5, "mm"), h-unit(0.5, "mm"), gp = gpar(fill = "red", col = NA))
    },
    MUT = function(x, y, w, h) {
        grid.rect(x, y, w-unit(0.5, "mm"), h*0.33, gp = gpar(fill = "#008000", col = NA))
    }
)
#定义颜色
col = c("MUT" = "#008000", "AMP" = "red", "HOMDEL" = "blue")
#读取行名信息
sample_order <- scan(system.file("extdata", "sample_order.txt",
                                package = "ComplexHeatmap"), 
                    what = "character")
#构建对象
ht <- oncoPrint(mat, get_type = function(x) strsplit(x, ";")[[1]],
                alter_fun = alter_fun, col = col, 
                row_order = NULL, column_order = sample_order,
                remove_empty_columns = TRUE,
                column_title = "OncoPrint for TCGA Lung Adenocarcinoma, genes in Ras Raf MEK JNK signalling",
                heatmap_legend_param = 
                  list(title = "Alternations", 
                       at = c("AMP", "HOMDEL", "MUT"), 
                       labels = c("Amplification", 
                                  "Deep deletion", 
                                  "Mutation"),
                       nrow = 1, title_position = "leftcenter"))
#画图
draw(ht, heatmap_legend_side = "bottom")


这里有另外一位大佬写的,挺不错的,欢迎大家参阅
https://www.jianshu.com/p/deb17d0bcd58

基因网络

在实验中,我们常常会将RNAseq的结果和ChIPseq的结果结合起来,以利于提高下游的验证率,在这个过程中,可以使用GeneNetworkBuilder来查找高验证率的目标基因。GeneNetworkBuilder需要两个实验输入,一个是全基因 组的表达图谱,一个是目标蛋白的目标基因。GeneNetworkbuilder还需要该物种的调控网络数据,该数据是为了联接实验数据迷失的目标基因。

library(GeneNetworkBuilder)
library(simpIntLists)
i <- findInteractionList("human", "Official")
i <- lapply(i, function(.ele) cbind(from=as.character(.ele$name), to=as.character(.ele$interactors)))
i <- do.call(rbind, i)
set.seed(123)
## generate a random ChIP-seq binding table
rootgene <- sample(i[, 1], 1)
TFbindingTable <- i[i[, 1] == rootgene, ]
interactionmap <- i
# build network
sifNetwork<-buildNetwork(TFbindingTable=TFbindingTable, 
                        interactionmap=interactionmap, level=2)
ID=unique(as.character(sifNetwork))
## create a random expression data
expressionData <- data.frame(ID=ID,
                             logFC=sample(-3:3, length(ID), replace=TRUE),
                             P.Value=runif(n=length(ID), max=0.25))
## filter network
cifNetwork<-filterNetwork(rootgene=rootgene, sifNetwork=sifNetwork, 
                    exprsData=expressionData, mergeBy="ID",
                    miRNAlist=character(0), 
                    tolerance=1, cutoffPVal=0.01, cutoffLFC=1)
## polish network
gR<-polishNetwork(cifNetwork)
## browse network
# browseNetwork(gR) # interactive htmlwidget; 
## or plot by Rgraphviz
library(Rgraphviz)
plotNetwork<-function(gR, layouttype="dot", ...){
    if(!is(gR,"graphNEL")) stop("gR must be a graphNEL object")
    if(!(GeneNetworkBuilder:::inList(layouttype, c("dot", "neato", "twopi", "circo", "fdp")))){
        stop("layouttype must be dot, neato, twopi, circo or fdp")
    }
    g1<-Rgraphviz::layoutGraph(gR, layoutType=layouttype, ...)
    nodeRenderInfo(g1)$col <- nodeRenderInfo(gR)$col
    nodeRenderInfo(g1)$fill <- nodeRenderInfo(gR)$fill
    renderGraph(g1)
}
plotNetwork(gR)

这个包的官方文档:
http://www.bioconductor.org/packages/release/bioc/vignettes/GeneNetworkBuilder/inst/doc/GeneNetworkBuilder_vignettes.html

查看Chip-seq信号强度

当拿到ChIPseq的结果后,我们可以使用众多手段来查看reads的真实情况,比如说使用IGV, UCSC genome browser等。但是,当大家需要把这个track生成图片发表时,这些工具提供的图片输出有时候无法达到发表的要求,于是很多软件包就因此而生。最常使用到的软件包是Gviz。然而,我想大家也看出来了,本文夹私的现象非常严重,所以,我重点介绍一下trackViewer软件包。
trackViewer软件包不但可以输出高品质的track图,还可以生成高颜值的lollipop plot。甚至,还在推广一种名为Dandelion plot的表示SNP以及Indel数据的可视化图形。

library(trackViewer)
gr <- parse2GRanges("chr3:108,476,000-108,485,000")
gr # interesting genomic locations

library(GenomicFeatures)# load GenomicFeatures can create TxDb from UCSC
if(interactive()){
    mm8KG <- makeTxDbFromUCSC(genome="mm8", tablename="knownGene")
    saveDb(mm8KG, "mm8KG.sqlite")
}else{## mm8KG was saved as sqlite file
    mm8KG <- loadDb("data/mm8KG.sqlite")
}
library(org.Mm.eg.db) # load annotation database

## create the gene model tracks information
trs <- geneModelFromTxdb(mm8KG, org.Mm.eg.db, gr=gr)
## import data from bedGraph/bigWig/BED ... files, see ?importScore for details
CLIP <- importScore("data/CLIP.bedGraph", format="bedGraph", ranges=gr)
#bedGraph文件格式
control <- importScore("data/control.bedGraph", format="bedGraph", ranges=gr)
knockdown <- importScore("data/knockdown.bedGraph", format="bedGraph", ranges=gr)
## create styles by preset theme
optSty <- optimizeStyle(trackList(trs, knockdown, control, CLIP), theme="col")
trackList <- optSty$tracks
viewerStyle <- optSty$style
## adjust the styles for this track
### rename the trackList for each track
names(trackList)[1:2] <- paste0("Sort1: ", names(trackList)[1:2])
names(trackList)[3] <- "RNA-seq TDP-43 KD"
names(trackList)[4] <- "RNA-seq control"
### change the lab positions for gene model track to bottomleft
setTrackStyleParam(trackList[[1]], "ylabpos", "bottomleft")
setTrackStyleParam(trackList[[2]], "ylabpos", "bottomleft")
### change the color of gene model track
setTrackStyleParam(trackList[[1]], "ylabgp", list(cex=1, col="red"))
setTrackStyleParam(trackList[[2]], "ylabgp", list(cex=1, col="green"))
### remove the xaxis
setTrackViewerStyleParam(viewerStyle, "xaxis", FALSE)
### add a scale bar in CLIP track
setTrackXscaleParam(trackList[[5]], "draw", TRUE)
## plot the tracks
vp <- viewTracks(trackList, gr=gr, viewerStyle=viewerStyle)
### add guide lines to show the range of CLIP-seq signal
addGuideLine(c(108481252, 108481887), vp=vp)
### add arrow mark to show the alternative splicing event
addArrowMark(list(x=c(108483570, 108483570), 
                  y=c(3, 4)), 
             label=c("Inclusive\nexon", ""), 
             col=c("blue", "cyan"), 
             vp=vp, quadrant=1)

trackViewer官方文档:
http://www.bioconductor.org/packages/release/bioc/vignettes/trackViewer/inst/doc/trackViewer.html

motif分析

motif一般利用MEME Suite或者Homer来进行motif搜索,然后利用motifStack包进行可视化
motifStack官方文档:
http://www.bioconductor.org/packages/release/bioc/vignettes/motifStack/inst/doc/motifStack_HTML.html

library(motifStack)
pcm <- read.table(file.path(find.package("motifStack"), 
                            "extdata", "bin_SOLEXA.pcm"))
pcm <- pcm[,3:ncol(pcm)]
rownames(pcm) <- c("A","C","G","T")
motif <- new("pcm", mat=as.matrix(pcm), name="bin_SOLEXA")
plot(motif)
## plot a RNA sequence logo
rna <- pcm
rownames(rna)[4] <- "U"
motif <- new("pcm", mat=as.matrix(rna), name="RNA_motif")
plot(motif)

其中,这个包:https://www.plob.org/article/15953.html也有提到用法

或者我之前推送的:https://www.jianshu.com/p/e7983c461635也有部分讲解

其中pcm结尾的文件是个打分矩阵,可在JASPAR数据库 (http://jaspar.genereg.net/) 下载

你可能感兴趣的:(《生物信息学生R入门教程》读书笔记 Chapter 7)