差异表达基因检测

写在前面

要帮朋友做RNA-seq的分析，cases vs. controls总共4个样本（2 vs. 2)，看到文献（实验设计比较类似）里用的是EBSeq较为频繁，所以用这个来做。
但从文献来看，EBSeq对样本量依赖较大。

EBSeq

TPR relatively independent of sample size and presence of outliers.

Poor FDR control in most situations, relatively unaffected by outliers.

Medium computational time requirement, increases slightly with sample size.

EBSeq的输入数据是原始的read count，可以通过featureCounts、HTSeq-count等软件包获得。

命令参考链接 http://www.bioconductor.org/packages/release/bioc/html/EBSeq.html

安装

数据不算大，PC应该够用了，所以在window下安装。EBSeq依赖的安装包blockmodeling一直无法成功安装，隔了两天再看，才想到要去仔细看报错文件，发现其中有特殊字符无法识别，索性直接改了，就能直接加载啦。

source("http://bioconductor.org/biocLite.R")
biocLite("EBSeq")
require(EBSeq)

载入需要的程辑包：blockmodeling
Error: package or namespace load failed for ‘blockmodeling’:
 attachNamespace()里算'blockmodeling'时.onAttach失败了，详细内容：
  调用: parse(con)
  错误: 16:28: 意外的INCOMPLETE_STRING
15:         journal = "Social Networks",
16:         author = as.person("Ale
                               ^
错误: 无法载入程辑包‘blockmodeling’
In addition: Warning message:
In parse(con) :
  invalid input found on input connection 'C:/Users/cc/Documents/R/win-library/3.5/blockmodeling/CITATION'

library(blockmodeling)
#出现同样的报错

修改文件'C:/Users/cc/Documents/R/win-library/3.5/blockmodeling/CITATION'的内容，其中对应报错的行有Aleš Žiberna——带帽子的字符，删掉或者改成别的常见字符就OK了。当然如果担心以后出类似问题，改起来琐碎，可以直接设置R的识别语言。
Note:搜索时使用关键词INCOMPLETE_STRING，而非最后的invalid input found on input connection。搜索方向决定了解决问题的效率。

library(EBSeq)
载入需要的程辑包：gplots
载入程辑包：‘gplots’
The following object is masked from ‘package:stats’:
    lowess
载入需要的程辑包：testthat

实战

Example

#example
data(GeneMat)
head(GeneMat)
#[,1] [,2] [,3] [,4] [,5] [,6] [,7]  [,8] [,9] [,10]
#Gene_1 1879 2734 2369 2636 2188 9743 9932 10099 9829  9831
#Gene_2   24   40   22   27   31  118  108   144  117   113

Sizes=MedianNorm(GeneMat) #EBSeq requires the library size factor ls for each sample s. Here, ls may be obtained via the function MedianNorm, which reproduces the median normalization approach in DESeq

Conditions=as.factor(rep(c("C1","C2"),each=5)) #设置样本类型
Conditions
#[1] C1 C1 C1 C1 C1 C2 C2 C2 C2 C2
#Levels: C1 C2

EBOut = EBTest(GeneMat,Conditions=as.factor(rep(c("C1","C2"),each=5)),sizeFactors=Sizes, maxround=5)
EBDERes=GetDEResults(EBOut, FDR=0.05)
summary(EBDERes)

str(EBDERes$DEfound) #a list of genes identified with 5% FDR
head(EBDERes$PPMat) # two columns PPEE and PPDE, corresponding to the posterior probabilities of being EE or DE for each gene
str(EBDERes$Status) #EBDERes$Status contains each gene’s status called by EBSeq

实例

paste KD_In1.count  KD_In2.count EGFP_In1.count EGFP_In2.count  |awk '{print $1"\t"$2"\t"$4"\t"$6"\t"$8}'|sort -k 1  >combined.count
head combined.count
tail combined.count
sed -i '1,2d'combined.count  #删除1-2行
sed -i '53800,$d'  combined.count  #删除53800-last 行
sed -i '1igene\tKD1\tKD2\tEGFP1\tEGFP2' combined.count   #增加首行

data=read.table("combined.count",header=T,row.names=1)
head(data)
#                EGFP1 EGFP2 KD1 KD2
#ENSMUSG00000000001 281 243   295   233
#ENSMUSG00000000003   0   0     0     0
#ENSMUSG00000000028  20  25    25    12

data=as.matrix(data)

#pre-filtering 这里仅保留了readcounts总和>=10的基因
keep <- rowSums(data[,1:4])>=10
data <- data[keep,]

#跑出结果
require(EBSeq)
Sizes=MedianNorm(data)
EBOut = EBTest(data,Conditions=as.factor(rep(c("EGFP","KD"),each=2)),sizeFactors=Sizes, maxround=15)  #maxround迭代次数，一般要求最后结果趋于收敛比较可信

EBDERes=GetDEResults(EBOut, FDR=0.05)
summary(EBDERes)

str(EBDERes$DEfound) #a list of genes identified with 5% FDR
head(EBDERes$PPMat) # two columns PPEE and PPDE, corresponding to the posterior probabilities of being EE or DE for each gene
#                        PPEE         PPDE
#ENSMUSG00000000001 1.0000000 3.203313e-17
#ENSMUSG00000000003        NA           NA

str(EBDERes$Status) #EBDERes$Status contains each gene’s status called by EBSeq

#输出DEG(FDR<0.05)
write.table(EBDERes$DEfound,file="DEG_FDR005.txt",quote=F,sep="\t",row.names=F,col.names=F)

#计算 Fold Change (FC) of the raw data as well as the posterior FC of the normalized data.
GeneFC=PostFC(EBOut)
str(GeneFC)
PlotPostVsRawFC(EBOut,GeneFC)  #FC vs.Posterior FC的图
write.table(GeneFC,file="FC.txt",quote=F,sep="\t")  #注意$Direction

如何确定maxround?
查看参数值：EBOut$Alpha, EBOut$P, EBOut$Beta，最后两个值相差<0.01就差不多了。

关于差异表达(RNA-seq)

标准化方法不同+检验不同=多种组合/软件，用之前需要结合自己的样本量来考虑，多参考有相似实验设计的文献，常用的方法都跑一下，自己评估下结果差异，再做定夺。（研究本来就是充满了不确定性，一切都只能用“可能性”来定义，所以，采用同样参数仍然无法完全重复出文献中的结果也是常见。即使你心有芥蒂...）

针对配合meRIP-seq的RNA-seq，因为朋友需要看了几篇文献，基本都是2 vs. 2的样本量，文献里的作法也有差异。

直接取了fold change >2 (2017Cancer Cell-m6A Demethylase ALKBH5 Maintains Tumorigenicity of Glioblastoma Stem-like Cells by Sustaining FOXM1 Expression and Cell Proliferation Program)
EBSeq (2017Cancer Cell-FTO Plays an Oncogenic Role in Acute Myeloid Leukemia as a N6-Methyladenosine RNA Demethylase）
DESeq2

差异表达基因检测 | EBSeq

写在前面

安装

实战

Example

实例

关于差异表达(RNA-seq)

你可能感兴趣的:(差异表达基因检测 | EBSeq)