RNASEQ分析入门笔记8-使用DESeq2进行表达差异分析

基因的差异表达分析,通常使用R中的软件包,包括:DESeq2,edgeR,limma等,今天介绍DESeq2的分析流程:

1、在R中安装DESeq2软件包

source ("http://bioconductor.org/biocLite.R") # 调入安装工具Bioconductor
biocLite("DESeq2") # 安装DESeq2
library("DESeq2") # 测试是否安装成功

如果安装出错,请移步前文:《MAC安装DESeq2报错及解决方案》,作为一个生信小白,我花了3天时间才把安装DESeq2的问题解决,一路全是坑啊!


2、导入分析数据

我们还是使用《RNASEQ分析入门笔记7-HTSeq定量基因表达水平》中得到的数据进行分析,首先回顾一下:

上文中,我们得到了raw_count_filt2矩阵,格式如下(执行:tail (raw_count_filt2, n=5):

control1 control2 rep1 rep2
ENSMUSG00000110420 0 0 0 0
ENSMUSG00000110421 0 0 0 1
ENSMUSG00000110422 1 0 1 2
ENSMUSG00000110423 0 0 0 0
ENSMUSG00000110424 26 20 48 24

可以看到,该矩阵实际只有4列,是由整数组成的,而最前面的一列是行名,可以直接从保存的.csv文件中导入:

raw_count_filt2 <- read.csv ("raw_count_filt2.csv")  # 从保存的.csv文件导入数据,并赋值给raw_count_filt2

3、加载DESeq2并设置样品信息

library(DESeq2) # 加载DESeq2包
countData <- raw_count_filt2 # 表达矩阵
condition <- factor(c("control","control","KD","KD")) # 定义condition
colData <- data.frame(row.names=colnames(countData), condition) # 样品信息矩阵

condition:

[1] control control KD KD
Levels: KD control

colData:

condition
control1 control
control2 control
rep1 KD
rep2 KD

4、构建dds矩阵

dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), design= ~ condition ) # 构建dds矩阵
head(dds) # 查看dds矩阵的前6行

class: DESeqDataSet
dim: 6 4
metadata(1): version
assays(1): counts
rownames(6): ENSMUSG00000000001 ENSMUSG00000000003 ...
ENSMUSG00000000037 ENSMUSG00000000049
rowData names(0):
colnames(4): control1 control2 rep1 rep2
colData names(1): condition

dds2 <- DESeq(dds) # 对dds进行Normalize

运行成功会有如下提示:

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

resultsNames(dds) # 查看结果的名称

[1] "Intercept" "condition_control_vs_KD"

res <- results(dds) # 使用results()函数获取结果,并赋值给res
head (res, n=5) # 查看res矩阵的前5行

输出的res矩阵有6列,分别是:baseMean,log2FoldChange,lfcSE,pvalue,padj:

log2 fold change (MLE): condition control vs KD
Wald test p-value: condition control vs KD
DataFrame with 5 rows and 6 columns

baseMean log2FoldChange lfcSE stat pvalue padj
ENSMUSG00000000001 2359.75391 -0.08461265 0.2081281 -0.4065413 0.68434494 0.9040654
ENSMUSG00000000003 0.00000 NA NA NA NA NA
ENSMUSG00000000028 1027.82363 0.03638541 0.2036829 0.1786375 0.85822232 0.9618582
ENSMUSG00000000031 65.37333 0.94588315 0.4428518 2.1358912 0.03268828 0.2511131
ENSMUSG00000000037 69.75843 0.11587980 0.4400431 0.2633374 0.79229054 0.9422796
mcols(res,use.names= TRUE) # 查看res矩阵每一列的含义

DataFrame with 6 rows and 2 columns

type description
baseMean intermediate mean of normalized counts for all samples
log2FoldChange results log2 fold change (MLE): condition control vs KD
lfcSE results standard error: condition control vs KD
stat results Wald statistic: condition control vs KD
pvalue results Wald test p-value: condition control vs KD
padj results BH adjusted p-values
summary(res) # 对res矩阵进行总结

out of 28335 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 445, 1.6%
LFC < 0 (down) : 625, 2.2%
outliers [1] : 0, 0%
low counts [2] : 12683, 45%
(mean count < 18)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

其中,445个基因表达上调,625个基因表达下降,可靠程度是:p-value < 0.1


5、提取差异分析结果

table(res$padj<0.05) # 取padj小于0.05的数据,得到743行
FALSE TRUE
14909 743
res <- res[order(res$padj),] # 按照padj的大小将res重新排列
diff_gene_deseq2 <- subset(res,padj < 0.05 & (log2FoldChange >1 | log2FoldChange < -1)) # 获取padj小于0.05,表达倍数取以2为对数后绝对值大于1的差异表达基因,赋值给diff_gene_deseq2
head (diff_gene_deseq2, n=5) # 查看diff_gene_deseq2矩阵的前5行

log2 fold change (MLE): condition control vs KD
Wald test p-value: condition control vs KD
DataFrame with 5 rows and 6 columns

baseMean log2FoldChange lfcSE stat pvalue padj
ENSMUSG00000003309 548.1926 3.231612 0.2658125 12.157487 5.234431e-34 8.192931e-30
ENSMUSG00000046323 404.1894 3.067051 0.2628220 11.669689 1.820880e-31 1.425021e-27
ENSMUSG00000001123 341.8542 2.797485 0.2766499 10.112006 4.887336e-24 2.549886e-20
ENSMUSG00000018569 485.4839 3.136032 0.3312999 9.465839 2.912140e-21 9.116163e-18
ENSMUSG00000023906 951.9460 2.382308 0.2510718 9.488553 2.342631e-21 9.116163e-18
diff_gene_deseq2 <- row.names(diff_gene_deseq2) # 提取diff_gene_deseq2的行名
head (diff_gene_deseq2, n=5)

[1] "ENSMUSG00000003309"
[2] "ENSMUSG00000046323"
[3] "ENSMUSG00000001123"
[4] "ENSMUSG00000018569"
[5] "ENSMUSG00000023906"

resdata <- merge (as.data.frame(res),as.data.frame(counts(dds,normalize=TRUE)),by="row.names",sort=FALSE)
head (resdata,n=5)
Row.names baseMean log2FoldChange lfcSE stat pvalue padj control1 control2 rep1 rep2
1 ENSMUSG00000003309 548.1926 3.231612 0.2658125 12.157487 5.234431e-34 8.192931e-30 1073.9977 908.7239 125.93083 84.11803
2 ENSMUSG00000046323 404.1894 3.067051 0.2628220 11.669689 1.820880e-31 1.425021e-27 723.4828 721.5049 97.10329 74.66657
3 ENSMUSG00000001123 341.8542 2.797485 0.2766499 10.112006 4.887336e-24 2.549886e-20 679.8243 515.6735 84.96538 86.95347
4 ENSMUSG00000018569 485.4839 3.136032 0.3312999 9.465839 2.912140e-21 9.116163e-18 1113.9140 630.6325 127.44807 69.94083
5 ENSMUSG00000023906 951.9460 2.382308 0.2510718 9.488553 2.342631e-21 9.116163e-18 1862.3445 1333.5250 351.99943 259.91526
write.csv(resdata, file="control_vs_akap95.csv") # 将结果写入control_vs_akap95.csv文件

你可能感兴趣的:(RNASEQ分析入门笔记8-使用DESeq2进行表达差异分析)