DESeq2详细用法

1.构建DESeq2对象

(1)从SummarizedExperiment对象构建DESeqDataSet对象

dds <- DESeqDataSet( se, design = ~ cell + dex) 
  • se为RangedSummarizedExperiment对象,行信息rowRanges(se)为基因区间,列信息colData(se)为样本信息(四种细胞系,每一个cell lines都有对照组与dex处理组),实验数据assays(se)为原始的read counts
  • dds为DESeqDataSet对象,dds与se的区别是se的assay slot被DESeq2的counts accessor function代替。
  • dds包含设计公式design formula,design(dds)。实验设计为cell+dex,希望检测对于不同细胞,地塞米松处理的效果

(2)从表达矩阵countData和样品信息colData构建DESeqDataSet对象

dds <- DESeqDataSetFromMatrix( countData = countData, colData = colData, design = ~ Group) 
构建DESeqDataSet对象
  • 表达矩阵countData,行为基因,列为样本,表达量必须是非负整数。
  • 样本信息colData,每一行对应一个样本,行名与countData的样本顺序一一对应,列为各种分组信息。
  • 设计公式通常格式为~ batch + conditions,batch和conditions都是colData的一列,是因子型变量。为了方便后续计算,最为关注的分组信息放在最后一位。如果记录了样本的批次信息,或者其它需要抹除的信息可以定义在design参数中,在下游回归分析中会根据design formula来估计batch effect的影响,并在下游分析时减去这个影响。这是处理batch effect的推荐方式。在模型中考虑batch effect并没有在数据矩阵中移除bacth effect,如果下游处理时确实有需要,可以使用limma包的removeBatchEffect来处理。
  • 默认情况下,R会根据字母表顺序排列因子型变量,排在最前面的因子作为对照。设置对照:colData$Group <- relevel(colData$Group, ref=“WTF”)colData$Group <- factor(colData$Group, levels = c(“WTF”, ”WTM”, ”MF”, ”MMF”))
  • designs with multiple variables, e.g., ~ group + condition, and designs with interactions (answering: is the condition effect different across genotypes?) , e.g., ~ genotype + treatment + genotype:treatment. 默认情况下,此包中的函数将使用公式中的最后一个变量来构建结果表和绘图。
  • design(dds) <- value. value, a formula used for estimating dispersion and fitting Negative Binomial GLMs.

2.过滤低丰度数据

dds <- dds[rowSums(counts(dds)) > 1, ] 

或者在构建dds之前加上

 countData <- count[apply(count, 1, sum) > 1 , ] 
  • 在独立筛选(independent filtering)中,DESeq2可以去掉在所有样品中平均表达量CPM不大于min.CPM的基因,以减少假阴性。
  • EdgeR是保留在2个或更多样品中表达量大于min.CPM的基因。
  • 可以尝试不同的cutoff,以获得最佳效果。


    statquest

3.两种数据转化方法

(1)方差稳定变换,The variance stabilizing transformation

vsd <- vst(object=dds,blind=T) 
  • 样本信息的列名names(colData(vsd))多了1列sizeFactor,colData(vsd)$sizeFactor
  • 基因信息的列名names(rowData(vsd))多了4列
  • vst函数快速估计离散趋势并应用方差稳定变换。该函数从拟合的离散-均值关系中计算方差稳定变换(VST),然后变换count data(除以标准化因子),得到一个近似为同方差的值矩阵(沿均值范围具有恒定的方差)。许多常见的多维数据探索性分析方法,例如聚类或PCA,对于同方差的数据表现良好。数据集小于30个样品可以用rlog,数据集大于30个样品用vst,因为rlog速度慢。

(2)正则化对数变换,The regularized-logarithm transformation

rld <- rlog(object=dds,blind=F) 
  • 样本信息多了1列sizeFactor,和vsd的sizeFactor相同
  • 基因信息多了7列
  • rlog函数将count data转换为log2尺度,以最小化有small counts的行的样本间差异,并使library size标准化。rlog在size factors变化很大的情况下更稳健。

(3)用法

  • blind,转换时是否忽视实验设计。blind=T,不考虑实验设计,用于样品质量保证(sample quality assurance,QA)。blind=F,考虑实验设计,用于downstream analysis。

(4)为什么要转换?为了确保所有基因有大致相同的贡献。

对于RNA-seq raw counts,方差随均值增长。如果直接用size-factor-normalized read counts:counts(dds, normalized=T) 进行主成分分析,结果通常只取决于少数几个表达最高的基因,因为它们显示了样本之间最大的绝对差异。为了避免这种情况,一个策略是采用the logarithm of the normalized count values plus a small pseudocount:log2(counts(dds2, normalized=T) +1)。但是这样,有很低counts的基因将倾向于主导结果。作为一种解决方案,DESeq2为counts数据提供了stabilize the variance across the mean的转换。其中之一是regularized-logarithm transformation or rlog2。对于counts较高的基因,rlog转换可以得到与普通log2转换相似的结果。然而,对于counts较低的基因,所有样本的值都缩小到基因的平均值。
用于绘制PCA图或聚类的数据可以有多种:counts、CPM、log2(counts+1)、log2(CPM+1)、vst、rlog等。

4. DESeq2的标准化方法

(1)计算归一化系数sizeFactor

dds <- estimateSizeFactors(dds) 
  • colData(dds)多了sizeFactor这一列,对测序深度和文库组成进行校正。和vsd、rld的sizeFactor是一样的。

(2)标准化之后的数据

normalized_counts <- counts(dds,normalized=T) 
  • 将原始的表达量除以每个样本的归一化系数,就得到了归一化之后的表达量。read counts/sizeFactor。

5. 差异表达分析

(1)一步

dds <- DESeq(dds) 
DESeq2

(2)用法

DESeq(object, test = c("Wald", "LRT"), fit Type = c("parametric", "local", "mean"), sfType = c("ratio", "poscounts", "iterate"),betaPrior, full = design(object), reduced, quiet = FALSE, minReplicatesForReplace = 7, modelMatrixType, useT = FALSE, minmu = 0.5, parallel = FALSE, BPPARAM = bpparam())

  • test可以是Wald significance tests或likelihood ratio test(似然比检验),on the difference in deviance between a full and reduced model formula。

(3)分步

  • 计算归一化系数sizeFactor
dds <- estimateSizeFactors(dds) 
  • 估计基因的离散程度
dds <- estimateDispersions(dds) 

DESeq2假定基因的表达量符合负二项分布,有两个关键参数,总体均值和离散程度α值。这个α值衡量的是均值和方差之间的关系。


负二项分布
  • 统计检验,差异分析
dds <- nbinomWaldTest(dds) 

6. 获得分析结果

(1)默认情况

res <- results(dds) 
  • 默认使用样本信息的最后一个因子与第一个因子进行比较。
  • 返回一个数据框res,包含6列:baseMean、log2FC、lfcSE、stat、pvalue、padj
  • baseMean表示所有样本经过归一化系数矫正的read counts(counts/sizeFactor)的均值。baseMean = apply( normalized_counts, 1, mean )
  • log2Foldchange表示该基因的表达发生了多大的变化,是估计的效应大小effect size。对差异表达的倍数取以2为底的对数,变化倍数=2^log2Foldchange。log2FoldChange = apply( normalized_counts, 1, function(t) {log2( mean(t[5:8])/ mean(t[1:4]))}(并不完全相等。log2FC反映的是不同分组间表达量的差异,这个差异由两部分构成,一种是样本间本身的差异,比如生物学重复样本间基因的表达量就有一定程度的差异,另外一部分就是我们真正感兴趣的,由于分组不同或者实验条件不同造成的差异。用归一化之后的数值直接计算出的log2FC包含了以上两种差异,而我们真正感兴趣的只有分组不同造成的差异,DESeq2在差异分析的过程中已经考虑到了样本本身的差异,其最终提供的log2FC只包含了分组间的差异,所以会与手动计算的不同)。
  • lfcSE(logfoldchange Standard Error)是对于log2Foldchange估计的标准误差估计,效应大小估计有不确定性。
  • stat是Wald统计量,它是由log2Foldchange除以标准差所得。
  • pvalue和padj分别代表原始的p值以及经过校正后的p值。adjusted p value less than 0.1 should contain no more than 10% false positives.
  • Need to filter on adjusted p-values, not p-values, to obtain FDR control. 10% FDR is common because RNA-seq experiments are often exploratory and having 90% true positives in the gene set is ok.

(2)比较任何两组数据

resultsNames(dds) 
resultsNames
res <- results(dds, name="Group_MMF_vs_WTF") 
res <- results(dds, contrast=c("Group"," MMF "," WTF ")) #后面的是对照 

(3)用法

results(object, contrast, name, lfcThreshold = 0, altHypothesis = c("greaterAbs", "less Abs", "greater", "less"), listValues = c(1, -1), cooksCutoff, independentFiltering = TRUE, alpha = 0.1, filter, theta, p Adjust Method = "BH", filterFun, format = c("DataFrame", "GRanges", "GRangesList"), test,
addMLE = FALSE, tidy = FALSE, parallel = FALSE, BPPARAM = bpparam(), minmu = 0.5)

  • contrast
    this argument specifies what comparison to extract from the object to build a results table. one of either:(此参数指定从对象中提取什么比较以构建结果表)
    a) a character vector with exactly three elements: the name of a factor in the design formula, the name of the numerator level for the fold change, and the name of the denominator level for the fold change (simplest case) (有三个元素的字符向量:设计公式中一个因子的名称、fold change的分子level的名称、fold change的分母level的名称)
    b) a list of 2 character vectors: the names of the fold changes for the numerator, and the names of the fold changes for the denominator. these names should be elements of resultsNames(object). if the list is length 1, a second element is added which is the empty character vector, character().
    (more general case, can be to combine interaction terms and main effects)(一个由两个字符向量组成的列表:分子的fold change名称,分母的fold change名称。这些名称应该是resultsNames(object)。如果列表是长度1,则添加第二个元素,即空字符向量character()。更一般的情况是,可以将交互项和主要效果结合起来。)
    c) a numeric contrast vector with one element for each element in resultsNames(object) (most general case)(具有一个元素的数值型对比向量,对于resultsNames(object)中的每一个元素)
    If specified, the name argument is ignored.
  • name
    the name of the individual effect (coefficient) for building a results table. Use this argument rather than contrast for continuous variables, individual effects or for individual interaction terms. The value provided to name must be an element of resultsNames(object).(建立结果表的单个效应(系数)的名称。对于连续变量、单个效应或单个交互项,使用此参数而不是contrast。 提供给name的值必须是resultsNames(object)的元素)
  • lfcThreshold
    a non-negative value which specifies a log2 fold change threshold. The default value is 0, corresponding to a test that the log2 fold changes are equal to zero.
  • independentFiltering
    logical, whether independent filtering should be applied automatically
  • alpha
    the significance cutoff used for optimizing the independent filtering (by default 0.1). If the adjusted p-value cutoff (FDR) will be a value other than 0.1, alpha should be set to that value.(用于优化独立筛选的显著性截止值(默认情况下为0.1)。如果adjusted p-value cutoff (FDR)是0.1以外的值,则α应设置为该值。)
  • two conditions, three genotypes, with an interaction term. (2种条件,3种基因型,和相互作用项) The interaction term, answering: is the condition effect different across genotypes?(相互作用项解释了条件的效果在基因型之间是否不同)
dds <- makeExampleDESeqDataSet(n=100,m=18)
dds$genotype <- factor(rep(rep(c("I","II","III"),each=3),2))
design(dds) <- ~ genotype + condition + genotype:condition
dds$condition
[1] A A A A A A A A A B B B B B B B B B
Levels: A B
dds$genotype
[1] I   I   I   II  II  II  III   III   III   I   I   I   II  II  II  III  III  III
Levels: I   II   III
dds <- DESeq(dds)
resultsNames(dds)
[1] "Intercept"              "genotype_II_vs_I" 
[3] "genotype_III_vs_I"      "condition_B_vs_A" 
[5] "genotypeII.conditionB"  "genotypeIII.conditionB"
# the condition effect for genotype I (the main effect)(基因型I的条件效应,主要效应)
results(dds, contrast=c("condition","B","A"))
log2 fold change (MLE): condition B vs A 
Wald test p-value: condition B vs A
# the condition effect for genotype III.(基因型III的条件效应)
# this is the main effect *plus* the interaction term (主要效应加上相互作用项)
# (the extra condition effect in genotype III compared to genotype I).(基因型III与基因型I相比额外的条件作用效果)
results(dds, contrast=list( c("condition_B_vs_A","genotypeIII.conditionB") ))
log2 fold change (MLE): condition_B_vs_A+genotypeIII.conditionB effect 
Wald test p-value: condition_B_vs_A+genotypeIII.conditionB effect
# the interaction term for condition effect in genotype III vs genotype I.(基因型III与基因型I相比额外的条件作用效果)
# this tests if the condition effect is different in III compared to I(检测了条件效果在基因型III与基因型I之间是否有区别)
results(dds, name="genotypeIII.conditionB")
log2 fold change (MLE): genotypeIII.conditionB 
Wald test p-value: genotypeIII.conditionB 
# the interaction term for condition effect in genotype III vs genotype II. (基因型III与基因型II相比额外的条件作用效果)
# this tests if the condition effect is different in III compared to II(检测了条件效果在基因型III与基因型II之间是否有区别)
results(dds, contrast=list("genotypeIII.conditionB", "genotypeII.conditionB"))
log2 fold change (MLE): genotypeIII.conditionB vs genotypeII.conditionB 
Wald test p-value: genotypeIII.conditionB vs genotypeII.conditionB
# Note that a likelihood ratio could be used to test if there are any
# differences in the condition effect between the three genotypes.
  • Using a grouping variable
# This is a useful construction when users just want to compare
# specific groups which are combinations of variables.
dds$group <- factor(paste0(dds$genotype, dds$condition))
design(dds) <- ~ group
dds <- DESeq(dds)
resultsNames(dds)
# the condition effect for genotypeIII
results(dds, contrast=c("group", "IIIB", "IIIA"))

(4)结果的简单统计

summary(res) 
summary
summary(res, alpha=0.1) 
  • alpha: the adjusted p-value cutoff. If not set, this defaults to the alpha argument which was used in results to set the target FDR for independent filtering, or if independent filtering was not performed, to 0.1.

(5)排序和筛选

resOrdered <- res[order(res$pvalue), ]  #从小到大排序,默认decreasing = F 
sum(res$padj < 0.1, na.rm=TRUE)  #有多少padj小于0.1的 
diff_gene <-subset(res, padj < 0.1 & abs(log2FoldChange) > 1) 

(6)2种更严格的方法筛选显著差异基因

  • 降低false discovery rate threshold (the threshold on padj in the results table)
  • 提升log2 fold change threshold (from 0 using the lfc Threshold argument of results)
res.05 <- results(dds, alpha=0.05) 
#alpha为padj的阈值,默认padj=0.1。
resLFC <- results(dds, lfcThreshold=1) 
#提升log2 fold change threshold,结果中不满足lfc阈值的gene的p值都是1。

7. LFC校正lfcShrink

(1)介绍

Adds shrunken log2 fold changes (LFC) and SE to a results table from DESeq run without LFC shrinkage. For consistency with results, the column name lfcSE is used here although what is returned is a posterior SD. Three shrinkage estimators for LFC are available via type (see the
vignette for more details on the estimators). The apeglm publication demonstrates that ’apeglm’ and ’ashr’ outperform the original ’normal’ shrinkage estimator.

The shrunken fold changes are useful for ranking genes by effect size and for visualization.
缩小的倍数变化有助于按效应大小对基因进行排序和可视化。
log2FC estimates do not account for the large dispersion we observe with low read counts.
log2 FC估计不能解释我们在低read counts下观察到的大的离散程度。
As with the shrinkage of dispersion estimates, LFC shrinkage uses information from all genes to generate more accurate estimates.
与估计离散程度的收缩一样,LFC收缩使用来自所有基因的信息来生成更准确的估计。
如果要根据LFC值提取差异基因,需要shrunken values。另外,进行功能分析例如GSEA时,需要提供shrunken values。

(2)应用

resultsNames(dds) 
[1] "Intercept"  "condition_treated_vs_untreated"
resLFC <- lfcShrink(dds, coef="condition_treated_vs_untreated", type="apeglm") 
#或
resLFC <- lfcShrink(dds, contrast = c("condition"," treated "," untreated "))
  • 选择的apeglm参数进行effect size shrinkage,改善了先前的估计。resLFC相比res数据更加紧凑。type可以选择apeglm、ashr等。
names(resLFC) 
[1] "baseMean"  "log2FoldChange"  "lfcSE"  "pvalue"  "padj"       
  • 与res相比少了"stat"一列。这一步只会对LFC的值产生影响,p值是没有改变的,不会改变显著差异的基因总数。

(3)用法

lfcShrink(dds, coef, contrast, res, type = c("normal", "apeglm", "ashr"), lfcThreshold = 0, svalue = FALSE, return List = FALSE, format = c("DataFrame", "GRanges", "GRangesList"), apeAdapt = TRUE, apeMethod = "nbinomCR", parallel = FALSE, BPPARAM = bpparam(), quiet = FALSE, ...)

  • coef
    the name or number of the coefficient (LFC) to shrink, consult resultsNames(dds) after running DESeq(dds). note: only coef or contrast can be specified, not both. apeglm requires use of coef. For normal, one of coef or contrast must be provided.(要收缩的系数的名称或编号,通过resultsNames(dds)查看。coef或contrast二选一。apeglm需要coef。)
  • contrast
    see argument description in results. only coef or contrast can be specified, not both.
  • type
    "normal" is the original DESeq2 shrinkage estimator; "apeglm" is the adaptive t prior shrinkage estimator from the ’apeglm’ package; "ashr" is the adaptive shrinkage estimator from the ’ashr’ package, using a fitted mixture of normal prior.

8. 似然比检验LRT

ddsLRT <- DESeq(dds, test="LRT", reduced=~1)
resLRT <- results(ddsLRT)

9. 开启多线程

library("BiocParallel")
register(MulticoreParam(4))
#先预定4个核,等需要的时候直接使用parallel=TRUE来调用。

10. Plotting results

(1)样本间关系热图(总体相似度)

library(pheatmap)
library(RColorBrewer)
rld
sampleDist <- dist(t(assay(rld)))  #样品距离,欧氏距离,t转置 
#为确保所有基因大致相同的contribution用rlog-transformed data 
#画某些基因在样本间的heatmap也可以用rlog数据 
#用PoiClaClu包计算泊松距离(Poisson Distance),必须是原始表达矩阵 
#poisd <- PoissonDistance(t(counts(dds))) 
sampleDistMatrix <- as.matrix(sampleDist)  #样品间距离的矩阵
rownames(sampleDistMatrix) <- paste0(rld $cell,"-", rld$dex)
colnames(sampleDistMatrix) <- NULL
head(sampleDistMatrix)  #样品间距离的数据框
colors <- colorRampPalette(rev(brewer.pal(9,"Blues")))(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDist,
         clustering_distance_cols=sampleDist,
         color = colors)

(2)多维尺度分析(multidimensional scaling,MDS)或主坐标分析(principal coordinates analysis,PCoA)

library(ggplot2)
#把样本之间的距离转化成二维坐标,在降维过程中保证样品点之间的距离不变
#MDS基于最小线性距离(欧氏距离)的聚类,与PCA的最大线性相关是一样的
#适合在没有表达矩阵值,但只有一个距离矩阵的情况下使用
mdsdata <- data.frame(cmdscale(sampleDistMatrix))
#cmdscale(classical multidimensional scaling)
mdsdata  #返回MDS的坐标
mds <- cbind(mdsdata,as.data.frame(colData(vsd)))
mds  #按列合并
ggplot(data=mds,aes(X1,X2,color=cell,shape=dex)) +
  geom_point(size=3)

(3)主成分分析(Principal Component Analysis,PCA)

pcadata <- plotPCA(vsd,intgroup = c("Batch","Group"), returnData=TRUE)
percentVar <- round(100*attr(pcadata,"percentVar"),1)
ggplot(pcadata, aes(PC1, PC2, color=Group, shape=Batch)) + 
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) +
  geom_text_repel(aes(PC1, PC2,color=Group,label=colnames(vsd)),size=3) +
  theme_bw()

(4)plotCounts()函数查看特定基因的表达量

topGene <- rownames(res)[which.min(res$padj)]  #padj最小的一个基因
plotCounts(dds, gene=topGene, intgroup=c("dex"))  #画出这个基因的标准化后的表达量
#以散点图的形式画出这个基因在各样本中的表达量
data <- plotCounts(dds, gene=topGene, intgroup=c("dex","cell"), returnData=TRUE)
ggplot(data, aes(x=dex, y=count, color=cell)) +  
 scale_y_log10() +
 geom_point(position=position_jitter(width=.1,height=0), size=3)  
ggplot(data, aes(x=dex, y=count, fill=dex)) +
 scale_y_log10() +
 geom_dotplot(binaxis="y", stackdir="center")
ggplot(data, aes(x=dex, y=count, color=cell, group=cell)) +
 scale_y_log10() + geom_point(size=3) + geom_line()

(5)MA图

plotMA(res, ylim=c(-5,5))
#"M" for minus(减), because a log ratio is equal to log minus log, and "A" for average(均值)
#M对应差异对比组之间基因表达变化log2 fold changes (Y轴)
#A对应差异对比组基因表达量均值the mean of normalized counts (X轴)
plotMA(res, alpha = 0.1, main = "", xlab = "mean of normalized counts", ylim=c(-5,5))
#alpha为padj显著性水平阈值,默认alpha=0.1
#Each gene is represented with a dot. Genes with an adjusted p value below a threshold (here 0.1, the default) are shown in red.
plotMA(resLFC, ylim=c(-5,5))  #提高了log2 fold change阈值
topGene <- rownames(resLFC)[which.min(resLFC$padj)]
with(resLFC[topGene, ], {
 points(baseMean, log2FoldChange, col="dodgerblue", cex=2, lwd=2)
 text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
})  #标记出一个特定的基因

(6)Removing hidden batch effects

library("sva")
dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv=2)
svseq$sv
par(mfrow=c(2,1),mar=c(3,5,3,1))
stripchart(svseq$sv[,1] ~ dds$cell,vertical=TRUE,main="SV1")
abline(h=0)
stripchart(svseq$sv[,2] ~ dds$cell,vertical=TRUE,main="SV2")
abline(h=0)
ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + dex
ddssva <- DESeq(ddssva)

你可能感兴趣的:(DESeq2详细用法)