转录组全记录：从rawdata到DEG功能富集

写在前面：
本文是新手学习全记录，仅供参考，欢迎交流！

1. rawdata

1.1 QC

fastqc -o out_path filename
multiqc ./

1.2 数据量统计

调用工具：iTools
使用：

iTools Fqtools stat -InFq <1.fq> -InFq <2.fq>  -OutStat

2. cleandata

2.1 fastp过滤

fastp
-q 15 #期望碱基质量值
-u 40 #去除高于40%低质量碱基数量的reads
-n 5 #去除多于5个N碱基的reads
-l 80 #去除总长度低于80个碱基的reads
-i read1 -o read1_clean
-I read2 -O read2_clean

2.2 fastqc情况

3. alignment

3.1 hisat2

1)建index
2)比对
参数：

hisat2 -t -x -1 -2 -S

3)sam2bam, sort

samtools view -S -b in.sam > out.bam
samtools sort -n in.bam out.prefix

3.2 RNA-seq specific QC

用到了qualimap，检测reads的基因组来源：intron exon intergenic等

qualimap rnaseq 
-bam  
-gtf 
-pe  -s --java-mem-size=8G

4.差异表达分析

说明：此处两套流程：stringtie+ballgown和featureCount+DESeq2都可以，目前主流是fc+DESeq2，但stringtie可做转录本水平的差异。

5. StringTie

5.1 组装转录本

stringtie in.bam -G ref.gtf -o out.gtf -p 8

StringTie运行速度很快，30min/样品

5.2 merge 转录本

首先创建mergelist.txt

ls *sample.gtf >mergelist.txt

开始merge，运行时间5min

stringtie --merge -p 8 
-G 
-o   #只有这一个输出文件
mergelist.txt

5.3 检测转录本的组装情况

gffcompare -r gtf 
-G -o merged 
stringtie_merged.gtf

结果文件：

merged.stringtie_merged.gtf.refmap

merged.stringtie_merged.gtf.tmap

5.4 重新组装转录本

stringtie -e # 仅定量-G中存在的转录本
-p 8 
sample_sorted.bam
-G stringtie_merge.gtf #使用上一步merge的gtf
-o sample.gtf
-b file_for_ballgown/sample/ #具体到每个样品的文件夹，或 -B /dir/sample.gtf
-A file_for_abundance/sample.txt #要具体到file名字

生成的文件直接用于下一步的ballgown分析。

6. ballwn差异表达基因

问题：差异基因分不清是上调还是下调；

biostar论坛：
ballgown和FPKM不适合differential expression analysis；
用DESeq2 for DE analysis，用Ballgown看基因表达水平FPKM。
FPKM被认为是inferior for sample comparisions;
well-maintained gene level solutions: DESeq2 or edgeR.

脚本：

setwd( )

library(ballgown)
library(RSkittleBrewer)
library(genefilter)
library(dplyr)
library(devtools)

pheno_data <- read.csv("geuvadis_phenodata.csv",sep = ",", header = T) #表型数据
bg <- ballgown(dataDir = , samplePattern = "sample", pData=pheno_data) #dataDir是数据的地方
bg_filter <- subset(bg, "rowVars(texpr(bg))>1", genomesubset=TRUE) #过滤低丰度基因，滤掉了样本间差异少于一个转录本的数据
########################确认组间差异###########################
result_tran <- stattest(bg_filter, feature = "transcript", covariate ="stage", adjustvars = c("idv"), getFC=TRUE,meas="FPKM") #组间有差异的转录本
result_gene <- stattest(bg_filter, feature = "gene", covariate ="stage", adjustvars = c("idv"), getFC=TRUE,meas="FPKM") #组间有差异的基因
result_tran <- data.frame(geneNames=ballgown::geneNames(bg_filter), geneIDs = ballgown::geneIDs(bg_filter), result_tran) #为trans添加基因名

indices <- match(result_gene$id, texpr(bg, 'all')$gene_id) #为gene加基因名，依据bg
gene_names_for_result <- texpr(bg, 'all')$gene_name[indices]
result_gene <- data.frame(geneNames=gene_names_for_result, result_gene)

result_tran <- arrange(result_tran, pval) #排序
result_gene <- arrange(result_gene, pval)
write.csv(result_tran, "result_tran.csv",row.names=FALSE) #保存
write.csv(result_gene, "result_gene.csv",row.names=FALSE)

indices <- match(result_gene$id, texpr(bg_filter, 'all')$gene_id) #为DEG加基因名，bg_filter
gene_names_for_result <- texpr(bg_filter, 'all')$gene_name[indices]
result_gname <- data.frame(geneNames=gene_names_for_result, result_gene)
write.csv(result_gname, "result_gname_filter.csv",row.names=FALSE)
#########################确定DEG/DET##########################
result_DET <- subset(result_tran, result_tran$qval<0.05) #筛选出q值小于0.05的，即差异
 result_DEG <- subset(result_gene, result_gene$qval<0.05)
write.csv(result_DET,"DET.csv",row.names=FALSE)
write.csv(result_DEG,"DEG.csv",row.names=FALSE)
 ########################画FPKM图#############################
 tropical <- c('darkorange', 'dodgerblue','hotpink', 'limegreen', 'yellow')
palette(tropical)
fpkm <- texpr(bg, meas = "FPKM")
fpkm <- log2(fpkm+1)
pdf("FPKM.pdf")
boxplot(fpkm, col=as.numeric(pheno_data$stage), las=2, ylab='log2(FPKM+1)')
dev.off()

###################单个转录本的样品分布箱线图#####################
#查看单个转录本在样品中的分布
ballgown::transcriptNames(bg)[12]
    12
    "NM_012227"
ballgown::geneNames(bg)[12]
    12
    "GTPBP6"
#绘制箱线图
plot(fpkm[12,] ~ pheno_data$stage, border=c(1,2),+
    main=paste(ballgown::geneNames(bg)[12], ' : ',+
    ballgown::transcriptNames(bg)[12]), pch=19, xlab="stage",+
    ylab='log2(FPKM+1)')
 points(fpkm[12,] ~ jitter(as.numeric(pheno_data$stage)),+
    col=as.numeric(pheno_data$stage))
############查看某一基因位置上所有的转录本#########################
# plotTranscripts函数可以根据指定基因的id画出在特定区段的转录本
#可以通过sample函数指定看在某个样本中的表达情况，这里选用id=1750, sample=ERR188234
plotTranscripts(ballgown::geneIDs(bg)[1729], bg,+
    main=c('Gene XIST in sample ERR188234'), sample=c('ERR188234'))
plotMeans('MSTRG.56', bg_filter, groupvar="stage",legend=FALSE)

脚本参考：

https://www.jianshu.com/p/1f5d13cc47f8
Hisat+stringtie+ballgown文章

7. featureCount+DESeq

7.1 featureCounts

featureCounts 
-p -t exon -g gene_id 
-a Sscrofa11.1.gtf 
-o  
bamfile

结果： 两种文件

1, txt文件，很多列，有ENSG基因号、位置、call到的reads数等
2, summary文件，很小类似一个日志文件，显示了比对的情况，未必对上的是什么。

7.2 DESeq2

7.2.1 基本原理

1.1 概述

全称：DESeq2 package for differential analysis of count data;

利用负二项分布广义线性模型( negative binomial generalized linear models），同时，还利用了离散型估计、logFoldChange;

负二项分布是一个离散分布，符合测序reads分布；

1.2 构建dds

要求输入原始 reads count 数；不接受已经做过处理的FPKM/TPM等，因为软件有自己的标准化计算方法；

构建dds。需要设置design公式，即告诉软件你的数据是怎样来的，基本试验设计如何，软件会根据几个变量综合计算；
一般：design =~ variable1 + variable2 + ...；
只有一个变量时：design=~ condition；
很多医学分析会加入年龄、性别等：design=~sex+disease+condition；
可以对应几个变量，但如果没有额外参数，log2FC和p值是默认对design公式中的最后一个变量或者最后一个因子与参考因子进行比较；

1.3 函数与计算

1.3.1 标准化：DESeq函数

不同样品的测序量有差异，最简单的标准化方式是计算counts per million (CPM) = 原始reads count ÷ 总reads数 x 1,000,000；

这种计算方式，易受到极高表达且在不同样品中存在差异表达的基因的影响：这些基因的打开或关闭会影响到细胞中总的分子数目，可能导致这些基因标准化之后就不存在表达差异了，而原本没有差异的基因标准化之后却有差异了；

RPKM、FPKM、TPM 是 CPM按照基因或转录本长度归一化后的表达，都会受到这一影响；

DESeq2的方法：

量化因子 (size factor,SF)，首先计算每个基因在所有样品中表达的几何平均值；每个细胞的SF是所有基因与其在所有样品中的表达值的几何平均值的比值的中位数；由于几何平均值的使用，只有在所有样品中表达都不为0的基因才能用来计算。这一方法又被称为RLE(relative log expression)。

不但考虑了测序深度的问题，还考虑了表达量超高或者极显著差异表达的基因导致count的分布出现偏倚。

DESeq函数分析：

三步：estimation of size factors（estimateSizeFactors)， estimation of dispersion（estimateDispersons)， Negative Binomial GLM fitting and Wald statistics（nbinomWaldTest）；

可以分步运行，也可一步到位，最后返回 results可用的DESeqDataSet对象。

1.3.2 归一化：rlog/vst

是我自己去的名字，可能不准确，我用在对dds进行vst然后做PCA分析。

全称：快速估算离散趋势并应用方差稳定转换；
若 samples<30 用 rlog函数，>30用 vst；
类似的函数：gmodels - fast.prcomp，输入数据为TPM；或者TMM；

1.3.3 数据收缩：lfcShrink：

shrink the log2 foldchange，不会改变显著差异的基因总数，作者很推荐这个新功能。

为何采用lfcShrink？log2FC estimates do not account for the large dispersion we observe with low read counts. 因此，两种数据特别需要：低表达量占比高的；数据特别分散的。

但是我只用来做MA plot并没用来差异分析，因为：

lfcShrink 不改变p值q值，但改变了fc，使 foldchange范围变小，所以选择DEG时会有不同结果，一般会偏少！所以，根据数据情况，本次分析DEG还是不做shrink。

1.3.4 p-value和q-value

作者给出的建议：
Need to filter on adjusted p-values, not p-values, to obtain FDR control. 10% FDR is common because RNA-seq experiments are often exploratory and having 90% true positives in the gene set is ok.
即：用padj为标准做结果筛选。

事实上，在软件计算过程中，多次以alpha表示padj，并默认alpha=0.1；

1.3.5 MA plot

MA plot也叫 mean-difference plot或者Bland-Altman plot，用来估计模型中系数的分布;

X轴, the "A"(average)；Y轴，the "M"(minus) – subtraction of log values is equivalent to the log of the ratio;

M表示log fold change，衡量基因表达量变化，上调还是下调；A表示每个基因的count的均值；

根据summary(res)可知，low count的比率很高，所以大部分基因表达量不高，也就是集中在0的附近（log2(1)=0，也就是变化1倍），提供了模型预测系数的分布总览。

1.4 DESeq(dds)结果矩阵每一列的含义：

baseMean： is a just the average of the normalized count values, dividing by size factors, taken over all samples in the DESeqDataSet；是对照组的样本标准化counts的均值；
log2FoldChange： the effect size estimate. It tells us how much the gene’s expression seems to have changed due to treatment with dexamethasone in comparison to untreated samples；也不是简单的用标准化的counts进行计算，因为计算的时候需要考虑零值以及其他效应；结果是log2fc(trt/untrt)所以要注意对照和处理的指定；
lfcSE： the standard error estimate for the log2 fold change estimate，(the effect size estimate has an uncertainty associated with it,)；
p value: statistical test , the result of this test is reported as a p value. Remember that a p value indicates the probability that a fold change as strong as the observed one, or even stronger, would be seen under the situation described by the null hypothesis；

p value有时候是NA：Sometimes a subset of the p values in res will be NA (not available); This is DESeq's way of reporting that all counts for this gene were zero, and hence no test was applied. In addition, p values can be assigned NA if the gene was excluded from analysis because it contained an extreme count outlier. For more information, see the outlier detection section of the DESeq2 vignette.

padj: adjusted p-value;

1.5 实际遇到的其它问题

1.5.1 pre-analysis 预分析

就是开始熟悉你的数据，选择合适的分析方法；

先做三个图：PCA，相关性热图，聚类图；

1.5.2 批次效应

1. 本项目的批次效应：

design =~ batch + condition

一般批次效应：

可以用limma removeBatchEffect或者sva Combat等去除；

但是在做差异分析时，ballgown, DESeq2等软件建议不要提前去批次，而是将批次作为covariate进行分析；

如果想做差异表达分析，但数据中又有已知的批次问题，则应该在构建模型矩阵时加入批次因素，我做ballgown时用了adjust cov = idv。

7.2.2 实操

经过多次分析和调整，最后用的代码是：

（1）安装包

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("DESeq2")

（2）导入数据两两比较

setwd( )
colData <- read.table('', header=TRUE, row.names=1)
readscount <- read.table('', header=TRUE, row.names=1)
condition <- factor(c(rep("A",1),rep("F",1)))
batch <- factor(c(rep(),rep(),rep(),
                  rep(),rep(),rep()))
library(DESeq2) 
dds <- DESeqDataSetFromMatrix(readscount, colData, design =~ batch + condition)
keep <- rowSums(counts(dds) >= 10) >= 3  
dds <- dds[keep, ]

（3）PCA

vsdata <- vst(dds, blind=FALSE)  #归一化
assay(vsdata) <- limma::removeBatchEffect(assay(vsdata), vsdata$batch)  #去批次效应
plotPCA(vsdata, intgroup = "condition")

（4）差异分析

dds_norm <- DESeq(dds, minReplicatesForReplace = Inf) #标准化; 
dds_norm$condition   #保证是levels是按照后一个比前一个即trt/untrt，否则需在results时指定
res <- results(dds_norm, contrast = c("condition","A","F"), cooksCutoff = FALSE) #alpha=0.05可指定padj; cookCutoff是不筛选outliers因为太多了
summary(res)  
#resOrdered <- res[order(res$pvalue), ] #排序
sum(res$padj<0.05, na.rm = TRUE)
res_data <- merge(as.data.frame(res),
              as.data.frame(counts(dds_norm,normalize=TRUE)),
              by="row.names",sort=FALSE)
up_DEG <- subset(res_data, padj < 0.05 & log2FoldChange > 1)
down_DEG <- subset(res_data, padj < 0.05 & log2FoldChange < -1)
write.csv(up_DEG, "up.csv")
write.csv(down_DEG, "down.csv")

（5）判断欧氏距离，若有异常样品则不用cooksCutoff；当有上千个异常值时也不用：（完全可以不做）

par(mar=c(8,5,2,2))
boxplot(log10(assays(dds_norm)[["cooks"]]), range=0, las=2)

（6）lfcshrink & MA plot

library(apeglm)  
resultsNames(dds_norm)  #看一下要shrink的维度;shrink数据更加紧凑,少了一项stat，并未改变padj，但改变了foldchange
res_shrink <- lfcShrink(dds_norm, coef="condition_A_vs_F", type="apeglm") #最推荐apeglm算法;根据resultsNames(dds)的第5个维度，coef=5，也可直接""指定;apeglm不allow contrast，所以要指定coef
pdf("MAplot.pdf", width = 6, height = 6) 
plotMA(res_shrink, ylim=c(-10,10), alpha=0.1, main="MA plot")
dev.off()

（7）火山图，需要根据lfc添加significant列,分别为down,up,stable

library(ggplot2)
voldata <-read.csv(file = "allDEGs.csv",header = TRUE, row.names =1,sep = ",")
pdf("volcano.pdf", width = 6, height = 5)
ggplot(data=voldata, aes(x=log2FoldChange,y= -1*log10(padj))) +
  geom_point(aes(color=significant)) +
  scale_color_manual(values=c("#546de5", "#d2dae2","#ff4757")) + 
  labs(title="Volcano Plot", x=expression(log[2](FC), y=expression(-log[10](padj)))) +
  geom_hline(yintercept=1.3,linetype=4) +  
  geom_vline(xintercept=c(-1,1),linetype=4) +
  theme_bw() + theme(panel.grid = element_blank())  
dev.off()

心得：

DESeq2官方说明: https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#indfilt

8. 富集分析 Gene Ontology Enrichment Analysis

富集分析分为两类：

A：差异基因富集分析（不需要表达值，只需要gene name）
B：基因集(gene set)富集分析（不管有无差异，需要全部genes表达值）

8.1 差异基因富集：ID转换

如果是做人、小鼠，则不需要转换。

方法：
网页在线版 g: profiler https://biit.cs.ut.ee/gprofiler/convert

8.2 GO富集与画图

工具：Y叔的R包 clusterProfiler；

代码：

setwd('')
library(clusterProfiler)
library(DOSE)
library(org.Ss.eg.db)
library(ggplot2)
library(stringr)
###get the ENTREZID for the next analysis
keytypes(org.Ss.eg.db) 
gene_All <- read.csv(file = "", header = T)
gene_Alias <- gene_All[ ,2]
gene_ID <- bitr(gene = gene_Alias, fromType = "ALIAS", 
              toType = c("SYMBOL","ENTREZID"),
              OrgDb = org.Ss.eg.db) 
###Go classification
go_res <- enrichGO(gene = gene_ID$ENTREZID, 
                   OrgDb = "org.Ss.eg.db", 
                   ont = "all",
                   pvalueCutoff = 0.9,
                   qvalueCutoff = 0.9)
go_dose <- DOSE::setReadable(go_res, OrgDb = 'org.Ss.eg.db', keyType = 'ENTREZID')
write.csv(go_dose, 'goresult.csv')
pdf('goresult.pdf')
barplot(go_dose, split = "ONTOLOGY", font.size = 10, 
        title="DEGs GO enrichment") + 
  facet_grid(ONTOLOGY~., scale = "free") + 
  scale_x_discrete(labels = function(x) str_wrap(x, width=50)) 
dev.off()

KEGG enrichment:

kegg <- enrichKEGG(gene = gene_ID$ENTREZID,
                   organism = 'ssc',
                   keyType = "kegg",
                   pvalueCutoff = 0.9,
                   qvalueCutoff = 0.9,
                   pAdjustMethod = "BH",
                   minGSSize = 10, 
                   maxGSSize = 500)
kegg[1:30]
pdf('keggresult.pdf')
barplot(kegg, showCategory = 20, font.size = 10, xlab = "Gene Counts",
        title = "kegg") + 
  scale_size(range = c(2, 12)) + 
  scale_x_discrete(labels = function(kegg) str_wrap(kegg, width = 50)) 
dev.off()

8.3 Allenricher GO+KEGG

Linux运行；

perl /software/AllEnricher-v1.0/AllEnricher 
-l id.txt 
-s ssc -v v20190612 
-o /Allenricher_GO_KEGG/ 
-r /bin/Rscript 
-i KEGG+GO

# 富集genenum padj 可在allenricher软件脚本修改。

8.4 基因集富集：GESA（clusterProfiler包）

结果说明

Enrichment Score：作者认为当其表达矩阵的gene list在gene sets中是随机分布的话，那么最终的ES值是相对较小的；当是非随机分布时，则对应的ES值是相对较大的。
一般|NES|>1, p-value<0.05, FDR<=25%的条目有意义；