RNA-Seq data analysis

数据准备

数据来源

测序下机数据

直接进行质控和后续分析;

已发表数据

以“The PP2A-interactor TIP41 modulates ABA responses in Arabidopsis thaliana”,这篇文章为例;

示例文章

获取bioproject号;


bioproject

NCBI数据库搜索


NCBI 搜索

获取SRR号


SRA experiments
SRA
SRA号

SRR号如下:
SRR6281250 【tip41ABA2】tip41_1 12 d old in vitro on germination media and treated for 3h with 50 μM ABA_rep2
SRR6281251 【tip41ABA3】tip41_1 12 d old in vitro on germination media and treated for 3h with 50 μM ABA_rep3
SRR6281252 【tip41Con1】tip41_1 12 d old in vitro on germination media_rep1
SRR6281253 【tip41Con2】tip41_1 12 d old in vitro on germination media_rep2
SRR6281254 【ColABA2】Col-0 12 d old in vitro on germination media and treated for 3h with 50 μM ABA_rep2
SRR6281255 【ColABA3】Col-0 12 d old in vitro on germination media and treated for 3h with 50 μM ABA_rep3
SRR6281256 【ColCon3】Col-0 12 d old in vitro on germination media_rep3
SRR6281257 【ColABA1】Col-0 12 d old in vitro on germination media and treated for 3h with 50 μM ABA_rep1
SRR6281258 【ColCon1】Col-0 12 d old in vitro on germination media_rep1
SRR6281259 【ColCon2】Col-0 12 d old in vitro on germination media_rep2
SRR6281260 【tip41Con3】tip41_1_12 d old in vitro on germination media_rep3
SRR6281261 【tip41ABA1】tip41_1 12 d old in vitro on germination media and treated for 3h with 50 μM ABA_rep1

数据下载 【sratoolkit prefetch】

vim prefetch.sh
写一个for循环

for ((i = 50; i <= 61; i++)) 
do prefetch SRR62812$i 2> prefetch.log
done

运行程序
nohup bash prefetch.sh &

数据拆分 【sratoolkit fastq-dump】

SRR数据下载完成后需要进行拆分成fastq格式才能用于后续分析,数据拆分使用NCBI sratoolkit工具包中的fastq-dump程序;
nohup fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files *.sra 2> fastq-dump.log &

质控 【fastp】

使用fastp对原始数据进行质控;

vim fastp.sh

for file in `ls ../01_data/*_1.fastq | perl -lpe 's/_1.fastq//'`;
do fastp -i ${file}_1.fastq -I ${file}_2.fastq -o ${file}_R1.clean.fq.gz -O ${file}_R2.clean.fq.gz -j ${file}.json -h ${file}.html 2> ${file}.log;
done

nohup bash fastp.sh 2> fastp.log &


参考基因组准备

对于有参转录组分析,需要准备参考基因组文件(fasta file) 和基因组注释文件(gff3/gtf file),在比对前需将gff3格式转换成gtf格式;
gff3转换gtf
gffread TAIR10_GFF3_genes.gff -T -o TAIR10_gtf_genes.gtf

比对 【hisat2】

这一步需要将质控后的数据比对到参考基因组上,进行比对的软件有很多,在此使用hisat2,生成sam文件;比对之前需要对参考基因组建立索引;

vim hisat2.sh

genome=../03_genome/TAIR10_chr_all.fa
genome_prefix=../03_genome/TAIR10_chr_all
hisat2-build $genome $genome_prefix 1>hisat2-build.log 2>hisat2-build.err
for sample in `ls ../02_fastp/*_R1.clean.fq.gz | perl -lpe 's/_R1.clean.fq.gz//'`;
do hisat2 --new-summary -p 2 -x $genome_prefix -1 ${sample}_R1.clean.fq.gz -2 ${sample}_R2.clean.fq.gz ${sample}.sam 2> ${sample}.err;
done

nohup bash hisat2.sh 2> hisat2.log &

排序去重 【samtools sort】

将生成的sam文件转换成bam文件,并排序去重,用于后续counts数计算;

for file in *.sam;do 
    samtools view -b $file > ${file%.sam}.bam
    samtools sort ${file%.sam}.bam > ${file%.sam}.sorted.bam
    samtools index ${file%.sam}.sorted.bam
done

counts数计算 【Rsubread】

使用Rsubread包计算counts数;
vim featurecounts.sh

for file in ../05_sorted/*.sorted.bam; do 
        Rscript run-featurecounts.R ../05_sorted/$file ../03_genome/TAIR10_gtf_genes.gtf 10 $file
done

vim run-featurecounts.R

library(Rsubread)
library(limma)
library(edgeR)
args <- commandArgs( trailingOnly = TRUE )
bamFile <- args[1]
gtfFile <- args[2]
nthreads <- args[3]
outFilePref <- args[4]
outStatsFilePath <- paste(outFilePref, '.stat', sep = '');
outCountsFilePath <- paste(outFilePref, '.count', sep = '');
fCountsList = featureCounts(bamFile, annot.ext=gtfFile, isGTFAnnotationFile=TRUE, nthreads=nthreads, isPairedEnd=TRUE)
dgeList = DGEList(counts=fCountsList$counts, genes=fCountsList$annotation)
fpkm = rpkm(dgeList, dgeList$genes$Length)
tpm = exp(log(fpkm) - log(sum(fpkm)) + log(1e6))
write.table(fCountsList$stat, outStatsFilePath, sep="\t", col.names=FALSE, row.names=FALSE, quote=FALSE)
featureCounts = cbind(fCountsList$annotation[,1], fCountsList$counts, fpkm, tpm)
colnames(featureCounts) = c('gene_id', 'counts', 'fpkm','tpm')
write.table(featureCounts, outCountsFilePath, sep="\t", col.names=TRUE, row.names=FALSE, quote=FALSE)

nohup bash featurecounts.sh 2> featurecounts.log &

生成表达矩阵 【reshape2】

将所有的counts数合并成矩阵,用于后续分析;
vim matrix.sh

perl -lne 'if ($ARGV=~/(.*).count/){print "$1\t$_"}' *.count > matrix.count

awk -F"\t" '{print $1"\t"$2"\t"$3}' matrix.count > gene.count
awk -F"\t" '{print $1"\t"$2"\t"$5}' matrix.count > gene.tpm

vim matrix.R

a=read.csv('gene.count',header=F,sep="\t")
colnames(a)=c('sample','gene','count')
library(reshape2)
counts=dcast(a,formula=gene~sample)
write.table(counts,file="gene_count.txt",sep="\t",quote=FALSE,row.names=FALSE)

Rscript matrix.R


查看样本相关系数

library(pheatmap)
setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/difference_expression/")
tpmTable <- read.table("gene_tpm_sort.txt", check.names = F, sep = "\t", row.names = 1, 
                       header = T)
ddCor <- cor(tpmTable, method = "spearman")
pheatmap(file = "allSampleWithNum.cor.pdf", ddCor, clustering_method = "average", 
         display_numbers = T)
allSampleWithNum.cor

差异表达分析

准备counts文件

head -n 1 gene_count.txt

gene tip41ABA2 tip41ABA3 tip41Con1 tip41Con2 ColABA2 ColABA3 ColCon3 ColABA1 ColCon1 ColCon2 tip41Con3 tip41ABA1

awk -F"\t" '{print $1"\t"$10"\t"$11"\t"$8"\t"$9"\t"$6"\t"$7"\t"$4"\t"$5"\t"$12"\t"$13"\t"$2"\t"$3}' gene_count.txt > gene_count_sort.txt

head -n 1 gene_count.txt > 1.txt
cat gene_tpm.txt | head -n 33324 | tail -n +2 >> 1.txt
awk -F"\t" '{print $1"\t"$10"\t"$11"\t"$8"\t"$9"\t"$6"\t"$7"\t"$4"\t"$5"\t"$12"\t"$13"\t"$2"\t"$3}' 1.txt > gene_tpm_sort.txt
awk -F"\t" '{print $1"\t"$5"\t"$6"\t"$7"\t"$11"\t"$12"\t"$13}' gene_count_sort.txt > gene_ColABA_vs_tip41ABA_count.txt
awk -F"\t" '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7}' gene_count_sort.txt > gene_ColCon_vs_ColABA_count.txt
awk -F"\t" '{print $1"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13}' gene_count_sort.txt > gene_tip41Con_vs_tip41ABA_count.txt
awk -F"\t" '{print $1"\t"$2"\t"$3"\t"$4"\t"$8"\t"$9"\t"$10}' gene_count_sort.txt > gene_ColCon_vs_tip41Con_count.txt
#cut -f1,2,3,4,8,9,10 gene_count_sort.txt > gene_ColCon_vs_tip41Con_count.txt

差异表达分析

【ColCon_vs_ColABA】

#载入DESeq2包
library(DESeq2)
#【ColCon_vs_ColABA】
#设置工作目录
setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/difference_expression/ColCon_vs_ColABA/")
#读取数据
raw <- read.table("gene_ColCon_vs_ColABA_count.txt", header = T, row.names = 1)
#矩阵化
Raw <- as.matrix(raw)
#Use DESeq2 to identify DE genes
raw_condition <- factor(c(rep("ColCon",3),rep("ColABA",3)))
coldata <- data.frame(row.names=colnames(raw), raw_condition)
dds <- DESeqDataSetFromMatrix(countData=raw, colData=coldata, design=~raw_condition)
dds <- DESeq(dds)
res <- results(dds)
resOrdered <- res[order(res$padj),]
summary(resOrdered)
# out of 24871 with nonzero total read count
# adjusted p-value < 0.1
# LFC > 0 (up)       : 4304, 17%
# LFC < 0 (down)     : 4916, 20%
# outliers [1]       : 12, 0.048%
# low counts [2]     : 3315, 13%
# (mean count < 2)
# [1] see 'cooksCutoff' argument of ?results
# [2] see 'independentFiltering' argument of ?results

#查看有显著差异的基因的个数
print(sum(resOrdered$padj < 0.01 & abs(resOrdered$log2FoldChange)>1, na.rm=TRUE))
# [1] 2406
#output DE results
res <- res[order(res$padj), ]
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), 
                 by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
#过滤结果
resdata_filter <- subset(resdata,resdata$padj<0.01 & abs(resdata$log2FoldChange) >1)
#保存结果
write.table(resdata_filter, file="diffexpr_ColCon_vs_ColABA_0.01.txt",sep="\t",
            row.names=F)

PCA分析

输入的矩阵不同,得出的结论也会不同。由于基因的表达水平在不同样本间本身就存在一定的差异,所以无论是采用原始的还是归一化之后的表达量矩阵,效果都不理想。针对这一问题,DESeq2提出了两种count值的转换算法,rlog和VST转换。

dds <- estimateSizeFactors(dds)
raw <- SummarizedExperiment(counts(dds, normalized=FALSE), colData=colData(dds))
nor <-SummarizedExperiment(counts(dds, normalized=TRUE), colData=colData(dds))
#vst转换
vsd <- vst(dds)
#rlog转换
rld <- rlog(dds)
#两种转换本质上是在降低生物学重复之间的差异,使得样本聚类和PCA分析的效果更好。转换之后的表达量数据可以采用assay函数进行提取;VST和rlog转换之后,生物学重复之间更佳的接近,不同分组也区分的较为明显
head(assay(rld[, 1:6]))
#             ColCon1   ColCon2   ColCon3   ColABA1   ColABA2   ColABA3
# AT1G01010  8.308275  8.318305  7.905503  7.940777  7.976028  7.990062
# AT1G01020  7.349027  7.310091  7.511672  7.180577  7.253919  7.214220
# AT1G01030  6.437458  6.479145  6.413880  6.140799  6.081422  6.110596
# AT1G01040 10.296109 10.303349 10.415850 10.340620 10.284638 10.326584
# AT1G01046  3.061996  3.039576  3.187465  2.977824  2.929619  2.972257
# AT1G01050  9.849337  9.843198  9.789624  9.783000  9.791312  9.907184

#对于raw count data,采用rlog或者VST转换之后的数据去进行PCA和聚类分析,效果会更好。
plotPCA(vsd, intgroup=c("raw_condition"), returnData=TRUE)
# PC1          PC2  group raw_condition    name
# ColCon1 -34.69931   9.36822458 ColCon        ColCon ColCon1
# ColCon2 -25.90069   4.80286144 ColCon        ColCon ColCon2
# ColCon3 -26.81507 -15.38832746 ColCon        ColCon ColCon3
# ColABA1  29.88549   3.38267261 ColABA        ColABA ColABA1
# ColABA2  27.76088  -0.09886219 ColABA        ColABA ColABA2
# ColABA3  29.76870  -2.06656898 ColABA        ColABA ColABA3
plotPCA(vsd, intgroup=c("raw_condition"))

【ColCon_vs_tip41Con】

#设置工作目录
setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/difference_expression/ColCon_vs_tip41Con/")
#读取数据
raw <- read.table("gene_ColCon_vs_tip41Con_count.txt", header = T, row.names = 1)
#矩阵化
Raw <- as.matrix(raw)
#Use DESeq2 to identify DE genes
raw_condition <- factor(c(rep("ColCon",3),rep("tip41Con",3)))
coldata <- data.frame(row.names=colnames(raw), raw_condition)
dds <- DESeqDataSetFromMatrix(countData=raw, colData=coldata, design=~raw_condition)
dds <- DESeq(dds)
res <- results(dds)
resOrdered <- res[order(res$padj),]
summary(resOrdered)
#查看有显著差异的基因的个数
print(sum(resOrdered$padj < 0.01 & abs(resOrdered$log2FoldChange)>1, na.rm=TRUE))
# [1] 2406
#output DE results
res <- res[order(res$padj), ]
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), 
                 by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
#过滤结果
resdata_filter <- subset(resdata,resdata$padj<0.01 & abs(resdata$log2FoldChange) >1)
#保存结果
write.table(resdata_filter, file="diffexpr_ColCon_vs_ColABA_0.01.txt",
            sep="\t",row.names=F)

【ColABA_vs_tip41ABA】

#设置工作目录
setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/difference_expression/ColABA_vs_tip41ABA/")
#读取数据
raw <- read.table("gene_ColABA_vs_tip41ABA_count.txt", header = T, row.names = 1)
#矩阵化
Raw <- as.matrix(raw)
#Use DESeq2 to identify DE genes
raw_condition <- factor(c(rep("ColABA",3),rep("tip41ABA",3)))
coldata <- data.frame(row.names=colnames(raw), raw_condition)
dds <- DESeqDataSetFromMatrix(countData=raw, colData=coldata, design=~raw_condition)
dds <- DESeq(dds)
res <- results(dds)
resOrdered <- res[order(res$padj),]
summary(resOrdered)
#查看有显著差异的基因的个数
print(sum(resOrdered$padj < 0.01 & abs(resOrdered$log2FoldChange)>1, na.rm=TRUE))
# [1] 2406
#output DE results
res <- res[order(res$padj), ]
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), 
                 by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
#过滤结果
resdata_filter <- subset(resdata,resdata$padj<0.01 & abs(resdata$log2FoldChange) >1)
#保存结果
write.table(resdata_filter, file="diffexpr_ColABA_vs_tip41ABA_0.01.txt",
            sep="\t",row.names=F)

【tip41Con_vs_tip41ABA】

#设置工作目录
setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/difference_expression/tip41Con_vs_tip41ABA/")
#读取数据
raw <- read.table("gene_tip41Con_vs_tip41ABA_count.txt", header = T, row.names = 1)
#矩阵化
Raw <- as.matrix(raw)
#Use DESeq2 to identify DE genes
raw_condition <- factor(c(rep("tip41Con",3),rep("tip41ABA",3)))
coldata <- data.frame(row.names=colnames(raw), raw_condition)
dds <- DESeqDataSetFromMatrix(countData=raw, colData=coldata, design=~raw_condition)
dds <- DESeq(dds)
res <- results(dds)
resOrdered <- res[order(res$padj),]
summary(resOrdered)
#查看有显著差异的基因的个数
print(sum(resOrdered$padj < 0.01 & abs(resOrdered$log2FoldChange)>1, na.rm=TRUE))
# [1] 2406
#output DE results
res <- res[order(res$padj), ]
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), 
                 by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
#过滤结果
resdata_filter <- subset(resdata,resdata$padj<0.01 & abs(resdata$log2FoldChange) >1)
#保存结果
write.table(resdata_filter, file="diffexpr_tip41Con_vs_tip41ABA_0.01.txt",
            sep="\t",row.names=F)

韦恩图绘制

  1. gplots包中的venn函数
  2. venneuler包中的venneuler函数
  3. venn包中的venn函数
  4. venndiagram包中的venn.diagram函数
setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/venn/")
install.packages("VennDiagram")
library(VennDiagram)

### venn.diagram (图形输出到文件)
# 显示当前工作目录
listA <- read.table("gene_list_ColABA_vs_tip41ABA.txt", header = FALSE)
listB <- read.table("gene_list_ColCon_vs_ColABA.txt", header = FALSE)
listC <- read.table("gene_list_ColCon_vstip41Con_.txt", header = FALSE)
listD <- read.table("gene_list_tip41Con_vs_tip41ABA.txt", header = FALSE)
ColABA_vs_tip41ABA <- listA$V1
ColCon_vs_ColABA <- listB$V1
ColCon_vstip41Con <- listC$V1
tip41Con_vs_tip41ABA <- listD$V1
length(ColABA_vs_tip41ABA); length(ColCon_vs_ColABA); 
length(ColCon_vstip41Con); length(tip41Con_vs_tip41ABA)

# 四个样品的韦恩图
venn.diagram(list(ColABA_vs_tip41ABA = ColABA_vs_tip41ABA, 
                  ColCon_vs_ColABA = ColCon_vs_ColABA, 
                  ColCon_vstip41Con = ColCon_vstip41Con, 
                  tip41Con_vs_tip41ABA = tip41Con_vs_tip41ABA), 
             fill = c("yellow", "red", "cyan", "forestgreen"), 
             cex = 1.5, filename = "Col_vs_tip41_venn4.png", 
             output = TRUE, imagetype = "png", 
             height = 100, width = 100,
             resolution = 100, 
             compression = "lzw")
#保存为pdf
p <- venn.diagram(list(ColABA_vs_tip41ABA = ColABA_vs_tip41ABA, 
                       ColCon_vs_ColABA = ColCon_vs_ColABA, 
                       ColCon_vstip41Con = ColCon_vstip41Con, 
                       tip41Con_vs_tip41ABA = tip41Con_vs_tip41ABA), 
                  fill = c("yellow", "red", "cyan", "forestgreen"), 
                  cex = 1.5, filename = NULL, output = TRUE)

pdf("Col_vs_tip41_venn4.pdf")
grid.draw(p)

# 绘图参数
venn.diagram(
  x = list(set1, set2, set3),
  category.names = c("Set 1" , "Set 2 " , "Set 3"),
  filename = 'venn2.png',
  output=TRUE,
# 输出
  imagetype="png" ,  # 类型(tiff png svg)
#height = 1000 ,   # 高度
#width = 1000 ,   # 宽度
  resolution = 400,  # 分辨率
  compression = "lzw",  # 压缩算法
# 圈
  lwd = 5,  # 圈线条粗细 1 2 3 4 5
  lty = 1,  # 线条类型, 1 实线, 2 虚线, blank 无线条
  fill = color,  # 填充色
  col = c("red", 'green', 'blue'),  # 线条色
# 数字 number
  cex = 2,  # 数字大小
  fontface = "bold",  # 加粗
  fontfamily = "sans",  # 字体
# 标签 category
  cat.cex = 2,  # 字体大小
  cat.col = c("red", 'green', 'blue'),  # 字体色
  cat.fontface = "bold",  # 加粗
  cat.default.pos = "outer",  # 位置, outer 内 text 外
  cat.pos = c(-27, 27, 135),  # 位置,用圆的度数
  cat.dist = c(0.055, 0.055, 0.085),  # 位置,离圆的距离
  cat.fontfamily = "sans",  # 字体
  rotation = 1  # 1 2 3 旋转确定大打头数据集
)
Venn plot

基因表达量分布热图

R绘制热图方法

  1. heatmap()
  2. gplots包heatmap.2()
  3. pheatmap包pheatmap()
  4. ComplexHeatmap包
setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/heatmap/")
m <- read.table("DEgene_tpm_ColABA_vs_tip41ABA.txt", header = T, row.names = 1)
class(m)
x <- as.matrix(m)
heatmap(x)
# 样品名和基因名互换t()函数转置
heatmap(t(x))
heatmap(x, col = c("green", "red"))
# 随机生成渐变色
colorRampPalette(c("red", "black", "green"))(nrow(x))
yanse <- colorRampPalette(c("green", "black", "red"))(nrow(x))
heatmap(x, col = yanse)
# 设置颜色条
heatmap(x, col = yanse, ColSideColors = colorRampPalette(c("red", "black", "green"))
        (ncol(x)))
# 聚类设置 Rowv(按行聚类);Colv(按列聚类)
heatmap(x, col = yanse, Colv = NA, Rowv = NA)
library(pheatmap)
pheatmap(x)
?pheatmap
colnames(x)
annotation_col <- data.frame(CellType = factor(rep(
  c("ColCon", "ColABA", "tip41Con", "tip41ABA"), each = 3)))
# 行号替换为样品名(样品和分组信息对应)
rownames(annotation_col) <- colnames(x)
# 对列分组
pheatmap(x, annotation_col = annotation_col)
# display_numbers显示数值
pheatmap(x, annotation_col = annotation_col, display_numbers = T)
# number_format = "0.2f" 设置数值格式(保留两位小数)
pheatmap(x, annotation_col = annotation_col, display_numbers = T, number_format = "0.2f")
pheatmap(x, annotation_col = annotation_col, display_numbers = T, number_format = "0.1f")
heatmap

火山图

setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/Volcano/")
g <- read.table("diffexpr_ColCon_vs_ColABA_all.txt", header = T, row.names = 1)
head(g)
# 去掉缺失行
g <- na.omit(g)
plot(g$log2FoldChange, g$padj)
plot(g$log2FoldChange, -1*log10(g$padj))
plot(g$log2FoldChange, -1*log10(g$padj), xlim = c(-10, 10), ylim = c(0, 100))
g <- transform(g, padj = -1*log10(g$padj))
down <- g[g$log2FoldChange <= -1,]
up <- g[g$log2FoldChange >=1,]
no <- g[g$log2FoldChange > -1 & g$log2FoldChange < 1,]

plot(no$log2FoldChange, no$padj, xlim = c(-10,10), 
     ylim = c(0,100), col = "blue", pch = 16, cex = 0.8,
     main = "Gene Expression", xlab = "log2FoldChange",
     ylab = "-log10(Qvalue)")

points(up$log2FoldChange, up$padj, col = "red", pch = 16, cex = 0.8)
points(down$log2FoldChange, down$padj, col = "green", pch = 16, cex = 0.8)
Volcano plot

COG

Clusters of Orthologous Groups of proteins
http://www.ncbi.nlm.nih.gov/COG/
http://www.ncbi.nlm.nih.gov/COG/old/xognitor.html
http://www.nlm.nih.gov/COG/old/COGhelp.html
ftp://ftp.ncbi.nih.gov/pub/COG/
COG注释作用:

  1. 通过已知蛋白对未知序列进行功能注释;
  2. 通过查看指定的COG编号对应的protein数目,存在及缺失,从而能推导特定的代谢途径是否存在;
  3. 每个COG编号是一类蛋白,将query序列和比对上的COG编号的proteins进行多序列比对,能确定保守位点,分析其进化关系。

使用eggNOG-mapper在线工具进行COG和GO注释,需要准备差异基因氨基酸序列;


EggNOG

输入数据和邮箱在线提交即可;


eggNOG-mapper

运行完成后下载

out.emapper.annotations 
文件;

emapper annotation

处理结果文件,提取COG category信息,用于分析和作图;


COG category

COG功能注释图

设置工作目录

setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/COG/")
m <- read.table("cog.class.annot.txt", header = T, sep = "\t")
head(m)
layout(matrix(c(1,2), nrow = 1), widths = c(20,13))
layout.show(2)
par(mar = c(3,4,4,1) + 0.1)
# 设定顺序
class <- c("J","A","K","L","B","D","V","T","M","Z","W","U","O","C","G","E","F","H","I","P","Q","S")
# 将m中Code列转化为因子
t <- factor(as.character(m$Code), levels = class)
# 用新的因子数据对原始数据进行排序
m <- m[order(t),]
head(m)
barplot(m$Gene.Number, space = F, col = rainbow(22), ylab = "Number of Genes", names.arg = m$Code)
# 添加分组信息
l <- c(0, 5, 13, 21, 22)
id <- c("INFORMATION STORAGE\nAND PROCESSING", "CELLULAR PROCESSES\nAND SIGNALING", "METABOLISM", "POORLY\n CHAPACTERIZED")
id
abline(v = l[c(-1,-5)])
for (i in 2:length(l)) {
  text((l[i-1] + l[i])/2, max(m[,3])*1.1, id[i-1], cex = 0.8, xpd = T)
}

par(mar = c(2,0,2,1) + 0.1)
# axes 不显示坐标轴
plot(0,0,type = "n", xlim = c(0,1), ylim = c(0,26), bty = "n", axes = F, xlab = "", ylab = "")
for ( i in 1:length(class)){
  text(0,26-i+0.5, paste(m$Code[i], m$Functional.Categories[i]), pos = 4, cex = 1, pty = T)
}

# 添加标题
title(main = "COG Function Classification")
COG Function Classification

GO enrichment analysis

http://eggnog5.embl.de/#/
https://github.com/eggnogdb/eggnog-mapper/wiki
http://current.geneontology.org/ontology/go.obo

GO号获取

基因组功能注释文件

https://www.arabidopsis.org/download_files/GO_and_PO_Annotations/Gene_Ontology_Annotations/ATH_GO_GOSLIM.txt.gz

eggNOG-mapper

http://eggnog-mapper.embl.de/job_status?jobname=MM_z9vko3fw

geneontology.org

http://current.geneontology.org/products/pages/downloads.html
http://current.geneontology.org/annotations/tair.gaf.gz

从ATH_GO_GOSLIM.txt提取差异基因GO号;

awk -F"\t" '{print $1"\t"$5"\t"$6"\t"$8}' ATH_GO_GOSLIM.txt > 1.txt
cat 1.txt | grep -f DEgene_list_GOenrichment.txt > 2.txt
awk -F"\t" '{print $4"\t"$2"\t"$3}' 2.txt > 3.txt
sort -k3,4 3.txt | uniq -c > 4.txt
awk -F" " '{print $1}' 4.txt > 5.txt
awk -F"\t" '{print $1}' 4.txt | awk -F" " '{print $2}' > 6.txt
awk -F"\t" '{print $2"\t"$3}' 4.txt > 7.txt

cat 5.txt | awk '{sum+=$1} END {print "Sum = ", sum}'
awk '{print $1"\t""39933"}' 5.txt > 55.txt
echo "$1 $2" | awk -F"\t" '{printf("%.2f\n", ($1/$2)*100)}' 55.txt > 555.txt

sed -i 's/F/Molecular function/g' 6.txt
sed -i 's/C/Cellular component/g' 6.txt
sed -i 's/P/Biological process/g' 6.txt

paste 6.txt 7.txt 555.txt > 8.txt

sed -i '1i\Ontology\tTerm\tGO_ID\tInput_number\tPercentage' 8.txt
mv 8.txt DEgene_GO_term.txt

R包获取GO号和对应注释信息;

library(GO.db)
goterms <- Term(GOTERM)
GOlist = as.data.frame(goterms)
View(GOlist)
keytypes(GO.db)
#[1] "DEFINITION" "GOID"       "ONTOLOGY"   "TERM"      
columns(GO.db)
#[1] "DEFINITION" "GOID"       "ONTOLOGY"   "TERM"      
keys(GO.db, keytype = "GOID")[1:3]
#[1] "GO:0000001" "GO:0000002" "GO:0000003"
k <- keys(GO.db, keytype = "GOID")[1:3]
select(GO.db, keys = k, columns = c("TERM","ONTOLOGY"),
+        keytype="GOID")
'select()' returned 1:1 mapping between keys and columns
#        GOID                             TERM ONTOLOGY
#1 GO:0000001        mitochondrion inheritance       BP
#2 GO:0000002 mitochondrial genome maintenance       BP
#3 GO:0000003                     reproduction       BP

write.table(golist, file = "GOlist.txt", sep = "\t", row.names = F)

GO功能注释条形图

setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/GO_enrichment/")

go <- read.table("DEgene_GO_term.txt", header = T, sep = "\t")
head(go)
View(go)
library(ggplot2)
# 排序
go_sort <- go[order(go$Ontology, -go$Percentage),]
m <- go_sort[go_sort$Ontology == "Molecular function",][1:10,]
c <- go_sort[go_sort$Ontology == "Cellular component",][1:10,]
b <- go_sort[go_sort$Ontology == "Biological process",][1:10,]
# 合并向量
slimgo <- rbind(b, c, m)
slimgo
# 将Term列转化为因子
slimgo$Term = factor(slimgo$Term, levels = slimgo$Term)

colnames(slimgo)
PP <- ggplot(data = slimgo, mapping = aes(x = Term, y = Percentage, fill = Ontology))
PP
PP + geom_bar(stat = "identity")
# 翻转坐标轴
PP + geom_bar(stat = "identity") + coord_flip()
PP + geom_bar(stat = "identity") + coord_flip() + 
  scale_x_discrete(limits = rev(levels(slimgo$Term)))

# 去掉图例
PP + geom_bar(stat = "identity") + coord_flip() + 
  scale_x_discrete(limits = rev(levels(slimgo$Term))) +
  guides(fill = FALSE)
# 切换黑白背景主题
PP + geom_bar(stat = "identity") + coord_flip() + 
  scale_x_discrete(limits = rev(levels(slimgo$Term))) +
  guides(fill = FALSE) + theme_bw()
GO term

KEGG enrichment analysis

针对非模式材料,使用KAAS在线工具进行KEGG注释;准备参考基因组的蛋白序列文件(fasta格式),上传数据,填写邮箱,提交即可;
https://www.genome.jp/kaas-bin/kaas_main

KAAS

上传protein文件
下载htext结果文件
q00001.keg

编写python脚本,提取需要的信息;

python3 kegg-to-geneID.py -i q00001.keg -o TAIR10_kegg.txt
awk -F"\t" '{print $5"\t""ko:"$3"\t"$4}' TAIR10_kegg.txt > TAIR10_kegg_pathway_annotation.txt

ClusterProfiler对差异基因进行富集分析

setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/KEGG_enrichment/")
### ClusterProfiler进行富集分析
#安装clusterProfiler包
BiocManager::install("clusterProfiler")
library(clusterProfiler)
### 准备pathway注释文件
data <- read.table("TAIR10_KEGG_annotation.txt", header = T, sep = "\t")
go2gene <- data[, c(2, 1)]
go2name <- data[, c(2, 3)]
### 准备差异基因列表文件
genelist <- read.table("DEgene_listKEGG_enrichment.txt", header = T)
### 差异基因ID向量
gene <- as.factor(genelist$Gene_ID)
### sorted vector: number should be sorted in decreasing order
gene <- sort(gene, decreasing = TRUE)
### ORA(Over-Representation Analysis)
kk <- enricher(gene,TERM2GENE = go2gene,TERM2NAME = go2name, pvalueCutoff  = 0.05,
               pAdjustMethod  = "BH",
               minGSSize = 5,
               qvalueCutoff  = 0.5)
write.table(kk, file = "diff-keggenricher.txt", quote =F, sep = "\t", row.names = F)

GSEA富集分析

### GSEA(Gene Set Enrichment Analysis)
### ## assume 1st column is ID
## 2nd column is FC
## head(geneList)
## Entrez.ID Fold.Change

## feature 1: numeric vector
glist <- genelist[,2];head(glist)

## feature 2: named vector
names(glist) <- as.character(genelist[,1]);head(glist)

## feature 3: decreasing order
glist <- sort(glist,decreasing = T); head(glist)

#gsea <- GSEA(glist, TERM2GENE=c5, verbose=FALSE, pvalueCutoff = 0.8)

kk <- GSEA(glist, TERM2GENE = go2gene, TERM2NAME = go2name, nPerm  = 1000,
           minGSSize = 10,
           maxGSSize = 500,
           pvalueCutoff = 0.05,
           pAdjustMethod     = "BH")

KEGG功能注释图(气泡图)

library(ggplot2)
pathway <- read.csv("diff-keggenricher.csv", header = T, sep = ",")
# 查看列名
colnames(pathway)
# [1] "ID"          "Description" "GeneRatio"   "BgRatio"     "pvalue"      "p.adjust"   
# [7] "qvalue"      "geneID"      "Count"       "All_unigene" "richFactor" 
# 映射数据
PP <- ggplot(data = pathway, mapping = aes(x = richFactor, y = Description))
# 绘制点图
PP + geom_point()
# 将点的大小映射到Count列
PP + geom_point(mapping = aes(size = Count))
# qvalue映射到color
PP + geom_point(mapping = aes(size = Count, color = qvalue))
# 修改颜色
PP + geom_point(mapping = aes(size = Count, color = qvalue)) + 
  scale_color_gradient(low = "green", high = "red")
# 修改标题
PP + geom_point(mapping = aes(size = Count, color = qvalue)) + 
  scale_color_gradient(low = "green", high = "red") + 
  labs(title = "Top20 of Pathway Enrichment", x = "Rich Factor", 
       y = "Pathway Name", color = "-log10(qvalue)", size = "Gene Number") + theme_bw()
barplot
Circle plot
dot plot

以上。


Huazhong Agricultural University

你可能感兴趣的:(RNA-Seq data analysis)