RNA-seq分析简洁版

  • 前面RNA-seq分析:从软件安装到富集分析部分已经把转录组全部流程走完了一遍,这次利用RNA-seq(2)-2:下载数据中下载的肝癌数据进行分析,不在赘述细节,所以有看细节的还是请去这里。
  • 这部分主要是代码和部分结果图,但进行了部分修正,比如KEGG 可视化部分,用了前clusterprofiler部分的结果,所以这部分不包括Gage包。
  • 所以从质量比对开始进行
-----------------------------------------------------------------

3 sra到fastq格式转换并进行质量控制

3.1 数据解压:用samtools中的fastq-dump将sra格式转为fastq格式
  • 备注:用时5个小时
for ((i=2;i<=5;i++));do fastq-dump --gzip --split-3 -A SRR31621$i.sra -O .;done
3.2 用fastqc进行质量控制
#将所有的数据进行质控,得到zip的压缩文件和html文件
fastqc -o .  *.fastq.gz
3.3 3 质控结果解读(为保持紧凑,只放一张图)
RNA-seq分析简洁版_第1张图片
SRR316214.png

4 下载参考基因组及基因注释

RNA-seq(4):下载参考基因组及基因注释部分已经下载

5 序列比对:Hisat2

5.1 开始比对:用hisat2,得到SAM文件(5个小时)
  • 我的fastq文件在/mnt/f/rna_seq/data
  • 我的reference在/mnt/f/rna_seq/data/reference
  • 我的index在/mnt/f/rna_seq/data/reference/index/hg19
  • 比对后得到的bam文件会存放在/mnt/f/rna_seq/aligned
 source ~/miniconda3/bin/activate
(base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq/data$ cd ..
(base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq$
for ((i=2;i<=5;i++));do hisat2 -t -x /mnt/f/rna_seq/data/reference/index/hg19/genome -1 /mnt/f/rna_seq/data/SRR31621${i}.sra_1.fastq.gz -2 /mnt/f/rna_seq/data/SRR31621${i}.sra_2.fastq.gz -S SRR31621${i}.sam ;done
(base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq$ for ((i=2;i<=5;i++));do hisat2 -t -x /mnt/f/rna_seq/data/reference/index/hg19/genome -1 /mnt/f/rna_seq/data/SRR31621${i}.sra_1.fastq.gz -2 /mnt/f/rna_seq/data/SRR31621${i}.sra_2.fastq.gz -S SRR31621${i}.sam ;done
Time loading forward index: 00:00:05
Time loading reference: 00:00:01
Multiseed full-index search: 02:11:39
101339131 reads; of these:
  101339131 (100.00%) were paired; of these:
    7840870 (7.74%) aligned concordantly 0 times
    87196406 (86.04%) aligned concordantly exactly 1 time
    6301855 (6.22%) aligned concordantly >1 times
    ----
    7840870 pairs aligned concordantly 0 times; of these:
      333400 (4.25%) aligned discordantly 1 time
    ----
    7507470 pairs aligned 0 times concordantly or discordantly; of these:
      15014940 mates make up the pairs; of these:
        8711709 (58.02%) aligned 0 times
        5441734 (36.24%) aligned exactly 1 time
        861497 (5.74%) aligned >1 times
95.70% overall alignment rate
5.2 SAM文件转换为BAM文件,并对bam文件进行sort,最后建立索引:SAMtools
Sam 文件转换为bam格式(用时1.5hrs)
(base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq/aligned$ for ((i=2;i<=5;i++));do samtools view -S /mnt/f/rna_seq/SRR31621${i}.sam -b > SRR31621${i}.bam;done
对bam文件进行排序,默认染色体位置(用时1.5hrs)
for ((i=2;i<=5;i++));do samtools sort SRR31621${i}.bam
-o SRR31621${i}_sorted.bam;done
建立索引(用时40mins)
for ((i=2;i<=5;i++));do samtools index SRR31621${i}_sorted.bam;done

6 reads计数,合并矩阵并进行注释

6.1 bam文件按reads name排序(用时1h)
for ((i=2;i<=5;i++));do samtools sort -n SRR31621${i}.b
am -o SRR31621${i}_nsorted.bam;done
6.2 2 reads计数,得到表达矩阵htseq-count(用时很久很久,久到不忍写,8hrs)

注释文件如果已经解压,则不再需要重复

cd ../data/matrix
gunzip /mnt/f/rna_seq/data/reference/annotation/hg19/gencode.v19.annotation.gt.gz && rm -rf gencode.v19.annotation.gtf.gz
for ((i=2;i<=5;i++));do htseq-count -r name -f bam /mnt/f/rna_seq/aligned/SRR31621${i}_nsorted.bam /mnt/f/rna_seq/data/reference/annotation/hg19/gencode.v19.an
notation.gtf > SRR31621${i}.count; done
ls -al *.count
-rwxrwxrwx 1 root root 1197426 Aug  7 16:25 SRR316212.count
-rwxrwxrwx 1 root root 1186189 Aug  7 17:16 SRR316213.count
-rwxrwxrwx 1 root root 1200305 Aug  7 22:51 SRR316214.count
-rwxrwxrwx 1 root root 1187596 Aug  7 23:32 SRR316215.count
6.3 3 合并表达矩阵并进行基因名注释(R中进行)

Tumor:SRR316214,SRR316215
Adjacent Normal Liver:SRR316212,SRR316213

setwd("F:/rna_seq/data/matrix")
options(stringsAsFactors = FALSE)
control1<-read.table("SRR316212.count",sep = "\t",col.names = c("gene_id","control1"))
control2<-read.table("SRR316213.count",sep = "\t",col.names = c("gene_id","control2"))
treat1<-read.table("SRR316214.count",sep = "\t",col.names = c("gene_id","treat1"))
treat2<-read.table("SRR316215.count",sep = "\t",col.names = c("gene_id","treat2"))
raw_count <- merge(merge(control1, control2, by="gene_id"), merge(treat1, treat2, by="gene_id"))
head(raw_count)
raw_count_filt <- raw_count[-1:-5,]
ENSEMBL <- gsub("\\.\\d*", "", raw_count_filt$gene_id) 
row.names(raw_count_filt) <- ENSEMBL
head(raw_count_filt)
> head(raw_count_filt)
                           gene_id control1 control2 treat2.x treat2.y
ENSG00000000003 ENSG00000000003.10     4004      781      756      756
ENSG00000000005  ENSG00000000005.5        1        0        0        0
ENSG00000000419  ENSG00000000419.8      776      140      165      165
ENSG00000000457  ENSG00000000457.9      624      144      240      240
ENSG00000000460 ENSG00000000460.12      260       52      105      105
ENSG00000000938  ENSG00000000938.8      161       59       16       16

7 DEseq2筛选差异表达基因并注释

这次我换了Annotation包进行注释

7.1 载入数据(countData和colData)

# 这一步很关键,要明白condition这里是因子,不是样本名称;小鼠数据有对照组和处理组,各两个重复
condition <- factor(c(rep("control",2),rep("treat",2)), levels = c("control","treat"))
condition
colData <- data.frame(row.names=colnames(mycounts), condition)
> colData
         condition
control1   control
control2   control
treat1       treat
treat2       treat

7.2 构建dds对象,开始DESeq流程

dds <- DESeqDataSetFromMatrix(countData=mycounts, 
                              colData=colData, 
                              design= ~ condition)
dds = DESeq(dds)

7.3 总体结果查看

接下来,我们要查看treat versus control的总体结果,并根据p-value进行重新排序。利用summary命令统计显示一共多少个genes上调和下调(FDR0.1)

res = results(dds, contrast=c("condition", "control", "treat"))
res = res[order(res$pvalue),]
head(res)
summary(res)
write.csv(res,file="All_results.csv")
table(res$padj<0.01)
> summary(res)
out of 30981 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 3928, 13% 
LFC < 0 (down)   : 3283, 11% 
outliers [1]     : 0, 0% 
low counts [2]   : 10317, 33% 
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
> table(res$padj<0.01)
FALSE  TRUE 
16192  4472

可见,上调基因和下调的数目都很多,padj<0.01的有4472,即使padj<0.001的也有3089个。

7.4 提取差异表达genes(DEGs)

先看下padj(p值经过多重校验校正后的值)小于0.01,表达倍数取以2为对数后大于1或者小于-1的差异表达基因。代码如下

diff_gene_deseq2 <-subset(res, padj < 0.01 & abs(log2FoldChange) > 1)
dim(diff_gene_deseq2)
head(diff_gene_deseq2)
write.csv(diff_gene_deseq2,file= "HCC_DEG_0.05_2.csv")
> dim(diff_gene_deseq2)
[1] 3780    6
> head(diff_gene_deseq2)
log2 fold change (MLE): condition control vs treat 
Wald test p-value: condition control vs treat 
DataFrame with 6 rows and 6 columns
                 baseMean log2FoldChange     lfcSE      stat        pvalue          padj
                                  
ENSG00000130649 78059.342       7.190959 0.2110024  34.07998 1.460506e-254 3.017990e-250
ENSG00000268925  5666.055       7.534595 0.2573299  29.27991 1.869559e-188 1.931628e-184
ENSG00000105697  8791.483       6.611356 0.2338215  28.27523 6.970959e-176 4.801597e-172
ENSG00000167244  8988.099       6.767261 0.2582403  26.20528 2.313301e-151 1.195051e-147
ENSG00000162366  4037.960      -7.076854 0.2704122 -26.17062 5.742578e-151 2.373293e-147
ENSG00000169715  3803.597       6.238596 0.2403373  25.95767 1.489813e-148 5.130914e-145

把padj<0.001,|log2FC|>1有2938个

8 探索分析结果:Data visulization

#MA plot
plotMA(res,ylim=c(-2,2))
topGene <- rownames(res)[which.min(res$padj)]
with(res[topGene, ], {
  points(baseMean, log2FoldChange, col="dodgerblue", cex=6, lwd=2)
  text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
})

#shirnked
res_order<-res[order(row.names(res)),]
res = res_order
res.shrink <- lfcShrink(dds, contrast = c("condition","treat","control"), res=res)
plotMA(res.shrink, ylim = c(-5,5))
topGene <- rownames(res)[which.min(res$padj)]
with(res[topGene, ], {
  points(baseMean, log2FoldChange, col="dodgerblue", cex=2, lwd=2)
  text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
})
#identify
idx <- identify(res$baseMean, res$log2FoldChange)
rownames(res)[idx]
#plotcounts
plotCounts(dds, gene=which.min(res$padj), intgroup="condition", returnData=TRUE)
plotCounts(dds, gene="ENSG00000130649", intgroup="condition", returnData=FALSE)
#boxplot
plotCounts(dds, gene="ENSG00000130649", intgroup="condition", returnData=TRUE) %>% 
  ggplot(aes(condition, count)) + geom_boxplot(aes(fill=condition)) + scale_y_log10() + ggtitle("CYP2E1")
#pointplot
d <- plotCounts(dds, gene="ENSG00000130649", intgroup="condition", returnData=TRUE)
ggplot(d, aes(x=condition, y=count)) + 
  geom_point(aes(color= condition),size= 4, position=position_jitter(w=0.5,h=0)) + 
  scale_y_log10(breaks=c(25,100,400))+ ggtitle("CYP2E1")

##3最小padj
d <- plotCounts(dds, gene=which.min(res$padj), intgroup="condition", 
                returnData=TRUE)
ggplot(d, aes(x=condition, y=count)) + 
  geom_point(position=position_jitter(w=0.1,h=0)) + 
  scale_y_log10(breaks=c(25,100,400))
#PCA
vsdata <- vst(dds, blind=FALSE)
plotPCA(vsdata, intgroup="condition")
#beatifule pca plot
pcaData <- plotPCA(vsdata, intgroup=c("condition"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition, shape=condition)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed()
library("pheatmap")
select<-order(rowMeans(counts(dds, normalized = TRUE)),
              decreasing = TRUE)[1:20]
df <- as.data.frame(colData(dds)[,c("condition","sizeFactor")])
# this gives log2(n + 1)
ntd <- normTransform(dds)
pheatmap(assay(ntd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
         cluster_cols=FALSE, annotation_col=df)
#vsd
vsd <- vst(dds, blind=FALSE)
pheatmap(assay(vsd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
         cluster_cols=FALSE, annotation_col=df)
##sample to sample heatmap
sampleDists <- dist(t(assay(vsd)))
library("RColorBrewer")
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vsd$condition, vsd$type, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors)

这部分主要结果部分的一些可视化,具体参考详细的部分即可,只放一张图


RNA-seq分析简洁版_第2张图片
Rplot.jpeg

9 DEGs的富集分析(功能注释)ClusterProfiler包

##enrichment analysis using clusterprofiler package created by yuguangchuang
library(clusterProfiler)
library(DOSE)
library(stringr)
library(org.Hs.eg.db)
#get the ENTREZID for the next analysis
sig.gene= diff_gene_deseq2
head(sig.gene)
gene<-rownames(sig.gene)
head(gene)
gene.df<-bitr(gene, fromType = "ENSEMBL", 
              toType = c("SYMBOL","ENTREZID"),
              OrgDb = org.Hs.eg.db)

head(gene.df)
> head(gene.df)
          ENSEMBL   SYMBOL ENTREZID
1 ENSG00000130649   CYP2E1     1571
3 ENSG00000105697     HAMP    57817
4 ENSG00000167244     IGF2     3481
5 ENSG00000162366 PDZK1IP1    10158
6 ENSG00000169715     MT1E     4493
7 ENSG00000074276    CDHR2    54825

GO enrichment

#Go enrichment
ego_cc<-enrichGO(gene       = gene.df$ENSEMBL,
                 OrgDb      = org.Hs.eg.db,
                 keyType    = 'ENSEMBL',
                 ont        = "CC",
                 pAdjustMethod = "BH",
                 pvalueCutoff = 0.01,
                 qvalueCutoff = 0.05)
ego_bp<-enrichGO(gene       = gene.df$ENSEMBL,
                 OrgDb      = org.Hs.eg.db,
                 keyType    = 'ENSEMBL',
                 ont        = "BP",
                 pAdjustMethod = "BH",
                 pvalueCutoff = 0.01,
                 qvalueCutoff = 0.05)
barplot(ego_bp,showCategory = 18,title="The GO_BP enrichment analysis of all DEGs ")+ 
  scale_size(range=c(2, 12))+
  scale_x_discrete(labels=function(ego_bp) str_wrap(ego_bp,width = 25))

KEGG enrichment

library(stringr)
kk<-enrichKEGG(gene      =gene.df$ENTREZID,
               organism = 'hsa',
               pvalueCutoff = 0.05)
> kk[1:30]
               ID                                          Description GeneRatio  BgRatio
hsa04610 hsa04610                  Complement and coagulation cascades   38/1407  79/7431
hsa04933 hsa04933 AGE-RAGE signaling pathway in diabetic complications   40/1407  99/7431
hsa00071 hsa00071                               Fatty acid degradation   23/1407  44/7431
hsa04974 hsa04974                     Protein digestion and absorption   35/1407  90/7431
hsa05146 hsa05146                                           Amoebiasis   36/1407  96/7431
hsa04978 hsa04978                                   Mineral absorption   23/1407  51/7431
hsa04115 hsa04115                                p53 signaling pathway   29/1407  72/7431
hsa05144 hsa05144                                              Malaria   22/1407  49/7431
hsa00982 hsa00982                    Drug metabolism - cytochrome P450   28/1407  72/7431
hsa00980 hsa00980         Metabolism of xenobiotics by cytochrome P450   29/1407  76/7431
hsa03320 hsa03320                               PPAR signaling pathway   28/1407  74/7431
hsa05204 hsa05204                              Chemical carcinogenesis   30/1407  82/7431
mykegg<-barplot(kk,showCategory = 20, title="The KEGG enrichment analysis of all DEGs")+
  scale_size(range=c(2, 12))+
  scale_x_discrete(labels=function(kk) str_wrap(kk,width = 25))
 dotplot(kk,showCategory = 20, title="The KEGG enrichment analysis of all DEGs")+
  scale_size(range=c(2, 12))+
  scale_x_discrete(labels=function(kk) str_wrap(kk,width = 25))

用cowplot拼起来

library(cowplot)
plot_grid(mygobp,mykegg,labels = c("A","B"),align = "h")
RNA-seq分析简洁版_第3张图片
Rplot01.jpeg

10 KEGG信号通路的可视化(pathview包)

备注,前面详细部分用gageData重新进行了富集分析,这部分直接调用ClusterProfiler包的富集结果
  • 选取enrichKEGG结果pvalue排名第一的hsa4610:Complement and coagulation cascades和排名第7的hsa04115:p53 signaling pathway
library("pathview")
foldchanges = sig.gene$log2FoldChange
names(foldchanges)= gene.df$ENTREZID
head(foldchanges)
pathview(gene.data = foldchanges, pathway.id = "hsa04610", species="hsa")

------------------------------------------------------------------------------

hsa4610:Complement and coagulation cascades
RNA-seq分析简洁版_第4张图片
hsa04610.pathview.png

------------------------------------------------------------------------------

hsa04115:p53 signaling pathway

RNA-seq分析简洁版_第5张图片
hsa04115.pathview.png

11 把counts结果,clusterprofiler部分的symbol name 和DEGs全部合并到一个表格

备注,这部分的主要问题是没用可以merge用的ID,先看下
> head(mycounts)
                control1 control2 treat1 treat2
ENSG00000000003     4004      781   4229    756
ENSG00000000005        1        0      0      0
ENSG00000000419      776      140   1180    165
ENSG00000000457      624      144   1271    240
ENSG00000000460      260       52    610    105
ENSG00000000938      161       59     57     16
> head(gene.df)
          ENSEMBL   SYMBOL ENTREZID
1 ENSG00000130649   CYP2E1     1571
3 ENSG00000105697     HAMP    57817
4 ENSG00000167244     IGF2     3481
5 ENSG00000162366 PDZK1IP1    10158
6 ENSG00000169715     MT1E     4493
7 ENSG00000074276    CDHR2    54825
> head(sig.gene)
log2 fold change (MLE): condition control vs treat 
Wald test p-value: condition control vs treat 
DataFrame with 6 rows and 6 columns
                  ENSEMBL log2FoldChange     lfcSE      stat        pvalue          padj
                                  
ENSG00000130649 78059.342       7.190959 0.2110024  34.07998 1.460506e-254 3.017990e-250
ENSG00000268925  5666.055       7.534595 0.2573299  29.27991 1.869559e-188 1.931628e-184
ENSG00000105697  8791.483       6.611356 0.2338215  28.27523 6.970959e-176 4.801597e-172
ENSG00000167244  8988.099       6.767261 0.2582403  26.20528 2.313301e-151 1.195051e-147
ENSG00000162366  4037.960      -7.076854 0.2704122 -26.17062 5.742578e-151 2.373293e-147
ENSG00000169715  3803.597       6.238596 0.2403373  25.95767 1.489813e-148 5.130914e-145

所以执行以下代码:

head(sig.gene)
head(gene.df)
ENSEMBL<-rownames(sig.gene)
sig.gene<-cbind(ENSEMBL, sig.gene)
colnames(sig.gene)[1]<-c("ENSEMBL")

head(mycounts)
ENSEMBL<-rownames(mycounts)
mycounts<-cbind(ENSEMBL, mycounts)
colnames(mycounts)[1]<-c("ENSEMBL")

merge1<-merge(sig.gene,mycounts,by="ENSEMBL")
merge2<-merge(merge1,gene.df, by="ENSEMBL")
head(merge2)
write.csv(DEG_symbole,file="hcc_DEGs_last_results.csv")

结果如下:

> head(merge2)
DataFrame with 6 rows and 13 columns
          ENSEMBL   baseMean log2FoldChange     lfcSE      stat       pvalue         padj  control1  control2    treat1    treat2
                            
1 ENSG00000000938   70.51956       2.169409 0.4610591  4.705274 2.535248e-06 2.632581e-05       161        59        57        16
2 ENSG00000001460   77.93853      -1.496365 0.4247974 -3.522537 4.274371e-04 2.517116e-03        68        18       268        68
3 ENSG00000001561  382.90350      -1.466940 0.3397462 -4.317751 1.576269e-05 1.347065e-04       416        69      1780       227
4 ENSG00000001626   70.41775       4.713865 0.5812437  8.109965 5.063455e-16 1.944819e-14       226        61        16         1
5 ENSG00000001630  165.86393      -2.283850 0.4094980 -5.577194 2.444286e-08 3.665365e-07        84        28       442       207
6 ENSG00000002549 1683.37557       1.519578 0.2216429  6.855972 7.082930e-12 1.708432e-10      4166      1104      2171       481
       SYMBOL    ENTREZID
   
1         FGR        2268
2       STPG1       90529
3       ENPP4       22875
4        CFTR        1080
5     CYP51A1        1595
6        LAP3       51056

至此,这部分结果可以用来做其他很多下游分析了。

你可能感兴趣的:(RNA-seq分析简洁版)