Author : yujia
目录:
- 概述
- salmon工具完成索引建立和生物学定量
- subread工具完成序列比对和定量
- DESeq2差异基因分析
- 总结
- 练习数据:数据来源于拟南芥,共16个样本,处理分为4组(0day,1day,2day,3day)
- 练习目的:熟悉两套RNA-seq差异基因表达分析的流程(salmon流程和subread流程)
- 数据存放地址:/public/study/mRNAseq/tair/
- 软件调用地址:/usr/local/bin/miniconda3/bin/
- 实战项目来源地址:Jimmy学长的生信菜鸟团博客:http://www.bio-info-trainee.com/2809.html (一个植物转录组项目的实践)
salmon是一款不通过序列比对就可以快速完成生物学定量的RNA-seq数据分析工具。它的使用流程包括两步:1.建立索引 2.对reads进行生物学定量(quantification)。所以,如果我们使用salmon工具来做生物学定量的话,会非常的快速简洁。以下是代码流程:
利用salmon建立索引的基本语法是:
salmon index -t athal.fa.gz -i athal_index
#index 代表建立索引
#-t .fa文件的路径
#-i 索引存放路径
所以我们的代码如下:
/usr/local/bin/miniconda3/bin/salmon index -t /public/study/mRNAseq/tair/Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz -i /trainee/home/yjxiang/practice/index_file
执行后,会得到如下文件:
-rw-rw-r-- 1 yjxiang yjxiang 14K Aug 13 14:40 duplicate_clusters.tsv
-rw-rw-r-- 1 yjxiang yjxiang 751M Aug 13 14:40 hash.bin
-rw-rw-r-- 1 yjxiang yjxiang 357 Aug 13 14:40 header.json
-rw-rw-r-- 1 yjxiang yjxiang 115 Aug 13 14:40 indexing.log
-rw-rw-r-- 1 yjxiang yjxiang 412 Aug 13 14:40 quasi_index.log
-rw-rw-r-- 1 yjxiang yjxiang 116 Aug 13 14:40 refInfo.json
-rw-rw-r-- 1 yjxiang yjxiang 7.8M Aug 13 14:40 rsd.bin
-rw-rw-r-- 1 yjxiang yjxiang 247M Aug 13 14:40 sa.bin
-rw-rw-r-- 1 yjxiang yjxiang 63M Aug 13 14:40 txpInfo.bin
-rw-rw-r-- 1 yjxiang yjxiang 96 Aug 13 14:40 versionInfo.json
salmon中进行生物学定量的基本语法是:
salmon quant -i athal_index -l A -1 samp_1.fastq.gz -2 samp_2.fastq.gz -p 8 -o quants/sample_quant
# quant是salmon中进行生物学定量的选项
# -i The -i argument tells salmon where to find the index
# -l A tells salmon that it should automatically determine the library type of the sequencing reads
#The -1 and -2 arguments tell salmon where to find the left and right reads for this sample
# -p 8 argument tells salmon to make use of 8 threads
# -o argument specifies the directory where salmon’s quantification results sould be written
由于我们一共有16个样本,要一一进行生物学定量,因此编写一个bash脚本完成批量处理:
#脚本地址存储在/trainee/home/yjxiang/practice
#!/bin/bash
index=/trainee/home/yjxiang/practice/index_file
for fn in ERR1698{194..209};
do
samp=`basename ${fn}`
echo "Processing sample ${samp}"
/usr/local/bin/miniconda3/bin/salmon quant -i $index -l A \
-1 ${samp}_1.fastq.gz \
-2 ${samp}_2.fastq.gz \
-p 6 -o /trainee/home/yjxiang/practice/quants/${samp}_quant
done
运行脚本之后,得到如下结果:
drwxrwxr-x 5 yjxiang yjxiang 4.0K Aug 13 15:28 ERR1698194_quant
drwxrwxr-x 5 yjxiang yjxiang 4.0K Aug 13 15:38 ERR1698195_quant
drwxrwxr-x 5 yjxiang yjxiang 4.0K Aug 13 15:39 ERR1698196_quant
drwxrwxr-x 5 yjxiang yjxiang 4.0K Aug 13 15:39 ERR1698197_quant
drwxrwxr-x 5 yjxiang yjxiang 4.0K Aug 13 15:40 ERR1698198_quant
drwxrwxr-x 5 yjxiang yjxiang 4.0K Aug 13 15:40 ERR1698199_quant
drwxrwxr-x 5 yjxiang yjxiang 4.0K Aug 13 15:41 ERR1698200_quant
drwxrwxr-x 5 yjxiang yjxiang 4.0K Aug 13 15:42 ERR1698201_quant
drwxrwxr-x 5 yjxiang yjxiang 4.0K Aug 13 15:43 ERR1698202_quant
drwxrwxr-x 5 yjxiang yjxiang 4.0K Aug 13 15:43 ERR1698203_quant
drwxrwxr-x 5 yjxiang yjxiang 4.0K Aug 13 15:44 ERR1698204_quant
drwxrwxr-x 5 yjxiang yjxiang 4.0K Aug 13 15:44 ERR1698205_quant
drwxrwxr-x 5 yjxiang yjxiang 4.0K Aug 13 15:45 ERR1698206_quant
drwxrwxr-x 5 yjxiang yjxiang 4.0K Aug 13 15:46 ERR1698207_quant
drwxrwxr-x 5 yjxiang yjxiang 4.0K Aug 13 15:46 ERR1698208_quant
drwxrwxr-x 5 yjxiang yjxiang 4.0K Aug 13 15:47 ERR1698209_quant
每个文件夹里都有对应样本的quant结果,以样本ERR1698209为例,文件夹里含有这些文件,quant.sf 文件就是我们得到的定量结果:
drwxrwxr-x 2 yjxiang yjxiang 4.0K Aug 13 15:47 aux_info
-rw-rw-r-- 1 yjxiang yjxiang 307 Aug 13 15:47 cmd_info.json
-rw-rw-r-- 1 yjxiang yjxiang 551 Aug 13 15:47 lib_format_counts.json
drwxrwxr-x 2 yjxiang yjxiang 4.0K Aug 13 15:47 libParams
drwxrwxr-x 2 yjxiang yjxiang 4.0K Aug 13 15:00 logs
-rw-rw-r-- 1 yjxiang yjxiang 1.8M Aug 13 15:47 quant.sf
查看一下quant.sf,常见的TPM值、Numreads都在里面:
$ head -n 5 quant.sf
Name Length EffectiveLength TPM NumReads
ATMG00010.1 462 301.089 0.000000 0.000000
ATMG00030.1 324 166.891 0.000000 0.000000
ATMG00040.1 948 786.477 0.000000 0.000000
ATMG00050.1 396 236.034 0.000000 0.000000
salmon流程到此就结束了,根据得到的quant文件,我们可以在后续利用DESeq2, edgeR, limma等包进行下游的差异基因分析。现在我们来看subread工具如何完成RNA-seq数据的生物学定量。