转录组(3):了解fastq测序数据

学习目标: 前面下载了SRR3589956.sra-SRR3589962.sra的RNA-seq数据,本次用sratoolkit.2.6.3软件解压,并查看fastq数据的格式,用fastqc软件检验其数据质量,IGV可视化数据,学会批量操作。
参考:http://www.biotrainee.com/thread-1831-1-1.html
http://fbb84b26.wiz03.com/share/s/3XK4IC0cm4CL22pU-r1HPcQQ2irG2836uQYm2iZAyh1Zwf3_

1. sratoolkit的使用

fastq-dump -h查看帮助

fastq-dump [options]  [...] #基本用法

常用参数:

INPUT
  -A|--accession        Replaces accession derived from  in 
                                   filename(s) and deflines (only for single 
                                   table dump) 
  --table              Table name within cSRA object, default is 
                                   "SEQUENCE" 

OUTPUT
  -O|--outdir                Output directory, default is working 
                                   directory '.' ) 
  -Z|--stdout                      Output to stdout, all split data become 
                                   joined into single stream 
  --gzip                           Compress output using gzip  #fastqc软件可以直接识别gzip压缩的文件
  --bzip2                          Compress output using bzip2  #比gzip压缩率高但是慢

Multiple File Options              Setting these options will produce more
                                     than 1 file, each of which will be suffixed
                                     according to splitting criteria.
  --split-files                    Dump each read into separate file.Files 
                                   will receive suffix corresponding to read 
                                   number 
  --split-3                        Legacy 3-file splitting for mate-pairs: 
                                   First biological reads satisfying dumping 
                                   conditions are placed in files *_1.fastq and 
                                   *_2.fastq If only one biological read is 
                                   present it is placed in *.fastq Biological 
                                   reads and above are ignored. 
 

学会批量解压:

for i in `seq 56 62`
do 
    /opt/NfsDir/BioDir/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --gzip --split-3 -O /opt/NfsDir/UserDir/qin/qin/Data/RNAseq/ -A SRR35899${i}.sra
done

bash命令能够直接用于解压缩文件,如zgrep,zcat,zless,zdiff等。举例:zcat SRR3589956_1.fastq.gz | head -n 4

2.fastqc批量查看测序质量

参考:http://www.biotrainee.com/thread-324-1-1.html

格式: FASTQ文件每个序列通常为4行,分别为:

@DJB775P1:248:D0MDGACXX:7:1202:12362:49613 1:Y:18:ATCACG #第一行:@字符开头的标题行,分别为:设备名称/run id/flowcell id/flowcell lane/tile number within the flowcell lane/'x'-coordinate of the cluster within the tile/'y'-coordinate of the cluster within the tile/the member of a pair, 1 or 2/Y if the read is filtered, N otherwise/0 when none of the control bits are on, otherwise it is an even number/index sequence
TGCTTACTCTGCGTTGATACCACTGCTTAGATCGGAAGAGCACACGTCTGAA #序列
+
JJJJJIIJJJJJJHIHHHGHFFFFFFCEEEEEDBD?DDDDDDBDDDABDDCA #碱基质量格式phred+33

fastqc用法:

fastqc SRR3589956_1.fastq.gz
fastqc seqfile1 seqfile2 .. seqfileN
常用参数:
-o: 输出路径-
-extract: 输出文件是否需要自动解压 默认是--noextract-
t: 线程, 和电脑配置有关,每个线程需要250MB的内存
-c: 测序中可能会有污染, 比如说混入其他物种
-a: 接头-
q: 安静模式

结果产生两个文件
Paste_Image.png

查看SRR3589956质控结果,为啥中间少了一块?


转录组(3):了解fastq测序数据_第1张图片
Paste_Image.png

multiQC批量质控查看结果
# 先获取QC结果
ls *gz | while read id; do /opt/NfsDir/BioDir/fastqc/FastQC/fastqc -t 4 $id; done
# multiqc
multiqc *fastqc.zip --pdf
转录组(3):了解fastq测序数据_第2张图片
Paste_Image.png
转录组(3):了解fastq测序数据_第3张图片
Paste_Image.png
转录组(3):了解fastq测序数据_第4张图片
Paste_Image.png

你可能感兴趣的:(转录组(3):了解fastq测序数据)