先来熟悉一下数据
人的全外显子组数据,为了增加深度测了两个lane。
drwxr-xr-x 4 huangsiyuan grp3 220 Oct 17 09:18 CL100072545_L01_44/
drwxr-xr-x 4 huangsiyuan grp3 220 Oct 11 14:56 CL100072545_L02_44/
huangsiyuan 21:21:01 ~/learn_wes/data_ren
$ lsx CL100072545_L01_44_1.fq.gz
@CL100072545L1C001R001_6/1
TTTTTCTGTGAATGTTTCTTTTCCCAGCTTCCCTGAAAGCAACCATGGCT
+
BF@DEF@GFB::9EFDGEFFAE?D;CC=FEFFEF6@9:1C:FCFEFC:AF
--------------------------------------------------
$ lsx CL100072545_L01_44_2.fq.gz
@CL100072545L1C001R001_6/2
CTCTGGGATGATTGGAATTGATCCTGTAGCTGTTTTCCGATGGGCAATTC
+
>F>FB9FCFF6DCF=EBFFFEFFDFFFEFFGFCGFFD?8FEFEDEFEFFF
fq1和fq2文件中的reads是一一对应的,正好是双端测序的两端,这次测序的reads长度是50bp。
数据的质控
之前我用的都是Trimmomatic, 这次换一个,用SOAPnuke。SOAPnuke是华大自主开发的一款针对fastq文件的过滤软件,主要功能有adapter过滤、低quality过滤和高比例N过滤。基本的过滤功能集中在filter模块中,filter模块适用于大部分fastq格式下机数据过滤。
$ git clone https://github.com/BGI-flexlab/SOAPnuke.git
$ cd SOAPnuke/
$ ls
ChangeLog COPYING Makefile Readme.md src/
$ make
#这是2.X版本,下面的例子我还是用的1.5.6版本
#./SOAPnuke filter -h查看帮助文档
详细的使用方法见这篇帖子:fastq数据质控过滤软件-soapnuke,fastp
~/learn_wes/soft/SOAPnuke1.5.6 filter -n 0.1 --qualRate 0.5 --lowQual 12 -Q 2 -E 35 -G \
-1 ~/learn_wes/data_ren/CL100072545_L01_44_1.fq.gz \
-2 ~/learn_wes/data_ren/CL100072545_L01_44_2.fq.gz \
-f AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA \
-r AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG \
-M 2 -o ~/learn_wes/data_ren/ \
-C L01_44_1.fq.gz \
-D L01_44_2.fq.gz
#在1.5.6版本中,这些参数是这个意思
-n, --nRate : N rate threshold (default: [0.05])
-q, --qualRate : low quality rate (default: [0.5])
-l, --lowQual : low quality threshold (default: [5])
-Q, --qualSys : quality system 1:illumina, 2:sanger (default: [ 1 ])
-E, --cutAdaptor: cut sequence from adaptor index,unless performed -f/-r also in use
discard the read when the adaptor index of the read is less than INT
-G, --sanger : set clean data qualtiy system to sanger (default: illumina)
-f, --adapter1 : 3' adapter sequence of fq1 file
-r, --adapter2 : 5' adapter sequence of fq2 file [only for PE reads]
-M, --misMatch : the max mismatch number when match the adapter (default: [1])
#最终除了生成clean data还会生成8个质量报告文件,相当贴心了
Base_distributions_by_read_position_1.txt
Base_distributions_by_read_position_2.txt
Base_quality_value_distribution_by_read_position_1.txt
Base_quality_value_distribution_by_read_position_2.txt
Basic_Statistics_of_Sequencing_Quality.txt
Distribution_of_Q20_Q30_bases_by_read_position_1.txt
Distribution_of_Q20_Q30_bases_by_read_position_2.txt
Statistics_of_Filtered_Reads.txt
看看8个质量报告文件
Base_distributions_by_read_position_1.txt
Base_distributions_by_read_position_2.txt
这两个文件中存储的是,fq1和fq2文件在过滤前后一条read上每一个位置ATGC四种碱基的占比(综合所有reads的一个统计值)。
Pos A C G T N Clean A Clean C Clean G Clean T Clean N
1 21.60% 37.75% 25.16% 15.48% 0.01% 21.60% 37.75% 25.16% 15.48% 0.01%
2 19.07% 12.22% 20.45% 48.25% 0.01% 19.07% 12.22% 20.45% 48.25% 0.00%
3 24.01% 22.27% 22.08% 31.63% 0.00% 24.02% 22.27% 22.08% 31.63% 0.00%
4 37.14% 19.32% 19.02% 24.51% 0.01% 37.14% 19.32% 19.02% 24.51% 0.00%
5 34.61% 18.92% 21.84% 24.63% 0.00% 34.61% 18.92% 21.84% 24.63% 0.00%
Base_quality_value_distribution_by_read_position_1.txt
Base_quality_value_distribution_by_read_position_2.txt
fq1和fq2文件在过滤前后每条read每一个位置上的质量值分布情况。
Pos Q0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32 Q33 Q34 Q35 Q36 Q37 Q38 Q39 Q40 Q41 Mean Median Lower quartile Upper quartile 10thpercentile 90thpercentile
1 6506 0 0 0 5634 8034 10520 10780 14115 14977 25066 30361 30236 57415 62044 77249 133447 113520 147473 227156 354561 431683 418184 624258 607520 709456 811158 845320 798457 1107895 1135951 1248682 1316381 2005659 2227114 3966695 9120155 39936510 1676084 0 0 0 34.86 37.00 35.00 37.00 29.00 37.00
...
Clean Quality Value Distribute
Pos Q0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32 Q33 Q34 Q35 Q36 Q37 Q38 Q39 Q40 Q41 Mean Median Lower quartile Upper quartile 10thpercentile 90thpercentile
1 4825 0 0 0 5532 8013 10489 10738 14075 14905 24958 30242 30078 57214 61856 77000 133019 113168 146928 226490 353320 430205 416898 622189 605782 707357 808742 843046 796149 1104558 1132645 1244984 1312412 1999461 2220267 3955090 9096624 39860838 1673117 0 0 0 34.86 37.00 35.00 37.00 29.00 37.00
...
Basic_Statistics_of_Sequencing_Quality.txt
测序质量的基本统计,也是fq1,fq2过滤前后一起比较,具体包括这些项:
Item
Read length
Total number of reads
Number of filtered reads (%)
Total number of bases
Number of filtered bases (%)
Reads related to Adapter and Trimmed (%)
Number of base A (%)
Number of base C (%)
Number of base G (%)
Number of base T (%)
Number of base N (%)
Number of base calls with quality value of 20 or higher (Q20+) (%)
Number of base calls with quality value of 30 or higher (Q30+) (%)
Distribution_of_Q20_Q30_bases_by_read_position_1.txt
Distribution_of_Q20_Q30_bases_by_read_position_2.txt
分别是fq1,fq2过滤前后read每一个位置上Q20, Q30的碱基所占百分比。
Position in reads Percentage of Q20+ bases Percentage of Q30+ bases Percentage of Clean Q20+ Percentage of Clean Q30+
1 98.61% 89.07% 98.62% 89.08%
2 97.95% 87.35% 97.95% 87.36%
Statistics_of_Filtered_Reads.txt
存放的是过滤掉的reads的一些信息,可以清楚地看到是因为什么而被过滤的。
Item Total
Total filtered reads (%) 326084
Reads with adapter (%) 6016
Reads with low quality (%) 86372
Reads with low mean quality (%) 0
Reads with duplications (%) 0
Read with n rate exceed: (%) 233696
Read with small insert size: (%) 0
Reads with PolyA (%) 0 ......
我个人觉得Basic_Statistics_of_Sequencing_Quality.txt和Statistics_of_Filtered_Reads.txt比较重要。