全外显子组数据分析笔记(二):数据质控

先来熟悉一下数据

人的全外显子组数据,为了增加深度测了两个lane。

drwxr-xr-x 4 huangsiyuan grp3 220 Oct 17 09:18 CL100072545_L01_44/
drwxr-xr-x 4 huangsiyuan grp3 220 Oct 11 14:56 CL100072545_L02_44/
huangsiyuan 21:21:01 ~/learn_wes/data_ren

$ lsx CL100072545_L01_44_1.fq.gz
@CL100072545L1C001R001_6/1
TTTTTCTGTGAATGTTTCTTTTCCCAGCTTCCCTGAAAGCAACCATGGCT
+
BF@DEF@GFB::9EFDGEFFAE?D;CC=FEFFEF6@9:1C:FCFEFC:AF
--------------------------------------------------
$ lsx CL100072545_L01_44_2.fq.gz
@CL100072545L1C001R001_6/2
CTCTGGGATGATTGGAATTGATCCTGTAGCTGTTTTCCGATGGGCAATTC
+
>F>FB9FCFF6DCF=EBFFFEFFDFFFEFFGFCGFFD?8FEFEDEFEFFF

fq1和fq2文件中的reads是一一对应的,正好是双端测序的两端,这次测序的reads长度是50bp。

数据的质控

之前我用的都是Trimmomatic, 这次换一个,用SOAPnuke。SOAPnuke是华大自主开发的一款针对fastq文件的过滤软件,主要功能有adapter过滤、低quality过滤和高比例N过滤。基本的过滤功能集中在filter模块中,filter模块适用于大部分fastq格式下机数据过滤。

$ git clone https://github.com/BGI-flexlab/SOAPnuke.git
$ cd SOAPnuke/
$ ls
ChangeLog  COPYING  Makefile  Readme.md  src/
$ make
#这是2.X版本,下面的例子我还是用的1.5.6版本
#./SOAPnuke filter -h查看帮助文档

详细的使用方法见这篇帖子:fastq数据质控过滤软件-soapnuke,fastp

~/learn_wes/soft/SOAPnuke1.5.6 filter -n 0.1 --qualRate 0.5 --lowQual 12 -Q 2 -E 35 -G \
-1 ~/learn_wes/data_ren/CL100072545_L01_44_1.fq.gz \
-2 ~/learn_wes/data_ren/CL100072545_L01_44_2.fq.gz \
-f AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA \
-r AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG \
-M 2 -o ~/learn_wes/data_ren/ \
-C L01_44_1.fq.gz \
-D L01_44_2.fq.gz

#在1.5.6版本中,这些参数是这个意思
-n, --nRate     :  N rate threshold (default: [0.05])
-q, --qualRate  :  low quality rate (default: [0.5])
-l, --lowQual   :  low quality threshold (default: [5])
-Q, --qualSys   :  quality system 1:illumina, 2:sanger (default: [ 1 ])
-E, --cutAdaptor:  cut sequence from adaptor index,unless performed -f/-r also in use
                          discard the read when the adaptor index of the read is less than INT
-G, --sanger    :  set clean data qualtiy system to sanger (default: illumina)
-f, --adapter1  :  3' adapter sequence of fq1 file
-r, --adapter2  :  5' adapter sequence of fq2 file [only for PE reads]
-M, --misMatch  :  the max mismatch number when match the adapter (default: [1])

#最终除了生成clean data还会生成8个质量报告文件,相当贴心了
Base_distributions_by_read_position_1.txt
Base_distributions_by_read_position_2.txt
Base_quality_value_distribution_by_read_position_1.txt
Base_quality_value_distribution_by_read_position_2.txt
Basic_Statistics_of_Sequencing_Quality.txt
Distribution_of_Q20_Q30_bases_by_read_position_1.txt
Distribution_of_Q20_Q30_bases_by_read_position_2.txt
Statistics_of_Filtered_Reads.txt

看看8个质量报告文件

Base_distributions_by_read_position_1.txt
Base_distributions_by_read_position_2.txt
这两个文件中存储的是,fq1和fq2文件在过滤前后一条read上每一个位置ATGC四种碱基的占比(综合所有reads的一个统计值)。

Pos          A       C       G       T     N    Clean A Clean C Clean G Clean T Clean N
1       21.60%  37.75%  25.16%  15.48%   0.01%  21.60%  37.75%  25.16%  15.48%   0.01%
2       19.07%  12.22%  20.45%  48.25%   0.01%  19.07%  12.22%  20.45%  48.25%   0.00%
3       24.01%  22.27%  22.08%  31.63%   0.00%  24.02%  22.27%  22.08%  31.63%   0.00%
4       37.14%  19.32%  19.02%  24.51%   0.01%  37.14%  19.32%  19.02%  24.51%   0.00%
5       34.61%  18.92%  21.84%  24.63%   0.00%  34.61%  18.92%  21.84%  24.63%   0.00%

Base_quality_value_distribution_by_read_position_1.txt
Base_quality_value_distribution_by_read_position_2.txt
fq1和fq2文件在过滤前后每条read每一个位置上的质量值分布情况。

Pos         Q0          Q1          Q2          Q3          Q4          Q5          Q6          Q7          Q8          Q9          Q10         Q11         Q12         Q13         Q14         Q15         Q16         Q17         Q18         Q19         Q20         Q21         Q22         Q23         Q24         Q25         Q26         Q27         Q28         Q29         Q30         Q31         Q32         Q33         Q34         Q35         Q36         Q37         Q38         Q39         Q40         Q41         Mean    Median  Lower quartile  Upper quartile  10thpercentile  90thpercentile
1           6506        0           0           0           5634        8034        10520       10780       14115       14977       25066       30361       30236       57415       62044       77249       133447      113520      147473      227156      354561      431683      418184      624258      607520      709456      811158      845320      798457      1107895     1135951     1248682     1316381     2005659     2227114     3966695     9120155     39936510    1676084     0           0           0           34.86       37.00       35.00       37.00       29.00       37.00    
...
Clean Quality Value Distribute
Pos         Q0          Q1          Q2          Q3          Q4          Q5          Q6          Q7          Q8          Q9          Q10         Q11         Q12         Q13         Q14         Q15         Q16         Q17         Q18         Q19         Q20         Q21         Q22         Q23         Q24         Q25         Q26         Q27         Q28         Q29         Q30         Q31         Q32         Q33         Q34         Q35         Q36         Q37         Q38         Q39         Q40         Q41         Mean    Median  Lower quartile  Upper quartile  10thpercentile  90thpercentile
1           4825        0           0           0           5532        8013        10489       10738       14075       14905       24958       30242       30078       57214       61856       77000       133019      113168      146928      226490      353320      430205      416898      622189      605782      707357      808742      843046      796149      1104558     1132645     1244984     1312412     1999461     2220267     3955090     9096624     39860838    1673117     0           0           0           34.86       37.00       35.00       37.00       29.00       37.00    
...

Basic_Statistics_of_Sequencing_Quality.txt
测序质量的基本统计,也是fq1,fq2过滤前后一起比较,具体包括这些项:

Item                                                              
Read length                                                       
Total number of reads                                             
Number of filtered reads (%)                                     
Total number of bases                                            
Number of filtered bases (%)                                     
Reads related to Adapter and Trimmed (%)                           
Number of base A (%)                                             
Number of base C (%)                                             
Number of base G (%)                                             
Number of base T (%)                                             
Number of base N (%)                                             
Number of base calls with quality value of 20 or higher (Q20+) (%)
Number of base calls with quality value of 30 or higher (Q30+) (%)

Distribution_of_Q20_Q30_bases_by_read_position_1.txt
Distribution_of_Q20_Q30_bases_by_read_position_2.txt
分别是fq1,fq2过滤前后read每一个位置上Q20, Q30的碱基所占百分比。

Position in reads   Percentage of Q20+ bases    Percentage of Q30+ bases    Percentage of Clean Q20+    Percentage of Clean Q30+
1       98.61%  89.07%  98.62%  89.08%
2       97.95%  87.35%  97.95%  87.36%

Statistics_of_Filtered_Reads.txt
存放的是过滤掉的reads的一些信息,可以清楚地看到是因为什么而被过滤的。

Item                                        Total  
Total filtered reads (%)                    326084
Reads with adapter (%)                      6016
Reads with low quality (%)                  86372
Reads with low mean quality (%)             0
Reads with duplications (%)                 0
Read with n rate exceed: (%)                233696
Read with small insert size: (%)            0
Reads with PolyA (%)                        0  ......

我个人觉得Basic_Statistics_of_Sequencing_Quality.txt和Statistics_of_Filtered_Reads.txt比较重要。

你可能感兴趣的:(全外显子组数据分析笔记(二):数据质控)