测序数据基本信息统计 | reads,coverage,depth

00 写在前面

测序后公司交付数据时，一般会提供质控后的clean data和后续的基础分析结果。因为可能需要自己来进行数据的预处理，记得一定要拿回raw data，同时弄清楚公司处理数据时每步用到的软件版本及具体参数。

拿到数据后，先对数据进行质量评估，除了跑下Fastqc看下测序质量外，还需要统计测序reads数目、mapping ratio、coverage、depth。现在一般human全外显子组要求测序深度在100X，实际深度一般会在100X以上，human重测序一般测序深度30X。

关于测序深度和覆盖度，这里按下图定义。

depth and coverage

01 fq文件中到底有多少条reads

这里使用软件readfq来统计，成熟的小工具最大的好处是，因为功能单一，作者会着重优化计算速度，一般处理速度会比较快。具体有多快，参考作者介绍

It could deal with ~4M reads (1G base) in less than 40 seconds, ~50M reads (14G base) in less than 5 minutes!!

下载安装

#download https://github.com/billzt/readfq
gcc -lz -o kseq_fastq_base kseq_fastq_base.c
kseq_fastq_base [input.fq]

#FASTQ files could be gzipped.
#The statistic results would be printed on STDOUT.

计算

#01_read.sh
#! /bin/bash
cd FQDIR/  #fq所在位置
readfq="/../../biosoft/readfq-master/readfq-master/kseq_fastq_base"
ls *.fq >fq.list
for i in $(cat fq.list)
do 
  printf $i "\t" >>fq.reads_num
  .$readfq $i >>fq.reads_num
done

结果文件如下，最后整理成表格即可。

$ more fq.reads_num
sample1_1.fq Num reads:3603393    Num Bases: 540508950
sample1_2.clean.fq Num reads:3603393    Num Bases: 540508950
sample2_1.clean.fq Num reads:13328100   Num Bases: 1999215000
sample2_2.clean.fq Num reads:13328100   Num Bases: 1999215000

02 有多少reads比对到参考基因组上了？

这里从比对后得到的BAM文件开始，利用软件samtools中的flagstat 或者idxstats统计，参考samtools命令详解http://www.cnblogs.com/emanlee/p/4316581.html

#02_mappedreads.sh
#! /bin/bash
cd bamDIR/  #bam文件所在位置
ls *.bam >bam.list
for i in $(cat bam.list)
do 
  printf $i >>bam.mappedreads
  samtools flagstat $i | sed -n '5p'  >>bam.mappedreads
done

#02_reads.sh
#! /bin/bash
cd bamDIR/  #bam文件所在位置
ls *.bam >bam.list
for sample in $(cat bam.list)
do
  export total_reads=$(samtools idxstats $bam_dir/$sample.final.bam |awk -F '\t' '{s+=$3}END{print s}')
  echo $sample number_of_reads $total_reads
done

#命令解析
$ samtools flagstat sample1.bam
3826122 + 0 in total (QC-passed reads + QC-failed reads) #总共的reads数
0 + 0 secondary
1658 + 0 supplementary
343028 + 0 duplicates
3824649 + 0 mapped (99.96% : N/A) #总体reads的匹配率
3824464 + 0 paired in sequencing #总共的reads数
1912442 + 0 read1 #reads1中的reads数
1912022 + 0 read2 #reads2中的reads数
3808606 + 0 properly paired (99.59% : N/A) #完美匹配的reads数：比对到同一条参考序列，并且两条reads之间的距离符合设置的阈值
3821518 + 0 with itself and mate mapped #paired reads中两条都比对到参考序列上的reads数
1473 + 0 singletons (0.04% : N/A)  #单独一条匹配到参考序列上的reads数，和上一个相加，则是总的匹配上的reads数。
5882 + 0 with mate mapped to a different chr#paired reads中两条分别比对到两条不同的参考序列的reads数
4273 + 0 with mate mapped to a different chr (mapQ>=5) #paired reads中两条分别比对到两条不同的参考序列的reads数

03 计算coverage和depth

这里从比对后得到的BAM文件开始，利用软件统计每个碱基被测序到的次数，再写脚本统计coverage和depth.这里介绍3种方法

03-1 samtools的`mpileup` + custome perl ：~4h per WES sample

利用samtools的mpileup计算每个碱基的测序深度,由于遍历每个reads,所以非常慢,100X的全外样本耗时~4h。

# 01_mpileup.sh
#! /bin/bash
cd bamDIR/  #bam文件所在位置
ls *.bam >bam.list
for sample in $(cat bam.list)
do
    samtools mpileup -f reference.fa $bam_dir/$sample >mpileup/$sample.mpileup
    echo $sample is ready!
done

Perl脚本计算exon、50bp、100bp、150bp侧翼序列的覆盖度及depth；其中CCDS.20160908.exon.hg19.bed为targets.bed，捕获文件，重点是必须要有前3列，TAB分隔。（脚本来自生信技能树JIMMY的分享）

# 02_exon_dep.pl
#perl 02_exon_dep.sh CCDS.20160908.exon.hg19.bed  *.mplieup output
open FH,"@ARGV[0]";
#chr1    10085036        10097869        Cops5   0       -       10085036        10097869        255,0,0
while(){
  chomp;
  @F=split;
  $start=$F[1];$end=$F[2];$chr=$F[0]; $chr=~s/chr//;
  foreach ($start..$end){$hash{"$chr:$_"}=1;}
  foreach ($start-150..$start-101){$exon150{"$chr:$_"}=1;}
  foreach ($start-100..$start-51){$exon100{"$chr:$_"}=1;}
  foreach ($start-50..$start-1){$exon50{"$chr:$_"}=1;}
  foreach ($end+1..$end+50){$exon50{"$chr:$_"}=1;}
  foreach ($end+51..$end+100){$exon100{"$chr:$_"}=1;}
  foreach ($end+101..$end+150){$exon150{"$chr:$_"}=1;}
}
close FH;

$tmp=0;
$tmp++ foreach keys %hash;
$exon_length=$tmp;
$tmp=0;
$tmp++ foreach keys %exon50;
$exon50_length=$tmp;
$tmp=0;
$tmp++ foreach keys %exon100;
$exon100_length=$tmp;
$tmp=0;
$tmp++ foreach keys %exon150;
$exon150_length=$tmp;

open FH,$ARGV[1] or die "you need to give us a mpileup file!";
#chrM    1       G       4       ^7.^,.^I.^],    GGGC
open R,">>$ARGV[2]";
while(){
  @F=split;
  $chr=$F[0];$pos=$F[1];$depth=$F[3];$chr=~s/chr//;
  if (exists $hash{"$chr:$pos"}){
      $c_sum++ if $depth >0;$d_sum+=$depth ;
        }
        else{
                if (exists $exon50{"$chr:$pos"}){
                    $exon50_c_sum++ if $depth >0;$exon50_d_sum+=$depth ;
                }
                else{
                        if (exists $exon100{"$chr:$pos"}){
                           $exon100_c_sum++ if $depth >0;$exon100_d_sum+=$depth ;
                        }
                        else{
                                if (exists $exon150{"$chr:$pos"}){
                                   $exon150_c_sum++ if $depth >0;$exon150_d_sum+=$depth ;
                                }
                                else{
                                    $other_c_sum++ if $depth >0;$other_d_sum+=$depth ;
                                }
                        }
                }
        }
}
close FH;
if($exon_length>0){$coverage=$c_sum/$exon_length;};
if($c_sum>0){$avg_depth=$d_sum/$c_sum;}

if($exon50_length>0){$exon50_coverage=$exon50_c_sum/$exon50_length;}
if($exon50_c_sum>0){$exon50_avg_depth=$exon50_d_sum/$exon50_c_sum;}

if($exon100_length>0){$exon100_coverage=$exon100_c_sum/$exon100_length;}
if($exon100_c_sum>0){$exon100_avg_depth=$exon100_d_sum/$exon100_c_sum;}
if($exon150_length>0){$exon150_coverage=$exon150_c_sum/$exon150_length;}
if($exon150_c_sum>0){$exon150_avg_depth=$exon150_d_sum/$exon150_c_sum;}
#$other_coverage=$other_c_sum/(3000000000-90362533);
#$other_avg_depth=$other_d_sum/$other_c_sum;

print R $ARGV[0],"\t","exon","\t","$c_sum\t$d_sum\t$coverage\t$avg_depth\n";
print R $ARGV[0],"\t","50bp","\t","$exon50_c_sum\t$exon50_d_sum\t$exon50_coverage\t$exon50_avg_depth\n";
print R $ARGV[0],"\t","100bp","\t","$exon100_c_sum\t$exon100_d_sum\t$exon100_coverage\t$exon100_avg_depth\n";
print R $ARGV[0],"\t","150bp","\t","$exon150_c_sum\t$exon150_d_sum\t$exon150_coverage\t$exon150_avg_depth\n";
#print "$other_c_sum\t$other_d_sum\t$other_coverage\t$other_avg_depth\n";

03-2 bedtools的`genomecov` + custome perl: 比samtools快

#! /bin/bash
cd bamDIR/  #bam文件所在位置
ls *.bam >bam.list
for sample in $(cat bam.list)
do
  bedtools genomecov -ibam $bam_dir/$sample -d >$sample.dep_base
  echo depth_base_is_calculated_for $sample
done

统计脚本同03-2中02_exon_dep.pl。

03-3 bam2bedGraph + bedtools: 最快

bam2bedGraph 统计每个碱基的depth，并利用bedtools求overlapped region。

# 01_bam2bedGraph_bedtools.sh
#! /bin/bash
cd bamDIR/  #bam文件所在位置
ls *.bam >bam.list
for sample in $(cat bam.list)
do
  $bam2bed -o $sample.bam2bed $bam_dir/$sample  
  #chr start end depth #zero-based start and a one-based end
  
  cat $sample.bam2bed.bedGraph | awk '{print $1"\t"$2+1"\t"$3"\t"$4}' >$sample.bedGraph
  bedtools intersect -a $sample.bedGraph -b probes.bed >$sample.intersect
  #chr overlapped_start overlapped_end depth 
  echo $sample ok
done


#02_cov_depth.sh：实际测到的区域长度及覆盖度
#! /bin/bash
#cal the length and depth of overlapped regions
for sample in $(cat bam.list)
do
  perl -alne '{$len=$F[2]-$F[1]+1;$tmp+=$len}END{print $tmp}' $sample.intersect >>intersect.length
  perl -alne '{$len=$F[3]*($F2]-$F[1]+1);$tmp+=$len}END{print $tmp}' $sample.intersect >>intersect.depth
done
paste bam.list intersect.length intersect.depth >intersect.cov_depth
#intersect.cov_depth: sampleID intersect.length intersect.depth

#bed_length：理论应该测到的区域（捕获区域）长度
#probes.bed  
#chr start end name
perl -alne '{$len=$F[2]-$F[1]+1;$tmp+=$len}END{print $tmp}' probes.bed


#coverage=intersect.length/bed_length
#depth=intersect.depth/intersect.length

04 写在最后

基础的统计可以用来锻炼自己的编程能力，着急的话可以去GitHub上搜索，有许多相似的小工具，而且一般都很注重速度的优化，可以快速得到结果；而且关于原理部分不太理解的话，直接联系作者也能得到蛮及时的回复。（个人体验，仅供参考）