00 写在前面
测序后公司交付数据时,一般会提供质控后的clean data和后续的基础分析结果。因为可能需要自己来进行数据的预处理,记得一定要拿回raw data,同时弄清楚公司处理数据时每步用到的软件版本及具体参数。
拿到数据后,先对数据进行质量评估,除了跑下Fastqc看下测序质量外,还需要统计测序reads数目、mapping ratio、coverage、depth。现在一般human全外显子组要求测序深度在100X,实际深度一般会在100X以上,human重测序一般测序深度30X。
关于测序深度和覆盖度,这里按下图定义。
01 fq文件中到底有多少条reads
这里使用软件readfq来统计,成熟的小工具最大的好处是,因为功能单一,作者会着重优化计算速度,一般处理速度会比较快。具体有多快,参考作者介绍
It could deal with ~4M reads (1G base) in less than 40 seconds, ~50M reads (14G base) in less than 5 minutes!!
下载安装
#download https://github.com/billzt/readfq
gcc -lz -o kseq_fastq_base kseq_fastq_base.c
kseq_fastq_base [input.fq]
#FASTQ files could be gzipped.
#The statistic results would be printed on STDOUT.
计算
#01_read.sh
#! /bin/bash
cd FQDIR/ #fq所在位置
readfq="/../../biosoft/readfq-master/readfq-master/kseq_fastq_base"
ls *.fq >fq.list
for i in $(cat fq.list)
do
printf $i "\t" >>fq.reads_num
.$readfq $i >>fq.reads_num
done
结果文件如下,最后整理成表格即可。
$ more fq.reads_num
sample1_1.fq Num reads:3603393 Num Bases: 540508950
sample1_2.clean.fq Num reads:3603393 Num Bases: 540508950
sample2_1.clean.fq Num reads:13328100 Num Bases: 1999215000
sample2_2.clean.fq Num reads:13328100 Num Bases: 1999215000
02 有多少reads比对到参考基因组上了?
这里从比对后得到的BAM文件开始,利用软件samtools中的flagstat
或者idxstats
统计,参考samtools命令详解http://www.cnblogs.com/emanlee/p/4316581.html
#02_mappedreads.sh
#! /bin/bash
cd bamDIR/ #bam文件所在位置
ls *.bam >bam.list
for i in $(cat bam.list)
do
printf $i >>bam.mappedreads
samtools flagstat $i | sed -n '5p' >>bam.mappedreads
done
#02_reads.sh
#! /bin/bash
cd bamDIR/ #bam文件所在位置
ls *.bam >bam.list
for sample in $(cat bam.list)
do
export total_reads=$(samtools idxstats $bam_dir/$sample.final.bam |awk -F '\t' '{s+=$3}END{print s}')
echo $sample number_of_reads $total_reads
done
#命令解析
$ samtools flagstat sample1.bam
3826122 + 0 in total (QC-passed reads + QC-failed reads) #总共的reads数
0 + 0 secondary
1658 + 0 supplementary
343028 + 0 duplicates
3824649 + 0 mapped (99.96% : N/A) #总体reads的匹配率
3824464 + 0 paired in sequencing #总共的reads数
1912442 + 0 read1 #reads1中的reads数
1912022 + 0 read2 #reads2中的reads数
3808606 + 0 properly paired (99.59% : N/A) #完美匹配的reads数:比对到同一条参考序列,并且两条reads之间的距离符合设置的阈值
3821518 + 0 with itself and mate mapped #paired reads中两条都比对到参考序列上的reads数
1473 + 0 singletons (0.04% : N/A) #单独一条匹配到参考序列上的reads数,和上一个相加,则是总的匹配上的reads数。
5882 + 0 with mate mapped to a different chr#paired reads中两条分别比对到两条不同的参考序列的reads数
4273 + 0 with mate mapped to a different chr (mapQ>=5) #paired reads中两条分别比对到两条不同的参考序列的reads数
03 计算coverage和depth
这里从比对后得到的BAM文件开始,利用软件统计每个碱基被测序到的次数,再写脚本统计coverage和depth.这里介绍3种方法
03-1 samtools的mpileup
+ custome perl :~4h per WES sample
利用samtools的mpileup
计算每个碱基的测序深度,由于遍历每个reads,所以非常慢,100X的全外样本耗时~4h。
# 01_mpileup.sh
#! /bin/bash
cd bamDIR/ #bam文件所在位置
ls *.bam >bam.list
for sample in $(cat bam.list)
do
samtools mpileup -f reference.fa $bam_dir/$sample >mpileup/$sample.mpileup
echo $sample is ready!
done
Perl脚本计算exon、50bp、100bp、150bp侧翼序列的覆盖度及depth;其中CCDS.20160908.exon.hg19.bed为targets.bed,捕获文件,重点是必须要有前3列,TAB分隔。(脚本来自生信技能树JIMMY的分享)
# 02_exon_dep.pl
#perl 02_exon_dep.sh CCDS.20160908.exon.hg19.bed *.mplieup output
open FH,"@ARGV[0]";
#chr1 10085036 10097869 Cops5 0 - 10085036 10097869 255,0,0
while(){
chomp;
@F=split;
$start=$F[1];$end=$F[2];$chr=$F[0]; $chr=~s/chr//;
foreach ($start..$end){$hash{"$chr:$_"}=1;}
foreach ($start-150..$start-101){$exon150{"$chr:$_"}=1;}
foreach ($start-100..$start-51){$exon100{"$chr:$_"}=1;}
foreach ($start-50..$start-1){$exon50{"$chr:$_"}=1;}
foreach ($end+1..$end+50){$exon50{"$chr:$_"}=1;}
foreach ($end+51..$end+100){$exon100{"$chr:$_"}=1;}
foreach ($end+101..$end+150){$exon150{"$chr:$_"}=1;}
}
close FH;
$tmp=0;
$tmp++ foreach keys %hash;
$exon_length=$tmp;
$tmp=0;
$tmp++ foreach keys %exon50;
$exon50_length=$tmp;
$tmp=0;
$tmp++ foreach keys %exon100;
$exon100_length=$tmp;
$tmp=0;
$tmp++ foreach keys %exon150;
$exon150_length=$tmp;
open FH,$ARGV[1] or die "you need to give us a mpileup file!";
#chrM 1 G 4 ^7.^,.^I.^], GGGC
open R,">>$ARGV[2]";
while(){
@F=split;
$chr=$F[0];$pos=$F[1];$depth=$F[3];$chr=~s/chr//;
if (exists $hash{"$chr:$pos"}){
$c_sum++ if $depth >0;$d_sum+=$depth ;
}
else{
if (exists $exon50{"$chr:$pos"}){
$exon50_c_sum++ if $depth >0;$exon50_d_sum+=$depth ;
}
else{
if (exists $exon100{"$chr:$pos"}){
$exon100_c_sum++ if $depth >0;$exon100_d_sum+=$depth ;
}
else{
if (exists $exon150{"$chr:$pos"}){
$exon150_c_sum++ if $depth >0;$exon150_d_sum+=$depth ;
}
else{
$other_c_sum++ if $depth >0;$other_d_sum+=$depth ;
}
}
}
}
}
close FH;
if($exon_length>0){$coverage=$c_sum/$exon_length;};
if($c_sum>0){$avg_depth=$d_sum/$c_sum;}
if($exon50_length>0){$exon50_coverage=$exon50_c_sum/$exon50_length;}
if($exon50_c_sum>0){$exon50_avg_depth=$exon50_d_sum/$exon50_c_sum;}
if($exon100_length>0){$exon100_coverage=$exon100_c_sum/$exon100_length;}
if($exon100_c_sum>0){$exon100_avg_depth=$exon100_d_sum/$exon100_c_sum;}
if($exon150_length>0){$exon150_coverage=$exon150_c_sum/$exon150_length;}
if($exon150_c_sum>0){$exon150_avg_depth=$exon150_d_sum/$exon150_c_sum;}
#$other_coverage=$other_c_sum/(3000000000-90362533);
#$other_avg_depth=$other_d_sum/$other_c_sum;
print R $ARGV[0],"\t","exon","\t","$c_sum\t$d_sum\t$coverage\t$avg_depth\n";
print R $ARGV[0],"\t","50bp","\t","$exon50_c_sum\t$exon50_d_sum\t$exon50_coverage\t$exon50_avg_depth\n";
print R $ARGV[0],"\t","100bp","\t","$exon100_c_sum\t$exon100_d_sum\t$exon100_coverage\t$exon100_avg_depth\n";
print R $ARGV[0],"\t","150bp","\t","$exon150_c_sum\t$exon150_d_sum\t$exon150_coverage\t$exon150_avg_depth\n";
#print "$other_c_sum\t$other_d_sum\t$other_coverage\t$other_avg_depth\n";
03-2 bedtools的genomecov
+ custome perl: 比samtools快
#! /bin/bash
cd bamDIR/ #bam文件所在位置
ls *.bam >bam.list
for sample in $(cat bam.list)
do
bedtools genomecov -ibam $bam_dir/$sample -d >$sample.dep_base
echo depth_base_is_calculated_for $sample
done
统计脚本同03-2中02_exon_dep.pl。
03-3 bam2bedGraph + bedtools: 最快
bam2bedGraph 统计每个碱基的depth,并利用bedtools求overlapped region。
# 01_bam2bedGraph_bedtools.sh
#! /bin/bash
cd bamDIR/ #bam文件所在位置
ls *.bam >bam.list
for sample in $(cat bam.list)
do
$bam2bed -o $sample.bam2bed $bam_dir/$sample
#chr start end depth #zero-based start and a one-based end
cat $sample.bam2bed.bedGraph | awk '{print $1"\t"$2+1"\t"$3"\t"$4}' >$sample.bedGraph
bedtools intersect -a $sample.bedGraph -b probes.bed >$sample.intersect
#chr overlapped_start overlapped_end depth
echo $sample ok
done
#02_cov_depth.sh:实际测到的区域长度及覆盖度
#! /bin/bash
#cal the length and depth of overlapped regions
for sample in $(cat bam.list)
do
perl -alne '{$len=$F[2]-$F[1]+1;$tmp+=$len}END{print $tmp}' $sample.intersect >>intersect.length
perl -alne '{$len=$F[3]*($F2]-$F[1]+1);$tmp+=$len}END{print $tmp}' $sample.intersect >>intersect.depth
done
paste bam.list intersect.length intersect.depth >intersect.cov_depth
#intersect.cov_depth: sampleID intersect.length intersect.depth
#bed_length:理论应该测到的区域(捕获区域)长度
#probes.bed
#chr start end name
perl -alne '{$len=$F[2]-$F[1]+1;$tmp+=$len}END{print $tmp}' probes.bed
#coverage=intersect.length/bed_length
#depth=intersect.depth/intersect.length
04 写在最后
基础的统计可以用来锻炼自己的编程能力,着急的话可以去GitHub上搜索,有许多相似的小工具,而且一般都很注重速度的优化,可以快速得到结果;而且关于原理部分不太理解的话,直接联系作者也能得到蛮及时的回复。(个人体验,仅供参考)