一、体细胞突变 SNV/INDEL(Somatic SNVs + Indels)的介绍
GATK官网介绍:Somatic short variant discovery (SNVs + Indels)
单个样本/成对样本都可以进行这一步
官方给出了两套流程(都在一个GIT下面):
Somatic short variants tumor-normal pair
Somatic short variants PON creation
git clone https://github.com/gatk-workflows/gatk4-somatic-snvs-indels
共有四套wdl:
mutect2-normal-normal.wdl
mutect2.wdl (single tumor-normal pair or on a single tumor sample用于一对N/T样品或者单个tumour情况)
mutect2_nio.wdl
mutect2_pon.wdl(Create a Mutect2 panel of normals 用于创建多个normal的模型)
二、WDL中各个task的介绍
这次学习采用的是mutect2.wdl流程。总共包含了14个task,一一来学习一下:
GATK官网:(How to) Call somatic mutations using GATK4 Mutect2
总共包含了14个task:
1. CramToBam
忽略
2. SplitIntervals
功能:将Bed进行拆分,由于我采用的bed并不大,这一步忽略。
gatk --java-options "-Xmx5000m" SplitIntervals \
-R ${ref_fasta} \
${"-L " + intervals} \
-scatter 50 \
-O interval-files \
${split_intervals_extra_args}
-scatter这个参数表示拆分成多少个区域。
Mutect2这一步主要的参考是3、9、10几步,主要看一下。
GATK官网介绍: (How to) Call somatic mutations using GATK4 Mutect2
3. M2
这一步可以根据成对/单样品来进行分析,下面实例是成对,包含了以下3个步骤:
1)GetSampleName
GetSampleName is a BETA tool and is not yet ready for use in production
这一步忽略
2)Mutect2
gatk --java-options "-Xmx5000m" Mutect2 \
-R ${ref_fasta} \
-I ${tumor_bam} -tumor tumor_NAME \
-I ${normal_bam} -normal normal_NAME \
--germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz \
${"-pon " + pon} \
${"-L " + intervals} \
${"--alleles " + gga_vcf} \
-O "\${output_vcf}" \
${true='--bam-output bamout.bam' false='' make_bamout} \
${true='--f1r2-tar-gz f1r2.tar.gz' false='' run_ob_filter} \
${m2_extra_args}
--germline-resource:Population vcf of germline sequencing containing allele fractions
GATK官网问答介绍:MuTect2 beta --germline_resource for build b37,vcf下载地址
只有b37和hg38版本,没有hg19版本,hg19版本下载(但是来源并不保证),该链接来自于官网的问答
-pon是多个N组成的PON文件,只有一对样品的情况可以忽略
-L 区域文件,可以是一个Bed
--alleles:FeatureInput The set of alleles at which to genotype when --genotyping_mode is GENOTYPE_GIVEN_ALLELES 忽略
--bam-output,-bamout:String File to which assembled haplotypes should be written
输出bam
3)GetPileupSummaries
Warning: GetPileupSummaries is a BETA tool and is not yet ready for use in production
GATK官网有关GetPileupSummaries的介绍:GetPileupSummaries
功能:统计,得到一个六列的表格,计算的是给定的VCF中的位点的count数
These must be created, even if they remain empty, as cromwell doesn't support optional output
这一步的结果是必须存在的,或者可以直接touch两个文件
touch tumor-pileups.table
touch normal-pileups.table
tumor:
gatk --java-options "-Xmx${command_mem}m" GetPileupSummaries \
-R ${ref_fasta} \
-I ${tumor_bam} ${"--interval-set-rule INTERSECTION -L " + intervals} \
-V ${variants_for_contamination} -L ${variants_for_contamination} \
-O tumor-pileups.table
normal:
gatk --java-options "-Xmx${command_mem}m" GetPileupSummaries \
-R ${ref_fasta} \
-I ${normal_bam} ${"--interval-set-rule INTERSECTION -L " + intervals} \
-V ${variants_for_contamination} -L ${variants_for_contamination} \
-O normal-pileups.table
成对样品分别做这一步,如果没有N样品,忽略掉第二步;
-V必须参数 输入的是vcf格式 GATK官网下载
The tool requires a common germline variant sites VCF, e.g. derived from the gnomAD resource, with population allele frequencies (AF) in the INFO field. This resource must contain only biallelic SNPs and can be an eight-column sites-only VCF. The tool ignores the filter status of the variant calls in this germline resource.
small_exac_common_3_hg19.vcf 示例如下:
$ grep -v "#" small_exac_common_3_hg19.vcf | head
chr1 17365 . C G 826621 InbreedingCoeff_Filter AF=0.136;REMAP_ALIGN=FP
chr1 17385 . G A 592354 InbreedingCoeff_Filter AF=0.122;REMAP_ALIGN=FP
chr1 30548 . T G 1923.65 VQSRTrancheSNP99.60to99.80 AF=0.081;REMAP_ALIGN=FP
chr1 69761 . A T 2041400 PASS AF=0.113;REMAP_ALIGN=FP
chr1 139213 . A G 1634390 PASS AF=0.233;REMAP_ALIGN=FP
chr1 139233 . C A 1489200 PASS AF=0.231;REMAP_ALIGN=FP
chr1 567783 rs142895724 T C 120608 PASS AF=0.120;REMAP_ALIGN=FP
chr1 567825 . C T 96811.60 PASS AF=0.066;REMAP_ALIGN=FP
chr1 721757 rs189147642 T A 349942 PASS AF=0.051;REMAP_ALIGN=FP
chr1 739142 rs2340527 T A 53624.30 PASS AF=0.073;REMAP_ALIGN=FP
-L支持多个参数输入 Although the sites (-L) and variants (-V) resources will often be identical, this need not be the case. 如果-V和-L区域一样,可以省略-L。
GetPileupSummaries这一步得到:
$ head samplename_T.pileups.table
contig position ref_count alt_count other_alt_count allele_frequency
chr1 8064578 1140 1179 1 0.144
chr1 65311214 2290 0 0 0.061
chr1 65311262 2299 1 1 0.151
chr1 97981395 1101 987 1 0.192
chr1 98165091 1956 0 2 0.086
chr2 29416615 2375 0 0 0.091
chr2 29416794 1634 0 2 0.129
chr2 47693788 2203 1 0 0.088
chr2 47703500 898 859 2 0.114
4. MergeVCFs
合并vcf 忽略
5. MergeBamOuts
忽略
6. MergeStats
忽略
7. MergePileupSummaries
忽略
8. LearnReadOrientationModel
忽略
9. CalculateContamination
gatk --java-options "-Xmx${command_mem}m" \
CalculateContamination -I ${tumor_pileups} \
-O contamination.table --tumor-segmentation segments.table \
${"-matched " + normal_pileups}
--tumor-segmentation,-segments:File The output table containing segmentation of the tumor by minor allele fraction
-matched 如果没有N样品,不需要这个参数
The tool takes the summary table from GetPileupSummaries and gives the fraction contamination. This estimation informs downstream filtering by FilterMutectCalls.(来源:https://software.broadinstitute.org/gatk/documentation/article?id=11136)
得到:
segments.table
contamination.table
10. Filter
gatk --java-options "-Xmx${command_mem}m" FilterMutectCalls -V ${unfiltered_vcf} \
-R ${ref_fasta} \
-O ${output_vcf} \
${"--contamination-table " + contamination_table} \
11. FilterAlignmentArtifacts
gatk --java-options "-Xmx${command_mem}m" FilterAlignmentArtifacts \
-V ${input_vcf} \
-I ${bam} \
--bwa-mem-index-image ${realignment_index_bundle} \
${realignment_extra_args} \
-O ${output_vcf}
12. oncotate_m2
13. SumFloats
功能:Calculates sum of a list of floats
14. Funcotate
三、执行命令
1. 小panel的测试:
time gatk --java-options "-Xmx5000m" Mutect2 -R ucsc.hg19.fasta -I samplename_T.recal.bam -tumor samplename_T -I samplename_N.recal.bam -normal samplename_N --germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz -L Covered.bed -O Mutect2.vcf.gz --bam-output Mutect2.bam
time gatk --java-options "-Xmx5000m" GetPileupSummaries -R ucsc.hg19.fasta -I samplename_T.recal.bam -L Covered.bed -V small_exac_common_3_hg19.vcf -O samplename_T.pileups.table
time gatk --java-options "-Xmx5000m" GetPileupSummaries -R ucsc.hg19.fasta -I samplename_N.recal.bam -L Covered.bed -V small_exac_common_3_hg19.vcf -O samplename_N.pileups.table
time gatk --java-options "-Xmx5000m" CalculateContamination -I samplename_T.pileups.table -matched samplename_N.pileups.table -O contamination.table --tumor-segmentation segments.table
time gatk --java-options "-Xmx5000m" FilterMutectCalls -V Mutect2.vcf.gz -R ucsc.hg19.fasta -O Mutect2.filter.vcf --contamination-table contamination.table
得到:
Mutect2.vcf.gz
Mutect2.vcf.gz.tbi
Mutect2.bam
Mutect2.bai
samplename_T.pileups.table
samplename_N.pileups.table
segments.table
contamination.table
Mutect2FilteringStats.tsv
Mutect2.filter.vcf
Mutect2.filter.vcf.idx
运行时间:
$ grep real Mutect2.sh.e
real 28m7.830s (Mutect2这一步运行时间较长)
real 2m36.866s
real 3m6.991s
real 1m7.443s
real 1m8.615s
最后得到:
$ grep -v "#" Mutect2.filter.vcf | head -3
chr1 8064650 . AAAAC A . artifact_in_normal;str_contraction DP=4156;ECNT=1;NLOD=652.40;N_ART_LOD=12.45;POP_AF=8.590e-03;P_CONTAM=0.00;P_GERMLINE=-1.004e+03;RPA=4,3;RU=AAAC;STR;TLOD=15.08 GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB 0/0:2301,9:0.068:2310:1186,3:1115,6:30,30:294,302:60:62:0 0/1:1249,9:7.890e-03:1258:669,5:580,4:30,30:190,316:60:23:0:0.010,0.010,7.154e-03:1.469e-03,6.913e-04,0.998
chr1 162749855 . TTC T . artifact_in_normal;str_contraction DP=3631;ECNT=1;NLOD=535.21;N_ART_LOD=34.95;POP_AF=1.000e-06;P_CONTAM=0.00;P_GERMLINE=-8.309e+02;RPA=8,7;RU=TC;STR;TLOD=33.52 GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB 0/0:2120,35:0.060:2155:1027,19:1093,16:30,30:293,308:60:40:0 0/1:1151,29:0.022:1180:404,7:747,22:30,30:195,213:60:44:0:0.020,0.020,0.025:9.423e-03,7.577e-04,0.990
chr1 201754410 . CTTTTTT C,CT,CTT,CTTT,CTTTT,CTTTTT,CTTTTTTT . artifact_in_normal;germline_risk;multiallelic DP=3984;ECNT=1;NLOD=472.90,322.80,52.53,-5.898e+02,-1.771e+03,-2.565e+03,391.62;N_ART_LOD=5.75,17.32,37.19,173.03,780.25,1811.50,28.21;POP_AF=7.803e-04,2.467e-03,0.010,0.058,0.545,1.000e-06,1.000e-06;P_CONTAM=0.00;P_GERMLINE=-4.905e+02,-3.158e+02,-2.000e+01,0.00,0.00,0.00,-3.794e+02;RPA=17,11,12,13,14,15,16,18;RU=T;STR;TLOD=8.13,49.36,85.34,219.24,582.57,881.43,28.23 GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB 0/0:287,9,18,34,134,500,898,50:0.017,0.023,0.032,0.077,0.237,0.406,0.042:1930:146,4,8,13,54,250,442,32:141,5,10,21,80,250,456,18:30,30,30,30,30,30,30,30:287,316,299,297,286,297,291,265:60,60,60,60,60,60,60:28,38,37,46,45,42,40:0 0/1/2/3/4/5/6/7:134,10,32,63,143,358,467,40:6.001e-03,0.024,0.045,0.107,0.287,0.387,0.027:1247:62,5,13,32,55,134,201,21:72,5,19,31,88,224,266,19:30,30,30,30,30,30,30,30:181,195,179,207,197,193,190,190:60,60,60,60,60,60,60:24,36,35,46,37,41,36:0:0.768,0.758,0.777:8.021e-03,0.061,0.931
其表头有对FILTER和FORMAT列的解释,如:
##FILTER=
##FILTER=
##FORMAT=
##FORMAT=
... ...
2. WES的测试:
time gatk --java-options "-Xmx5000m" Mutect2 -R ucsc.hg19.fasta -I samplename_T.recal.bam -tumor samplename_T -I samplename_N.recal.bam -normal samplename_N --germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz -L all_exon.bed -O Mutect2.vcf.gz --bam-output Mutect2.bam
time gatk --java-options "-Xmx5000m" GetPileupSummaries -R ucsc.hg19.fasta -I samplename_T.recal.bam -L all_exon.bed -V small_exac_common_3_hg19.vcf -O samplename_T.pileups.table
time gatk --java-options "-Xmx5000m" GetPileupSummaries -R ucsc.hg19.fasta -I samplename_N.recal.bam -L all_exon.bed -V small_exac_common_3_hg19.vcf -O samplename_N.pileups.table
time gatk --java-options "-Xmx5000m" CalculateContamination -I samplename_T.pileups.table -matched samplename_N.pileups.table -O contamination.table --tumor-segmentation segments.table
time gatk --java-options "-Xmx5000m" FilterMutectCalls -V Mutect2.vcf.gz -R ucsc.hg19.fasta -O Mutect2.filter.vcf --contamination-table contamination.table
所用时间:
$ grep real wes_mutect.sh.e
real 139m0.688s
real 19m26.328s
real 14m53.911s
real 1m8.589s
real 1m9.738s
3. 小panel拆分Bed的测试:
从上面时间上来看,这一步所用时间还是很长,这次拆分bed来测试一下,会比上面增加三步(拆分/mutect2 call vcf/合并vcf)
1)拆分
time gatk --java-options "-Xmx5000m" SplitIntervals -R ucsc.hg19.fasta -L Covered.bed -scatter 30 -O interval-files
real 1m7.952s
2)mutect2 call vcf
time gatk --java-options "-Xmx5000m" Mutect2 -R ucsc.hg19.fasta -I samplename_T.recal.bam -tumor samplename_T -I samplename_N.recal.bam -normal samplename_N --germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz -L interval-files/0000-scattered.intervals -O interval-files/0000-scattered.intervals.vcf --bam-output interval-files/0000-scattered.intervals.Mutect2.bam
time gatk --java-options "-Xmx5000m" Mutect2 -R ucsc.hg19.fasta -I samplename_T.recal.bam -tumor samplename_T -I samplename_N.recal.bam -normal samplename_N --germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz -L interval-files/0001-scattered.intervals -O interval-files/0001-scattered.intervals.vcf --bam-output interval-files/0001-scattered.intervals.Mutect2.bam
... ...
... ...
$ grep real *
Mutect2_00002.sh.e86950:real 2m18.126s
Mutect2_00003.sh.e86951:real 3m15.701s
Mutect2_00004.sh.e86952:real 2m53.857s
... ...
... ...
这一步是并行运行,按照最长时间计算是:3m15.701s
3)合并vcf
GATK3中采用:CombineVariants
java -XX:ParallelGCThreads=20 -Xmx40g -Djava.io.tmpdir=swp -jar /share/nas2/genome/biosoft/GATK/GenomeAnalysisTK.jar -T CombineVariants -R bwa_Ref/ucsc.hg19.fasta -V Chr_Result/chr1.swp.vcf -V Chr_Result/chr10.swp.vcf -V Chr_Result/chr11.swp.vcf -V Chr_Result/chr12.swp.vcf -V Chr_Result/chr13.swp.vcf -V Chr_Result/chr14.swp.vcf -V Chr_Result/chr15.swp.vcf -V Chr_Result/chr16.swp.vcf -V Chr_Result/chr17.swp.vcf -V Chr_Result/chr18.swp.vcf -V Chr_Result/chr19.swp.vcf -V Chr_Result/chr2.swp.vcf -V Chr_Result/chr20.swp.vcf -V Chr_Result/chr21.swp.vcf -V Chr_Result/chr22.swp.vcf -V Chr_Result/chr3.swp.vcf -V Chr_Result/chr4.swp.vcf -V Chr_Result/chr5.swp.vcf -V Chr_Result/chr6.swp.vcf -V Chr_Result/chr7.swp.vcf -V Chr_Result/chr8.swp.vcf -V Chr_Result/chr9.swp.vcf -V Chr_Result/chrX.swp.vcf --disable_auto_index_creation_and_locking_when_reading_rods -o samplename_T.swp.all --genotypemergeoption UNSORTED
GATK4中CombineVariants已经没有,有三个合并vcf的功能:
MergeVcfs (Picard) Combines multiple variant files into a single variant file
GatherVcfs (Picard) Gathers multiple VCF files from a scatter operation into a single VCF file
对于合并不同染色体的vcf,我们采用GatherVcfs,CatVariants, MergeVcfs or GatherVcfs
GatherVcfsCloud (BETA Tool) Gathers multiple VCF files from a scatter operation into a single VCF file
命令:
time gatk --java-options "-Xmx5000m" GatherVcfs -I interval-files/0000-scattered.intervals.vcf -I interval-files/0001-scattered.intervals.vcf -I interval-files/0002-scattered.intervals.vcf -I interval-files/0003-scattered.intervals.vcf -I interval-files/0004-scattered.intervals.vcf -I interval-files/0005-scattered.intervals.vcf -I interval-files/0006-scattered.intervals.vcf -I interval-files/0007-scattered.intervals.vcf -I interval-files/0008-scattered.intervals.vcf -I interval-files/0009-scattered.intervals.vcf -I interval-files/0010-scattered.intervals.vcf -I interval-files/0011-scattered.intervals.vcf -I interval-files/0012-scattered.intervals.vcf -I interval-files/0013-scattered.intervals.vcf -I interval-files/0014-scattered.intervals.vcf -I interval-files/0015-scattered.intervals.vcf -I interval-files/0016-scattered.intervals.vcf -I interval-files/0017-scattered.intervals.vcf -I interval-files/0018-scattered.intervals.vcf -I interval-files/0019-scattered.intervals.vcf -I interval-files/0020-scattered.intervals.vcf -I interval-files/0021-scattered.intervals.vcf -I interval-files/0022-scattered.intervals.vcf -I interval-files/0023-scattered.intervals.vcf -I interval-files/0024-scattered.intervals.vcf -I interval-files/0025-scattered.intervals.vcf -I interval-files/0026-scattered.intervals.vcf -I interval-files/0027-scattered.intervals.vcf -I interval-files/0028-scattered.intervals.vcf -I interval-files/0029-scattered.intervals.vcf -O gather.raw.vcf
real 0m15.819s
bed不拆分和拆分得到的vcf基本一致,少量位点不同。
时间上,不经过bed分割的mutect2需要28m,经过bed分割的只需要4m 提速了6倍。
-L参数 Intervals and interval lists的介绍
之后的步骤和bed不拆分流程相同。
这一步的学习基本完成。
下一个step是对转录组SNP/INDEL(RNAseq SNPs + Indels)的学习