GATK Best Practices — step3 体细胞突变 SNV/INDEL（Somatic SNVs + Indels）

一、体细胞突变 SNV/INDEL（Somatic SNVs + Indels）的介绍

GATK官网介绍：Somatic short variant discovery (SNVs + Indels)

单个样本/成对样本都可以进行这一步

官方给出了两套流程（都在一个GIT下面）：
Somatic short variants tumor-normal pair
Somatic short variants PON creation

git clone https://github.com/gatk-workflows/gatk4-somatic-snvs-indels

共有四套wdl：
mutect2-normal-normal.wdl
mutect2.wdl （single tumor-normal pair or on a single tumor sample用于一对N/T样品或者单个tumour情况）
mutect2_nio.wdl
mutect2_pon.wdl（Create a Mutect2 panel of normals 用于创建多个normal的模型）

二、WDL中各个task的介绍

这次学习采用的是mutect2.wdl流程。总共包含了14个task，一一来学习一下：
GATK官网：(How to) Call somatic mutations using GATK4 Mutect2

总共包含了14个task:

1. CramToBam

忽略

2. SplitIntervals

功能：将Bed进行拆分，由于我采用的bed并不大，这一步忽略。

gatk --java-options "-Xmx5000m" SplitIntervals \
-R ${ref_fasta} \
${"-L " + intervals} \
-scatter 50 \
-O interval-files \
${split_intervals_extra_args}

-scatter这个参数表示拆分成多少个区域。

Mutect2这一步主要的参考是3、9、10几步，主要看一下。
GATK官网介绍： (How to) Call somatic mutations using GATK4 Mutect2

3. M2

这一步可以根据成对/单样品来进行分析，下面实例是成对，包含了以下3个步骤：

1）GetSampleName

GetSampleName is a BETA tool and is not yet ready for use in production
这一步忽略

2）Mutect2

gatk --java-options "-Xmx5000m" Mutect2 \
-R ${ref_fasta} \
-I ${tumor_bam} -tumor tumor_NAME \
-I ${normal_bam} -normal normal_NAME \
--germline-resource   af-only-gnomad.raw.sites.hg19.vcf.gz \
${"-pon " + pon} \
${"-L " + intervals} \
${"--alleles " + gga_vcf} \
-O "\${output_vcf}" \
${true='--bam-output bamout.bam' false='' make_bamout} \
${true='--f1r2-tar-gz f1r2.tar.gz' false='' run_ob_filter} \
${m2_extra_args}

--germline-resource：Population vcf of germline sequencing containing allele fractions
GATK官网问答介绍：MuTect2 beta --germline_resource for build b37，vcf下载地址
只有b37和hg38版本，没有hg19版本，hg19版本下载（但是来源并不保证），该链接来自于官网的问答

-pon是多个N组成的PON文件，只有一对样品的情况可以忽略

-L 区域文件，可以是一个Bed

--alleles:FeatureInput The set of alleles at which to genotype when --genotyping_mode is GENOTYPE_GIVEN_ALLELES 忽略

--bam-output,-bamout:String File to which assembled haplotypes should be written
输出bam

3）GetPileupSummaries

Warning: GetPileupSummaries is a BETA tool and is not yet ready for use in production

GATK官网有关GetPileupSummaries的介绍：GetPileupSummaries

功能：统计，得到一个六列的表格，计算的是给定的VCF中的位点的count数

These must be created, even if they remain empty, as cromwell doesn't support optional output
这一步的结果是必须存在的，或者可以直接touch两个文件

touch tumor-pileups.table
touch normal-pileups.table

tumor：

gatk --java-options "-Xmx${command_mem}m" GetPileupSummaries  \
-R ${ref_fasta}  \
-I ${tumor_bam} ${"--interval-set-rule INTERSECTION -L " + intervals} \
-V ${variants_for_contamination} -L ${variants_for_contamination}  \
-O tumor-pileups.table

normal：

gatk --java-options "-Xmx${command_mem}m" GetPileupSummaries  \
-R ${ref_fasta}  \
-I ${normal_bam} ${"--interval-set-rule INTERSECTION -L " + intervals} \
-V ${variants_for_contamination} -L ${variants_for_contamination}  \
-O normal-pileups.table

成对样品分别做这一步，如果没有N样品，忽略掉第二步；

-V必须参数输入的是vcf格式 GATK官网下载
The tool requires a common germline variant sites VCF, e.g. derived from the gnomAD resource, with population allele frequencies (AF) in the INFO field. This resource must contain only biallelic SNPs and can be an eight-column sites-only VCF. The tool ignores the filter status of the variant calls in this germline resource.
small_exac_common_3_hg19.vcf 示例如下：

$ grep -v "#" small_exac_common_3_hg19.vcf | head
chr1    17365   .       C       G       826621  InbreedingCoeff_Filter  AF=0.136;REMAP_ALIGN=FP
chr1    17385   .       G       A       592354  InbreedingCoeff_Filter  AF=0.122;REMAP_ALIGN=FP
chr1    30548   .       T       G       1923.65 VQSRTrancheSNP99.60to99.80      AF=0.081;REMAP_ALIGN=FP
chr1    69761   .       A       T       2041400 PASS    AF=0.113;REMAP_ALIGN=FP
chr1    139213  .       A       G       1634390 PASS    AF=0.233;REMAP_ALIGN=FP
chr1    139233  .       C       A       1489200 PASS    AF=0.231;REMAP_ALIGN=FP
chr1    567783  rs142895724     T       C       120608  PASS    AF=0.120;REMAP_ALIGN=FP
chr1    567825  .       C       T       96811.60        PASS    AF=0.066;REMAP_ALIGN=FP
chr1    721757  rs189147642     T       A       349942  PASS    AF=0.051;REMAP_ALIGN=FP
chr1    739142  rs2340527       T       A       53624.30        PASS    AF=0.073;REMAP_ALIGN=FP

-L支持多个参数输入 Although the sites (-L) and variants (-V) resources will often be identical, this need not be the case. 如果-V和-L区域一样，可以省略-L。

GetPileupSummaries这一步得到：

$ head samplename_T.pileups.table
contig  position        ref_count       alt_count       other_alt_count allele_frequency
chr1    8064578 1140    1179    1       0.144
chr1    65311214        2290    0       0       0.061
chr1    65311262        2299    1       1       0.151
chr1    97981395        1101    987     1       0.192
chr1    98165091        1956    0       2       0.086
chr2    29416615        2375    0       0       0.091
chr2    29416794        1634    0       2       0.129
chr2    47693788        2203    1       0       0.088
chr2    47703500        898     859     2       0.114

4. MergeVCFs

合并vcf 忽略

5. MergeBamOuts

忽略

6. MergeStats

忽略

7. MergePileupSummaries

忽略

8. LearnReadOrientationModel

忽略

9. CalculateContamination

gatk --java-options "-Xmx${command_mem}m"  \
CalculateContamination -I ${tumor_pileups} \
-O contamination.table --tumor-segmentation segments.table  \
${"-matched " + normal_pileups}

--tumor-segmentation,-segments:File The output table containing segmentation of the tumor by minor allele fraction
-matched 如果没有N样品，不需要这个参数

The tool takes the summary table from GetPileupSummaries and gives the fraction contamination. This estimation informs downstream filtering by FilterMutectCalls.（来源：https://software.broadinstitute.org/gatk/documentation/article?id=11136）

得到：

segments.table
contamination.table

10. Filter

gatk --java-options "-Xmx${command_mem}m" FilterMutectCalls -V ${unfiltered_vcf} \
-R ${ref_fasta} \
-O ${output_vcf} \
${"--contamination-table " + contamination_table} \

11. FilterAlignmentArtifacts

gatk --java-options "-Xmx${command_mem}m" FilterAlignmentArtifacts \
-V ${input_vcf} \
-I ${bam} \
--bwa-mem-index-image ${realignment_index_bundle} \
${realignment_extra_args} \
-O ${output_vcf}

12. oncotate_m2

13. SumFloats

功能：Calculates sum of a list of floats

14. Funcotate

三、执行命令

1. 小panel的测试：

time gatk --java-options "-Xmx5000m" Mutect2 -R ucsc.hg19.fasta -I samplename_T.recal.bam -tumor samplename_T -I samplename_N.recal.bam -normal samplename_N --germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz -L Covered.bed -O Mutect2.vcf.gz --bam-output Mutect2.bam
time gatk --java-options "-Xmx5000m" GetPileupSummaries -R ucsc.hg19.fasta  -I  samplename_T.recal.bam -L Covered.bed -V small_exac_common_3_hg19.vcf  -O samplename_T.pileups.table
time gatk --java-options "-Xmx5000m" GetPileupSummaries -R ucsc.hg19.fasta  -I  samplename_N.recal.bam -L Covered.bed -V small_exac_common_3_hg19.vcf  -O samplename_N.pileups.table
time gatk --java-options "-Xmx5000m" CalculateContamination -I samplename_T.pileups.table -matched samplename_N.pileups.table -O contamination.table  --tumor-segmentation segments.table
time gatk --java-options "-Xmx5000m" FilterMutectCalls -V Mutect2.vcf.gz -R  ucsc.hg19.fasta -O Mutect2.filter.vcf --contamination-table contamination.table

得到：

Mutect2.vcf.gz
Mutect2.vcf.gz.tbi
Mutect2.bam
Mutect2.bai
samplename_T.pileups.table
samplename_N.pileups.table
segments.table
contamination.table
Mutect2FilteringStats.tsv
Mutect2.filter.vcf
Mutect2.filter.vcf.idx

运行时间：

$ grep real Mutect2.sh.e
real    28m7.830s （Mutect2这一步运行时间较长）
real    2m36.866s
real    3m6.991s
real    1m7.443s
real    1m8.615s

最后得到：

$ grep -v "#" Mutect2.filter.vcf | head -3
chr1    8064650 .       AAAAC   A       .       artifact_in_normal;str_contraction      DP=4156;ECNT=1;NLOD=652.40;N_ART_LOD=12.45;POP_AF=8.590e-03;P_CONTAM=0.00;P_GERMLINE=-1.004e+03;RPA=4,3;RU=AAAC;STR;TLOD=15.08      GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB 0/0:2301,9:0.068:2310:1186,3:1115,6:30,30:294,302:60:62:0   0/1:1249,9:7.890e-03:1258:669,5:580,4:30,30:190,316:60:23:0:0.010,0.010,7.154e-03:1.469e-03,6.913e-04,0.998
chr1    162749855       .       TTC     T       .       artifact_in_normal;str_contraction      DP=3631;ECNT=1;NLOD=535.21;N_ART_LOD=34.95;POP_AF=1.000e-06;P_CONTAM=0.00;P_GERMLINE=-8.309e+02;RPA=8,7;RU=TC;STR;TLOD=33.52        GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB 0/0:2120,35:0.060:2155:1027,19:1093,16:30,30:293,308:60:40:0        0/1:1151,29:0.022:1180:404,7:747,22:30,30:195,213:60:44:0:0.020,0.020,0.025:9.423e-03,7.577e-04,0.990
chr1    201754410       .       CTTTTTT C,CT,CTT,CTTT,CTTTT,CTTTTT,CTTTTTTT     .       artifact_in_normal;germline_risk;multiallelic   DP=3984;ECNT=1;NLOD=472.90,322.80,52.53,-5.898e+02,-1.771e+03,-2.565e+03,391.62;N_ART_LOD=5.75,17.32,37.19,173.03,780.25,1811.50,28.21;POP_AF=7.803e-04,2.467e-03,0.010,0.058,0.545,1.000e-06,1.000e-06;P_CONTAM=0.00;P_GERMLINE=-4.905e+02,-3.158e+02,-2.000e+01,0.00,0.00,0.00,-3.794e+02;RPA=17,11,12,13,14,15,16,18;RU=T;STR;TLOD=8.13,49.36,85.34,219.24,582.57,881.43,28.23       GT:AD:AF:DP:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:ORIGINAL_CONTIG_MISMATCH:SA_MAP_AF:SA_POST_PROB     0/0:287,9,18,34,134,500,898,50:0.017,0.023,0.032,0.077,0.237,0.406,0.042:1930:146,4,8,13,54,250,442,32:141,5,10,21,80,250,456,18:30,30,30,30,30,30,30,30:287,316,299,297,286,297,291,265:60,60,60,60,60,60,60:28,38,37,46,45,42,40:0     0/1/2/3/4/5/6/7:134,10,32,63,143,358,467,40:6.001e-03,0.024,0.045,0.107,0.287,0.387,0.027:1247:62,5,13,32,55,134,201,21:72,5,19,31,88,224,266,19:30,30,30,30,30,30,30,30:181,195,179,207,197,193,190,190:60,60,60,60,60,60,60:24,36,35,46,37,41,36:0:0.768,0.758,0.777:8.021e-03,0.061,0.931

其表头有对FILTER和FORMAT列的解释，如：

##FILTER=
##FILTER=
##FORMAT=
##FORMAT=
... ...

2. WES的测试：

time gatk --java-options "-Xmx5000m" Mutect2 -R ucsc.hg19.fasta -I samplename_T.recal.bam -tumor samplename_T -I samplename_N.recal.bam -normal samplename_N --germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz -L all_exon.bed -O Mutect2.vcf.gz --bam-output Mutect2.bam
time gatk --java-options "-Xmx5000m" GetPileupSummaries -R ucsc.hg19.fasta  -I samplename_T.recal.bam -L all_exon.bed -V small_exac_common_3_hg19.vcf -O samplename_T.pileups.table
time gatk --java-options "-Xmx5000m" GetPileupSummaries -R ucsc.hg19.fasta  -I  samplename_N.recal.bam -L all_exon.bed -V small_exac_common_3_hg19.vcf -O samplename_N.pileups.table
time gatk --java-options "-Xmx5000m" CalculateContamination -I samplename_T.pileups.table -matched samplename_N.pileups.table -O contamination.table  --tumor-segmentation segments.table
time gatk --java-options "-Xmx5000m" FilterMutectCalls -V Mutect2.vcf.gz -R  ucsc.hg19.fasta -O Mutect2.filter.vcf --contamination-table contamination.table

所用时间：

$ grep real wes_mutect.sh.e
real    139m0.688s
real    19m26.328s
real    14m53.911s
real    1m8.589s
real    1m9.738s

3. 小panel拆分Bed的测试：

从上面时间上来看，这一步所用时间还是很长，这次拆分bed来测试一下，会比上面增加三步（拆分/mutect2 call vcf/合并vcf）

1）拆分

time gatk --java-options "-Xmx5000m" SplitIntervals -R ucsc.hg19.fasta -L  Covered.bed -scatter 30 -O interval-files

real 1m7.952s

2）mutect2 call vcf

time gatk --java-options "-Xmx5000m" Mutect2 -R ucsc.hg19.fasta -I samplename_T.recal.bam -tumor samplename_T -I samplename_N.recal.bam -normal samplename_N --germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz -L interval-files/0000-scattered.intervals -O interval-files/0000-scattered.intervals.vcf --bam-output interval-files/0000-scattered.intervals.Mutect2.bam
time gatk --java-options "-Xmx5000m" Mutect2 -R ucsc.hg19.fasta -I samplename_T.recal.bam -tumor samplename_T -I samplename_N.recal.bam -normal samplename_N --germline-resource af-only-gnomad.raw.sites.hg19.vcf.gz -L interval-files/0001-scattered.intervals -O interval-files/0001-scattered.intervals.vcf --bam-output interval-files/0001-scattered.intervals.Mutect2.bam
... ...
... ...

$ grep real *
Mutect2_00002.sh.e86950:real    2m18.126s
Mutect2_00003.sh.e86951:real    3m15.701s
Mutect2_00004.sh.e86952:real    2m53.857s
... ...
... ...

这一步是并行运行，按照最长时间计算是：3m15.701s

3）合并vcf

GATK3中采用：CombineVariants

java -XX:ParallelGCThreads=20 -Xmx40g -Djava.io.tmpdir=swp -jar /share/nas2/genome/biosoft/GATK/GenomeAnalysisTK.jar -T CombineVariants -R bwa_Ref/ucsc.hg19.fasta -V Chr_Result/chr1.swp.vcf -V Chr_Result/chr10.swp.vcf -V Chr_Result/chr11.swp.vcf -V Chr_Result/chr12.swp.vcf -V Chr_Result/chr13.swp.vcf -V Chr_Result/chr14.swp.vcf -V Chr_Result/chr15.swp.vcf -V Chr_Result/chr16.swp.vcf -V Chr_Result/chr17.swp.vcf -V Chr_Result/chr18.swp.vcf -V Chr_Result/chr19.swp.vcf -V Chr_Result/chr2.swp.vcf -V Chr_Result/chr20.swp.vcf -V Chr_Result/chr21.swp.vcf -V Chr_Result/chr22.swp.vcf -V Chr_Result/chr3.swp.vcf -V Chr_Result/chr4.swp.vcf -V Chr_Result/chr5.swp.vcf -V Chr_Result/chr6.swp.vcf -V Chr_Result/chr7.swp.vcf -V Chr_Result/chr8.swp.vcf -V Chr_Result/chr9.swp.vcf -V Chr_Result/chrX.swp.vcf  --disable_auto_index_creation_and_locking_when_reading_rods -o samplename_T.swp.all --genotypemergeoption UNSORTED

GATK4中CombineVariants已经没有，有三个合并vcf的功能：
MergeVcfs (Picard) Combines multiple variant files into a single variant file
GatherVcfs (Picard) Gathers multiple VCF files from a scatter operation into a single VCF file
对于合并不同染色体的vcf，我们采用GatherVcfs，CatVariants, MergeVcfs or GatherVcfs
GatherVcfsCloud (BETA Tool) Gathers multiple VCF files from a scatter operation into a single VCF file

命令：

time gatk  --java-options "-Xmx5000m" GatherVcfs -I interval-files/0000-scattered.intervals.vcf -I interval-files/0001-scattered.intervals.vcf -I interval-files/0002-scattered.intervals.vcf -I interval-files/0003-scattered.intervals.vcf -I interval-files/0004-scattered.intervals.vcf -I interval-files/0005-scattered.intervals.vcf -I interval-files/0006-scattered.intervals.vcf -I interval-files/0007-scattered.intervals.vcf -I interval-files/0008-scattered.intervals.vcf -I interval-files/0009-scattered.intervals.vcf -I interval-files/0010-scattered.intervals.vcf -I interval-files/0011-scattered.intervals.vcf -I interval-files/0012-scattered.intervals.vcf -I interval-files/0013-scattered.intervals.vcf -I interval-files/0014-scattered.intervals.vcf -I interval-files/0015-scattered.intervals.vcf -I interval-files/0016-scattered.intervals.vcf -I interval-files/0017-scattered.intervals.vcf -I interval-files/0018-scattered.intervals.vcf -I interval-files/0019-scattered.intervals.vcf -I interval-files/0020-scattered.intervals.vcf -I interval-files/0021-scattered.intervals.vcf -I interval-files/0022-scattered.intervals.vcf -I interval-files/0023-scattered.intervals.vcf -I interval-files/0024-scattered.intervals.vcf -I interval-files/0025-scattered.intervals.vcf -I interval-files/0026-scattered.intervals.vcf -I interval-files/0027-scattered.intervals.vcf -I interval-files/0028-scattered.intervals.vcf -I interval-files/0029-scattered.intervals.vcf -O gather.raw.vcf

real 0m15.819s

bed不拆分和拆分得到的vcf基本一致，少量位点不同。
时间上，不经过bed分割的mutect2需要28m，经过bed分割的只需要4m 提速了6倍。
-L参数 Intervals and interval lists的介绍

之后的步骤和bed不拆分流程相同。

这一步的学习基本完成。
下一个step是对转录组SNP/INDEL（RNAseq SNPs + Indels）的学习