GATK插件检测拷贝数变异

作者按

本文记述了博主测试软件GATK的插件来检测CNV的所有过程,mind如下,部分尚待补充。

概述

GATK是一款认可度较高的点突变变异检测的软件,help的时候偶然发现有插件可以用来检测CNV,所以尝试了一下,比较小众,不推荐。

官方文献为

BETA,未发现publication,但有构建normal数据库,文献为https://www.nature.com/articles/nature15393。

官方manual网址为

https://gatk.broadinstitute.org/hc/en-us/articles/360035531092?id=11682

原理

(待有时间补充)
基本上还是统计深度,假设检验,取显著区间,再合并相邻step。

安装

安装了gatk即可,和Mutect2使用方法类似。

运行脚本

#1.1prepare for region file
gatk PreprocessIntervals -L panel_regions.bed -R ucsc.hg19.fasta -O output/test.list --bin-length 267 --interval-merging-rule OVERLAPPING_ONLY
####annoGC
gatk AnnotateIntervals -L output/test.list -O output/test.anno.list -R ucsc.hg19.fasta --interval-merging-rule  OVERLAPPING_ONLY --sequence-dictionary ~/GATK/hg19Ref/ucsc.hg19.dict
#1.2
gatk CollectReadCounts -I  input/test.HQ.bam -L output/test.list --interval-merging-rule OVERLAPPING_ONLY -O output/test.counts.hdf5 ##在使用anno.list的时候回出现报错Query interval "@HD  VN:1.5" is not valid for this input,可以考虑删掉这些咦@开头的无意义行
#2构建pon
#2.1先统计pon的bam深度
for i in `cat samplelist`
do
gatk CollectReadCounts -I input/18080706-1/${i}.HQ.bam -L liqian.list panel.bed --interval-merging-rule OVERLAPPING_ONLY -O PON/${i}.hdf5
done
#2.2生成pon文件
gatk CreateReadCountPanelOfNormals -I PON/pon1.hdf5  -I PON/pon2.hdf5 -I PON/pon3.hdf5 -I PON/pon4.hdf5 -I PON/pon5.hdf5 -I PON/pon6.hdf5 -I PON/pon7.hdf5 -I PON/pon8.hdf5 -I PON/pon9.hdf5 -I PON/pon10.hdf5 -I PON/pon11.hdf5 --minimum-interval-median-percentile 5.0 -O PON/11_normal.hdf5
#3去噪
gatk  DenoiseReadCounts -I output/test.counts.hdf5 --count-panel-of-normals PON/11_normal.hdf5 --standardized-copy-ratios  output/test.standardCR.tsv --denoised-copy-ratios output/test.denoisedCR.tsv
#4标准化copyration
gatk PlotDenoisedCopyRatios --standardized-copy-ratios output/test.standardCR.tsv --denoised-copy-ratios output/test.denoisedCR.tsv --sequence-dictionary ~/GATK/hg19Ref/ucsc.hg19.dict --minimum-contig-length 46709983 --output output/ --output-prefix test
#5计算单倍体的拷贝数
gatk CollectAllelicCounts -L output/test.list -I input/test.HQ.bam -R ~/GATK/hg19Ref/ucsc.hg19.fasta -O output/test_T_clean.allelicCounts.tsv
#6分割
gatk ModelSegments --denoised-copy-ratios output/test..denoisedCR.tsv --allelic-counts output/test_T_clean.allelicCounts.tsv --output output --output-prefix test
#7对片段求ratio
gatk CallCopyRatioSegments --input output/test.cr.seg --output output/test.cr.call.seg --calling-copy-ratio-z-score-threshold 2.0 --neutral-segment-copy-ratio-upper-bound 1.1 --neutral-segment-copy-ratio-lower-bound 0.9 
#8.1最终结果求出显著片段
gatk PlotModeledSegments --denoised-copy-ratios output/test.denoisedCR.tsv --allelic-counts output/test.hets.tsv --segments output/test.modelFinal.seg --sequence-dictionary ~/hg19Ref/ucsc.hg19.dict --minimum-contig-length 46709983 --output output/ --output-prefix test

结果解析

(待补充)

可视化结果

image.png

参考文献

NOTE

你可能感兴趣的:(GATK插件检测拷贝数变异)