变异检测准确性评估软件hap.py使用

变异检测准确性评估软件hap.py使用

1. hap.py简介

Hap.py 是illumina官方开发的在单倍型水平上比较二倍体基因型的工具。可用于针对金标准变异数据集(例如NA12878样本)对检测的变异结果进行基准测试,以判断检测结果的准确性, 也可以用于比较和评估两个不同变异检测软件检测结果的差异。

2. hap.py 的使用

安装就不多说了,怕麻烦的同学建议直接使用docker镜像,省时省力。docker pull pkrusche/hap.py
下面直接上基本使用命令行吧。

hap.py 
	hap.py/example/happy/PG_NA12878_hg38-chr21.vcf.gz \
    hap.py/example/happy/NA12878-GATK3-chr21.vcf.gz \
    -f hap.py/example/happy/PG_Conf_hg38-chr21.bed.gz \
    -r hap.py/example/happy/hg38.chr21.fa \
    --threads 5 \
    -o gatk-all

其中,

  1. -r hap.py/example/happy/hg38.chr21.fa 用于指定使用的参考序列。 也可以通过设置环境变量HGREF来指定export HGREF=hap.py/example/happy/hg38.chr21.fa , 其中hap.py/example/happy/hg38.chr21.fa 需要换成你实际使用的fasta序列
  2. hap.py/example/happy/PG_NA12878_hg38-chr21.vcf.gzhap.py/example/happy/NA12878-GATK3-chr21.vcf.gz 为待比较的两个VCF文件,替换为你实际VCF文件即可。
  3. -f hap.py/example/happy/PG_Conf_hg38-chr21.bed.gz 指定需要比较的区间,即只有该bed区间内的位点用于比较。
  4. --threads 5 指定线程
  5. -o out_prefix 指定输出文件前缀

3. 结果说明

结果包含以下几个文件:

Output File Contents
gatk-all.summary.csv Summary statistics
gatk-all.extended.csv Extended statistics
gatk-all.roc.all.csv All precision / recall data points that were calculated
gatk-all.vcf.gz Annotated VCF according to https://github.com/ga4gh/benchmarking-tools/tree/master/doc/ref-impl
gatk-all.vcf.gz.tbi VCF Tabix Index
gatk-all.metrics.json JSON file containing all computed metrics and tables.
gatk-all.roc.Locations.INDEL.csv ROC for ALL indels only.
gatk-all.roc.Locations.SNP.csv ROC for ALL SNPs only.
gatk-all.roc.Locations.INDEL.PASS.csv ROC for PASSing indels only.
gatk-all.roc.Locations.SNP.PASS.csv ROC for PASSing SNPs only.

其中两个比较重要的文件为:gatk-all.summary.csvgatk-all.vcf.gz
1)gatk-all.summary.csv包含直接的统计结果,包含变异位点数,精确度,召回率,F1值等统计信息,可以直观的看到两者之间的差异。gatk-all.extended.csv 是所有的统计信息,包含了gatk-all.summary.csv里面的信息,以及其他拓展信息。相关字段说明如下(就不中文解释了,有疑问可以一起讨论):

Stratification Column Description
Type Variant type (SNP / INDEL)
Subtype Variant Subtype (ti/tv/indel length, see above
Subset Subset of the genome/stratification region
Filter Variant filters: PASS, SEL, ALL, or a particular filter from the query VCF
Genotype Genotype of benchmarked variants (het / homalt / hetalt)
QQ.Field Which field from the original VCF was used to produce QQ values in truth and query
QQ QQ threshold for ROC values
Metric Column Description
METRIC.Recall Recall for truth variant representation = TRUTH.TP / (TRUTH.TP + TRUTH.FN)
METRIC.Precision Precision of query variants = QUERY.TP / (QUERY.TP + QUERY.FP)
METRIC.Frac_NA Fraction of non-assessed query calls = QUERY.UNK / QUERY.TOTAL
METRIC.F1_Score Harmonic mean of precision and recall = 2METRIC.RecallMetric.Precision/(METRIC.Recall + METRIC.Precision)
TRUTH.TOTAL Total number of truth variants
TRUTH.TP Number of true-positive calls in truth representation (counted via the truth sample column)
TRUTH.FN Number of false-negative calls = calls in truth without matching query call
QUERY.TOTAL Total number of query calls
QUERY.TP Number of true positive calls in query representation (counted via the query sample column)
QUERY.FP Number of false-positive calls in the query file (mismatched query calls within the confident regions)
QUERY.UNK Number of query calls outside the confident regions
FP.gt Number of genotype mismatches (alleles match, but different zygosity)
FP.al Number of allele mismatches (variants matched by position and not by haplotype)
TRUTH.TOTAL.TiTv_ratio Transition / Transversion ratio for all truth variants
TRUTH.TOTAL.het_hom_ratio Het/Hom ratio for all truth variants
TRUTH.FN.TiTv_ratio Transition / Transversion ratio for false-negative variants
TRUTH.FN.het_hom_ratio Het/Hom ratio for false-negative variants
TRUTH.TP.TiTv_ratio Transition / Transversion ratio for true positive variants
TRUTH.TP.het_hom_ratio Het/Hom ratio for true positive variants
QUERY.FP.TiTv_ratio Transition / Transversion ratio for false positive variants
QUERY.FP.het_hom_ratio Het/Hom ratio for false-positive variants
QUERY.TOTAL.TiTv_ratio Transition / Transversion ratio for all query variants
QUERY.TOTAL.het_hom_ratio Het/Hom ratio for all query variants
QUERY.TP.TiTv_ratio Transition / Transversion ratio for true positive variants (query representation)
QUERY.TP.het_hom_ratio Het/Hom ratio for true positive variants (query representation)
QUERY.UNK.TiTv_ratio Transition / Transversion ratio for unknown variants
QUERY.UNK.het_hom_ratio Het/Hom ratio for unknown variants
Subset.Size When using stratification regions, this gives the number of nucleotides contained in the current subset
Subset.IS_CONF.Size This gives the number of confident bases (-f regions) in the current subset

2)gatk-all.vcf.gz 是VCF格式的文件,文件最后两列(TRUTH和QUERY列)包含两个比较VCF样本的位点对比的情况。
变异检测准确性评估软件hap.py使用_第1张图片
其中涉及GT:BD:BK:BI:BVT:BLT:QQ等多个字段

field Description
GT Genotype
BD Decision for call (TP/FP/FN/N)
BK Sub-type for decision (match/mismatch type)
BI Additional comparison information
QQ Variant quality for ROC creation
BVT High-level variant type (SNP/INDEL)
BLT High-level location type (het/homref/hetalt/homalt/nocall)

hap.py的基本用法就酱了,喜欢可以点赞加收藏留着吃灰去吧。

参考:
https://github.com/Illumina/hap.py
https://github.com/Illumina/hap.py/blob/master/doc/happy.md

你可能感兴趣的:(生物信息,python,经验分享)