Hap.py 是illumina官方开发的在单倍型水平上比较二倍体基因型的工具。可用于针对金标准变异数据集(例如NA12878样本)对检测的变异结果进行基准测试,以判断检测结果的准确性, 也可以用于比较和评估两个不同变异检测软件检测结果的差异。
安装就不多说了,怕麻烦的同学建议直接使用docker镜像,省时省力。docker pull pkrusche/hap.py
下面直接上基本使用命令行吧。
hap.py
hap.py/example/happy/PG_NA12878_hg38-chr21.vcf.gz \
hap.py/example/happy/NA12878-GATK3-chr21.vcf.gz \
-f hap.py/example/happy/PG_Conf_hg38-chr21.bed.gz \
-r hap.py/example/happy/hg38.chr21.fa \
--threads 5 \
-o gatk-all
其中,
-r hap.py/example/happy/hg38.chr21.fa
用于指定使用的参考序列。 也可以通过设置环境变量HGREF来指定export HGREF=hap.py/example/happy/hg38.chr21.fa
, 其中hap.py/example/happy/hg38.chr21.fa
需要换成你实际使用的fasta序列hap.py/example/happy/PG_NA12878_hg38-chr21.vcf.gz
和 hap.py/example/happy/NA12878-GATK3-chr21.vcf.gz
为待比较的两个VCF文件,替换为你实际VCF文件即可。 -f hap.py/example/happy/PG_Conf_hg38-chr21.bed.gz
指定需要比较的区间,即只有该bed区间内的位点用于比较。--threads 5
指定线程-o out_prefix
指定输出文件前缀结果包含以下几个文件:
Output File | Contents |
---|---|
gatk-all.summary.csv | Summary statistics |
gatk-all.extended.csv | Extended statistics |
gatk-all.roc.all.csv | All precision / recall data points that were calculated |
gatk-all.vcf.gz | Annotated VCF according to https://github.com/ga4gh/benchmarking-tools/tree/master/doc/ref-impl |
gatk-all.vcf.gz.tbi | VCF Tabix Index |
gatk-all.metrics.json | JSON file containing all computed metrics and tables. |
gatk-all.roc.Locations.INDEL.csv | ROC for ALL indels only. |
gatk-all.roc.Locations.SNP.csv | ROC for ALL SNPs only. |
gatk-all.roc.Locations.INDEL.PASS.csv | ROC for PASSing indels only. |
gatk-all.roc.Locations.SNP.PASS.csv | ROC for PASSing SNPs only. |
其中两个比较重要的文件为:gatk-all.summary.csv
和 gatk-all.vcf.gz
。
1)gatk-all.summary.csv
包含直接的统计结果,包含变异位点数,精确度,召回率,F1值等统计信息,可以直观的看到两者之间的差异。gatk-all.extended.csv
是所有的统计信息,包含了gatk-all.summary.csv
里面的信息,以及其他拓展信息。相关字段说明如下(就不中文解释了,有疑问可以一起讨论):
Stratification Column | Description |
---|---|
Type | Variant type (SNP / INDEL) |
Subtype | Variant Subtype (ti/tv/indel length, see above |
Subset | Subset of the genome/stratification region |
Filter | Variant filters: PASS, SEL, ALL, or a particular filter from the query VCF |
Genotype | Genotype of benchmarked variants (het / homalt / hetalt) |
QQ.Field | Which field from the original VCF was used to produce QQ values in truth and query |
QQ threshold for ROC values |
Metric Column | Description |
---|---|
METRIC.Recall | Recall for truth variant representation = TRUTH.TP / (TRUTH.TP + TRUTH.FN) |
METRIC.Precision | Precision of query variants = QUERY.TP / (QUERY.TP + QUERY.FP) |
METRIC.Frac_NA | Fraction of non-assessed query calls = QUERY.UNK / QUERY.TOTAL |
METRIC.F1_Score | Harmonic mean of precision and recall = 2METRIC.RecallMetric.Precision/(METRIC.Recall + METRIC.Precision) |
TRUTH.TOTAL | Total number of truth variants |
TRUTH.TP | Number of true-positive calls in truth representation (counted via the truth sample column) |
TRUTH.FN | Number of false-negative calls = calls in truth without matching query call |
QUERY.TOTAL | Total number of query calls |
QUERY.TP | Number of true positive calls in query representation (counted via the query sample column) |
QUERY.FP | Number of false-positive calls in the query file (mismatched query calls within the confident regions) |
QUERY.UNK | Number of query calls outside the confident regions |
FP.gt | Number of genotype mismatches (alleles match, but different zygosity) |
FP.al | Number of allele mismatches (variants matched by position and not by haplotype) |
TRUTH.TOTAL.TiTv_ratio | Transition / Transversion ratio for all truth variants |
TRUTH.TOTAL.het_hom_ratio | Het/Hom ratio for all truth variants |
TRUTH.FN.TiTv_ratio | Transition / Transversion ratio for false-negative variants |
TRUTH.FN.het_hom_ratio | Het/Hom ratio for false-negative variants |
TRUTH.TP.TiTv_ratio | Transition / Transversion ratio for true positive variants |
TRUTH.TP.het_hom_ratio | Het/Hom ratio for true positive variants |
QUERY.FP.TiTv_ratio | Transition / Transversion ratio for false positive variants |
QUERY.FP.het_hom_ratio | Het/Hom ratio for false-positive variants |
QUERY.TOTAL.TiTv_ratio | Transition / Transversion ratio for all query variants |
QUERY.TOTAL.het_hom_ratio | Het/Hom ratio for all query variants |
QUERY.TP.TiTv_ratio | Transition / Transversion ratio for true positive variants (query representation) |
QUERY.TP.het_hom_ratio | Het/Hom ratio for true positive variants (query representation) |
QUERY.UNK.TiTv_ratio | Transition / Transversion ratio for unknown variants |
QUERY.UNK.het_hom_ratio | Het/Hom ratio for unknown variants |
Subset.Size | When using stratification regions, this gives the number of nucleotides contained in the current subset |
Subset.IS_CONF.Size | This gives the number of confident bases (-f regions) in the current subset |
2)gatk-all.vcf.gz
是VCF格式的文件,文件最后两列(TRUTH和QUERY列)包含两个比较VCF样本的位点对比的情况。
其中涉及GT:BD:BK:BI:BVT:BLT:QQ等多个字段
field | Description |
---|---|
GT | Genotype |
BD | Decision for call (TP/FP/FN/N) |
BK | Sub-type for decision (match/mismatch type) |
BI | Additional comparison information |
Variant quality for ROC creation | |
BVT | High-level variant type (SNP/INDEL) |
BLT | High-level location type (het/homref/hetalt/homalt/nocall) |
hap.py的基本用法就酱了,喜欢可以点赞加收藏留着吃灰去吧。
参考:
https://github.com/Illumina/hap.py
https://github.com/Illumina/hap.py/blob/master/doc/happy.md