Extract high quality variants from VCF files

Here shows a convenient way to extract the high quality variant sites form multiple VCF files using a simple Python script.
The script just screen the whole VCF file, select the high quality variations and generate a new VCF file that only include the high quality variations sites.

The low quality variations as follows were excluded:
  1. variant sites with sequencing depth "lower than min_depth" and "bigger than max_depth"
  2. heterozygous variant sites

Here's the python script:

# -*- coding:utf-8 -*-
import sys

min_depth = sys.argv[1]
max_depth = sys.argv[2]

for line in sys.stdin:
    if line.startswith("#"):
        print(line.strip())
    if line.startswith("Chr"):
        gt = line.split()[9].split(":")[0]
        gt_1 = gt.split("/")[0]
        gt_2 = gt.split("/")[1]
        dp = int(line.split()[9].split(":")[2])
        if gt_1 == gt_2 and dp >= int(min_depth) and dp <= int(max_depth):
            print(line.strip())

Usually the VCF file is in the format of "*.vcf.gz". Let's say your store all your VCF files in one folder, your just need to run the commond line as follows:

for sample in *.vcf.gz; do zcat $sample | python3 select_hq_vcf.py   > sample.hq.vcf; done

你可能感兴趣的:(Extract high quality variants from VCF files)