「生信练习题」从SnpEff注释得到的VCF中过滤4DTV位点

在2019年的一篇NG文章"Resequencing of 414 cultivated and wild watermelon accessions identifies selection for fruit quality traits"的方法部分,作者提到他们使用位于四倍兼并位点( fourfold degenerated sites)上的SNP构建系统发育树。原本的19,725,853个SNP经过这一过滤,最终只剩下89,914 个SNP。

Phylogenetic and population analyses. Bi-allelic SNPs with a missing data rate less than 15% and a minor allele count greater than three were kept for population genomic analyses. Additionally, only SNPs at fourfold degenerated sites (89,914 SNPs) were used to construct a neighbor-joining phylogenetic tree using MEGA7 with 500 bootstraps61.

所谓四倍兼并位点(4DTV), 指无论怎么突变,最终氨基酸都一样的位点,例如"TCA/TCT/TCC/TCG"中的第三位就是一个4DTV,只要在这个位点的碱基,无论怎么突变最终都是丝氨酸(Ser)。

我稍微检索了一下,发现并没有相应的脚本,于是就有了这道练习题。

给定一个用SnpEff注释的VCF文件,请过滤出其中的位于四倍兼并位点上的SNP。

我花了一个晚上时间写了代码,使用方法为python calc_4dTv_in_eff_vcf.py input.vcf output.vcf ref.fa,三个参数分别是输入的vcf,输出的vcf文件名,以及参考基因组序列。

需要注意的是,通常SnpEff会对一个位点注释多个信息,而我只选择其中第一条注释信息进行过滤。

#!/usr/bin/env python3

from sys import argv
from pysam import VariantFile
from pysam import FastaFile

file_in = argv[1]
file_out = argv[2]
fafile = argv[3]

codon = set(["TC", "CT", "CC", "CG", "AC", "GT", "GC", "GG"])
rev_dict = dict(A='T',T='A', C='G', G='C')

bcf_in = VariantFile(file_in)
bcf_out = VariantFile(file_out, "w", header = bcf_in.header)
fa_in = FastaFile(fafile)

for rec in bcf_in.fetch():
    # no annotation available
    if len(rec.info.items()) == 0:
        continue
    ann = rec.info['ANN']
    info = rec.info['ANN'][0].split('|')
    # only use synonymouse variants
    if info[1] != "synonymous_variant":
        continue
    # only the 3rd position can be 4dTv
    if int(info[9][2:-3]) % 3 != 0:
        continue

    # determine the strand by the REF column and mutation
    # if the ref is not same as the mutation site
    if rec.ref == info[9][-3]:
        pre = fa_in.fetch(rec.chrom, rec.pos-3, rec.pos-1)
        pre = pre.upper()
    else:
        tmp = fa_in.fetch(rec.chrom, rec.pos, rec.pos+2)
        tmp = tmp.upper()
        pre = rev_dict[tmp[1]] + rev_dict[tmp[0]]
    if pre not in codon:
        continue
    bcf_out.write(rec)


你可能感兴趣的:(「生信练习题」从SnpEff注释得到的VCF中过滤4DTV位点)