在2019年的一篇NG文章"Resequencing of 414 cultivated and wild watermelon accessions identifies selection for fruit quality traits"的方法部分,作者提到他们使用位于四倍兼并位点( fourfold degenerated sites)上的SNP构建系统发育树。原本的19,725,853个SNP经过这一过滤,最终只剩下89,914 个SNP。
Phylogenetic and population analyses. Bi-allelic SNPs with a missing data rate less than 15% and a minor allele count greater than three were kept for population genomic analyses. Additionally, only SNPs at fourfold degenerated sites (89,914 SNPs) were used to construct a neighbor-joining phylogenetic tree using MEGA7 with 500 bootstraps61.
所谓四倍兼并位点(4DTV), 指无论怎么突变,最终氨基酸都一样的位点,例如"TCA/TCT/TCC/TCG"中的第三位就是一个4DTV,只要在这个位点的碱基,无论怎么突变最终都是丝氨酸(Ser)。
我稍微检索了一下,发现并没有相应的脚本,于是就有了这道练习题。
给定一个用SnpEff注释的VCF文件,请过滤出其中的位于四倍兼并位点上的SNP。
我花了一个晚上时间写了代码,使用方法为python calc_4dTv_in_eff_vcf.py input.vcf output.vcf ref.fa
,三个参数分别是输入的vcf,输出的vcf文件名,以及参考基因组序列。
需要注意的是,通常SnpEff会对一个位点注释多个信息,而我只选择其中第一条注释信息进行过滤。
#!/usr/bin/env python3
from sys import argv
from pysam import VariantFile
from pysam import FastaFile
file_in = argv[1]
file_out = argv[2]
fafile = argv[3]
codon = set(["TC", "CT", "CC", "CG", "AC", "GT", "GC", "GG"])
rev_dict = dict(A='T',T='A', C='G', G='C')
bcf_in = VariantFile(file_in)
bcf_out = VariantFile(file_out, "w", header = bcf_in.header)
fa_in = FastaFile(fafile)
for rec in bcf_in.fetch():
# no annotation available
if len(rec.info.items()) == 0:
continue
ann = rec.info['ANN']
info = rec.info['ANN'][0].split('|')
# only use synonymouse variants
if info[1] != "synonymous_variant":
continue
# only the 3rd position can be 4dTv
if int(info[9][2:-3]) % 3 != 0:
continue
# determine the strand by the REF column and mutation
# if the ref is not same as the mutation site
if rec.ref == info[9][-3]:
pre = fa_in.fetch(rec.chrom, rec.pos-3, rec.pos-1)
pre = pre.upper()
else:
tmp = fa_in.fetch(rec.chrom, rec.pos, rec.pos+2)
tmp = tmp.upper()
pre = rev_dict[tmp[1]] + rev_dict[tmp[0]]
if pre not in codon:
continue
bcf_out.write(rec)