如果根据SNP或者Indel 构建其系统进化树,可以展示群体中不同个体的相互关系,基因变异相似的往往会在同一个树的cluster中,一颗好的树可以给你一个群体大概的分类(你这个群体中有多少个cluster,一般同一个亚种或者有亲缘关系的个体会形成一个cluster),这是群体遗传中重要的一部分。其构建的核心原理就是把每个位点SNPs的信息提取,然后计算每个变异位点的差异得到算法中的“距离”。
在实战中,我们群体中样本的数量往往会是成百的,所以一般call出来的SNP变异的位点,或者说文件大小会很大,如果我们直接将没有过滤掉的文件拿来直接构建系统发育树,这不但会产生很大的误差(低质量的位点会影响距离的计算),而且一般的构树软件也不难接受如此大的文件,一来很消耗内存,二来运算量很大,你可能需要几个星期或几个月去完成你的建树。
这个时候你需要首先对你raw SNP calling的结果进行初步的过滤。
Filtering the raw SNPs vcf file
经典的过滤方式,,保留比较可信的变异位点
only include SNPs with MAF >= 0.05 and include only SNPs with a 90% genotyping rate (10% missing) use
~/biosoft/plink --vcf All_Gm_combine.vcf --maf 0.05 --geno 0.1 --recode vcf-iid --out All_test_11 --allow-extra-chr
进一步根据,.对标记进行中性LD筛选,并提取
用到的参数的基本介绍
--indep-pairphase [window size]
~/biosoft/plink --vcf All_test_11.vcf --allow-extra-chr--indep-pairwise 50 10 0.2 --out test_12
~/biosoft/plink --allow-extra-chr --extract test_12.prune.in --make-bed --out test_12.prune.in --recode vcf-iid --vcf All_test_11.vcf
具体的建树过程我用了两种不同的工具和不同的算法进行构建
Methods 1 Megax
Mega是一款很经典的构树软件,重比对到建树,还有各种不同参数的选择都做的不错。而且它还支持各种平台有user interface的版本和命令行的版本,方便我们去使用。
工具下载地址:
https://www.megasoftware.net/
首先将vcf 格式转成phylip格式
Transferrring the format into phylip
python2 vcf2phylip.py -i All_edit_Gm_tab.vcf
用到的 Script:
'''
The script converts a collection of SNPs in VCF format into a PHYLIP, FASTA,
NEXUS, or binary NEXUS file for phylogenetic analysis. The code is optimized
to process VCF files with sizes >1GB. For small VCF files the algorithm slows
down as the number of taxa increases (but is still fast).
'''
__author__ = "Edgardo M. Ortiz"
__credits__ = "Juan D. Palacio-Mejía"
__version__ = "1.5"
__email__ = "[email protected]"
__date__ = "2018-04-24"
import sys
import os
import argparse
def main():
parser = argparse.ArgumentParser(description="Converts SNPs in VCF format into an alignment for phylogenetic analysis")
parser.add_argument("-i", "--input", action="store", dest="filename", required=True,
help="Name of the input VCF file")
parser.add_argument("-m", "--min-samples-locus", action="store", dest="min_samples_locus", type=int, default=4,
help="Minimum of samples required to be present at a locus, default=4 since is the minimum for phylogenetics.")
parser.add_argument("-o", "--outgroup", action="store", dest="outgroup", default="",
help="Name of the outgroup in the matrix. Sequence will be written as first taxon in the alignment.")
parser.add_argument("-p", "--phylip-disable", action="store_true", dest="phylipdisable", default=False,
help="A PHYLIP matrix is written by default unless you enable this flag")
parser.add_argument("-f", "--fasta", action="store_true", dest="fasta", default=False,
help="Write a FASTA matrix, disabled by default")
parser.add_argument("-n", "--nexus", action="store_true", dest="nexus", default=False,
help="Write a NEXUS matrix, disabled by default")
parser.add_argument("-b", "--nexus-binary", action="store_true", dest="nexusbin", default=False,
help="Write a binary NEXUS matrix for analysis of biallelic SNPs in SNAPP, disabled by default")
args = parser.parse_args()
filename = args.filename
min_samples_locus = args.min_samples_locus
outgroup = args.outgroup
phylipdisable = args.phylipdisable
fasta = args.fasta
nexus = args.nexus
nexusbin = args.nexusbin
# Dictionary of IUPAC ambiguities for nucleotides
# '*' means deletion for GATK (and other software?)
# Deletions are ignored when making the consensus
amb = {("A","A"):"A",
("A","C"):"M",
("A","G"):"R",
("A","N"):"A",
("A","T"):"W",
("C","A"):"M",
("C","C"):"C",
("C","G"):"S",
("C","N"):"C",
("C","T"):"Y",
("G","A"):"R",
("G","C"):"S",
("G","G"):"G",
("G","N"):"G",
("G","T"):"K",
("N","A"):"A",
("N","C"):"C",
("N","G"):"G",
("N","N"):"N",
("N","T"):"T",
("T","A"):"W",
("T","C"):"Y",
("T","G"):"K",
("T","N"):"T",
("T","T"):"T",
("*","*"):"-",
("A","*"):"A",
("*","A"):"A",
("C","*"):"C",
("*","C"):"C",
("G","*"):"G",
("*","G"):"G",
("T","*"):"T",
("*","T"):"T",
("N","*"):"N",
("*","N"):"N"}
# Dictionary for translating biallelic SNPs into SNAPP
# 0 is homozygous reference
# 1 is heterozygous
# 2 is homozygous alternative
gen_bin = {"./.":"?",
".|.":"?",
"0/0":"0",
"0|0":"0",
"0/1":"1",
"0|1":"1",
"1/0":"1",
"1|0":"1",
"1/1":"2",
"1|1":"2"}
# Process header of VCF file
with open(filename) as vcf:
# Create a list to store sample names
sample_names = []
# Keep track of longest sequence name for padding with spaces in the output file
len_longest_name = 0
# Look for the line in the VCF header with the sample names
for line in vcf:
if line.startswith("#CHROM"):
# Split line into fields
broken = line.strip("\n").split("\t")
# If the minimum-samples-per-locus parameter is larger than the number of
# species in the alignment make it the same as the number of species
if min_samples_locus > len(broken[9:]):
min_samples_locus = len(broken[9:])
# Create a list of sample names and the keep track of the longest name length
for i in range(9, len(broken)):
name_sample = broken[i].replace("./","") # GATK adds "./" to sample names
sample_names.append(name_sample)
len_longest_name = max(len_longest_name, len(name_sample))
break
vcf.close()
# Output filename will be the same as input file, indicating the minimum of samples specified
outfile = filename.replace(".vcf",".min"+str(min_samples_locus))
# We need to create an intermediate file to hold the sequence data
# vertically and then transpose it to create the matrices
if fasta or nexus or not phylipdisable:
temporal = open(outfile+".tmp", "w")
# if binary NEXUS is selected also create a separate temporal
if nexusbin:
temporalbin = open(outfile+".bin.tmp", "w")
##################
# PROCESS VCF FILE
index_last_sample = len(sample_names)+9
# Start processing SNPs of VCF file
with open(filename) as vcf:
# Initialize line counter
snp_num = 0
snp_accepted = 0
snp_shallow = 0
snp_multinuc = 0
snp_biallelic = 0
while True:
# Load large chunks of file into memory
vcf_chunk = vcf.readlines(100000)
if not vcf_chunk:
break
# Now process the SNPs one by one
for line in vcf_chunk:
if not line.startswith("#") and line.strip("\n") != "": # pyrad sometimes produces an empty line after the #CHROM line
# Split line into columns
broken = line.strip("\n").split("\t")
for g in range(9,len(broken)):
if broken[g] in [".", ".|."]:
broken[g] = "./."
# Keep track of number of genotypes processed
snp_num += 1
# Print progress every 500000 lines
if snp_num % 500000 == 0:
print str(snp_num)+" genotypes processed"
# Check if the SNP has the minimum of samples required
if (len(broken[9:]) - ''.join(broken[9:]).count("./.")) >= min_samples_locus:
# Check that ref genotype is a single nucleotide and alternative genotypes are single nucleotides
if len(broken[3]) == 1 and (len(broken[4])-broken[4].count(",")) == (broken[4].count(",")+1):
# Add to running sum of accepted SNPs
snp_accepted += 1
# If nucleotide matrices are requested
if fasta or nexus or not phylipdisable:
# Create a dictionary for genotype to nucleotide translation
# each SNP may code the nucleotides in a different manner
nuc = {str(0):broken[3], ".":"N"}
for n in range(len(broken[4].split(","))):
nuc[str(n+1)] = broken[4].split(",")[n]
# Translate genotypes into nucleotides and the obtain the IUPAC ambiguity
# for heterozygous SNPs, and append to DNA sequence of each sample
site_tmp = ''.join([(amb[(nuc[broken[i][0]], nuc[broken[i][1]])]) for i in range(9, index_last_sample)])
# Write entire row of single nucleotide genotypes to temporary file
temporal.write(site_tmp+"\n")
# Write binary NEXUS for SNAPP if requested
if nexusbin:
# Check taht the SNP only has two alleles
if len(broken[4]) == 1:
# Add to running sum of biallelic SNPs
snp_biallelic += 1
# Translate genotype into 0 for homozygous ref, 1 for heterozygous, and 2 for homozygous alt
binsite_tmp = ''.join([(gen_bin[broken[i][0:3]]) for i in range(9, index_last_sample)])
# Write entire row to temporary file
temporalbin.write(binsite_tmp+"\n")
else:
# Keep track of loci rejected due to multinucleotide genotypes
snp_multinuc += 1
# Keep track of loci rejected due to exceeded missing data
snp_shallow += 1
else:
# Keep track of loci rejected due to exceeded missing data
snp_shallow += 1
# Print useful information about filtering of SNPs
print str(snp_num) + " genotypes processed in total"
print "\n"
print str(snp_shallow) + " genotypes were excluded because they exceeded the amount of missing data allowed"
print str(snp_multinuc) + " genotypes passed missing data filter but were excluded for not being SNPs"
print str(snp_accepted) + " SNPs passed the filters"
if nexusbin:
print str(snp_biallelic) + " SNPs were biallelic and selected for binary NEXUS"
print "\n"
vcf.close()
if fasta or nexus or not phylipdisable:
temporal.close()
if nexusbin:
temporalbin.close()
#######################
# WRITE OUTPUT MATRICES
if not phylipdisable:
output_phy = open(outfile+".phy", "w")
header_phy = str(len(sample_names))+" "+str(snp_accepted)+"\n"
output_phy.write(header_phy)
if fasta:
output_fas = open(outfile+".fasta", "w")
if nexus:
output_nex = open(outfile+".nexus", "w")
header_nex = "#NEXUS\n\nBEGIN DATA;\n\tDIMENSIONS NTAX=" + str(len(sample_names)) + " NCHAR=" + str(snp_accepted) + ";\n\tFORMAT DATATYPE=DNA" + " MISSING=N" + " GAP=- ;\nMATRIX\n"
output_nex.write(header_nex)
if nexusbin:
output_nexbin = open(outfile+".bin.nexus", "w")
header_nexbin = "#NEXUS\n\nBEGIN DATA;\n\tDIMENSIONS NTAX=" + str(len(sample_names)) + " NCHAR=" + str(snp_biallelic) + ";\n\tFORMAT DATATYPE=SNP" + " MISSING=?" + " GAP=- ;\nMATRIX\n"
output_nexbin.write(header_nexbin)
# Store index of outgroup in list of sample names
idx_outgroup = "NA"
# Write outgroup as first sequence in alignment if the name is specified
if outgroup in sample_names:
idx_outgroup = sample_names.index(outgroup)
if fasta or nexus or not phylipdisable:
with open(outfile+".tmp") as tmp_seq:
seqout = ""
# This is where the transposing happens
for line in tmp_seq:
seqout += line[idx_outgroup]
# Write FASTA line
if fasta:
output_fas.write(">"+sample_names[idx_outgroup]+"\n"+seqout+"\n")
# Pad sequences names and write PHYLIP or NEXUS lines
padding = (len_longest_name + 3 - len(sample_names[idx_outgroup])) * " "
if not phylipdisable:
output_phy.write(sample_names[idx_outgroup]+padding+seqout+"\n")
if nexus:
output_nex.write(sample_names[idx_outgroup]+padding+seqout+"\n")
# Print current progress
print "Outgroup, "+outgroup+", added to the matrix(ces)."
if nexusbin:
with open(outfile+".bin.tmp") as bin_tmp_seq:
seqout = ""
# This is where the transposing happens
for line in bin_tmp_seq:
seqout += line[idx_outgroup]
# Write line of binary SNPs to NEXUS
padding = (len_longest_name + 3 - len(sample_names[idx_outgroup])) * " "
output_nexbin.write(sample_names[idx_outgroup]+padding+seqout+"\n")
# Print current progress
print "Outgroup, "+outgroup+", added to the binary matrix."
# Write sequences of the ingroup
for s in range(0, len(sample_names)):
if s != idx_outgroup:
if fasta or nexus or not phylipdisable:
with open(outfile+".tmp") as tmp_seq:
seqout = ""
# This is where the transposing happens
for line in tmp_seq:
seqout += line[s]
# Write FASTA line
if fasta:
output_fas.write(">"+sample_names[s]+"\n"+seqout+"\n")
# Pad sequences names and write PHYLIP or NEXUS lines
padding = (len_longest_name + 3 - len(sample_names[s])) * " "
if not phylipdisable:
output_phy.write(sample_names[s]+padding+seqout+"\n")
if nexus:
output_nex.write(sample_names[s]+padding+seqout+"\n")
# Print current progress
print "Sample "+str(s+1)+" of "+str(len(sample_names))+", "+sample_names[s]+", added to the nucleotide matrix(ces)."
if nexusbin:
with open(outfile+".bin.tmp") as bin_tmp_seq:
seqout = ""
# This is where the transposing happens
for line in bin_tmp_seq:
seqout += line[s]
# Write line of binary SNPs to NEXUS
padding = (len_longest_name + 3 - len(sample_names[s])) * " "
output_nexbin.write(sample_names[s]+padding+seqout+"\n")
# Print current progress
print "Sample "+str(s+1)+" of "+str(len(sample_names))+", "+sample_names[s]+", added to the binary matrix."
if not phylipdisable:
output_phy.close()
if fasta:
output_fas.close()
if nexus:
output_nex.write(";\nEND;\n")
output_nex.close()
if nexusbin:
output_nexbin.write(";\nEND;\n")
output_nexbin.close()
if fasta or nexus or not phylipdisable:
os.remove(outfile+".tmp")
if nexusbin:
os.remove(outfile+".bin.tmp")
print "\nDone!\n"
if __name__ == "__main__":
main()
然后使用megax的内置工具将phylip格式的文件转化成mega格式的文件
使用MEGAX常用的参数去构建系统进化树,因为我的样本都是来自同一个specie的不同亚种,所以NJ tree 已经适用了。
最后生成.nwk 文件,进一步使用其他工具将其美化。
Method 2 SnPhylop
然后除了Mega之外SNPhylo这款工具也很适合利用具有很多个个体的变异文件来构建系统进化树。这款软件有其内置的过滤机制,可以根据你所选参数的条件进行过滤(或者其default的参数),然后最后还会帮你把图也输出(当然不太美观,最好还是要进行后期的美化)。它的画图需要用到R包,这里如果你想要它帮你画图,它所需要的R包也要装好。
下载地址
http://chibba.pgml.uga.edu/snphylo/
其算法及流程
用法manual:
Usage:
snphylo.sh -v VCF_file [-p Maximum_PLCS (5)] [-c Minimum_depth_of_coverage (5)]|-H HapMap_file [-p Maximum_PNSS (5)]|-s Simple_SNP_file [-p Maximum_PNSS (5)]|-d GDS_file [-l LD_threshold (0.1)] [-m MAF_threshold (0.1)] [-M Missing_rate (0.1)] [-o Outgroup_sample_name] [-P Prefix_of_output_files (snphylo.output)] [-b [-B The_number_of_bootstrap_samples (100)]] [-a The_number_of_the_last_autosome (22)] [-r] [-A] [-h]
Options:
-A: Perform multiple alignment by MUSCLE
-b: Performs (non-parametric) bootstrap analysis and generate a tree
-h: Show help and exit
-r: Skip the step removing low quality data (-p and -c option are ignored)
Acronyms:
PLCS: The percent of Low Coverage Sample
PNSS: The percent of Sample which has no SNP information
LD: Linkage Disequilibrium
MAF: Minor Allele Frequency
Simple SNP File Format:
#Chrom Pos SampleID1 SampleID2 SampleID3 ...
1 1000 A A T ...
1 1002 G C G ...
...
2 2000 G C G ...
2 2002 A A T ...
当安装好工具,准确放好输入文件,简单敲一行命令,运算就会开始了:
~/biosoft/SNPhylo/snphylo.sh -v Gpan.prune.in.vcf -A -b -B 300
运算完,生成的文件如下
rw-r----- 1 21230309 domain users 98K Sep 4 09:08 snphylo.output.bs.png
-rw-r----- 1 21230309 domain users 27K Sep 4 09:08 snphylo.output.bs.tree
-rw-r----- 1 21230309 domain users 4.9M Sep 1 20:04 snphylo.output.fasta
-rw-r----- 1 21230309 domain users 488M Sep 1 19:48 snphylo.output.filtered.vcf
-rw-r----- 1 21230309 domain users 31M Sep 1 20:03 snphylo.output.gds
-rw-r----- 1 21230309 domain users 72K Sep 1 20:03 snphylo.output.id.txt
-rw-r----- 1 21230309 domain users 95K Sep 3 19:37 snphylo.output.ml.png
-rw-r----- 1 21230309 domain users 25K Sep 3 19:37 snphylo.output.ml.tree
-rw-r----- 1 21230309 domain users 257K Sep 3 19:37 snphylo.output.ml.txt
-rw-r----- 1 21230309 domain users 5.4M Sep 1 20:49 snphylo.output.phylip.txt
-rw-r----- 1 21230309 domain users 20K Sep 1 19:16 snphylo.tar.gz