Call种内微生物SNP用Parsnp

parsnp可进行核心基因组对齐、call核心基因组SNP、构建SNP进化树。

Github: https://github.com/marbl/parsnp
Manual: https://harvest.readthedocs.io/en/latest/content/parsnp/tutorial.html

文章:The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes
中文:用于快速核心基因组比对和数千种种内微生物基因组可视化的 Harvest 套件
杂志:Genome Biology
时间:2014
被引:913

文章知识点:
1 核心基因组是全基因组的自己,这些同源序列可用于多基因组序列对齐
2 核心基因垂直传递,信噪比更强
3 核心基因SNP是研究进化关系最可靠的突变
4 核心基因组SNP是目前研究近源微生物从大标准方法
5 core-genome SNP typing有三种方法:
read mapping: 成本最低
k-mer analyses
whole-genome alignment:原方法,检测SNP INDEL金标准,依赖组装

Parsnp原理:
1 基于suffix graph data structure快速检测maximal unique matches (MUMs)
MUM是很多pairwise/multiple基因组对齐的基础
2 Gingr可视化比对结构
3 输入要求:同亚种或ANI>=97%,基于MUMi distance指定纳入标准
4 Directed Acyclic Graph (DAG) 数据结构,也称Compressed Suffix Graph (CSG)
5 multi-MUMs检测locally collinear blocks (LCBs),构建核心基因组对齐的基础
6 LCB作为anchor,MUSCLE定位gap
7 以上结果包含核心基因组所有的:SNP, INDEL, 结构变异
8 确定核心基因组SNP的方法:(1) repetitive sequence; (2) small LCB size; (3) poor alignment quality; (4) poor base quality; and (5) possible recombination。FreeBayes推测碱基质量,PhiPack检测碱基重排。
9 FastTree2构建进化树

方法比较:

引用:
1 Genome sequence of a clinical Salmonella Enteritidis sequence type 11 strain from South Africa. Journal of Global Antimicrobial Resistance. 2019
2 OutbreakFinder: a visualization tool for rapid detection of bacterial strain clusters based on optimized multidimensional scaling. PeerJ. 2019

安装

conda create -n parsnp
conda install parsnp
parsnp --version
#|--Parsnp 1.5.6--|
#For detailed documentation please see --> http://harvest.readthedocs.org/en/latest
#parsnp 1.5.6

使用

parsnp \
-c -p 4 -n mafft \
-r ./T2103096416.fna \
-d ./input/*.fna \
-o ./parsnp/

参数
-d Input genomes
-g
-r
-o OUTPUT_DIR
-c, --curated (c)urated genome directory
--use-fasttree Use fasttree instead of RaxML
--vcf Generate VCF file.
-p THREADS, --threads THREADS
-n {mafft,muscle,fsa,prank}, --alignment-program
-u, --unaligned Ouput unaligned regions

过程

19:51:26 - INFO - <>
19:51:26 - INFO - No genbank file provided for reference annotations, skipping..
19:51:26 - INFO - Running Parsnp multi-MUM search and libMUSCLE aligner...
19:51:45 - INFO - Reconstructing core genome phylogeny...
19:51:45 - INFO - Aligned 4 genomes in 18.35 seconds
19:51:45 - INFO - Parsnp finished! All output available in ./parsnp/

结果


tree 树文件
xmfa 基因组对齐结果
Ini 运行配置

xmfa文件格式同mauve对齐结果文件,但丢弃了unalign的序列

>1:0-109 + cluster58 :p1
>2:1399845-1399954 + cluster58 s11:p323
>3:2214877-2214986 - cluster58 s41:p1147
>4:2220684-2220793 - cluster58 s51:p849
=
>1:138-164 + cluster59 s1:p138
>2:1399997-1400023 + cluster59 s11:p475
>3:2214808-2214834 - cluster59 s41:p1013
>4:2220615-2220641 - cluster59 s51:p715
=

Gingr可视化

Manual: https://harvest.readthedocs.io/en/latest/content/gingr.html

只有MAC Linux可视化软件才能用,无命令行模式,使用测试数据直接打开如下:

Harvest处理parsnap结果

Manual:https://harvest.readthedocs.io/en/latest/content/harvest/quickstart.html#download-install-run

下载解压可直接使用

wget -c https://github.com/marbl/harvest-tools/releases/download/v1.2/harvesttools-Linux64-v1.2.tar.gz/
tar -zxvf harvesttools-Linux64-v1.2.tar.gz
/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/harvesttools-Linux64-v1.2/harvesttools -h

参数:

harvesttools options:
   -i 
   -b ,,""
   -B 
   -f 
   -F 
   -g 
   -a 
   -m 
   -M 
   -n 
   -N 
   --midpoint-reroot (reroot the tree at its midpoint after loading)
   -o 
   -S 
   -u 0/1 (update the branch values to reflect genome length)
   -v 
   -V 
   -x 
   -X 
   -h (show this help)
   -q (quiet mode)

利用ggr文件获取snp fasta

/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/harvesttools-Linux64-v1.2/harvesttools \
-i parsnp.ggr \
-S parsnp.snps
# 过程
Loading parsnp.ggr...
Writing parsnp.snps...
# remove enter
python3 /hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/analysis/oral/hgt/db_blast/trim_enter.py \
parsnp.snps parsnp.snps2

结果:

利用ggr文件获取snp vcf

/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/harvesttools-Linux64-v1.2/harvesttools \
-i parsnp.ggr \
-V parsnp.vcf

结果:

##FILTER=
##FILTER=
##FILTER=
##FILTER=
##FILTER= 20 indels">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  MF3KT15001B.fna.ref        BF3KT15003B.fna MF3KT15004.fna  PF3KT15003.fna
MF3KT15001B.Scaf2       48063   ACAGTGGGAT.AACTTACCCG   A       G       40      PASS       NA      GT      0       1       0       0
MF3KT15001B.Scaf2       48079   CCCGACTGGC.AACAAATTGG   A       G       40      PASS       NA      GT      0       1       0       0
MF3KT15001B.Scaf6       74973   AGTTGCTTGG.TTTAGAGTTT   T       C       40      PASS       NA      GT      0       1       1       1
MF3KT15001B.Scaf6       102844  ACATCATGAT.CTTCAAGTTC   C       G       40      PASS       NA      GT      0       1       0       0

更多:
细菌基因组重测序:使用snippy一条命令从sra到vcf
如何对细菌基因组做SNP calling

你可能感兴趣的:(Call种内微生物SNP用Parsnp)