导读
比较平均核酸一致性average nucleotide identity(ANI)和氨基酸一致性amino acid identity(AAI)是比较基因组常用的两种基本算法。TETRA MASH也常被用于辅助物种分类。ANI > 95%, TETRA > 0.99, AAI > 95%, Mash < 0.05被认为是同一物种。
引出文章
标题:Insights on the Evolutionary Genomics of the Blautia Genus: Potential New Species and Genetic Content Among Lineages
中文:Blautia菌属比较基因组
杂志:Front Microbiol
时间:2021
下文用三种方法联合判断species group/cluster
一、AAI
AAI:CompareM
Github:https://github.com/dparks1134/CompareM
1.1 安装:使用python3.6 conda 环境
conda create -n python3.6 python=3.6
conda activate python3.6
conda install comparem
compare -h
compare aai_wf -h
输入方式:
i) a text file where each line indicating the location of a genome
ii) a directory containing all genomes to be compared
1.2 使用:
comparem aai_wf --cpus 16 --tmp_dir ./ -x fna genome out_comparem
参数:
--cpus: 线程数
--tmp_dir: 临时文件位置
-x: 基因组后缀
genome: 输入文件夹【genome_A.fna genome_B.fna genome_B2.fna】
1.3 结果:
未提及ref query的分工
Identifier of the first genome
Number of genes in the first genome
Identifier of the second genome
Number of genes in the second genome
Number of orthologous genes identified between the two genomes
The mean amino acid identity (AAI) of orthologous genes
The standard deviation of the AAI across orthologous genes
The orthologous fraction (OF) between the two genomes defined as the number of orthologous genes divided the minimum number of genes in either genome
后来再安装使用报错: OSError: [Errno 122] Disk quota exceeded
github提到comparem不再维护更新,并推荐了新的工具。
原因是home文件夹爆了,把大文件删掉就行了,集群home分配少的可怜
du -h ./
# 102M
rm -r .cache
# 196K
AAI: EzAAI
地址:http://leb.snu.ac.kr/ezaai
Github: https://github.com/lebsnu/ezaai
win下载jar文件,conda安装依赖,
conda activate xgenome
conda install prodigal mmseqs2
二、ANI:FastANI
标题:High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries
中文:90K 原核基因组的高通量 ANI 分析揭示了清晰的物种界限
杂志:Nature Communication
时间:2018
被引:978 (谷歌学术2021.11.9)
Github: https://github.com/ParBLiSS/FastANI
2.1 安装:使用python3.6 conda 环境
conda install fastANI
fastANI --help
2.2 使用
fastANI -t 16 \
-q genome/genome_B.fasta \
-r genome/genome_B2.fasta \
-o out_fastani
参数:
-q: query genome
-r: reference genome
--rl: reference list
--ql: query list
-t: threads
2.3 结果:
- genome_B.fasta genome_B2.fasta是同一个文件
cat out.txt
genome/genome_B.fasta genome/genome_B2.fasta 100 612 612
说明:
Out of the total 612 sequence fragments from B genome, 612 were aligned as orthologous matches.
ANI 是100。来自B的612条序列中有612条与B2同源匹配。
ref query互换计算出的ANI是不同的
2.4 多个基因组计算相关矩阵
1 准备输入文件,含所有基因组的路径
cat genome_ref.tsv
genome/AF34-13.fna
genome/AF67-21pH9TA.fna
genome/AF81-08TA.fna
genome/AM27-31LB.fna
genome/AM49-4BH.fna
genome/AM53-13BH.fna
genome/AM59-24XD.fna
genome/OM05-6BH.fna
genome/OM05-9BH.fna
genome/OM07-10AC.fna
genome/OM07-9AC.fna
genome/TF01-11.fna
genome/TF08-3AC.fna
2 运行
fastANI -t 4 \
--ql genome_ref.tsv \
--rl genome_ref.tsv \
-o out_matrix.txt \
--matrix
3 结果
head out_matrix.txt
genome/AF34-13.fna genome/AF34-13.fna 100 1148 1150
genome/AF34-13.fna genome/TF01-11.fna 99.1324 1044 1150
genome/AF34-13.fna genome/AF81-08TA.fna 99.0977 1023 1150
genome/AF34-13.fna genome/AF67-21pH9TA.fna 98.9514 1065 1150
genome/AF34-13.fna genome/AM27-31LB.fna 98.1427 982 1150
genome/AF34-13.fna genome/TF08-3AC.fna 83.9318 619 1150
genome/AF67-21pH9TA.fna genome/AF67-21pH9TA.fna 100 1221 1222
genome/AF67-21pH9TA.fna genome/TF01-11.fna 99.5597 1074 1222
genome/AF67-21pH9TA.fna genome/AF81-08TA.fna 99.3177 1045 1222
genome/AF67-21pH9TA.fna genome/AF34-13.fna 98.9199 1070 1222
head out_matrix.txt.matrix
13
genome/AF34-13.fna
genome/AF67-21pH9TA.fna 98.935646
genome/AF81-08TA.fna 99.104561 99.317062
genome/AM27-31LB.fna 98.067337 98.105423 98.159103
genome/AM49-4BH.fna NA NA NA NA
genome/AM53-13BH.fna NA NA NA NA 98.027405
genome/AM59-24XD.fna NA NA NA NA 80.823174 NA
genome/OM05-6BH.fna NA NA NA NA 97.929825 98.926117 NA
genome/OM05-9BH.fna NA NA NA NA 97.908569 98.953743 NA 99.978661
ANI低于70,fastANI将以NA对待。ANI 95%以上的两个基因组可以认为是同一种细菌。
2.5 plot mapping
方法:https://github.com/ParBLiSS/FastANI
R脚本:https://github.com/ParBLiSS/FastANI/tree/master/scripts
使用:
./fastANI -q B_quintana.fna -r B_henselae.fna --visualize -o fastani.out
Rscript scripts/visualize.R B_quintana.fna B_henselae.fna fastani.out.visual
三、Mash
官网:https://mash.readthedocs.io/en/latest/
文章:
标题:Mash Screen: high-throughput sequence containment estimation for genome discovery
期刊:Genome Biol.
时间:2019
被引:75 (谷歌学术 2021.11.24)
标题:Mash: fast genome and metagenome distance estimation using MinHash
期刊:Genome Biol.
时间:2016
被引:1214
安装
conda activate mash
conda install mash
mash --version
# 2.3
参数:
Mash version 2.3
Type 'mash --license' for license and copyright information.
Usage:
mash [options] [arguments ...]
Commands:
bounds Print a table of Mash error bounds.
dist Estimate the distance of query sequences to references.
info Display information about sketch files.
paste Create a single sketch file from multiple sketch files.
screen Determine whether query sequences are within a larger mixture of
sequences.
sketch Create sketches (reduced representations for fast operations).
taxscreen Create Kraken-style taxonomic report based on mash screen.
triangle Estimate a lower-triangular distance matrix.
使用
mash dist ../AAI/input/TF04-14/*.fna
过程:
Sketching ../AAI/input/TF04-14/TF04-14_BGI.fna (provide sketch file made with "mash sketch" to skip)...done.
结果:终端打印
../AAI/input/TF04-14/TF04-14_BGI.fna ../AAI/input/TF04-14/TF04-14_Illumina.fna 0 0 1000/1000
字符排序靠后的被认为是query
Reference-ID
Query-ID
Mash-distance
P-value
Matching-hashes
四、TETRA (tetra-nucleotide signature)
官网:https://imedea.uib-csic.es/jspecies/download.html
TETRA介绍:https://help.ezbiocloud.net/tetra-nucleotide-analysis-tna/
JSpeciesWS: a web server for prokaryotic species circumscription based on pairwise genome comparison. Bioinformatics. 2016
Shifting the genomic gold standard for the prokaryotic species definition. PNAS. 2009 描述了可计算ANI TETRA的Jspecies包。
1 ANI可替代DDH(DNA分子杂交)定义物种
2 JSpecies检测四核苷酸信号(ANI相关)可辅助定义物种
不擅长java,试试bioconda
pyani Github: https://github.com/widdowquinn/pyani
标题:Genomics and taxonomy in diagnostics for food security: soft-rotting enterobacterial plant pathogens
期刊:Analytical Methods
时间:2016
被引:418 (谷歌学术 2021.11.24)
安装
conda activate xgenome
conda install pyani
# 内含ANI TETRA计算方法
参数:
-m {ANIm,ANIb,ANIblastall,TETRA}, --method {ANIm,ANIb,ANIblastall,TETRA}
-g, --graphics Generate heatmap of ANI
-i INDIRNAME, --indir INDIRNAME
-o OUTDIRNAME, --outdir OUTDIRNAME
ANI based on BLAST+ (ANIb)
ANI based on MUMmer (ANIm)
使用
结果文件夹自动创建,无需添足
average_nucleotide_identity.py \
-m TETRA \
-i ./AAI/input/$i/ \
-o ./tetra/out/$i/
结果
输入文件夹中的所有基因组的距离矩阵
cat TETRA_correlations.tab
TF04-14_BGI TF04-14_Illumina
TF04-14_BGI 1.0 1.0
TF04-14_Illumina 1.0 1.0
结果是完全对称的矩阵
更多:
[1] Shifting the genomic gold standard for the prokaryotic species definition. pnas 2009
[2] High throughput ANI analysis of 90K prokaryotic genomes reveals clear speciesboundaries[J]. Nature Communications, 2018
基因组相似性计算:ANI