导读

比较平均核酸一致性average nucleotide identity（ANI）和氨基酸一致性amino acid identity（AAI）是比较基因组常用的两种基本算法。TETRA MASH也常被用于辅助物种分类。ANI > 95%, TETRA > 0.99, AAI > 95%, Mash < 0.05被认为是同一物种。

引出文章

标题：Insights on the Evolutionary Genomics of the Blautia Genus: Potential New Species and Genetic Content Among Lineages
中文：Blautia菌属比较基因组
杂志：Front Microbiol
时间：2021

下文用三种方法联合判断species group/cluster

一、AAI

AAI：CompareM

Github：https://github.com/dparks1134/CompareM

1.1 安装：使用python3.6 conda 环境

conda create -n python3.6 python=3.6
conda activate python3.6
conda install comparem
compare -h
compare aai_wf -h

输入方式：
i) a text file where each line indicating the location of a genome
ii) a directory containing all genomes to be compared

1.2 使用：

comparem aai_wf --cpus 16 --tmp_dir ./ -x fna genome out_comparem

参数：
--cpus: 线程数
--tmp_dir: 临时文件位置
-x: 基因组后缀
genome: 输入文件夹【genome_A.fna genome_B.fna genome_B2.fna】

程序运行过程

1.3 结果：

aai_summary.tsv

未提及ref query的分工

Identifier of the first genome
Number of genes in the first genome
Identifier of the second genome
Number of genes in the second genome
Number of orthologous genes identified between the two genomes
The mean amino acid identity (AAI) of orthologous genes
The standard deviation of the AAI across orthologous genes
The orthologous fraction (OF) between the two genomes defined as the number of orthologous genes divided the minimum number of genes in either genome

后来再安装使用报错： OSError: [Errno 122] Disk quota exceeded
github提到comparem不再维护更新，并推荐了新的工具。

原因是home文件夹爆了，把大文件删掉就行了，集群home分配少的可怜

du -h ./
# 102M
rm -r .cache
# 196K

AAI: EzAAI

地址：http://leb.snu.ac.kr/ezaai
Github: https://github.com/lebsnu/ezaai

win下载jar文件，conda安装依赖，

conda activate xgenome
conda install prodigal mmseqs2

二、ANI：FastANI

标题：High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries
中文：90K 原核基因组的高通量 ANI 分析揭示了清晰的物种界限
杂志：Nature Communication
时间：2018
被引：978 (谷歌学术2021.11.9)

Github: https://github.com/ParBLiSS/FastANI

2.1 安装：使用python3.6 conda 环境

conda install fastANI
fastANI --help

2.2 使用

fastANI -t 16 \
-q genome/genome_B.fasta \
-r genome/genome_B2.fasta \
-o out_fastani

参数：
-q: query genome
-r: reference genome
--rl: reference list
--ql: query list
-t: threads

2.3 结果：

genome_B.fasta genome_B2.fasta是同一个文件

cat out.txt
genome/genome_B.fasta   genome/genome_B2.fasta  100 612 612

说明：
Out of the total 612 sequence fragments from B genome, 612 were aligned as orthologous matches.
ANI 是100。来自B的612条序列中有612条与B2同源匹配。
ref query互换计算出的ANI是不同的

2.4 多个基因组计算相关矩阵

1 准备输入文件，含所有基因组的路径

cat genome_ref.tsv
genome/AF34-13.fna
genome/AF67-21pH9TA.fna
genome/AF81-08TA.fna
genome/AM27-31LB.fna
genome/AM49-4BH.fna
genome/AM53-13BH.fna
genome/AM59-24XD.fna
genome/OM05-6BH.fna
genome/OM05-9BH.fna
genome/OM07-10AC.fna
genome/OM07-9AC.fna
genome/TF01-11.fna
genome/TF08-3AC.fna

2 运行

fastANI -t 4 \
--ql genome_ref.tsv \
--rl genome_ref.tsv \
-o out_matrix.txt \
--matrix

3 结果

head out_matrix.txt
genome/AF34-13.fna      genome/AF34-13.fna      100     1148    1150
genome/AF34-13.fna      genome/TF01-11.fna      99.1324 1044    1150
genome/AF34-13.fna      genome/AF81-08TA.fna    99.0977 1023    1150
genome/AF34-13.fna      genome/AF67-21pH9TA.fna 98.9514 1065    1150
genome/AF34-13.fna      genome/AM27-31LB.fna    98.1427 982     1150
genome/AF34-13.fna      genome/TF08-3AC.fna     83.9318 619     1150
genome/AF67-21pH9TA.fna genome/AF67-21pH9TA.fna 100     1221    1222
genome/AF67-21pH9TA.fna genome/TF01-11.fna      99.5597 1074    1222
genome/AF67-21pH9TA.fna genome/AF81-08TA.fna    99.3177 1045    1222
genome/AF67-21pH9TA.fna genome/AF34-13.fna      98.9199 1070    1222

head out_matrix.txt.matrix
13
genome/AF34-13.fna
genome/AF67-21pH9TA.fna 98.935646
genome/AF81-08TA.fna    99.104561       99.317062
genome/AM27-31LB.fna    98.067337       98.105423       98.159103
genome/AM49-4BH.fna     NA      NA      NA      NA
genome/AM53-13BH.fna    NA      NA      NA      NA      98.027405
genome/AM59-24XD.fna    NA      NA      NA      NA      80.823174       NA
genome/OM05-6BH.fna     NA      NA      NA      NA      97.929825       98.926117       NA
genome/OM05-9BH.fna     NA      NA      NA      NA      97.908569       98.953743       NA      99.978661

ANI低于70，fastANI将以NA对待。ANI 95%以上的两个基因组可以认为是同一种细菌。

2.5 plot mapping

方法：https://github.com/ParBLiSS/FastANI
R脚本：https://github.com/ParBLiSS/FastANI/tree/master/scripts
使用：

./fastANI -q B_quintana.fna -r B_henselae.fna --visualize -o fastani.out
Rscript scripts/visualize.R B_quintana.fna B_henselae.fna fastani.out.visual

三、Mash

官网：https://mash.readthedocs.io/en/latest/

文章：
标题：Mash Screen: high-throughput sequence containment estimation for genome discovery
期刊：Genome Biol.
时间：2019
被引：75 （谷歌学术 2021.11.24）

标题：Mash: fast genome and metagenome distance estimation using MinHash
期刊：Genome Biol.
时间：2016
被引：1214

安装

conda activate mash
conda install mash
mash --version
# 2.3

参数：

Mash version 2.3
Type 'mash --license' for license and copyright information.
Usage:
  mash  [options] [arguments ...]
Commands:
  bounds     Print a table of Mash error bounds.
  dist       Estimate the distance of query sequences to references.
  info       Display information about sketch files.
  paste      Create a single sketch file from multiple sketch files.
  screen     Determine whether query sequences are within a larger mixture of
             sequences.
  sketch     Create sketches (reduced representations for fast operations).
  taxscreen  Create Kraken-style taxonomic report based on mash screen.
  triangle   Estimate a lower-triangular distance matrix.

使用

mash dist ../AAI/input/TF04-14/*.fna

过程:

Sketching ../AAI/input/TF04-14/TF04-14_BGI.fna (provide sketch file made with "mash sketch" to skip)...done.

结果：终端打印

../AAI/input/TF04-14/TF04-14_BGI.fna    ../AAI/input/TF04-14/TF04-14_Illumina.fna       0       0       1000/1000

字符排序靠后的被认为是query
Reference-ID
Query-ID
Mash-distance
P-value
Matching-hashes

四、TETRA (tetra-nucleotide signature)

官网：https://imedea.uib-csic.es/jspecies/download.html

TETRA介绍：https://help.ezbiocloud.net/tetra-nucleotide-analysis-tna/

JSpeciesWS: a web server for prokaryotic species circumscription based on pairwise genome comparison. Bioinformatics. 2016
Shifting the genomic gold standard for the prokaryotic species definition. PNAS. 2009 描述了可计算ANI TETRA的Jspecies包。
1 ANI可替代DDH（DNA分子杂交）定义物种
2 JSpecies检测四核苷酸信号（ANI相关）可辅助定义物种

不擅长java，试试bioconda

pyani Github: https://github.com/widdowquinn/pyani

标题：Genomics and taxonomy in diagnostics for food security: soft-rotting enterobacterial plant pathogens
期刊：Analytical Methods
时间：2016
被引：418 （谷歌学术 2021.11.24）

安装

conda activate xgenome
conda install pyani
# 内含ANI TETRA计算方法

参数：

-m {ANIm,ANIb,ANIblastall,TETRA}, --method {ANIm,ANIb,ANIblastall,TETRA}
-g, --graphics        Generate heatmap of ANI
-i INDIRNAME, --indir INDIRNAME
-o OUTDIRNAME, --outdir OUTDIRNAME

ANI based on BLAST+ (ANIb)
ANI based on MUMmer (ANIm)

使用
结果文件夹自动创建，无需添足

average_nucleotide_identity.py \
    -m TETRA \
    -i ./AAI/input/$i/ \
    -o ./tetra/out/$i/

结果

输入文件夹中的所有基因组的距离矩阵

cat TETRA_correlations.tab
        TF04-14_BGI     TF04-14_Illumina
TF04-14_BGI     1.0     1.0
TF04-14_Illumina        1.0     1.0

结果是完全对称的矩阵

更多：
[1] Shifting the genomic gold standard for the prokaryotic species definition. pnas 2009
[2] High throughput ANI analysis of 90K prokaryotic genomes reveals clear speciesboundaries[J]. Nature Communications, 2018
基因组相似性计算：ANI

ANI AAI TETRA Mash比较基因组

导读