资源介绍:GTDB-Tk v2: memory friendly classification with the genome taxonomy database | Bioinformatics | Oxford Academic (oup.com)
GTDB - Genome Taxonomy Database (ecogenomic.org)
大家自己看吧,不在解释了,直接上安装和配置,然后再使用。
github地址: GitHub - Ecogenomics/GTDBTk: GTDB-Tk: a toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes.
国内地址:
基本都能看懂吧,这里估计full_tree需要410GB的内存,建议512的大服务器了。这里明确了64个线程1小时估计能跑1000个基因组,不算快吧。
Domain |
Memory |
Storage |
Time |
---|---|---|---|
Archaea |
~45 GB |
~85 GB |
~1 hour / 1,000 genomes @ 64 CPUs |
Bacteria |
~65GB (410 GB when using –full_tree) |
~85 GB |
~1 hour / 1,000 genomes @ 64 CPUs |
环境依赖:
看起来都是我们经常用的几个软件,所以可以借鉴开发自己的分析流程。
Software |
Version |
Reference |
---|---|---|
Prodigal |
>= 2.6.2 |
Hyatt D, et al. 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11:119. doi: 10.1186/1471-2105-11-119. |
HMMER |
>= 3.1b2 |
Eddy SR. 2011. Accelerated profile HMM searches. PLOS Comp. Biol., 7:e1002195. |
pplacer |
>= 1.1 |
Matsen FA, et al. 2010. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics, 11:538. |
FastANI |
>= 1.32 |
Jain C, et al. 2019. High-throughput ANI Analysis of 90K Prokaryotic Genomes Reveals Clear Species Boundaries. Nat. Communications, doi: 10.1038/s41467-018-07641-9. |
FastTree |
>= 2.1.9 |
Price MN, et al. 2010. FastTree 2 - Approximately Maximum-Likelihood Trees for Large Alignments. PLoS One, 5, e9490. |
Mash |
>= 2.2 |
Ondov BD, et al. 2016. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17, 132. doi: doi: 10.1186/s13059-016-0997-x. |
使用conda方式安装
##建议大家带上版本号,因为这个数据库一直在更新,方便分辨是哪个版本
conda create -n gtdbtk-2.1.1 -c conda-forge -c bioconda gtdbtk=2.1.1
#mamba的安装方式建议大家搜索一下吧,反正就是另外一个命令,安装方式还挺快,不过得先安装mamba
# using mamba (alternative)
mamba create -n gtdbtk-2.1.1 -c conda-forge -c bioconda gtdbtk=2.1.1
###自动下载数据库
download-db.sh
##默认会下载到安装环境相对路径下:/{conda}/envs/gtdbtk/share/gtdbtk-1.0.2/db/
##这个是自动下载的路径,建议修改,我这里配置的时候会下载出错!
wget -c https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/gtdbtk_r89_data.tar.gz
##下面这几个路径是对的,下载的时候记得看清版本
#这两个路径是默认的最新版本
https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_data.tar.gz
https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/auxillary_files/gtdbtk_data.tar.gz
#这里是直接下载指定版本
https://data.gtdb.ecogenomic.org/releases/release214/214.0/auxillary_files/gtdbtk_r214_data.tar.gz
https://data.gtdb.ecogenomic.org/releases/release207/207.0/auxillary_files/gtdbtk_r207_v2_data.tar.gz
https://data.gtdb.ecogenomic.org/releases/release202/202.0/auxillary_files/gtdbtk_r202_data.tar.gz
https://data.gtdb.ecogenomic.org/releases/release95/95.0/auxillary_files/gtdbtk_r95_data.tar.gz
###如果安装环境目录空间有限,建议使用自定义目录统一管理:
#先激活环境
# Activate the GTDB-Tk conda environment
conda activate gtdbtk-2.1.1
#再设置GTDBTK数据库的路径,大家只需要将对应版本的数据库下载到自定义路径下并解压就好了。
# Set the environment variable to the directory containing the GTDB-Tk reference data
conda env config vars set GTDBTK_DATA_PATH="/path/to/unarchived/gtdbtk/data";
数据下载秘籍
修改/anaconda3/envs/gtdbtk/bin/download-db.sh中的第一行GTDBTK_DATA_PATH到指定目录,修改下面的第6-8行N_FILES_IN_TAR,DB_URL,TARGET_TAR_NAME到指定版本的参数,不知道参数的就不要修改了。还有个判断目录为空的地方建议注释掉,手动删除目标文件夹的除了下载数据库压缩文件以外的所有目录和文件,当然也可以全部删除后重新开始下载
下载文件比较大,建议wget后面加上-c参数以断点续传,防止以外中断重头开始下载:
pip安装方式本质上与conda没啥区别,不过还是建议大家上conda,多个环境单独配置比较好一点
docker方式需要大家简单了解一下docker的用法了,有docker经验的应该会喜欢这个,所有输入输出文件需要进行目录挂载。
官方文档参考:Installing GTDB-Tk — GTDB-Tk 2.3.2 documentation (ecogenomics.github.io)
安装完成后可以使用下面命令检查安装是否正确
#激活环境
conda activate GTDB-TK2.1.1
#执行检查命令
gtdbtk check_install
Create a multiple sequence alignment based on the AR53/BAC120 marker set.
gtdbtk align --identify_dir identify_output/ --out_dir align_output --cpus 3
Determine taxonomic classification of genomes.
#这个就是分类了,就是物种注释
gtdbtk classify --align_dir align_3lines/ --batchfile 3lines_batchfile.tsv --out_dir 3classify_ani --mash_db mash_db_dir/ --cpus 20
The classify workflow consists of four steps: ani_screen
, identify
, align
, and classify
.
gtdbtk classify_wf --genome_dir genomes/ --out_dir classify_wf_out --cpus 3
The convert_to_itol command will remove internal labels from Newick tree, making it suitable for visualization in iTOL
gtdbtk convert_to_itol --input some_tree.tree --output itol.tree
This workflow consists of five steps: identify
, align
, infer
, root
, and decorate
.
gtdbtk de_novo_wf --genome_dir genomes/ --outgroup_taxon p__Undinarchaeota --archaea --out_dir de_novo_wf --cpus 3
gtdbtk de_novo_wf --genome_dir genomes/ --outgroup_taxon p__Chloroflexota --bacteria --taxa_filter p__Firmicutes --out_dir de_novo_output
#Skip GTDB reference genomes ( requires --custom_taxonomy_file for outgrouping)
gtdbtk de_novo_wf --genome_dir genomes/ --outgroup_taxon p__Customphylum --bacteria --custom_taxonomy_file custom_taxonomy.tsv --out_dir de_novo_output
#Use a subset of GTDB reference genomes (p__Firmicutes) and outgroup on a custom Phylum (p__Customphylum)
gtdbtk de_novo_wf --genome_dir genomes/ --taxa_filter p__Firmicutes --outgroup_taxon p__Customphylum --bacteria --custom_taxonomy_file custom_taxonomy.tsv --out_dir de_novo_output
Decorate a tree with the GTDB-Tk taxonomy.
gtdbtk decorate --input_tree input.tree --output_tree output.tree
The export_msa will export the untrimmed archaeal or bacterial MSA used in the reference data.
gtdbtk export_msa --domain arc --output /tmp/msa.faa
Identify marker genes in genome(s). The following heuristic is used to establish the translation table used by Prodigal: use table 11 unless the coding density using table 4 is 5% higher than when using table 11 and the coding density under table 4 is >70%. Distinguishing between tables 4 and 25 is challenging so GTDB-Tk does not attempt to distinguish between these two tables. If you know the correct translation table for your genomes this can be provided to GTDB-Tk in the –batchfile.
gtdbtk identify --genome_dir genomes/ --out_dir identify_output --cpus 3
Infer tree from multiple sequence alignment.
gtdbtk infer --msa_file msa.faa --out_dir infer_out
Root a tree using an outgroup.
gtdbtk root --input_tree input.tree --outgroup_taxon p__Nanoarchaeota --output_tree output.tree
The trim_msa command will trim a MSA given a user-specified mask file, or the archaeal/bacterial mask present in the reference data.
gtdbtk trim_msa --untrimmed_msa msa.faa --output msa_trim.faa --mask_file mask.txt
###输入输出文件格式:
#### msa.faa
>genome_a
AKLAK
#### mask.txt
01011
#### msa_trim.faa
>genome_a
KAK
大家随时可参考这里查看输入输出文件格式,每类文件都有一个示例
Files — GTDB-Tk 2.3.2 documentation (ecogenomics.github.io)https://ecogenomics.github.io/GTDBTk/files/index.html
The following files are output by GTDB-Tk throughout various stages of the pipeline:
下面的示例是使用classify_wf进行物种分类,基于jupyter notebook的示例,需要安装配置jupyter,当然大家可以使用命令执行也是一样效果。
This jupyter notebook demonstrates how to run the classify_wf
command on a set of test genomes.
For a full list of commands see: * gtdbtk -h
, or * GTDB-Tk commands
This example will use the following two genomes that we will refer to as: * Genome A: GCF_003947435.1
[GTDB / NCBI] * Genome B: GCA_002011125.1
[GTDB / NCBI]
[5]:
# Create the directory. !mkdir -p /tmp/gtdbtk && cd /tmp/gtdbtk # Obtain the genomes. !mkdir -p /tmp/gtdbtk/genomes !wget -q https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/947/435/GCF_003947435.1_ASM394743v1/GCF_003947435.1_ASM394743v1_genomic.fna.gz -O /tmp/gtdbtk/genomes/genome_a.fna.gz !wget -q https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/011/125/GCA_002011125.1_ASM201112v1/GCA_002011125.1_ASM201112v1_genomic.fna.gz -O /tmp/gtdbtk/genomes/genome_b.fna.gz
Note that the workflow can be run as a single command classify_wf
, however, each step will be run individually in this notebook. It can sometimes be useful to run the steps individually when processing large pipelines.
[ ]:
!ls -l /tmp/gtdbtk/genomes
[2]:
!gtdbtk identify --genome_dir /tmp/gtdbtk/genomes --out_dir /tmp/gtdbtk/identify --extension gz --cpus 2
[2022-04-11 11:48:59] INFO: GTDB-Tk v2.0.0 [2022-04-11 11:48:59] INFO: gtdbtk identify --genome_dir /tmp/gtdbtk/genomes --out_dir /tmp/gtdbtk/identify --extension gz --cpus 2 [2022-04-11 11:48:59] INFO: Using GTDB-Tk reference data version r207: /srv/db/gtdbtk/official/release207 [2022-04-11 11:48:59] INFO: Identifying markers in 2 genomes with 2 threads. [2022-04-11 11:48:59] TASK: Running Prodigal V2.6.3 to identify genes. [2022-04-11 11:49:10] INFO: Completed 2 genomes in 10.94 seconds (5.47 seconds/genome). [2022-04-11 11:49:10] TASK: Identifying TIGRFAM protein families. [2022-04-11 11:49:16] INFO: Completed 2 genomes in 5.78 seconds (2.89 seconds/genome). [2022-04-11 11:49:16] TASK: Identifying Pfam protein families. [2022-04-11 11:49:16] INFO: Completed 2 genomes in 0.42 seconds (4.81 genomes/second). [2022-04-11 11:49:16] INFO: Annotations done using HMMER 3.1b2 (February 2015). [2022-04-11 11:49:16] TASK: Summarising identified marker genes. [2022-04-11 11:49:16] INFO: Completed 2 genomes in 0.05 seconds (40.91 genomes/second). [2022-04-11 11:49:16] INFO: Done.
The called genes and marker information can be found under each genomes respeective intermediate files directory, as shown below.
[4]:
!ls /tmp/gtdbtk/identify/identify/intermediate_results/marker_genes/genome_a.fna/
genome_a.fna_pfam_tophit.tsv genome_a.fna_protein.gff.sha256 genome_a.fna_pfam_tophit.tsv.sha256 genome_a.fna_tigrfam.out genome_a.fna_pfam.tsv genome_a.fna_tigrfam.out.sha256 genome_a.fna_pfam.tsv.sha256 genome_a.fna_tigrfam_tophit.tsv genome_a.fna_protein.faa genome_a.fna_tigrfam_tophit.tsv.sha256 genome_a.fna_protein.faa.sha256 genome_a.fna_tigrfam.tsv genome_a.fna_protein.fna genome_a.fna_tigrfam.tsv.sha256 genome_a.fna_protein.fna.sha256 prodigal_translation_table.tsv genome_a.fna_protein.gff prodigal_translation_table.tsv.sha256
However, it is sometimes more useful to just read the summary files which detail markers identified from either the archaeal 53, or bacterial 120 marker set.
[5]:
!cat /tmp/gtdbtk/identify/identify/gtdbtk.ar53.markers_summary.tsv
name number_unique_genes number_multiple_genes number_multiple_unique_genes number_missing_genes list_unique_genes list_multiple_genes list_multiple_unique_genes list_missing_genes genome_a.fna 44 0 0 9 PF00410.20,PF00466.21,PF00687.22,PF00827.18,PF00900.21,PF01000.27,PF01015.19,PF01090.20,PF01200.19,PF01280.21,PF04919.13,PF07541.13,TIGR00037,TIGR00111,TIGR00134,TIGR00279,TIGR00291,TIGR00323,TIGR00335,TIGR00405,TIGR00448,TIGR00483,TIGR00491,TIGR00967,TIGR00982,TIGR01008,TIGR01012,TIGR01018,TIGR01020,TIGR01028,TIGR01046,TIGR01052,TIGR01213,TIGR01952,TIGR02236,TIGR02390,TIGR03626,TIGR03627,TIGR03628,TIGR03671,TIGR03672,TIGR03674,TIGR03676,TIGR03680 TIGR00064,TIGR00373,TIGR00522,TIGR01171,TIGR02338,TIGR02389,TIGR03629,TIGR03670,TIGR03673 genome_b.fna 50 0 0 3 PF00410.20,PF00466.21,PF00687.22,PF00827.18,PF00900.21,PF01000.27,PF01015.19,PF01090.20,PF01200.19,PF01280.21,PF04919.13,PF07541.13,TIGR00037,TIGR00064,TIGR00111,TIGR00134,TIGR00279,TIGR00291,TIGR00323,TIGR00335,TIGR00373,TIGR00405,TIGR00448,TIGR00483,TIGR00491,TIGR00522,TIGR00967,TIGR00982,TIGR01008,TIGR01012,TIGR01018,TIGR01020,TIGR01028,TIGR01046,TIGR01052,TIGR01952,TIGR02236,TIGR02338,TIGR02389,TIGR02390,TIGR03626,TIGR03628,TIGR03629,TIGR03670,TIGR03671,TIGR03672,TIGR03673,TIGR03674,TIGR03676,TIGR03680 TIGR01171,TIGR01213,TIGR03627
The align step will align all identified markers, determine the most likely domain and output a concatenated MSA.
[7]:
!gtdbtk align --identify_dir /tmp/gtdbtk/identify --out_dir /tmp/gtdbtk/align --cpus 2
[2022-04-11 11:59:14] INFO: GTDB-Tk v2.0.0 [2022-04-11 11:59:14] INFO: gtdbtk align --identify_dir /tmp/gtdbtk/identify --out_dir /tmp/gtdbtk/align --cpus 2 [2022-04-11 11:59:14] INFO: Using GTDB-Tk reference data version r207: /srv/db/gtdbtk/official/release207 [2022-04-11 11:59:15] INFO: Aligning markers in 2 genomes with 2 CPUs. [2022-04-11 11:59:16] INFO: Processing 2 genomes identified as archaeal. [2022-04-11 11:59:16] INFO: Read concatenated alignment for 3,412 GTDB genomes. [2022-04-11 11:59:16] TASK: Generating concatenated alignment for each marker. [2022-04-11 11:59:16] INFO: Completed 2 genomes in 0.01 seconds (139.73 genomes/second). [2022-04-11 11:59:16] TASK: Aligning 52 identified markers using hmmalign 3.1b2 (February 2015). [2022-04-11 11:59:17] INFO: Completed 52 markers in 0.86 seconds (60.66 markers/second). [2022-04-11 11:59:17] TASK: Masking columns of archaeal multiple sequence alignment using canonical mask. [2022-04-11 11:59:21] INFO: Completed 3,414 sequences in 4.19 seconds (815.22 sequences/second). [2022-04-11 11:59:21] INFO: Masked archaeal alignment from 13,540 to 10,153 AAs. [2022-04-11 11:59:21] INFO: 0 archaeal user genomes have amino acids in <10.0% of columns in filtered MSA. [2022-04-11 11:59:21] INFO: Creating concatenated alignment for 3,414 archaeal GTDB and user genomes. [2022-04-11 11:59:23] INFO: Creating concatenated alignment for 2 archaeal user genomes. [2022-04-11 11:59:23] INFO: Done.
It is important to pay attention to the output, if a genome had a low number of markers identified it will be excluded from the analysis at this step. A warning will appear if that is the case.
Depending on the domain, a prefixed file of either ar53
or bac120
will appear containing the MSA of the user genomes and the GTDB genomes, or just the user genomes (gtdbtk.ar53.msa.fasta.gz
and gtdbtk.ar53.user_msa.fasta.gz
respectively.)
[8]:
!ls /tmp/gtdbtk/align/align/
gtdbtk.ar53.filtered.tsv gtdbtk.ar53.user_msa.fasta.gz gtdbtk.ar53.msa.fasta.gz intermediate_results
The classify step will place the genomes into a reference tree, then determine their most likely classification.
[9]:
!gtdbtk classify --genome_dir /tmp/gtdbtk/genomes --align_dir /tmp/gtdbtk/align --out_dir /tmp/gtdbtk/classify -x gz --cpus 2
[2022-04-11 12:02:06] INFO: GTDB-Tk v2.0.0 [2022-04-11 12:02:06] INFO: gtdbtk classify --genome_dir /tmp/gtdbtk/genomes --align_dir /tmp/gtdbtk/align --out_dir /tmp/gtdbtk/classify -x gz --cpus 2 [2022-04-11 12:02:06] INFO: Using GTDB-Tk reference data version r207: /srv/db/gtdbtk/official/release207 [2022-04-11 12:02:07] TASK: Placing 2 archaeal genomes into reference tree with pplacer using 2 CPUs (be patient). [2022-04-11 12:02:07] INFO: pplacer version: v1.1.alpha19-0-g807f6f3 [2022-04-11 12:07:06] INFO: Calculating RED values based on reference tree. [2022-04-11 12:07:06] TASK: Traversing tree to determine classification method. [2022-04-11 12:07:06] INFO: Completed 2 genomes in 0.00 seconds (18,558.87 genomes/second). [2022-04-11 12:07:06] TASK: Calculating average nucleotide identity using FastANI (v1.32). [2022-04-11 12:07:08] INFO: Completed 4 comparisons in 1.61 seconds (2.49 comparisons/second). [2022-04-11 12:07:08] INFO: 2 genome(s) have been classified using FastANI and pplacer. [2022-04-11 12:07:08] INFO: Note that Tk classification mode is insufficient for publication of new taxonomic designations. New designations should be based on one or more de novo trees, an example of which can be produced by Tk in de novo mode. [2022-04-11 12:07:08] INFO: Done.
The two main files output are the summary files (gtdbtk.ar53.summary.tsv
, and gtdbtk.bac120.summary.tsv
respectively). Classification of the genomes are present in the summary file.
[10]:
!ls /tmp/gtdbtk/classify
classify gtdbtk.ar53.summary.tsv gtdbtk.log gtdbtk.warnings.log