EukCC2是一个基于python编写的用于评估真核生物MAGs完整度和污染度的软件。可以评估binning后的单个bin或者bins目录。
其原理是基于动态变化的单拷贝标记基因集(SCMGs),包括基础真核生物、真菌、原生动物及植物的SCMGs。
SCMGs的原理在于基于特异物种谱系通用的单个拷贝的基因,所以可以比较预测的基因数量来评估完整性,而额外的SCMGs则是污染序列。
(base) [yut@node02]$ mamba create -n eukcc -c conda-forge -c bioconda "eukcc>=2"
$ conda activate eukcc
(eukcc) [yut@node02]$ eukcc --version
EukCC version 2.1.0
# pip install eukcc
参考https://eukcc.readthedocs.io/en/latest/quickstart.html
一次性为EukCC配置数据库(注意事项,版本 1 的数据库不兼容。因此,在升级到 EukCC2 后,请务必升级数据库)。
获取数据库非常简单:
mkdir eukccdb
cd eukccdb
wget http://ftp.ebi.ac.uk/pub/databases/metagenomics/eukcc/eukcc2_db_ver_1.2.tar.gz
tar -xzvf eukcc2_db_ver_1.1.tar.gz
# 在.bashrc配置全局变量
export EUKCC2_DB=/path/to/.../eukcc2_db_ver_1.1
参考http://ftp.ebi.ac.uk/pub/databases/metagenomics/eukcc/
EukCC有两种模式。可以在一个单独的bin上或在一个bins的文件夹上运行EukCC。
在单个bin上运行EukCC给了最多的选择,可以根据你的需要调整参数。对于大多数元基因组工作流程来说,在一个bin的文件夹上运行EukCC可能是最简单的事情。
假设已将 $EUKCC2_DB 设置到正确位置。如果没有,请使用 --db 标志将数据库传递给 EukCC。
eukcc single --out outfolder --threads 8 bin.fa
EukCC将在8个线程上运行。你可以将核苷酸fastas或蛋白质组传递到EukCC。它会自动检测是否需要预测蛋白质。
默认情况下,它不会使用多于一个线程来将基因组放置在参考树中,以节省内存。
(eukcc) [yutao@myosin eukcc]$ time eukcc single --out yeast_out -t 50 Yeast_GCF_000149845.2_SJ5_genomic.fna
23-09-2022 16:08:01: EukCC version 2.1.0
23-09-2022 16:08:01: Set sequence type to DNA
23-09-2022 16:08:22: Doing a first pass placement in the base database
23-09-2022 16:08:22: Searching for marker genes in base database
23-09-2022 16:08:23: Found 21 marker genes, placing them in the tree
23-09-2022 16:09:11: Genome belongs to clade: fungi (Best TaxID: 4894)
23-09-2022 16:09:11: Searching for marker genes in fungi database
23-09-2022 16:09:12: Found 11 marker genes, placing them in the tree using epa-ng
23-09-2022 16:09:27: Automatically locating best SCMG set
23-09-2022 16:09:57: Searching fasta for selected markers
23-09-2022 16:10:44: Completeness: 100.0
23-09-2022 16:10:44: Contamination: 0.0
23-09-2022 16:10:44: Max silent contamination: 4.59
23-09-2022 16:10:44: Wrote output to: /home/test/eukcc/yeast_out/eukcc.csv
real 2m44.545s
user 13m59.558s
sys 0m58.519s
$ cat
(eukcc) [yutao@myosin yeast_out]$ cat eukcc.csv
fasta completeness contamination
/home/test/eukcc/Yeast_GCF_000149845.2_SJ5_genomic.fna 100.0 0.0
EukCC假定文件夹包含后缀为.fa的文件。如果不是这样,请使用–suffix参数。在目录模式下,EukCC也会尝试自动优化bins。
eukcc folder --out outfolder --threads 8 bins
# --suffix SUFFIX Suffix (default: .fa)
(eukcc) [yut@myosin eukcc]$ time eukcc folder --out yeast_bins1 -t 50 yeast_bins1
23-09-2022 16:33:58: EukCC version 2.1.0
23-09-2022 16:33:58: Found 2 bins
23-09-2022 16:34:19: Searching for marker genes in base database
23-09-2022 16:34:21: Found 15 marker genes, placing them in the tree using epa-ng
23-09-2022 16:34:43: Genome belongs to clade: fungi (Best TaxID: 4894)
23-09-2022 16:34:43: Searching for marker genes in fungi database
23-09-2022 16:34:44: Found 10 marker genes, placing them in the tree using epa-ng
23-09-2022 16:34:57: Automatically locating best SCMG set
23-09-2022 16:35:16: Searching fasta for selected markers
23-09-2022 16:36:00: Completeness: 71.0
23-09-2022 16:36:00: Contamination: 0.0
23-09-2022 16:36:00: Max silent contamination: 100.0
23-09-2022 16:36:00: Searching for marker genes in base database
23-09-2022 16:36:02: Found 6 marker genes, placing them in the tree using epa-ng
23-09-2022 16:36:17: Genome belongs to clade: fungi (Best TaxID: 4894)
23-09-2022 16:36:17: Searching for marker genes in fungi database
23-09-2022 16:36:18: Found 1 marker genes, placing them in the tree using epa-ng
23-09-2022 16:36:20: Automatically locating best SCMG set
23-09-2022 16:36:38: Searching fasta for selected markers
23-09-2022 16:37:10: Completeness: 29.0
23-09-2022 16:37:10: Contamination: 0.0
23-09-2022 16:37:10: Max silent contamination: 100.0
23-09-2022 16:37:10: Found 1 large bins to merge with
23-09-2022 16:37:10: For bin 1.bin.fa we found 1 merging combinations
23-09-2022 16:37:10: Created 0 merged bins
real 3m13.709s
bins(yeast) | eukcc comp(%) | eukcc cont(%) | merge |
---|---|---|---|
7.8 + 3.5M | 71 + 29 | 0 | no |
12M+64k | 100+0 | 0 | no |
综上,单纯的folder并没有将拆分的2部分酵母合并在一起
如果EukCC在文件夹模式下运行,它可以尝试再合并两个更多的bins,以创建一个增加完整性的精炼/合并版本。
为此,你可以而且应该将pair reads信息传递给EukCC。因此,只有由至少100个(默认)reads连接的bins才被考虑合并。这极大地提高了速度和准确性。
Preparing your linked reads:
如果你有paired-end read数据,你应该创建一个排序过的比对。如果你有多个读文件,你可以创建多个BAM文件。
为此,你将需要用来创建这个bins的contigs。或者,将所有的bins合并成一个伪组装基因组文件。
cat binfolder/*.fa > pseudo_contigs.fasta
bwa index pseudo_contigs.fasta
bwa mem -t 8 pseudo_contigs.fasta reads_1.fastq.gz reads_2.fastq.gz |
samtools view -q 20 -Sb - |
samtools sort -@ 8 -O bam - -o alignment.bam
samtools index alignment.bam
然后,您可以使用EukCC提供的脚本创建一个bin链接表
binlinks.py --ANI 99 --within 1500 \
--out linktable.csv binfolder alignment.bam
如果你有多个bam文件,把它们都传递给脚本(例如*.bam)。
您将获得一个三列文件(bin 1,bin 2,links)。
你可以像这样在相同的文件夹上启动EukCC:
eukcc folder \
--out outfolder \
--threads 8 \
--links linktable.csv \
binfolder
EukCC将对所有的bins单独运行。然后,它将识别中等质量的bins,这些bins至少有50%的完整度,但还没有超过100%的改进率。然后,它将识别出至少有100个paid-end reads与这些中等质量仓相连接的bin。如果合并后质量得分上升,这个bin将被合并。
合并的箱子可以在输出文件夹中找到。
合并超过两个bins。因此,将–n_combine设置为任何高于1的值都是试验性的,还不建议使用。我们在合并两个bins时有很好的结果。
我在地衣研究ERP123954的基础上创建了示例数据来测试。使用CONCOCT创建了分组,但任何没有原核生物偏见的分组都可以。
wget ...
gunzip eukcc_example_folder_GT57.zip
# Use at least a couple threads to speed it up
eukcc folder --threads 6 \
--out output \
--links eukcc_example_folder_GT57/links.csv \
--n_combine 1 \
eukcc_example_folder_GT57/bins
(eukcc) [yut@io02 TestData]$ ll fungi_genome/
总用量 97M
-rw-r--r-- 1 yut lzdx 52M 11月 9 21:03 SX22-37_unicycler_for_only_trimgalore_assembly.fasta
-rw-r--r-- 1 yut lzdx 45M 11月 9 21:03 SX22-37_unicycler_for_trimgalore_and_kraken2_filter_by_PlusPFP_assembly.fasta
(eukcc) [yut@io02 TestData]$ time eukcc folder --out eukcc_out --threads 8 fungi_genome/ --suffix .fasta
09-11-2023 21:05:13: EukCC version 2.1.0
09-11-2023 21:05:13: Found 2 bins
09-11-2023 21:10:50: Searching for marker genes in base database
09-11-2023 21:10:53: Found 23 marker genes, placing them in the tree using epa-ng
NCBI database not present yet (first time used?)
Downloading taxdump.tar.gz from NCBI FTP site (via HTTP)...
Done. Parsing...
Loading node names...
2541725 names loaded.
313200 synonyms loaded.
Loading nodes...
2541725 nodes loaded.
Linking nodes...
Tree is loaded.
Updating database: /datanode02/yut/.etetoolkit/taxa.sqlite ...
2541000 generating entries...
Uploading to /datanode02/yut/.etetoolkit/taxa.sqlite
Inserting synonyms: 310000
Inserting taxid merges: 70000
Inserting taxids: 2540000
/datanode02/yut/Software/Miniconda3/envs/eukcc/lib/python3.10/site-packages/ete3/ncbi_taxonomy/ncbiquery.py:243: UserWarning: taxid 13349 was translated into 2813651
warnings.warn("taxid %s was translated into %s" %(taxid, merged_conversion[taxid]))
/datanode02/yut/Software/Miniconda3/envs/eukcc/lib/python3.10/site-packages/ete3/ncbi_taxonomy/ncbiquery.py:243: UserWarning: taxid 134006 was translated into 479712
warnings.warn("taxid %s was translated into %s" %(taxid, merged_conversion[taxid]))
/datanode02/yut/Software/Miniconda3/envs/eukcc/lib/python3.10/site-packages/ete3/ncbi_taxonomy/ncbiquery.py:243: UserWarning: taxid 134013 was translated into 223615
warnings.warn("taxid %s was translated into %s" %(taxid, merged_conversion[taxid]))
/datanode02/yut/Software/Miniconda3/envs/eukcc/lib/python3.10/site-packages/ete3/ncbi_taxonomy/ncbiquery.py:243: UserWarning: taxid 453998 was translated into 2049356
warnings.warn("taxid %s was translated into %s" %(taxid, merged_conversion[taxid]))
/datanode02/yut/Software/Miniconda3/envs/eukcc/lib/python3.10/site-packages/ete3/ncbi_taxonomy/ncbiquery.py:243: UserWarning: taxid 2498572 was translated into 1280935
warnings.warn("taxid %s was translated into %s" %(taxid, merged_conversion[taxid]))
/datanode02/yut/Software/Miniconda3/envs/eukcc/lib/python3.10/site-packages/ete3/ncbi_taxonomy/ncbiquery.py:243: UserWarning: taxid 91191 was translated into 3075448
warnings.warn("taxid %s was translated into %s" %(taxid, merged_conversion[taxid]))
09-11-2023 21:13:43: Genome belongs to clade: fungi (Best TaxID: 36048)
09-11-2023 21:13:43: Searching for marker genes in fungi database
09-11-2023 21:13:45: Found 10 marker genes, placing them in the tree using epa-ng
09-11-2023 21:14:07: Automatically locating best SCMG set
09-11-2023 21:14:36: Searching fasta for selected markers
09-11-2023 21:15:19: Completeness: 96.62
09-11-2023 21:15:19: Contamination: 9.02
09-11-2023 21:15:19: Max silent contamination: 100.0
09-11-2023 21:15:20: Searching for marker genes in base database
09-11-2023 21:15:22: Found 19 marker genes, placing them in the tree using epa-ng
09-11-2023 21:16:05: Genome belongs to clade: fungi (Best TaxID: 147538)
09-11-2023 21:16:05: Searching for marker genes in fungi database
09-11-2023 21:16:08: Found 10 marker genes, placing them in the tree using epa-ng
09-11-2023 21:16:29: Automatically locating best SCMG set
09-11-2023 21:16:57: Searching fasta for selected markers
09-11-2023 21:17:35: Completeness: 88.72
09-11-2023 21:17:35: Contamination: 0.75
09-11-2023 21:17:35: Max silent contamination: 100.0
09-11-2023 21:17:35: Found 2 large bins to merge with
09-11-2023 21:17:35: For bin SX22-37_unicycler_for_only_trimgalore_assembly.fasta we found 1 merging combinations
09-11-2023 21:17:35: For bin SX22-37_unicycler_for_trimgalore_and_kraken2_filter_by_PlusPFP_assembly.fasta we found 1 merging combinations
09-11-2023 21:17:35: Created 0 merged bins
real 12m32.333s
user 50m2.338s
sys 2m0.765s
(eukcc) [yut@io02 TestData]$ column -t eukcc_out/eukcc.csv
bin completeness contamination
SX22-37_unicycler_for_only_trimgalore_assembly.fasta 96.62 9.02
SX22-37_unicycler_for_trimgalore_and_kraken2_filter_by_PlusPFP_assembly.fasta 88.72 0.75
(eukcc) [yut@node02 eukccdb]$ eukcc single --help
usage: eukcc single [-h] [--out OUT] [--db DB] [--threads THREADS] [--threads_epa THREADS_EPA] [--DNA] [--AA] [--taxids TAXIDS [TAXIDS ...]] [--set_size SET_SIZE] [--use_placement USE_PLACEMENT] [--set_number_species SET_NUMBER_SPECIES]
[--marker_prevalence MARKER_PREVALENCE] [--max_set_size MAX_SET_SIZE] [--select_best_guess] [--select_species] [--use_ncbi_tree] [--gmes] [--ignore_tree] [--simple] [--clade CLADE] [--rerun] [--no_dynamic_root] [--extra]
[--keep]
fasta
positional arguments:
fasta Estimate quality of this single bin (fasta file) # 必须参数
options:
-h, --help show this help message and exit
--out OUT, -o OUT Path to output folder (Default: .)
--db DB Path to EukCC DB
--threads THREADS, -t THREADS
Number of threads to use (Default: 1)
--threads_epa THREADS_EPA
Number of threads to use for epa-ng, recommended: 1 (Default: 1)
--DNA The fasta file contains DNA sequenes
--AA The fasta file contains amino acid sequences
--taxids TAXIDS [TAXIDS ...]
Taxids to use as set starting point
--set_size SET_SIZE Minimal number of marker genes to use (Default: 20)
--use_placement USE_PLACEMENT
Path to previous result file, to use exact same marker gene set
--set_number_species SET_NUMBER_SPECIES
Minimal number of species to define a set. Reduce this if no sets can be found (Default: 3)
--marker_prevalence MARKER_PREVALENCE
Percentage of species in which markers should be found (Default: 95)
--max_set_size MAX_SET_SIZE
Maximal number of marker genes used, set to 0 to include all possible marker genes (Default: 500)
--select_best_guess Use best guess to select marker gene set (Default)
--select_species Use species count to select best marker gene set (Default: best guess)
--use_ncbi_tree Instead of using the EukCC phylogenetic tree, rely on NCBI taxids (default: False)
--gmes Use GeneMark-ES instead of metaeuk (much slower) (default: False)
--ignore_tree Advanced option, mainly for debugging. Can ignore the tree if genomes are knwon via taxids for example
--simple Use global DB instead of clade specific dbs (faster, not suitable for protozoa)
--clade CLADE Define clade as base, fungi, protozoa or plants
--rerun, -r Rerun and remove any previously computed data in the target folder
--no_dynamic_root Do not re-root tree dynamically, to get best set detection (default: True)
--extra Produce extra output files.
--keep Keep workdir after the run.
(eukcc) [yut@node02 eukccdb]$ eukcc folder --help
usage: eukcc folder [-h] [--links LINKS] [--min_links MIN_LINKS] [--prefix PREFIX] [--suffix SUFFIX] [--out OUT] [--db DB] [--improve_ratio IMPROVE_RATIO] [--improve_percent IMPROVE_PERCENT] [--n_combine N_COMBINE] [--threads THREADS]
[--threads_epa THREADS_EPA] [--marker_prevalence MARKER_PREVALENCE]
binfolder
positional arguments:
binfolder Run script on bins in this folder # 必须参数
options:
-h, --help show this help message and exit
--links LINKS Path to a link table generated with bamlinks.py. If suuplied paired reads will be used to refine bins (Recommended)
--min_links MIN_LINKS
Number of paired reads matching between bins for merging to happen (default: 100)
--prefix PREFIX Prefix to add for merged bins (default: merged.)
--suffix SUFFIX Suffix (default: .fa) # 输入bins文件名的后缀,默认.fa
--out OUT, -o OUT Path to output folder (Default: .)
--db DB Path to EukCC DB # 未设置EUKCC2_DB环境变量时需要指定该参数
--improve_ratio IMPROVE_RATIO
Ratio of completeness to contamination change (Default: 5)
--improve_percent IMPROVE_PERCENT
A merger must increase completeness at least by n percent (Default: 10)
--n_combine N_COMBINE
How many small bins should be merged into a medium sized bin (Default: 1)
--threads THREADS, -t THREADS
Number of threads to use (Default: 1)
--threads_epa THREADS_EPA
Number of threads to use for epa-ng, recommended: 1 (Default: 1)
--marker_prevalence MARKER_PREVALENCE
Percentage of species in which markers should be found (Default: 95)
Saary, Paul, Alex L. Mitchell, and Robert D. Finn. “Estimating the quality of eukaryotic genomes recovered from metagenomic analysis with EukCC.” Genome biology 21.1 (2020): 1-21.
EukCC GitHub