DAS Tool 介绍

DAS Tool

DAS Tool 是一种自动化的处理方法, 集成了多个 binning 算法的结果, 从而从单个 assembly 结果中获取优质的, 非冗余的 bins. 与其他方法相比, 其可以从土壤基因组中重建更多接近完整的基因组 1 2


安装

DAS Tool 可以通过 Bioconda 安装. 存储库.

conda install -c bioconda das_tool

使用方法

基本使用方式

(例 1) 对 MetaBAT, MaxBin, Concot, TourESOM 的 binning 结果运行 DAS Tool.

$ ./DAS_Tool -i \
        sample_data/sample.human.gut_concoct_scaffolds2bin.tsv, \
        sample_data/sample.human.gut_maxbin2_scaffolds2bin.tsv, \
        sample_data/sample.human.gut_metabat_scaffolds2bin.tsv, \
        sample_data/sample.human.gut_tetraESOM_scaffolds2bin.tsv \
        -l concoct,maxbin,metabat,tetraESOM \
        -c sample_data/sample.human.gut_contigs.fa \
        -o sample_output/DASToolRun1

其中 -i 指定不同 binning 软件输出的 bin, -l 指定标签, 也就是对应 binning 结果的输出软件, -c 指定用于此次 binning 的叠连群, 指定为 fasta 格式. -o 指定输出文件前缀.

输入文件

bins

用逗号分隔的 bin 表

   -i, --bins                 methodA.scaffolds2bin,...,methodN.scaffolds2bin

列表为用 "\t" 分隔的 scaffold-IDs 和 bin-IDs, 如下:

Scaffold_1	bin.01
Scaffold_8	bin.01
Scaffold_42	bin.02
Scaffold_49	bin.03

Contigs

FASTA 格式的叠连群 (contigs)

-c, --contigs              contigs.fa

也就是用于 binning 的 assembly 文件, 如下:

>Scaffold_1
ATCATCGTCCGCATCGACGAATTCGGCGAACGAGTACCCCTGACCATCTCCGATTA...
>Scaffold_2
GATCGTCACGCAGGCTATCGGAGCCTCGACCCGCAAGCTCTGCGCCTTGGAGCAGG...

(可选) Proteins

预先预测的蛋白序列

--proteins                 proteins.faa

格式如

>Scaffold_1_1
MPRKNKKLPRHLLVIRTSAMGDVAMLPHALRALKEAYPEVKVTVATKSLFHPFFEG...
>Scaffold_1_2
MANKIPRVPVREQDPKVRATNFEEVCYGYNVEEATLEASRCLNCKNPRCVAACPVN...

输出文件

输出文件包括

  • 汇总的 binning 信息, 包括质量和完整性评估 (_DASTool_Summary.txt).
  • DAS 综合评估后输出的 binning 文件 (_DASTool_scaffolds2bin.txt).
  • 可选
    • 若设置 --write_bin_evals 1 1 1 (默认为 1 1 1), 则估计输入bin集合的质量和完整性 (_[method].eval).
    • 若设置 --create_plots 1 1 1 (默认为 1 1 1), 则显示每种方法的高质量 bin 的数量和分数分布 (_DASTool_hqBins.pdf,_DASTool_scores.pdf).
    • 若设置 --write_bins 1 1 1 (默认为 0 0 0), 则以 FASTA 格式输出 bin (DASTool_Bins).

详细介绍

DAS_Tool -i methodA.scaffolds2bin,...,methodN.scaffolds2bin
         -l methodA,...,methodN -c contigs.fa -o myOutput

   -i, --bins                 Comma separated list of tab separated scaffolds to bin tables.
   -c, --contigs              Contigs in fasta format.
   -o, --outputbasename       Basename of output files.
   -l, --labels               Comma separated list of binning prediction names. (optional)
   --search_engine            Engine used for single copy gene identification [blast/diamond/usearch].
                              (default: usearch)
   --write_bin_evals          Write evaluation for each input bin set [0/1]. (default: 1)
   --create_plots             Create binning performance plots [0/1]. (default: 1)
   --write_bins               Export bins as fasta files  [0/1]. (default: 0)
   --proteins                 Predicted proteins in prodigal fasta format (>scaffoldID_geneNo).
                              Gene prediction step will be skipped if given. (optional)
   --score_threshold          Score threshold until selection algorithm will keep selecting bins [0..1].
                              (default: 0.5)
   --duplicate_penalty        Penalty for duplicate single copy genes per bin (weight b).
                              Only change if you know what you're doing. [0..3]
                              (default: 0.6)
   --megabin_penalty          Penalty for megabins (weight c). Only change if you know what you're doing. [0..3]
                              (default: 0.5)
   --db_directory             Directory of single copy gene database. (default: install_dir/db)
   --resume                   Use existing predicted single copy gene files from a previous run [0/1]. (default: 0)
   --debug                    Write debug information to log file.
   -t, --threads              Number of threads to use. (default: 1)
   -v, --version              Print version number and exit.
   -h, --help                 Show this message.

Example 2: Run DAS Tool again with different parameters. Use the proteins predicted in Example 1 to skip the gene prediction step, disable writing of bin evaluations, set the number of threads to 2 and score threshold to 0.6. Output files will start with the prefix DASToolRun2:

$ ./DAS_Tool -i sample_data/sample.human.gut_concoct_scaffolds2bin.tsv, \
                sample_data/sample.human.gut_maxbin2_scaffolds2bin.tsv, \
                sample_data/sample.human.gut_metabat_scaffolds2bin.tsv, \
                sample_data/sample.human.gut_tetraESOM_scaffolds2bin.tsv \
             -l concoct,maxbin,metabat,tetraESOM \
             -c sample_data/sample.human.gut_contigs.fa \
             -o sample_output/DASToolRun2 \
             --proteins sample_output/DASToolRun1_proteins.faa \
             --write_bin_evals 0 \
             --threads 2 \
             --score_threshold 0.6

输入文件的制备

不是所有的 binning 工具都以 "\t" 分隔的 scaffold-ID 和 bin-ID 文件形式输出. DAS 工具同时提供了一个脚本, 将一组 fasta 格式的 bin 转化为 “scaffolds2bin” 表格, 用于 DAS Tool 的输入: Fasta_to_Scaffolds2Bin

使用方法

$ src/Fasta_to_Scaffolds2Bin.sh -h
Fasta_to_Scaffolds2Bin: Converts genome bins in fasta format to scaffolds-to-bin table.

Usage: Fasta_to_Scaffolds2Bin.sh -e fasta > my_scaffolds2bin.tsv

   -e, --extension            Extension of fasta files. (default: fasta)
   -i, --input_folder         Folder with bins in fasta format. (default: ./)
   -h, --help                 Show this message.

示例

$ ls /maxbin/output/folder
maxbin.001.fasta   maxbin.002.fasta   maxbin.003.fasta...

$ src/Fasta_to_Scaffolds2Bin.sh -i /maxbin/output/folder -e fasta > maxbin.scaffolds2bin.tsv

$ head gut_maxbin2_scaffolds2bin.tsv
NODE_10_length_127450_cov_375.783524	maxbin.001
NODE_27_length_95143_cov_427.155298	maxbin.001
NODE_51_length_78315_cov_504.322425	maxbin.001
NODE_84_length_66931_cov_376.684775	maxbin.001
NODE_87_length_65653_cov_460.202156	maxbin.001

问题

  1. 路径 DASTool_output/ 需要手动创建, 否则运行结束后不会输出.
  2. 出现了奇怪的错误
mv: cannot stat ‘DASTool_output/_proteins.faa.scg’: No such file or directory
mv: cannot stat ‘DASTool_output/_proteins.faa.scg’: No such file or directory
rm: cannot remove ‘DASTool_output/_proteins.faa.findSCG.b6’: No such file or directory
rm: cannot remove ‘DASTool_output/_proteins.faa.scg.candidates.faa’: No such file or directory
rm: cannot remove ‘DASTool_output/_proteins.faa.all.b6’: No such file or directory

使用 --search_engine diamond 后运行成功.

DAS_Tool -i MetaBat.scaffolds2bin.tsv,MaxBin.scaffolds2bin.tsv,CONCOCT.scaffolds2bin.tsv -l MetaBat,MaxBin,CONCOCT -c …/scaffold.fa -o DASTool_output/

srun -p small -n 4 --pty /bin/bash DAS_Tool -i MetaBat.scaffolds2bin.tsv,MaxBin.scaffolds2bin.tsv,CONCOCT.scaffolds2bin.tsv -l MetaBat,MaxBin,CONCOCT -c …/scaffold.fa -o DASTool_output/

checkM MaxBin scaffold_gene.faa
CONCOCT MetaBat scaffold_gene.gff
DAS_Tool scaffold.bam scaffold_gene.gtf
Mariana_TY42_1_paired.fq.gz scaffold.bam.bai scaffold.res
Mariana_TY42_1_unpaired.fq.gz scaffold.depth scaffold.res.summary
Mariana_TY42_2_paired.fq.gz scaffold.fa
Mariana_TY42_2_unpaired.fq.gz scaffold_gene_count.fa


  1. https://www.baidu.com/link?url=JbN0z_QhZbcz05SXOmXghq4KtVaCf00Tbp6YBX3qm3O6AB-yyFw2gN9XISe880jE3sylTvZ4mTI3k-XvDwzTg9D8mefZI0koVLxEVn_M6gk_jaRX6x8BXgfeRqsWaQmH&wd=&eqid=f554c6c4000ce067000000065eca2021 (DAS Tool for Genome Reconstruction from Metagenomes) ↩︎

  2. https://doi.org/10.1038/s41564-018-0171-1 (Christian M. K. Sieber, Alexander J. Probst, Allison Sharrar, Brian C. Thomas, Matthias Hess, Susannah G. Tringe & Jillian F. Banfield (2018). Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nature Microbiology.) ↩︎

你可能感兴趣的:(DAS Tool 介绍)