测序后组装获得微生物基因组序列完不完整?污染率有多少?
checkM是用于评估单菌基因组、单细胞和宏基因组的质量工具。
checkM使用具有系统发生关系的单拷贝看家基因来估计基因组的完整度和污染程度。
文章:
http://genome.cshlp.org/content/early/2015/05/14/gr.186072.114.abstract
源码:
https://github.com/Ecogenomics/CheckM
checkm taxonomy_wf -h 显示帮助
usage: checkm taxonomy_wf [-h] [--ali] [--nt] [-g] [--individual_markers]
[--skip_adj_correction]
[--skip_pseudogene_correction]
[--aai_strain AAI_STRAIN] [-a ALIGNMENT_FILE]
[--ignore_thresholds] [-e E_VALUE] [-l LENGTH]
[-c COVERAGE_FILE] [-f FILE] [--tab_table]
[-x EXTENSION] [-t THREADS] [-q] [--tmpdir TMPDIR]
{life,domain,phylum,class,order,family,genus,species}
taxon bin_folder out_folder
Runs taxon_set, analyze, qa
positional arguments:
{life,domain,phylum,class,order,family,genus,species}
taxonomic rank
taxon taxon of interest
bin_folder folder containing bins (fasta format)
out_folder folder to write output files
optional arguments:
-h, --help show this help message and exit
--ali generate HMMER alignment file for each bin
--nt generate nucleotide gene sequences for each bin
-g, --genes bins contain genes as amino acids instead of nucleotide contigs
--individual_markers treat marker as independent (i.e., ignore co-located set structure)
--skip_adj_correction
do not exclude adjacent marker genes when estimating contamination
--skip_pseudogene_correction
skip identification and filtering of pseudogenes
--aai_strain AAI_STRAIN
AAI threshold used to identify strain heterogeneity (default: 0.9)
-a, --alignment_file ALIGNMENT_FILE
produce file showing alignment of multi-copy genes and their AAI identity
--ignore_thresholds ignore model-specific score thresholds
-e, --e_value E_VALUE
e-value cut off (default: 1e-10)
-l, --length LENGTH percent overlap between target and query (default: 0.7)
-c, --coverage_file COVERAGE_FILE
file containing coverage of each sequence; coverage information added to table type 2 (see coverage command)
-f, --file FILE print results to file (default: stdout)
--tab_table print tab-separated values table
-x, --extension EXTENSION
extension of bins (other files in folder are ignored) (default: fna)
-t, --threads THREADS
number of threads (default: 1)
-q, --quiet suppress console output
--tmpdir TMPDIR specify an alternative directory for temporary files
Example: checkm taxonomy_wf domain Bacteria ./bins ./output
[2019-10-17 02:51:12] INFO: CheckM v1.0.18
[2019-10-17 02:51:12] INFO: checkm taxonomy_wf domain Bacteria . out
[2019-10-17 02:51:12] INFO: [CheckM - taxon_set] Generate taxonomic-specific marker set.
[2019-10-17 02:51:15] INFO: Marker set for Bacteria contains 104 marker genes arranged in 58 sets.
[2019-10-17 02:51:15] INFO: Marker set inferred from 5449 reference genomes.
[2019-10-17 02:51:15] INFO: Marker set written to: out/Bacteria.ms
[2019-10-17 02:51:15] INFO: { Current stage: 0:00:03.175 || Total: 0:00:03.175 }
[2019-10-17 02:51:15] INFO: [CheckM - analyze] Identifying marker genes in bins.
[2019-10-17 02:51:15] INFO: Identifying marker genes in 1 bins with 1 threads:
Finished processing 1 of 1 (100.00%) bins.
[2019-10-17 02:52:05] INFO: Saving HMM info to file.
[2019-10-17 02:52:05] INFO: { Current stage: 0:00:49.981 || Total: 0:00:53.156 }
[2019-10-17 02:52:05] INFO: Parsing HMM hits to marker genes:
Finished parsing hits for 1 of 1 (100.00%) bins.
[2019-10-17 02:52:05] INFO: Aligning marker genes with multiple hits in a single bin:
Finished processing 1 of 1 (100.00%) bins.
[2019-10-17 02:52:05] INFO: { Current stage: 0:00:00.343 || Total: 0:00:53.500 }
[2019-10-17 02:52:05] INFO: Calculating genome statistics for 1 bins with 1 threads:
Finished processing 1 of 1 (100.00%) bins.
[2019-10-17 02:52:05] INFO: { Current stage: 0:00:00.246 || Total: 0:00:53.747 }
[2019-10-17 02:52:05] INFO: [CheckM - qa] Tabulating genome statistics.
[2019-10-17 02:52:05] INFO: Calculating AAI between multi-copy marker genes.
[2019-10-17 02:52:05] INFO: Reading HMM info from file.
[2019-10-17 02:52:05] INFO: Parsing HMM hits to marker genes:
Finished parsing hits for 1 of 1 (100.00%) bins.
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Bin Id Marker lineage # genomes # markers # marker sets 0 1 2 3 4 5+ Completeness Contamination Strain heterogeneity
-----------------------------------------------------------------------------------------------------------------------------------------------------------
GCA_900517465 Bacteria 5449 104 58 3 101 0 0 0 0 97.07 0.00 0.00
-----------------------------------------------------------------------------------------------------------------------------------------------------------
[2019-10-17 02:52:06] INFO: { Current stage: 0:00:00.330 || Total: 0:00:54.077 }
这个例子中Completeness 为97.07%, Contamination为0,结果还不错。
checkM还有很多其他的功能,例如计算宏基因组分箱结果reads覆盖的百分比。
checkm coverage -h
usage: checkm coverage [-h] [-x EXTENSION] [-r] [-a MIN_ALIGN]
[-e MAX_EDIT_DIST] [-m MIN_QC] [-t THREADS] [-q]
bin_dir output_file bam_files [bam_files ...]
Calculate coverage of sequences.
positional arguments:
bin_dir directory containing bins (fasta format)
output_file print results to file
bam_files BAM files to parse
optional arguments:
-h, --help show this help message and exit
-x, --extension EXTENSION
extension of bins (other files in directory are ignored) (default: fna)
-r, --all_reads use all reads to estimate coverage instead of just those in proper pairs
-a, --min_align MIN_ALIGN
minimum alignment length as percentage of read length (default: 0.98)
-e, --max_edit_dist MAX_EDIT_DIST
maximum edit distance as percentage of read length (default: 0.02)
-m, --min_qc MIN_QC minimum quality score (in phred) (default: 15)
-t, --threads THREADS
number of threads (default: 1)
-q, --quiet suppress console output
Example: checkm coverage ./bins coverage.tsv example_1.bam example_2.bam
checkm coverage -x fa bins/ ~/cov.tsv sorted.bam
Sequence Id Bin Id Sequence length (bp) Bam Id Coverage Mapped reads
k141_14826 unbinned 522 test 0.000000 0
k141_8961 unbinned 539 test 18.918367 68
k141_66057 unbinned 597 test 8.040201 32
k141_28930 unbinned 978 test 50.460123 329
k141_25826 unbinned 1090 test 8.669725 63
k141_63496 unbinned 510 test 2.352941 8
k141_25829 unbinned 1023 test 12.167155 83
k141_63494 unbinned 558 test 6.182796 23
k141_63495 unbinned 1399 test 448.277341 4181
k141_63493 unbinned 703 test 6.827881 32
k141_63490 unbinned 841 test 9.096314 51
k141_66058 bin.39 6959 test 18.923983 878
k141_28933 unbinned 1195 test 12.175732 97
k141_63498 unbinned 557 test 5.655296 21
k141_63499 unbinned 614 test 8.550489 35
k141_14821 unbinned 1130 test 82.559292 622
checkm profile ~/cov.tsv
#可以获得% mapped reads: (reads mapped to bin)/(total number of reads mapped to assembly)
[2019-10-17 03:25:40] INFO: CheckM v1.0.18
[2019-10-17 03:25:40] INFO: checkm profile /data/home/liufei/cov.tsv
[2019-10-17 03:25:40] INFO: [CheckM - profile] Calculating percentage of reads mapped to each bin.
[2019-10-17 03:25:40] INFO: Determining number of reads mapped to each bin.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bin Id Bin size (Mbp) test: mapped reads test: % mapped reads test: % binned populations test: % community
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
bin.1 0.22 586765 0.59 2.15 1.36
bin.10 0.86 542830 0.54 0.52 0.33
bin.11 0.29 2445260 2.45 6.93 4.39
bin.12 0.24 70318 0.07 0.24 0.15
bin.13 0.33 815391