CheckM评估微生物基因组完整度

测序后组装获得微生物基因组序列完不完整?污染率有多少?

checkM是用于评估单菌基因组、单细胞和宏基因组的质量工具。

checkM使用具有系统发生关系的单拷贝看家基因来估计基因组的完整度和污染程度。

文章:

http://genome.cshlp.org/content/early/2015/05/14/gr.186072.114.abstract

CheckM评估微生物基因组完整度_第1张图片
image.png

源码:

https://github.com/Ecogenomics/CheckM

checkm taxonomy_wf -h 显示帮助


usage: checkm taxonomy_wf [-h] [--ali] [--nt] [-g] [--individual_markers]
                          [--skip_adj_correction]
                          [--skip_pseudogene_correction]
                          [--aai_strain AAI_STRAIN] [-a ALIGNMENT_FILE]
                          [--ignore_thresholds] [-e E_VALUE] [-l LENGTH]
                          [-c COVERAGE_FILE] [-f FILE] [--tab_table]
                          [-x EXTENSION] [-t THREADS] [-q] [--tmpdir TMPDIR]
                          {life,domain,phylum,class,order,family,genus,species}
                          taxon bin_folder out_folder

Runs taxon_set, analyze, qa

positional arguments:
  {life,domain,phylum,class,order,family,genus,species}
                        taxonomic rank
  taxon                 taxon of interest
  bin_folder            folder containing bins (fasta format)
  out_folder            folder to write output files

optional arguments:
  -h, --help            show this help message and exit
  --ali                 generate HMMER alignment file for each bin
  --nt                  generate nucleotide gene sequences for each bin
  -g, --genes           bins contain genes as amino acids instead of nucleotide contigs
  --individual_markers  treat marker as independent (i.e., ignore co-located set structure)
  --skip_adj_correction
                        do not exclude adjacent marker genes when estimating contamination
  --skip_pseudogene_correction
                        skip identification and filtering of pseudogenes
  --aai_strain AAI_STRAIN
                        AAI threshold used to identify strain heterogeneity (default: 0.9)
  -a, --alignment_file ALIGNMENT_FILE
                        produce file showing alignment of multi-copy genes and their AAI identity
  --ignore_thresholds   ignore model-specific score thresholds
  -e, --e_value E_VALUE
                        e-value cut off (default: 1e-10)
  -l, --length LENGTH   percent overlap between target and query (default: 0.7)
  -c, --coverage_file COVERAGE_FILE
                        file containing coverage of each sequence; coverage information added to table type 2 (see coverage command)
  -f, --file FILE       print results to file (default: stdout)
  --tab_table           print tab-separated values table
  -x, --extension EXTENSION
                        extension of bins (other files in folder are ignored) (default: fna)
  -t, --threads THREADS
                        number of threads (default: 1)
  -q, --quiet           suppress console output
  --tmpdir TMPDIR       specify an alternative directory for temporary files

Example: checkm taxonomy_wf domain Bacteria ./bins ./output

[2019-10-17 02:51:12] INFO: CheckM v1.0.18
[2019-10-17 02:51:12] INFO: checkm taxonomy_wf domain Bacteria . out
[2019-10-17 02:51:12] INFO: [CheckM - taxon_set] Generate taxonomic-specific marker set.
[2019-10-17 02:51:15] INFO: Marker set for Bacteria contains 104 marker genes arranged in 58 sets.
[2019-10-17 02:51:15] INFO: Marker set inferred from 5449 reference genomes.
[2019-10-17 02:51:15] INFO: Marker set written to: out/Bacteria.ms
[2019-10-17 02:51:15] INFO: { Current stage: 0:00:03.175 || Total: 0:00:03.175 }
[2019-10-17 02:51:15] INFO: [CheckM - analyze] Identifying marker genes in bins.
[2019-10-17 02:51:15] INFO: Identifying marker genes in 1 bins with 1 threads:
   Finished processing 1 of 1 (100.00%) bins.
[2019-10-17 02:52:05] INFO: Saving HMM info to file.
[2019-10-17 02:52:05] INFO: { Current stage: 0:00:49.981 || Total: 0:00:53.156 }
[2019-10-17 02:52:05] INFO: Parsing HMM hits to marker genes:
   Finished parsing hits for 1 of 1 (100.00%) bins.
[2019-10-17 02:52:05] INFO: Aligning marker genes with multiple hits in a single bin:
   Finished processing 1 of 1 (100.00%) bins.
[2019-10-17 02:52:05] INFO: { Current stage: 0:00:00.343 || Total: 0:00:53.500 }
[2019-10-17 02:52:05] INFO: Calculating genome statistics for 1 bins with 1 threads:
   Finished processing 1 of 1 (100.00%) bins.
[2019-10-17 02:52:05] INFO: { Current stage: 0:00:00.246 || Total: 0:00:53.747 }
[2019-10-17 02:52:05] INFO: [CheckM - qa] Tabulating genome statistics.
[2019-10-17 02:52:05] INFO: Calculating AAI between multi-copy marker genes.
[2019-10-17 02:52:05] INFO: Reading HMM info from file.
[2019-10-17 02:52:05] INFO: Parsing HMM hits to marker genes:
   Finished parsing hits for 1 of 1 (100.00%) bins.
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Bin Id          Marker lineage   # genomes   # markers   # marker sets   0    1    2   3   4   5+   Completeness   Contamination   Strain heterogeneity
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 GCA_900517465      Bacteria         5449        104            58        3   101   0   0   0   0       97.07            0.00               0.00
-----------------------------------------------------------------------------------------------------------------------------------------------------------
[2019-10-17 02:52:06] INFO: { Current stage: 0:00:00.330 || Total: 0:00:54.077 }

这个例子中Completeness 为97.07%, Contamination为0,结果还不错。

checkM还有很多其他的功能,例如计算宏基因组分箱结果reads覆盖的百分比。

checkm coverage  -h
usage: checkm coverage [-h] [-x EXTENSION] [-r] [-a MIN_ALIGN]
                       [-e MAX_EDIT_DIST] [-m MIN_QC] [-t THREADS] [-q]
                       bin_dir output_file bam_files [bam_files ...]

Calculate coverage of sequences.

positional arguments:
  bin_dir               directory containing bins (fasta format)
  output_file           print results to file
  bam_files             BAM files to parse

optional arguments:
  -h, --help            show this help message and exit
  -x, --extension EXTENSION
                        extension of bins (other files in directory are ignored) (default: fna)
  -r, --all_reads       use all reads to estimate coverage instead of just those in proper pairs
  -a, --min_align MIN_ALIGN
                        minimum alignment length as percentage of read length (default: 0.98)
  -e, --max_edit_dist MAX_EDIT_DIST
                        maximum edit distance as percentage of read length (default: 0.02)
  -m, --min_qc MIN_QC   minimum quality score (in phred) (default: 15)
  -t, --threads THREADS
                        number of threads (default: 1)
  -q, --quiet           suppress console output

Example: checkm coverage ./bins coverage.tsv example_1.bam example_2.bam

checkm coverage -x fa bins/ ~/cov.tsv sorted.bam

Sequence Id     Bin Id  Sequence length (bp)    Bam Id  Coverage        Mapped reads
k141_14826      unbinned        522     test    0.000000        0
k141_8961       unbinned        539     test    18.918367       68
k141_66057      unbinned        597     test    8.040201        32
k141_28930      unbinned        978     test    50.460123       329
k141_25826      unbinned        1090    test    8.669725        63
k141_63496      unbinned        510     test    2.352941        8
k141_25829      unbinned        1023    test    12.167155       83
k141_63494      unbinned        558     test    6.182796        23
k141_63495      unbinned        1399    test    448.277341      4181
k141_63493      unbinned        703     test    6.827881        32
k141_63490      unbinned        841     test    9.096314        51
k141_66058      bin.39  6959    test    18.923983       878
k141_28933      unbinned        1195    test    12.175732       97
k141_63498      unbinned        557     test    5.655296        21
k141_63499      unbinned        614     test    8.550489        35
k141_14821      unbinned        1130    test    82.559292       622

checkm profile  ~/cov.tsv
#可以获得% mapped reads: (reads mapped to bin)/(total number of reads mapped to assembly)
[2019-10-17 03:25:40] INFO: CheckM v1.0.18
[2019-10-17 03:25:40] INFO: checkm profile /data/home/liufei/cov.tsv
[2019-10-17 03:25:40] INFO: [CheckM - profile] Calculating percentage of reads mapped to each bin.
[2019-10-17 03:25:40] INFO: Determining number of reads mapped to each bin.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Bin Id     Bin size (Mbp)   test: mapped reads   test: % mapped reads   test: % binned populations   test: % community
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  bin.1           0.22                            586765                                            0.59                                                 2.15                                               1.36
  bin.10          0.86                            542830                                            0.54                                                 0.52                                               0.33
  bin.11          0.29                           2445260                                            2.45                                                 6.93                                               4.39
  bin.12          0.24                            70318                                             0.07                                                 0.24                                               0.15
  bin.13          0.33                            815391                             

你可能感兴趣的:(CheckM评估微生物基因组完整度)