细菌基因组测序完,想看看样本有没有被其他的菌污染?
人的转录组测序完,想快速看看人、微生物的序列的比例?
元/宏基因组测序完,想快速获得样本中物种的丰度信息?
REFERENCE
Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments.Genome Biology 2014, 15:R46.
Kraken 1
Kraken 1是2014年Wood DE在Genome Biology发表的宏基因组序列分类软件,能够快速对宏基因样品中的reads进行分类。Kraken在序列比对环节基于精确k-mer匹配和精简数据库的方法,采取精确匹配,其核心是Kraken有一种特殊数据库,用以预先计算序列中包含的特殊的Kmer序列。 下面是来自kraken官网关于各分类器的测评结果:Kraken速度很快,精度较低,适用于做微生物检测的预处理。通过一些实际的数据测试发现:与Metaphlan2相比,Kraken速度较快,获得的物种数目较多,相对应的假阳性率也较高。
Usage: kraken [options]
Options:
--db NAME Name for Kraken DB
(default: none)
--threads NUM Number of threads (default: 1)
--fasta-input Input is FASTA format
--fastq-input Input is FASTQ format
--fastq-output Output in FASTQ format
--gzip-compressed Input is gzip compressed
--bzip2-compressed Input is bzip2 compressed
--quick Quick operation (use first hit or hits)
--min-hits NUM In quick op., number of hits req'd for classification
NOTE: this is ignored if --quick is not specified
--unclassified-out FILENAME
Print unclassified sequences to filename
--classified-out FILENAME
Print classified sequences to filename
--out-fmt FORMAT Format for [un]classified sequence output. supported
options are: {legacy, paired, interleaved}
--output FILENAME Print output to filename (default: stdout); "-" will
suppress normal output
--only-classified-output
Print no Kraken output for unclassified sequences
--preload Loads DB into memory before classification
--paired The two filenames provided are paired-end reads
--check-names Ensure each pair of reads have names that agree
with each other; ignored if --paired is not specified
--help Print this message
--version Print version information
If none of the *-input or *-compressed flags are specified, and the
file is a regular file, automatic format detection is attempted.
$ kraken --threads 40 --db minikraken_20171013_4GB --preload --
paired --fastq-input --gzip-compressed ${B}_1.fastq.gz ${B}_2.fastq.gz | kraken-report --db minikraken_20171013_4GB > "$B"_kraken.tab
$ less -S "$B"_kraken.tab
50.30 27375076 27375076 U 0 unclassified
49.70 27047187 13633 - 1 root
49.35 26857009 348 - 131567 cellular organisms
49.35 26855582 105532 D 2 Bacteria
33.57 18270135 0 - 1783270 FCB group
33.57 18269977 5 - 68336 Bacteroidetes/Chlorobi group
33.57 18269761 107058 P 976 Bacteroidetes
33.34 18144208 69 C 200643 Bacteroidia
33.34 18144105 961382 O 171549 Bacteroidales
30.47 16580771 0 F 815 Bacteroidaceae
30.47 16580771 2380340 G 816 Bacteroides
22.89 12454849 9623614 S 821 Bacteroides vulgatus
5.20 2831235 2831235 - 435590 Bacteroides vulgatus ATCC 8482
0.73 396849 0 S 357276 Bacteroides dorei
0.73 396849 396849 - 997877 Bacteroides dorei CL03T12C01
0.60 326487 326471 S 28116 Bacteroides ovatus
0.00 16 16 - 1379690 Bacteroides ovatus V975
0.42 227354 141795 S 818 Bacteroides thetaiotaomicron
0.16 85559 85559 - 226186 Bacteroides thetaiotaomicron VPI-5482
0.38 206113 206113 S 1796613 Bacteroides caecimuris
0.33 180286 180286 S 47678 Bacteroides caccae
0.29 156976 68554 S 817 Bacteroides fragilis
0.10 55847 55847 - 862962 Bacteroides fragilis 638R
0.03 17859 17859 - 295405 Bacteroides fragilis YCH46
0.03 14716 14716 - 272559 Bacteroides fragilis NCTC 9343
如上是Kraken的结果,可以看出它没有估算出物种的丰度。
Bracken
这时可以使用另一款软件Bracken (Bayesian Reestimation of Abundance with KrakEN),它是一种从元基因组数据中计算物种丰度的高度准确的统计方法。
$ bracken -h
Usage: bracken -d MY_DB -i INPUT -o OUTPUT -r READ_LEN -l LEVEL -t THRESHOLD
MY_DB location of Kraken database
INPUT Kraken REPORT file to use for abundance estimation
OUTPUT file name for Bracken default output
READ_LEN read length to get all classifications for (default: 100)
LEVEL level to estimate abundance at [options: D,P,C,O,F,G,S] (default: S)
THRESHOLD number of reads required PRIOR to abundance estimation to perform reestimation (default: 0)
$ bracken -d minikraken_20171013_4GB/ -i $B\_kraken.tab -t 10 -o $B.out
#获得如下结果:
name taxonomy_id taxonomy_lvl kraken_assigned_reads added_reads new_est_reads fraction_total_reads
Uncultured phage WW-nAnB strain 2 1449896 S 388 113 501 0.00007
Uncultured phage WW-nAnB strain 3 1449897 S 349 131 480 0.00007
Aureimonas sp. AU20 1349819 S 108 39 147 0.00002
Phaeobacter piscinae 1580596 S 21 4 25 0.00000
Sinorhizobium sp. CCBAU 05631 794846 S 25 12 37 0.00001
Mucilaginibacter sp. BJC16-A31 1234841 S 31 0 31 0.00000
Arcanobacterium phocae 131112 S 34 0 34 0.00000
Kineococcus radiotolerans 131568 S 82 3 85 0.00001
Actinomyces radingae 131110 S 314 5 319 0.00005
Sediminicola sp. YIK13 1453352 S 19 0 19 0.00000
Methylotenera mobilis 359408 S 42 1 43 0.00001
Stenotrophomonas rhizophila 216778 S 76 52 128 0.00002
Acholeplasma oculi 35623 S 12 0 12 0.00000
Dictyoglomus turgidum 513050 S 12 0 12 0.00000
Chelatococcus sp.
NOTE: Kraken 2 is the newest version of Kraken (See Kraken 2's Webpage for details). Kraken 1 will continue to be available via the Kraken 1 Github page, but it is no longer being supported.
Kraken 2
与Kraken 1相比,Kraken 2有了很大的改进:
1.更快速的构建数据库
2.数据库的占用存储空间更少
- 更快的分类速度
还能支持 16S Databases包括Greengenes, SILVA,和 RDP.
$ kraken2
Need to specify input filenames!
Usage: kraken2 [options]
Options:
--db NAME Name for Kraken 2 DB
(default: none)
--threads NUM Number of threads (default: 1)
--quick Quick operation (use first hit or hits)
--unclassified-out FILENAME
Print unclassified sequences to filename
--classified-out FILENAME
Print classified sequences to filename
--output FILENAME Print output to filename (default: stdout); "-" will
suppress normal output
--confidence FLOAT Confidence score threshold (default: 0.0); must be
in [0, 1].
--minimum-base-quality NUM
Minimum base quality used in classification (def: 0,
only effective with FASTQ input).
--report FILENAME Print a report with aggregrate counts/clade to file
--use-mpa-style With --report, format report output like Kraken 1's
kraken-mpa-report
--report-zero-counts With --report, report counts for ALL taxa, even if
counts are zero
--memory-mapping Avoids loading database into RAM
--paired The filenames provided have paired-end reads
--use-names Print scientific names instead of just taxids
--gzip-compressed Input files are compressed with gzip
--bzip2-compressed Input files are compressed with bzip2
--help Print this message
--version Print version information
If none of the *-compressed flags are specified, and the filename provided
is a regular file, automatic format detection is attempted.
$ kraken2\
--db minikraken2_v2_8GB_201904_UPDATE/ \
--threads 20 \
--report report \
--gzip-compressed --paired \
${B}_1.fastq.gz ${B}_2.fastq.gz
感谢您的阅读,欢迎点赞、评论和转发!!
扫描或长按下方二维码,即可关注公众号: 基因的生物信息学分析
相关阅读
细菌基因组:结核杆菌测序耐药位点分析
一文搞定细菌基因组De Novo测序分析
肠道菌群:16S测序分析流程解读
肠道菌群:宏基因组测序分析流程解读(上)