宏基因组:MEGAHIT组装拼接及quast评估

Megahit

组装软件很多下面介绍三款组装软件:

MEGAHIT下载地址
https://github.com/voutcn/megahit

git clone https://github.com/voutcn/megahit.git
cd megahit
make

其他两款组装软件下载地址

SOAPdenovo下载地址
http://sourceforge.net/projects/soapdenovo2/files/SOAPdenovo2/
metaSPAdes下载地址
http://spades.bioinf.spbau.ru/release3.11.0/

评估软件quast下载地址

git clone https://github.com/ablab/quast.git -b release_4.5
export PYTHONPATH=$(pwd)/quast/libs/

使数据

cd megahit/
Bacterial_F1_1.pe.fq
Bacteria_F1_2.pe.fq

开始组装

megahit -1 Bacterial_F1_2.pe.fq -2 Bacterial_F1_2.pe.fq -o combined
2019-09-26 16:31:55 - MEGAHIT v1.2.8
2019-09-26 16:31:55 - Using megahit_core with POPCNT and BMI2 support
2019-09-26 16:31:55 - Convert reads to binary library
2019-09-26 16:31:55 - INFO  sequence/io/sequence_lib.cpp  :   77 - Lib 0 (/home/ZQK/Data/megahit_data/Bacterial_F1_1.pe.fq,/home/ZQK/Data/megahit_data/Bacterial_F1_2.pe.fq): pe, 160126 reads, 250 max length
2019-09-26 16:31:55 - INFO  utils/utils.h                 :  152 - Real: 0.3188 user: 0.2668    sys: 0.0560     maxrss: 22892
2019-09-26 16:31:55 - k-max reset to: 141
2019-09-26 16:31:55 - Start assembly. Number of CPU threads 56
2019-09-26 16:31:55 - k list: 21,29,39,59,79,99,119,141
2019-09-26 16:31:55 - Memory used: 304044370329
2019-09-26 16:31:55 - Extract solid (k+1)-mers for k = 21
2019-09-26 16:31:56 - Build graph for k = 21
2019-09-26 16:31:57 - Assemble contigs from SdBG for k = 21
2019-09-26 16:32:00 - Local assembly for k = 21
2019-09-26 16:32:00 - Extract iterative edges from k = 21 to 29
2019-09-26 16:32:01 - Build graph for k = 29
2019-09-26 16:32:01 - Assemble contigs from SdBG for k = 29
2019-09-26 16:32:02 - Local assembly for k = 29
2019-09-26 16:32:02 - Extract iterative edges from k = 29 to 39
2019-09-26 16:32:02 - Build graph for k = 39
2019-09-26 16:32:02 - Assemble contigs from SdBG for k = 39
2019-09-26 16:32:03 - Local assembly for k = 39
2019-09-26 16:32:03 - Extract iterative edges from k = 39 to 59
2019-09-26 16:32:03 - Build graph for k = 59
2019-09-26 16:32:04 - Assemble contigs from SdBG for k = 59
2019-09-26 16:32:04 - Local assembly for k = 59
2019-09-26 16:32:04 - Extract iterative edges from k = 59 to 79
2019-09-26 16:32:05 - Build graph for k = 79
2019-09-26 16:32:05 - Assemble contigs from SdBG for k = 79
2019-09-26 16:32:05 - Local assembly for k = 79
2019-09-26 16:32:06 - Extract iterative edges from k = 79 to 99
2019-09-26 16:32:06 - Build graph for k = 99
2019-09-26 16:32:06 - Assemble contigs from SdBG for k = 99
2019-09-26 16:32:07 - Local assembly for k = 99
2019-09-26 16:32:07 - Extract iterative edges from k = 99 to 119
2019-09-26 16:32:07 - Build graph for k = 119
2019-09-26 16:32:08 - Assemble contigs from SdBG for k = 119
2019-09-26 16:32:08 - Local assembly for k = 119
2019-09-26 16:32:09 - Extract iterative edges from k = 119 to 141
2019-09-26 16:32:09 - Build graph for k = 141
2019-09-26 16:32:09 - Assemble contigs from SdBG for k = 141
2019-09-26 16:32:10 - Merging to output final contigs
2019-09-26 16:32:10 - 177 contigs, total 70612 bp, min 200 bp, max 470 bp, avg 398 bp, N50 445 bp
2019-09-26 16:32:10 - ALL DONE. Time elapsed: 15.146550 seconds

测试文件为了方便演示,只取了原数据的一小部分,原作者用15min,我的服务器运行只用了4min。原始数据使用三种主流软件分析,运行所消耗时间、内存比较。
宏基因组:MEGAHIT组装拼接及quast评估_第1张图片
查看结果

less combined/final.contigs.fa

评估组装结果

运行QUEST

cd assembly
mkdir quast-evaluation
cd quast-evaluation
ln -fs ../combined/final.contigs.fa megahit.contigs.fa
../../quast/quast.py megahit.contigs.fa -o megahit-report
cat megahit-report/report.txt

下载metaSPAdes结果评估并比较

curl -LO https://osf.io/h29jk/download
mv download metaspades.contigs.fa.gz
gunzip metaspades.contigs.fa.gz

../../quast/quast.py metaspades.contigs.fa -o metaspades-report
cat metaspades-report/report.txt

# look at the two reports in parallel
paste *report/report.txt

结果如下:

Assembly                    megahit.contigs    metaspades.contigs
# contigs (>= 0 bp)         7904               4112              
# contigs (>= 1000 bp)      2763               1843              
# contigs (>= 5000 bp)      582                583               
# contigs (>= 10000 bp)     191                244               
# contigs (>= 25000 bp)     18                 43                
# contigs (>= 50000 bp)     2                  17                
Total length (>= 0 bp)      13222363           12090326          
Total length (>= 1000 bp)   11149439           11320830          
Total length (>= 5000 bp)   5893043            7955570           
Total length (>= 10000 bp)  3186708            5596677           
Total length (>= 25000 bp)  663719             2500084           
Total length (>= 50000 bp)  112488             1603525           
# contigs                   3847               2280              
Largest contig              61397              261464            
Total length                11895322           11615922          
GC (%)                      46.29              46.27             
N50                         4924               9303              
N75                         2524               3937              
L50                         594                266               
L75                         1455               754               
# N's per 100 kbp           0.00               0.00

结果N50和N75在metaspades结果更好,如果有计算资源,且不缺时间,推荐使用metaspades。但如果没有上T内存的服务器,项目周期又紧张,直接用metahit出结果。

你可能感兴趣的:(宏基因组,组装软件,宏基因组,组装评估,megahit,soapdenovo)