QUAST评估基因组组装质量

QUAST是评估基因组组装质量的常用工具,可计算N50等contig基本信息(without reference),也可通过比对参考基因组计算fraction, duplication, misassembly, unaligned, mismatch等信息(reference-based)。之后推出的metaquast可通过与close reference比较评估宏基因组组装质量。

文章:
文章1:QUAST: quality assessment tool for genome assemblies. Bioinformatics 2013
引用:3510
文章2:MetaQUAST: evaluation of metagenome assemblies. Bioinformatics 2016
引用:233

方法:
主页:http://bioinf.spbau.ru/quast
github: https://github.com/ablab/quast
sourceforge: http://quast.sourceforge.net/
sourceforge: http://quast.sourceforge.net/quast
quast5 更新:QUAST v.5.1.0 release notes (public version)
quast5 github下载:quast_5.1.0rc1
quast5 manual: http://quast.sourceforge.net/docs/manual.html
metaquast主页:http://bioinf.spbau.ru/metaquast
metaqusat sourceforge:http://quast.sourceforge.net/metaquast

下载,安装:

可执行文件,免安装,爱了

wget -c https://github.com/ablab/quast/releases/download/quast_5.1.0rc1/quast-5.1.0rc1.tar.gz
tar -zxvf quast-5.1.0rc1.tar.gz
python quast.py --help
python quast.py --version
# QUAST v5.1.0rc1, 6260eff0

运行:

# 使用测试数据
python quast.py test_data/contigs_1.fasta \
           test_data/contigs_2.fasta \
        -r test_data/reference.fasta.gz \
        -g test_data/genes.txt \
        -1 test_data/reads1.fastq.gz -2 test_data/reads2.fastq.gz \
        -o quast_test_output

# 实战:有参使用QUAST
quast_route="/software/quast-5.1.0rc1"
python $quast_route/quast.py AF04-12.fna \
-r ../Prokka/bgi/AF04-12/AF04-12.fna \
-g ../Prokka/bgi/AF04-12/AF04-12.gff \
--fragmented \
-t 4 -o ./AF04-12/

# 造个轮子,批量QUAST,bgi vs illumina
for i in `cat 76_strain_id.list`;
do
    python $quast_route/quast.py Prokka/illumina/$i/$i.fna \
    -r Prokka/bgi/$i/$i.fna \
    -g Prokka/bgi/$i/$i.gff \
    --fragmented --silent \
    -t 2 -o QUAST/illumina/$i/
    echo -e "\033[32m $i done...\033[0m"
done

input contig
-r: reference fasta
-g: reference gff file
--fragmented: detect misassemblies caused and mark them fake
-1/-2: forward/reverse reads
-o: output dir

这里使用的python没有装matplotlib模块,结果无pdf,无关紧要,我们要report.txt就够了。

结果文件:

report.txt      summary table
report.tsv      tab-separated version, for parsing, or for spreadsheets (Google Docs, Excel, etc)  
report.tex      Latex version
report.pdf      PDF version, includes all tables and plots for some statistics
report.html     everything in an interactive HTML file
icarus.html     Icarus main menu with links to interactive viewers
contigs_reports/        [only if a reference genome is provided]
  misassemblies_report  detailed report on misassemblies
  unaligned_report      detailed report on unaligned and partially unaligned contigs
k_mer_stats/            [only if --k-mer-stats is specified]
  kmers_report          detailed report on k-mer-based metrics
reads_stats/            [only if reads are provided]
  reads_report          detailed report on mapped reads statistics

结果文件:report.html


see manual for more detail: http://quast.sourceforge.net/docs/manual.html

  • Genome fraction (%)
    is the percentage of aligned bases in the reference genome.
  • N's per 100 kbp
    is the average number of uncalled bases (N's) per 100000 assembly bases.
  • mismatches per 100 kbp
    is the average number of mismatches per 100000 aligned bases. True SNPs and sequencing errors are not distinguished and are counted equally.
  • indels per 100 kbp
    is the average number of indels per 100000 aligned bases. Several consecutive single nucleotide indels are counted as one indel.

批处理,结果整理

## QUEST结果统计
task="bgi"
touch QUAST/${task}_quast.txt
cat QUAST/bgi/AF04-12/transposed_report.tsv | sed -n '1p' >> QUAST/${task}_quast.txt

for i in `cat 76_strain_id.list`;
do
    cat QUAST/${task}/$i/transposed_report.tsv | sed -n '2p' >> QUAST/${task}_quast.txt
    echo -e "\033[32m $i done... \033[0m"
done

更多:
quast 的结果怎么看

你可能感兴趣的:(QUAST评估基因组组装质量)