BUSCO评估基因组组装和完整性

BUSCO是Benchmarking Universal Single-Copy Orthologs(通用单拷贝同源基因基准)的缩写,基于基因进化(有参比对)评估基因组组装和注释完整性的开源python软件。

文献:
文章:BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015
引用:4695
BOOK:BUSCO: Assessing Genome Assembly and Annotation Completeness. Methods in Molecular Biology 2019

摘要:
Motivation: Genomics has revolutionized biological research, but quality assessment of the resulting assembled sequences is complicated and remains mostly limited to technical measures like N50.
Results: We propose a measure for quantitative assessment of genome assembly and annotation completeness based on evolutionarily informed expectations of gene content. We implemented the assessment procedure in open-source software, with sets of Benchmarking Universal Single-Copy Orthologs, named BUSCO.
基因组组装评估方法少,BUSCO开源且好用。

方法:
官网:https://busco.ezlab.org/
MANUAL: https://busco.ezlab.org/busco_userguide.html

conda安装:
conda:https://anaconda.org/bioconda/busco
选一即可,可能是v4.1.2

conda install -c bioconda busco
conda install -c bioconda/label/broken busco
conda install -c bioconda/label/cf201901 busco 

bioconda安装最新版v5.1.2,see manual

# 没有镜像的话,添加镜像
conda config --show 
conda config --add channels conda-forge
# conda安装
conda create -n busco
conda activate busco
conda install -c bioconda -c conda-forge busco=5.1.2
busco --help
busco --version
# BUSCO 5.1.2

数据库:
更多老哥下了植物的参考基因组,链接似乎不好用了?

# 植物的BUSCO的数据库
wget -c https://busco.ezlab.org/datasets/embryophyta_odb9.tar.gz

orthodb: https://www.orthodb.org/?page=filelist 里似乎有很多数据?

MANAUAL中提供了lineage数据源:
https://busco-data.ezlab.org/v5/data/,发现:

是V5最新版的数据库,没错了

https://busco-data.ezlab.org/v5/data/lineages/,发现:

2021本月最新版,各个物种任意选择,下载bacteria_odb10,并查看:

wget -c https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz
tar -zxvf bacteria_odb10.2020-03-06.tar.gz
cd bacteria_odb10

BUSCO使用:

manual里的Automated lineage selection模式

busco -m MODE -i INPUT -o OUTPUT --auto-lineage
busco -m MODE -i INPUT -o OUTPUT --auto-lineage-prok
# or ignoring eukaryotes to save runtime, if compatible with your experimental goal.
busco -m MODE -i INPUT -o OUTPUT --auto-lineage-euk
# or ignoring non-eukaryotes to save runtime, if compatible with your experimental goal.

manual推荐的靶向lineage模式

db_busco="/database/BUSCO/bacteria_odb10"
busco --in AF04-12.fna \
--lineage_dataset $db_busco \
--out ./output/ \
-m genome --offline

结果报错:

顾名思义,不能有slash,需要更改配置文件,安全起见别这样做。去掉slash即可正常。对于批处理,只需不断进出新建的文件夹即可。

busco --in AF04-12.fna \
--lineage_dataset $db_busco \
--out output \
-m genome --offline

结果:

full_table.tsv

# BUSCO version is: 5.1.2
# The lineage dataset is: bacteria_odb10 (Creation date: 2020-03-06, number of genomes: 4085, number of BUSCOs: 124)
# Busco id      Status  Sequence        Gene Start      Gene End        Strand  ScoreLength   OrthoDB url     Description
4421at2 Complete        AF04-12.Scaf40_36       46725   51011   +       1675.3  1205 https://www.orthodb.org/v10?query=4421at2        DNA-directed RNA polymerase subunit beta'
9601at2 Complete        AF04-12.Scaf40_35       42874   46686   +       1169.7  804  https://www.orthodb.org/v10?query=9601at2        DNA-directed RNA polymerase subunit beta
26038at2        Complete        AF04-12.Scaf8_42        54773   58477   +       212.5371      https://www.orthodb.org/v10?query=26038at2      phosphoribosylformylglycinamidine synthase
91428at2        Complete        AF04-12.Scaf45_20       22437   25052   +       540.6530      https://www.orthodb.org/v10?query=91428at2      alanine--tRNA ligase
95696at2        Complete        AF04-12.Scaf4_63        73584   75617   +       714.7504      https://www.orthodb.org/v10?query=95696at2      excinuclease ABC subunit B
143460at2       Complete        AF04-12.Scaf1_51        58613   60415   +       512.5441      https://www.orthodb.org/v10?query=143460at2     GTP-binding protein
182107at2       Complete        AF04-12.Scaf17_16       11979   13760   +       709.2491      https://www.orthodb.org/v10?query=182107at2     elongation factor 4

missing_busco_list.tsv

POG091H008J
POG091H00BL
POG091H00TK
...............这里其实没有,嘎嘎

short_summary.txt

# BUSCO version is: 5.1.2
# The lineage dataset is: bacteria_odb10 (Creation date: 2020-03-06, number of genomes: 4085, number of BUSCOs: 124)
# Summarized benchmarking in BUSCO notation for file /hwfssz5/ST_META/P18Z10200N0423_ZYQ/MiceGutProject/hutongyuan/analysis/platform/test/AF04-12.fna
# BUSCO was run in mode: genome
# Gene predictor used: prodigal

        ***** Results: *****

        C:100.0%[S:97.6%,D:2.4%],F:0.0%,M:0.0%,n:124
        124     Complete BUSCOs (C)
        121     Complete and single-copy BUSCOs (S)
        3       Complete and duplicated BUSCOs (D)
        0       Fragmented BUSCOs (F)
        0       Missing BUSCOs (M)
        124     Total BUSCO groups searched

Dependencies and versions:
        hmmsearch: 3.1
        prodigal: 2.6.3

合并BUSCO结果:

## BUSCO 结果统计
task="illumina"
touch BUSCO/${task}_busco.txt
echo -e "id\tc\ts\td\tf\tm" >> BUSCO/${task}_busco.txt

for i in `cat 76_strain_id.list`;
do
    c=`cat BUSCO/$task/$i/run_bacteria_odb10/short_summary.txt | grep "Complete BUSCOs" | awk '{print $1}'`
    s=`cat BUSCO/$task/$i/run_bacteria_odb10/short_summary.txt | grep "Complete and single-copy BUSCOs" | awk '{print $1}'`
    d=`cat BUSCO/$task/$i/run_bacteria_odb10/short_summary.txt | grep "Complete and duplicated BUSCOs" | awk '{print $1}'`
    f=`cat BUSCO/$task/$i/run_bacteria_odb10/short_summary.txt | grep "Fragmented BUSCOs" | awk '{print $1}'`
    m=`cat BUSCO/$task/$i/run_bacteria_odb10/short_summary.txt | grep "Missing BUSCOs" | awk '{print $1}'`
    echo -e "$i\t$c\t$s\t$d\t$f\t$m" >> BUSCO/${task}_busco.txt
    echo -e "\033[32m $i done... \033[0m"
done

可视化:
这个呢需要某个脚本,官网是这么干的,自己捯饬一下也行,反正我没做了。

cp XX1/short_summary.*.lineage_odb10.XX1.txt BUSCO_summaries/.
cp XX2/short_summary.*.lineage_odb10.XX2.txt BUSCO_summaries/.
cp XX3/short_summary.*.lineage_odb10.XX3.txt BUSCO_summaries/.

python3 scripts/generate_plot.py –wd BUSCO_summaries
python3 scripts/generate_plot.py –wd /full/path/to/my/folder/BUSCO_summaries

更多:
BUSCO - 组装质量评估

你可能感兴趣的:(BUSCO评估基因组组装和完整性)