基因组和转录组组装质量评估利器BUSCO(以大肠杆菌基因组质量评估为例)

基因组和转录组组装质量评估利器——BUSCO(Benchmarking Universal Single-Copy Orthologs)


一,BUSCO简介

BUSCO软件2015年发表在Bioinformatics杂志上,谷歌学术显示该文章至今已经被引用1192次,足以说明该软件的认可度。
基因组和转录组组装质量评估利器BUSCO(以大肠杆菌基因组质量评估为例)_第1张图片
谷歌学术

该软件利用OrthoDB数据库提供的保守的单拷贝同源基因作为基准,评价组装的基因组或者转录组的完整性。


二,BUSCO软件的安装

  1. 手动安装
    头疼! 真需要请移步上的一个教程BUSCO使用笔记

2.BUSCO虚拟机安装
软件作者为了大家使用方面,直接将配置好的BUSCO软件打包成 了一个虚拟镜像,将这个虚拟镜像包BUSCO_UBUNTU_VM.ova下载下来,安装在VirtualBox或者是VMware中。虚拟机安装安装虚拟镜像的方法可以去上述软件的官网去学习,或者是百度一下教程茫茫多。后续我也可以分享一下过程,这次不是重点。


3.使用Miniconda安装BUSCO
Miniconda实在不失为生信入门者的必备良品!使用请看Miniconda的安装与使用(以转录组分析软件为例).

#软件库里搜索busco
czh@ubuntu:~$ conda search busco                     [11:22下午]
Loading channels: done
# Name                  Version           Build  Channel             
busco                       1.2          py27_0  bioconda            
busco                       1.2          py27_0  anaconda/cloud/bioconda
busco                       1.2          py27_1  bioconda            
busco                       1.2          py27_1  anaconda/cloud/bioconda
busco                       1.2          py34_0  bioconda            
busco                       1.2          py34_0  anaconda/cloud/bioconda
busco                       1.2          py34_1  bioconda            
busco                       1.2          py34_1  anaconda/cloud/bioconda
busco                       1.2          py35_0  bioconda            
busco                       1.2          py35_0  anaconda/cloud/bioconda
busco                       1.2          py35_1  bioconda            
busco                       1.2          py35_1  anaconda/cloud/bioconda
busco                       2.0          py27_0  bioconda            
busco                       2.0          py27_0  anaconda/cloud/bioconda
busco                       2.0          py34_0  bioconda            
busco                       2.0          py34_0  anaconda/cloud/bioconda
busco                       2.0          py35_0  bioconda            
busco                       2.0          py35_0  anaconda/cloud/bioconda
busco                       2.0          py36_0  bioconda            
busco                       2.0          py36_0  anaconda/cloud/bioconda
busco                     2.0.1          py27_0  bioconda            
busco                     2.0.1          py27_0  anaconda/cloud/bioconda
busco                     2.0.1          py34_0  bioconda            
busco                     2.0.1          py34_0  anaconda/cloud/bioconda
busco                     2.0.1          py35_0  bioconda            
busco                     2.0.1          py35_0  anaconda/cloud/bioconda
busco                     2.0.1          py36_0  bioconda            
busco                     2.0.1          py36_0  anaconda/cloud/bioconda
busco                     3.0.1          py35_0  bioconda            
busco                     3.0.1          py35_0  anaconda/cloud/bioconda
busco                     3.0.1          py36_0  bioconda            
busco                     3.0.1          py36_0  anaconda/cloud/bioconda
busco                     3.0.2          py35_4  bioconda            
busco                     3.0.2          py35_4  anaconda/cloud/bioconda
busco                     3.0.2          py35_5  bioconda            
busco                     3.0.2          py35_5  anaconda/cloud/bioconda
busco                     3.0.2          py35_6  bioconda            
busco                     3.0.2          py35_6  anaconda/cloud/bioconda
busco                     3.0.2          py35_7  bioconda            
busco                     3.0.2          py35_7  anaconda/cloud/bioconda
busco                     3.0.2          py36_4  bioconda            
busco                     3.0.2          py36_4  anaconda/cloud/bioconda
busco                     3.0.2          py36_5  bioconda            
busco                     3.0.2          py36_5  anaconda/cloud/bioconda
busco                     3.0.2          py36_6  bioconda            
busco                     3.0.2          py36_6  anaconda/cloud/bioconda
busco                     3.0.2          py36_7  bioconda            
busco                     3.0.2          py36_7  anaconda/cloud/bioconda

从软件库里搜索到BUSCO多个版本,可以根据需要安装指定版本,若不指定版本,默认安装最新版本。

czh@ubuntu:~$ conda create -n busco busco 

安装成功后,激活busco这个虚拟环境。

czh@ubuntu:~$ source activate busco
#一般输入软件名 + -h 或者 --help,出现帮助文档说明软件安装成功
(busco) czh@ubuntu:~$ busco -h
No command 'busco' found, did you mean:
 Command 'btsco' from package 'bluez-btsco' (universe)
busco: command not found
#报错!好像busco未安装

翻阅一下BUSCO的软件使用手册BUSCO_userguide.pdf,查看一下软件使用的基本命令。

基因组和转录组组装质量评估利器BUSCO(以大肠杆菌基因组质量评估为例)_第2张图片
软件帮助文档

BUSCO是python写的软件,使用该软件是调用run_BUSCO.py这个文件。

#查找run_BUSCO.py
(busco) czh@ubuntu:~$ which run_BUSCO.py
/home/czh/miniconda3/envs/busco/bin/run_BUSCO.py
(busco) czh@ubuntu:~$ run_BUSCO.py -h
usage: python BUSCO.py -i [SEQUENCE_FILE] -l [LINEAGE] -o [OUTPUT_NAME] -m [MODE] [OTHER OPTIONS]

Welcome to BUSCO 3.0.2: the Benchmarking Universal Single-Copy Ortholog assessment tool.
For more detailed usage information, please review the README file provided with this distribution and the BUSCO user guide.

optional arguments:
  -i FASTA FILE, --in FASTA FILE
                        Input sequence file in FASTA format. Can be an assembled genome or transcriptome (DNA), or protein sequences from an annotated gene set.
  -c N, --cpu N         Specify the number (N=integer) of threads/cores to use.
  -o OUTPUT, --out OUTPUT
                        Give your analysis run a recognisable short name. Output folders and files will be labelled with this name. WARNING: do not provide a path
  -e N, --evalue N      E-value cutoff for BLAST searches. Allowed formats, 0.001 or 1e-03 (Default: 1e-03)
  -m MODE, --mode MODE  Specify which BUSCO analysis mode to run.
                        There are three valid modes:
                        - geno or genome, for genome assemblies (DNA)
                        - tran or transcriptome, for transcriptome assemblies (DNA)
                        - prot or proteins, for annotated gene sets (protein)
  -l LINEAGE, --lineage_path LINEAGE
                        Specify location of the BUSCO lineage data to be used.
                        Visit http://busco.ezlab.org for available lineages.
  -f, --force           Force rewriting of existing files. Must be used when output files with the provided name already exist.
  -r, --restart         Restart an uncompleted run. Not available for the protein mode
  -sp SPECIES, --species SPECIES
                        Name of existing Augustus species gene finding parameters. See Augustus documentation for available options.
  --augustus_parameters AUGUSTUS_PARAMETERS
                        Additional parameters for the fine-tuning of Augustus run. For the species, do not use this option.
                        Use single quotes as follow: '--param1=1 --param2=2', see Augustus documentation for available options.
  -t PATH, --tmp_path PATH
                        Where to store temporary files (Default: ./tmp/)
  --limit REGION_LIMIT  How many candidate regions (contig or transcript) to consider per BUSCO (default: 3)
  --long                Optimization mode Augustus self-training (Default: Off) adds considerably to the run time, but can improve results for some non-model organisms
  -q, --quiet           Disable the info logs, displays only errors
  -z, --tarzip          Tarzip the output folders likely to contain thousands of files
  --blast_single_core   Force tblastn to run on a single core and ignore the --cpu argument for this step only. Useful if inconsistencies when using multiple threads are noticed
  -v, --version         Show this version and exit
  -h, --help            Show this help message and exit

三,使用BUSCO评估4个大肠杆菌基因组组装质量

1.下载大肠杆菌基因组数据
从NCBI数据库中下载了Escherichia coli IAI39(Complete),E. coli UMN026 (Chromosome),E. coli IMT2125(Scaffold)和E. coli TW10722 (Contig)四个不同组装水平的基因组核酸序列。

基因组和转录组组装质量评估利器BUSCO(以大肠杆菌基因组质量评估为例)_第3张图片
为了凸显BUSCO评估作用,特意下载了不同的组装水平的基因组

#文件下载后放在test文件夹中,进入test文件夹并查看文件夹中的文件
(busco) czh@ubuntu:~$ cd '/home/czh/Desktop/test' 
(busco) czh@ubuntu:~/Desktop/test$ ls
Escherichia_coli_IAI39.gz    Escherichia_coli_TW10722.gz
Escherichia_coli_IMT2125.gz  Escherichia_coli_UMN026.gz
#下载的基因组文件为压缩文件,解压文件
(busco) czh@ubuntu:~/Desktop/test$ gzip -d *.gz
(busco) czh@ubuntu:~/Desktop/test$ ls
Escherichia_coli_IAI39    Escherichia_coli_TW10722
Escherichia_coli_IMT2125  Escherichia_coli_UMN026
#busco要求fasta格式文件,虽然文件里面是fasta格式,但没有fasta后缀,busco不识别。
(busco) czh@ubuntu:~/Desktop/test$(busco) czh@ubuntu:~/Desktop/test$ rename 's/$/\.fasta/'  *
(busco) czh@ubuntu:~/Desktop/test$ ls
Escherichia_coli_IAI39.fasta    Escherichia_coli_TW10722.fasta
Escherichia_coli_IMT2125.fasta  Escherichia_coli_UMN026.fasta

2.下载恰当的OrthoDB数据库数据
现在需要从BUSCO官网中的OrthoDB数据库中下载合适的数据文件。

基因组和转录组组装质量评估利器BUSCO(以大肠杆菌基因组质量评估为例)_第4张图片
OrthoDB

大肠杆菌的分类地位:细菌界(Bacteria),变形菌门(Proteobacteria), γ-变形菌纲(Gammaproteobacteria),肠杆菌目(Enterobacteriales), 肠杆菌科(Enterobacteriaceae),埃希氏菌属(Escherichia)。
基因组和转录组组装质量评估利器BUSCO(以大肠杆菌基因组质量评估为例)_第5张图片
细菌保守的单拷贝同源序列数据库

因此可以选择下载Enterobacteriales odb9, Gammaproteobacteria obd9 和Proteobacteria odb9。
我之前把细菌界的所有odb9数据库都下载并解压好了,放置在自己新建的busco_database文件夹中。

(busco) czh@ubuntu:~/Desktop/test$ cd '/home/czh/Desktop/busco_database' 
(busco) czh@ubuntu:~/Desktop/busco_database$ ls
actinobacteria_odb9.tar.gz      enterobacteriales_odb9.tar.gz
bacillales_odb9.tar.gz          firmicutes_odb9.tar.gz
bacteria_odb9                   gammaproteobacteria_odb9
bacteria_odb9.tar.gz            gammaproteobacteria_odb9.tar.gz
bacteroidetes_odb9.tar.gz       lactobacillales_odb9.tar.gz
betaproteobacteria_odb9.tar.gz  proteobacteria_odb9
clostridia_odb9.tar.gz          proteobacteria_odb9.tar.gz
cyanobacteria_odb9.tar.gz       rhizobiales_odb9.tar.gz
deltaepsilonsub_odb9.tar.gz     spirochaetes_odb9.tar.gz
enterobacteriales_odb9          tenericutes_odb9.tar.gz
#查看帮助文档,了解软件运行的基本命令
(busco) czh@ubuntu:~/Desktop/test$ run_BUSCO.py -h
usage: python BUSCO.py -i [SEQUENCE_FILE] -l [LINEAGE] -o [OUTPUT_NAME] -m [MODE] [OTHER OPTIONS]
# -i参数:指定输入文件,-l参数:指定比对的obd9数据库,-o参数:指定输出文件夹, -m参数:指定待评估的数据的类型,-m geno 为基因组,-m tran为转录组。
  1. 对4个大肠杆菌基因组进行组装质量评估
    有4个菌需要评估,在此使用for...do...done循环函数进行,数据库使用肠杆菌科enterobacteriales_odb9。
(busco) czh@ubuntu:~/Desktop/test$ for i in Escherichia_coli_TW10722.fasta Escherichia_coli_IAI39.fasta Escherichia_coli_UMN026.fasta Escherichia_coli_IMT2125.fasta; do /home/czh/miniconda3/envs/busco/bin/run_BUSCO.py -i $i -l '/home/czh/Desktop/busco_database/enterobacteriales_odb9' -o $i -m geno -c 2; done
INFO    ****************** Start a BUSCO 3.0.2 analysis, current time: 10/12/2018 23:54:48 ******************
INFO    Configuration loaded from /home/czh/miniconda3/envs/busco/bin/../config/config.ini
INFO    Init tools...
INFO    Check dependencies...
INFO    Check input file...
INFO    To reproduce this run: python /home/czh/miniconda3/envs/busco/bin/run_BUSCO.py -i Escherichia_coli_TW10722.fasta -o Escherichia_coli_TW10722.fasta -l /home/czh/Desktop/busco_database/enterobacteriales_odb9/ -m genome -c 2 -sp E_coli_K12
INFO    Mode is: genome
INFO    The lineage dataset is: enterobacteriales_odb9 (prokaryota)
INFO    Temp directory is ./tmp/
INFO    ****** Phase 1 of 2, initial predictions ******
INFO    ****** Step 1/3, current time: 10/12/2018 23:54:48 ******
INFO    Create blast database...
INFO    [makeblastdb]   Building a new DB, current time: 10/12/2018 23:54:48
INFO    [makeblastdb]   New DB name:   /home/czh/Desktop/test/tmp/Escherichia_coli_TW10722.fasta_3135827084
INFO    [makeblastdb]   New DB title:  Escherichia_coli_TW10722.fasta
INFO    [makeblastdb]   Sequence type: Nucleotide
INFO    [makeblastdb]   Keep MBits: T
INFO    [makeblastdb]   Maximum file size: 1000000000B
INFO    [makeblastdb]   Adding sequences from FASTA; added 405 sequences in 0.165066 seconds.
INFO    [makeblastdb]   1 of 1 task(s) completed at 10/12/2018 23:54:48
INFO    Running tblastn, writing output to /home/czh/Desktop/test/run_Escherichia_coli_TW10722.fasta/blast_output/tblastn_Escherichia_coli_TW10722.fasta.tsv...
INFO    [tblastn]   1 of 1 task(s) completed at 10/12/2018 23:55:10
INFO    ****** Step 2/3, current time: 10/12/2018 23:55:10 ******
INFO    Maximum number of candidate contig per BUSCO limited to: 3
INFO    Getting coordinates for candidate regions...
INFO    Pre-Augustus scaffold extraction...
INFO    Running Augustus prediction using E_coli_K12 as species:
INFO    [augustus]  Please find all logs related to Augustus errors here: /home/czh/Desktop/test/run_Escherichia_coli_TW10722.fasta/augustus_output/augustus.log
INFO    [augustus]  101 of 1008 task(s) completed at 10/12/2018 23:57:03
INFO    [augustus]  202 of 1008 task(s) completed at 10/12/2018 23:58:47
INFO    [augustus]  303 of 1008 task(s) completed at 10/13/2018 00:00:36
INFO    [augustus]  404 of 1008 task(s) completed at 10/13/2018 00:02:02
INFO    [augustus]  504 of 1008 task(s) completed at 10/13/2018 00:03:48
INFO    [augustus]  605 of 1008 task(s) completed at 10/13/2018 00:05:14
INFO    [augustus]  706 of 1008 task(s) completed at 10/13/2018 00:06:57
INFO    [augustus]  807 of 1008 task(s) completed at 10/13/2018 00:08:41
INFO    [augustus]  908 of 1008 task(s) completed at 10/13/2018 00:10:14
INFO    [augustus]  1008 of 1008 task(s) completed at 10/13/2018 00:11:48
INFO    Extracting predicted proteins...
INFO    ****** Step 3/3, current time: 10/13/2018 00:12:00 ******
INFO    Running HMMER to confirm orthology of predicted proteins:
INFO    [hmmsearch] 101 of 1002 task(s) completed at 10/13/2018 00:12:03
INFO    [hmmsearch] 201 of 1002 task(s) completed at 10/13/2018 00:12:07
INFO    [hmmsearch] 301 of 1002 task(s) completed at 10/13/2018 00:12:12
INFO    [hmmsearch] 401 of 1002 task(s) completed at 10/13/2018 00:12:15
INFO    [hmmsearch] 502 of 1002 task(s) completed at 10/13/2018 00:12:20
INFO    [hmmsearch] 602 of 1002 task(s) completed at 10/13/2018 00:12:24
INFO    [hmmsearch] 702 of 1002 task(s) completed at 10/13/2018 00:12:29
INFO    [hmmsearch] 802 of 1002 task(s) completed at 10/13/2018 00:12:34
INFO    [hmmsearch] 902 of 1002 task(s) completed at 10/13/2018 00:12:38
INFO    [hmmsearch] 1002 of 1002 task(s) completed at 10/13/2018 00:12:42
INFO    Results:
INFO    C:98.0%[S:97.6%,D:0.4%],F:1.5%,M:0.5%,n:781
INFO    765 Complete BUSCOs (C)
INFO    762 Complete and single-copy BUSCOs (S)
INFO    3 Complete and duplicated BUSCOs (D)
INFO    12 Fragmented BUSCOs (F)
INFO    4 Missing BUSCOs (M)
INFO    781 Total BUSCO groups searched
..........

4.评估结果与解读

使用肠杆菌科的保守单拷贝同源基因数据库对四个大肠杆菌基因组组装评估如下: Escherichia_coli_IAI39.fasta (Complete)

C:99.9%[S:99.9%,D:0.0%],F:0.1%,M:-0.0%,n:781
780 Complete BUSCOs (C)
780 Complete and single-copy BUSCOs (S)
0   Complete and duplicated BUSCOs (D)
1   Fragmented BUSCOs (F)
0   Missing BUSCOs (M)
781 Total BUSCO groups searched

Escherichia_coli_UMN026.fasta(Chromosome)

C:99.9%[S:99.9%,D:0.0%],F:0.1%,M:-0.0%,n:781
780 Complete BUSCOs (C)
780 Complete and single-copy BUSCOs (S)
0   Complete and duplicated BUSCOs (D)
1   Fragmented BUSCOs (F)
0   Missing BUSCOs (M)
781 Total BUSCO groups searched

Escherichia_coli_IMT2125.fasta(Scaffold)

C:97.8%[S:97.8%,D:0.0%],F:2.2%,M:0.0%,n:781
764 Complete BUSCOs (C)
764 Complete and single-copy BUSCOs (S)
0   Complete and duplicated BUSCOs (D)
17  Fragmented BUSCOs (F)
0   Missing BUSCOs (M)
781 Total BUSCO groups searched

Escherichia_coli_TW10722.fasta(Contig)

C:98.0%[S:97.6%,D:0.4%],F:1.5%,M:0.5%,n:781
765 Complete BUSCOs (C)
762 Complete and single-copy BUSCOs (S)
3   Complete and duplicated BUSCOs (D)
12  Fragmented BUSCOs (F)
4   Missing BUSCOs (M)
781 Total BUSCO groups searched

结果解读

上有一个教程BUSCO组装质量评估软件对BUSCO结果进行了比较好的中文解释 。
大体如下:

  1. C : 多少个BUSCO测试基因被覆盖。
  2. S : 多少个基因经过比对发现是单拷贝。
  3. D : 多少个基因经过比对发现是多拷贝。
  4. F : 多少个基因经过比对覆盖不完全,只是部分比对上。
  5. M : 没有得到比对结果的基因数。
  6. Total : 总共测试的基因条目。
Genome C S D F M Total
Genome Complete single-copy duplicated Fragment Miss Total
Escherichia_coli_IAI39.fasta (Complete) 99.9% 99.9% 0.0% 0.1% 0.0% 781
Escherichia_coli_UMN026.fasta(Chromosome) 99.9% 99.9% 0.0% 0.1% 0.0% 781
Escherichia_coli_IMT2125.fasta(Scaffold) 97.8% 97.8% 0.0% 2.2% 0.0% 781
Escherichia_coli_TW10722.fasta(Contig) 98.0% 97.6% 0.4% 1.5% 0.5% 781

Complete和Chromosome组装水平的基因组完整度99.9%,而Scaffold和Contig的基因组完整度分别为97.8%和98%,但contig基因组中发现理论上应该是单拷贝基因的被检测到有多个拷贝,可能是组装错误造成的,因此contig的组装质量并不高于Scaffold。

那么也就是说一般来看,按照我个人的看法。S似乎越大越好,M越小越好,说明组装的越完整。但是D与F这两个数值越大不见得就是好的,因为组装饿错误可能会带来这两个值的增大。究竟应该如何评判,我觉得不能仅仅只是通过这一个软件来判定。比如还可以借助QUAST和常规指标N50、总的核酸量、对角线图等等多个评判标准来进行(引自BUSCO - 组装质量评估)。

参考资料

  1. BUSCO组装质量评估软件
  2. BUSCO官方网站
  3. BUSCO使用笔记

你可能感兴趣的:(基因组和转录组组装质量评估利器BUSCO(以大肠杆菌基因组质量评估为例))