使用Unicycler进行细菌基因组组装

二代测序技术的迅速发展大幅度的降低了高通量测序价格且提高了测序质量,同时高通量测序相关的数据处理和分析的软件也是越来越好用。SPAdes在二代测序小基因组的组装上表现优异,使用方便,真是爱不释手。Unicycler在二代测序小基因组装上则是依赖于SPAdes,在此基础上进行了优化;此外在二代三代测序数据混合组装和纯三代测序数据组装上也表现较好;该项目在github上被标星210次。今天使用SPAdes和Unicycler对一个350bp小片段建库测序1G数据量的细菌菌株进行组装,熟悉一下流程并且看看两个软件的差别。


1. 数据质控

公司给的建库测序的数据分为raw_data和clean_data, 我先直接使用fastqc对clean_data的数据进行质控。双端测序数据是N46-1_add_BDMS190627350-1a_1.clean.fq.gz和N46-1_add_BDMS190627350-1a_2.clean.fq.gz。

(denovo_genome) czh@ubuntu:~/Desktop/01_denovo_genome/N46_DENOVO$ fastqc '/home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_1.clean.fq.gz' '/home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_2.clean.fq.gz'  -o '/home/czh/Desktop/01_denovo_genome/N46_DENOVO/fastqc_out'

使用Unicycler进行细菌基因组组装_第1张图片
正向序列
使用Unicycler进行细菌基因组组装_第2张图片
反向序列

fastqc结果看,测序长度是150bp,序列数量约为8M,数据量约为1.2G,序列GC含量和duplication水平不是特别完美,总体来看双端序列质量很好,数据的质量和数量满足下游分析。

2. 使用spades对Illumina双端序列进行组装。
(denovo_genome) czh@ubuntu:~/Desktop/01_denovo_genome/N46_DENOVO$ spades.py -1 N46-1_add_BDMS190627350-1a_1.clean.fq.gz -2 N46-1_add_BDMS190627350-1a_2.clean.fq.gz --careful -o spades_out -t 12 -m 24
Command line: /home/czh/miniconda3/envs/denovo_genome/bin/spades.py -1  /home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_1.clean.fq.gz  -2  /home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_2.clean.fq.gz  --careful   -o  /home/czh/Desktop/01_denovo_genome/N46_DENOVO/spades_out    -t  12  -m  24  

System information:
  SPAdes version: 3.13.1
  Python version: 3.7.3
  OS: Linux-5.0.0-23-generic-x86_64-with-debian-buster-sid

Output dir: /home/czh/Desktop/01_denovo_genome/N46_DENOVO/spades_out
Mode: read error correction and assembling
Debug mode is turned OFF

Dataset parameters:
  Multi-cell mode (you should set '--sc' flag if input data was obtained with MDA (single-cell) technology or --meta flag if processing metagenomic dataset)
  Reads:
    Library number: 1, library type: paired-end
      orientation: fr
      left reads: ['/home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_1.clean.fq.gz']
      right reads: ['/home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_2.clean.fq.gz']
      interlaced reads: not specified
      single reads: not specified
      merged reads: not specified
Read error correction parameters:
  Iterations: 1
  PHRED offset will be auto-detected
  Corrected reads will be compressed
Assembly parameters:
  k: automatic selection based on read length
  Repeat resolution is enabled
  Mismatch careful mode is turned ON
  MismatchCorrector will be used
  Coverage cutoff is turned OFF
Other parameters:
  Dir for temp files: /home/czh/Desktop/01_denovo_genome/N46_DENOVO/spades_out/tmp
  Threads: 12
  Memory limit (in Gb): 24


======= SPAdes pipeline started. Log can be found here: /home/czh/Desktop/01_denovo_genome/N46_DENOVO/spades_out/spades.log


===== Read error correction started. 


== Running read error correction tool: /home/czh/miniconda3/envs/denovo_genome/bin/spades-hammer /home/czh/Desktop/01_denovo_genome/N46_DENOVO/spades_out/corrected/configs/config.info

  0:00:00.000     4M / 4M    INFO    General                 (main.cpp                  :  75)   Starting BayesHammer, built from N/A, git revision N/A
  0:00:00.000     4M / 4M    INFO    General                 (main.cpp                  :  76)   Loading config from /home/czh/Desktop/01_denovo_genome/N46_DENOVO/spades_out/corrected/configs/config.info
  0:00:00.000     4M / 4M    INFO    General                 (main.cpp                  :  78)   Maximum # of threads to use (adjusted due to OMP capabilities): 12
  0:00:00.000     4M / 4M    INFO    General                 (memory_limit.cpp          :  49)   Memory limit set to 24 Gb
  0:00:00.000     4M / 4M    INFO    General                 (main.cpp                  :  86)   Trying to determine PHRED offset
  0:00:00.000     4M / 4M    INFO    General                 (main.cpp                  :  92)   Determined value is 33


耗时大约70分钟。

3. 使用unicycler对Illumina双端序列进行组装。
(denovo_genome) czh@ubuntu:~/Desktop/01_denovo_genome/N46_DENOVO$ unicycler -1 N46-1_add_BDMS190627350-1a_1.clean.fq.gz -2 N46-1_add_BDMS190627350-1a_2.clean.fq.gz -o unicycler_out -t 12
Starting Unicycler (2019-11-03 22:46:05)
    Welcome to Unicycler, an assembly pipeline for bacterial genomes. Since you
provided only short reads, Unicycler will essentially function as a SPAdes-
optimiser. It will try many k-mer sizes, choose the best based on contig length
and graph connectivity, and scaffold the graph using SPAdes repeat resolution.
    For more information, please see https://github.com/rrwick/Unicycler

Command: /home/czh/miniconda3/envs/denovo_genome/bin/unicycler -1 /home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_1.clean.fq.gz -2 /home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_2.clean.fq.gz -o /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out -t 12

Unicycler version: v0.4.8
Using 12 threads

The output directory already exists:
  /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out

Dependencies:
  Program         Version           Status  
  spades.py       3.13.1            good    
  racon                             not used
  makeblastdb     2.9.0+            good    
  tblastn         2.9.0+            good    
  bowtie2-build   2.3.5             good    
  bowtie2         2.3.5             good    
  samtools        1.9               good    
  java            11.0.1-internal   good    
  pilon           1.23              good    
  bcftools                          not used


SPAdes read error correction (2019-11-03 22:47:16)
    Unicycler uses the SPAdes read error correction module to reduce the number
of errors in the short read before SPAdes assembly. This can make the assembly
faster and simplify the assembly graph structure.

Command: /home/czh/miniconda3/envs/denovo_genome/bin/spades.py -1 /home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_1.clean.fq.gz -2 /home/czh/Desktop/01_denovo_genome/N46_DENOVO/N46-1_add_BDMS190627350-1a_2.clean.fq.gz -o /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/spades_assembly/read_correction --threads 12 --only-error-correction

Corrected reads:
  /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/spades_assembly/corrected_1.fastq.gz
  /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/spades_assembly/corrected_2.fastq.gz


Choosing k-mer range for assembly (2019-11-03 23:47:00)
    Unicycler chooses a k-mer range for SPAdes based on the length of the input
reads. It uses a wide range of many k-mer sizes to maximise the chance of
finding an ideal assembly.

SPAdes maximum k-mer: 127
Median read length: 150
K-mer range: 27, 47, 63, 77, 89, 99, 107, 115, 121, 127


SPAdes assemblies (2019-11-03 23:48:07)
    Unicycler now uses SPAdes to assemble the short reads. It scores the
assembly graph for each k-mer using the number of contigs (fewer is better) and
the number of dead ends (fewer is better). The score function is 1/(c*(d+2)),
where c is the contig count and d is the dead end count.

K-mer   Contigs   Dead ends   Score   
   27                           failed
   47                           failed
   63                           failed
   77                           failed
   89                           failed
   99                           failed
  107                           failed
  115                           failed
  121                           failed
  127       198           0   2.53e-03 <-best

Read depth filter: removed 216 contigs totalling 72842 bp
Deleting /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/spades_assembly/


Determining graph multiplicity (2019-11-04 00:07:15)
    Multiplicity is the number of times a sequence occurs in the underlying
sequence. Single-copy contigs (those with a multiplicity of one, occurring only
once in the underlying sequence) are particularly useful.

Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/001_best_spades_graph.gfa


Cleaning graph (2019-11-04 00:07:15)
    Unicycler now performs various cleaning procedures on the graph to remove
overlaps and simplify the graph structure. The end result is a graph ready for
bridging.

Graph overlaps removed

Removed zero-length segments:
    111, 112, 116, 117, 122, 123, 125, 126, 134, 135, 140, 143, 144, 145, 146,
149, 151, 153, 158, 161, 169, 170, 171, 172, 179, 181, 183

Removed zero-length segments:
    110, 165

Removed zero-length segments:
    195

Merged small segments:
    184, 185, 186, 187, 192, 196, 197

Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/002_overlaps_removed.gfa

    Unicycler now selects a set of anchor contigs from the single-copy contigs.
These are the contigs which will be connected via bridges to form the final
assembly.

43 anchor segments (5,368,430 bp) out of 162 total segments (5,423,141 bp)


Creating SPAdes contig bridges (2019-11-04 00:07:15)
    SPAdes uses paired-end information to perform repeat resolution (RR) and
produce contigs from the assembly graph. SPAdes saves the graph paths
corresponding to these contigs in the contigs.paths file. When one of these
paths contains two or more anchor contigs, Unicycler can create a bridge from
the path.

                                                                                                                                         Bridge
Start                                                            Path                                                           End     quality
  -35                                                            -58                                                            44         18.2
  -19       109 -> 114 -> 95 -> -136 -> 106 -> 112 -> 123 -> 80 -> 130 -> 65 -> 136 -> 110 -> -145 -> -103 -> -109 -> 131       43         12.0
    2                                                            -42                                                            37          9.3
    3                                                            102                                                            10         62.2
    5                                                     -96 -> 152 -> -96                                                     -12        37.8
    8                                                            -102                                                           -15        61.8
   10                                                      -98 -> 46 -> -98                                                     1           7.3
   16   -150 -> 113 -> 133 -> -103 -> -109 -> 131 -> 45 -> 132 -> 109 -> 114 -> 124 -> 129 -> 83 -> 121 -> -117 -> -88 -> 156   -28         7.5
   22                                                            -57                                                            -40        11.8
   37                                                            -42                                                            4           8.7


Creating loop unrolling bridges (2019-11-04 00:07:15)
    When a SPAdes contig path connects an anchor contig with the middle contig
of a simple loop, Unicycler concludes that the sequences are contiguous (i.e.
the loop is not a separate piece of DNA). It then uses the read depth of the
middle and repeat contigs to guess the number of times to traverse the loop and
makes a bridge.

                                  Loop count   Loop count    Loop    Bridge
Start   Repeat   Middle     End    by repeat    by middle   count   quality
    2      -42       37       4         1.07         0.97       1      40.5
    5      -96      152     -12         0.67         0.60       1      36.4
    9       97      147      12         0.82         0.71       1      40.1
   10      -98       46       1         0.12         0.88       1      37.5


Applying bridges (2019-11-04 00:07:15)
    Unicycler now applies to the graph in decreasing order of quality. This
ensures that when multiple, contradictory bridges exist, the most supported
option is used.

Bridge type   Start -> end   Path                                       Quality
SPAdes           3 -> 10     102                                         62.210
SPAdes           8 -> -15    -102                                        61.764
SPAdes           5 -> -12    -96, 152, -96                               37.778
SPAdes         -35 -> 44     -58                                         18.167
SPAdes         -19 -> 43     109, 114, 95, -136, 106, 112, 123, 80,      11.994
                             130, 65, 136, 110, -145, -103, -109, 131          
SPAdes          22 -> -40    -57                                         11.829
loop             2 -> 4      -42, 37, -42                                40.460
loop             9 -> 12     97, 147, 97                                 40.120
loop            10 -> 1      -98, 46, -98                                37.498

Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/003_bridges_applied.gfa


Bridged assembly graph (2019-11-04 00:07:16)
    The assembly is now mostly finished and no more structural changes will be
made. Ideally the assembly graph should now have one contig per replicon and no
erroneous contigs (i.e a complete assembly). If there are more contigs, then
the assembly is not complete.

Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/004_final_clean.gfa

Component   Segments   Links   Length      N50         Longest segment   Status    
        1        131     184   5,430,795   1,298,586         1,645,591   incomplete


Polishing assembly with Pilon (2019-11-04 00:07:16)
    Unicycler now conducts multiple rounds of Pilon in an attempt to repair any
remaining small-scale errors with the assembly.

Aligning reads to find appropriate insert size range...
Insert size 1st percentile:  190
Insert size 99th percentile: 513

Pilon polish round 1
Total number of changes: 29

Pilon polish round 2
Total number of changes: 4

Pilon polish round 3
No Pilon changes

Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/005_polished.gfa


Assembly complete (2019-11-04 02:13:11)
Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/assembly.gfa
Saving /home/czh/Desktop/01_denovo_genome/N46_DENOVO/add_out/assembly.fasta

总共耗时3小时左右,比较慢。

4. 使用QUAST评估两个软件组装结果。
(qc_genome) czh@ubuntu:~/Desktop/01_denovo_genome/N46_DENOVO$ quast -o quast_out(qc_genome) czh@ubuntu:~/Desktop/01_denovo_genome/N46_DENOVO$ quast -o quast_out/ -t 8 spades_assembly.fasta unicycler_assembly.fasta 
/home/czh/miniconda3/envs/qc_genome/bin/quast -o quast_out/ -t 8 spades_assembly.fasta unicycler_assembly.fasta

Version: 5.0.2

System information:
  OS: Linux-5.0.0-23-generic-x86_64-with-debian-buster-sid (linux_64)
  Python version: 3.6.7
  CPUs number: 16

Started: 2019-11-04 06:14:57

Logging to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/quast.log

CWD: /home/czh/Desktop/01_denovo_genome/N46_DENOVO
Main parameters: 
  MODE: default, threads: 8, minimum contig length: 500, minimum alignment length: 65, \
  ambiguity: one, threshold for extensive misassembly size: 1000

Contigs:
  Pre-processing...
  1  spades_assembly.fasta ==> spades_assembly
  2  unicycler_assembly.fasta ==> unicycler_assembly

2019-11-04 06:15:06
Running Basic statistics processor...
  Contig files: 
    1  spades_assembly
    2  unicycler_assembly
  Calculating N50 and L50...
    1  spades_assembly, N50 = 472283, L50 = 5, Total length = 5427777, GC % = 57.74, # N's per 100 kbp =  15.92
    2  unicycler_assembly, N50 = 1298586, L50 = 2, Total length = 5419702, GC % = 57.76, # N's per 100 kbp =  0.00
  Drawing Nx plot...
    saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/Nx_plot.pdf
  Drawing cumulative plot...
    saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/cumulative_plot.pdf
  Drawing GC content plot...
    saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/GC_content_plot.pdf
  Drawing spades_assembly GC content plot...
    saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/spades_assembly_GC_content_plot.pdf
  Drawing unicycler_assembly GC content plot...
    saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/unicycler_assembly_GC_content_plot.pdf
  Drawing Coverage histogram (bin size: 24x)...
    saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/coverage_histogram.pdf
  Drawing spades_assembly coverage histogram (bin size: 24x)...
    saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/basic_stats/spades_assembly_coverage_histogram.pdf
Done.

NOTICE: Genes are not predicted by default. Use --gene-finding or --glimmer option to enable it.

2019-11-04 06:15:08
Creating large visual summaries...
This may take a while: press Ctrl-C to skip this step..
  1 of 2: Creating Icarus viewers...
  2 of 2: Creating PDF with all tables and plots...
Done

2019-11-04 06:15:08
RESULTS:
  Text versions of total report are saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/report.txt, report.tsv, and report.tex
  Text versions of transposed total report are saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/transposed_report.txt, transposed_report.tsv, and transposed_report.tex
  HTML version (interactive tables and plots) is saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/report.html
  PDF version (tables and plots) is saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/report.pdf
  Icarus (contig browser) is saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/icarus.html
  Log is saved to /home/czh/Desktop/01_denovo_genome/N46_DENOVO/quast_out/quast.log

Finished: 2019-11-04 06:15:08
Elapsed time: 0:00:11.039108
NOTICEs: 1; WARNINGs: 0; non-fatal ERRORs: 0

Thank you for using QUAST!
使用Unicycler进行细菌基因组组装_第3张图片

quast评估结果可以看出unicycler组装结果优于spades,unicycler应该是矫正和剔除了长度太短和一些冗余的contig,减少了contig的数量,提高了N50的长度,但基因组总长度小于spades组装。


5. unicycler调用spades初步组装后对contigs的处理。

区分单拷贝的contigs和重复的contigs,选择合适的单拷贝contigs进行下一步处理。


使用Unicycler进行细菌基因组组装_第4张图片

剔除contig间的重叠区域,为了尽可能的环化contigs。

使用Unicycler进行细菌基因组组装_第5张图片

单拷贝contigs桥接图

使用Unicycler进行细菌基因组组装_第6张图片

最后使用 Pilon将短序列mapping到组装好的基因组上进行矫正。


unicycler相较于spades耗时较长,组装结果较好,这只是一个菌基因组的组装比较结果,可能在其他菌株unicycler在组装并不会有较大提升,反而会降低N50,不过N50不是评估基因组的唯一标准,过于苛求更长的N50可能会导致错误组装,可能会误导后续基因注释和功能研究。

你可能感兴趣的:(使用Unicycler进行细菌基因组组装)