pb-assembly=0.0.6参数设置

Input 输入

[General]
input_fofn=input.fofn
input_type=raw
pa_DBdust_option=true
pa_fasta_filter_option=streamed-median

input_type: 可以为raw或者preads，如果指定preads,管道将跳过整个0-rawreads预组装阶段；

pa_fasta_filter_option: 默认为streamed-internal-median，用于处理一个ZMW有多条subreads时，到底选择哪一条的问题。"pass": 不做过滤，全部要；"streamed-median": 表示选择中等长度的subreads；"streamed-internal-median": 当一个ZMW里的subread低于3条时选择最长，多于3条则选择中等长度的subreads。

Data Partitioning 数据分区

# large genomes
pa_DBsplit_option=-x500 -s200
ovlp_DBsplit_option=-x500 -s200

# small genomes (<10Mb)
pa_DBsplit_option = -x500 -s50
ovlp_DBsplit_option = -x500 -s50

这部分的设置会将参数传递给DBsplit，将数据进行拆分多个block，后续的运算都基于blocks，-s 控制 DB blocks的大小

如果前面设置了pa_fasta_filter_option=pass ，pa_DBsplit_option这里要加一个 -a选项

Repeat Masking 屏蔽重复序列

pa_HPCTANmask_option=
pa_REPmask_code=0,300;0,300;0,300

Repeat masking occurs in two phases, Tandem and Interspersed. Tandem repeat masking is run with a modified version of daligner called datander and thus uses a similar parameter set. Whatever settings you use for pre-assembly daligner overlapping in the next section (pa_daligner_option) will be used here for tandem repeat masking. You can supply additional arguments for tandem repeat masking that will be passed to HPC.TANmask with the pa_HPCTANmask_option.

The second phase of masking deals with interspersed repeats and can be run in up to 3 iterations specified with thepa_REPmask_code option. The parameters needed for each iteration are both the group size and coverage specified as group,coverage pairs separated by semicolons as seen above.

For information and theory on how to set up your rounds of repeat masking, consult this blog post.

Pre-assembly 预组装

genome_size=1000000000
seed_coverage=30
length_cutoff=-1    
pa_HPCdaligner_option=-v -B128 -M24
pa_daligner_option=-e0.8 -l2000 -k18 -h480  -w8 -s100
falcon_sense_option=--output-multi --min-idt 0.70 --min-cov 3 --max-n-read 400
falcon_sense_greedy=False

During pre-assembly, the PacBio subreads are aligned and error correction is performed. The longest subreads are chosen as seed reads and all shorter reads are aligned to them and consensus sequences are generated from the alignments. These consensus sequences are called pre-assembled reads or preads and generally have accuracy greater than 99% or QV20.

如果你想自动计算种子subreads覆盖度，那就不用去设置 genome_size和 seed_coverage，只需设置length_cutoff=-1即可自动计算。我们一般推荐“20-40x”种子覆盖度。
另外，如果你不知道基因组大小，不确定seed_coverage 的大小或者如果您只想利用特定长度以上的所有reads，您可以使用length_cutoff手动设置该限制。

需要注意的是，无论length_cutoff被设置为什么值，都是对falcon-unzip的一个限制，任何小于该截断值的reads都不会用于phasing。对于组装来说，除非你期望一个特定的特性,比如微染色体或短圆形质粒,否则在设置高的length_cutoff时可能不会有什么害处。但是，如果你打算unzip，那么你就应该人为地限制你的phasing数据集，而拥有一个较低的length_cutoff可能对你有好处。大多数计算都发生在预组装中，因此如果计算时间对您很重要，那么增加length_cutoff将提高效率，但是需要进行上述权衡。

Overlap options for daligner are set with the pa_HPCdaligner_option and pa_daligner_option flags. Previous versions of FALCON had a single parameter. This is now split into two flags, one that affects requested resources pa_HPCdaligner_optionand one that affects the overlap search pa_daligner_option. For pa_HPCdaligner_option, the -v parameter is passed to the LAsort and LAmerge programs while -B and -M parameters are passed to the daligner sub-commands.

To understand the theory and how to configure daligner see this blog post and this command reference guide.

For daligner, in general we recommend the following:

-e: average correlation rate (average sequence identity)

0.70 (low quality data) - 0.80 (high quality data). A higher value will help prevent haplotype collapse.

-l: minimum length of overlap

1000 (shorter library) - 5000 (longer library)

-k: kmer size

14 (low quality data) - 18 (high quality data)

较低的-k值在增加磁盘空间、内存消耗和较慢的运行时间之间具有较高的敏感性，并且在较低质量的数据下工作得最好。相反，对于-k，较大的kmer值具有更高的特异性，使用更少的系统资源，运行速度更快，但是只适用于高质量的数据

You can configure basic pre-assembly consensus calling options with the falcon_sense_option flag.
--output-multi necessary for generating proper fasta headers
--min-idt minimum alignment identity
--min-cov minimum coverage necessary
--max-n-read max number of reads for calling consensus to make the preads

By default, -fo are the parameters passed to LA4Falcon. The option falcon_sense_greedy changes this parameter set to -fog which essentially attempts to maintain relative information between reads that have been broken due to regions of low quality.

Pread overlapping 重叠

ovlp_HPCdaligner_option=-v -M24 -l500
ovlp_daligner_option=-e.96 -s1000 -h60

The second phase of error-corrected read overlapping occurs in a similar fashion to the overlapping performed in the pre-assembly, however no repeat masking is performed and no consensus is called. Overlaps are identified and fed into the final assembly. The parameter options work the same way as described above in the pre-assembly section.

Recommendation for preads:

-e: average correlation rate (average sequence identity)

0.93 (inbred) - 0.96 (outbred)

-l: minimum length of overlap

1800 (poor preassembly, short/low quality library) - 6000 (long, high quality library)

-k: kmer size

18 (low quality) - 24 (most cases)

Final Assembly 最终组装

# experimenent with "--min-idt" to collapse (98-99) or split haplotypes (up to 99.9) during contig assembly
# if you plan to unzip, collapse first using ~98, lower for very divergent haplotypes
# ignore indels looks at only substitutions in overlaps, allows higher overlap stringency to reduce repeat-induced errors
overlap_filtering_setting = --max-diff 400 --max-cov 400 --min-cov 2 --n-core 24 --min-idt 99.9 --ignore-indels

overlap_filtering_setting=--max-diff 100 --max-cov 100 --min-cov 2
fc_ovlp_to_graph_option=
length_cutoff_pr=1000

The option overlap_filter_setting allows setting criteria for filtering pread overlaps. --max-diff filters overlaps that have a coverage difference between the 5' and 3' ends larger than specified. --max-cov filters highly represented overlaps typically caused by contaminants or repeats and --min-cov allows specification of a minimum overlap coverage.

将--min-cov设置得太低将允许检测到更多的重叠，代价是可能会出现额外的嵌合/错误组装。

length_cutoff_pr is the minimum length of pre-assembled preads used for the final assembly. Typically, this value is set to allow for approximately 15 to 30-fold coverage of corrected reads in the final assembly.

通常，将此值设置为允许在最终组装中对corrected reads进行大约15到30倍的覆盖度的长度。

Miscellaneous configuration options 其他选项

Additional configuration options that don't necessarily fit into one of the previous categories are described here.

target=assembly
skip_checks=False
LA4Falcon_preload=false

FALCON can be configured to stop after any of its three stages with the target flag set to either overlapping, pre-assembly or assembly. Each option will stop the pipeline at the end of its corresponding stage, 0-rawreads, 1-preads_ovlor 2-asm-falcon respectively. The default is full assembly pipeline.

The flag skip_checks disables .las file checks with LAcheck which has been known to cause errors on certain systems in the past.

选项LA4Falcon_preload将-P参数传递给LA4Falcon，从而将所有读取操作加载到内存中。在较慢的文件系统上，这可以显著加快速度，但这将大大增加consensus阶段的内存需求。