转自:https://mp.weixin.qq.com/s/lgJDpwk0vYipARfTorfCkA
首先要说的是,并不是所有的分析都需要将双末端测序序列拼接,比如转录组就不需要,拼接最常见的是扩增子测序。
为什么要进行拼接?因为二代测序是将DNA或RNA打成特定长度的片段,比如300-400bp,而二代测序只能测特定长度,比如150nt,超过这一长度,测序质量就会下降的很严重,基本没有意义了。但是还有150-200bp没有测到,所以同一条DNA片段再反向测一次。
以下就是双末端测序中同一条DNA片段,正向和反向测序序列使用Clone Manager的比对结果。图中蓝色和红色分别表示两条reads匹配的序列,长约111bp,而打碎的这条DNA/RNA片段长约189bp。
1.软件安装
在Linux系统下通过命令行进行下载安装。
自行下载安装
wget http://ccb.jhu.edu/software/FLASH/index.shtml/FLASH-1.2.11.tar.gz
tar -zxvf FLASH-1.2.11.tar.gz(解压缩FLASH-1.2.11.tar.gz)
cd FLASH-1.2.11/(进入FLASH-1.2.11文件夹工作路径下)
make(运行make编译命令,自动完成安装,生成可执行文件‘flash’)
或者conda安装
conda install -c bioconda flash
flash --help
Usage: flash [OPTIONS] MATES_1.FASTQ MATES_2.FASTQ
flash [OPTIONS] --interleaved-input (MATES.FASTQ | -)
flash [OPTIONS] --tab-delimited-input (MATES.TAB | -)
----------------------------------------------------------------------------
DESCRIPTION
----------------------------------------------------------------------------
FLASH (Fast Length Adjustment of SHort reads) is an accurate and fast tool
to merge paired-end reads that were generated from DNA fragments whose
lengths are shorter than twice the length of reads. Merged read pairs result
in unpaired longer reads, which are generally more desired in genome
assembly and genome analysis processes.
Briefly, the FLASH algorithm considers all possible overlaps at or above a
minimum length between the reads in a pair and chooses the overlap that
results in the lowest mismatch density (proportion of mismatched bases in
the overlapped region). Ties between multiple overlaps are broken by
considering quality scores at mismatch sites. When building the merged
sequence, FLASH computes a consensus sequence in the overlapped region.
More details can be found in the original publication
(http://bioinformatics.oxfordjournals.org/content/27/21/2957.full).
Limitations of FLASH include:
- FLASH cannot merge paired-end reads that do not overlap.
- FLASH is not designed for data that has a significant amount of indel
errors (such as Sanger sequencing data). It is best suited for Illumina
data.
----------------------------------------------------------------------------
MANDATORY INPUT
----------------------------------------------------------------------------
The most common input to FLASH is two FASTQ files containing read 1 and read 2
of each mate pair, respectively, in the same order.
Alternatively, you may provide one FASTQ file, which may be standard input,
containing paired-end reads in either interleaved FASTQ (see the
--interleaved-input option) or tab-delimited (see the --tab-delimited-input
option) format. In all cases, gzip compressed input is autodetected. Also,
in all cases, the PHRED offset is, by default, assumed to be 33; use the
--phred-offset option to change it.
----------------------------------------------------------------------------
OUTPUT
----------------------------------------------------------------------------
The default output of FLASH consists of the following files:
- out.extendedFrags.fastq The merged reads.
- out.notCombined_1.fastq Read 1 of mate pairs that were not merged.
- out.notCombined_2.fastq Read 2 of mate pairs that were not merged.
- out.hist Numeric histogram of merged read lengths.
- out.histogram Visual histogram of merged read lengths.
FLASH also logs informational messages to standard output. These can also be
redirected to a file, as in the following example:
$ flash reads_1.fq reads_2.fq 2>&1 | tee flash.log
In addition, FLASH supports several features affecting the output:
- Writing the merged reads directly to standard output (--to-stdout)
- Writing gzip compressed output files (-z) or using an external
compression program (--compress-prog)
- Writing the uncombined read pairs in interleaved FASTQ format
(--interleaved-output)
- Writing all output reads to a single file in tab-delimited format
(--tab-delimited-output)
----------------------------------------------------------------------------
OPTIONS
----------------------------------------------------------------------------
-m, --min-overlap=NUM The minimum required overlap length between two
reads to provide a confident overlap. Default:
10bp.
-M, --max-overlap=NUM Maximum overlap length expected in approximately
90% of read pairs. It is by default set to 65bp,
which works well for 100bp reads generated from a
180bp library, assuming a normal distribution of
fragment lengths. Overlaps longer than the maximum
overlap parameter are still considered as good
overlaps, but the mismatch density (explained below)
is calculated over the first max_overlap bases in
the overlapped region rather than the entire
overlap. Default: 65bp, or calculated from the
specified read length, fragment length, and fragment
length standard deviation.
-x, --max-mismatch-density=NUM
Maximum allowed ratio between the number of
mismatched base pairs and the overlap length.
Two reads will not be combined with a given overlap
if that overlap results in a mismatched base density
higher than this value. Note: Any occurence of an
'N' in either read is ignored and not counted
towards the mismatches or overlap length. Our
experimental results suggest that higher values of
the maximum mismatch density yield larger
numbers of correctly merged read pairs but at
the expense of higher numbers of incorrectly
merged read pairs. Default: 0.25.
-O, --allow-outies Also try combining read pairs in the "outie"
orientation, e.g.
Read 1: <-----------
Read 2: ------------>
as opposed to only the "innie" orientation, e.g.
Read 1: <------------
Read 2: ----------->
FLASH uses the same parameters when trying each
orientation. If a read pair can be combined in
both "innie" and "outie" orientations, the
better-fitting one will be chosen using the same
scoring algorithm that FLASH normally uses.
This option also causes extra .innie and .outie
histogram files to be produced.
-p, --phred-offset=OFFSET
The smallest ASCII value of the characters used to
represent quality values of bases in FASTQ files.
It should be set to either 33, which corresponds
to the later Illumina platforms and Sanger
platforms, or 64, which corresponds to the
earlier Illumina platforms. Default: 33.
-r, --read-len=LEN
-f, --fragment-len=LEN
-s, --fragment-len-stddev=LEN
Average read length, fragment length, and fragment
standard deviation. These are convenience parameters
only, as they are only used for calculating the
maximum overlap (--max-overlap) parameter.
The maximum overlap is calculated as the overlap of
average-length reads from an average-size fragment
plus 2.5 times the fragment length standard
deviation. The default values are -r 100, -f 180,
and -s 18, so this works out to a maximum overlap of
65 bp. If --max-overlap is specified, then the
specified value overrides the calculated value.
If you do not know the standard deviation of the
fragment library, you can probably assume that the
standard deviation is 10% of the average fragment
length.
--cap-mismatch-quals Cap quality scores assigned at mismatch locations
to 2. This was the default behavior in FLASH v1.2.7
and earlier. Later versions will instead calculate
such scores as max(|q1 - q2|, 2); that is, the
absolute value of the difference in quality scores,
but at least 2. Essentially, the new behavior
prevents a low quality base call that is likely a
sequencing error from significantly bringing down
the quality of a high quality, likely correct base
call.
--interleaved-input Instead of requiring files MATES_1.FASTQ and
MATES_2.FASTQ, allow a single file MATES.FASTQ that
has the paired-end reads interleaved. Specify "-"
to read from standard input.
--interleaved-output Write the uncombined pairs in interleaved FASTQ
format.
-I, --interleaved Equivalent to specifying both --interleaved-input
and --interleaved-output.
-Ti, --tab-delimited-input
Assume the input is in tab-delimited format
rather than FASTQ, in the format described below in
'--tab-delimited-output'. In this mode you should
provide a single input file, each line of which must
contain either a read pair (5 fields) or a single
read (3 fields). FLASH will try to combine the read
pairs. Single reads will be written to the output
file as-is if also using --tab-delimited-output;
otherwise they will be ignored. Note that you may
specify "-" as the input file to read the
tab-delimited data from standard input.
-To, --tab-delimited-output
Write output in tab-delimited format (not FASTQ).
Each line will contain either a combined pair in the
format 'tag seq qual' or an uncombined
pair in the format 'tag seq_1 qual_1
seq_2 qual_2'.
-o, --output-prefix=PREFIX
Prefix of output files. Default: "out".
-d, --output-directory=DIR
Path to directory for output files. Default:
current working directory.
-c, --to-stdout Write the combined reads to standard output. In
this mode, with FASTQ output (the default) the
uncombined reads are discarded. With tab-delimited
output, uncombined reads are included in the
tab-delimited data written to standard output.
In both cases, histogram files are not written,
and informational messages are sent to standard
error rather than to standard output.
-z, --compress Compress the output files directly with zlib,
using the gzip container format. Similar to
specifying --compress-prog=gzip and --suffix=gz,
but may be slightly faster.
--compress-prog=PROG Pipe the output through the compression program
PROG, which will be called as `PROG -c -',
plus any arguments specified by --compress-prog-args.
PROG must read uncompressed data from standard input
and write compressed data to standard output when
invoked as noted above.
Examples: gzip, bzip2, xz, pigz.
--compress-prog-args=ARGS
A string of additional arguments that will be passed
to the compression program if one is specified with
--compress-prog=PROG. (The arguments '-c -' are
still passed in addition to explicitly specified
arguments.)
--suffix=SUFFIX, --output-suffix=SUFFIX
Use SUFFIX as the suffix of the output files
after ".fastq". A dot before the suffix is assumed,
unless an empty suffix is provided. Default:
nothing; or 'gz' if -z is specified; or PROG if
--compress-prog=PROG is specified.
-t, --threads=NTHREADS Set the number of worker threads. This is in
addition to the I/O threads. Default: number of
processors. Note: if you need FLASH's output to
appear deterministically or in the same order as
the original reads, you must specify -t 1
(--threads=1).
-q, --quiet Do not print informational messages.
-h, --help Display this help and exit.
-v, --version Display version.
Run `flash --help | less' to prevent this text from scrolling by.
2.使用方法
flash read1.fq read2.fq -p 33 -r 250 -f 500 -s 100 -o output
主要参数说明:
-m 拼接时overlap区的最小长度阈值,默认10bp;
-M overlap区的最大长度阈值,
-x overlap区允许的最大碱基错配比率(最大碱基错配数目/overlap区长度),默认为0.25;
-p 碱基质量值类型,64或者33;
-r reads长度;
-f 片段长度,也就是测序的文库大小;
-s 文库的偏差;
-o 输出文件前缀;
-z 输出压缩文件
-t 设置线程数,默认为1,FLASH软件支持多线程,速度快;
FLASH拼接默认输出6个结果文件:
output.extendeFrags.fastq 为拼接后的扩增片段序列文件;
output.flash.log 为日志文件,详细记录了拼接过程中的参数和拼接统计的数据;
output.hist 为拼接后的reads长度的统计信息文件;
output.histogram 为拼接后的reads长度直方图文件;
output.notCombined_1.fastq 为拼接不上的reads1序列文件;
output.notCombined_2.fastq 为拼接不上的reads2序列文件;
拼接
ls *1.fastq.gz |while read id;
do
mkdir -p ${id%_*}
flash ${id%_*}_R1.fastq.gz -O ${id%_*}_R2.fastq.gz \
-m 10 -M 100 -x 0.25 -z -o ${id%_*} -d ./${id%_*}
done