Lecture 2——Basics of data processing

本文图片来自于学习视频——新一代测序技术数据分析第二讲

Lecture 2——Basics of data processing

Review Lecutre 1

Lecture 2——Basics of data processing_第1张图片

Outline

Date analysis workflow
Sequence qualify evaluation
Phred scores
NGS error rates
Alignment
Smith-Waterman algorithm
Theories on short reads alignment
Suffix free, indexing, and Burrows-Wheeler transformation
Comparison of different aligners
Data formats
FASTQ, SAM, pileups, VCF
Data visualization
Genome Browsers, IGV,…

Date analysis workflow

HiSeq 2000 200G run
Image data: 32TB
Intensity Data: 2TB
Base call/quality score data: 250GB
Alignment output: 6TB(3TB), 1.2TB after intermediate files removed
Lecture 2——Basics of data processing_第2张图片
Major steps for secondary analysis
Raw data —— QC Filter —— Alignment —— Annotation

Sequence quality
Base quality
For every nucleotide
Reported by the sequencer
Mapping quality (alignment quality)
For every read
Reported by the aligners
Consensus quality (variant call quality)
For every genomic locus
Reported by the variant callers
Quality scores
Phred scores
Published in 1998
Initially developed for human genome project
Widely used to characterize the quality of DNA sequence
Q= -10log10§
Q = 10; P = 0.1; acc = 90%
Q = 20; P = 0.01; acc = 99%

Sequence alignment
A way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity
Helps in inferring functional, structural, or evolutionary relationships between the sequences
Goal: find out the best matching sequences
Global vs Local
Lecture 2——Basics of data processing_第3张图片
Alignment theories
Scoring matrix, or penalty scheme
Protein: PAM and BLOSUM
DNA/RNA
Match = 1
Mismatch = 0
Gap
d = 3 (gap opening)
e = 0.1 (gap extension)
Lecture 2——Basics of data processing_第4张图片
Global alignment
must account for all characters of each sequence
Needleman-Wunsch algorithm
Local alignment
accounts for only a continuous portion of each sequence
Smith-Waterman algorithm
Searching can start/end anywhere

Fast alignment for short reads
Short reads aligner
Major challenge: Going through 1 trillion times (reads) dynamic programming is not practical
Strategy: making a dictionary (index)
Problem: Making a 50-nt index is too huge: 450 = 1.3*1030

Things to consider
Features: Short and massive amounts
Cost
Speed, Resources required (memory)
Alignment quality
Gaps allowed?
Information considered
Base sequence quality considered?
Accuracy

Short reads aligner strategies
Three common strategies
Hash table
Seed-extend paradigm
Lecture 2——Basics of data processing_第5张图片
Space allowance
Suffix/Prefix tree
Suffix array
Burrows-Wheeler transformation
==Merge sorting ==(not commonly used)

Hash table - based algorithm
Algorithms
Hashing reads
Eland, MAQ, Mosaik…
Hashing reference genome
BFAST, Mosaik, SOAP
Hash table - space allowance
Perfect match is straightforward, but not useful to identify genetic variants
Solution: using multiple indices that allow mismatches
Lecture 2——Basics of data processing_第6张图片
More than one way to build mask
Allow 1-nt mismatch m: seed length;
w: weight (number of counted nt)
k: number of allowed mismatches
n: number of indices
Lecture 2——Basics of data processing_第7张图片
What’s the best mask design?
The seed weight w too small—— too many false positives that slow down the mapping process
The seed weight w too higher—— more seeds needed to achieve full sensitivity —— more memory
Optimal mask design: Lin et al. Bioinformatics (2008): ZOOM! Zillions of oligos mapped
mismatch k = 2
Lecture 2——Basics of data processing_第8张图片
Suffix/prefix tree
Problems of hash table based strategy:
Alignment to multiple identical copies of a substring in the reference mast be performed for each copy.
当基因组中重复序列过多时,使得align的速度变慢
Suffix/prefix tree (trie) can handle this well
fast query, O(n), where n is the length of the query sequence.
Nodes: suffix array interval of …
Substring form node to root

Burrows-Wheeler Transformation (BWA, bowtie use)
Developed in 1994
It is an algorithm developed for data compression, such as bzip2
在这里插入图片描述
All the letters remain the same value
Order changed
If the original string had several substrings that occurred often, the transformed string will have several places where a single character is repeated multiple times in a row
Easier for compression
Lecture 2——Basics of data processing_第9张图片
Burrows M, Wheeler DJ A block sorting lossless data compression algorithm. 1994
https://baike.baidu.com/item/Burrows-Wheeler%E5%8F%98%E6%8D%A2/752836?fr=aladdin

Two questions
Why is it reversible?
T = acaacg$
BWT(T) = gc$aaac
In any language that has a sort() function, usually, you cannot find an unsort()
How can this strange transformation help me for sequence alignment?
Lecture 2——Basics of data processing_第10张图片
Query using BWT (FM index)
To match Q in T using BWT(T), repeatedly apply rule:
top = LF(top, qc); bot = LF(bot, qc)
Where qc is the next character in Q(right-to-left) and LF(i,qc) maps row i to the row whose first character corresponds to i’s last character as if it were qc

Checkpointing in FM Index
Solution: per-calculate cumulative counts for A/C/G/T up to periodic checkpoints in BWT
Rows to Reference Positions
Bowtie marks every 32nd row by default(configurable)

FM Index is Small
Entire FM Index on DNA reference consists of:
BWT(same size as T)
Checkpoints(~15% size of T)
SA sample(~50% size of T)
Total: ~1.65x the size of T

Alignment tools that uses BWT
BOWTIE
Langmead et al., Genome Biology, 2009
Super efficient, super fast
Biggest problem: Do not allow gaps
Used by many other popular downstream tools, like TOPHAT
BWA
Li H and Durbin R. (2009) Bioinformatics
Allow gaps
Two versions
BWA-short: shorter than 200bp
BWA-SW: longer sequences, up to 100kb
Sequencing error is considered

Evaluations for alignment
Straight Smith-Waterman scores
BFAST
A phred-scale mapping quality calculation
Used in MAQ
Modified in BOWTIE and BWA
A posterior probability p
z: read
x: reference sequence
u: mismatch position
Base quality score of
A: 20 (p=0.01) B: 10 (p=0.1)
The probability that this read is “perfect match” is both A and B are not sequencing error
p=0.01*0.1=0.001
Every other read will have very low p
So, if mapping is unique, p ~= 1
Q(u|x,z) = -10log10[1-p(u|x,z)]
如果mismatch多于2个,输出结果不好
Lecture 2——Basics of data processing_第11张图片
Accuracy:
Sensitivity(recall);
Specificify
Selectivity
Other factors may be included in the evaluation, but sometime should be evaluated independently.
Number of mismatches allowed
Number of gaps allowed
Unique mapping
Comparison among short read aligners
David M. et al. SHRiMP2, Bioinformatics, 2001
Most commonly used aligners: BFAST, BWA, Bowtie vs. SHRiMP2
Lecture 2——Basics of data processing_第12张图片

File format

Files

FASTQ
Raw sequences
Format variations:
.qseq (Illumina)
.csfasta + .qual (SOLiD)
SAM/BAM
Sequence alignment
.export —— all alignments, not sorted, including not aligned reads (Illumina)
.sorted —— unique aligned reads (Illumina)
VCF
Variants
.pg —— personal genome SNP (UCSC GB)
Lecture 2——Basics of data processing_第13张图片
Lecture 2——Basics of data processing_第14张图片
SAM/BAM
Li et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009
Reference document: http://samtools.sourceforge.net/SAM1.pdf
BAM is a binary (indexable, more compact) representation of SAM
Two components
Lecture 2——Basics of data processing_第15张图片
The headers
@HD: File information
format version, sorting order of the alignment
@SQ: Reference sequence information
Reference name, length, assembly identifier, species,…
@RG: Reads group information
RG ID, sequencing center, date, flow order, library, …
@PG: Program
Name and command for alignment program,…
@CO: Additional text comments
Lecture 2——Basics of data processing_第16张图片
Lecture 2——Basics of data processing_第17张图片
1-base coordinate system
SAM, GFF, Wiggle files
First base of a seq is 1
A region is a closed interval
Region between 3nd and 7th nucleotide: [3,7]
0-base coordinate system
BAM, BED, PSL files
First base of a seq is 0
A region is a half closed half open interval
Region between 3nd and 7th nucleotide: [2,7]
Col 6—— CIGAR operators
M: match/mismatch
I: insertion
D: deletion (8M2I4M1D3M)
H: hard-clipped( unaligned, can only be first and/or last operation)
S: soft-clipped (can have H operations between them and the ends of the CIGAR string)
P: padding
N: skip
Len = M + I + S(H)
Col 9—— inferred fragment size(signed observed template length)
Only for pair-end ( 0 for single-end)
Number of bases from the leftmost mapped base to the rightmost mapped base
Leftmost read(+), rightmost read(-)
SAM/BAM——optional fields
Format
TAG:TYPE:VALUE(6 types: A,i, f, Z, H, B)
Example: NM:I:2
35 pre-defined fields in the v 1.4-r962 version
Allow users to define
Selected fields
Lecture 2——Basics of data processing_第18张图片
Lecture 2——Basics of data processing_第19张图片
BAM( and BAI-indexing)
Binary format of SAM, compressed in the BGZF format
Can be indexed to achieve fast retrieval of alignments overlapping a specified region without going through the whole alignments
must be sorted by the reference ID and then the leftmost coordinate before indexing.
Updated information: http://samtools.sourceforge.net/SAM1.pdf
Lecture 2——Basics of data processing_第20张图片

你可能感兴趣的:(bioinformatics)