Sequence alignment

concept

  In computer realm,Sequence is not continuous,on the contrary,String is continuous;in the biology,sequence is called gapped sequence,string is called sequence.
 Sequence similarity problem occurs in search engine,command line,genome
Paradigm one:Similar sequence similar organism:Microorganism how to be classified?
Paradigm two:Similar sequence similar structure similar function:protein,DNA,RNA and so on.
  The basic method of sequence alignment is Dynamic Programming.

example

 Q:Given U,V,how to measure the similarity?
Definition: the alignment of U and V is to insert ” ” into sequences to make them the same length n.(“” means space)
Note:alignment of ” ” and ” “is forbidden.
Example:
N:cat. V:act

(c""aa""ctt)

2 matches, 2 inserts or deltions.
(c""aatc""t)

1 match, 2 mismatchs.
Q:how many alignments between U and V?
A: a lot!
Q:which alignment is better?
A: It depends on the model(scoring function)!

Alignment score

Example one:
Given scoring function

w(x,y)=3,1,3,if x=yif xyif x=""ory=""

s(caactt)=3+333=0

s(caactt)=311=1 better!
Example two:

w(x,y)=3,1,2,if x=yif xyif x=""ory=""

s(caactt)=3+322=2 better!

s(caactt)=311=1

Optimal Global Alignment

 Definition:Given U,V,w(), asks to find the optimal global alignment that has the maximum score.
 S(U,V):score of the optimal alignment number.
 s(alignment):the score of the alignment.
S(U,V) = s(T) T is the optimal alignment.
 Key observation: the structure of the optimal solutions.
(1)T: optimal alignment for act and cat.
s(T) = S(act,cat)
what do we know about last column of T?

(a""attt""tt"")

obviously, the first and second column is impossible!
if the third column is true T=T1(tt) , T1 is an alignment of ca and ac. s(T1)=S(ca,ac)? YES!
Prove: cut & paste!
if the fourth column is true T=T2(t"") , T2 is an alignment of ca and act. s(T2)=S(ca,act)? YES!
Prove: cut & paste!

In Summary:
S(cat,act)=maxS(ca,ac)+w(t,t)S(ca,act)+w(t,"")S(cat,ac),+w("",t)

n""ccacat“”s("","")s(c,"")s(ca,"")s(cat,"")as("",a)s(c,a)s(ca,a)s(cat,a)acs("",ac)s(c,ac)s(ca,ac)s(cat,ac)acts("",act)s(c,act)s(ca,act)s(cat,act)

Note:计算时不需要决策树,只需要这个表,按照逻辑:每一个值取决于斜上对角线和左侧,上侧,表的数值一行行产生。

Algorithm

  1. Def scoring function s() (60% workload)
  2. Recursive function
  3. Boundouries
  4. Dynamic Programming
  5. Time & Space complexity

你可能感兴趣的:(algorithm)