一、DP 动态规划算法概要
It is a good algorithm to solve the segmentation process optimization problem, it means that we can divide it into several segmentations, and calculate the optimize result of each segments, which will produce an optimize result for the last problem. Dp is not totally the same to Divide and Conquer, because the sub problems in DAC are not related to each other when executing computing, while in DP, we must rely on the prior sub problem to compute the later sub problems. The first important step in DP is to find a optimization decision sequence(structure), then we must construct a recursive relation. Generally, we can use top-down and down-top in DP. When we use top-down, we need to repeat many sub problems. So it’s important to solve the overlapping subroblems in DP. Some popular but old examples include package problem, matrix multiplication.
STEP 1: analyse the structure of the optimal solution.
STEP 2: construct the recursive relation.
STEP3: computing the optimal value.
STEP4: construct the optimal solution.
二、Sequence alignment algorithm
1. Analysis of individual sequence
physico-chemical parameter or other biology features
2. Pairwise sequence comparison
Dot plots: a visual representation of the similarities between two sequences.
Here, we need a scoring funtion, when matches, add 2 points, mismatch or gap -1 point. then fill them into the diagram and get the representation
DP (described above):
Prepare a scoring matrix using recursive function
Scan matrix diagonally using traceback protocol
Needlematl-Wunsch: ignore badly aligning regions
Smith-Waterman algorithm:
3. Multiple seuqences alignment多序列比对: clustal
a.比对和函数(sum of pairs):最优SP值的比对
最优SP值比对是一个NP复杂度问题,一般求解方法采用计算差异性的模型是:近似算法、启发式算法、引入其他信息
近似算法:既然无法在多项式时间里找到最优解,那么就找次优解。然后证明最优解和次优解的误差距离范围。
启发式算法:既然无法在多项式时间内遍历整个空间,就采用算法遍历最大的空间
其他算法
A星算法 (人工智能算法)
信息论
隐马尔科夫模型
蚁群优化算法
b.一致性函数(consensus)
c.树 函 数(tree alignments)
1、 在给定的生物序列中查找保守区间(conserved subregion)
2、 通过关联的生物特征序列推测种群进化历程
研究点:在已有的多种评价多序列比对方法中,计算最优测序方法
有人提出了使用逼近算法和随机化算法(高速度,高可靠性),本文提出了能保证误差范围的逼近算法
NP是指非确定性多项式(non-deterministic polynomial,缩写NP)。所谓的非确定性是指,可用一定数量的运算去解决多项式时间内可解决的问题。NP 问题通俗来说是其解的正确性能够被“很容易检查”的问题,这里“很容易检查”指的是存在一个多项式检查算法。相应的,若NP中所有问题到某一个问题是图灵可归约的,则该问题为NP困难问题
多项式时间(Polynomial time)在计算复杂度理论中,指的是一个问题的计算时间m(n)不大于问题大小n的多项式倍数。任何抽象机器都拥有一复杂度类,此类包括可于此机器以多项式时间求解的问题。
Motifs 基序,生物大分子的保守序列,构成特征序列的基本结构
查找基序以及相应的结合点,研究基本表达过程。
两种老方法:线性和矩阵式表示 缺点是都假设一个结合点上的核酸之间是互相独立的
更为复杂的表示方法,例如,隐马尔科夫模型、正则表达。 缺点是不适用(需要太多参数,并且需要已知很多结合点)
新观点:SPSP表示法 此方法仍然具有NP复杂性,因此提出了新的简单模型叫做DPS表示
模型简单而且也能查找到dependency pattern set (DPS-finder)
DPS-Finder: a few minutes to discover a length-10 motif from 20 length-600 DNA sequences.
DPS representation is a generalized model of string representation and matrix representation, that can model the adjacent dependency of nucleotides with much less parameters than HMM and regular expression.
1、DPS representation:
The String-format will include some fatal sequence. So
SPSP representation: a pattern P and a scoring function S
(the S value of unknown motifs is difficult to determine)
Omit the S function ,only pattern P is enough
A DPS representation P contains a list of patterns sets Pi, 1 ≤ i ≤ L, where at most two are wildcard pattern sets Pi containing 2 to k length-li patterns Pi,j of symbols ‘A’, ‘C’, ‘G’ and ‘T’, li ≤ lmax where the Hamming distance between these patterns is at most dmax. Each of the other pattern set Pi contains exactly one length-li pattern Pi,1 and Σi l i = l. A length-l string σ = σ1σ2…σL where |σi| = l i is considered as a binding site of P if σi ∈ Pi, 1 ≤ i ≤ L.
2、scoring function and problem definition
we calculate the probability (p-value) that P has b or more binding sites in T by chance based on a background model.
Produce a scoring function with value(P)
3、DPS-Finder algorithms
A l-factor tree
Discover optimal motifs in the tree. Using Branch and Bound Approach
Refine a candidate motif P to a motif P’ in DSP representation with the minimum p-value (P’) using depth-first-search
4. HMM
三、the next-generation sequencing: sequencing by synthetising
1. Solexa sequencing(Illumina's Genome Analyze ):
DNA sample prepration:
Attach dna to surface:
Bridge amplification:
Double stranded:
Denature the double-stranded molecules:
Comlete amplification:
Deiermin and Image first base and recycle:
Align data: