Exact match

image.png

Naive exact match

Introduction:精确匹配，耗时长
Input: parten 和 text即模式串和需要比对上的字符串
Output: parten出现所有位置的index
Algorithm:两个loop,外层对t进行移位，内层逐字符比对，关键是有一个mach标志为，如果比对上了，append。

def naive(p, t):
    occurrences = []
    for i in range(len(t) - len(p) + 1):  # loop over alignments
        match = True
        for j in range(len(p)):  # loop over characters
            if t[i+j] != p[j]:  # compare characters
                match = False
                break
        if match:
            occurrences.append(i)  # all chars matched; record
    return occurrences

Boyer-Moore

Introduction:j精确匹配，可以跳过一些不需要匹配的位置，因此速度比Naive快
Input: p, t 和p_bm，它相当于一个对p的preprocess
Output: p匹配到t的位置
Algorithm:核心思想：一个shift，每次选出一个最大gap, 如何匹配上了， append(i)
细节：首先while循环，因为你不清楚循环次数，所以不用for循环；然后for循环，对p中的每个字符遍历，从右往左，如果不等，通过bad_character和good_suffix两种规则，求出最大shift,并break;如果match，则append，并skip_gs，继续map，有可能有多个位置。

def boyer_moore(p, p_bm, t):
    """ Do Boyer-Moore matching. p=pattern, t=text,
        p_bm=BoyerMoore object for p """
    i = 0
    occurrences = []
    while i < len(t) - len(p) + 1:
        shift = 1
        mismatched = False
        for j in range(len(p)-1, -1, -1):
            if p[j] != t[i+j]:
                skip_bc = p_bm.bad_character_rule(j, t[i+j])
                skip_gs = p_bm.good_suffix_rule(j)
                shift = max(shift, skip_bc, skip_gs)
                mismatched = True
                break
        if not mismatched:
            occurrences.append(i)
            skip_gs = p_bm.match_skip()
            shift = max(shift, skip_gs)
        i += shift
    return occurrences

image.png

Hash Algorithm

Introduction:需要preprocess的算法，精确匹配，适用于test较大的比对，但是hash表较大
Input: p, t, k k是生成hash table也就是Index中的k-mer
Output:比对的位置
Algorithm:核心：先是对test进行index生成hashtable,query的时候对p只选了一个k-mer, 因为这里选的都是精确匹配。python中用的一个list，然后元素是（k-mer, position)，用了一个二叉树查找，这里的-1有点奇怪。然后对p进行选取k-mer，查找，最后需要对剩下的字符进行匹配。

class Index(object):
    """ Holds a substring index for a text T """

    def __init__(self, t, k):
        """ Create index from all substrings of t of length k """
        self.k = k  # k-mer length (k)
        self.index = []
        for i in range(len(t) - k + 1):  # for each k-mer
            self.index.append((t[i:i+k], i))  # add (k-mer, offset) pair
        self.index.sort()  # alphabetize by k-mer

    def query(self, p):
        """ Return index hits for first k-mer of p """
        kmer = p[:self.k]  # query with first k-mer
        i = bisect.bisect_left(self.index, (kmer, -1))  # binary search
        hits = []
        while i < len(self.index):  # collect matching index entries
            if self.index[i][0] != kmer:
                break
            hits.append(self.index[i][1])
            i += 1
        return hits

def query_index(p, t, index):
    k = index.k
    offsets = []
    for i in index.query(p):
        if p[k:] == t[i+k: i+len(p)]:
            offsets.append(i)
    return offsets

减少hashtable的大小

image.png

提高specificity, 减少collision的次数

image.png

Approximate match

image.png

Pigenohole principle:即P与T匹配，允许k个失配，那么把P分成k+1份，必然在T中能够找到k+1份中某一份的精确匹配
if |X| = |Y|, then EditDistance(X, Y) <= HammingDistance(X, Y)
if |X| not equal |Y|, then EditDistance(X, Y) >= ||X|-|Y||

image.png

** Hamming distance

image.png

Edit distance

image.png

Hash with pigenohole principle

Introduction:模糊匹配，原理主要采用鸽子洞原理，即把P分为k+1份，总有一份满足精确匹配，找到它的位置，然后计算其余部分失配的个数不能超过限定值。
Input: p, t, n n表示允许失配的个数
Output:匹配到的位置，以及字符
Algorithm:核心思想是把p分为n+1份，我们只要求长度即可，并且小于它就行，不需要我们产生n+1个随机数，还得之和为len(P).然后根据这个平均长度，生成hashtable.然后对n+1份字符进行遍历，并且查找位置，之后对查找出的位置进行遍历。如果查找的位置小于start 或者大于len(t),则continue,否则，逐字符比对剩下的字符，统计失配个数。

def approximate_match(p, t, n):
    segment_length = int(round(len(p) / (n + 1)))
    all_matches = set()
    p_idx = Index(t, segment_length)
    idx_hits = 0
    for i in range(n + 1):
        start = i * segment_length
        end = min((i + 1) * segment_length, len(p))
        matches = p_idx.query(p[start:end])
        # Extend matching segments to see if whole p matches
        for m in matches:
            idx_hits += 1
            if m < start or m - start + len(p) > len(t):
                continue
            mismatches = 0
            for j in range(0, start):
                if not p[j] == t[m - start + j]:
                    mismatches += 1
                    if mismatches > n:
                        break
            for j in range(end, len(p)):
                if not p[j] == t[m - start + j]:
                    mismatches += 1
                    if mismatches > n:
                        break

            if mismatches <= n:
                all_matches.add(m - start)
    return list(all_matches), idx_hits

image.png

def editDistance(a, b):
  if len(a) == 0 :
    return len(b)
  if len(b) == 0:
    return len(a)
  if a[-1] == b[-1]:
    delta = 1
  else :
    delta = 0
  return min{editDistance(a[ :-1], b[ :-1]) + delta,  editDistance(a, b[ : -1]) + 1, editDistance(a[ : -1], b) + 1}

Problem :递归太耗时了，主要是会重复计算相同的值

Dynamic programing

image.png

For any pair of prefixes from X and Y ,edit distance is calculated once ,这也是它跟递归相比快的原因，但很明显，它消耗更多的内存资源。

def editDistance(x, y):
    # create distance matrix
    D = []
    for i in range(len(x)+1):
        D.append([0]*(len(y)+1))
        # initialze first row and colum of matrix
    for i in range(len(x)+1):
        D[i][0] = i
    for i in range(len(y)+1):
        D[0][i] = i
    #fill in the rest of the matrix
    for i in range(1, len(x)+1):
        for j in range(1, len(y)+1):
            disHor = D[i][j-1] + 1
            disVer = D[i-1][j] + 1
            if x[i-1] == y[j-1]:
                disDiag = D[i-1][j-1]
            else:
                disDiag = D[i-1][j-1] + 1
            D[i][j] = min(disHor, disVer, disDiag)

    return D[-1][-1]

Approximate match

Introduction:第一行初始化为0，因为你不清楚P在哪个地方与T匹配第一个字符，所以默认的editDistance都为0，然后从最后一排中editDistance最小的位置开始回溯

image.png

local alignment and gloal alignement

image.png

Assembly

image.png

Overlap graph

image.png

Introduction:
Input:一组reads
Output:overlap graph
Algorithm:核心：对每一个read移位生成k-mer,然后以k-mer为地址，value为read,进行建表,key-value也即dic数据结构

def overlap(a, b, min_length=3):
    """ Return length of longest suffix of 'a' matching
        a prefix of 'b' that is at least 'min_length'
        characters long.  If no such overlap exists,
        return 0. """
    start = 0  # start all the way at the left
    while True:
        start = a.find(b[:min_length], start)  # look for b's prefix in a
        if start == -1:  # no more occurrences to right
            return 0
        # found occurrence; check for full suffix/prefix match
        if b.startswith(a[start:]):
            return len(a)-start
        start += 1  # move just past previous match

def overlap_all_pairs(reads, k):
    dic = {}
    index = defaultdict(set)
    for read in reads:
        for i in range(len(read)-k+1):
            index[read[i:i+k]].add(read)

    # make graph
    graph = defaultdict(set)
    for r in reads:
        for o in index[r[-k:]]:
            if r != o:
                if overlap(r, o, k):
                    graph[r].add(o)

    edges = 0
    for read in graph:
        edges += len(graph[read])
    return(edges, len(graph))

shortest common string

Introduction:需要把各种组合都考虑进去，求取最小的一个，太耗时间了
Input:
Output:
Algorithm

def scs(ss):
    shortest_sup = None
    for ssperm in itertools.permutations(ss):
        sup = ssperm[0]
        for i in range(len(ssperm)-1):
            olen = overlap(ssperm[i], ssperm[i+1], min_length=1)
            sup += ssperm[i+1][olen:]
        if shortest_sup is None or len(sup)

 
 Greedy shortest common superstring 
 Introduction:不一定能得到最优解，但是速度快，并且与最优解的差距不大
 Input:
 Output:
 Altogrithm
  
  
   
     
    
   
   
  
  
 def pick_maximal_overlap(reads, k):
    read_a = None
    read_b = None
    best_olen = 0
    for a, b in itertools.permutations(reads, 2):
        olen = overlap(a, b, min_length=k)
        if olen > best_olen:
            read_a, read_b = a, b
            best_olen = olen
    return read_a, read_b, best_olen

def greedy_scs(reads, k):
    read_a , read_b, olen = pick_maximal_overlap(reads, k)
    while(olen > 0):
        reads.remove(read_a)
        reads.remove(read_b)
        reads.append(read_a + read_b[olen:])
        read_a, read_b, olen = pick_maximal_overlap(reads, k)

    return ''.join(reads)
 
  
   
     
    
   
  
    image.png 
   
  
 De Bruijn graph 
  
   
     
    
   
  
    image.png 
   
  
 
 
  
   
     
    
   
  
    image.png 
   
  
 def de_bruijn_ize(st, k):
    edges = []
    nodes = set()

    for i in range(len(st)-k+1):
        edges.append((st[i:i+k-1], st[i+1:i+k]))
        nodes.add(st[i:i+k-1])
        nodes.add(st[i+1:i+k])

    return nodes, edges
 
  
   
     
    
   
  
    image.png

DNA sequence Algorithm

Exact match

Naive exact match

Boyer-Moore

Hash Algorithm

Approximate match

Hash with pigenohole principle

Dynamic programing

Approximate match

local alignment and gloal alignement

Assembly

Overlap graph

shortest common string

Greedy shortest common superstring

De Bruijn graph

你可能感兴趣的:(DNA sequence Algorithm)