DNA sequence Algorithm

Exact match

image.png

Naive exact match

Introduction:精确匹配,耗时长
Input: parten 和 text即模式串和需要比对上的字符串
Output: parten出现所有位置的index
Algorithm:两个loop,外层对t进行移位,内层逐字符比对,关键是有一个mach标志为,如果比对上了,append。

def naive(p, t):
    occurrences = []
    for i in range(len(t) - len(p) + 1):  # loop over alignments
        match = True
        for j in range(len(p)):  # loop over characters
            if t[i+j] != p[j]:  # compare characters
                match = False
                break
        if match:
            occurrences.append(i)  # all chars matched; record
    return occurrences

Boyer-Moore

Introduction:j精确匹配,可以跳过一些不需要匹配的位置,因此速度比Naive快
Input: p, t 和p_bm,它相当于一个对p的preprocess
Output: p匹配到t的位置
Algorithm:核心思想:一个shift,每次选出一个最大gap, 如何匹配上了, append(i)
细节:首先while循环,因为你不清楚循环次数,所以不用for循环;然后for循环,对p中的每个字符遍历,从右往左,如果不等,通过bad_character和good_suffix两种规则,求出最大shift,并break;如果match,则append,并skip_gs,继续map,有可能有多个位置。

def boyer_moore(p, p_bm, t):
    """ Do Boyer-Moore matching. p=pattern, t=text,
        p_bm=BoyerMoore object for p """
    i = 0
    occurrences = []
    while i < len(t) - len(p) + 1:
        shift = 1
        mismatched = False
        for j in range(len(p)-1, -1, -1):
            if p[j] != t[i+j]:
                skip_bc = p_bm.bad_character_rule(j, t[i+j])
                skip_gs = p_bm.good_suffix_rule(j)
                shift = max(shift, skip_bc, skip_gs)
                mismatched = True
                break
        if not mismatched:
            occurrences.append(i)
            skip_gs = p_bm.match_skip()
            shift = max(shift, skip_gs)
        i += shift
    return occurrences

image.png

image.png

Hash Algorithm

Introduction:需要preprocess的算法,精确匹配,适用于test较大的比对,但是hash表较大
Input: p, t, k k是生成hash table也就是Index中的k-mer
Output:比对的位置
Algorithm:核心:先是对test进行index生成hashtable,query的时候对p只选了一个k-mer, 因为这里选的都是精确匹配。python中用的一个list,然后元素是(k-mer, position), 用了一个二叉树查找,这里的-1有点奇怪。然后对p进行选取k-mer,查找,最后需要对剩下的字符进行匹配。

class Index(object):
    """ Holds a substring index for a text T """

    def __init__(self, t, k):
        """ Create index from all substrings of t of length k """
        self.k = k  # k-mer length (k)
        self.index = []
        for i in range(len(t) - k + 1):  # for each k-mer
            self.index.append((t[i:i+k], i))  # add (k-mer, offset) pair
        self.index.sort()  # alphabetize by k-mer

    def query(self, p):
        """ Return index hits for first k-mer of p """
        kmer = p[:self.k]  # query with first k-mer
        i = bisect.bisect_left(self.index, (kmer, -1))  # binary search
        hits = []
        while i < len(self.index):  # collect matching index entries
            if self.index[i][0] != kmer:
                break
            hits.append(self.index[i][1])
            i += 1
        return hits

def query_index(p, t, index):
    k = index.k
    offsets = []
    for i in index.query(p):
        if p[k:] == t[i+k: i+len(p)]:
            offsets.append(i)
    return offsets

减少hashtable的大小

image.png

image.png

image.png

提高specificity, 减少collision的次数
image.png

Approximate match

image.png

Pigenohole principle:即P与T匹配,允许k个失配,那么把P分成k+1份,必然在T中能够找到k+1份中某一份的精确匹配
if |X| = |Y|, then EditDistance(X, Y) <= HammingDistance(X, Y)
if |X| not equal |Y|, then EditDistance(X, Y) >= ||X|-|Y||

image.png

** Hamming distance

image.png

Edit distance
image.png

Hash with pigenohole principle

Introduction:模糊匹配,原理主要采用鸽子洞原理,即把P分为k+1份,总有一份满足精确匹配,找到它的位置,然后计算其余部分失配的个数不能超过限定值。
Input: p, t, n n表示允许失配的个数
Output:匹配到的位置,以及字符
Algorithm:核心思想是把p分为n+1份,我们只要求长度即可,并且小于它就行,不需要我们产生n+1个随机数,还得之和为len(P).然后根据这个平均长度,生成hashtable.然后对n+1份字符进行遍历,并且查找位置,之后对查找出的位置进行遍历。如果查找的位置小于start 或者大于len(t),则continue,否则,逐字符比对剩下的字符,统计失配个数。

def approximate_match(p, t, n):
    segment_length = int(round(len(p) / (n + 1)))
    all_matches = set()
    p_idx = Index(t, segment_length)
    idx_hits = 0
    for i in range(n + 1):
        start = i * segment_length
        end = min((i + 1) * segment_length, len(p))
        matches = p_idx.query(p[start:end])
        # Extend matching segments to see if whole p matches
        for m in matches:
            idx_hits += 1
            if m < start or m - start + len(p) > len(t):
                continue
            mismatches = 0
            for j in range(0, start):
                if not p[j] == t[m - start + j]:
                    mismatches += 1
                    if mismatches > n:
                        break
            for j in range(end, len(p)):
                if not p[j] == t[m - start + j]:
                    mismatches += 1
                    if mismatches > n:
                        break

            if mismatches <= n:
                all_matches.add(m - start)
    return list(all_matches), idx_hits
image.png
image.png
image.png
def editDistance(a, b):
  if len(a) == 0 :
    return len(b)
  if len(b) == 0:
    return len(a)
  if a[-1] == b[-1]:
    delta = 1
  else :
    delta = 0
  return min{editDistance(a[ :-1], b[ :-1]) + delta,  editDistance(a, b[ : -1]) + 1, editDistance(a[ : -1], b) + 1}

Problem :递归太耗时了,主要是会重复计算相同的值

Dynamic programing

image.png

For any pair of prefixes from X and Y ,edit distance is calculated once ,这也是它跟递归相比快的原因,但很明显,它消耗更多的内存资源。

def editDistance(x, y):
    # create distance matrix
    D = []
    for i in range(len(x)+1):
        D.append([0]*(len(y)+1))
        # initialze first row and colum of matrix
    for i in range(len(x)+1):
        D[i][0] = i
    for i in range(len(y)+1):
        D[0][i] = i
    #fill in the rest of the matrix
    for i in range(1, len(x)+1):
        for j in range(1, len(y)+1):
            disHor = D[i][j-1] + 1
            disVer = D[i-1][j] + 1
            if x[i-1] == y[j-1]:
                disDiag = D[i-1][j-1]
            else:
                disDiag = D[i-1][j-1] + 1
            D[i][j] = min(disHor, disVer, disDiag)

    return D[-1][-1]

Approximate match

Introduction:第一行初始化为0,因为你不清楚P在哪个地方与T匹配第一个字符,所以默认的editDistance都为0,然后从最后一排中editDistance最小的位置开始回溯

image.png

local alignment and gloal alignement

image.png

Assembly

image.png

image.png

image.png

Overlap graph

image.png

image.png

Introduction:
Input:一组reads
Output:overlap graph
Algorithm:核心:对每一个read移位生成k-mer,然后以k-mer为地址,value为read,进行建表,key-value也即dic数据结构

def overlap(a, b, min_length=3):
    """ Return length of longest suffix of 'a' matching
        a prefix of 'b' that is at least 'min_length'
        characters long.  If no such overlap exists,
        return 0. """
    start = 0  # start all the way at the left
    while True:
        start = a.find(b[:min_length], start)  # look for b's prefix in a
        if start == -1:  # no more occurrences to right
            return 0
        # found occurrence; check for full suffix/prefix match
        if b.startswith(a[start:]):
            return len(a)-start
        start += 1  # move just past previous match

def overlap_all_pairs(reads, k):
    dic = {}
    index = defaultdict(set)
    for read in reads:
        for i in range(len(read)-k+1):
            index[read[i:i+k]].add(read)

    # make graph
    graph = defaultdict(set)
    for r in reads:
        for o in index[r[-k:]]:
            if r != o:
                if overlap(r, o, k):
                    graph[r].add(o)

    edges = 0
    for read in graph:
        edges += len(graph[read])
    return(edges, len(graph))

shortest common string


Introduction:需要把各种组合都考虑进去,求取最小的一个,太耗时间了
Input:
Output:
Algorithm

def scs(ss):
    shortest_sup = None
    for ssperm in itertools.permutations(ss):
        sup = ssperm[0]
        for i in range(len(ssperm)-1):
            olen = overlap(ssperm[i], ssperm[i+1], min_length=1)
            sup += ssperm[i+1][olen:]
        if shortest_sup is None or len(sup)

Greedy shortest common superstring

Introduction:不一定能得到最优解,但是速度快,并且与最优解的差距不大
Input:
Output:
Altogrithm

def pick_maximal_overlap(reads, k):
    read_a = None
    read_b = None
    best_olen = 0
    for a, b in itertools.permutations(reads, 2):
        olen = overlap(a, b, min_length=k)
        if olen > best_olen:
            read_a, read_b = a, b
            best_olen = olen
    return read_a, read_b, best_olen

def greedy_scs(reads, k):
    read_a , read_b, olen = pick_maximal_overlap(reads, k)
    while(olen > 0):
        reads.remove(read_a)
        reads.remove(read_b)
        reads.append(read_a + read_b[olen:])
        read_a, read_b, olen = pick_maximal_overlap(reads, k)

    return ''.join(reads)
image.png

De Bruijn graph

image.png

image.png
def de_bruijn_ize(st, k):
    edges = []
    nodes = set()

    for i in range(len(st)-k+1):
        edges.append((st[i:i+k-1], st[i+1:i+k]))
        nodes.add(st[i:i+k-1])
        nodes.add(st[i+1:i+k])

    return nodes, edges
image.png

你可能感兴趣的:(DNA sequence Algorithm)