生信算法4 - 获取overlap序列索引和序列的算法

生信序列基本操作算法

建议在Jupyter实践,python版本3.9

1. 获取overlap序列索引和序列的算法实现

# min_length 最小overlap碱基数量3个
def getOverlapIndexAndSequence(a, b, min_length=3):
    """ Return length of longest suffix of 'a' matching
        a prefix of 'b' that is at least 'min_length'
        characters long.  If no such overlap exists,
        return 0. """
    # 开始位置
    start = 0  
    while True:
        # 在序列a中查找b的最小长度后缀
        start = a.find(b[:min_length], start)  
        
        # 如果没有匹配到则返回0
        if start == -1:  
            return 0
        
        # 如果存在overlap序列,则输出a序列开始索引以及overlap序列
        # 即序列b的开始 min_length 个碱基与a序列的 min_length 个碱基的后缀序列相同
        if b.startswith(a[start:]):
            return len(a)-start, a[start:]
        
        # 右移1个碱基
        start += 1  

2. 算法测试

getOverlapIndexAndSequence('TTACGT', 'CGTGTGC')
# (3, 'CGT') overlap序列开始索引和对应序列碱基

getOverlapIndexAndSequence('TTACGT', 'GTGTGC')
# 0

你可能感兴趣的:(生信算法教程,java,数据结构,开发语言)