后缀树-SuffixTree概念

基本概念

关于 suffix(后缀),suffix tree(后缀树),generalised suffix tree(一般后缀树)以及 suffix link(后缀链接)等等,都可以在如下页面找到明确的定义,不在此一一赘述。

 
看两个例子先

因 suffix link 难于在字符图内表示,故略之。
 
apple: |--apple$--[0]
       |--e$--[4]
       |--le$--[3]
       |--p--|--le$--[2]
             |--ple$--[1]
 
banana: |--a--|--$--[5]
        |     |--na--|--$--[3]
        |     |      |--na$--[1]
        |--banana$--[0]
        |--na--|--$--[4]
        |      |--na$--[2]
 
后缀树有什么用
 
同样是上面那个链接,在 Functionality 一栏,可以看到详细的描述。因为当中的条目,是本系列篇什的行动指南,故录之于此。

A suffix tree for a string S of length n can be built in Θ(n) time, if the alphabet is constant or integer. Otherwise, the construction time depends on the implementation. The costs below are given under the assumption that the alphabet is constant. If it is not, the cost depends on the implementation (see below).
 
若字符集恒定或为整数,则一个长为 n 的字符串 S,其后缀树可于 Θ(n) 的时间内得以构建。否则,构造时间依实现而定。如下给出的开销数据即基于字符集为恒定的假设。若字符集不为恒定,则开销依实现而定 。(偶发现【翻译】是一件及其痛苦和费时的事,决意以后不再翻译!)
 
Assume that a suffix tree has been built for the string S of length n, or that a generalised suffix tree has been built for the set of strings D = {S1,S2,...,SK} of total length n = | n1 | + | n2 | + ... + | nK | . You can:

  • Search for strings:
    • Check if a string P of length m is a substring in O(m) time.
    • Find all z occurrences of the patterns P1,...,Pq of total length m as substrings in O(m + z) time.
    • Search for a regular expression P in time expected sublinear on n.
    • Find for each suffix of a pattern P, the length of the longest match between a prefix of P[i...m] and a substring in D in Θ(m) time. This is termed the matching statistics for P.
  • Find properties of the strings:
    • Find the longest common substrings of the string Si and Sj in Θ(ni + nj) time.
    • Find all maximal pairs, maximal repeats or supermaximal repeats in Θ(n + z) time.
    • Find the Lempel-Ziv decomposition in Θ(n) time.
    • Find the longest repeated substrings in Θ(n) time.
    • Find the most frequently occurring substrings of a minimum length in Θ(n) time.
    • Find the shortest strings from Σ that do not occur in D, in O(n + z) time, if there are z such strings.
    • Find the shortest substrings occurring only once in Θ(n) time.
    • Find, for each i, the shortest substrings of Si not occurring elsewhere in D in Θ(n) time.

The suffix tree can be prepared for constant time lowest common ancestor retrieval between nodes in Θ(n) time. You can then also:

  • Find the longest common prefix between the suffixes Si[p..ni] and Sj[q..nj in Θ(1).
  • Search for a pattern P of length m with at most k mismatches in O(kn + z) time, where z is the number of hits.
  • Find all z maximal palindromes in Θ(n), or Θ(gn) time if gaps of length g are allowed, or Θ(kn) if k mismatches are allowed.
  • Find all z tandem repeats in O(nlogn + z), and k-mismatch tandem repeats in O(knlog(n / k) + z).
  • Find the longest substrings common to at least k strings in D for k = 2..K in Θ(n) time.
下一步
 
打算用 C++ 或 Python 来实现构建后缀树的 Ukkonen 方法。

你可能感兴趣的:(tree)