A Suffix Tree is a data-structure that allows many problems on strings (sequences of characters) to be solved quickly. Iftxt=t1t2...ti...tn
is a string, then Ti=titi+1...tn
is the suffix of txt
that starts at position i, e.g.
The suffix tree for `txt
' is a Trie-like or PATRICIA-like data structure that represents the suffixes of txt
.
A given suffix tree can be used to search for a substring, pat[1..m]
in O(m) time. There are n(n+1)/2 substrings in txt[1..n]
so it is rather surprising that a suffix tree can be built in O(n) time. Adding just one character to txt
increases the number of substrings by n+1, but they are not independent. Weiner (1973) gave the first algorithm and McCreight (1976) gave a more readable account for constructing the suffix tree while processing txt
from right to left. Only much later did Ukkonen (1992, 1995) give a left-to-right on-line algorithm, i.e., an algorithm that maintains a suffix tree for txt[1..i]
at each step as i is increased from 1 to n.
If the non-empty suffixes are sorted:
it becomes obvious that some of them (may) share common prefixes. Here there are substrings starting with `i', `m', `p' and `s', but all of those starting `is', in fact start `issi'. Two or more common prefixes share a common path from the root of the suffix tree (as in a PATRICIA tree). Now, a search (sub)string pat
must be a prefix of a suffix of txt
, if it occurs in txt
.
tree substrings tree-->|---mississippi m .. mississippi | |---i-->|---ssi-->|---ssippi i .. ississippi | | | | | |---ppi issip,issipp,issippi | | | |---ppi ip, ipp, ippi | |---s-->|---si-->|---ssippi s .. ssissippi | | | | | |---ppi ssip, ssipp, ssippi | | | |---i-->|---ssippi si .. sissippi | | | |---ppi sip, sipp, sippi | |---p-->|---pi p, pp, ppi | |---i p, pi |
--- Suffix Tree for "mississippi" --- |
Each edge (arc) of the suffix tree is labelled with a substring of txt
which is implemented by pointers to the start and end of the substring, e.g. `ssi' by <3,5>. One of the observation in Ukkonen's algorithm is that an edge, <i,n>, leading to a leaf can be implemented by <i,∞> where `∞', i.e., infinity, means `to the end of the string'.
Suffix Tree Demonstration
Change the Text txt=...
in the HTML FORM below, and click on `go'; experiment with different text strings:
tree
NB. If the string is "short", a simple sort routine is run first to sort the suffices the slow way for comparison with the tree; this is not done if the string is "long".
If the termination of txt
is important, this can be indicated by a special terminating character often denoted by `$' in papers on strings (~zero char in C/Unix).
Building a Suffix Tree, (a) Slowly
We show how a suffix tree might be built "by hand". Three dots, `...
', are used to show the current end of any suffix that will grow as more characters are processed. Starting with the empty suffix tree, consider the string `m':
tree 1 tree-->----m... |
Adding the second character to get `mi' there are now suffixes `mi' and `i':
tree 2 tree-->|---mi... | |---i... |
Next `mis'
tree 3 tree-->|---mis... | |---is... | |---s... |
There is no need to add any more splits for `miss' because `s' is part of `ss'.
tree 4 tree-->|---miss... | |---iss... | |---ss... |
However, with `missi' there must be a split because one `s' is followed by `i', the other by `s'
tree 5 tree-->|---missi... | |---issi... | |---s-->|---si... | |---i... |
The 6th character, `s', brings us to `missis' and no split because both `i's are followed by `s's.
tree 6 tree-->|---missis... | |---issis... | |---s-->|---sis... | |---is... |
`mississ'
tree 7 tree-->|---mississ... | |---ississ... | |---s-->|---siss... | |---iss... |
`mississi'
tree 8 tree-->|---mississi... | |---ississi... | |---s-->|---sissi... | |---issi... |
A lot suddenly happens for `mississip', because it brings the first `p', and causes the third `i' to be followed by `p' where the other two are followed by `ssi'. Consequently one of the `ssi' is followed by `p', the other by `ssip', ditto `si'.
tree 9 tree-->|---mississip... | |---i-->|---ssi-->|---ssip... | | | | | |---p... | | | |---p... | |---s-->|---si-->|---ssip... | | | | | |---p... | | | |---i-->|---ssip... | | | |---p... | |---p... |
By comparison `mississipp' is very quiet
tree 10 tree-->|---mississipp... | |---i-->|---ssi-->|---ssipp... | | | | | |---pp... | | | |---pp... | |---s-->|---si-->|---ssipp... | | | | | |---pp... | | | |---i-->|---ssipp... | | | |---pp... | |---pp... |
`mississippi' is an anti-climax
tree 11 tree-->|---mississippi | |---i-->|---ssi-->|---ssippi | | | | | |---ppi | | | |---ppi | |---s-->|---si-->|---ssippi | | | | | |---ppi | | | |---i-->|---ssippi | | | |---ppi | |---p-->|---pi | |---i |
and we are done. The challenge, to a computer scientist, is to make sure treei is updated to treei+1 efficiently. This can be done (Ukkonen 1992, 1995) so that treen can be built, starting from tree0, in O(n)-time overall.
(b) Faster
The following terminology is adapted from Ukkonen (1995).
- If `x' is a substring of txt then `x' represents the state (i.e., location) in the suffix-tree found by tracing out the characters of x from the root. Note that x might be part-way along an edge of the tree.
- A vertex (node) of the suffix-tree is called an explicit state.
- A substring x=txt[L..R] can be represented by (L,R).
- If `v' is a vertex of the suffix-tree, the pair `(v,x)', equivalently (v,(L,R)), represents the state (location) in the suffix-tree found by tracing out the characters of x from v.
- (v,x) is canonical if v is the last explit state on the path from v to (v,x). NB. (v,empty) is canonical.
- A special vertex called `bottom' is added and is denoted _|_.
When treei-1 is expanded into treei, character txt[i] must be dealt with. This is done during a traversal of the boundary path. Any state on the boundary path before sj is a leaf and could be extended by adding txt[i] to the incoming arc, but this can be done for free by representing arcs to leaves by (L,∞) where `∞' is `infinity'. So it it is only necessary to examine states from the active point sj and prior to the end-point sj' .
"[states from sj and before sj' create entirely new branches that start from states sh, j<=h<j'. ...
They are found along the boundary path of [treei-1] using reference pairs and suffix links." - Ukkonen (1995).
// almost JavaScript (try view-source) function upDate(s, k, i) // (s, (k, i-1)) is the canonical reference pair for the active point { var oldr = root; var (endPoint, r) = test_and_split(s, k, i-1, Txt.charAt(i)); while (!endPoint) { r.addTransition(i, infinity, new State()); if (oldr != root) oldr.sLink = r; // build suffix-link active-path oldr = r; var (s,k) = canonize(s.sLink, k, i-1) (endPoint, r) = test_and_split(s, k, i-1, Txt.charAt(i)) } if(oldr != root) oldr.sLink = s; return new pair(s, k); }//upDate |
Note that r.addTransition(...)
adds an edge from state r
, labelling the edge with a substring. New txt[i]-transitions must be "open" transitions of the form (L,∞).
Where necessary, test_and_split(...)
replaces edges s--->s1
with s--->r--->s1
for a new node r. This makes r=(s,(k,p))
explicit.
function test_and_split(s, k, p, t) { if(k<=p) { // find the t_k transition g'(s,(k',p'))=s' from s // k1 is k' p1 is p' in Ukkonen '95 var ((k1,p1), s1) = s[Txt.charAt(k)]; if (t == Txt.charAt(k1 + p - k + 1)) return new pair(true, s); else { var r = new State() s.addTransition(k1, k1+p-k, r); // s---->r---->s1 r.addTransition( k1+p-k+1, p1, s1); return new pair(false, r) } } else // k > p; ? is there a t-transition from s ? return new pair(s[t] != null, s); }//test_and_split |
Canonize(...)
takes (s,w)=(s,(k,p)) and steps over intermediate nodes by spelling out the characters of w=txt[k..p] for as far as possible.
function canonize(s, k, p) // s--->... { if(p < k) return new pair (s, k); // find the t_k transition g'(s,(k',p'))=s' from s // k1 is k', p1 is p' in Ukk' '95 var ((k1,p1), s1) = s[Txt.charAt(k)]; // s--(k1,p1)-->s1 while(p1-k1 <= p-k) // s--(k1,p1)-->s1--->... { k += p1 - k1 + 1; // remove |(k1,p1)| chars from front of (k,p) s = s1; if(k <= p) { ((k1,p1), s1) = s[Txt.charAt(k)]; // s--(k1,p1)-->s1 } } return new pair(s, k); }//canonize |
The main controlling routine repeatedly takes the next character, updates the sites on the active path and finds and canonizes the new active point:
function ukkonen95()// construct suffix tree for Txt[0..N-1] { var s, k, i; var bt; root = new State(); bt = new State(); // bt (bottom or _|_) // Want to create transitions for all possible chars // from bt to root for (i=0; i < Txt.length; i++) bt.addTransition(i,i, root); root.sLink = bt; s=root; k=0; // NB. k=0, unlike Ukkonen our strings are 0 based for(i=0; i < Txt.length; i++) { var (s,k) = upDate(s, k, i); // follow path from active-point (s,k) = canonize(s, k, i); } }//ukkonen95 |
It relies upon the fact (lemma 2 Ukkonen (1995)) that if (s,(k,i-1)) is a reference pair for the end-point, sj' , of treei-1 then (s,(k,i)) is a reference pair for the active point of treei.
Suffix Tree Applications
Suffix Trees can be used to solve a large number of string problems that occur in text-editing, free-text search, computational biology, and other application areas. Some examples are given below.
String Search
Searching for a substring, pat[1..m]
, in txt[1..n]
, can be solved in O(m) time (after the suffix tree for txt
has been built in O(n) time).
Longest Repeated Substring
Add a special ``end of string'' character, e.g. `$', to txt[1..n]
and build a suffix tree; the longest repeated substring oftxt[1..n]
is indicated by the deepest fork node in the suffix tree, where depth is measured by the number of characterstraversed from the root, i.e., `issi' in the case of `mississippi'. The longest repeated substring can be found in O(n) time using a suffix tree.
Longest Common Substring
The longest common substring of two strings, txt1
and txt2
, can be found by building a generalized suffix tree for txt1
andtxt2
: Each node is marked to indicate if it represents a suffix of txt1
or txt2
or both. The deepest node marked for both txt1
and txt2
represents the longest common substring.
Equivalently, one can build a (basic) suffix tree for the string txt1$txt2#
, where `$' is a special terminator for txt1
and `#' is a special terminator for txt2
. The longest common substring is indicated by the deepest fork node that has both `...$...' and `...#...' (no $) beneath it.
(Try it using the HTML FORM above.)
Note that the `longest common substring problem' is different to the `longest common subsequence problem' which is closely related to the `edit-distance problem': An instance of a subsequence can have gaps where it appears in txt1
and intxt2
, but an instance of a substring cannot have gaps.
Palindromes
A palindrome is a string, P, such that P=reverse(P). e.g. `abba'=reverse(`abba'). e.g. `ississi' is the longest palindrome in `mississippi'.
The longest palindrome of txt[1..n]
can be found in O(n) time, e.g. by building the suffix tree for txt$reverse(txt)#
or by building the generalized suffix tree for txt
and reverse(txt)
.
(Try it.)
From:
http://www.allisons.org/ll/AlgDS/Tree/Suffix/
http://decomplexify.blogspot.com/2014/07/suffix-tree_19.html