Text-editing programs frequently need to find all occurrences of a pattern in the text. Efficient algorithms for this problem - called “String Matching” - can greatly aid the responsiveness of the text-editing program.
Among their many other applications, string-matching algorithms search for particular patterns in DNA sequences. Internet search engines also use them to find Web pages relevant to queries.
The string-matching problem is formalized as follows. We assume that the text is an array T[1..n] of length n and that the pattern is an array P[1..m] of length m ≤ ≤ n. We further assume that the elements of P and T are characters drawn from a finite alphabet Σ Σ . For example, we may have Σ={0,1} Σ = { 0 , 1 } or Σ={a,b,..,z} Σ = { a , b , . . , z } . The character arrays P and T are often called strings of characters.
The concatenation of two strings x and y denoted xy, has length |x|+|y| and consists of the characters from x followed by the characters from y.
We say that a string w is a prefix of a string x, denoted w ⊏ ⊏ x, if x = wy for some string y. Similarly. we say that a string w is a suffix of a string x, denoted w ⊐ ⊐ x, if x = yw for some y.
Algorithm | Preprocessing time | Matching time |
---|---|---|
Naive | 0 | O O ((n-m+1)m) |
Rabin-Karp | Θ Θ (m) | O O ((n-m+1)m) |
Finite automation | O O (m| Σ Σ |) | Θ Θ (n) |
Knuth-Morris-Pratt | Θ Θ (m) | Θ Θ (n) |
The Naive algorithm is using a loop that checks the condition P[1..m] = T[s+1..s+m] for each of the valide value of s.
The Rabin-Karp is a method that translates T to string into a number array, whose value is the remainder of the continues five (or another length) items from that item divided by D. So the precondition of match is that the remainder is matched.
And the Finite automation method is to build a finite automaton that scans the text string T for all occurrences of the pattern P.
The KMP algorithm is a linear-time string-matching algorithm due to Knuth Morris, and Pratt.
The prefix function π π for a pattern encapsulates knowledge about how the pattern matches against shifts of itself.
π[q]=max{k:k<qandPk⊐Pq} π [ q ] = m a x { k : k < q a n d P k ⊐ P q }
example:
KMP-MATCHER(T, P)
n = T.length
m = P.length
π = COMPUTE-PREFIX-FUNCTION(P)
q = 0 // numnber of characters matched
for i = 0 to n // scan the text from left to right
while q > 0 and P[q+1] ≠ T[i]
q = π[q] // next character does not match
if P[q+1] == T[i]
q = q + 1 // next character matches
if q == m // is all of P matched?
print "Pattern occurs with shift" i - m
q = π[q] // look for the next match
COMPUTE-PREFIX-FUNCTION(P)
m = P.length
let π[1..m] be a new array
π[1] = 0
q = 0 // numnber of characters matched
for i = 2 to m // scan the text from left to right
while q > 0 and P[q+1] ≠ P[i]
q = π[q] // next character does not match
if P[q+1] == P[i]
q = q + 1 // next character matches
π[i] = q
return π
The two procedures have much in common, because both match a string against the pattern P. KMP-MATCHER matches the text T against P, and the COMPUTE-PREFIX-FUNCTION matches P against itself.
The running time of COMPUTE-PREFIX-FUNCTION(P) as the same as KMP-MATCHER(T, P).
The only tricky part of COMPUTE-PREFIX-FUNCTION(P) is the while loop.
The total increase of k is at most m-1. And the k cannot decrease to negative.
So the while loop makes at most m-1 iterations and the COMPUTE-PREFIX-FUNCTION(P) runs in time Θ Θ (m).
First, it corrects when k = 0 and q = 1, and we suppose that it corrects when k = k and q = q.
Then π π [q+1] = k+1, if P[k+1] == P[q+1]. Else π π [q+1] = π π [.. π π [k]..]+1, when P[ π π [.. π π [k]..]+1] == P[q+1].
So π[q+1]=max{k:k<qandPk+1⊐Pq+1} π [ q + 1 ] = m a x { k : k < q a n d P k + 1 ⊐ P q + 1 } corrects.
Suppose to the contrary that the algorithm is incorrect.
So there is a substring miss-match and it dues to the prefix function.
So π[q]=max{k:kqandPk⊐Pq} π [ q ] = m a x { k : k q a n d P k ⊐ P q } does not follow the definition, which is in contradiction with the above proof of COMPUTE-PREFIX-FUNCTION.
The assumption is not set up and KMP-MATCHER is correct.
《Introduction to Algorithms》 the 3rd edition