Knuth-Morris-Pratt简称KMP,是对字符串匹配算法的改进。该算法对于任何字符串的匹配都可以在线性时间内完成匹配,不会发生退化。
对于给定的字符串strings和sub_s,判断strings中是否包含sub_s,并返回出现位置,暴力算法匹配字符串过程:把strings [0] 跟sub_s [0] 匹配,如果相同则匹配下一个字符,出现不匹配的字符时我们会丢弃前面的匹配信息,然后把strings [1] 跟sub_s [1] 匹配,循环进行,直到主串结束,或者匹配成功。这种匹配算法极大地降低了匹配效率,时间复杂度是O(nm)。
KMP算法较之暴力匹配算法引进了一个部分匹配表,从该表中可以得到向后移动位数。下面举例说明:
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
strings |
B |
A |
B |
C |
B |
A |
B |
C |
A |
B |
C |
A |
A |
B |
C |
A |
B |
C |
A |
B |
C |
A |
C |
A |
B |
sub_s |
A |
B |
C |
A |
B |
C |
A |
C |
A |
B |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1.首先将strings和sub_s的第一个字符进行比较,如果不匹配,向后移动一位
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
strings |
B |
A |
B |
C |
B |
A |
B |
C |
A |
B |
C |
A |
A |
B |
C |
A |
B |
C |
A |
B |
C |
A |
C |
A |
B |
sub_s |
|
A |
B |
C |
A |
B |
C |
A |
C |
A |
B |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2.此时A与A匹配,接着比较字符串的后几位,发现第5位出现不匹配现象,这时最自然的反映是,将sub_s向后移动一位(如下图所示),然后从sub_s[0]与strings[3]开始逐个比较,你会发现已经比较过的位置需要再次作比较,这样做极大的降低了效率。
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
strings |
B |
A |
B |
C |
B |
A |
B |
C |
A |
B |
C |
A |
A |
B |
C |
A |
B |
C |
A |
B |
C |
A |
C |
A |
B |
sub_s |
|
|
A |
B |
C |
A |
B |
C |
A |
C |
A |
B |
|
|
|
|
|
|
|
|
|
|
|
|
|
3.当第5位出现B和A不匹配时,我们知道字符串前三位 “ABC”,根据这个信息可以算出sub_s向后移动的位数,这样对于已经比较过的位置无需重复比较,从而提高匹配效率。移动位数=已匹配的字符数-对应的部分匹配值。
4.对于此例中的strings和sub_s的匹配其对应的部分匹配表,如下表所示:
j |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
sub_[j] |
A |
B |
C |
A |
B |
C |
A |
C |
A |
B |
部分匹配值 |
0 |
0 |
0 |
1 |
2 |
3 |
1 |
0 |
1 |
2 |
5. 当第5位出现B和A不匹配时,前面三个字符“ABC”是匹配的,最后一个匹配字符C对应的匹配值是1,根据
移动位数=已匹配的字符数-对应的部分匹配值,得到移动位数为3(3 - 0 = 3)。
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
strings |
B |
A |
B |
C |
B |
A |
B |
C |
A |
B |
C |
A |
A |
B |
C |
A |
B |
C |
A |
B |
C |
A |
C |
A |
B |
sub_s |
|
|
|
|
A |
B |
C |
A |
B |
C |
A |
C |
A |
B |
|
|
|
|
|
|
|
|
|
|
|
6.B与A不匹配,后移一位,逐位比较A与C不匹配,向右移动7位(7-0 =7)
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
strings |
B |
A |
B |
C |
B |
A |
B |
C |
A |
B |
C |
A |
A |
B |
C |
A |
B |
C |
A |
B |
C |
A |
C |
A |
B |
sub_s |
|
|
|
|
|
A |
B |
C |
A |
B |
C |
A |
C |
A |
B |
|
|
|
|
|
|
|
|
|
|
7.右移动7位后,B与C不匹配,后移7位(7-0=7),然后逐次比较,直至string最后一位,若发现完全匹配,则匹配成功,否则失败。这里不再继续重复。
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
strings |
B |
A |
B |
C |
B |
A |
B |
C |
A |
B |
C |
A |
A |
B |
C |
A |
B |
C |
A |
B |
C |
A |
C |
A |
B |
sub_s |
|
|
|
|
|
|
|
|
|
|
|
|
A |
B |
C |
A |
B |
C |
A |
C |
A |
B |
|
|
|
8.部分匹配值计算
"前缀"指除了最后一个字符以外,一个字符串的全部头部组合;"后缀"指除了第一个字符以外,一个字符串的全部尾部组合。部分匹配值"就是"前缀"和"后缀"的最长的共有元素的长度。以"ABCABCACAB"为例
sub_s[j] |
前缀 |
后缀 |
共有部分及长度 |
A |
[ ] |
[ ] |
[ ]:0 |
AB |
[A] |
[B] |
[ ]:0 |
ABC |
[A,AB] |
[BC,C] |
[ ]:0 |
ABCA |
[A,AB,ABC,] |
[BCA,CA,A] |
[A]:1 |
ABCAB |
[A, AB,ABC,ABCA] |
[BCAB,CAB,AB,B] |
[AB ]:2 |
ABCABC |
[A,AB,ABC,ABCA,ABCAB] |
[BCABC,CABC,ABC,BC,C] |
[ABC]:3 |
ABCABCA |
[A,AB,ABC,ABCA,ABCAB,ABCABC,] |
[BCABCA,CABCA,ABCA,BCA,AC,A] |
[A]:1 |
ABCABCAC |
[A,AB,ABC,ABCA,ABCAB,ABCABC,ABCABCA] |
[BCABCAC, CABCAC, ABCAC, BCAC, CAC, AC,C] |
[ ]:0 |
ABCABCACA |
[A,AB,ABC,ABCA,ABCAB,ABCABC,ABCABCA, ABCABCAC] |
[BCABCACA, CABCACA, ABCACA, BCACA, CACA, ACA, CA,A] |
[A]:1 |
ABCABCACAB |
[A,AB,ABC,ABCA,ABCAB,ABCABC,ABCABCA, ABCABCAC,ABCABCACA] |
[BCABCACAB, CABCACAB, ABCACAB, BCACAB, CACAB, ACAB, CAB, AB,B] |
[AB ]:2 |