这两天看了一下KMP算法,它是什么,我就不赘述了。不懂的自己动手查查。
我已经把代码上传到Github了,可以去那里下载,地址如下:
https://github.com/nemax/KMP-search-algorithm
一般来说,我们习惯于把KMP和Brute-Force解法比较,那么KMP到底胜在什么地方呢?胜在它的覆盖函数。什么是覆盖函数呢,它是一个用来计算模式串自身信息的函数,计算出来的函数表征着自我覆盖的函数,所以说它是覆盖函数。(其实还听过叫next函数或其他名字的。)
约定:红色字母代表失配字母,绿色字母代表上一轮比较结束后,这一轮起始比较位置,而蓝色代表起始位置和失配位置一样。
简单的来说明一下,假如有模式串abaabc,主串abaabacdad,第一轮匹配刚好最后一个c没法匹配,如下:
主串: a b a a b a c d a d
模式串: a b a a b c
对于BF而言,是从头再匹配了。但是想想,最后一个c没法匹配就意味着前面的都对着呢。说明在主串中,对不上的地方的前两个字符肯定是ab,而注意到没有,模式串一开头就是ab,如果知道这一点,我们是不是可以把这两次匹配省了,直接从上面的情形跳到下面这样,从第三个a开始比较
主串: a b a a b a c d a d
模式串: a b a a b c
而对于BF而言,在上一步错了以后应该是下面这样
主串: a b a a b a c d a d
模式串: a b a a b c
发现差异有多大了吗?看看起始位置,KMP可以避免主串指针的回溯,而BF法一旦一轮结束,必须指针回溯。这就是二者差异所在,KMP充分利用了模式串自身的信息,避免了指针回溯,避免了不必要的比较。在KMP算法中,模式串的每一个字符都有自己相应的next值,何谓next值,可以肤浅地理解为在下标为i的字符处失配时,我们下次应该用模式串中下标为next[i]的字符来比对。而计算next值得函数就是前面说道的覆盖函数。
我大概看过两种计算next值得方法,一种计算出来的next值直接告诉你失配后下一次用哪个位置的字符比较,而另一种给出的next值需要经过一个固定的计算,算出一下次需要比对的位置。而KMP的核心,也就在于这个next值得计算。
跟着next函数走一遍,你会发现,代码其实很简单。难者不会,会者不难。
首先他有两个指针,一个指向当前要计算的字符,即指针k,另一个指针value不好解释,看着就知道了。
我们先简单说一下next值怎么算吧,先说第一种,对应代码如下:
- void get_next_array_origin(char * pattern){
- //Get pattern length
- int length = strlen(pattern);
- int k = 1,value=0;
- //It`s a rule to set next[0] with value -1
- next[0]=-1;
- while(k < length){
- //Keep the next value of last unmatch character
- value = next[k-1];
- /*value>=0 means there is some overlays was discovered,
- the second condition means the overlay was stop here
- */
- while(value >= 0 && pattern[k] != pattern[value+1]){
- value = next[value];
- }
- /*It means we discoverd an overlay and pattern[k] is the
- first(value equals -1) or subsequent char(value >= 0).
- */
- if(pattern[k] == pattern[value+1]){
- next[k] = value+1;
- }
- //Other condition
- else{
- next[k]=-1;
- }
- k++;
- //printf("next[%d] = %d\n",k-1,next[k-1]);
- }
- }
假设模式串被表示为a[0]a[1]..a[k]...a[j-k]...a[j],如果把a[0]到a[k]和a[j-k]到a[j]刚好能配得上,那么next[j] = k,也可以说发现了覆盖,即模式串自身内部的重叠.如果找不到这样的匹配,next[j]=-1特殊的地方就在于next[0]=-1是定死的,无论哪种覆盖函数。
这下好办了吧,根据这个定义看看abaabc是多少?
1.next[0]=-1;
2.只看ab,next[1]=-1;
3.只看aba,next[2]=0,因为next[0]到next[0]和next[2-0]到next[2]一样,所以next[2]=0;
4.只看abaa,next[3]=0,因为next[0]到next[0]和next[3-0]到next[3]一样,所以next[3]=0;
5.只看abaab,next[5]=1,因为next[0]到next[1]和next[5-1]到next[5]一样,所以next[5]=1;
6.同理,next[6]=-1
所以,
模式串:a b a a b c
next值:-1 -1 0 0 1 -1
那么这个值怎么用呢,这样的规则计算出来的next值就如我前面说的那样,不是直接告诉你失配了再比较哪一个,而是要计算的。
举个例子,加入c失配了,我们知道主串失配处前两个字符为ab,我们的模式串一开始也为ab,所以,我们用模式串的next[2]处的a来比较就好了,因为模式串前面那两个ab,肯定和主串中失配位置前那两个ab重合。那这个next[2]的2是怎样计算出来的呢,很遗憾,我们用的不是c对应的next值,而是c之前一位的next值,也就是甚为next[4]的b的next值加一计算出来的。所以,计算公式就是:
下一个比较的字符的下标 = 模式串中最后一个匹配得上的字符的next值+1
然而,我们有更好地next值计算方法。代码如下:
- void get_next_array(char * pattern){
- int length = strlen(pattern) ;
- int k = 0,value=-1;
- next[k]=-1;
- while(k < length){
- while(value >= 0 && pattern[k] != pattern[value]){
- value = next[value];
- }
- k++;
- value++;
- next[k] = value;
- }
- }
next[0]=-1还是不变,指针还是两个,不过,其他计算过程稍有不同。这样计算出来的值就直接告诉你如果匹配错了,下一次用下标为几的字符匹配。next值具体计算过程看代码。
模式串:a b a a b c
next值:-1 0 0 1 1 2
如果在c处失配,则下一个用pattern[c的next值],即pattern[2]来匹配。
其实理解透彻了以后你会发现,value的值其实是向前推进的,如果有覆盖的话,而如果没有覆盖,它会往前回溯到前一个可能发生或延续覆盖的地方,如果一直没法发生或延续覆盖,它最终退为-1。
其实就和前面说的那个好多个a的公式有点相似了。
其实还有办法改进这个next值得算法,想想看,还是上面那个串,假如在next[2]处的a失效了是不是下一次应该比较next[0]的值,而next[0]还是一个a,肯定不匹配,最终主串指针进一,模式串从头匹配。所以我们是不是可以再改进一下next值,省去了这样的盲目跳转,改进的算法对应get_next_array_enhanced()。这样计算出来以后使用方法和第二种差不多,但是如果失配位的next值为-1就直接做主串指针进一,从头匹配的操作。
所有的代码如下:
KMP.h
- #include <string.h>
- #include <stdio.h>
- static int next[20]={0};
- /*This is the worst one I think,the next value dosen`t tell
- you where the pattern_index should be put then,but it can
- work out by the value.
- */
- void get_next_array_origin(char * pattern){
- //Get pattern length
- int length = strlen(pattern);
- int k = 1,value=0;
- //It`s a rule to set next[0] with value -1
- next[0]=-1;
- while(k < length){
- //Keep the next value of last unmatch character
- value = next[k-1];
- /*value>=0 means there is some overlays was discovered,
- the second condition means the overlay was stop here
- */
- while(value >= 0 && pattern[k] != pattern[value+1]){
- value = next[value];
- }
- /*It means we discoverd an overlay and pattern[k] is the
- first(value equals -1) or subsequent char(value >= 0).
- */
- if(pattern[k] == pattern[value+1]){
- next[k] = value+1;
- }
- //Other condition
- else{
- next[k]=-1;
- }
- k++;
- //printf("next[%d] = %d\n",k-1,next[k-1]);
- }
- }
- /*This is the second next value caculate algorithm,it`s
- convenient.Because the next value tell you what the
- next value of pattern_index.
- */
- void get_next_array(char * pattern){
- int length = strlen(pattern) ;
- int k = 0,value=-1;
- next[k]=-1;
- while(k < length){
- while(value >= 0 && pattern[k] != pattern[value]){
- value = next[value];
- }
- k++;
- value++;
- next[k] = value;
- }
- }
- /*It`s an improvement algrithm for get_next_array().Former just
- tell you where to set you pattern_index,but not to concerned
- about is the next char equal to the mismatch one,this algori-
- thm fix this problem.
- */
- void get_next_array_enhanced(char * pattern){
- int length = strlen(pattern) ;
- int k = 0,value = -1;
- next[k]=value;
- while(k < length){
- while(value>=0 && pattern[k]!=pattern[value]){
- value = next[value];
- }
- /*Once the next char is equal to the current one,also
- the mismatch one,we do this.Although they are equal
- characters,but the next value of former one has been
- work out,so it`s an available next value for the seond
- one.
- */
- if(pattern[k] == pattern[value]){
- next[k] = next[value];
- }
- k++;
- value++;
- next[k] = value;
- }
- }
- void KMP_search_origin(char * main,char * pattern){
- get_next_array_origin(pattern);
- int main_index = 0,pattern_index = 0;
- int main_length = strlen(main);
- int pattern_length = strlen(pattern);
- int flag=-1;
- while(main_index<main_length){
- //printf("main_index:%d\n",main_index);
- if(main[main_index] == pattern[pattern_index]){
- //printf("%c = %c\n",main[main_index],pattern[pattern_index]);
- if(pattern_index == pattern_length-1){
- printf("find in place %d\n",main_index - pattern_length+1);
- flag=1;
- /*Once the last char equals the first char in pattern,
- that means current char in main string can match the
- first char in pattern,so we do this to avoiding miss
- the comparision.
- */
- if(pattern[0] == pattern[pattern_index]){
- main_index--;
- }
- pattern_index = -1;
- }
- main_index++;
- pattern_index++;
- }
- else{
- //printf("%c != %c\n",main[main_index],pattern[pattern_index]);
- if(pattern_index == 0){
- main_index++;
- }
- else{
- /*caculate the next position to be compare according
- to the next value
- */
- pattern_index = next[pattern_index-1]+1;
- }
- }
- }
- if(flag == -1){
- printf("Sorry,we find nothing.");
- }
- }
- void KMP_search(char * main,char * pattern){
- get_next_array(pattern);
- int main_index = 0,pattern_index = 0;
- int main_length = strlen(main);
- int pattern_length = strlen(pattern);
- int flag=-1;
- while(main_index<main_length){
- //printf("main_index:%d\n",main_index);
- if(main[main_index] == pattern[pattern_index]){
- //printf("%c = %c\n",main[main_index],pattern[pattern_index]);
- if(pattern_index == pattern_length-1){
- printf("find in place %d\n",main_index - pattern_length+1);
- flag=1;
- if(pattern[0] == pattern[pattern_index]){
- main_index--;
- }
- pattern_index = -1;
- }
- main_index++;
- pattern_index++;
- }
- else{
- //printf("%c != %c\n",main[main_index],pattern[pattern_index]);
- if(pattern_index == 0){
- main_index++;
- }
- else{
- /*It`s easier than before,the next value is
- just where to put next time.
- */
- pattern_index = next[pattern_index];
- }
- }
- }
- if(flag == -1){
- printf("Sorry,we find nothing.");
- }
- }
- void KMP_search_enhanced(char * main,char * pattern){
- get_next_array(pattern);
- int main_index = 0,pattern_index = 0;
- int main_length = strlen(main);
- int pattern_length = strlen(pattern);
- int flag=-1;
- while(main_index<main_length){
- //printf("main_index:%d\n",main_index);
- if(main[main_index] == pattern[pattern_index]){
- //printf("%c = %c\n",main[main_index],pattern[pattern_index]);
- if(pattern_index == pattern_length-1){
- printf("find in place %d\n",main_index - pattern_length+1);
- flag=1;
- if(pattern[0] == pattern[pattern_index]){
- main_index--;
- }
- pattern_index = -1;
- }
- main_index++;
- pattern_index++;
- }
- else{
- //printf("%c != %c\n",main[main_index],pattern[pattern_index]);
- /*next value equals -1 means the condition we mismatch at the
- first char,so we match again from the next char in main string
- */
- if(next[pattern_index] == -1){
- main_index++;
- pattern_index = 0;
- }
- else{
- pattern_index = next[pattern_index];
- }
- }
- }
- if(flag == -1){
- printf("Sorry,we find nothing.");
- }
- }
测试用例:
KMP.c
- #include "KMP.h"
- int main(){
- printf("=========origin===========\n");
- KMP_search_origin("abacababa","aba");
- printf("=========origin===========\n");
- printf("=========normal===========\n");
- KMP_search("abacababa","aba");
- printf("=========normal===========\n");
- printf("=========enhanced===========\n");
- KMP_search_enhanced("abacababa","aba");
- printf("=========enhanced===========\n");
- return 0;
- }
next.c
- #include "KMP.h"
- int main(){
- int i = 0;
- char * p = "abaabc";
- int length = strlen(p);
- printf("=========origin===========\n");
- get_next_array_origin(p);
- while(i < length){
- printf("next[%d] = %d\n",i,next[i]);
- i++;
- }
- printf("=========origin===========\n");
- printf("=========normal===========\n");
- get_next_array(p);
- i = 0;
- while(i < length){
- printf("next[%d] = %d\n",i,next[i]);
- i++;
- }
- printf("=========normal===========\n");
- printf("=========enhanced===========\n");
- get_next_array_enhanced(p);
- i = 0;
- while(i < length){
- printf("next[%d] = %d\n",i,next[i]);
- i++;
- }
- printf("=========enhanced===========\n");
- return 0;
- }