字符串系列之最长回文子串


Translated to ENGLISH

源于这两篇文章: 
http://blog.csdn.net/ggggiqnypgjg/article/details/6645824
http://zhuhongcheng.wordpress.com/2009/08/02/a-simple-linear-time-algorithm-for-finding-longest-palindrome-sub-string/

这个算法看了三天,终于理解了,在这里记录一下自己的思路,免得以后忘了又要想很久- -.

首先用一个非常巧妙的方式,将所有可能的奇数/偶数长度的回文子串都转换成了奇数长度:在每个字符的两边都插入一个特殊的符号。比如 abba 变成 #a#b#b#a#, aba变成 #a#b#a#。 为了进一步减少编码的复杂度,可以在字符串的开始加入另一个特殊字符,这样就不用特殊处理越界问题,比如$#a#b#a#。

下面以字符串12212321为例,经过上一步,变成了 S[] = "$#1#2#2#1#2#3#2#1#";

然后用一个数组 P[i] 来记录以字符S[i]为中心的最长回文子串向左/右扩张的长度(包括S[i]),比如S和P的对应关系:
S  #  1  #  2  #  2  #  1  #  2  #  3  #  2  #  1  #
P  1  2  1  2  5  2  1  4  1  2  1  6  1  2  1  2  1
(p.s. 可以看出,P[i]-1正好是原字符串中回文串的总长度)

那么怎么计算P[i]呢?该算法增加两个辅助变量(其实一个就够了,两个更清晰)id和mx,其中id表示最大回文子串中心的位置,mx则为id+P[id],也就是最大回文子串的边界。

然后可以得到一个非常神奇的结论,这个算法的关键点就在这里了:如果mx > i,那么P[i] >= MIN(P[2 * id - i], mx - i)。就是这个串卡了我非常久。实际上如果把它写得复杂一点,理解起来会简单很多:
//记j = 2 * id - i,也就是说 j 是 i 关于 id 的对称点。
if (mx - i > P[j]) 
    P[i] = P[j];
else /* P[j] >= mx - i */
    P[i] = mx - i; // P[i] >= mx - i,取最小值,之后再匹配更新。

当然光看代码还是不够清晰,还是借助图来理解比较容易。

当 mx - i > P[j] 的时候,以S[j]为中心的回文子串包含在以S[id]为中心的回文子串中,由于 i 和 j 对称,以S[i]为中心的回文子串必然包含在以S[id]为中心的回文子串中,所以必有 P[i] = P[j],见下图。
字符串系列之最长回文子串_第1张图片

当 P[j] >= mx - i 的时候,以S[j]为中心的回文子串不一定完全包含于以S[id]为中心的回文子串中,但是基于对称性可知,下图中两个绿框所包围的部分是相同的,也就是说以S[i]为中心的回文子串,其向右至少会扩张到mx的位置,也就是说 P[i] >= mx - i。至于mx之后的部分是否对称,就只能老老实实去匹配了。
字符串系列之最长回文子串_第2张图片

对于 mx <= i 的情况,无法对 P[i]做更多的假设,只能P[i] = 1,然后再去匹配了。

于是代码如下:
//输入,并处理得到字符串s
int p[1000], mx = 0, id = 0;
memset(p, 0,  sizeof(p));
for (i = 1; s[i] != '\0'; i++) {
    p[i] = mx > i ? min(p[2*id-i], mx-i) : 1;
     while (s[i + p[i]] == s[i - p[i]]) p[i]++;
     if (i + p[i] > mx) {
        mx = i + p[i];
        id = i;
    }
}
//找出p[i]中最大的

OVER.

#UPDATE@2013-08-21 14:27

@zhengyuee 同学指出,由于 P[id] = mx,所以 S[id-mx] != S[id+mx],那么当 P[j] > mx - i 的时候,可以肯定 P[i] = mx - i ,不需要再继续匹配了。不过在具体实现的时候即使不考虑这一点,也只是多一次匹配(必然会fail),但是却要多加一个分支,所以上面的代码就不改了。


Longest Palindromic Substring Part II

November 20, 2011 in string

Given a string S, find the longest palindromic substring in S.

Note:
This is Part II of the article: Longest Palindromic Substring. Here, we describe an algorithm (Manacher’s algorithm) which finds the longest palindromic substring in linear time. Please read Part I for more background information.

In my previous post we discussed a total of four different methods, among them there’s a pretty simple algorithm with O(N2) run time and constant space complexity. Here, we discuss an algorithm that runs in O(N) time and O(N) space, also known as Manacher’s algorithm.

Hint:
Think how you would improve over the simpler O(N2) approach. Consider the worst case scenarios. The worst case scenarios are the inputs with multiple palindromes overlapping each other. For example, the inputs: “aaaaaaaaa” and “cabcbabcbabcba”. In fact, we could take advantage of the palindrome’s symmetric property and avoid some of the unnecessary computations.

An O(N) Solution (Manacher’s Algorithm):
First, we transform the input string, S, to another string T by inserting a special character ‘#’ in between letters. The reason for doing so will be immediately clear to you soon.

For example: S = “abaaba”, T = “#a#b#a#a#b#a#”.

To find the longest palindromic substring, we need to expand around each Ti such that Ti-d … Ti+d forms a palindrome. You should immediately see that d is the length of the palindrome itself centered at Ti.

We store intermediate result in an array P, where P[ i ] equals to the length of the palindrome centers at Ti. The longest palindromic substring would then be the maximum element in P.

Using the above example, we populate P as below (from left to right):

T = # a # b # a # a # b # a #
P = 0 1 0 3 0 1 6 1 0 3 0 1 0

Looking at P, we immediately see that the longest palindrome is “abaaba”, as indicated by P6 = 6.

Did you notice by inserting special characters (#) in between letters, both palindromes of odd and even lengths are handled graciously? (Please note: This is to demonstrate the idea more easily and is not necessarily needed to code the algorithm.)

Now, imagine that you draw an imaginary vertical line at the center of the palindrome “abaaba”. Did you notice the numbers in P are symmetric around this center? That’s not only it, try another palindrome “aba”, the numbers also reflect similar symmetric property. Is this a coincidence? The answer is yes and no. This is only true subjected to a condition, but anyway, we have great progress, since we can eliminate recomputing part of P[ i ]‘s.

Let us move on to a slightly more sophisticated example with more some overlapping palindromes, where S = “babcbabcbaccba”.


Above image shows T transformed from S = “babcbabcbaccba”. Assumed that you reached a state where table P is partially completed. The solid vertical line indicates the center (C) of the palindrome “abcbabcba”. The two dotted vertical line indicate its left (L) and right (R) edges respectively. You are at index i and its mirrored index around C is i’. How would you calculate P[ i ] efficiently?

Assume that we have arrived at index i = 13, and we need to calculate P[ 13 ] (indicated by the question mark ?). We first look at its mirrored index i’ around the palindrome’s center C, which is index i’ = 9.


The two green solid lines above indicate the covered region by the two palindromes centered at i and i’. We look at the mirrored index of i around C, which is index i’. P[ i' ] = P[ 9 ] = 1. It is clear that P[ i ] must also be 1, due to the symmetric property of a palindrome around its center.

As you can see above, it is very obvious that P[ i ] = P[ i' ] = 1, which must be true due to the symmetric property around a palindrome’s center. In fact, all three elements after C follow the symmetric property (that is, P[ 12 ] = P[ 10 ] = 0, P[ 13 ] = P[ 9 ] = 1, P[ 14 ] = P[ 8 ] = 0).


Now we are at index i = 15, and its mirrored index around C is i’ = 7. Is P[ 15 ] = P[ 7 ] = 7?

Now we are at index i = 15. What’s the value of  P[ i ] ? If we follow the symmetric property, the value of  P[ i ] should be the same as P[ i' ] = 7. But this is wrong. If we expand around the center at T15, it forms the palindrome “a#b#c#b#a”, which is actually shorter than what is indicated by its symmetric counterpart. Why?


Colored lines are overlaid around the center at index i and i’. Solid green lines show the region that must match for both sides due to symmetric property around C. Solid red lines show the region that might not match for both sides. Dotted green lines show the region that crosses over the center.

It is clear that the two substrings in the region indicated by the two solid green lines must match exactly. Areas across the center (indicated by dotted green lines) must also be symmetric. Notice carefully that P[ i ' ] is 7 and it expands all the way across the left edge (L) of the palindrome (indicated by the solid red lines), which does not fall under the symmetric property of the palindrome anymore. All we know is  P[ i ]  ≥ 5, and to find the real value of P[ i ] we have to do character matching by expanding past the right edge (R). In this case, since P[ 21 ] ≠ P[ 1 ], we conclude that P[ i ] = 5.

Let’s summarize the key part of this algorithm as below:

if P[ i' ] ≤ R – i,
then P[ i ] ← P[ i' ]
else P[ i ] ≥ P[ i' ]. (Which we have to expand past the right edge (R) to find P[ i ].

See how elegant it is? If you are able to grasp the above summary fully, you already obtained the essence of this algorithm, which is also the hardest part.

The final part is to determine when should we move the position of C together with R to the right, which is easy:

If the palindrome centered at i does expand past R, we update C to i, (the center of this new palindrome), and extend R to the new palindrome’s right edge.

In each step, there are two possibilities. If P[ i ] ≤ R – i, we set P[ i ] to P[ i' ] which takes exactly one step. Otherwise we attempt to change the palindrome’s center to i by expanding it starting at the right edge, R. Extending R (the inner while loop) takes at most a total of N steps, and positioning and testing each centers take a total of N steps too. Therefore, this algorithm guarantees to finish in at most 2*N steps, giving a linear time solution.

Note:
This algorithm is definitely non-trivial and you won’t be expected to come up with such algorithm during an interview setting. However, I do hope that you enjoy reading this article and hopefully it helps you in understanding this interesting algorithm. You deserve a pat if you have gone this far! 

Further Thoughts:

  • In fact, there exists a sixth solution to this problem — Using suffix trees. However, it is not as efficient as this one (run time O(N log N) and more overhead for building suffix trees) and is more complicated to implement. If you are interested, read Wikipedia’s article about Longest Palindromic Substring.
  • What if you are required to find the longest palindromic subsequence? (Do you know the difference between substring and subsequence?)

Useful Links:
» Manacher’s Algorithm O(N) 时间求字符串的最长回文子串 (Best explanation if you can read Chinese)
» A simple linear time algorithm for finding longest palindrome sub-string
» Finding Palindromes
» Finding the Longest Palindromic Substring in Linear Time
» Wikipedia: Longest Palindromic Substring

 字符串系列之最长回文子串

0人收藏此文章, 我要收藏发表于1年前(2012-06-24 14:05) , 已有 1028次阅读 ,共 0个评论

问题描述:

    给定一个字符串S=A1A2...An,要求找出其最长回文子串(Longest Palindromic Substring)。所谓回文子串就是S的某个子串Ai...Aj为回文。例如,对字符串S=abcdcbeba,它的回文子串有:bcdcb,cdc,beb,满足题目要求的最长回文子串为bcdcb。

推理思路:

1.由于回文可能由奇数个字符组成,也可能由偶数个字符组成。对奇数回文的处理比较直观,只需要以某个字符为中心,依次向两边扩展即可。因此,我们可以通过如下方式把对偶数回文的处理转换成对奇数回文的处理:在字符边界添加特殊符号。例如,对字符串aba,预处理后变成#a#b#a#;对字符串abba,预处理后变成#a#b#b#a。可以看出,不管是奇数回文,还是偶数回文,在与处理后都变成奇数回文。在找出与预处理后字符串的最长回文后,只需要去除所有的#即为源字符串的最长回文。

2.对寻找字符串某类子串的问题,最简单直观的想法就是穷举出所有子串一一进行判别。这里也不例外,当然时间复杂度也很高,为O(n^3)。

3.对该问题,我们可以进行一定程度的简化处理。既然回文是一种特殊的字符串,我们可以以源字符串的每个字符为中心,依次寻找出最长回文子串P0, P1,...,Pn。这些最长回文子串中的最长串Pi = max(P1, P2,...,Pn)即为所求。请看源码:

01 string find_lps_native(const string &str)
02 {
03     int center = 0, max_len = 0;
04     for(int i = 1; i < str.length()-1; ++i)
05     {
06         int j = 1;
07  
08         //以str[i]为中心,依次向两边扩展,寻找最长回文Pi
09         while(i+j < str.length() && i-j >= 0 && str[i+j] == str[i-j])
10             ++j;
11         --j;
12         if(j > 1 && j > max_len)
13         {
14             center = i;
15             max_len = j;
16         }
17     }
18  
19     return str.substr(center-max_len, (max_len << 1) + 1);
20 }

4.可以看出,上面做法的复杂度为O(n^2)。相比穷举字符串的做法,已经降低了一个量级的复杂度。但是仔细想想,上面的算法还有改进空间吗?当然有!而且改进后能够把复杂度降低到O(n)!这就是大名鼎鼎的  Manacher’s Algorithm。请看下文分析:

    举例说明:对字符串S=abcdcba而言,最长回文子串是以d为中心,半径为3的子串。当我们采用上面的做法分别求出以S[1]=a, S[2]=b, S[3]=c, S[4]=d为中心的最长回文子串后,对S[5]=c,S[6]=b...还需要一一进行扩展求吗?答案是NO。因为我们已经找到以d为中心,半径为3的回文了,S[5]与S[3],S[6]与S[2]...,以S[4]为对称中心。因此,在以S[5],S[6]为中心扩展找回文串时,可以利用已经找到的S[3],S[2]的相关信息直接进行一定步长的偏移,这样就减少了比较的次数(回想一下KMP中next数组的思想)。优化的思想找到了,我们先看代码:

01 string find_lps_advance(const string &str)
02 {
03     //find radius of all characters
04     vector<int> p(str.length(), 0);
05     int idx = 1, max = 0;
06     for(int i = 1; i < str.length()-1; ++i)
07     {
08         if(max > i)
09         {
10             p[i] = p[(idx << 1) - i] < (max - i) ? p[(idx << 1) - i]:(max - i);
11         }
12         while(str[i+p[i]+1] == str[i-p[i]-1])
13             p[i] += 1;
14         if(i + p[i] > max)
15         {
16             idx = i;
17             max = i+p[i];
18         }
19     }
20  
21     // find the character which has max radius
22     int center = 0, radius = 0;
23     for(int i = 0; i < p.size(); ++i)
24     {
25         if(p[i] > radius)
26         {
27             center = i;
28             radius = p[i];
29         }
30     }
31  
32     return str.substr(center-radius, (radius << 1) + 1);
33 }

    这里进行简单的解释:上述代码中有三个主要变量,它们代表的意义分别是:

     p:以S[i]为中心的最长回文串的半径为p[i]。

     idx:已经找出的最长回文子串的起始位置。

     max:已经找出的最长回文子串的结束位置。

     算法的主要思想是:先找出所有的p[i],最大的p[i]即为所求。在求p[j] (j>i)时,利用已经求出的p[i]减少比较次数。

     代码中比较关键的一句是:

1 p[i] = p[(idx << 1) - i] < (max - i) ? p[(idx << 1) - i]:(max - i);

    在求p[i]时,如果max>i,则表明已经求出的最长回文中包含了p[i],那么与p[i]关于idx对称的p[ (idx << 1) - i]的最长回文子串可以提供一定的信息。看了两幅图大概就明白什么意思了:

字符串系列之最长回文子串_第3张图片

字符串系列之最长回文子串_第4张图片

    求二者的最小值是因为当前能够获取的信息都来自max的左侧,需要进一步比较,求出以S[i]为中心的最长回文串。

5.除了上述的几种做法外,还可以利用动态规划以及后缀树来进行求解。下次进行介绍。

    结束行文之前,补充一句,对于字符串类的问题,建议多画一画,寻找其中的规律。

 

参考文献:1.http://www.akalin.cx/longest-palindrome-linear-time

 


你可能感兴趣的:(字符串系列之最长回文子串)