零宽断言 -- Lookahead/Lookahead Positive/Negative

http://www.vaikan.com/regular-expression-to-match-string-not-containing-a-word/

经常我们会遇到想找出不包含某个字符串的文本,程序员最容易想到的是在正则表达式里使用,

^(hede)

来过滤”hede”字串,但这种写法是错误的。

我们可以这样写:

[^hede]

,但这样的正则表达式完全是另外一个意思,它的意思是字符串里不能包含

‘h’,‘e’,‘d’三个但字符。那什么样的正则表达式能过滤出不包含完整“hello”字串的信息呢?

事实上,说正则表达式里不支持逆向匹配并不是百分之百的正确。就像这个问题,

我们就可以使用否定式查找来模拟出逆向匹配,从而解决我们的问题:

^( (?!hede). ) * $

上面这个表达式就能过滤出不包含‘hede’字串的信息。我上面也说了,这种写法并不是正则表达式“擅长”的用法,但它是可以这样用的。

解释

一个字符串是由n个字符组成的。在每个字符之前和之后,都有一个空字符。

这样,一个由n个字符组成的字符串就有n+1个空字符串。我们来看一下 “ABhedeCD” 这个字符串:

    +--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+
S = |e1| A |e2| B |e3| h |e4| e |e5| d |e6| e |e7| C |e8| D |e9|
    +--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+---+--+
index    0      1      2      3      4      5      6      7
后面 <-------------e3--------------------------------------->前面

 

所有的e编号的位置都是空字符。

表达式 (?!hede). 会往前查找,看看前面是不是没有“hede”字串,

如果没有(是其它字符),那么.(点号)就会匹配这些其它字符。

这种正则表达式的“查找”也叫做“zero-width-assertions”(零宽度断言),

因为它不会捕获任何的字符,只是判断。

在上面的例子里,每个空字符都会检查其前面的字符串是否不是‘hede’,

 

如果不是,这.(点号)就是匹配捕捉这个字符。表达式(?!hede).只执行一次,

所以,我们将这个表达式用括号包裹成组(group),

然后用*(星号)修饰——匹配0次或多次:((?!hede).)*

 

你可以理解,正则表达式((?!hede).)*匹配字符串"ABhedeCD"的结果false,

因为e3位置(?!hede)匹配不合格,它之前有"hede"字符串,也就是包含了指定的字符串。

在正则表达式里, ?! 是否定式向前查找,它帮我们解决了字符串“不包含”匹配的问题。  

零宽断言 (?=exp) 匹配exp前面的位置 <自身出现位置的后面>
(?<=exp) 匹配exp后面的位置 <自身出现位置的前面>
(?!exp) 匹配后面跟的不是exp的位置 <自身出现位置的前面>
(?<!exp) 匹配前面不是exp的位置 <自身出现位置的后面>

<<< [后面] 自身出现的位置  [前面] >>

表达式的前面 <---表达式---> 表达式的后面 

(?=子表达式)      零宽度正预测先行断言。它断言自身出现的位置的[后面]能匹配表达式exp。 Lookahead Positive 

(?<=子表达式)    零宽度正回顾后发断言。断言自身出现的位置的[前面]能匹配表达式exp。lookbehind Positive 

!表示非,就是不包含,同样是零宽度,不会被捕获。 

(?!子表达式)       零宽度负预测先行断言。断言此位置的[后面]不能匹配表达式exp。Lookahead Negative ! 

(?<!子表达式)     零宽度负回顾后发断言。断言此位置的[前面]不能匹配表达式exp。lookbehind Negative ! 

< -- lookbehind

空白 -- lookahead

! -- Negative

= -- Positive 

((?!regex).)*    : 这个就是不包含字符串"regex"的字符串。 

零宽断言

接下来的四个用于查找在某些内容(但并不包括这些内容)之前或之后的东西,也就是说它们像\b,^,$那样用于指定一个位置,

这个位置应该满足一定的条件(即断言),因此它们也被称为零宽断言。最好还是拿例子来说明吧: 

断言用来声明一个应该为真的事实。正则表达式中只有当断言为真时才会继续进行匹配。 

(?=exp)也叫零宽度正预测先行断言,断言自身出现的位置的[后面]能匹配表达式exp。

比如\b\w+(?=ing\b),匹配以ing结尾的单词的前面部分(除了ing以外的部分),

如查找I'm singing while you're dancing.时,它会匹配sing和danc。 

(?<=exp)也叫零宽度正回顾后发断言,断言自身出现的位置的[前面]能匹配表达式exp。

比如(?<=\bre)\w+\b会匹配以re开头的单词的后半部分(除了re以外的部分),

例如在查找reading a book时,它匹配ading。

假如你想要给一个很长的数字中每三位间加一个逗号(当然是从右边加起了),

你可以这样查找需要在前面和里面添加逗号的部分:

( (?<=\d)\d{3} ) + \b,用它对1234567890进行查找时结果是234567890。 

下面这个例子同时使用了这两种断言:

(?<=\s)  \d+  (?=\s)  匹配以空白符间隔的数字(再次强调,不包括这些空白符)

abc 123 def

负向零宽断言

前面我们提到过怎么查找不是某个字符或不在某个字符类里的字符的方法(反义)。

但是如果我们只是想要确保某个字符没有出现,但并不想去匹配它时怎么办?

例如,如果我们想查找这样的单词--它里面出现了字母q,但是q后面跟的不是字母u,我们可以尝试这样:

\b\w*q[^u]\w*\b 匹配包含后面不是字母u的字母q的单词。

但是如果多做测试(或者你思维足够敏锐,直接就观察出来了),你会发现,

如果q出现在单词的结尾的话,像Iraq,Benq,这个表达式就会出错。

这是因为[^u]总要匹配一个字符,所以如果q是单词的最后一个字符的话,

后面的[^u]将会匹配q后面的单词分隔符(可能是空格,或者是句号或其它的什么),

后面的\w*\b将会匹配下一个单词,于是\b\w*q[^u]\w*\b就能匹配整个Iraq fighting。

负向零宽断言能解决这样的问题,因为它只匹配一个位置,并不消费任何字符

现在,我们可以这样来解决这个问题:\b\w*q(?!u)\w*\b。

零宽度负预测先行断言 (?!exp)断言此位置的[后面]不能匹配表达式exp。

例如:\d{3}(?!\d)匹配三位数字,而且这三位数字的后面不能是数字;

\b ((?!abc)\w) +\b匹配不包含连续字符串abc的单词。

同理,我们可以用

零宽度负回顾后发断言(?<!exp), 断言此位置的[前面]不能匹配表达式exp:

(?<![a-z])\d{7} 匹配 前面 不是小写字母的七位数字。

一个更复杂的例子:(?<=<(\w+)>).*(?=<\/\1>)匹配不包含属性的简单HTML标签内里的内容。

(?<=<(\w+)>)指定了这样的前缀:

被尖括号括起来的单词(比如可能是<b>),

然后是.*(任意的字符串),最后是一个后缀(?=<\/\1>)。

注意后缀里的\/,它用到了前面提过的字符转义;

\1则是一个反向引用,引用的正是捕获的第一组,前面的(\w+)匹配的内容,

这样如果前缀实际上是<b>的话,后缀就是</b>了。

整个表达式匹配的是<b>和</b>之间的内容(再次提醒,不包括前缀和后缀本身)。

 

Lookahead and Lookbehind Zero-Length Assertions

Lookahead<预测, 看前面, 考虑未来, 超前处理>, and

lookbehind<回顾, 回头, 向后看>, collectively called

lookaround<四顾;朝四周看> are zero-length assertions

just like the start ^ and end of line $  and start and end of word anchors.

The difference is that lookaround actually matches characters, but then gives up the match,

returning only the result: match or no match. That is why they are called

assertions<论断, 断言>

They do not consume characters in the string, but only assert whether a match is possible or not.

Lookaround allows you to create regular expressions that are impossible to create without them,

or that would get very longwinded without them. 

Positive and Negative Lookahead

Negative lookahead is indispensable if you want to match something not followed by something else.

When explaining character classes, this tutorial explained why you cannot use a negated character class

to match a q not followed by a u.

Negative lookahead provides the solution: q(?!u)

The negative lookahead construct is the pair of parentheses,

with the opening parenthesis[(] followed by a question mark [?] and an exclamation point[!]

Inside the lookahead, we have the trivial regex u.

 

Positive lookahead works just the same. q(?=u) 

matches a q that is followed by a u, without making the u part of the match.

The positive lookahead construct is a pair of parentheses, 

with the opening parenthesis [(] followed by a question mark [?] and an equals sign [=]

You can use any regular expression inside the lookahead (but not lookbehind, as explained below).

Any valid regular expression can be used inside the lookahead.

If it contains capturing groups then those groups will capture as normal and backreferences to them will work normally,

even outside the lookahead.

(The only exception is Tcl, which treats all groups inside lookahead as non-capturing.)

The lookahead itself is not a capturing group.

It is not included in the count towards numbering the backreferences.

If you want to store the match of the regex inside a lookahead,

you have to put capturing parentheses around the regex inside the lookahead, like this: 

(?=(regex))

The other way around will not work,

because the lookahead will already have discarded the regex match by the time the capturing group is to store its match. 

Regex Engine Internals

First, let's see how the engine applies q(?!u) to the string Iraq.

The first token in the regex is the literal q.

As we already know, this causes the engine to traverse the string until the q in the string is matched.

The position in the string is now the void after the string.

The next token is the lookahead.

The engine takes note that it is inside a lookahead construct now,

and begins matching the regex inside the lookahead.

So the next token is u.

This does not match the void after the string.

The engine notes that the regex inside the lookahead failed.

Because the lookahead is negative, this means that the lookahead has successfully matched at the current position.

At this point, the entire regex has matched, and q is returned as the match.

Let's try applying the same regex to quit

q matches q.

The next token is the u inside the lookahead.

The next character is the u.

These match.

The engine advances to the next character: i.

However, it is done with the regex inside the lookahead.

The engine notes success, and discards the regex match.

This causes the engine to step back in the string to u.

Because the lookahead is negative, the successful match inside it causes the lookahead to fail.

Since there are no other permutations of this regex, the engine has to start again at the beginning.

Since q cannot match anywhere else, the engine reports failure.

Let's take one more look inside, to make sure you understand the implications of the lookahead.

Let's apply q(?=u)i to quit.

The lookahead is now positive and is followed by another token.

Again, q matches q and u matches u.

Again, the match from the lookahead must be discarded, so the engine steps back from i in the string to u.

The lookahead was successful, so the engine continues with i.

But i cannot match u.

So this match attempt fails.

All remaining attempts fail as well, because there are no more q's in the string.

 

Positive and Negative Lookbehind

Lookbehind has the same effect, but works backwards.

It tells the regex engine to temporarily step backwards in the string,

to check if the text inside the lookbehind can be matched there. 

(?<!a)b matches a "b" that is not preceded by an "a", using negative lookbehind.

It doesn't match cab, but matches the b (and only the b) in bed or debt.

(?<=a)b (positive lookbehind) matches the b (and only the b) in cab, but does not match bed or debt.

 

The construct for positive lookbehind is (?<=text):

a pair of parentheses, with the opening parenthesis followed by a question mark,

"less than" symbol, and an equals sign.

Negative lookbehind is written as (?<!text),

using an exclamation point instead of an equals sign.

More Regex Engine Internals

Let's apply (?<=a)b to thingamabob.

The engine starts with the lookbehind and the first character in the string.

In this case, the lookbehind tells the engine to step back one character, and see if a can be matched there.

The engine cannot step back one character because there are no characters before the t.

So the lookbehind fails, and the engine starts again at the next character, the h.

(Note that a negative lookbehind would have succeeded here.)

Again, the engine temporarily steps back one character to check if an "a" can be found there.

It finds a t, so the positive lookbehind fails again.

The lookbehind continues to fail until the regex reaches the m in the string.

The engine again steps back one character, and notices that the a can be matched there.

The positive lookbehind matches.

Because it is zero-length, the current position in the string remains at the m.

The next token is b, which cannot match here.

The next character is the second a in the string.

The engine steps back, and finds out that the m does not match a.

The next character is the first b in the string.

The engine steps back and finds out that a satisfies the lookbehind. 

matches b, and the entire regex has been matched successfully.

It matches one character: the first b in the string.

 

Important Notes About Lookbehind

The good news is that you can use lookbehind anywhere in the regex, not only at the start.

If you want to find a word not ending with an "s", you could use \b\w+(?<!s)\b.

This is definitely not the same as \b\w+[^s]\b.

When applied to John's, the former matches John and the latter matches John' (including the apostrophe).

I will leave it up to you to figure out why.

(Hint: \b matches between the apostrophe and the s).

The latter also doesn't match single-letter words like "a" or "I".

The correct regex without using lookbehind is \b\w*[^s\W]\b 

(star instead of plus, and \W in the character class).

Personally, I find the lookbehind easier to understand.

The last regex, which works correctly, has a double negation (the \W in the negated character class).

Double negations tend to be confusing to humans. Not to regex engines, though.

(Except perhaps for Tcl, which treats negated shorthands in negated character classes as an error.)

The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind,

because they cannot apply a regular expression backwards.

The regular expression engine needs to be able to figure out how many characters to step back before checking the lookbehind.

When evaluating the lookbehind, the regex engine determines the length of the regex inside the lookbehind,

steps back that many characters in the subject string, and then applies the regex inside the lookbehind

from left to right just as it would with a normal regex.

Many regex flavors, including those used by Perl and Python, only allow fixed-length strings.

You can use literal text,character escapes, Unicode escapes other than \X, and character classes.

You cannot use quantifiers or backreferences.

You can use alternation, but only if all alternatives have the same length.

These flavors evaluate lookbehind by first stepping back through the subject string for as many characters as the lookbehind needs,

and then attempting the regex inside the lookbehind from left to right.

PCRE is not fully Perl-compatible when it comes to lookbehind.

While Perl requires alternatives inside lookbehind to have the same length,

PCRE allows alternatives of variable length. 

PHP, Delphi, R, and Ruby also allow this.

Each alternative still has to be fixed-length.

Each alternative is treated as a separate fixed-length lookbehind.

Java takes things a step further by allowing finite repetition.

You still cannot use the star or plus, but you can use thequestion mark and the curly braces with the max parameter specified.

Java determines the minimum and maximum possible lengths of the lookbehind.

The lookbehind in the regex (?<!ab{2,4}c{3,5}d) test has 6 possible lengths.

It can be between 7 to 11 characters long.

When Java (version 6 or later) tries to match the lookbehind, it first steps back the minimum number of characters (7 in this example) i

n the string and then evaluates the regex inside the lookbehind as usual, from left to right.

If it fails, Java steps back one more character and tries again.

If the lookbehind continues to fail, Java continues to step back until the lookbehind either matches

or it has stepped back the maximum number of characters (11 in this example).

This repeated stepping back through the subject string kills performance

when the number of possible lengths of the lookbehind grows.

Keep this in mind. Don't choose an arbitrarily large maximum number of repetitions to work around the lack of infinite quantifiers inside lookbehind.

Java 4 and 5 have bugs that cause lookbehind with alternation or variable quantifiers to fail

when it should succeed in some situations. These bugs were fixed in Java 6.

The only regex engines that allow you to use a full regular expression inside lookbehind,

including infinite repetition and backreferences, are the JGsoft engine and the .NET framework RegEx classes.

These regex engines really apply the regex inside the lookbehind backwards,

going through the regex inside the lookbehind and through the subject string from right to left.

They only need to evaluate the lookbehind once, regardless of how many different possible lengths it has.

Finally, flavors like JavaScript and Tcl do not support lookbehind at all,

even though they do support lookahead.

Lookaround Is Atomic

The fact that lookaround is zero-length automatically makes it atomic.

As soon as the lookaround condition is satisfied, the regex engine forgets about everything inside the lookaround.

It will not backtrack inside the lookaround to try different permutations.

The only situation in which this makes any difference is when you use capturing groups inside the lookaround.

Since the regex engine does not backtrack into the lookaround, it will not try different permutations of the capturing groups.

For this reason, the regex (?=(\d+))\w+\1 never matches 123x12.

First the lookaround captures 123 into \1.

\w+ then matches the whole string and backtracks until it matches only 1.

Finally, \w+ fails since \1 cannot be matched at any position.

Now, the regex engine has nothing to backtrack to, and the overall regex fails.

The backtracking steps created by \d+ have been discarded.

It never gets to the point where the lookahead captures only 12.

Obviously, the regex engine does try further positions in the string.

If we change the subject string, the regex (?=(\d+))\w+\1 does match 56x56 in 456x56.

If you don't use capturing groups inside lookaround, then all this doesn't matter.

Either the lookaround condition can be satisfied or it cannot be.

In how many ways it can be satisfied is irrelevant.

 

你可能感兴趣的:(head)