正则表达式中的“量词”入门介绍

正则表达式中的量词可以用来指明某个字符串匹配的次数。将在以下描述“贪心量词”(Greedy)、“厌恶量词”(reluctant)、“占有量词”(possessive)这三种量词。(真的不知道怎么翻译)。乍一看量词X?(贪心量词)、X??(厌恶量词) 和X?+(占有量词)好像作用也差不多,因为它们的匹配规则都是匹配“X” 一次或者零次,即X出现一次或者一次都不出现。其实它们有着细微的差别,在本文中最后一部分会说明。

正则表达式中的“量词”入门介绍_第1张图片
1.png

让我们用贪心量词来创建三种不同的正则表达式:a?、a*、a+、。看看如果用空字符串来测匹配会得到什么结果。

先给出以下测试代码(直接使用终端编译运行即可):

public class RegexTestHarness {
    public static void main(String[] args){
        Console console = System.console();
        if (console == null) {
            System.err.println("No console.");
            System.exit(1);
        }
        while (true) {

            Pattern pattern =
                    Pattern.compile(console.readLine("%nEnter your regex: "));

            Matcher matcher =
                    pattern.matcher(console.readLine("Enter input string to search: "));

            boolean found = false;
            while (matcher.find()) {
                console.format("I found the text" +
                                " \"%s\" starting at " +
                                "index %d and ending at index %d.%n",
                        matcher.group(),
                        matcher.start(),
                        matcher.end());
                found = true;
            }
            if(!found){
                console.format("No match found.%n");
            }
        }
    }
}

Enter your regex: a?
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.

Enter your regex: a*
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.

Enter your regex: a+
Enter input string to search:
No match found.

零长度匹配

在上面的例子中,前两个例子可以匹配成功是因为表达式a?和a*允许字符串中不出现‘a’字符。你会看到开始和结束的下标都是0。空字符串""没有长度,因此这个正则在开始位置(即下标为0)即匹配成功。像这一类的匹配称之为“零长度匹配”。零长度匹配会在以下三种情况出现:
1.一个空字符串匹配。
2.和字符串的开端匹配,即下标为0的地方匹配。(开端即是空字符串)
3.和字符串结束的位置匹配。(结束即是空字符串)
4.任意两个字符之间,如"bc",b和c之间即存在一个空字符串""。

用“foo”这个字符串作为例子,下标的位置对应关系为

正则表达式中的“量词”入门介绍_第2张图片
Paste_Image.png

即index=0和index=3的地方会匹配。

零长度匹配是非常容易辨别出来,因为他们开始的位置和结束的位置是同一下标。

然我们再看几个列子,输入一个“a”字符。
Enter your regex: a?
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.

Enter your regex: a*
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.

Enter your regex: a+
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.

以上三个量词都能找到字符“a”,但是前两个例子在下标为1处匹配,也就是字符的结尾处。记住,匹配器查找到下标0和1之间的“a”,该程序会一直匹配到没有匹配为止。

接下来输入"ababaaaab",看下会得到什么输出。输出如下:
Enter your regex: a?
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "a" starting at index 4 and ending at index 5.
I found the text "a" starting at index 5 and ending at index 6.
I found the text "a" starting at index 6 and ending at index 7.
I found the text "a" starting at index 7 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.

Enter your regex: a*
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "aaaa" starting at index 4 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.

Enter your regex: a+
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "aaaa" starting at index 4 and ending at index 8.

读者可以自己推敲为什么会得出以上结果。

如果要限制某个字符出现的次数,可以使用大括号"{}"。如:
匹配“aaa”
Enter your regex: a{3}
Enter input string to search: aa
No match found.

Enter your regex: a{3}
Enter input string to search: aaa
I found the text "aaa" starting at index 0 and ending at index 3.

Enter your regex: a{3}
Enter input string to search: aaaa
I found the text "aaa" starting at index 0 and ending at index 3.

对于第三个实例,要注意的是,当匹配了前三个a,后面的匹配和前面3个a没有任何关系,正则会继续和“aaa”后面的内容继续尝试匹配。

被量词修饰的子表达式 如:
Enter your regex: (dog){3}
Enter input string to search: dogdogdogdogdogdog
I found the text "dogdogdog" starting at index 0 and ending at index 9.
I found the text "dogdogdog" starting at index 9 and ending at index 18.

Enter your regex: dog{3}
Enter input string to search: dogdogdogdogdogdog
No match found.

对于第二个例子,正则表达式匹配的内容应该是"do",后面紧跟3个"g",因此第二个例子无法匹配。

再看多一个例子:
Enter your regex: [abc]{3}
Enter input string to search: abccabaaaccbbbc
I found the text "abc" starting at index 0 and ending at index 3.
I found the text "cab" starting at index 3 and ending at index 6.
I found the text "aaa" starting at index 6 and ending at index 9.
I found the text "ccb" starting at index 9 and ending at index 12.
I found the text "bbc" starting at index 12 and ending at index 15.

Enter your regex: abc{3}
Enter input string to search: abccabaaaccbbbc
No match found.

贪婪模式和厌恶模式和占有模式的区别
贪婪模式之所以被称为贪婪模式,是因为贪婪模式会尽可能的去匹配更多的内容,如果匹配不成功,将会进行回溯,直至匹配成功或者不成功。
看看下面例子:
Enter your regex: .*foo // greedy quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.

Enter your regex: .*?foo // reluctant quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

Enter your regex: .+foo // possessive quantifier
Enter input string to search: xfooxxxxxxfoo
No match found.
第一个例子采用贪婪模式,.
部分和整个字符串"xfooxxxxxxfoo"匹配,接着正则中foo部分和字符串"xfooxxxxxxfoo"的剩余部分匹配,即空字串"",发现匹配不成功。开始回溯, .*与"xfooxxxxxxfo"匹配,正则中的foo部分和"xfooxxxxxxfo"剩余部分进行匹配,即"o",发现不匹配,继续回溯。重复上诉过程,直到匹配成功。由于是贪婪模式,一旦成功,将不会继续匹配,匹配终止。

第二个例子采用的是厌恶模式(非贪婪模式),刚好和贪婪模式相反,一开始只会和字符串开始位置进行匹配,此例中,即和空字符串""匹配,匹配成功后,正则中的foo部分和字符串中的开头三个字符"xfo"匹配,发现匹配不成功。.*?开始和第一个字符匹配,即"x",匹配成功,接着正则中的foo和字符串中的"foo"匹配。至此整个正则第一次匹配成功。接着继续匹配,接下来的匹配内容为"xxxxxxfoo",采用相同的规则继续匹配,第二次匹配成功的字符串为"xxxxxxfoo"。直至整个字符串被消耗完毕才终止匹配。

第三个例子是占有模式。该模式只进行一次匹配。不进行回溯尝试,在次例中,.*+与"xfooxxxxxxfoo"匹配,正则中的foo和空字符串""匹配,匹配失败。将不进行回溯尝试。匹配结束。

以上内容大部分是翻译The Java™ Tutorials中关于正则的教程

你可能感兴趣的:(正则表达式中的“量词”入门介绍)