RE2,C++正则表达式库实战

RE2简介

RE2是,一个高效、原则性的正则表达式库,由Rob PikeRuss Cox两位来自google的大牛用C++实现。他俩同时也是Go语言的主导者。Go语言中的regexp正则表达式包,也是RE2的Go实现。

RE2是,一个快速安全线程友好,PCRE、PERL和Python等回溯正则表达式引擎(backtracking regular expression engine)的一个替代品。RE2支持Linux和绝大多数的Unix平台,但不支持Windows(如果有必要,你可以自己hack)。

RE2的特点

回溯引擎(Backtracking engine)通常是典型的完整的功能和便捷的语法糖,但是即使很小的输入都可能强制进入指数级时间处理场景。RE2应用自动机理论理论,来保证在一个尺寸的输入上正则表达式搜索运行于一个时间线。RE2实现了内存限制,所以搜索可以被制约在一个固定大小的内存。RE2被设计为使用一个很小的固定C++堆栈足迹,无论它必须处理的输入或正则表达式是什么。从而RE2在多线程环境非常有用,当线程栈不能武断的增大时。

当输入(数据集)很大时,RE2通常比回溯引擎快很多。它采用自动机理论,实施别的引擎无法进行的优化。

不同于绝大多数基于自动机的引擎,RE2实现了几乎所有Perl和PCRE特点,和语法糖。它找到最左-优先(leftmost-first)匹配,同时匹配Perl可能匹配的,并且能返回子匹配信息。最明显的例外是,RE2去掉了反向引用(backreferences)和一般性零-宽度断言(zero-width assertion)的支持,因为无法高效实现。

为了相对简单语法的使用者,RE2,有一个POSIX模式,仅接受POSIX egrep算子,实现最左-最长整体匹配(leftmost-longest overall matching)。

¹ Technical note: there's a difference between submatches and backreferences. Submatches let you find out what certain subexpressions matched after the match is over, so that you can find out, after matching dogcat against (cat|dog)(cat|dog), that \1 is dog and \2 is cat. Backreferences let you use those subexpressions during the match, so that (cat|dog)\1 matches catcat and dogdog but not catdog or dogcat.

RE2支持子匹配萃取(submatch extraction),但是不支持反向引用(backreferences)。

如果你必须要反向引用一般性断言,而RE2不支持,那么你可以看一下irregexp,Google Chrome的正则表达式引擎。

玩转RE2

安装

你可以下载发行版的代码包,然后解压进行安装。这里介绍,另一种安装方式:

需要安装Mercurial SCM和C++编译器(g++的克隆):

下载代码,并进行安装:


    hg clone http://re2.googlecode.com/hg re2
    cd re2
    make test
    make testinstall
    sudo make install

在BSD系统, 使用gmake替换make

使用RE2库

使用RE2库开发C++应用,需要在代码中包含re2/re2.h头文件,链接时增加 -lre2以及-lpthread(多线环境使用)选项。

语法

POSIX模式,RE@接受标准POSIX (egrep)语法正则表达式。在Perl模式,RE2接受大部分Perl操作符。唯一例外的是,那些要求回溯(潜在需要指数级的运行时)实现的部分。其中,包括反向引用(子匹配,还是支持的)和一般性断言。RE2,默认为Perl模式。

C++ 高级接口

这里包括两个基本的操作:

  • RE2::FullMatch: 要求regexp表达式匹配整个输入文本。
  • RE2::PartialMatch: 在输入文本中寻找一个子匹配。在POSIX模式,返回最左-最长匹配,Perl模式也是相同的匹配。

例如,

vi re2_high_interface_test.cc

#include 
#include 
#include 

int
main(void)
{
    assert(RE2::FullMatch("hello", "h.*o"));
    assert(!RE2::FullMatch("hello", "e"));

    assert(RE2::PartialMatch("hello", "h.*o"));
    assert(RE2::PartialMatch("hello", "e"));

    std::cout << "Ok" << std::endl;
    return 0;
}

编译程序:

 g++ -o re2_high_interface_test re2_high_interface_test.cc -lre2

执行re2_high_interface_test,程序正常运行,显示结果Ok

子匹配萃取

两个匹配函数,都支持附加参数,来指定子匹配。此参数可以是一个字符串或一个整数类型StringPiece类型。一个StringPiece是一个指向原始输入的指针,和一个字符串的长度计数。有点类似一个string,但是有自己的存储。和使用指针一样,当使用StringPiece时,你必须小心谨慎,原始文本已被删除或不在相同的边界时,不能使用。

示例:

vi re2_submatch_ex_test.cc

#include 
#include 
#include 

int
main(void)
{
    int i;
    std::string s;
    assert(RE2::FullMatch("ruby:1234", "(\\w+):(\\d+)", &s, &i));
    assert(s == "ruby");
    assert(i == 1234);

    // Fails: "ruby" cannot be parsed as an integer.
    assert(!RE2::FullMatch("ruby", "(.+)", &i));

    // Success; does not extract the number.
    assert(RE2::FullMatch("ruby:1234", "(\\w+):(\\d+)", &s));

    // Success; skips NULL argument.
    assert(RE2::FullMatch("ruby:1234", "(\\w+):(\\d+)", (void*)NULL, &i));

    // Fails: integer overflow keeps value from being stored in i.
    assert(!RE2::FullMatch("ruby:123456789123", "(\\w+):(\\d+)", &s, &i));

    std::cout << "Ok" << std::endl;
    return 0;
}
g++ -o re2_submatch_ex_test re2_submatch_ex_test.cc -lre2

预编译的正则表达式

上面的示例都是每次调用的时编译一次正则表达式。相反,你可以编译一次正则表达式,保存到一个RE2对象中,然后在每次调用时重用这个对象。

示例:

vi re2_prec_re_test.cc

#include 
#include 
#include 

int
main(void)
{
    int i;
    std::string s;
    RE2 re("(\\w+):(\\d+)");
    assert(re.ok());  // compiled; if not, see re.error();

    assert(RE2::FullMatch("ruby:1234", re, &s, &i));
    assert(RE2::FullMatch("ruby:1234", re, &s));
    assert(RE2::FullMatch("ruby:1234", re, (void*)NULL, &i));
    assert(!RE2::FullMatch("ruby:123456789123", re, &s, &i));

    std::cout << "Ok" << std::endl;
    return 0;
}
g++ -o re2_prec_re_test re2_prec_re_test.cc -lre2

选项

RE2构造器还有第二个可选参数,可以用来改变RE2的默认选项。例如,预定义的Quiet选项,当正则表达式解析失败时,不打印错误消息:

vi re2_options_test.cc

#include 
#include 
#include 

int
main(void)
{
    RE2 re("(ab", RE2::Quiet);  // don't write to stderr for parser failure
    assert(!re.ok());  // can check re.error() for details

    std::cout << "Ok" << std::endl;
    return 0;
}

编译程序:

g++ -o re2_options_test re2_options_test.cc -lre2

其他有用的预定义选项,是Latin1 (禁用UTF-8)和POSIX (使用POSIX语法最左-最长匹配)。

你可以定义自己的RE2::Options对象,然后配置它。所有的选项在re2/re2.h文件中。

Unicode规范化

RE2操作Unicode的码点(code points): 它没有试图进行规范化。例如,正则表达式/ü/(U+00FC, u和分音符)不匹配"ü"(U+0075 U+0308, u紧挨结合分音符)。规范化,是一个长期,参与的话题。最小的解决方案,如果你需要这样的匹配,是在使用RE2之前的处理环节中同时规范化正则表达式和输入。相关主题的更多细节,请参考http://www.unicode.org/reports/tr15/。

额外的技巧和窍门

RE2的高级应用技巧,如构造自己的参数列表,或将RE2作为词法分析器使用或解析十六进制、十进制和C-基数数字,请参考re2.h文件。

“回溯”与“非回溯”的区别

以下照片内容,源自“sregex: matching Perl 5 regexes on data streams”讲演文档.

RE2的各种包装

An Inferno wrapper is at http://code.google.com/p/inferno-re2/.

Python wrapper is at http://github.com/facebook/pyre2/.

Ruby wrapper is at http://github.com/axic/rre2/.

An Erlang wrapper is at http://github.com/tuncer/re2/.

Perl wrapper is at http://search.cpan.org/~dgl/re-engine-RE2-0.05/lib/re/engine/RE2.pm.

An Eiffel wrapper is at http://sourceforge.net/projects/eiffelre2/.

RE2支持的语法

这里列出了RE2支持的正则表达式语法。同时,也列出了PCRE、PERL和VIM接受的语法。蓝色内容是,RE2不支持的语法。

 
Single characters:
. any character, including newline (s=true)
[xyz] character class
[^xyz] negated character class
\d Perl character class
\D negated Perl character class
[:alpha:] ASCII character class
[:^alpha:] negated ASCII character class
\pN Unicode character class (one-letter name)
\p{Greek} Unicode character class
\PN negated Unicode character class (one-letter name)
\P{Greek} negated Unicode character class
 
Composites:
xy x followed by y
x|y x or y (prefer x)
 
Repetitions:
x zero or more x, prefer more
x+ one or more x, prefer more
x? zero or one x, prefer one
x{n,m} n or n+1 or ... or m x, prefer more
x{n,} n or more x, prefer more
x{n} exactly n x
x? zero or more x, prefer fewer
x+? one or more x, prefer fewer
x?? zero or one x, prefer zero
x{n,m}? n or n+1 or ... or m x, prefer fewer
x{n,}? n or more x, prefer fewer
x{n}? exactly n x
x{} (≡ x(NOT SUPPORTED) VIM
x{-} (≡ x?(NOT SUPPORTED) VIM
x{-n} (≡ x{n}?(NOT SUPPORTED) VIM
x= (≡ x?(NOT SUPPORTED) VIM
 
Possessive repetitions:
x+ zero or more x, possessive (NOT SUPPORTED)
x++ one or more x, possessive (NOT SUPPORTED)
x?+ zero or one x, possessive (NOT SUPPORTED)
x{n,m}+ n or ... or m x, possessive (NOT SUPPORTED)
x{n,}+ n or more x, possessive (NOT SUPPORTED)
x{n}+ exactly n x, possessive (NOT SUPPORTED)
 
Grouping:
(re) numbered capturing group
(?Pre) named & numbered capturing group
(?re) named & numbered capturing group (NOT SUPPORTED)
(?'name're) named & numbered capturing group (NOT SUPPORTED)
(?:re) non-capturing group
(?flags) set flags within current group; non-capturing
(?flags:re) set flags during re; non-capturing
(?#text) comment (NOT SUPPORTED)
(?|x|y|z) branch numbering reset (NOT SUPPORTED)
(?>re) possessive match of re (NOT SUPPORTED)
re@> possessive match of re (NOT SUPPORTED) VIM
%(re) non-capturing group (NOT SUPPORTED) VIM
 
Flags:
i case-insensitive (default false)
m multi-line mode: ^ and $ match begin/end line in addition to begin/end text (default false)
s let . match \n (default false)
U ungreedy: swap meaning of x and x?x+ and x+?, etc (default false)
Flag syntax is xyz (set) or -xyz (clear) or xy-z (set xy, clear z).
 
Empty strings:
^ at beginning of text or line (m=true)
$ at end of text (like \z not \Z) or line (m=true)
\A at beginning of text
\b at word boundary (\w on one side and \W\A, or \z on the other)
\B not a word boundary
\G at beginning of subtext being searched (NOT SUPPORTED) PCRE
\G at end of last match (NOT SUPPORTED) PERL
\Z at end of text, or before newline at end of text (NOT SUPPORTED)
\z at end of text
(?=re) before text matching re (NOT SUPPORTED)
(?!re) before text not matching re (NOT SUPPORTED)
(?<=re) after text matching re (NOT SUPPORTED)
(? after text not matching re (NOT SUPPORTED)
re& before text matching re (NOT SUPPORTED) VIM
re@= before text matching re (NOT SUPPORTED) VIM
re@! before text not matching re (NOT SUPPORTED) VIM
re@<= after text matching re (NOT SUPPORTED) VIM
re@ after text not matching re (NOT SUPPORTED) VIM
\zs sets start of match (= \K) (NOT SUPPORTED) VIM
\ze sets end of match (NOT SUPPORTED) VIM
\%^ beginning of file (NOT SUPPORTED) VIM
\%$ end of file (NOT SUPPORTED) VIM
\%V on screen (NOT SUPPORTED) VIM
\%# cursor position (NOT SUPPORTED) VIM
\%'m mark m position (NOT SUPPORTED) VIM
\%23l in line 23 (NOT SUPPORTED) VIM
\%23c in column 23 (NOT SUPPORTED) VIM
\%23v in virtual column 23 (NOT SUPPORTED) VIM
 
Escape sequences:
\a bell (≡ \007)
\f form feed (≡ \014)
\t horizontal tab (≡ \011)
\n newline (≡ \012)
\r carriage return (≡ \015)
\v vertical tab character (≡ \013)
* literal , for any punctuation character
\123 octal character code (up to three digits)
\x7F hex character code (exactly two digits)
\x{10FFFF} hex character code
\C match a single byte even in UTF-8 mode
\Q...\E literal text ... even if ... has punctuation
 
\1 backreference (NOT SUPPORTED)
\b backspace (NOT SUPPORTED) (use \010)
\cK control char ^K (NOT SUPPORTED) (use \001 etc)
\e escape (NOT SUPPORTED) (use \033)
\g1 backreference (NOT SUPPORTED)
\g{1} backreference (NOT SUPPORTED)
\g{+1} backreference (NOT SUPPORTED)
\g{-1} backreference (NOT SUPPORTED)
\g{name} named backreference (NOT SUPPORTED)
\g subroutine call (NOT SUPPORTED)
\g'name' subroutine call (NOT SUPPORTED)
\k named backreference (NOT SUPPORTED)
\k'name' named backreference (NOT SUPPORTED)
\lX lowercase X (NOT SUPPORTED)
\ux uppercase x (NOT SUPPORTED)
\L...\E lowercase text ... (NOT SUPPORTED)
\K reset beginning of $0 (NOT SUPPORTED)
\N{name} named Unicode character (NOT SUPPORTED)
\R line break (NOT SUPPORTED)
\U...\E upper case text ... (NOT SUPPORTED)
\X extended Unicode sequence (NOT SUPPORTED)
 
\%d123 decimal character 123 (NOT SUPPORTED) VIM
\%xFF hex character FF (NOT SUPPORTED) VIM
\%o123 octal character 123 (NOT SUPPORTED) VIM
\%u1234 Unicode character 0x1234 (NOT SUPPORTED) VIM
\%U12345678 Unicode character 0x12345678 (NOT SUPPORTED) VIM
 
Character class elements:
x single character
A-Z character range (inclusive)
\d Perl character class
[:foo:] ASCII character class foo
\p{Foo} Unicode character class Foo
\pF Unicode character class F (one-letter name)
 
Named character classes as character class elements:
[\d] digits (≡ \d)
[^\d] not digits (≡ \D)
[\D] not digits (≡ \D)
[^\D] not not digits (≡ \d)
[[:name:]] named ASCII class inside character class (≡ [:name:])
[^[:name:]] named ASCII class inside negated character class (≡ [:^name:])
[\p{Name}] named Unicode property inside character class (≡ \p{Name})
[^\p{Name}] named Unicode property inside negated character class (≡ \P{Name})
 
Perl character classes:
\d digits (≡ [0-9])
\D not digits (≡ [^0-9])
\s whitespace (≡ [\t\n\f\r ])
\S not whitespace (≡ [^\t\n\f\r ])
\w word characters (≡ [0-9A-Za-z])
\W not word characters (≡ [^0-9A-Za-z])
 
\h horizontal space (NOT SUPPORTED)
\H not horizontal space (NOT SUPPORTED)
\v vertical space (NOT SUPPORTED)
\V not vertical space (NOT SUPPORTED)
 
ASCII character classes:
[:alnum:] alphanumeric (≡ [0-9A-Za-z])
[:alpha:] alphabetic (≡ [A-Za-z])
[:ascii:] ASCII (≡ [\x00-\x7F])
[:blank:] blank (≡ [\t ])
[:cntrl:] control (≡ [\x00-\x1F\x7F])
[:digit:] digits (≡ [0-9])
[:graph:] graphical (≡ [!-~] == [A-Za-z0-9!"#$%&'()+,-./:;<=>?@[\]^{|}~])
[:lower:]lower case (≡ [a-z])
[:print:]printable (≡ [ -~] == [ [:graph:]])
[:punct:]punctuation (≡ [!-/:-@[-{-~])
[:space:] whitespace (≡ [\t\n\v\f\r ])
[:upper:] upper case (≡ [A-Z])
[:word:] word characters (≡ [0-9A-Za-z])
[:xdigit:] hex digit (≡ [0-9A-Fa-f])
 
Unicode character class names--general category:
C other
Cc control
Cf format
Cn unassigned code points (NOT SUPPORTED)
Co private use
Cs surrogate
L letter
LC cased letter (NOT SUPPORTED)
L& cased letter (NOT SUPPORTED)
Ll lowercase letter
Lm modifier letter
Lo other letter
Lt titlecase letter
Lu uppercase letter
M mark
Mc spacing mark
Me enclosing mark
Mn non-spacing mark
N number
Nd decimal number
Nl letter number
No other number
P punctuation
Pc connector punctuation
Pd dash punctuation
Pe close punctuation
Pf final punctuation
Pi initial punctuation
Po other punctuation
Ps open punctuation
S symbol
Sc currency symbol
Sk modifier symbol
Sm math symbol
So other symbol
Z separator
Zl line separator
Zp paragraph separator
Zs space separator
 
Unicode character class names--scripts:
Arabic Arabic
Armenian Armenian
Balinese Balinese
Bengali Bengali
Bopomofo Bopomofo
Braille Braille
Buginese Buginese
Buhid Buhid
Canadian_Aboriginal Canadian Aboriginal
Carian Carian
Cham Cham
Cherokee Cherokee
Common characters not specific to one script
Coptic Coptic
Cuneiform Cuneiform
Cypriot Cypriot
Cyrillic Cyrillic
Deseret Deseret
Devanagari Devanagari
Ethiopic Ethiopic
Georgian Georgian
Glagolitic Glagolitic
Gothic Gothic
Greek Greek
Gujarati Gujarati
Gurmukhi Gurmukhi
Han Han
Hangul Hangul
Hanunoo Hanunoo
Hebrew Hebrew
Hiragana Hiragana
Inherited inherit script from previous character
Kannada Kannada
Katakana Katakana
Kayah_Li Kayah Li
Kharoshthi Kharoshthi
Khmer Khmer
Lao Lao
Latin Latin
Lepcha Lepcha
Limbu Limbu
Linear_B Linear B
Lycian Lycian
Lydian Lydian
Malayalam Malayalam
Mongolian Mongolian
Myanmar Myanmar
New_Tai_Lue New Tai Lue (aka Simplified Tai Lue)
Nko Nko
Ogham Ogham
Ol_Chiki Ol Chiki
Old_Italic Old Italic
Old_Persian Old Persian
Oriya Oriya
Osmanya Osmanya
Phags_Pa 'Phags Pa
Phoenician Phoenician
Rejang Rejang
Runic Runic
Saurashtra Saurashtra
Shavian Shavian
Sinhala Sinhala
Sundanese Sundanese
Syloti_Nagri Syloti Nagri
Syriac Syriac
Tagalog Tagalog
Tagbanwa Tagbanwa
Tai_Le Tai Le
Tamil Tamil
Telugu Telugu
Thaana Thaana
Thai Thai
Tibetan Tibetan
Tifinagh Tifinagh
Ugaritic Ugaritic
Vai Vai
Yi Yi
 
Vim character classes:
\i identifier character (NOT SUPPORTED)/font> VIM
\I \i except digits (NOT SUPPORTED) VIM
\k keyword character (NOT SUPPORTED) VIM
\K \k except digits (NOT SUPPORTED) VIM
\f file name character (NOT SUPPORTED) VIM
\F \f except digits (NOT SUPPORTED) VIM
\p printable character (NOT SUPPORTED) VIM
\P \p except digits (NOT SUPPORTED) VIM
\s whitespace character (≡ [ \t](NOT SUPPORTED) VIM
\S non-white space character (≡ [^ \t](NOT SUPPORTED) VIM
\d digits (≡ [0-9]VIM
\D not \d VIM
\x hex digits (≡ [0-9A-Fa-f](NOT SUPPORTED) VIM
\X not \x (NOT SUPPORTED) VIM
\o octal digits (≡ [0-7](NOT SUPPORTED) VIM
\O not \o (NOT SUPPORTED) VIM
\w word character VIM
\W not \w VIM
\h head of word character (NOT SUPPORTED) VIM
\H not \h (NOT SUPPORTED) VIM
\a alphabetic (NOT SUPPORTED) VIM
\A not \a (NOT SUPPORTED) VIM
\l lowercase (NOT SUPPORTED) VIM
\L not lowercase (NOT SUPPORTED) VIM
\u uppercase (NOT SUPPORTED) VIM
\U not uppercase (NOT SUPPORTED) VIM
_x \x plus newline, for any x (NOT SUPPORTED) VIM
 
Vim flags:
\c ignore case (NOT SUPPORTED) VIM
\C match case (NOT SUPPORTED) VIM
\m magic (NOT SUPPORTED) VIM
\M nomagic (NOT SUPPORTED) VIM
\v verymagic (NOT SUPPORTED) VIM
\V verynomagic (NOT SUPPORTED) VIM
\Z ignore differences in Unicode combining characters (NOT SUPPORTED) VIM
 
Magic:
(?{code}) arbitrary Perl code (NOT SUPPORTED) PERL
(??{code}) postponed arbitrary Perl code (NOT SUPPORTED) PERL
(?n) recursive call to regexp capturing group n (NOT SUPPORTED)
(?+n) recursive call to relative group +n (NOT SUPPORTED)
(?-n) recursive call to relative group -n (NOT SUPPORTED)
(?C) PCRE callout (NOT SUPPORTED) PCRE
(?R) recursive call to entire regexp (≡ (?0)(NOT SUPPORTED)
(?&name) recursive call to named group (NOT SUPPORTED)
(?P=name) named backreference (NOT SUPPORTED)
(?P>name) recursive call to named group (NOT SUPPORTED)
(?(cond)true|false) conditional branch (NOT SUPPORTED)
(?(cond)true) conditional branch (NOT SUPPORTED)
(ACCEPT) make regexps more like Prolog (NOT SUPPORTED)
(COMMIT) (NOT SUPPORTED)
(F) (NOT SUPPORTED)
(FAIL) (NOT SUPPORTED)
(MARK) (NOT SUPPORTED)
(PRUNE) (NOT SUPPORTED)
(SKIP) (NOT SUPPORTED)
(THEN) (NOT SUPPORTED)
(ANY) set newline convention (NOT SUPPORTED)
(ANYCRLF) (NOT SUPPORTED)
(CR) (NOT SUPPORTED)
(CRLF) (NOT SUPPORTED)
(LF) (NOT SUPPORTED)
(BSR_ANYCRLF) set \R convention (NOT SUPPORTED) PCRE
(*BSR_UNICODE) (NOT SUPPORTED) PCRE
 

扩展阅读

  1. "perlre - Perl regular expressions" http://perldoc.perl.org/perlre.html

  2. "Implementing Regular Expressions" http://swtch.com/~rsc/regexp

  3. The re1 project: http://code.google.com/p/re1

  4. The re2 project: http://code.google.com/p/re2

  5. sregex: A non-backtracking regex engine matching on data streams

  6. sregex: matching Perl 5 regexes on data streams: http://agentzh.org/misc/slides/yapc-na-2013-sregex.pdf

参考资料

  • RE2官网资料:http://code.google.com/p/re2/wiki/

  • sregex: matching Perl 5 regexes on data streams: http://agentzh.org/misc/slides/yapc-na-2013-sregex.pdf

你可能感兴趣的:(正则)