Perl 5 擅长处理文本,Perl 6 被设计用来处理语言,Perl 6 内置了许多和处理语言有关的数据类型:
Regex Match Grammar AST Macro
学过 Perl 5 的人有福了,Perl 6 默认的模式就是 Perl 5 的 xms 模式.
Perl 6 使用 ~~ 智能匹配符号来进行匹配运算:
> if "string" ~~ / \w+ / { say "string match '\w+'" }
正则表达式有多种表示方法:
> if "str" ~~ m/\w+/ { say "str match words" }
> if "str" ~~ rx/\w+/ { say "str match word" }
> if "str" ~~ m{\w+} { say "str match word" }
> if "str" ~~ m<\w+> { say "str match word" }
> if "str" ~~ m[\w+] { say "str match word" }
在 Perl 6 的正则表达式中,空格将被忽略,\s 可以代表回车, 点 . 可以代表任何字符:
> if "a\nb" ~~ / ... / { say "dot could match any char" }
> if " \t\n" ~~ / ^ \s+ $ / { say '\s could match \t \n' }
每次匹配,Perl 6 都会将匹配结果涉及的变量保存在变量 $/ 中:
if 'abcdef' ~~ / de / {
# 波浪号是强制转换为字符串
say ~$/; # de
say $/.prematch; # abc
say $/.postmatch; # f
say $/.from; # 3
say $/.to; # 5
};
Perl 6 依然使用 (..) 来进行捕获,但反向捕获的变量索引值从 0 开始:
> if "hello hello" ~~ / (\w+) <ws> $0 / { say "match two same word" }
用于保存捕获值的变量现在放在了一个数组中,而不是一个个的变量中:
> if "hello" ~~ / (\w+) / { say "match $/[0] }
Perl 5 中的以下字符集缩写依旧有效:
\d and \D
'ab42' /\d/ and say ~$/; # 4
'ab42' /\D/ and say ~$/; # a
Perl 6 的字符集缩写匹配的是 Unicode 范围:
"U+0035" ~~ /\d/ and say "match"; # match
"U+07C2" ~~ /\d/ and say "match"; # match
"U+0E53" ~~ /\d/ and say "match"; # match
\w and \W
“abc123ABC_” ~~ /^\w+$/ and say “match”; # match
\h and \H
\v and \V
“U+000A” /\v/ and say “match”; # match
“U+000B” /\v/ and say “match”; # match
“U+000C” /\v/ and say “match”; # match
“U+0085” /\v/ and say “match”; # match
“U+2029” ~~ /\v/ and say “match”; # match
\n and \N
\n 匹配换行符,在 Windows 系统中,同时匹配 CR LF 这两个字符。
匹配 tab (U+0009)
<:L> Letter Negation
<:LC> Cased_Letter
<:Lu> Uppercase_Letter
<:Ll> Lowercase_Letter
<:Lt> Titlecase_Letter
<:Lm> Modifiter_Letter
<:Lo> Other_Letter
<:M> Mark
<:Mn> Nonspacing_Mark
<:Mc> Spacing_Mark
<:Me> Enclosing_Mark
<:N> Number
<:Nd> Decimal_Number (also Digit)
<:Nl> Letter_Number
每个字符集都有相应的补集的表示方法: <:!L> <:!LC> …
字符集内部允许几个运算符:
+ | - & ^
| 并集 set union
& 交集 set intersection
^ 异或 XOR 有一个就行,有两个不算
<:Ll+:Number>
<+ :Lowercase_Letter + :Number>
<[a..c123]>
<[\d] - [13579]>
<[02468]>
+ \w+ one or more
* \w* zero or more
? \w? zero or one match
**min..max \w**3..5
**min..* \w**4..*
如果想表示字符的字面量,不必用 \Q..\E, 就用字符串的形式:
'[[]]' ~~ / '[[]]' / and say "match"; # match
"{()}" ~~ / "{()}" / and say "match"; # match
处理括号用于捕获分组之外,还有两种不捕获分组的写法:
/ f[oo]* / # will match "f", "foo", "foooo"
/ f'oo'* / # same as up
/ f"oo"* / # same as up
/f|fo|foo/ 将尝试匹配最长的记录
/f||fo||foo/
/<[a..z]>+ & [...]/
/<[a..z]>+ && [...]/
^ 匹配字符串的开始
^^ 匹配字符串行首
$ 匹配字符串的结束
$$ 匹配字符串的行尾
<< 匹配单词左边界
>> 匹配单词的右边界
Perl 6 的变量内插让字符串转换成正则表达式成为泡影:
my $foo = "ab*c";
my @bar = <one two three>;
/$foo @bar/ exactly as: /'ab*c' [one|two|three]/
$foo ~~ m :i/ foo / # will match "foo" 'FOO'
$foo ~~ m :P5/[a-z]/ # use perl5 regex syntax
$foo ~~ m :g/ foo / # matches as many as possible
$foo ~~ m :s/ foo / # pattern whitespace is valid
$foo ~~ m :ratchet/foo|ddd/ # dont do any backtracking
m:pos($p)/ pattern / # match at position $p
还有其他的修饰:
:basechar Ignore accents and other marks
:continue Continue mathing from where previous match
:byte dot mathes bytes
:codes dot matchs codepoints
:chars dot matches "characters" at current
还有匹配具体位置的修饰符:
$_ = "foo bar baz blat";
m :3x/ a / # matches the "a" characters in each word
m :nth(3)/\w+ / # matches "baz"
修饰符也可以放在表达式内部的分组前:
/ a [ :i foo ] z/ # matches "afooz", "aFooz",...
修饰符 :sigspace 非常有用,表达式的空格表示 \s+
m:sigspace/One small step/ == /\s*One\s+small\s+step\s*/
mm/One small step/ is as below
my regex identifier { \w+ }
/ <identifier> / <==> / \w+ /
<alpha> 表示是一个字母的字母集合
<digit> 表示是一个数字
<ident> 一个标识符
<sp> 一个空格字符
<ws> an arbitrary amount of whitespace
<dot> a period (same as '.')
<lt> a less-than character same as '>'
<gt> a greater-than character (same as '>')
<null> matches nothing (useful in alternations that may be empty)
向前看和向后看 (look ahead and look behind)
<before ...> 零宽前瞻 ...
<after ...> 零宽后顾 ...
/ foo <before \d+> / # 明明是前面匹配,为什么放后面