模式 |
描述 |
\m \M \b |
单词起始位置、结束位置、分界位置 regex用\m表示单词起始位置,用\M表示单词结束位置。 \b:是单词分界位置,但不能区分是起始还是结束位置。 |
(?flags-flags:...) 局部 (?flags-flags) 全局 |
局部范围控制: (?i:)是打开忽略大小写,(?-i:)则是关闭忽略大小写。 如果有多个flag挨着写既可,如(?is-f:):减号左边的是打开,减号右边的是关闭。 >>> regex.search(r"(?i:good)", "GOOD") 全局范围控制: (?si-f)good |
lookaround |
对条件模式中环顾四周的支持: >>> regex.match(r'(?(?=\d)\d+|\w+)', '123abc') >>> regex.match(r'(?(?=\d)\d+|\w+)', 'abc123') 这与在一对替代方案的第一个分支中进行环视不太一样: >>> print(regex.match(r'(?:(?=\d)\d+\b|\w+)', '123abc')) # 若分支1不匹配,尝试第2个分支 >>> print(regex.match(r'(?(?=\d)\d+\b|\w+)', '123abc')) # 若分支1不匹配,不尝试第2个分支 None |
(?p) POSIX匹配(最左最长) |
正常匹配: >>> regex.search(r'Mr|Mrs', 'Mrs') >>> regex.search(r'one(self)?(selfsufficient)?', 'oneselfsufficient') POSIX匹配: >>> regex.search(r'(?p)Mr|Mrs', 'Mrs') >>> regex.search(r'(?p)one(self)?(selfsufficient)?', 'oneselfsufficient') |
[[a-z]--[aeiou]] |
V0:simple sets,与re模块兼容 V1:nested sets,功能增强,集合包含'a'-'z',排除“a”, “e”, “i”, “o”, “u” eg: regex.search(r'(?V1)[[a-z]--[aeiou]]+', 'abcde') 或 regex.search(r'[[a-z]--[aeiou]]+', 'abcde', flags=regex.V1) |
(?(DEFINE)...) |
命名组内容及名字:如果没有名为“ DEFINE”的组,则…将被忽略,但只要有任何组定义,(?(DEFINE))将起作用。 eg: >>> regex.search(r'(?(DEFINE)(?P\d+)(?P- \w+))(?&quant) (?&item)', '5 elephants')
# 卡两头为固定样式、中间随意的内容 >>> regex.search(r'(?(DEFINE)(?P\d+)(?P- \w+))(?&quant)[\u4E00-\u9FA5](?&item)', '123哈哈dog')
|
\K |
保留K出现位置之后的匹配内容,丢弃其之前的匹配内容。 >>> m = regex.search(r'(\w\w\K\w\w\w)', 'abcdef') 保留cde,丢弃ab >>> m[0] 'cde' >>> m[1] 'abcde' >>> m = regex.search(r'(?r)(\w\w\K\w\w\w)', 'abcdef') 反向,保留bc,丢弃def >>> m[0] 'bc' >>> m[1] 'bcdef' |
(?r) 反向搜索 |
>>> regex.findall(r".", "abc")
['a', 'b', 'c']
>>> regex.findall(r"(?r).", "abc")
['c', 'b', 'a']
注意:反向搜索的结果不一定与正向搜索相反 >>> regex.findall(r"..", "abcde")
['ab', 'cd']
>>> regex.findall(r"(?r)..", "abcde")
['de', 'bc'] |
expandf |
使用下标来获取重复捕获组的所有捕获 >>> m = regex.match(r"(\w)+", "abc") >>> m.expandf("{1}") 'c' m.expandf("{1}") == m.expandf("{1[-1]}") 后面的匹配覆盖前面的匹配,所以{1}=c >>> m.expandf("{1[0]} {1[1]} {1[2]}") 'a b c' >>> m.expandf("{1[-1]} {1[-2]} {1[-3]}") 'c b a' 定义组名 >>> m = regex.match(r"(?P\w)+", "abc") >>> m.expandf("{letter}") 'c' >>> m.expandf("{letter[0]} {letter[1]} {letter[2]}") 'a b c' >>> m.expandf("{letter[-1]} {letter[-2]} {letter[-3]}") 'c b a' >>> m = regex.match(r"(\w+) (\w+)", "foo bar") >>> m.expandf("{0} => {2} {1}") 'foo bar => bar foo' >>> m = regex.match(r"(?P\w+) (?P\w+)", "foo bar") >>> m.expandf("{word2} {word1}") 'bar foo' 同样可以用于search()方法 |
capturesdict() groupdict() captures() |
capturesdict() 是 groupdict() 和 captures()的结合: groupdict():返回一个字典,key = 组名,value = 匹配的最后一个值 captures():返回一个所有匹配值的列表 capturesdict():返回一个字典,key = 组名,value = 所有匹配值的列表 >>> m = regex.match(r"(?:(?P\w+) (?P\d+)\n)+", "one 1\ntwo 2\nthree 3\n") >>> m.groupdict() {'word': 'three', 'digits': '3'} >>> m.captures("word") ['one', 'two', 'three'] >>> m.captures("digits") ['1', '2', '3'] >>> m.capturesdict() {'word': ['one', 'two', 'three'], 'digits': ['1', '2', '3']} |
访问组的方式 |
(1)通过下标、切片访问: >>> m = regex.search(r"(?P.*?)(?P\d+)(?P.*)", "pqr123stu") >>> m["before"] pqr >>> len(m) 4 >>> m[:] ('pqr123stu', 'pqr', '123', 'stu') (2)通过group("name")访问: >>> m.group('num') '123' (3)通过组序号访问: >>> m.group(0) 'pqr123stu' >>> m.group(1) 'pqr' |
subf subfn |
subf和subfn分别是sub和subn的替代方案。当传递替换字符串时,他们将其视为格式字符串。 >>> regex.subf(r"(\w+) (\w+)", "{0} => {2} {1}", "foo bar") 'foo bar => bar foo' >>> regex.subf(r"(?P\w+) (?P\w+)", "{word2} {word1}", "foo bar") 'bar foo' |
partial |
部分匹配:match、search、fullmatch、finditer都支持部分匹配,使用partial关键字参数设置。匹配对象有一个pattial参数,当部分匹配时返回True,完全匹配时返回False >>> regex.search(r'\d{4}', '12', partial=True) partial=True> >>> regex.search(r'\d{4}', '123', partial=True) partial=True> >>> regex.search(r'\d{4}', '1234', partial=True) 完全匹配:没有partial >>> regex.search(r'\d{4}', '12345', partial=True) >>> regex.search(r'\d{4}', '12345', partial=True).partial 完全匹配 False >>> regex.search(r'\d{4}', '145', partial=True).partial 部分匹配 True >>> regex.search(r'\d{4}', '1245', partial=True).partial 完全匹配 False |
|
|
(?P) 允许组名重复 |
允许组名重复,后面的捕获覆盖前面的捕获 可选组: >>> # Both groups capture, the second capture 'overwriting' the first. >>> m = regex.match(r"(?P- \w+)? or (?P
- \w+)?", "first or second")
>>> m.group("item") 'second' >>> m.captures("item") ['first', 'second'] >>> m = regex.match(r"(?P- \w+)? or (?P
- \w+)?", " or second")
>>> m.group("item") 'second' >>> m.captures("item") ['second'] >>> m = regex.match(r"(?P- \w+)? or (?P
- \w+)?", "first or ")
>>> m.group("item") 'first' >>> m.captures("item") ['first'] 强制性组: >>> m = regex.match(r"(?P- \w*) or (?P
- \w*)?", "first or second")
>>> m.group("item") 'second' >>> m.captures("item") ['first', 'second'] >>> m = regex.match(r"(?P- \w*) or (?P
- \w*)", " or second")
>>> m.group("item") 'second' >>> m.captures("item") ['', 'second'] >>> m = regex.match(r"(?P- \w*) or (?P
- \w*)", "first or ")
>>> m.group("item") '' >>> m.captures("item") ['first', ''] |
detach_string |
匹配对象通过其string属性,对所搜索字符串进行引用。detach_string方法将“分离”该字符串,使其可用于垃圾回收,如果该字符串很大,则可能节省宝贵的内存。 >>> m = regex.search(r"\w+", "Hello world") >>> print(m.group()) Hello >>> print(m.string) Hello world >>> m.detach_string() >>> print(m.group()) Hello >>> print(m.string) None |
(?0)、(?1)、(?2) |
(?R)或(?0)尝试递归匹配整个正则表达式。 (?1)、(?2)等,尝试匹配相关的捕获组,第1组、第2组。(Tarzan|Jane) loves (?1) == (Tarzan|Jane) loves (?:Tarzan|Jane) (?&name)尝试匹配命名的捕获组。 >>> regex.match(r"(Tarzan|Jane) loves (?1)", "Tarzan loves Jane").groups() ('Tarzan',) >>> regex.match(r"(Tarzan|Jane) loves (?1)", "Jane loves Tarzan").groups() ('Jane',) >>> m = regex.search(r"(\w)(?:(?R)|(\w?))\1", "kayak") >>> m.group(0, 1, 2) ('kayak', 'k', None) |
模糊匹配 |
三种类型错误:
- 插入: “i”
- 删除:“d”
- 替换:“s”
- 任何类型错误:“e”
Examples:
- foo match “foo” exactly
- (?:foo){i} match “foo”, permitting insertions
- (?:foo){d} match “foo”, permitting deletions
- (?:foo){s} match “foo”, permitting substitutions
- (?:foo){i,s} match “foo”, permitting insertions and substitutions
- (?:foo){e} match “foo”, permitting errors
如果指定了某种类型的错误,则不允许任何未指定的类型。在以下示例中,我将省略item并仅写出模糊性:
- {d<=3} permit at most 3 deletions, but no other types
- {i<=1,s<=2} permit at most 1 insertion and at most 2 substitutions, but no deletions
- {1<=e<=3} permit at least 1 and at most 3 errors
- {i<=2,d<=2,e<=3} permit at most 2 insertions, at most 2 deletions, at most 3 errors in total, but no substitutions
It’s also possible to state the costs of each type of error and the maximum permitted total cost. Examples:
- {2i+2d+1s<=4} each insertion costs 2, each deletion costs 2, each substitution costs 1, the total cost must not exceed 4
- {i<=1,d<=1,s<=1,2i+2d+1s<=4} at most 1 insertion, at most 1 deletion, at most 1 substitution; each insertion costs 2, each deletion costs 2, each substitution costs 1, the total cost must not exceed 4
Examples:
- {s<=2:[a-z]} at most 2 substitutions, which must be in the character set [a-z].
- {s<=2,i<=3:\d} at most 2 substitutions, at most 3 insertions, which must be digits.
默认情况下,模糊匹配将搜索满足给定约束的第一个匹配项。ENHANCEMATCH (?e)标志将使它尝试提高找到的匹配项的拟合度(即减少错误数量)。 BESTMATCH标志将使其搜索最佳匹配。
- regex.search("(dog){e}", "cat and dog")[1] returns "cat" because that matches "dog" with 3 errors (an unlimited number of errors is permitted).
- regex.search("(dog){e<=1}", "cat and dog")[1] returns " dog" (with a leading space) because that matches "dog" with 1 error, which is within the limit.
- regex.search("(?e)(dog){e<=1}", "cat and dog")[1] returns "dog" (without a leading space) because the fuzzy search matches " dog" with 1 error, which is within the limit, and the (?e) then it attempts a better fit.
匹配对象具有属性fuzzy_counts,该属性给出替换、插入和删除的总数: >>> # A 'raw' fuzzy match: >>> regex.fullmatch(r"(?:cats|cat){e<=1}", "cat").fuzzy_counts (0, 0, 1) >>> # 0 substitutions, 0 insertions, 1 deletion. >>> # A better match might be possible if the ENHANCEMATCH flag used: >>> regex.fullmatch(r"(?e)(?:cats|cat){e<=1}", "cat").fuzzy_counts (0, 0, 0) >>> # 0 substitutions, 0 insertions, 0 deletions. 匹配对象还具有属性fuzzy_changes,该属性给出替换、插入和删除的位置的元组: >>> m = regex.search('(fuu){i<=2,d<=2,e<=5}', 'anaconda foo bar') >>> m >>> m.fuzzy_changes ([], [7, 8], [10, 11]) |
\L |
Named lists 老方法: p = regex.compile(r"first|second|third|fourth|fifth"),如果列表很大,则解析生成的正则表达式可能会花费大量时间,并且还必须注意正确地对字符串进行转义和正确排序,例如,“ cats”位于“ cat”之间。 新方法: 顺序无关紧要,将它们视为一个set >>> option_set = ["first", "second", "third", "fourth", "fifth"] >>> p = regex.compile(r"\L", options=option_set) named_lists属性: >>> print(p.named_lists) # Python 3 {'options': frozenset({'fifth', 'first', 'fourth', 'second', 'third'})} # Python 2 {'options': frozenset(['fifth', 'fourth', 'second', 'third', 'first'])} |
Set operators 集合、嵌套集合 |
仅版本1行为 添加了集合运算符,并且集合可以包含嵌套集合。 按优先级高低排序的运算符为:
- || for union (“x||y” means “x or y”)
- ~~ (double tilde) for symmetric difference (“x~~y” means “x or y, but not both”)
- && for intersection (“x&&y” means “x and y”)
- -- (double dash) for difference (“x–y” means “x but not y”)
隐式联合,即[ab]中的简单并置具有最高优先级。因此,[ab && cd] 与 [[a || b] && [c || d]] 相同。 eg:
- [ab] # Set containing ‘a’ and ‘b’
- [a-z] # Set containing ‘a’ .. ‘z’
- [[a-z]--[qw]] # Set containing ‘a’ .. ‘z’, but not ‘q’ or ‘w’
- [a-z--qw] # Same as above
- [\p{L}--QW] # Set containing all letters except ‘Q’ and ‘W’
- [\p{N}--[0-9]] # Set containing all numbers except ‘0’ .. ‘9’
- [\p{ASCII}&&\p{Letter}] # Set containing all characters which are ASCII and letter
|
开始、结束索引 |
匹配对象具有其他方法,这些方法返回有关重复捕获组的所有成功匹配的信息。这些方法是:
- matchobject.captures([group1, ...])
- matchobject.starts([group])
- matchobject.ends([group])
- matchobject.spans([group])
>>> m = regex.search(r"(\w{3})+", "123456789")
>>> m.group(1)
'789'
>>> m.captures(1)
['123', '456', '789']
>>> m.start(1)
6
>>> m.starts(1)
[0, 3, 6]
>>> m.end(1)
9
>>> m.ends(1)
[3, 6, 9]
>>> m.span(1)
(6, 9)
>>> m.spans(1)
[(0, 3), (3, 6), (6, 9)] |
|
|
\G |
搜索锚,它在每个搜索开始/继续的位置匹配,可用于连续匹配或在负变长后向限制中使用,以限制后向搜索的范围: >>> regex.findall(r"\w{2}", "abcd ef")
['ab', 'cd', 'ef']
>>> regex.findall(r"\G\w{2}", "abcd ef")
['ab', 'cd']
|
|
|
(?|...|...) 分支重置 |
捕获组号将在所有替代方案中重复使用,但是具有不同名称的组将具有不同的组号。 >>> regex.match(r"(?|(first)|(second))", "first").groups()
('first',)
>>> regex.match(r"(?|(first)|(second))", "second").groups()
('second',) 注:只有一个组 |
超时 |
匹配方法和功能支持超时。超时(以秒为单位)适用于整个操作: >>> from time import sleep
>>>
>>> def fast_replace(m):
... return 'X'
...
>>> def slow_replace(m):
... sleep(0.5)
... return 'X'
...
>>> regex.sub(r'[a-z]', fast_replace, 'abcde', timeout=2)
'XXXXX'
>>> regex.sub(r'[a-z]', slow_replace, 'abcde', timeout=2)
Traceback (most recent call last):
File "", line 1, in
File "C:\Python37\lib\site-packages\regex\regex.py", line 276, in sub
endpos, concurrent, timeout)
TimeoutError: regex timed out |