python正则表达式——regex模块

目录

1. 为了与re模块兼容,此模块具有2个行为

2. Unicode中不区分大小写的匹配:Case-insensitive matches

3. Flags

4. 组

5. 其他功能,如下表


参考:扩展模块官网regex 2020.5.7

regex正则表达式实现与标准“ re”模块向后兼容,但提供了其他功能。

re模块的零宽度匹配行为是在Python 3.7中更改的,并且为Python 3.7编译时,此模块将遵循该行为。

1. 为了与re模块兼容,此模块具有2个行为

  • Version 0:(old behaviour,与re模块兼容):

    Please note that the re module’s behaviour may change over time, and I’ll endeavour to match that behaviour in version 0.

    • Indicated by the VERSION0 or V0 flag, or (?V0) in the pattern.
    • Zero-width matches are not handled correctly in the re module before Python 3.7. The behaviour in those earlier versions is:
      • .split won’t split a string at a zero-width match.
      • .sub will advance by one character after a zero-width match.
    • Inline flags apply to the entire pattern, and they can’t be turned off.
    • Only simple sets are supported.
    • Case-insensitive matches in Unicode use simple case-folding by default.
  • Version 1:(new behaviour, possibly different from the re module):

    • Indicated by the VERSION1 or V1 flag, or (?V1) in the pattern.
    • Zero-width matches are handled correctly.
    • Inline flags apply to the end of the group or pattern, and they can be turned off.
    • Nested sets and set operations are supported.
    • Case-insensitive matches in Unicode use full case-folding by default.

如果未指定版本,则regex模块将默认为regex.DEFAULT_VERSION。

2. Unicode中不区分大小写的匹配:Case-insensitive matches

regex模块支持简单和完整的大小写折叠,以实现Unicode中不区分大小写的匹配。可以使用FULLCASE或F标志或模式中的(?f)来打开完整的大小写折叠。请注意,该标志会影响IGNORECASE标志的工作方式。FULLCASE标志本身不会打开不区分大小写的匹配。

  • 在版本0行为中,默认情况下该标志处于关闭状态。
  • 在版本1行为中,默认情况下该标志处于启用状态。

3. Flags

标志有2种:局部标志和全局标志。范围标志只能应用于模式的一部分,并且可以打开或关闭;全局标志适用于整个模式,只能将其打开。

局部标志: FULLCASE, IGNORECASE, MULTILINE, DOTALL, VERBOSE, WORD.

全局标志:ASCII, BESTMATCH, ENHANCEMATCH, LOCALE, POSIX, REVERSE, UNICODE, VERSION0, VERSION1.

如果未指定ASCII,LOCALE或UNICODE标志,则如果正则表达式模式为Unicode字符串,则默认为UNICODE;如果为字节字符串,则默认为ASCII。

  • ENHANCEMATCH标志进行模糊匹配,以提高找到的下一个匹配的匹配度。

  • BESTMATCH标志使模糊匹配搜索最佳匹配而不是下一个匹配。

4. 组

所有捕获组都有一个组号,从1开始。具有相同组名的组将具有相同的组号,而具有不同组名的组将具有不同的组号。

同一名称可由多个组使用,以后的捕获“覆盖”较早的捕获。该组的所有捕获都可以通过match对象的captures方法获得。

组号将在分支重置的不同分支之间重用,例如。(?|(first)|(second))仅具有组1。如果捕获组具有不同的组名,则它们当然将具有不同的组号,例如,(?|(?Pfirst)|(?Psecond)) 具有组1 (“foo”) 和组2 (“bar”).

 正则表达式: (\s+)(?|(?P[A-Z]+)|(\w+)) (?P[0-9]+) 有2组

  • (\s+) is group 1.
  • (?P[A-Z]+) is group 2, also called “foo”.
  • (\w+) is group 2 because of the branch reset.
  • (?P[0-9]+) is group 2 because it’s called “foo”.

5. 其他功能,如下表

模式 描述

\m

\M

\b

单词起始位置、结束位置、分界位置

regex用\m表示单词起始位置,用\M表示单词结束位置。

\b:是单词分界位置,但不能区分是起始还是结束位置。

(?flags-flags:...)  局部

(?flags-flags)  全局

局部范围控制:

(?i:)是打开忽略大小写,(?-i:)则是关闭忽略大小写。

如果有多个flag挨着写既可,如(?is-f:):减号左边的是打开,减号右边的是关闭

>>> regex.search(r"(?i:good)", "GOOD")

 

全局范围控制:

(?si-f)good

lookaround

对条件模式中环顾四周的支持:

>>> regex.match(r'(?(?=\d)\d+|\w+)', '123abc')

>>> regex.match(r'(?(?=\d)\d+|\w+)', 'abc123')

 

这与在一对替代方案的第一个分支中进行环视不太一样:

>>> print(regex.match(r'(?:(?=\d)\d+\b|\w+)', '123abc'))   # 若分支1不匹配,尝试第2个分支

>>> print(regex.match(r'(?(?=\d)\d+\b|\w+)', '123abc'))    # 若分支1不匹配,不尝试第2个分支
None

(?p)   

POSIX匹配(最左最长)

正常匹配:
>>> regex.search(r'Mr|Mrs', 'Mrs')

>>> regex.search(r'one(self)?(selfsufficient)?', 'oneselfsufficient')


POSIX匹配:
>>> regex.search(r'(?p)Mr|Mrs', 'Mrs')

>>> regex.search(r'(?p)one(self)?(selfsufficient)?', 'oneselfsufficient')

[[a-z]--[aeiou]]

V0:simple sets,与re模块兼容

V1:nested sets,功能增强,集合包含'a'-'z',排除“a”, “e”, “i”, “o”, “u”

eg:

     regex.search(r'(?V1)[[a-z]--[aeiou]]+', 'abcde')

     regex.search(r'[[a-z]--[aeiou]]+', 'abcde', flags=regex.V1)

(?(DEFINE)...)

命名组内容及名字:如果没有名为“ DEFINE”的组,则…将被忽略,但只要有任何组定义,(?(DEFINE))将起作用。

eg:

>>> regex.search(r'(?(DEFINE)(?P\d+)(?P\w+))(?&quant) (?&item)', '5 elephants')

 

# 卡两头为固定样式、中间随意的内容
>>> regex.search(r'(?(DEFINE)(?P\d+)(?P\w+))(?&quant)[\u4E00-\u9FA5](?&item)', '123哈哈dog')

\K

保留K出现位置之后的匹配内容,丢弃其之前的匹配内容。

>>> m = regex.search(r'(\w\w\K\w\w\w)', 'abcdef')
  保留cde,丢弃ab
>>> m[0]   'cde'
>>> m[1]   'abcde'

>>> m = regex.search(r'(?r)(\w\w\K\w\w\w)', 'abcdef')   
  反向,保留bc,丢弃def
>>> m[0]  'bc'
>>> m[1]  'bcdef'

 

(?r)  反向搜索
>>> regex.findall(r".", "abc")
['a', 'b', 'c']
>>> regex.findall(r"(?r).", "abc")
['c', 'b', 'a']

注意:反向搜索的结果不一定与正向搜索相反 

>>> regex.findall(r"..", "abcde")
['ab', 'cd']
>>> regex.findall(r"(?r)..", "abcde")
['de', 'bc']
expandf

使用下标来获取重复捕获组的所有捕获 

>>> m = regex.match(r"(\w)+", "abc")
>>> m.expandf("{1}")   'c'    m.expandf("{1}") == m.expandf("{1[-1]}")    后面的匹配覆盖前面的匹配,所以{1}=c
>>> m.expandf("{1[0]} {1[1]} {1[2]}")      'a b c'
>>> m.expandf("{1[-1]} {1[-2]} {1[-3]}")   'c b a'

 

定义组名
>>> m = regex.match(r"(?P\w)+", "abc")
>>> m.expandf("{letter}")    'c'
>>> m.expandf("{letter[0]} {letter[1]} {letter[2]}")       'a b c'
>>> m.expandf("{letter[-1]} {letter[-2]} {letter[-3]}")    'c b a'

 

>>> m = regex.match(r"(\w+) (\w+)", "foo bar")
>>> m.expandf("{0} => {2} {1}")     'foo bar => bar foo'

>>> m = regex.match(r"(?P\w+) (?P\w+)", "foo bar")
>>> m.expandf("{word2} {word1}")    'bar foo'

 

同样可以用于search()方法

capturesdict()

groupdict()

captures()

capturesdict() 是 groupdict() 和 captures()的结合:

groupdict():返回一个字典,key = 组名,value = 匹配的最后一个值 

captures():返回一个所有匹配值的列表

capturesdict():返回一个字典,key = 组名,value = 所有匹配值的列表

 

>>> m = regex.match(r"(?:(?P\w+) (?P\d+)\n)+", "one 1\ntwo 2\nthree 3\n")
>>> m.groupdict()
{'word': 'three', 'digits': '3'}


>>> m.captures("word")
['one', 'two', 'three']


>>> m.captures("digits")
['1', '2', '3']
>>> m.capturesdict()


{'word': ['one', 'two', 'three'], 'digits': ['1', '2', '3']} 

访问组的方式

(1)通过下标、切片访问:
>>> m = regex.search(r"(?P.*?)(?P\d+)(?P.*)", "pqr123stu")
>>> m["before"]
pqr
>>> len(m)
4
>>> m[:]
('pqr123stu', 'pqr', '123', 'stu')

 

(2)通过group("name")访问:
>>> m.group('num') 

'123'

 

(3)通过组序号访问:
>>> m.group(0)

'pqr123stu'

>>> m.group(1)

'pqr'

subf

subfn

subf和subfn分别是sub和subn的替代方案。当传递替换字符串时,他们将其视为格式字符串。

 

>>> regex.subf(r"(\w+) (\w+)", "{0} => {2} {1}", "foo bar")
'foo bar => bar foo'
>>> regex.subf(r"(?P\w+) (?P\w+)", "{word2} {word1}", "foo bar")
'bar foo' 

partial

部分匹配:match、search、fullmatch、finditer都支持部分匹配,使用partial关键字参数设置。匹配对象有一个pattial参数,当部分匹配时返回True,完全匹配时返回False

 

>>> regex.search(r'\d{4}', '12', partial=True)
       partial=True>
>>> regex.search(r'\d{4}', '123', partial=True)
       partial=True>
>>> regex.search(r'\d{4}', '1234', partial=True)
           完全匹配:没有partial
>>> regex.search(r'\d{4}', '12345', partial=True)
     
>>> regex.search(r'\d{4}', '12345', partial=True).partial     完全匹配
       False
>>> regex.search(r'\d{4}', '145', partial=True).partial        部分匹配
      True
>>> regex.search(r'\d{4}', '1245', partial=True).partial      完全匹配
      False

   

(?P)

允许组名重复

允许组名重复,后面的捕获覆盖前面的捕获
可选组:
>>> # Both groups capture, the second capture 'overwriting' the first.
>>> m = regex.match(r"(?P\w+)? or (?P\w+)?", "first or second")
>>> m.group("item")   'second'
>>> m.captures("item")   ['first', 'second']

>>> m = regex.match(r"(?P\w+)? or (?P\w+)?", " or second")
>>> m.group("item")     'second'
>>> m.captures("item")   ['second']

>>> m = regex.match(r"(?P\w+)? or (?P\w+)?", "first or ")
>>> m.group("item")     'first'
>>> m.captures("item")   ['first']

 

强制性组:
>>> m = regex.match(r"(?P\w*) or (?P\w*)?", "first or second")
>>> m.group("item")    'second'
>>> m.captures("item")  ['first', 'second']

>>> m = regex.match(r"(?P\w*) or (?P\w*)", " or second")
>>> m.group("item")     'second'
>>> m.captures("item")   ['', 'second']

>>> m = regex.match(r"(?P\w*) or (?P\w*)", "first or ")
>>> m.group("item")        ''
>>> m.captures("item")     ['first', '']

 

detach_string

匹配对象通过其string属性,对所搜索字符串进行引用。detach_string方法将“分离”该字符串,使其可用于垃圾回收,如果该字符串很大,则可能节省宝贵的内存。

>>> m = regex.search(r"\w+", "Hello world")
>>> print(m.group())
Hello
>>> print(m.string)
Hello world
>>> m.detach_string()
>>> print(m.group())
Hello
>>> print(m.string)
None

(?0)、(?1)、(?2)

 

(?R)或(?0)尝试递归匹配整个正则表达式。
(?1)、(?2)等,尝试匹配相关的捕获组,第1组、第2组。(Tarzan|Jane) loves (?1) == (Tarzan|Jane) loves (?:Tarzan|Jane)
(?&name)尝试匹配命名的捕获组。

>>> regex.match(r"(Tarzan|Jane) loves (?1)", "Tarzan loves Jane").groups()
('Tarzan',)
>>> regex.match(r"(Tarzan|Jane) loves (?1)", "Jane loves Tarzan").groups()
('Jane',)

>>> m = regex.search(r"(\w)(?:(?R)|(\w?))\1", "kayak")
>>> m.group(0, 1, 2)
('kayak', 'k', None)

模糊匹配

三种类型错误:

  • 插入: “i”
  • 删除:“d”
  • 替换:“s”
  • 任何类型错误:“e”

Examples:

  • foo match “foo” exactly
  • (?:foo){i} match “foo”, permitting insertions
  • (?:foo){d} match “foo”, permitting deletions
  • (?:foo){s} match “foo”, permitting substitutions
  • (?:foo){i,s} match “foo”, permitting insertions and substitutions
  • (?:foo){e} match “foo”, permitting errors

如果指定了某种类型的错误,则不允许任何未指定的类型。在以下示例中,我将省略item并仅写出模糊性:

  • {d<=3} permit at most 3 deletions, but no other types
  • {i<=1,s<=2} permit at most 1 insertion and at most 2 substitutions, but no deletions
  • {1<=e<=3} permit at least 1 and at most 3 errors
  • {i<=2,d<=2,e<=3} permit at most 2 insertions, at most 2 deletions, at most 3 errors in total, but no substitutions

It’s also possible to state the costs of each type of error and the maximum permitted total cost.

Examples:

  • {2i+2d+1s<=4} each insertion costs 2, each deletion costs 2, each substitution costs 1, the total cost must not exceed 4
  • {i<=1,d<=1,s<=1,2i+2d+1s<=4} at most 1 insertion, at most 1 deletion, at most 1 substitution; each insertion costs 2, each deletion costs 2, each substitution costs 1, the total cost must not exceed 4

Examples:

  • {s<=2:[a-z]} at most 2 substitutions, which must be in the character set [a-z].
  • {s<=2,i<=3:\d} at most 2 substitutions, at most 3 insertions, which must be digits.

默认情况下,模糊匹配将搜索满足给定约束的第一个匹配项。ENHANCEMATCH (?e)标志将使它尝试提高找到的匹配项的拟合度(即减少错误数量)。

BESTMATCH标志将使其搜索最佳匹配。

  • regex.search("(dog){e}", "cat and dog")[1] returns "cat" because that matches "dog" with 3 errors (an unlimited number of errors is permitted).
  • regex.search("(dog){e<=1}", "cat and dog")[1] returns " dog" (with a leading space) because that matches "dog" with 1 error, which is within the limit.
  • regex.search("(?e)(dog){e<=1}", "cat and dog")[1] returns "dog" (without a leading space) because the fuzzy search matches " dog" with 1 error, which is within the limit, and the (?e) then it attempts a better fit.

匹配对象具有属性fuzzy_counts,该属性给出替换、插入和删除的总数:

>>> # A 'raw' fuzzy match:
>>> regex.fullmatch(r"(?:cats|cat){e<=1}", "cat").fuzzy_counts
(0, 0, 1)
>>> # 0 substitutions, 0 insertions, 1 deletion.

>>> # A better match might be possible if the ENHANCEMATCH flag used:
>>> regex.fullmatch(r"(?e)(?:cats|cat){e<=1}", "cat").fuzzy_counts
(0, 0, 0)
>>> # 0 substitutions, 0 insertions, 0 deletions.

 

匹配对象还具有属性fuzzy_changes,该属性给出替换、插入和删除的位置的元组:

>>> m = regex.search('(fuu){i<=2,d<=2,e<=5}', 'anaconda foo bar')
>>> m

>>> m.fuzzy_changes
([], [7, 8], [10, 11]) 

\L

 Named lists

老方法:

p = regex.compile(r"first|second|third|fourth|fifth"),如果列表很大,则解析生成的正则表达式可能会花费大量时间,并且还必须注意正确地对字符串进行转义和正确排序,例如,“ cats”位于“ cat”之间。

 

新方法: 顺序无关紧要,将它们视为一个set

>>> option_set = ["first", "second", "third", "fourth", "fifth"]
>>> p = regex.compile(r"\L", options=option_set)

 

named_lists属性:

>>> print(p.named_lists)
# Python 3
{'options': frozenset({'fifth', 'first', 'fourth', 'second', 'third'})}
# Python 2
{'options': frozenset(['fifth', 'fourth', 'second', 'third', 'first'])}

Set operators

集合、嵌套集合

仅版本1行为

添加了集合运算符,并且集合可以包含嵌套集合。

按优先级高低排序的运算符为:

  • || for union (“x||y” means “x or y”)
  • ~~ (double tilde) for symmetric difference (“x~~y” means “x or y, but not both”)
  • && for intersection (“x&&y” means “x and y”)
  • -- (double dash) for difference (“x–y” means “x but not y”)

隐式联合,即[ab]中的简单并置具有最高优先级。因此,[ab && cd] 与 [[a || b] && [c || d]] 相同。

eg:

  • [ab]  # Set containing ‘a’ and ‘b’
  • [a-z]  # Set containing ‘a’ .. ‘z’
  • [[a-z]--[qw]]  # Set containing ‘a’ .. ‘z’, but not ‘q’ or ‘w’
  • [a-z--qw]  # Same as above
  • [\p{L}--QW]  # Set containing all letters except ‘Q’ and ‘W’
  • [\p{N}--[0-9]]  # Set containing all numbers except ‘0’ .. ‘9’
  • [\p{ASCII}&&\p{Letter}] # Set containing all characters which are ASCII and letter
开始、结束索引

匹配对象具有其他方法,这些方法返回有关重复捕获组的所有成功匹配的信息。这些方法是:

  • matchobject.captures([group1, ...])
  • matchobject.starts([group])
  • matchobject.ends([group])
  • matchobject.spans([group]) 
>>> m = regex.search(r"(\w{3})+", "123456789")
>>> m.group(1)
'789'
>>> m.captures(1)
['123', '456', '789']
>>> m.start(1)
6
>>> m.starts(1)
[0, 3, 6]
>>> m.end(1)
9
>>> m.ends(1)
[3, 6, 9]
>>> m.span(1)
(6, 9)
>>> m.spans(1)
[(0, 3), (3, 6), (6, 9)]
   

 \G

搜索锚,它在每个搜索开始/继续的位置匹配,可用于连续匹配或在负变长后向限制中使用,以限制后向搜索的范围:

 

>>> regex.findall(r"\w{2}", "abcd ef")
['ab', 'cd', 'ef']
>>> regex.findall(r"\G\w{2}", "abcd ef")
['ab', 'cd']
   
(?|...|...)   分支重置

捕获组号将在所有替代方案中重复使用,但是具有不同名称的组将具有不同的组号。

>>> regex.match(r"(?|(first)|(second))", "first").groups()
('first',)
>>> regex.match(r"(?|(first)|(second))", "second").groups()
('second',)

注:只有一个组

超时

匹配方法和功能支持超时。超时(以秒为单位)适用于整个操作:

>>> from time import sleep
>>>
>>> def fast_replace(m):
...     return 'X'
...
>>> def slow_replace(m):
...     sleep(0.5)
...     return 'X'
...
>>> regex.sub(r'[a-z]', fast_replace, 'abcde', timeout=2)
'XXXXX'
>>> regex.sub(r'[a-z]', slow_replace, 'abcde', timeout=2)
Traceback (most recent call last):
  File "", line 1, in 
  File "C:\Python37\lib\site-packages\regex\regex.py", line 276, in sub
    endpos, concurrent, timeout)
TimeoutError: regex timed out

 

你可能感兴趣的:(Python)