python关于re模块(正则表达式)

1.元字符
(1) \b:

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r’\bfoo\b’ matches ‘foo’, ‘foo.’, ‘(foo)’, ‘bar foo baz’ but not ‘foobar’ or ‘foo3’.

只匹配在一个单词或数字或下划线首, 尾的空字符; 正式定义: \b是\w和\W的边界或者\w和字符串开头或结尾的边界(当单词位于字符串的首或尾时.

By default Unicode alphanumerics are the ones used in Unicode patterns, but this can be changed by using the ASCII flag. Word boundaries are determined by the current locale if the LOCALE flag is used. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.

默认情况下,Unicode字母数字是Unicode模式中使用的字母数字,但可以使用ASCII标志更改。 如果使用LOCALE标志,则单词边界由当前环境设置确定。 在字符范围内,\ b表示退格符,以便与Python的字符串文字兼容。(因此在python中使用时,\b会被转义, 可在正则表达式前加r,或者用\\b)

使用示例:

re.findall(r'\b9', ' the i love 9 the 9applethe')
['9', '9']
re.findall(r'\bthe', ' the i thelove 9 the 9applethe')
['the', 'the', 'the']
re.findall(r'\bthe\b', ' the i thelove 9 the 9applethe')
['the', 'the']
re.findall(r'the\b', ' the i thelove 9 the 9applethe')
['the', 'the', 'the']
re.findall(r'\b@', ' the i thelove 9 @ the @9applethe')
[]     #  特殊字符无法匹配,只匹配数字和字母.下划线
re.findall('the\\b', ' the i thelove 9 the 9applethe')
['the', 'the', 'the']  # 使用双反斜杠
re.findall(r'\b_', ' the i thelove 9 _ the _9applethe')
['_', '_']  # 下划线也可匹配

(2) \B

Matches the empty string, but only when it is not at the beginning or end of a word. This means that r’py\B’ matches ‘python’, ‘py3’, ‘py2’, but not ‘py’, ‘py.’, or ‘py!’. \B is just the opposite of \b, so word characters in Unicode patterns are Unicode alphanumerics or the underscore, although this can be changed by using the ASCII flag. Word boundaries are determined by the current locale if the LOCALE flag is used

只匹配不在单词(或数字/下划线)首, 尾的空字符; 是\b的相反模式,
单词边界同样被现有环境决定.

使用示例:

re.findall(r'\Bthe', ' the i thelove 9 _ the _9applethe')
['the']
re.findall('\Bthe\B', ' the i thelove 9 _ the _9applethe')
[]
re.findall('\Bthe\B', ' the i thelove 9 aathebb _ the _9applethe')
['the']
re.findall('_\B', ' the i thelove 9 aathebb _ the _9applethe')
['_']

(3) ’ . ’
通配符,可匹配除\n和空字符以外的任何字符
(4) group方法和groups方法

Match.group([group1, …])
Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned). If a groupN argument is zero, the corresponding return value is the entire matching string; if it is in the inclusive range [1…99], it is the string matching the corresponding parenthesized group. If a group number is negative or larger than the number of groups defined in the pattern, an IndexError exception is raised. If a group is contained in a part of the pattern that did not match, the corresponding result is None. If a group is contained in a part of the pattern that matched multiple times, the last match is returned.

该方法由match对象调用: 返回匹配结果的一个或多个子组; 如果参数只有一个,返回字符串,如果有多个参数,每个参数返回一个结果,一起返回一个元组; 没有参数则默认为0, 返回整体匹配结果; 如果参数不为0, 则返回与带括号的组相匹配的字符串; 如果参数是负数或者大于组数,则报错; 如果组中的字符没有匹配上,则返回None; 如果组中的字符匹配了多次,那只返回最后一次匹配的结果,前几次匹配会被覆盖

2)Match.groups(default=None)
Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None.

返回一个包含所有子组匹配结果的元组.默认参数用来处理子组没有被匹配到的情况,默认为None(参见示例2)

示例1: 使用方法

re.search('(?Pab)+cd', 'abababcdc').group()
'abababcd'    #  整体匹配
re.search('(?Pab)+cd', 'abababcdc').group(1)
'ab'    #  子组多次匹配只保留最后一次
re.search('(?P(ab)+)cd', 'abababcdc').group()
'abababcd'  #  加个括号整体匹配结果不变
re.search('(?P(ab)+)cd', 'abababcdc').groups()
('ababab', 'ab')  # 但子组匹配结果大不相同
re.search('(?P((ab)+)+)cd', 'abababcdc').groups()
('ababab', 'ababab', 'ab')  # 几组括号groups有几个元素
re.search('(?P((ab)+)+)cd', 'abababcdc').group('fuck')
'ababab'  #  也可通过组名的方式调用

示例2: groups默认参数

re.search(r'\w+(\.?)\w+(\.?)\d+', 'www.baidu123').groups('shit')
('.', '')   # ?在包含在组中,匹配0次也算成功,返回空字符串
re.search(r'\w+(\.?)([a-z]+)(\.)?\d+', 'www.baidu123').groups('shit')
('.', 'baidu', 'shit')   #  ?包含在组外,组内字符匹配不成功,显示事先设置的参数
re.search(r'\w+(\.?)[a-z]+(\.)?\d+', 'www.baidu123').groups()
('.', None)  #  不设置默认为None

2.使用符号"?"

import re
print(re.match(r'industr(?Py)(aaa)(\1)', 'industryaaay').group(2))  # 分组查找, \1表示第一个分组的结果
print(re.match(r'b', 'bbc').group())  # 只从开头找, 只找第一个, 找不到返回None
print(re.search(r'b', 'abcb').group())  # 只找第一个, 找不到返回None
print(re.findall(r'b', 'abcb'))  # 找到所有, 返回列表
print(re.findall(r'industr(?Py)(aaa)(\1)', 'industryaaay'))  # 只返回分组的结果
print(re.findall(r'industr(?:yaaa)', 'industryaaay'))  # 返回完整匹配结果, ?:为非获取匹配
print(re.match(r'industr(?:yaaa)+', 'industryaaayaaa').group())  # 返回完整匹配结果, ?:为非获取匹配
print(re.findall(r'industr(?=yaaa)', 'industryaaafff'))  # 返回完整匹配结果, ?=为非获取匹配,前视匹配,只有以yaaa结尾才匹配成功
print(re.findall(r'industr(?!syaaa)', 'industryaaafff'))  # 返回完整匹配结果, ?!为非获取匹配,否定前视匹配,只有不以yaaa结尾才匹配成功
print(re.search(r'a(?<=yaaa)bcdef', 'ssyaaabcdef').group())  # 返回完整匹配结果, ?<=为非获取匹配,后视匹配,只有以yaaa开头才匹配成功
print(re.search(r'a(?, 'ssyaaabcdef').group())  # 返回完整匹配结果, ?<=为非获取匹配,否定后视匹配,只有不以yaaa开头才匹配成功

# result
aaa
b
b
['b', 'b']
[('y', 'aaa', 'y')]
['industryaaa']
industryaaayaaa
['industr']
['industr']
abcdef
abcdef

3.其他使用

print(re.split('\d+', 'hello 12abc 34def'))
# result
['hello ', 'abc ', 'def']

re.subn(r'\d+', 'ppp', 'sdfsdf123sfkj4dfjkd5')
# result
('sdfsdfpppsfkjpppdfjkdppp', 3)

a=re.compile('\d+')
a.findall('asdsf343sdf45')
# result
['343', '45']

你可能感兴趣的:(python知识块,python)