Python_模式匹配与正则表达式

正则表达式符合总结：
？匹配零次或一次前面的分组；

匹配零次或多次前面的分组；

匹配一次或多次前面的分组；
{n} 匹配n次前面的分组；
{n,} 匹配n次或更多次前面的分组；
{,m} 匹配零次或m次前面的分组；
{n,m} 匹配至少n次，至多m次前面的分组；
^spam 意味着字符串必须以spam开始；
spam$ 意味着字符串必须以spam结束；
. 匹配所有字符，除换行符外；
\d,\w,\s 分别匹配数字、单词和空格；
\D,\W,\S 分别匹配除数字、单词和空格外的所有字符；
[abc] 匹配方括号内任意字符；
[^abc] 匹配不在方括号内的任意字符；

一、Python 使用正则表达式步骤：
1.用import re 导入正则表达式；
2.用re.compile()函数创建一个Regex对象（记得使用原始字符串）；
3.向Regex对象的search()方法传入想查找的字符串，它返回一个March对象；
4.调用Match对象的group()方法，返回实际匹配文本的字符串。
基于网页的正则表达式测试程序：http://regexpal.com/

import re
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My number is 415-555-4242')
print('Phone number found: ' + mo.group())

Phone number found: 415-555-4242
二、用正则表达式匹配更多模式

1. 利用括号分组：正则表达式字符串中第一对括号是第1组，第二对括号是第2组，依次类推。向group()匹配对象方法传入整数（1或2或...），就可以获取匹配问题的不同部分。向group()方法传入0或不传参，将返回整个匹配文本。

phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242')
print('Phone number found: ' + mo.group(1))
print(mo.group(2))
print(mo.group(3))

Phone number found: 415
555
4242

2. 用管道匹配多个分组： “|”，希望匹配许多表达式中的一个时使用。第一次出现的匹配文本将作为Match对象返回。

heroRegex = re.compile(r'Batman|Tina Fay')
mo1 = heroRegex.search('Batman and Tina Fey.')
mo2 = heroRegex.search('Tina Fey and Batman.')
print(mo1.group())
print(mo2.group())

Batman
Tina Fay

  希望匹配‘Batman’、‘Batmobile’、‘Batcopter’中和‘Batbat’中任意一个，可指定前缀，括号实现。

heroRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo3 = heroRegex.search('Batmobile lost a wheel.')
print(mo3.group(1))

mobile

用问号？实现可选匹配：字符？表明它前面的分组在这个模式中时可选的。

batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman')
mo2 = batRegex.search(('The Adventures of Batwoman'))
print(mo1.group())
print(mo2.group())

Batman
Batwoman

4. 用星号*匹配零次或多次： 字符*号之前的分组，可以在文本中出现任意次，即可以完全不出现，或者一次又一次重复。

batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search('The Adventures of Batman')
mo2 = batRegex.search(('The Adventures of Batwoman'))
mo3 = batRegex.search('The Adventures of Batwowowowowoman')
print(mo1.group())
print(mo2.group())
print(mo3.group())

Batman
Batwoman
Batwowowowowoman

5. 用加号+匹配一次货多次：字符+号要求前面的分组必须“至少出现一次”。

batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batman')
mo2 = batRegex.search(('The Adventures of Batwoman'))
mo3 = batRegex.search('The Adventures of Batwowowowowoman')
print(mo1 == None)
print(mo2.group())
print(mo3.group())

True
Batwoman
Batwowowowowoman

 6. 用花括号{}匹配待定次数：匹配前面分组中特定次数或次数范围。

haRegex = re.compile(r'(Ha){3}')
baRegex = re.compile(r'(Ba){3,5}')
mo1 = haRegex.search('HaHaHa')
mo2 = haRegex.search('ha')
mo3 = baRegex.search('sdfiBaBaBaBa3fdf')
print(mo1.group())
print(mo2 == None)
print(mo3.group())

HaHaHa
True
BaBaBaBa

  7. 贪心和非贪心匹配：Python的正则表达式默认是“贪心”的，表示在有二义的情况下，它们会尽可能匹配最长的字符串。花括号的‘非贪心’版本匹配尽可能最短的字符串，即在结束的花括号后跟着一个问号。

greedyHaRegex = re.compile(r'(Ha){3,5}')
mo1 = greedyHaRegex.search('HaHaHaHaHa')
nongreedyHaRegex = re.compile(r'(Ha){3,5}?')
mo2 = nongreedyHaRegex.search('HaHaHaHaHa')
print(mo1.group())
print(mo2.group())

HaHaHaHaHa
HaHaHa

  8. search()和 findall()方法：search()方法返回一个Match对象，包含被查找字符串的‘第一次’匹配文本；而findall()方法将返回一组字符串，包含被查找字符串中的所有匹配，且如果调用在一个没有分组的正则表达式上，返回一个匹配字符串列表，如果调用在一个有分组的正则表达式上，返回一个匹配字符串（每一个分组对应一个字符串）元组列表。

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
phones = phoneNumRegex.findall('Cell:415-983-8876 Work:425-098-7723')
phoneNum1Regex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
phones1 = phoneNum1Regex.findall('Cell:415-983-8876 Work:425-098-7723')
print(phones)
print(phones1)

['415-983-8876', '425-098-7723']
[('415', '983', '8876'), ('425', '098', '7723')]

9. 字符分类：
    （1）\d：0到9的任意数字；
    （2）\D: 除0到9的数字外的任意字符；
    （3）\w：任何字母、数字或下划线字符（可以认为是匹配“单词”字符）；
    （4）\W：除字符、数字和下划线以外的任何字符；
    （5）\s：空格、制表符或换行符（可以认为是匹配“空白”字符）；
    （6）\S：除空格、制表符和换行符以外的任何字符；

xmasRegex = re.compile(r'\d+\s\w+')
xmax = xmasRegex.findall('12 drummers,11 pipers,10 lords,9 ladies,8 maids,7 swans,6 geese,5 rings, 4 birds,3 hens,2 doves,1 partridge')
print(xmax)

['12 drummers', '11 pipers', '10 lords', '9 ladies', '8 maids', '7 swans', '6 geese', '5 rings', '4 birds', '3 hens', '2 doves', '1 partridge']

  10. 建立自己的字符分类：方括号[]定义自己的字符分类；短横线-表示字母或数字的范围。方括号内普通的正则表达式符合不会被解释，左方括号后加^可以得到‘非字符类’，即匹配不在这个字符类中的所有其他字符。

consonantRegex = re.compile(r'[aeiouAEIOU]')
c= consonantRegex.findall('RoboCop eats baby food.BABY FOOD.')
consonant1Regex = re.compile(r'[^aeiouAEIOU]')
n= consonant1Regex.findall('RoboCop eats baby food.BABY FOOD.')
print (c)
print (n)

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']
['R', 'b', 'C', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.', 'B', 'B', 'Y', ' ', 'F', 'D', '.']

  11. 插入字符^和美元字符$：在正则表达式开始处使用^，表明匹配必须发生在被查找文本开始处；在正则表达式末尾加美元符$，表明该字符串必须以这个正则表达式模式结束。
  12. 通配字符：句号.字符匹配除换行符以外的所有字符。

atRegex = re.compile(r'.at')
at = atRegex.findall('The cat in the hat sat on the flat mat.')
print (at)

['cat', 'hat', 'sat', 'lat', 'mat']

13. 用点-星（.*）匹配所有字符：

nameRegex = re.compile(r'First Name:(.)Last Name:(.)')
mo = nameRegex.search('First Name:Al Last Name:Sweigart')
print(mo.group(1))
print(mo.group(2))

Al
Sweigart

不区分大小写的匹配：向re.compile()传入第二个参数re.I

robocop = re.compile(r'robocop',re.I)
r = robocop.search('RoboCop is part man,part machine,all cop.')
print (r.group())

RoboCop

15. 用sub()方法替换字符串：Regex的对象的sub()方法需要传入两个参数。第一个参数为一个字符串，用于取代发现的匹配。第二个参数是一个字符串，使用正则表达式匹配的内容。

namesRegex = re.compile(r'Agent \w+')
names = namesRegex.sub('CENSORED','Agent Alice gave the secret documents to Agent Bob.')
print (names)

CENSORED gave the secret documents to CENSORED.

16. 管理复杂的正则表达式：向re.compile()传入变量re.VERBOSE编写注解

phoneRegex = re.compile(r'''
(\d{3})|(\d{3}))? #area code
(\s|-|.)? #separator
\d{3} #first 3 digits
(\s|-|.) #separator
\d{4} #last 4 digits
(\s(ext|x|ext.)\s\d{2,5})? #extension
''',re.VERBOSE)

Python_模式匹配与正则表达式

你可能感兴趣的:(Python_模式匹配与正则表达式)