正则表达式,又称规则表达式**。**(英语:Regular Expression,在代码中常简写为regex、regexp或RE),计算机科学的一个概念。正则表达式通常被用来检索、替换那些符合某个模式(规则)的文本。
使用场景
在Python中使用正则需要导入re
包
import re
首先我们来看两个例子来体验一下正则表达式的威力吧:
比如,已知一个列表:
li = [“Chinese”, “China”, “English”, “Britain”, “Canada”, “New Zealand”]
找出以Ch开头的字串。
# 法1
li = ["Chinese", "China", "English", "Britain", "Canada", "New Zealand"]
lt = []
for i in li:
if i[0:2] == "Ch":
lt.append(i)
print(lt)
# 法2
li = ["Chinese", "China", "English", "Britain", "Canada", "New Zealand"]
print([i for i in li if i[0:2]=="Ch"])
假如a=“ab23fd5g67”,提取23,5,67
正则
>>> import re
>>> a = "ab23fd5g67"
>>> m = r'[0-9]+'
>>> num = re.findall(m,a)
>>> num
['23', '5', '67']
>>>
字符 | 功能 |
---|---|
. | 匹配任意一个字符(除了\n |
[ ] | 匹配[ ]中列举的字符 |
\d | 匹配数学(0-9) |
\D | 匹配非数字(\d的取反) |
\w | 匹配字符,A-Z,a-z,0-9,_ |
\W | \w的取反 |
\s | 匹配空白字符,比如空格 \tab |
\S | \s的取反 |
如
.
import re
m = re.match('.','s')
print(m.group()) #s
m = re.match('.','sad')
print(m.group()) #s
m = re.match('.','\n')
print(m.group()) #AttributeError: 'NoneType' object has no attribute 'group'
[ ]
import re
#如果sad的首字母是小写,则正则表达式需要小写的s
m = re.match('s','sad')
print(m.group()) #s
#如果sad的首字母是大写,则正则表达式需要大写的S
m = re.match('S','Sad')
print(m.group()) #S
m = re.match('[sS]','sad')
print(m.group()) #s
m = re.match('[sS]','Sad')
print(m.group()) #S
m = re.match('[0-9]','666Sad')
print(m.group()) #6
\d,\D
m = re.match(r'\d','666Sad')
print(m.group()) #6
m = re.match(r'\D','Sad')
print(m.group()) #S
\w,\W
m = re.match(r'\w','121ios')
print(m.group()) #1
m = re.match(r'\W','$qw')
print(m.group()) #$
\s,\S
m = re.match(r'\s'," askda")
print(m.group()) #
m = re.match(r'\S',"askda")
print(m.group()) #a
字符 | 功能 |
---|---|
* | 匹配前一个字符出现0次多次或者无限次,可有可无,可多可少 |
+ | 匹配前一个字符出现1次多次或者无限次,直到出现一次 |
? | 四配前一个字符出现1次或者0次,要么有1次,要么没有 |
{m} | 匹配前一个字符出现m次 |
{m,} | 匹配前一个字符至少出现m次 |
{m,n} | 匹配前一个字符出现m到n次 |
m = re.match('[A-Z][a-z]*',"AasAkda")
print(m.group()) #Aas
m = re.match('[A-Z]+[a-z]',"AAasAkda")
print(m.group()) #AAa
m = re.match('[A-Z]+[i]?[a-z]*',"AAasAkda")
print(m.group()) #AAas
m = re.match('[A-Z]+[i]?[a-z]{3}',"AAasffAkda")
print(m.group()) #AAasf
m = re.match('[A-Z]+[i]?[a-z]{1,}',"AAasffAkda")
print(m.group()) #AAasff
m = re.match('[A-Z]+[i]?[a-z]{1,3}',"AAasffAkda")
print(m.group()) #AAasf
字符 | 功能 |
---|---|
^ | 匹配字符串开头 |
$ | 匹配字符串的结尾 |
\b | 匹配一个单词的边界 |
\B | 匹配非单词边界 |
m = re.match('^\w+\s\w+\s\w+$',"ci ty university")
print(m.group())
字符 | 功能 |
---|---|
| | 匹配左右任意一个表达式 |
(a,b) | 将括号中的字符作为一个分组 |
\num | 引用分组num匹配到的字符串 |
(?P) | 分组起别名 |
(?P=name) | 引用别名为name分组匹配到的字符串 |
匹配0~100的数
m = re.match('0$|100$|[1-9]\d?$',"18")
print(m.group())
str = "hello world
"
pat = "(.*)
"
res = re.match(pat,str)
print(res.group()) #hello world
print(res.group(1)) #hello world
print(res.groups()) #('hello world',)
str = "hello world
"
pat = "(\w*)\s(\w*)
"
res = re.match(pat,str)
print(res.group()) #hello world
print(res.group(1)) #hello
print(res.group(2)) #world
print(res.groups()) #('hello', 'world')
str = "hello world!
"
pat = r"<(.+)><(.+)>(.*)\2>\1>"
res = re.match(pat,str)
print(res.group()) #hello world!
print(res.groups()) #('span', 'h1', 'hello world!')
re.match()和re.search()的区别:
- re.match()从字符串的开头开始匹配,如果匹配失败,None
- re.search()匹配整个字符串,直到找到一个匹配,失败,None
print(re.search('yun','Aliyun is a yun.').group()) #yun
print(re.findall('yun','Aliyun is a yun.')) #['yun', 'yun']
print(re.finditer('yun','Aliyun is a yun.').__next__().group()) #['yun', 'yun']
for i in re.finditer('yun','Aliyun is a yun.'):
print(i.group())
# yun
# yun