python--正则

正则表达式是一个单独的部分，无论什么语言，都需要正则表达式。
正则表达式：正则表达式是用了匹配或提取字符串的。正则表达式的语法无论在哪里使用都是相同的。先看一个实例：
判断用户提交的邮箱格式是否正确

import re
yx = '[email protected]'
yxx = re.findall(r'[a-zA-Z0-9]+@[a-zA-Z0-9]\.com$',yx)
print(yxx)
返回：[email protected]
yx = '[email protected]'
yxx = re.findall(r'[a-zA-Z0-9]+@[a-zA-Z0-9]\.com$',yx)
print(yxx)
返回：[email protected]
yx = '[email protected]'
yxx = re.findall(r'[a-zA-Z0-9]+@[a-zA-Z0-9]\.com$',yx)
print(yxx)
返回：[] #匹配失败
yx = '13900003333163.com'
yxx = re.findall(r'[a-zA-Z0-9]+@[a-zA-Z0-9]\.com$',yx)
print(yxx)
返回：[] #匹配失败

在上例中，r'[a-zA-Z0-9]+@[a-zA-Z0-9].com$' ‘ r’ 是表示python的防转义，而 \ 是正则表达式的防转义，这是两个不同的转义

re模块

在python中要使用正则，必须要导入re模块。

import re

re 中的常用方法

1.findall 方法：在字符串中找到正则表达式所匹配的所有子串，并返回一个列表，如果没有找到匹配的，则返回空列表.

re.findall(pattern,string)  #(匹配的语法规则，字符串)

import re
a = 'hello world pthon hello world pthon hello world pthon'
b = re.findall('llo',a)
print(b)
返回：['llo', 'llo', 'llo']
b = findall('123',a)
print(b)
返回：[] #字符串中没有数字，所以匹配失败，返回空 []

2.match方法：re.match 尝试从字符串的起始位置匹配一个模式，匹配成功返回的是一个匹配对象（这个对象包含了我们匹配的信息），如果不是起始位置匹配成功的话，match()返回的是空，
注意：match只能匹配到一个

import re
a = 'hello world pthon hello world pthon hello world pthon'
b = re.match('llo',a)
print(b)
返回：None
print(b.group())
返回：hel #group() 可以查看匹配的内容

a = 'hello world pthon hello world pthon hello world pthon'
b = re.match('hel',a)
print(b)
返回： #span 是字符所在的x下标取值区间

3.search方法：re.search 扫描整个字符串,匹配成功返回的是一个匹配对象（这个对象包含了我们匹配的信息）
注意：search也只能匹配到一个，找到符合规则的就返回，不会一直往后找

import re
a = 'hello world pthon hello world pthon hello world pthon'
b = re.search('llo',a)
print(b)
返回：

a = 'hello world pthon hello world pthon hello world pthon'
b = re.search('234',a)
print(b)
返回：None
#group() 可以查看匹配的内容，同match相同。

re.match与re.search的区别：
re.match只匹配字符串的开始位置找，如果字符串开始不符合正则表达式，则匹配失败，
re.search：匹配整个字符串，如果一直找不到，返回是空的，没有结果
4.sub方法：re.sub 是替换字符串。

import re
a = 'hello world hello pythpn hello java'
b = re.sub('hello','hi',a)  #（'原字符','替换的字符',string）
print(b)
返回：hi world hi pythpn hi java

在正则表达式中，如果加了（）就会只提取（）里面的内容。

import re
a = '''
    
       
       
       
       
       
       
       
       
    
'''

b = re.findall('href="(.*?)"',a)  #只会提取()内的内容，href当作一个标签
for i in b:
    print(i)
返回：
http://app.tanwan.com/htmlcode/9093.html
https://app.tanwan.com/htmlcode/9094.html
https://app.tanwan.com/htmlcode/9095.html
https://app.tanwan.com/htmlcode/9096.html
https://app.tanwan.com/htmlcode/9097.html
http://app.tanwan.com/htmlcode/9098.html
https://app.tanwan.com/htmlcode/9099.html
https://app.tanwan.com/htmlcode/9100.html

元字符

本身具有特殊意义的字符叫做元字符，比如：* ，$ ，[] ，. ，等。

单字符匹配

图1

代表数量的元字符

图2

表示边界的元字符（锚点元字符）

图3

分组匹配

图4

管道 |
import re
a = 'hello world pthon hello world pthon hello world pthon'
b = re.findall('hello|python',a)
print(b)
返回：['hello', 'pthon', 'hello', 'pthon', 'hello', 'pthon']

字符组：
a = 'hello world pthon hello world pthon hello world pthon'
b = re.findall('[he]',a)  #整个中括号只代表一个字符
print(b)
返回：['h', 'e', 'h', 'h', 'e', 'h', 'h', 'e', 'h']

a = 'abbj abbc abbd'
b = re.findall('abb[abd]',a)
print(b)
返回：['abbd']

比较[] 与 []{}

import re
a = '1209809wiahiwqdi821e01'
b = re.findall('[a-zA-Z0-9]{5}',a)
print(b)
返回：['12098', '09wia', 'hiwqd', 'i821e']

a = '1209809wiahiwqdi821e01'
b = re.findall('[a-zA-Z0-9]',a)
print(b)
返回：['1', '2', '0', '9', '8', '0', '9', 'w', 'i', 'a', 'h', 'i', 'w', 'q', 'd', 'i', '8', '2', '1', 'e', '0', '1']

反向字符类（取反）： ^

import re 
a = '1209809 alsj rpokgtp'
b = re.findall('[^a-zA-Z0-9]',a)  # ^表示除了[]里面的规则外的其他字符，^写在[]前面表示从头取，写在规则内，表示取反。
print(b)
返回：[' ', ' ']

转义元字符: \ 正则规则中 \ 可以转义其他的特殊元字符.

\*  ,   \.  ,   \+  ,   \?   等

贪婪与非贪婪

正则默认都是用贪婪模式去匹配数据的，就是尽可能多的匹配符合要求的数据，在非贪婪模式下，始终找最短匹配

import re
a = 'abbbbbbbbc'
b = re.findall('ab*',a)  #（'ab*c',a）#如果*后面有 c ，则必须要满足后面c的匹配。
b1 = re.findall('ab*?',a)  #前面的元字符（*）是正则规则，后面的？是非贪婪模式
print(b)
print(b1)
返回：
['abbbbbbbb']  #贪婪模式是尽可能取除多的字符串。* 可以取出所有。
['a']  #非贪婪模式是尽可能少取字符串，* 可以是0.

一个元字符表示贪婪模式，只有在后面加？可以变为非贪婪模式。

python--正则

re模块

re 中的常用方法

元字符

贪婪与非贪婪

你可能感兴趣的:(python--正则)