python网络爬虫3：使用正则表达式匹配

2.非贪婪匹配之(.*?)

\d 匹配1个数字字符

\w 匹配1个字母，数字或下划线字符

\s 匹配1个空白字符，如换行符、制表符、普通空格等

\S 匹配1个非空白字符

\n 匹配1个换行符，相当于按1次Enter键

\t 匹配1个制表符，相当于按1次Tab键或按8次空格键

# . 匹配1个任意字符，换行符除外 * 匹配0个或多个表达式

+ 匹配1个或多个表达式

？非贪婪限定符，常与.和*配合使用

() 匹配括号内的表达式，也表示一个组

例1：

import re

res = '文本A百度新闻文本B，新闻标题文本A新闻财经文本B，文本A搜狗新闻文本B新闻网址'

p_source = '文本A(.*?)文本B'

source = re.findall(p_source, res)

print(source) # ['百度新闻']

例2：

import re

res = '
网易新闻 2020年12月27日 18:37
'

p_info = '
(.*?)
'

info = re.findall(p_info, res)

print(info)

3.非贪婪匹配之.*?

import re

res = '
文本C<变化的网址>文本D新闻标题
'

p_title = '
文本C.*?文本D(.*?)
'

title = re.findall(p_title, res)

print(title) # ['新闻标题']

import re

res = '
阿里巴巴 代码竞赛现全球首位AI评委能为代码质量打分
'

p_title = '
.*?>(.*?)'

title = re.findall(p_title, res)

print(title) # .*?> 填充我们不要的内容 (.*?) 要查找的内容

# [' 阿里巴巴 代码竞赛现全球首位AI评委能为代码质量打分']

res2 = '
'

p_href = '

href = re.findall(p_href, res2)

print(href) # ['https://www.baidu.com/']

4.自动考虑换行的修饰符re.S

(.*?)和.*?无法自动匹配换行，可以用re.S

re.findall(匹配规则, 原始文本, re.S)

import re

res = ''' 文本A

百度新闻文本B'''

p_source = ' 文本A(.*?)文本B'

source = re.findall(p_source, res, re.S)

print(source) # ['\n 百度新闻']

import re

res = '''

data-click="{

英文&数字

}"

target="_blank"

>

阿里巴巴 代码竞赛现全球首位

'''

p_href = '
.*?
p_title = '
.*?>(.*?)'

href = re.findall(p_href, res, re.S)

title = re.findall(p_title, res, re.S)

print(href) # ['http://baijiahao.baidu.com/s?id=163111&wfr=spider&for=pc']

print(title) # ['\n 阿里巴巴 代码竞赛现全球首位\n ']

5.补充知识

（1）sub()函数：用于清洗正则表达式获取的内容

# re.sub(需要替换的内容, 替换值, 原字符串)

import re

title = ['阿里巴巴 代码竞赛全球首位AI评委能为代码质量打分']

title[0] = re.sub('', '', title[0])

title[0] = re.sub('', '', title[0])

print(title[0]) # 阿里巴巴代码竞赛全球首位AI评委能为代码质量打分

import re

title = ['阿里巴巴 代码竞赛全球首位AI评委能为代码质量打分']

title[0] = re.sub('<.*?>', '', title[0])

print(title[0]) # 阿里巴巴代码竞赛全球首位AI评委能为代码质量打分

# <.*?> 任何 <>形式的内容

# '' 替换后的内容

# 第一个title[0]是替换后的标题

# 最后一个title[0]是原来的标题

（2）中括号[]的用法：使中括号里的内容不再有特殊含义

import re

company = '* 华能信托'

company1 = re.sub('[*]', '', company)

print(company1) # 华能信托

data-click="{

英文&数字

}"

target="_blank"

>

阿里巴巴代码竞赛现全球首位

'''

p_href = '

.?
p_title = '
.?>(.?)'

href = re.findall(p_href, res, re.S)

title = re.findall(p_title, res, re.S)

print(href) # ['http://baijiahao.baidu.com/s?id=163111&wfr=spider&for=pc']

print(title) # ['\n 阿里巴巴* 代码竞赛现全球首位\n ']

.?>(.?)'

href = re.findall(p_href, res, re.S)

title = re.findall(p_title, res, re.S)

print(href) # ['http://baijiahao.baidu.com/s?id=163111&wfr=spider&for=pc']

print(title) # ['\n 阿里巴巴代码竞赛现全球首位\n ']

python网络爬虫3：使用正则表达式匹配

文本C<变化的网址>文本D新闻标题

文本C.?文本D(.?)

阿里巴巴代码竞赛现全球首位AI评委能为代码质量打分

.?>(.?)'

title = re.findall(p_title, res)

print(title) # .?> 填充我们不要的内容 (.?) 要查找的内容

# [' 阿里巴巴代码竞赛现全球首位AI评委能为代码质量打分']

res2 = '

'

p_href = '

href = re.findall(p_href, res2)

print(href) # ['https://www.baidu.com/']

你可能感兴趣的:(python网络爬虫3：使用正则表达式匹配)

python网络爬虫3：使用正则表达式匹配

文本C<变化的网址>文本D新闻标题

文本C.*?文本D(.*?)

阿里巴巴 代码竞赛现全球首位AI评委 能为代码质量打分

.*?>(.*?)' title = re.findall(p_title, res) print(title) # .*?> 填充我们不要的内容 (.*?) 要查找的内容 # [' 阿里巴巴 代码竞赛现全球首位AI评委 能为代码质量打分'] res2 = '

' p_href = '

href = re.findall(p_href, res2) print(href) # ['https://www.baidu.com/']

.*? p_title = '.*?>(.*?)' href = re.findall(p_href, res, re.S) title = re.findall(p_title, res, re.S) print(href) # ['http://baijiahao.baidu.com/s?id=163111&wfr=spider&for=pc'] print(title) # ['\n 阿里巴巴 代码竞赛现全球首位\n ']

.*?>(.*?)' href = re.findall(p_href, res, re.S) title = re.findall(p_title, res, re.S) print(href) # ['http://baijiahao.baidu.com/s?id=163111&wfr=spider&for=pc'] print(title) # ['\n 阿里巴巴 代码竞赛现全球首位\n ']

你可能感兴趣的:(python网络爬虫3：使用正则表达式匹配)

文本C.?文本D(.?)

阿里巴巴代码竞赛现全球首位AI评委能为代码质量打分

.?>(.?)'

title = re.findall(p_title, res)

print(title) # .?> 填充我们不要的内容 (.?) 要查找的内容

# [' 阿里巴巴代码竞赛现全球首位AI评委能为代码质量打分']

res2 = '

'

p_href = '

href = re.findall(p_href, res2)

print(href) # ['https://www.baidu.com/']

.?
p_title = '
.?>(.?)'

href = re.findall(p_href, res, re.S)

title = re.findall(p_title, res, re.S)

print(href) # ['http://baijiahao.baidu.com/s?id=163111&wfr=spider&for=pc']

print(title) # ['\n 阿里巴巴* 代码竞赛现全球首位\n ']

.?>(.?)'

href = re.findall(p_href, res, re.S)

title = re.findall(p_title, res, re.S)

print(href) # ['http://baijiahao.baidu.com/s?id=163111&wfr=spider&for=pc']

print(title) # ['\n 阿里巴巴代码竞赛现全球首位\n ']