正则、xpath、bs4用法

正则：
单字符匹配
. 除换行符之外的任意字符
\d 表示数字
\D 匹配非数字
\w 匹配单词字符[a-z,A-Z,0-9]
\W 匹配非单词字符
\s 匹配空白字符，空格，\n \t…
\S 匹配非空白字符
^ 匹配以…开头
$ 匹配以…结尾
[0-9] => \d 匹配0-9
多字符匹配（贪婪匹配）

匹配*前面的字符任意次数

匹配+前面的字符至少一次
？匹配？前面的字符0-1次
{n,m}匹配{n,m}前面的字符n-m次
多字符匹配（非贪婪匹配）
*？
+？
？？
其他
（）分组
|逻辑或
\转义字符
re模块下的方法
re.compile()：构建正则表达式对象
re.match():从起始位开始匹配，单次匹配，如果匹配到结果立即返回，反之，返回None
re.search():在整个字符串中进行匹配，单次匹配，如果匹配到结果立即返回，反之，返回None
re.findall():匹配出整个字符串中，所有符合正则规则的结果，返回一个列表
re.finditer():匹配出整个字符串中，所有符合正则规则的结果，返回的是一个可迭代对象
re.sub()：根据正则表达式进行字符串替换
re.split():根据正则表达式进行分割
正则的用法
def get_rank_data(url='http://top.hengyan.com/dianji/default.aspx?p=1'):

构建请求头

headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
}

url, \目标url

data=None, \默认为None表示是get请求,如果不为None说明是get请求

timeout 设置请求的超时时间

cafile=None, capath=None, cadefault=False,:证书相关参数

context=None :忽略证书认证

urlopen不能添加请求头

response = request.urlopen(url=url,timeout=10)

添加请求头

req = request.Request(url=url,headers=headers)
response = request.urlopen(req,timeout=10)

响应状态码

code = response.status

当前请求的url地址

url = response.url
print(code,url)

b_content = response.read()

bytes -> str: decode

str -> bytes: encode

print(b_content)

html = b_content.decode('utf-8')

print(html)

#文件操作

"""

w: w+: wb: wb+ a: a+: ab: ab+: r: rb:

"""

with open('hengyan.html','w') as file:

file.write(html)

证据正则表达式解析数据

re.S 修饰：表示.可以匹配换行符

pattern = re.compile('(.*?)

正则、xpath、bs4用法

构建请求头

url, \目标url

data=None, \默认为None表示是get请求,如果不为None说明是get请求

timeout 设置请求的超时时间

cafile=None, capath=None, cadefault=False,:证书相关参数

context=None :忽略证书认证

urlopen不能添加请求头

response = request.urlopen(url=url,timeout=10)

添加请求头

响应状态码

当前请求的url地址

bytes -> str: decode

str -> bytes: encode

print(b_content)

print(html)

#文件操作

"""

w: w+: wb: wb+ a: a+: ab: ab+: r: rb:

"""

with open('hengyan.html','w') as file:

file.write(html)

证据正则表达式解析数据

re.S 修饰：表示.可以匹配换行符

提取下一页：

详情页

下载

无需decode

保存

拼接

你可能感兴趣的:(正则、xpath、bs4用法)