最近某短视频平台上经常刷到,人生四大雅事:“品茗、抚琴、观山、听雨”。那么今天我们就利用python观山所看到的美景给记录起来,方便我们以后快捷的回忆观山美景。
好了,其他的不多说了,接下来我们所准备好所需环境:python3、requests、lxml。
# 模拟浏览器发送请求
response = requests.get(url, headers=spider_header)
# 防止返回的HTML字符串乱码,设置一下编码
response.encoding = "utf-8"
# 将返回的HTML字符串用etree解析成HTML,方便下面用xpath语法进行操作
html = etree.HTML(response.text)
# 获取页面图片展示列表
lis = html.xpath("/html/body/section/div/div/div[2]/div[2]/div/ul/li/a")
# 定义一个临时集合,用于保存首页展示的列表链接
lis_hrefs = []
for li in lis:
lis_hrefs.append(f'{base_url}{li.xpath("@href")[0]}')
img
里的src
属性就行了,利用xpath
语法提取,当我们分析完毕后,接下来就是遍历步骤二获取的封面集合,然后进入详情提取img
的src
属性,也就是下载地址了。代码如下:# 循环便利访问临时集合内的链接
for href in lis_hrefs:
res = requests.get(href, headers=spider_header)
res.encoding = "utf-8"
res_html = etree.HTML(res.text)
# 获取所有图片集合
images = res_html.xpath("/html/body/section/div/div/article/p/a/img/@src")
header = res_html.xpath("/html/body/section/div/div/header/h1/text()")[0]
root_path = os.getcwd()
os.mkdir(header)
file_path = f"{root_path}/{header}"
for image in images:
# 获取并设置保存图片的名称
file_name = re.findall("\d+.jpg", image)[0]
# 获取的图片链接,有重定向操作
response = requests.get(image, headers=down_header, allow_redirects=True)
with open(f"{file_path}/{file_name}", "wb") as f:
f.write(response.content)
time.sleep(0.5)
time.sleep(1)
print(f"首页:{header}的第一页图片保存完毕!")
注意
获取的图片链接,有重定向操作
上述步骤五代码有注释说明。
import requests
from lxml import etree
import re
import time
import os
url = "https://www.keaitupian.cn/meinv/"
base_url = "https://www.keaitupian.cn"
spider_header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
"Referer": "https://www.keaitupian.cn",
"Host": "www.keaitupian.cn"
}
down_header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}
# 模拟浏览器发送请求
response = requests.get(url, headers=spider_header)
# 防止返回的HTML字符串乱码,设置一下编码
response.encoding = "utf-8"
# 将返回的HTML字符串用etree解析成HTML,方便下面用xpath语法进行操作
html = etree.HTML(response.text)
# 获取页面图片展示列表
lis = html.xpath("/html/body/section/div/div/div[2]/div[2]/div/ul/li/a")
# 定义一个临时集合,用于保存首页展示的列表链接
lis_hrefs = []
for li in lis:
lis_hrefs.append(f'{base_url}{li.xpath("@href")[0]}')
# 循环便利访问临时集合内的链接
for href in lis_hrefs:
res = requests.get(href, headers=spider_header)
res.encoding = "utf-8"
res_html = etree.HTML(res.text)
# 获取所有图片集合
images = res_html.xpath("/html/body/section/div/div/article/p/a/img/@src")
header = res_html.xpath("/html/body/section/div/div/header/h1/text()")[0]
root_path = os.getcwd()
os.mkdir(header)
file_path = f"{root_path}/{header}"
for image in images:
# 获取并设置保存图片的名称
file_name = re.findall("\d+.jpg", image)[0]
# 获取的图片链接,有重定向操作
response = requests.get(image, headers=down_header, allow_redirects=True)
with open(f"{file_path}/{file_name}", "wb") as f:
f.write(response.content)
time.sleep(0.5)
time.sleep(1)
print(f"首页:{header}的第一页图片保存完毕!")
上述代码只对第一页进行了保存,并没有进行分页保存,如需分页保存可自行学习或者私信联系。