突然心血来潮想试一下爬虫去爬取网络上的图片
思索一番大致可以拆成几个步骤
1.找到要爬的网址 2.保存图片
用requests来就是:
先发送请求,获取响应文本,从中获取图片网址?
拼接或者拿到完整网址进行wget/requests再次发请求获得文本然后写入二进制文件从而得到图片
以下出现的网址仅作为例子尝试,无其他操作
import requests
url = "https://pic.netbian.com/e/search/result/index.php?page=0&searchid=2202"
resq = requests.get(url=url)
print(resq.text)
会发现这个请求响应文本并不可以直接查看到当前页面中的图片的url地址
以及请求头会有问题,如果反爬虫就需要重新设置一下请求头,模拟真实电脑的请求头,就直接使用电脑浏览器发送的请求头即可(笑)
print(resq.request.headers)
# {'User-Agent': 'python-requests/2.28.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
随便找一个网站查看请求头复制对应的user-agent即可
查阅一些大佬的代码和文章,发现普遍使用一个“lxml”库的etree,lxml是XML和HTML的解析器,其主要功能是解析和提取XML和HTML中的数据
使用lxml中的etree中的html方法,参数text就传入响应文本,然后etree.html会返回一个html对象?
import requests
from lxml import etree
url = "https://pic.netbian.com/e/search/result/index.php?page=0&searchid=2202"
resq = requests.get(url=url)
aaa = etree.HTML(text=resq.text)
print(aaa) #
然后再etree.html后加上一个.xpath(),通过xpath表达式提取图片的路径
表达式为://div[@class="slist"]/ul[@class="clearfix"]/li
找到图片所在的li标签,接着将每一个li标签下的部分链接src取出来,与网站进行拼接就获取了完整的图片连接
import requests
from lxml import etree
url = "https://pic.netbian.com/e/search/result/index.php?page=0&searchid=2202"
resq = requests.get(url=url)
aaa = etree.HTML(text=resq.text).xpath('//div[@class="slist"]/ul[@class="clearfix"]/li')
# aaa为列表,有很多符合条件的li标签元素
aa = (aaa[0].xpath("./a/img/@src"))[0]
# ['/uploads/allimg/230125/214305-167465418565ee.jpg']
img_url = "https://pic.netbian.com" + aa
print(img_url)
# https://pic.netbian.com/uploads/allimg/230125/214305-167465418565ee.jpg
bbb = requests.get(url=img_url).content
with open(file='tupian.jpg',mode='wb') as f:
f.write(bbb)
print("保存成功")
也可以使用wget直接下载图片
import time
import requests
import os
from lxml import etree # 将网页返回一个对象
class PaChong:
def __init__(self,title_url,search_url):
self.title_url = title_url
self.search_url = search_url
self.head = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"} # ,"Accept-Language": "zh-CN,zh;q=0.9"}
self.img_url_dict = {}
def save_img_url(self,first_xpath,second_xpath):
resq = requests.get(url=self.search_url,headers=self.head)
li_list = etree.HTML(text=resq.text).xpath(first_xpath)
for li in li_list:
try:
img_src = li.xpath(second_xpath)[0]
except Exception as e:
print("未成功匹配到图片连接字段")
img_url = self.title_url + img_src # 拼接完整的url地址
img_name = img_url.split('/')[-1]
self.img_url_dict[img_name] = img_url
def save_img_request(self):
save_path = self.save_path()
for img_name,img_url in self.img_url_dict.items():
try:
img_data = requests.get(url=img_url,headers=self.head).content
except Exception as e:
print(f"没有访问地址,报错信息为:{e}")
img_path = os.path.join(save_path,img_name)
self.save_img(filename=img_path,write_content=img_data)
def save_img(self,filename,write_content):
with open(file=filename,mode="wb")as f:
f.write(write_content)
print(filename, "图片保存成功")
def save_path(self):
save_path = "./save_image"
if not os.path.exists(save_path):
os.mkdir(save_path)
return save_path
else:
print(f"{save_path}已存在,无需创建")
return save_path
def save_all(self,first_xpath,second_xpath):
tic = time.time()
self.save_img_url(first_xpath=first_xpath,second_xpath=second_xpath)
self.save_img_request()
toc = time.time()
print(f"保存图片共计耗时{(toc-tic):0.2f}秒")
if __name__ == '__main__':
title_url = "https://pic.netbian.com"
search_url ="https://pic.netbian.com/e/search/result/index.php?page=0&searchid=2202"
first_xpth = '//div[@class="slist"]/ul[@class="clearfix"]/li'
second_xpth = "./a/img/@src"
cl = PaChong(title_url=title_url,search_url=search_url)
cl.save_all(first_xpath=first_xpth,second_xpath=second_xpth)
只是简单的一个小例子,代码还有多不完善和欠缺的地方:
(1)使用with open速度相比wget应该会比较慢?
(2)保存的只是未登录状态下得到的缩略图,实际需要是登录账号后,点进图片内点击下载“高清原图”才可以得到最完善的图片
(3)请求头内应该可以继续修改,不只是请求客户端,还有一个从哪个连接跳转过来的参数
{“Referer“:”https://xxx.com/”}更好的隐藏爬虫身份
(4)这网址仅仅只是搜索出来lol中的第一页,可以再网址上的“index.php?page=0”中的page=多少进行for循环
欢迎大佬进行指正和指导