1.爬虫相关概述
爬虫概念:
通过编写程序模拟浏览器上网,然后让其去互联网上爬取/抓取数据的过程
模拟:浏览器就是一款纯天然的原始的爬虫工具
爬虫分类:
通用爬虫:爬取一整张页面中的数据. 抓取系统(爬虫程序)
聚焦爬虫:爬取页面中局部的数据.一定是建立在通用爬虫的基础之上
增量式爬虫:用来监测网站数据更新的情况.以便爬取到网站最新更新出来的数据
风险分析
合理的的使用
爬虫风险的体现:
爬虫干扰了被访问网站的正常运营;
爬虫抓取了受到法律保护的特定类型的数据或信息。
避免风险:
严格遵守网站设置的robots协议;
在规避反爬虫措施的同时,需要优化自己的代码,避免干扰被访问网站的正常运行;
在使用、传播抓取到的信息时,应审查所抓取的内容,如发现属于用户的个人信息、隐私或者他人的商业秘密的,应及时停止并删除。
反爬机制
反反爬策略
robots.txt协议:文本协议,在文本中指定了可爬和不可爬的数据说明.
常用的头信息
User-Agent:请求载体的身份标识
Connection:close
content-type
如何鉴定页面中是否有动态加载的数据?
局部搜索 全局搜索
对一个陌生网站进行爬取前的第一步做什么?
确定你要爬取的数据是否为动态加载的!!!
2.requests模块的基本使用
requests模块
概念:一个机遇网络请求的模块.作用就是用来模拟浏览器发起请求
编码流程:
指定url
进行请求的发送
获取响应数据(爬取到的数据)
持久化存储
import requests
url = 'https://www.sogou.com'
#返回值是一个响应对象
response = requests.get(url=url)
#text返回的是字符串形式的响应数据
data = (response.text)
with open('./sogou.html',"w",encoding='utf-8') as f:
f.write(data)
基于搜狗编写一个简易的网页采集器
解决乱码问题
解决UA检测问题
import requests
wd = input('输入key:')
url = 'https://www.sogou.com/web'
# 存储的就是动态的请求参数
params = {
'query': wd
}
#params参数表示的是对请求url参数的封装
#headers 解决反爬机制,实现UA伪装
headers = {
'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, params=params,headers=headers)
#手动修改响应数据的编码,解决中文乱码
response.encoding = 'utf-8'
data = (response.text)
filename = wd + '.html'
with open(filename, "w", encoding='utf-8') as f:
f.write(data)
print(wd, "下载成功")
1.爬取豆瓣电影的详细数据
分析
当滚轮滑动到底部的时候,发起ajax的请求,且请求到了一组电影数据
动态加载的数据:就是通过另一个额外的请求请求到的数据
ajax生成动态加载的数据
js生成动态加载的数据
import requests
limit = input("排行榜前多少的数据:::")
url = 'https://movie.douban.com/j/chart/top_list'
params = {
"type": "5",
"interval_id": "100:90",
"action": "",
"start": "0",
"limit": limit
}
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, params=params, headers=headers)
#json返回的是序列化好的对象
data_list = (response.json())
with open('douban.txt', "w", encoding='utf-8') as f:
for i in data_list:
name = i['title']
score = i['score']
f.write(name+""+score+""+"\n")
print("成功")
2.爬取肯德基地理位置信息
import requests
url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword"
params = {
"cname": "",
"pid": "",
"keyword": "青岛",
"pageIndex": "1",
"pageSize": "10"
}
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.post(url=url, params=params, headers=headers)
# json返回的是序列化好的对象
data_list = (response.json())
with open('kedeji.txt', "w", encoding='utf-8') as f:
for i in data_list["Table1"]:
name = i['storeName']
addres = i['addressDetail']
f.write(name + "," + addres + "\n")
print("成功")
3.爬取药品管理局数据
import requests
url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList"
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
with open('化妆品,txt', "w", encoding="utf-8") as f:
for i in range(1, 5):
params = {
"on": "true",
"page": str(i),
"pageSize": "12",
"productName": "",
"conditionType": "1",
"applyname": "",
"applysn": ""
}
response = requests.post(url=url, params=params, headers=headers)
data_dic = (response.json())
for i in data_dic["list"]:
id = i['ID']
post_url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById"
post_data = {
"id": id
}
response2 = requests.post(url=post_url, params=post_data, headers=headers)
data_dic2 = (response2.json())
title = data_dic2["epsName"]
name = data_dic2['legalPerson']
f.write(title + ":" + name + "\n")
3.数据解析
解析:根据指定的规则对数据进行提取
作用:实现聚焦爬虫
聚焦爬虫的编码流程:
指定url
发起请求
获取响应数据
数据解析
持久化存储
数据解析的方式:
正则
bs4
xpath
pyquery(拓展)
数据解析的通用原理是什么?
数据解析需要作用在页面源码中(一组html标签组成的)
html的核心作用是什么?
展示数据
html是如何展示数据的呢?
html所要展示的数据一定是被放置在html标签之中,或者是在属性中
通用原理:
1.标签定位
2.取文本or取属性
1.正则解析
1.爬取糗事百科糗图数据
爬取单张
import requests
url = "https://pic.qiushibaike.com/system/pictures/12330/123306162/medium/GRF7AMF9GKDTIZL6.jpg"
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, headers=headers)
# content返回的是byte类型的数据
img_data = (response.content)
with open('./123.jpg', "wb") as f:
f.write(img_data)
print("成功")
爬取单页
import re
import os
import requests
dir_name = "./img"
if not os.path.exists(dir_name):
os.mkdir(dir_name)
url = "https://www.qiushibaike.com/imgrank/"
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
img_text = requests.get(url, headers).text
ex = '.*?'
img_list = re.findall(ex, img_text, re.S)
for src in img_list:
src = "https:" + src
img_name = src.split('/')[-1]
img_path = dir_name + "/" + img_name
response = requests.get(src, headers).content
# 对图片地址发请求获取图片数据
with open(img_path, "wb") as f:
f.write(response)
print("成功")
爬取多页
import re
import os
import requests
dir_name = "./img"
if not os.path.exists(dir_name):
os.mkdir(dir_name)
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
for i in range(1,5):
url = f"https://www.qiushibaike.com/imgrank/page/{i}/"
print(f"正在爬取第{i}页的图片")
img_text = requests.get(url, headers=headers).text
ex = '.*?'
img_list = re.findall(ex, img_text, re.S)
for src in img_list:
src = "https:" + src
img_name = src.split('/')[-1]
img_path = dir_name + "/" + img_name
response = requests.get(src, headers).content
# 对图片地址发请求获取图片数据
with open(img_path, "wb") as f:
f.write(response)
print("成功")
2.bs4解析
环境安装
pip install bs4
bs4的解析原理
实例化一个BeautifulSoup的对象为soup,并且将即将被解析的页面源码数据加载到该对象中,
调用BeautifulSoup对象中的相关属性和方法进行标签定位和数据提取
如何实例化BeautifulSoup对象呢?
BeautifulSoup(fp,'lxml'):专门用作于解析本地存储的html文档中的数据
BeautifulSoup(page_text,'lxml'):专门用作于将互联网上请求到的页面源码数据进行解析
标签定位
soup.tagName:定位到第一个TagName标签,返回的是第一个
属性定位
soup.find('div',class_='s'),返回值是class=s的div标签
find_all:和find用法一致,但是返回值是列表
选择器定位
select('选择器'),返回值为列表
标签,类,id,层级(>一个层级,空格 多个层级)
提取数据
取文本
tag.string:标签中直系的文本内容
tag.text:标签中所有的文本内容
取属性
soup.find("a",id_='tt')['href']
1.爬取三国演义小说内容
http://www.shicimingju.com/book/sanguoyanyi.html
爬取章节名称+章节内容
1.在首页中解析章节名称&每一个章节详情页的url
from bs4 import BeautifulSoup
import requests
url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
page_text = requests.get(url, headers=headers).text
soup = BeautifulSoup(page_text, 'lxml')
a_list = soup.select(".book-mulu a")
with open('./sanguo.txt', 'w', encoding='utf-8') as f:
for a in a_list:
new_url = "http://www.shicimingju.com" + a["href"]
mulu = a.text
print(mulu)
##对章节详情页的url发起请求,解析详情页中的章节内容
new_page_text = requests.get(new_url, headers).text
new_soup = BeautifulSoup(new_page_text, 'lxml')
neirong = new_soup.find('div', class_='chapter_content').text
f.write(mulu+":"+neirong+"\n")
3.xpath解析
环境安装
pip install lxml
xpath的解析原理
实例化一个etree类型xpath的解析原理的对象,且将页面源码数据加载到该对象中
需要调用该对象的xpath方法结合着不同形式的xpath表达式进行标签定位和数据提取
etree对象的实例化
tree = etree.parse(fileNane)
tree = etree.HTML(page_text)
xpath方法返回的永远是一个列表
标签定位
tree.xpath("")
在xpath表达式中最最侧的/表示的含义是说,当前定位的标签必须从根节点开始进行定位
xpath表达式中最左侧的//表示可以从任意位置进行标签定位
xpath表达式中非最左侧的//表示的是多个层级的意思
xpath表达式中非最左侧的/表示的是一个层级的意思
属性定位://div[@class='ddd']
索引定位://div[@class='ddd']/li[3] #索引从1开始
索引定位://div[@class='ddd']//li[2] #索引从1开始
提取数据
取文本:
tree.xpath("//p[1]/text()"):取直系的文本内容
tree.xpath("//div[@class='ddd']/li[2]//text()"):取所有的文本内容
取属性:
tree.xpath('//a[@id="feng"]/@href')
1.爬取boss的招聘信息
from lxml import etree
import requests
import time
url = 'https://www.zhipin.com/job_detail/?query=python&city=101120200&industry=&position='
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
'cookie':'__zp__pub__=; lastCity=101120200; __c=1594792470; __g=-; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1594713563,1594713587,1594792470; __l=l=%2Fwww.zhipin.com%2Fqingdao%2F&r=&friend_source=0&friend_source=0; __a=26925852.1594713563.1594713586.1594792470.52.3.39.52; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1594801318; __zp_stoken__=c508aZxdfUB9hb0Q8ORppIXd7JTdDTF96U3EdCDgIHEscYxUsVnoqdH9VBxY5GUtkJi5wfxggRDtsR0dAT2pEDDRRfWsWLg8WUmFyWQECQlYFSV4SCUQqUB8yfRwAUTAyZBc1ABdbRRhyXUY%3D'
}
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath('//*[@id="main"]/div/div[2]/ul/li')
for li in li_list:
#需要将li表示的局部页面源码数据中的相关数据进行提取
#如果xpath表达式被作用在了循环中, 表达式要以. / 或者. // 开头
detail_url = 'https://www.zhipin.com' + li.xpath('.//span[@class="job-name"]/a/@href')[0]
job_title = li.xpath('.//span[@class="job-name"]/a/text()')[0]
company = li.xpath('.//div[@class="info-company"]/div/h3/a/text()')[0]
# # 对详情页的url发请求解析出岗位职责
detail_page_text = requests.get(detail_url, headers=headers).text
tree = etree.HTML(detail_page_text)
job_desc = tree.xpath('//div[@class="text"]/text()')
#列表转字符传
job_desc = ''.join(job_desc)
print(job_title,company,job_desc)
time.sleep(5)
2.爬取糗事百科
爬取作者,和文章。注意作者有匿名和实名之分
from lxml import etree
import requests
url = "https://www.qiushibaike.com/text/page/4/"
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
div_list = tree.xpath('//div[@class="col1 old-style-col1"]/div')
print(div_list)
for div in div_list:
#用户名分为匿名用户和注册用户
author = div.xpath('.//div[@class="author clearfix"]//h2/text() | .//div[@class="author clearfix"]/span[2]/h2/text()')[0]
content = div.xpath('.//div[@class="content"]/span//text()')
content = ''.join(content)
print(author, content)
3.爬取网站图片
from lxml import etree
import requests
import os
dir_name = "./img2"
if not os.path.exists(dir_name):
os.mkdir(dir_name)
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
for i in range(1, 6):
if i == 1:
url = "http://pic.netbian.com/4kmeinv/"
else:
url = f"http://pic.netbian.com/4kmeinv/index_{i}.html"
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
for li in li_list:
img_src = "http://pic.netbian.com/" + li.xpath('./a/img/@src')[0]
img_name = li.xpath('./a/b/text()')[0]
#解决中文乱码
img_name = img_name.encode('iso-8859-1').decode('gbk')
response = requests.get(img_src).content
img_path = dir_name + "/" + f"{img_name}.jpg"
with open(img_path, "wb") as f:
f.write(response)
print(f"第{i}页成功")
4.IP代理
代理服务器
实现请求转发,从而可以实现更换请求的ip地址
代理的匿名度
透明:服务器知道你使用了代理并且知道你的真实ip
匿名:服务器知道你使用了代理,但是不知道你的真实ip
高匿:服务器不知道你使用了代理,更不知道你的真实ip
代理的类型
http:该类型的代理只可以转发http协议的请求
https:只可以转发https协议的请求
免费代理ip的网站
快代理
西祠代理
goubanjia
代理精灵(推荐):http://http.zhiliandaili.cn/
在爬虫中遇到ip被禁掉如何处理?
使用代理
构建一个代理池
拨号服务器
import requests
import random
from lxml import etree
# 列表形式的代理池
all_ips = []
proxy_url = "http://t.11jsq.com/index.php/api/entry?method=proxyServer.generate_api_url&packid=1&fa=0&fetch_key=&groupid=0&qty=5&time=1&pro=&city=&port=1&format=html&ss=5&css=&dt=1&specialTxt=3&specialJson=&usertype=15"
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
proxy_page_text = requests.get(url=proxy_url, headers=headers).text
tree = etree.HTML(proxy_page_text)
proxy_list = tree.xpath('//body//text()')
for ip in proxy_list:
dic = {'https': ip}
all_ips.append(dic)
# 爬取快代理中的免费代理ip
free_proxies = []
for i in range(1, 3):
url = f"http://www.kuaidaili.com/free/inha/{i}/"
page_text = requests.get(url, headers=headers,proxies=random.choice(all_ips)).text
tree = etree.HTML(page_text)
# xpath表达式中不可以出现tbody
tr_list = tree.xpath('//*[@id="list"]/table/tbody/tr')
for tr in tr_list:
ip = tr.xpath("./td/text()")[0]
port = tr.xpath("./td[2]/text()")[0]
dic = {
"ip":ip,
"port":port
}
print(dic)
free_proxies.append(dic)
print(f"第{i}页")
print(len(free_proxies))
5.处理cookie
视频解析接口
https://www.wocao.xyz/index.php?url=
https://2wk.com/vip.php?url=
https://api.47ks.com/webcloud/?v-
视频解析网址
牛巴巴 http://mv.688ing.com/
爱片网 https://ap2345.com/vip/
全民解析 http://www.qmaile.com/
回归正点
为什么要处理cookie?
保存客户端的相关状态
在请求中携带cookie,在爬虫中如果遇到了cookie的反爬如何处理?
#手动处理
在抓包工具中捕获cookie,将其封装在headers中
#自动处理
使用session机制
使用场景:动态变化的cookie
session对象:该对象和requests模块用法几乎一致.如果在请求的过程中产生了cookie,如果该请求使用session发起的,则cookie会被自动存储到session中
爬去雪球网的数据
import requests
s = requests.Session()
main_url = "https://xueqiu.com" # 先对url发请求获取cookie
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
params = {
"size": "8",
'_type': "10",
"type": "10"
}
s.get(main_url, headers=headers)
url = 'https://stock.xueqiu.com/v5/stock/hot_stock/list.json?size=8&_type=10&type=10'
page_text = s.get(url, headers=headers).json()
print(page_text)
6.验证码识别
相关的线上打码平台识别
- 打码兔
- 云打码
- 超级鹰:http://www.chaojiying.com/about.html
1.注册,登录(用户中心的身份认证)
2.登录后
创建一个软件:软件ID->生成一个软件id
下载示例代码:开发文档->python->下载
平台实例代码的演示
import requests
from hashlib import md5
class Chaojiying_Client(object):
def __init__(self, username, password, soft_id):
self.username = username
password = password.encode('utf8')
self.password = md5(password).hexdigest()
self.soft_id = soft_id
self.base_params = {
'user': self.username,
'pass2': self.password,
'softid': self.soft_id,
}
self.headers = {
'Connection': 'Keep-Alive',
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
}
def PostPic(self, im, codetype):
params = {
'codetype': codetype,
}
params.update(self.base_params)
files = {'userfile': ('ccc.jpg', im)}
r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
headers=self.headers)
return r.json()
def ReportError(self, im_id):
params = {
'id': im_id,
}
params.update(self.base_params)
r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
return r.json()
chaojiying = Chaojiying_Client('超级鹰用户名', '超级鹰用户名的密码', '96001')
im = open('a.jpg', 'rb').read()
print(chaojiying.PostPic(im, 1902)['pic_str'])
将古诗网中的验证码进行识别
zbb.py
import requests
from hashlib import md5
class Chaojiying_Client(object):
def __init__(self, username, password, soft_id):
self.username = username
password = password.encode('utf8')
self.password = md5(password).hexdigest()
self.soft_id = soft_id
self.base_params = {
'user': self.username,
'pass2': self.password,
'softid': self.soft_id,
}
self.headers = {
'Connection': 'Keep-Alive',
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
}
def PostPic(self, im, codetype):
params = {
'codetype': codetype,
}
params.update(self.base_params)
files = {'userfile': ('ccc.jpg', im)}
r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
headers=self.headers)
return r.json()
def ReportError(self, im_id):
params = {
'id': im_id,
}
params.update(self.base_params)
r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
return r.json()
def www(path,type):
chaojiying = Chaojiying_Client('5423', '521521', '906630')
im = open(path, 'rb').read()
return chaojiying.PostPic(im, type)['pic_str']
requests.py
import requests
from lxml import etree
from zbb import www
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
img_url = "https://so.gushiwen.cn/" + tree.xpath('//*[@id="imgCode"]/@src')[0]
img_data = requests.get(img_url,headers=headers).content
with open('./111.jpg','wb') as f:
f.write(img_data)
img_text = www('./111.jpg',1004)
print(img_text)
7.模拟登陆
为什么在爬虫中需要实现模拟登录?
有的数据是必须经过登录后才可以显示出来的
古诗网
涉及到的反扒机制
1.验证码
2.动态请求参数:每次请求对应的请求参数都是动态变化
动态捕获:通常情况下,动态的请求参数都会被隐藏在前台页面的源码中
3.cookie存在验证码图片之中
坑壁玩意
import requests
from lxml import etree
from zbb import www
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
# 获取cookie
s = requests.Session()
# s_url = "https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx"
# s.get(s_url, headers=headers)
# 获取验证码
url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
img_url = "https://so.gushiwen.cn/" + tree.xpath('//*[@id="imgCode"]/@src')[0]
img_data = s.get(img_url, headers=headers).content
with open('./111.jpg', 'wb') as f:
f.write(img_data)
img_text = www('./111.jpg', 1004)
print(img_text)
# 动态捕获动态的请求参数
__VIEWSTATE = tree.xpath('//*[@id="__VIEWSTATE"]/@value')[0]
__VIEWSTATEGENERATOR = tree.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')[0]
# 点击登录按钮后发起请求的url:通过抓包工具捕获
login_url = 'https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspx'
data = {
"__VIEWSTATE": __VIEWSTATE,
"__VIEWSTATEGENERATOR": __VIEWSTATEGENERATOR, # 变化的
"from": "http://so.gushiwen.cn/user/collect.aspx",
"email": "[email protected]",
"pwd": "zxy521",
"code": img_text,
"denglu": "登录"
}
main_page_text = s.post(login_url, headers=headers, data=data).text
with open('main.html', 'w', encoding='utf-8') as fp:
fp.write(main_page_text)
8.基于线程池的异步爬取
基于线程池的异步爬取 趣味百科前十页内容
import requests
from multiprocessing.dummy import Pool
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
#将url获取,加入列表之中
urls = []
for i in range(1, 11):
urls.append(f'https://www.qiushibaike.com/8hr/page/{i}/')
#创建一个request请求
def get_request(url):
# 必须只能有一个参数
return requests.get(url, headers=headers).text
#实例化线程10个
pool = Pool(10)
response_text_list = pool.map(get_request,urls)
print(response_text_list)
9.单线程+多任务异步协程
1.简介
协程:对象
#可以把协程当做是一个特殊的函数.如果一个函数的定义被async关键字所修饰.该特殊的函数被调用后函数内部的程序语句不会被立即执行,而是会返回一个协程对象.
任务对象(task)
#所谓的任务对象就是对协程对象的进一步封装.在任务对象中可以实现显示协程对象的运行状况.
#任务对象最终是需要被注册到事件循环对象中.
绑定回调
#回调函数是绑定给任务对象,只有当任务对象对应的特殊函数被执行完毕后,回调函数才会被执行
事件循环对象
#无限循环的对象.也可以把其当成是某一种容器.该容器中需要放置多个任务对象(就是一组待执行的代码块).
异步的体现
#当事件循环开启后,该对象会安装顺序执行每一个任务对象,
#当一个任务对象发生了阻塞事件循环是不会等待,而是直接执行下一个任务对象
await:挂起的操作.交出cpu的使用权
单任务
from time import sleep
import asyncio
# 回调函数:
# 默认参数:任务对象
def callback(task):
print('i am callback!!1')
print(task.result()) # result返回的就是任务对象对应的那个特殊函数的返回值
async def get_request(url):
print('正在请求:', url)
sleep(2)
print('请求结束:', url)
return 'hello bobo'
# 创建一个协程对象
c = get_request('www.1.com')
# 封装一个任务对象
task = asyncio.ensure_future(c)
# 给任务对象绑定回调函数
task.add_done_callback(callback)
# 创建一个事件循环对象
loop = asyncio.get_event_loop()
loop.run_until_complete(task) # 将任务对象注册到事件循环对象中并且开启了事件循环
2.多任务的异步协程
import asyncio
from time import sleep
import time
start = time.time()
urls = [
'http://localhost:5000/a',
'http://localhost:5000/b',
'http://localhost:5000/c'
]
#在待执行的代码块中不可以出现不支持异步模块的代码
#在该函数内部如果有阻塞操作必须使用await关键字进行修饰
async def get_request(url):
print('正在请求:',url)
# sleep(2)
await asyncio.sleep(2)
print('请求结束:',url)
return 'hello bobo'
tasks = [] #放置所有的任务对象
for url in urls:
c = get_request(url)
task = asyncio.ensure_future(c)
tasks.append(task)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
print(time.time()-start)
注意事项:
1.将多个任务对象存储到一个列表中,然后将该列表注册到事件循环中.在注册的过程中,该列表需要被wait方法进行处理.
2.在任务对象对应的特殊函数内部的实现中,不可以出现不支持异步模块的代码,否则就会中断整个的异步效果.并且,在该函数内部每一组阻塞的操作都必须使用await关键字进行修饰.
3.requests模块对应的代码不可以出现在特殊函数内部,因为requests是一个不支持异步的模块
3.aiohttp
支持异步操作的网络请求的模块
- 环境安装:pip install aiohttp
import asyncio
import requests
import time
import aiohttp
from lxml import etree
urls = [
'http://localhost:5000/bobo',
'http://localhost:5000/bobo',
'http://localhost:5000/bobo',
'http://localhost:5000/bobo',
'http://localhost:5000/bobo',
'http://localhost:5000/bobo',
]
# 无法实现异步的效果:是因为requests模块是一个不支持异步的模块
async def req(url):
async with aiohttp.ClientSession() as s:
async with await s.get(url) as response:
# response.read():byte
page_text = await response.text()
return page_text
# 细节:在每一个with前面加上async,在每一步的阻塞操作前加上await
def parse(task):
page_text = task.result()
tree = etree.HTML(page_text)
name = tree.xpath('//p/text()')[0]
print(name)
if __name__ == '__main__':
start = time.time()
tasks = []
for url in urls:
c = req(url)
task = asyncio.ensure_future(c)
task.add_done_callback(parse)
tasks.append(task)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
print(time.time() - start)
10.selenium
概念
基于浏览器自动化的一个模块.
环境的安装:
下载selenium模块
selenium和爬虫之间的关联是什么?
便捷的获取页面中动态加载的数据
requests模块进行数据爬取:可见非可得
selenium:可见即可得
实现模拟登录
基本操作:
谷歌浏览器驱动程序下地址:
http://chromedriver.storage.googleapis.com/index.html
selenium驱动程序和谷歌版本的映射关系表:
https://blog.csdn.net/huilan_same/article/details/51896672
动作链
一系列的行为动作
无头浏览器
无可视化界面的浏览器
phantosJS
1.京东基本操作示例
from selenium import webdriver
from time import sleep
#1.实例化一个浏览器对象
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
#2.模拟用户发起请求
url = 'https://www.jd.com'
bro.get(url)
#3.标签定位
search_input = bro.find_element_by_id('key')
#4.对指定标签进行数据交互
search_input.send_keys('华为')
#5.系列的行为动作
btn = bro.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
btn.click()
sleep(2)
#6.执行js代码
jsCode = 'window.scrollTo(0,document.body.scrollHeight)'
bro.execute_script(jsCode)
sleep(3)
#7.关闭
bro.quit()
2.爬取药品总局信息
from selenium import webdriver
from lxml import etree
from time import sleep
page_text_list = []
# 实例化一个浏览器对象
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
url = 'http://125.35.6.84:81/xk/'
bro.get(url)
# 必须等待页面加载完毕
sleep(2)
# page_source就是浏览器打开页面的源码数据
page_text = bro.page_source
page_text_list.append(page_text)
#必须要与窗口对应,窗口必须要显示点击按钮才可
jsCode = 'window.scrollTo(0,document.body.scrollHeight)'
bro.execute_script(jsCode)
#打开后两页的
for i in range(2):
bro.find_element_by_id('pageIto_next').click()
sleep(2)
page_text = bro.page_source
page_text_list.append(page_text)
for p in page_text_list:
tree = etree.HTML(p)
li_list = tree.xpath('//*[@id="gzlist"]/li')
for li in li_list:
name = li.xpath('./dl/@title')[0]
print(name)
sleep(2)
bro.quit()
3.动作链
from lxml import etree
from time import sleep
from selenium import webdriver
from selenium.webdriver import ActionChains
# 实例化一个浏览器对象
page_text_list = []
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
url = 'https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
bro.get(url)
# 如果定位的标签是存在于iframe对应的子页面中的话,在进行标签定位前一定要执行一个switch_to的操作
bro.switch_to.frame('iframeResult')
div_tag = bro.find_element_by_id('draggable')
# 1.实例化动作链对象
action = ActionChains(bro)
action.click_and_hold(div_tag)
for i in range(5):
#perform让动作链立即执行
action.move_by_offset(17, 0).perform()
sleep(0.5)
#释放
action.release()
sleep(3)
bro.quit()
4.处理反爬selenium
像淘宝很多网站都禁止selenium爬取
正常在浏览器输入window.Navigator.webdriver返回的是undefined
用代码打开浏览器返回的是true
from selenium import webdriver
from selenium.webdriver import ChromeOptions
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
#实例化一个浏览器对象
bro = webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe',options=option)
bro.get('https://www.taobao.com/')
5.模拟12306登录
from selenium import webdriver
from selenium.webdriver import ActionChains
from PIL import Image # 用作于图片的裁剪 pillow
from zbb import www
from time import sleep
bro =webdriver.Chrome(executable_path=r'C:\Users\zhui3\Desktop\chromedriver.exe')
bro.get('https://kyfw.12306.cn/otn/resources/login.html')
sleep(5)
zhdl = bro.find_element_by_xpath('/html/body/div[2]/div[2]/ul/li[2]/a')
zhdl.click()
sleep(1)
username = bro.find_element_by_id('J-userName')
username.send_keys('181873')
pwd = bro.find_element_by_id('J-password')
pwd.send_keys('zx1')
# 验证码图片进行捕获(裁剪)
bro.save_screenshot('main.png')
# 定位到了验证码图片对应的标签
code_img_ele = bro.find_element_by_xpath('//*[@id="J-loginImg"]')
location = code_img_ele.location # 验证码图片基于当前整张页面的左下角坐标
size = code_img_ele.size # 验证码图片的长和宽
# 裁剪的矩形区域(左下角和右上角两点的坐标)
rangle = (
int(location['x']), int(location['y']), int(location['x'] + size['width']), int(location['y'] + size['height']))
i = Image.open('main.png')
frame = i.crop(rangle)
frame.save('code.png')
# # 使用打码平台进行验证码的识别
result = www('./code.png',9004)
# x1,y1|x2,y2|x3,y3 ==> [[x1,y1],[x2,y2],[x3,y3]]
all_list = [] # [[x1,y1],[x2,y2],[x3,y3]] 每一个列表元素表示一个点的坐标,坐标对应值的0,0点是验证码图片左下角
if '|' in result:
list_1 = result.split('|')
count_1 = len(list_1)
for i in range(count_1):
xy_list = []
x = int(list_1[i].split(',')[0])
y = int(list_1[i].split(',')[1])
xy_list.append(x)
xy_list.append(y)
all_list.append(xy_list)
else:
x = int(result.split(',')[0])
y = int(result.split(',')[1])
xy_list = []
xy_list.append(x)
xy_list.append(y)
all_list.append(xy_list)
print(all_list)
action = ActionChains(bro)
for l in all_list:
x = l[0]
y = l[1]
action.move_to_element_with_offset(code_img_ele, x, y).click().perform()
sleep(2)
btn = bro.find_element_by_xpath('//*[@id="J-login"]')
btn.click()
action.release()
sleep(3)
bro.quit()