实现基于 JWT 登录模式,实现对 https://login3.scrape.center/login 数据的爬取,并把数据持久化到 MySQL,存储的表名为:spider_books,字段名称自定义,存储的字段信息包含:书名、作者、封面图像本地路径、评分、简介、标签、定价、出版社、出版时间、页数、ISBM
这里我们发现登录时其请求的 URL 为 https://login3.scrape.center/api/login ,是通过 Ajax 请求的,返回状态码为 200。
可以看到返回结果是一个 JSON 格式的数据,包含一个 token 字段,其结果为:
token:“eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJ1c2VyX2lkIjoxLCJ1c2VybmFtZSI6ImFkbWluIiwiZXhwIjoxNjY3MjQ3NTcyLCJlbWFpbCI6ImFkbWluQGFkbWluLmNvbSIsIm9yaWdfaWF0IjoxNjY3MjA0MzcyfQ.sk-QJ7U5qezyn9bzr_l7naDHfHaKmYbmuNhjsi5VGy4”
那么有了这个 JWT 之后,后续的数据怎么获取呢?下面我们再来观察下后续的请求内容,如图所示。
这里我们可以发现,后续获取数据的 Ajax 请求中的 Request Headers 里面就多了一个 Authorization 字段,其结果为 jwt 然后加上刚才的 JWT 的内容,返回结果就是 JSON 格式的数据。
模拟请求登录结果,带上必要的登录信息,获取 JWT 的结果。
后续的请求在 Request Headers 里面加上 Authorization 字段,值就是 JWT 对应的内容。
定义了登录接口和获取数据的接口,分别为 LOGIN_URL 和 INDEX_URL,接着通过 post 请求进行了模拟登录,这里提交的数据由于是 JSON 格式,所以这里使用 json 参数来传递。接着获取了返回结果中包含的 JWT 的结果。第二步就可以构造 Request Headers,然后设置 Authorization 字段并传入 JWT 即可,这样就能成功获取数据了
import json
import urllib.request
import pymysql
import requests
from urllib.parse import urljoin # 连接url
固定参数
BASE_URL = 'https://login3.scrape.center/'
LOGIN_URL = urljoin(BASE_URL, '/api/login')
INDEX_URL = urljoin(BASE_URL, '/api/book')
USERNAME = 'admin'
PASSWORD = 'admin'
r = requests.post(url, json/data, headers)
#r: 响应对象response_login = requests.post(LOGIN_URL, json={
'username': USERNAME,
'password': PASSWORD
})
data = response_login.json()
# print('Response JSON', data)
jwt = data.get('token')
# print('JWT', jwt)
headers = {
'Authorization': f'jwt {jwt}'
}
response_index = requests.get(INDEX_URL, params={
'limit': 18,
'offset': 0
}, headers=headers)
# print('Response Status', response_index.status_code)
# print('Response URL', response_index.url)
# print('Response Data', response_index.json())
response_index.json() 中书籍的信息存在’results’下
books = response_index.json()['results']
urllib.request.Request(url, data=None, headers={})
request = urllib.request.Request('https://python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('urf-8'))
# save image
def save_image(url, name):
if url is not None:
req_image = urllib.request.Request(url)
image = urllib.request.urlopen(req_image).read()
'''保存路径:image/name.jpg'''
f_image = open("image\\" + name + ".jpg", "wb")
f_image.write(image)
f_image.close()
path = "image\\" + name + ".jpg"
else:
path = "null"
return path
db.cursor()
db = pymysql.connect(host='localhost', user='root', password='****', port=3306, db='spiders')
cursor = db.cursor()
tag
tag = ""
tag_list = html['tags']
if tag_list is not None:
for j in range(len(tag_list)):
tag += tag_list[j] + " "
else:
tag = "null"`
for i in range(len(books)):
book_id = books[i]['id']
book_name = books[i]['name'] if books[i]['name'] is not None else "null" + str(i + 1)
author = ""
author_list = books[i]['authors']
if author_list is not None:
for j in range(len(author_list)):
# .strip()方法删除多余空格
author += author_list[j].strip() + "/"
else:
author = "null"
image_url = books[i]['cover']
image_path = save_image(image_url, book_name)
score = books[i]['score']
# 书本详细信息的路径
req = requests.get("https://login3.scrape.center/api/book/" + book_id + "/", headers=headers)
html = req.json()
introduce = html['introduction']
tag = ""
tag_list = html['tags']
if tag_list is not None:
for j in range(len(tag_list)):
tag += tag_list[j] + " "
else:
tag = "null"
price = html['price']
publisher = html['publisher']
published_time = html['published_at']
page_num = html['page_number']
isbn_num = html['isbn']
# 插入数据的语句
sql = 'INSERT INTO spi_book(b_name,author,cover,score,intro,label,price,publish_house,publish_date,page,ISBM) ' \
'values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'
# print(book_name, author, image_path, score, introduce, tag, price)
# print(publisher, published_time, page_num, isbn_num)
try:
# 执行sql语句
cursor.execute(sql, (
book_name, author, image_path, score, introduce, tag, price, publisher, published_time, page_num, isbn_num))
# 提交
db.commit()
print('Insert successfully')
except Exception as err:
# 事件回滚
db.rollback()
print(err)
print('Insert failure')
db.close()