爬取豆瓣图书的主要字段为:书名、 作者、 出版社、 出版年、 定价、 评分
爬取的页面:爬取前3页的内容
url: 主要是start={}这里面的内容不一致,修改这里面的数据就可以爬取多页
第一页: https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=0&type=T 第二页: https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=20&type=T 第三页: https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=40&type=T
使用Bs4爬取
# 图书标签为:编程
# 爬取3页内容:
# 书名 作者 出版社 出版年 定价 评分
# https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=0&type=T
# https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=20&type=T
# https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=40&type=T
# https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=60&type=T
import pandas as pd
import requests
from bs4 import BeautifulSoup
t_list = []
z_list = []
c_list = []
n_list = []
p_list = []
s_list = []
for i in range(0,2):
url = f"https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start={i * 20}&type=T"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.39"
}
res = requests.get(url,headers=headers)
soup = BeautifulSoup(res.text,'html.parser')
div = soup.find('div',class_="article")
div2 = div.find('div',attrs={"id":"subject_list"})
ul = div2.find('ul')
li_all = ul.find_all('li')
for li in li_all:
info = li.find('div',class_="info")
title = info.find('a').get("title") # 书名
xx = li.find('div',class_="pub").text.strip().split("/")
a = li.find('div',class_="star clearfix")
try:
sorce = a.find('span',class_="rating_nums").text # 评分
except:
sorce = None
if len(xx)==5:
zuozhe = xx[0] + xx[1]
chu = xx[2]
time = xx[3].replace('月','').replace('年','-')
price = xx[4].replace("元", "").replace("CNY ", "")
if len(xx)==4:
if title == "动手学深度学习(PyTorch版)":
zuozhe = xx[0] + xx[1]
chu = xx[2]
time = xx[3].replace('月','').replace('年','-')
price = None
else:
zuozhe = xx[0]
chu = xx[1]
time = xx[2].replace('月','-')
price = xx[3].replace("元","").replace("CNY ","")
t_list.append(title)
z_list.append(zuozhe)
c_list.append(chu)
n_list.append(time)
p_list.append(price)
s_list.append(sorce)
# 保存excel文件
data = pd.DataFrame(data={"书名":t_list,"作者":z_list,"出版社":c_list,"出版时间":n_list,"价格":p_list,"评分":s_list})
data.to_excel("豆瓣读书1.xlsx",index=False)
#保存数据库
import pymysql
db = pymysql.connect(user="root",password="123456",host="localhost",port=3306,charset="utf8",db="python")
cursor = db.cursor()
# # 创建标
sql_create = ""
cursor.execute(sql_create)
sql_inset = 'insert into books values("%s","%s","%s","%s")'
try:
for k in range(len(t_list)):
cursor.execute(sql_create,(t_list[0],c_list[0]))
db.commit()
except:
db.rollback()