Python爬虫(使用Bs4爬取、保存csv、excel、数据库)

1、爬虫爬取的内容:

        爬取豆瓣图书的主要字段为:书名、  作者、  出版社、  出版年、  定价、   评分

        爬取的页面:爬取前3页的内容

        url:     主要是start={}这里面的内容不一致,修改这里面的数据就可以爬取多页

        第一页:        https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=0&type=T              第二页:        https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=20&type=T            第三页:        https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=40&type=T 

        使用Bs4爬取

2、爬虫代码:

# 图书标签为:编程
# 爬取3页内容:
# 书名   作者  出版社  出版年  定价  评分
# https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=0&type=T
# https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=20&type=T
# https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=40&type=T
# https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=60&type=T
import pandas as pd
import requests
from bs4 import BeautifulSoup

t_list = []
z_list = []
c_list = []
n_list = []
p_list = []
s_list = []
for i in range(0,2):
    url = f"https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start={i * 20}&type=T"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.39"
    }
    res = requests.get(url,headers=headers)
    soup = BeautifulSoup(res.text,'html.parser')
    div = soup.find('div',class_="article")
    div2 = div.find('div',attrs={"id":"subject_list"})
    ul = div2.find('ul')
    li_all = ul.find_all('li')
    for li in li_all:
        info = li.find('div',class_="info")
        title = info.find('a').get("title")   # 书名
        xx = li.find('div',class_="pub").text.strip().split("/")
        a = li.find('div',class_="star clearfix")
        try:
            sorce = a.find('span',class_="rating_nums").text    # 评分
        except:
            sorce = None
        if len(xx)==5:
            zuozhe = xx[0] + xx[1]
            chu = xx[2]
            time = xx[3].replace('月','').replace('年','-')
            price = xx[4].replace("元", "").replace("CNY ", "")
        if len(xx)==4:
            if title == "动手学深度学习(PyTorch版)":
                zuozhe = xx[0] + xx[1]
                chu = xx[2]
                time = xx[3].replace('月','').replace('年','-')
                price = None
            else:
                zuozhe = xx[0]
                chu = xx[1]
                time = xx[2].replace('月','-')
                price = xx[3].replace("元","").replace("CNY ","")
        t_list.append(title)
        z_list.append(zuozhe)
        c_list.append(chu)
        n_list.append(time)
        p_list.append(price)
        s_list.append(sorce)


# 保存excel文件
data = pd.DataFrame(data={"书名":t_list,"作者":z_list,"出版社":c_list,"出版时间":n_list,"价格":p_list,"评分":s_list})
data.to_excel("豆瓣读书1.xlsx",index=False)

#保存数据库
import pymysql
db = pymysql.connect(user="root",password="123456",host="localhost",port=3306,charset="utf8",db="python")
cursor = db.cursor()
# # 创建标
sql_create = ""
cursor.execute(sql_create)
sql_inset = 'insert into books values("%s","%s","%s","%s")'
try:
     for k in range(len(t_list)):
         cursor.execute(sql_create,(t_list[0],c_list[0]))
         db.commit()
 except:
     db.rollback()

3、爬虫保存excel的信息:

 Python爬虫(使用Bs4爬取、保存csv、excel、数据库)_第1张图片

你可能感兴趣的:(python爬虫,python,爬虫,开发语言)