29.selenium多页抓取，并保存到三大主流数据库

在上一篇中使用scrapy-splash抓取了单页js加载的信息，只抓取一页对爬虫来说简直是高射炮打蚊子。本篇将抛却fiddler奇淫巧技，通过最基础的方式研究如何抓取多页信息。

这样一个网页，看着很简单，但它下面的网页跳转与网址无任何关系，点击网页跳转，网址岿然不动。如果不加载js，连下面的网页跳转都没有。哭戚戚。

先在pycharm中新建个文件作为练习。
爬取一页的代码如下：

# coding:utf-8
from selenium import webdriver
from scrapy.selector import Selector

url = "http://www.zjzfcg.gov.cn/purchaseNotice/index.html?categoryId=3001"
driver = webdriver.Chrome()
driver.get(url)
data = driver.page_source
response = Selector(text=data)  # 这里如果不使用"text=data",直接写data将会报错 'str' object has no attribute 'text'
infodata = response.css(".items p")
for infoline in infodata:
    city = infoline.css(".warning::text").extract()[0].replace("[", "").replace("·", "").strip()
    issuescate = infoline.css(".warning .limit::text").extract()[0]
    title = infoline.css("a .underline::text").extract()[0].replace("]", "")
    publish_date = infoline.css(".time::text").extract()[0].replace("[", "").replace("]", "")
    print(city+"--"+title+"--"+issuescate+"--"+publish_date)
driver.close()

抓到的信息如下：

那现在可以考虑翻页的事情了。
可以使用webdriver的find_element_by_css_selector定位到元素，然后使用click方法实现点击，OK，做个小测试：

# -*- coding: utf-8 -*-
# @AuThor  : frank_lee

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
import time


class ZfCaigou():
    """
    """
    def __init__(self):
        super(ZfCaigou, self).__init__()
        # 实际地址
        self.url = 'http://www.zjzfcg.gov.cn/purchaseNotice/index.html?categoryId=3001'
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 30)  # 设置超时时间
        self.zoom = 1

    def open(self):
        self.driver.get(self.url)
        self.driver.maximize_window()
        time.sleep(5)
        i = 0
        while i < 8:
            self.driver.find_element_by_css_selector('div.paginationjs-pages > ul > li.paginationjs-next.J-paginationjs-next a').click()
            i += 1

        time.sleep(3)


if __name__ == '__main__':
    z = ZfCaigou()
    z.open()

这样做的确可以实现动态页面的加载。

那怎样实现页面跳转后的信息加载呢？
整合上面对的两者的代码即可，将第一个练习的代码加到第二个练习的while后面就可以实现了

# -*- coding: utf-8 -*-
# @AuThor  : frank_lee

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
import time
from scrapy.selector import Selector


class ZfCaigou():
    """
    """
    def __init__(self):
        self.url = 'http://www.zjzfcg.gov.cn/purchaseNotice/index.html?categoryId=3001'
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 30)  # 设置超时时间
        self.zoom = 1

    def get_info(self):
        self.driver.get(self.url)
        self.driver.maximize_window()
        time.sleep(5)
        i = 0
        while i < 8:  #这里的页数随意设置，也可以定义一个total_page，然后在这里用self.total_page调用
            time.sleep(2)
            data = self.driver.page_source
            response = Selector(text=data)  # 这里如果不使用"text=data",直接写data将会报错 'str' object has no attribute 'text'
            infodata = response.css(".items p")
            for infoline in infodata:
                city = infoline.css(".warning::text").extract()[0].replace("[", "").replace("·", "").strip()
                issuescate = infoline.css(".warning .limit::text").extract()[0]
                title = infoline.css("a .underline::text").extract()[0].replace("]", "")
                publish_date = infoline.css(".time::text").extract()[0].replace("[", "").replace("]", "")
                print(city + "--" + title + "--" + issuescate + "--" + publish_date)
            self.driver.find_element_by_css_selector(
                'div.paginationjs-pages > ul > li.paginationjs-next.J-paginationjs-next a').click()
            i += 1
            time.sleep(3)
        time.sleep(3)
        self.driver.close()


if __name__ == '__main__':
    z = ZfCaigou()
    z.get_info()

自己想要的信息已抓取到，接下来将其保存到数据库。

1.保存到MongoDB数据库

在初始化函数中加入：

# 以下保存到MongoDB数据库，不想要可以删掉
self.client = pymongo.MongoClient(host="localhost", port=27017)
self.db = self.client['zfcaigou']
# MongoDB部分结束

在get_info函数for语句中加入：

# 为保存到MongoDB做的处理，不想保存可以删掉
result = {
    "city": city,
    "issuescate": issuescate,
    "title": title,
    "publish_date": publish_date,
}
self.save_to_mongo(result)
# 结束

再写一个自定义函数：

def save_to_mongo(self, result):
    if self.db['caigou'].insert(result):
        print("保存成功啦，嘻嘻")

超简单有木有，然后看着满满的“保存成功”，开心！

2.保存到mysql数据库

在初始化函数中加入：

# 以下保存到Mysql数据库，不想要可以删掉
self.db = pymysql.connect("localhost", "root", "", "test")
self.cursor = self.db.cursor()
# 创建一个表
sql = """create table  caigou (
  city varchar(30) not null ,
  issuescate varchar(30) not null,
  title varchar(200) not null,
  publish_date varchar(50) not null
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
"""
try:
    # 执行sql语句
    self.cursor.execute(sql)
    # 提交到数据库执行
    self.db.commit()
except:
    # 如果发生错误则回滚
    self.db.rollback()
# mysql部分结束

在get_info函数for语句中加入：

# 保存到mysql数据库的操作，可选
insert_sql = """
    insert into caigou(city, issuescate, title, publish_date) values(%s, %s, %s, %s);
    """
result = {
    "city": city,
    "issuescate": issuescate,
    "title": title,
    "publish_date": publish_date
}
try:
    # 执行sql语句
    self.cursor.execute(insert_sql, (result["city"], result["issuescate"], result["title"], result["publish_date"]))
    # 提交到数据库执行
    self.db.commit()
except:
    # 如果发生错误则回滚
    self.db.rollback()

这样只要能连接到mysql，一切都自动化了，以前保守的做法是在本地建表，python代码写sql插入语句，此次练习增加了建表语句。

数据插入成功，(#^^.^#)。

3.保存到redis数据库

在初始化函数中加入：

self.pool = redis.ConnectionPool(host='localhost', port=6379)  
self.myredis = redis.Redis(connection_pool=self.pool)
self.keyName = 'ZfCaigou'

在for循环中加入：

self.myredis.lpush(self.keyName, json.dumps(result))

需要注意的是，以上想要保存数据到某个数据库，需要首先打开对应的服务器。或者对应的可视化工具能够连接上，然后才能指望数据能够成功保存。

详细代码请参考：https://github.com/hfxjd9527/caigou

29.selenium多页抓取，并保存到三大主流数据库

自己想要的信息已抓取到，接下来将其保存到数据库。

1.保存到MongoDB数据库

2.保存到mysql数据库

3.保存到redis数据库

你可能感兴趣的:(29.selenium多页抓取，并保存到三大主流数据库)