scrapy+selenium爬取网页动态加载数据实例讲解

实例:爬取网易新闻中的五大版块

url:https://news.163.com/

分析:

首页没有动态加载的数据,从中提取五个版块对应的url,每一个版块对应的页面中的新闻标题是动态加载,这里要配合selenium来提取爬取新闻标题和详情页的url,每一条新闻详情页面中的数据不是动态加载,直接爬取新闻内容,下面讲一下selenium在scrapy中的使用流程:

  • 在爬虫类中实例化一个浏览器对象,将其作为爬虫类的一个属性
  • 在中间件中实现浏览器自动化相关的操作
  • 在爬虫类中重写closed(self,spider),在其内部关闭浏览器对象

程序代码:

先在终端依次输入以下命令创建一个新的工程和爬虫:

  • scrapy startproject wangyiPro
  • cd wangyiPro
  • scrapy genspider wangyi www.xxx.com

接着编写spider文件夹下的爬虫文件:

import scrapy
from wangyiPro.items import WangyiproItem
from selenium import webdriver


class WangyiSpider(scrapy.Spider):
    name = 'wangyi'
    # allowed_domains = ['xxx.com']
    start_urls = ['https://news.163.com/']
    model_urls = []
    driver = webdriver.Chrome(executable_path='D:\pycharm\Scrapy\chromedriver.exe') # 实例化浏览器对象

    def parse(self, response):
        li_list = response.xpath('//*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li')
        index = [3, 4, 6, 7, 8] # 五个版块的索引
        for i in index:
            model_url = li_list[i].xpath('./a/@href').extract_first()
            self.model_urls.append(model_url)
        for url in self.model_urls:
            yield scrapy.Request(url=url, callback=self.parse_model)

    def parse_model(self, response):
        div_list = response.xpath('/html/body/div/div[3]/div[4]/div[1]/div/div/ul/li/div/div')
        for div in div_list:
            title = div.xpath('./div/div[1]/h3/a/text()').extract_first() # 这里xpath路径中的h3不能省略,否则会报错
            detail_url = div.xpath('./div/div[1]/h3/a/@href').extract_first()
            if detail_url:
                item = WangyiproItem()
                item['title'] = title

                yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={'item': item}) # 通过请求传参将meta传递给callback

    def parse_detail(self, response):
        content = response.xpath('//*[@id="endText"]/p/text()').extract()
        content = ''.join(content)
        item = response.meta['item']
        item['content'] = content

        yield item

    def closed(self, reason): # 爬虫结束爬取后关闭浏览器对象
        self.driver.quit()

在middlwares.py文件中实现浏览器自动化相关的操作:

from time import sleep
from scrapy.http import HtmlResponse


class WangyiproDownloaderMiddleware(object):

    def process_request(self, request, spider):

        return None

    def process_response(self, request, response, spider):
        if request.url in spider.model_urls:
            driver = spider.driver
            driver.get(request.url)
            sleep(2)
            driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')# js注入:滑动页面一个屏幕的长度,可以获取更多的新闻信息
            sleep(1)
            page_text = driver.page_source
            return HtmlResponse(url=request.url, body=page_text, encoding='utf-8', request=request)
        else:
            return response

    def process_exception(self, request, exception, spider):

        pass

在pipelines.py中实现持久化存储(将数据一份存储为txt文件,一份存储到mysql数据库):

import pymysql


class WangyiproPipeline(object):

    def open_spider(self, spider):
        self.fp = open('wangyi.txt', 'w', encoding='utf-8')

    def close_spider(self, spider):
        self.fp.close()

    def process_item(self, item, spider):
        self.fp.write(item['title'] + ':' + item['content'] + '\n')
        return item


class MysqlPipeline(object): # 存储到mysql数据库,需要提前在mysql中创建用于存放的数据库和表

    def open_spider(self, spider):
        self.conn = pymysql.Connect(host='127.0.0.1', port=3306, user='root', password='你的mysql数据库密码', db='spider',
                                    charset='utf8')
        self.cursor = self.conn.cursor()

    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()

    def process_item(self, item, spider):
        sql = 'insert into wangyi values ("%s","%s")' % (item['title'], item['content'])
        try:
            self.cursor.execute(sql)
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()
        return item

最后要记得在setting.py里开启管道机制和中间件机制:

DOWNLOADER_MIDDLEWARES = {
   'wangyiPro.middlewares.WangyiproDownloaderMiddleware': 543,
}

ITEM_PIPELINES = {
   'wangyiPro.pipelines.WangyiproPipeline': 300,
   'wangyiPro.pipelines.MysqlPipeline': 301
}

你可能感兴趣的:(Python爬虫)