Scrapy(Scrapy+Scrapy-splash)抓取js动态页面入门笔记

本文章仅作为个人笔记

还不懂Scrapy的可以参考

Scrpy官网

Scrpy官方文档

Scrpy中文文档

Scrpy-splash项目git地址

个人ScrapyDemo项目地址

准备工作
  • 先完成简单scrapy项目

  • 安装docker

    • win下下载安装包安装

    • mac下下载安装包安装(尝试使用brew安装,安装启动过程非常复杂,最后选择使用安装包直接安装)

    • centos7下运行:

      yum install docker
      
  • redhat运行:

      yum install --setopt=obsoletes=0 docker-ce-17.03.2.ce-1.el7.centos.x86_64 docker-ce-selinux-17.03.2.ce-1.el7.centos.noarch
    
  • 安装 scrapy-splash

      pip install scrapy-splash
    
  • 启动docker服务

    • centos7

      service docker start
      
    • win下直接打开应用

    • mac下直接打开应用

  • 拉取镜像

      docker pull scrapinghub/splash
    
  • 运行镜像

      docker run -p 8050:8050 scrapinghub/splash
    
  • 配置splash服务(以下操作全部在settings.py):

    • 添加splash服务器地址:

      SPLASH_URL = 'http://localhost:8050'

    • 将splash middleware添加到DOWNLOADER_MIDDLEWARE中:

        DOWNLOADER_MIDDLEWARES = {
            'scrapy_splash.SplashCookiesMiddleware': 723,
            'scrapy_splash.SplashMiddleware': 725,
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        }
      
    • Enable SplashDeduplicateArgsMiddleware:

        SPIDER_MIDDLEWARES = {
            'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
        }
      
    • Set a custom DUPEFILTER_CLASS:

        DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
      
    • a custom cache storage backend:

        HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
      
  • 例子

      import json, scrapy
    
      lass MySpider(scrapy.Spider):
         name = 'example'
         allowed_domains = ['example.com']
         start_urls = ["http://example.com", "http://example.com/foo"]
    
         def start_requests(self):
           for url in self.start_urls:
             yield SplashRequest(url, self.parse, args={'wait': 0.5})
    
         def parse(self, response):
             # ...
    

你可能感兴趣的:(Scrapy(Scrapy+Scrapy-splash)抓取js动态页面入门笔记)