scrapy-splash

scrapy-splash是一个配合scrapy使用的爬取动态js的第三方库(包)
安装
pip install scrapy-splash
使用
配合上一篇docker的安装食用更美味。
我就假设你看完了docker的安装使用文章
进入docker容器中,使用docker pull scrapinghub/splash
加载splash镜像
docker run -p 8050:8050 scrapinghub/splash
启动splash服务
配置splash服务(以下操作全部在settings.py):

1)添加splash服务器地址:

SPLASH_URL = 'http://localhost:8050'
2)将splash middleware添加到DOWNLOADER_MIDDLEWARE中:

DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
3)Enable SplashDeduplicateArgsMiddleware:

SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
4)Set a custom DUPEFILTER_CLASS:

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
5)a custom cache storage backend:

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

scrapy.spider中使用:

# -*- coding: utf-8 -*-

import scrapy
from scrapy import Selector, Request
from scrapy_splash import SplashRequest


class DmozSpider(scrapy.Spider):
    name = "bcch"
    allowed_domains = ["http://bcch.ahnw.gov.cn"]
    start_urls = [
        "http://bcch.ahnw.gov.cn/default.aspx",
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, args={'wait': 0.5})

    def parse(self, response):
        resp_sel = Selector(response)
        resp_sel.xpath('/')

使用容易,但是对于没搞过docker的朋友来讲,还是麻烦一点点

你可能感兴趣的:(scrapy-splash)