centos scrapy-splash 简明教程

一、环境安装

1、安装

pip install scrapy-splash

2、安装docker

apt install docker.io 

3、运行docker

下载代码 scrapy-splash  https://github.com/scrapy-plugins/scrapy-splash.git

cd scrapy-splash

执行

docker run -p 8050:8050 scrapinghub/splash

或者 指定超时时间

docker run -it -p 8050:8050 scrapinghub/splash --max-timeout 300

4. setting.py 配置 SPLASH_URL = 'http://172.17.0.1:8050/'

5. 启动爬虫scrapy crawl getdata


参考资料 API 和教程

https://splash-cn-doc.readthedocs.io/zh_CN/latest/scrapy-splash-toturial.html

https://splash-cn-doc.readthedocs.io/zh_CN/latest/api.html#render-html

https://github.com/scrapy-plugins/scrapy-splash

https://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.common.touch_actions


二、建一个 scrapy-splash 项目

1、配置 setting.py 

  SPLASH_URL ifconfig docker0 -->inet addr:172.17.0.1

DOWNLOADER_MIDDLEWARES = {

    # Engine side

    'scrapy_splash.SplashCookiesMiddleware': 723,

    'scrapy_splash.SplashMiddleware': 725,

    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,

    # Downloader side

}

SPIDER_MIDDLEWARES = {

    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,

}

SPLASH_URL = 'http://172.17.0.1:8050/'

# SPLASH_URL = 'http://192.168.59.103:8050/'

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

2、使用 yield SplashRequest() 代替 yield scrapy.Request

你可能感兴趣的:(centos scrapy-splash 简明教程)