27.scrapy-splash初探

使用scrapy-splash之前,可以先创建一个scrapy项目,然后打印一下网页,突出scrapy-splash的优秀,嘻嘻。
作为scrapy-splash练习的靶子,它的长相是这个样子滴:



如果仅仅使用scrapy,打印出当前的网页是这样滴:



所以,还玩个蛇皮

当当当当,今天的主角要登场了。

重启下docker:

sudo service docker start

让Docker容器以守护态运行,这样在中断远程服务器连接后,不会终止Splash服务的运行:

docker run -d -p 8050:8050 scrapinghub/splash

如果运行不报错一般就是OK的,也可以在浏览器打开localhost:8050查看是不是这样婶儿的:



然后配置下settings文件,添加如下代码:

# 以下是scrapy-splash的配置:

# 渲染服务的url
SPLASH_URL = 'http://localhost:8050'

#下载器中间件
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
# 去重过滤器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# 使用Splash的Http缓存
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

spiders文件zfcaigou.py代码如下:

# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
from caigou.items import CaigouItem
# from caigou.items import ZfcaigouItemLoad, CaigouItem
spiders文件zfcaigou.py代码如下:

class ZfcaigouSpider(scrapy.Spider):
    name = 'zfcaigou'
    allowed_domains = ['www.zjzfcg.gov.cn']
    start_urls = ['http://www.zjzfcg.gov.cn/purchaseNotice/index.html?categoryId=3001']

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url, callback=self.parse,
                                args={'wait': 10}, endpoint='render.html')

    def parse(self, response):
        # print(response.body.decode("utf-8"))
        infodata = response.css(".items p")
        for infoline in infodata:
            caigouitem = CaigouItem()
            caigouitem['city'] = infoline.css(".warning::text").extract()[0].replace("[", "").replace("·", "").strip()
            caigouitem['issuescate'] = infoline.css(".warning .limit::text").extract()[0]
            caigouitem['title'] = infoline.css("a .underline::text").extract()[0].replace("]", "")
            caigouitem['publish_date'] = infoline.css(".time::text").extract()[0].replace("[", "").replace("]", "")
            yield caigouitem

items.py代码:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class CaigouItem(scrapy.Item):
    city = scrapy.Field()
    issuescate = scrapy.Field()
    title = scrapy.Field()
    publish_date = scrapy.Field()

除了网络稍卡,还是将内容给抓取下来了。


项目地址:https://github.com/hfxjd9527/caigou

你可能感兴趣的:(27.scrapy-splash初探)