process_item(item, spider)
每一个item管道组件都会调用该方法,并且必须返回一个item对象实例或raise DropItem异常。被丢掉的item将不会在管道组件进行执行此外,我们也可以在类中实现以下方法open_spider(spider)
当spider执行的时候将调用该方法close_spider(spider)
当spider关闭的时候将调用该方法
from scrapy.exceptions import DropItem class PricePipeline(object): vat_factor = 1.15 def process_item(self, item, spider): if item['price']: if item['price_excludes_vat']: item['price'] = item['price'] * self.vat_factor return item else: raise DropItem("Missing price in %s" % item)注:VAT:Value Added Tax(增值税)
import json class JsonWriterPipeline(object): def __init__(self): self.file = open('items.jl', 'wb') def process_item(self, item, spider): line = json.dumps(dict(item)) + "\n" self.file.write(line) return item注:JsonWriterPipeline的目的是介绍如何编写项目管道。如果想要保存抓取的items到json文件中,推荐使用 Feed exports
from scrapy.exceptions import DropItem class DuplicatesPipeline(object): def __init__(self): self.ids_seen = set() def process_item(self, item, spider): if item['id'] in self.ids_seen: raise DropItem("Duplicate item found: %s" % item) else: self.ids_seen.add(item['id']) return item
ITEM_PIPELINES = { 'myproject.pipeline.PricePipeline': 300, 'myproject.pipeline.JsonWriterPipeline': 800, }The integer values you assign to classes in this setting determine the order they run in- items go through pipelines from order number low to high
整数值通常设置在0-1000之间
作者:曾是土木人(http://blog.csdn.net/php_fly)
原文地址:http://blog.csdn.net/php_fly/article/details/19571121
原文地址: Item Pipeline