Item Pipeline 主要用于从网页抓取(spider)后对数据Item进行收集,写入数据库或文件中。
spider 在获得item后,会传递给item pipeline,进行后续数据收集工作。
在setting中对item pipeline类路径进行配置,scrapy框架会调用该item pipeline类,为了正确调用,
item pipeline类必须按照框架要求实现一些方法。使用者只需关注实现这些方法即可。
下面文件实现了一个简单的item pipeline类,对抓取的新闻数据进行进一步处理并写入文件中。这些方法的功能见注释。
1. 文件:pipelines.py
注意事项:
1. 初始化函数实现非常自由,无需限定参数,只需保证from_crawler类方法能够调用该初始化函数生成相应的实例及可。
2. 框架所使用方法声明参数固定。(保证框架能够正确调用)
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class News2FileFor163Pipeline(object):
"""
pipeline: process items given by spider
"""
def __init__(self, filepath, filename):
"""
init for the pipeline class
"""
self.fullname = filepath + '/' + filename
self.id = 0
return
def process_item(self, item, spider):
"""
process each items from the spider.
example: check if item is ok or raise DropItem exception.
example: do some process before writing into database.
example: check if item is exist and drop.
"""
for element in ("url","source","title","editor","time","content"):
if item[element] is None:
raise DropItem("invalid items url: %s" % str(item["url"]))
self.fs.write("news id: %s" % self.id)
self.fs.write("\n")
self.id += 1
self.fs.write("url: %s" % item["url"][0].strip().encode('UTF-8'))
self.fs.write("\n")
self.fs.write("source: %s" % item["source"][0].strip().encode('UTF-8'))
self.fs.write("\n")
self.fs.write("title: %s" % item["title"][0].strip().encode('UTF-8'))
self.fs.write("\n")
self.fs.write("editor: %s" % item["editor"][0].strip().
encode('UTF-8').split(':')[1])
self.fs.write("\n")
time_string = item["time"][0].strip().split()
datetime = time_string[0] + ' ' + time_string[1]
self.fs.write("time: %s" % datetime.encode('UTF-8'))
self.fs.write("\n")
content = ""
for para in item["content"]:
content += para.strip().replace('\n', '').replace('\t', '')
self.fs.write("content: %s" % content.encode('UTF-8'))
self.fs.write("\n")
return item
def open_spider(self, spider):
"""
called when spider is opened.
do something before pipeline is processing items.
example: do settings or create connection to the database.
"""
self.fs = open(self.fullname, 'w+')
return
def close_spider(self, spider):
"""
called when spider is closed.
do something after pipeline processing all items.
example: close the database.
"""
self.fs.flush()
self.fs.close()
return
@classmethod
def from_crawler(cls, crawler):
"""
return an pipeline instance.
example: initialize pipeline object by crawler's setting and components.
"""
return cls(crawler.settings.get('ITEM_FILE_PATH'),
crawler.settings.get('ITEM_FILE_NAME'))
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'NewsSpiderMan.pipelines.News2FileFor163Pipeline': 300,
}
如果抓取数据的内容非常多,使用item pipeline 对数据处理并写入数据库中乃王道。