[Feed exports] - 数据导出配置详解

  • 通过执行爬虫命令时添加可选参数来到处数据到文件:
    scrapy runspider toscrape-css -o quotes.json
  • 保存的数据是什么样的:
[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"]},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": ["abilities", "choices"]},
{"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein", "tags": ["inspirational", "life", "live", "miracle", "miracles"]},
...
]
  • 可以看到数据中包含了一些类似:\u201c、\u201d这样的不可读字符,其实这是保存数据时编码格式没有设置,导致使用类似\uXXXX 这样的序列进行保存。这里我们就要好好理理数据保存时的参数设置问题了。

Feed exports参数详解

  • FEED_URI (指向文件)FEED_FORMAT(数据格式)
  • FEED_STORAGES(额外存储方式,即存到哪)
  • FEED_STORAGES_BASE(基础存储方式,即存到哪)
  • FEED_EXPORTERS(额外输出方式)
  • FEED_EXPORTERS_BASE(基础输出方式)
  • FEED_STORE_EMPTY(是否输出空数据,默认不输出)
  • FEED_EXPORT_ENCODING(文件编码格式)
  • FEED_EXPORT_FIELDS(指定数据输出项及顺序)
  • FEED_EXPORT_INDENT(添加数据缩,优雅输出)

下面开始说明(上面加粗参数为重点掌握,比较实用):
1 、FEED_URI

指定文件存储的位置以及文件名,支持输出到:

本地文件

D://tmp/filename.csv

FTP

ftp://user:[email protected]/path/to/filename.csv

2、FEED_FORMAT

指定数据输出格式,支持的输出格式有(分别示例):

  • JSON
  • FEED_FORMAT: json
  • Exporter used: JsonItemExporter

实际上是JsonItemExporter,示例:

[{"name": "Color TV", "price": "1200"},
{"name": "DVD player", "price": "200"}]

注意:如果数据量太多的话不建议使用json格式,因为它是把整个对象放入内存中,所以大数据量简易使用jsonlines 或者分块输出数据到文件。

  • JSON lines
  • FEED_FORMAT: jsonlines
  • Exporter used: JsonLinesItemExporter

实际上是JsonLinesItemExporter,示例:

{"name": "Color TV", "price": "1200"}
{"name": "DVD player", "price": "200"}
  • CSV
  • FEED_FORMAT: csv
  • Exporter used: CsvItemExporter
  • To specify columns to export and their order use FEED_EXPORT_FIELDS. Other feed exporters can also use this option, but it is important for CSV because unlike many other export formats CSV uses a fixed header.

实际上为CsvItemExporter,示例:

product,price
Color TV,1200
DVD player,200

第一行为输出数据项的名称,下面每行为一组数据。

  • XML
  • FEED_FORMAT: xml
  • Exporter used: XmlItemExporter

实际上为XmlItemExporter,示例:



  
    Color TV
    1200
 
  
    DVD player
    200
 


剩余的还有Pickle、Marshal暂时不做不了解。

3、存储方式

FEED_STORAGES

默认为{},如果要进行设置则以URL方案名作为key,值为该存储类的路径。

FEED_STORAGES_BASE

基础存储方式,默认的为:

{
    '': 'scrapy.extensions.feedexport.FileFeedStorage',
    'file': 'scrapy.extensions.feedexport.FileFeedStorage',
    'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
    's3': 'scrapy.extensions.feedexport.S3FeedStorage',
    'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
}

4、文件存储格式

FEED_EXPORTERS

默认为{},定义扩展的文件存储方法,以格式为key,值为该格式类的路径。

FEED_EXPORTERS_BASE

默认存储格式有:

{
    'json': 'scrapy.exporters.JsonItemExporter',
    'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
    'jl': 'scrapy.exporters.JsonLinesItemExporter',
    'csv': 'scrapy.exporters.CsvItemExporter',
    'xml': 'scrapy.exporters.XmlItemExporter',
    'marshal': 'scrapy.exporters.MarshalItemExporter',
    'pickle': 'scrapy.exporters.PickleItemExporter',
}

5、编码及数据输出

FEED_EXPORT_ENCODING

存储文件编码,默认为None,一般设置为utf-8。

FEED_EXPORT_FIELDS

设定输出哪些字段,以及字段的顺序,例子:

FEED_EXPORT_FIELDS = ["foo", "bar", "baz"]

FEED_EXPORT_INDENT

默认值为0,单值为0或负数时将在新一行输出数据,设置大于0则为每一级的数据添加等量倍的空格缩进。

3 使用范例

# -*- coding: utf-8 -*-
import scrapy


class QuotesItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ['toscrape.com']
    custom_settings = {
        'FEED_EXPORT_ENCODING': 'utf-8',
        'FEED_URI': 'quotes.jsonlines',
    }

    def __init__(self, category=None, *args, **kwargs):
        super(QuotesSpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://quotes.toscrape.com/tag/%s/' % category, ]

    def parse(self, response):
        quote_block = response.css('div.quote')
        for quote in quote_block:
            text = quote.css('span.text::text').extract_first()
            author = quote.xpath('span/small/text()').extract_first()
            # item = dict(text=text, author=author)
            item = QuotesItem()
            item['text'] = text
            item['author'] = author
            yield item

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

在同目录下cmd输入:

(保存数据到爬虫中定义的文件中)

scrapy runspider Quotes_Spider.py -a category=love

(保存数据到命令行中指定的文件)

scrapy runspider Quotes_Spider.py -a category=love -o new_quotes.json

你可能感兴趣的:([Feed exports] - 数据导出配置详解)