- 通过执行爬虫命令时添加可选参数来到处数据到文件:
scrapy runspider toscrape-css -o quotes.json
- 保存的数据是什么样的:
[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"]},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": ["abilities", "choices"]},
{"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein", "tags": ["inspirational", "life", "live", "miracle", "miracles"]},
...
]
- 可以看到数据中包含了一些类似:\u201c、\u201d这样的不可读字符,其实这是保存数据时编码格式没有设置,导致使用类似\uXXXX 这样的序列进行保存。这里我们就要好好理理数据保存时的参数设置问题了。
Feed exports参数详解
- FEED_URI (指向文件)FEED_FORMAT(数据格式)
- FEED_STORAGES(额外存储方式,即存到哪)
- FEED_STORAGES_BASE(基础存储方式,即存到哪)
- FEED_EXPORTERS(额外输出方式)
- FEED_EXPORTERS_BASE(基础输出方式)
- FEED_STORE_EMPTY(是否输出空数据,默认不输出)
- FEED_EXPORT_ENCODING(文件编码格式)
- FEED_EXPORT_FIELDS(指定数据输出项及顺序)
- FEED_EXPORT_INDENT(添加数据缩,优雅输出)
下面开始说明(上面加粗参数为重点掌握,比较实用):
1 、FEED_URI
指定文件存储的位置以及文件名,支持输出到:
本地文件
D://tmp/filename.csv
FTP
ftp://user:[email protected]/path/to/filename.csv
2、FEED_FORMAT
指定数据输出格式,支持的输出格式有(分别示例):
- JSON
-
FEED_FORMAT
:json
- Exporter used:
JsonItemExporter
实际上是JsonItemExporter,示例:
[{"name": "Color TV", "price": "1200"},
{"name": "DVD player", "price": "200"}]
注意:如果数据量太多的话不建议使用json格式,因为它是把整个对象放入内存中,所以大数据量简易使用jsonlines 或者分块输出数据到文件。
- JSON lines
-
FEED_FORMAT
:jsonlines
- Exporter used:
JsonLinesItemExporter
实际上是JsonLinesItemExporter,示例:
{"name": "Color TV", "price": "1200"}
{"name": "DVD player", "price": "200"}
- CSV
-
FEED_FORMAT
:csv
- Exporter used:
CsvItemExporter
- To specify columns to export and their order use
FEED_EXPORT_FIELDS
. Other feed exporters can also use this option, but it is important for CSV because unlike many other export formats CSV uses a fixed header.
实际上为CsvItemExporter,示例:
product,price
Color TV,1200
DVD player,200
第一行为输出数据项的名称,下面每行为一组数据。
- XML
-
FEED_FORMAT
:xml
- Exporter used:
XmlItemExporter
实际上为XmlItemExporter,示例:
-
Color TV
1200
-
DVD player
200
剩余的还有Pickle、Marshal暂时不做不了解。
3、存储方式
FEED_STORAGES
默认为{},如果要进行设置则以URL方案名作为key,值为该存储类的路径。
FEED_STORAGES_BASE
基础存储方式,默认的为:
{
'': 'scrapy.extensions.feedexport.FileFeedStorage',
'file': 'scrapy.extensions.feedexport.FileFeedStorage',
'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
's3': 'scrapy.extensions.feedexport.S3FeedStorage',
'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
}
4、文件存储格式
FEED_EXPORTERS
默认为{},定义扩展的文件存储方法,以格式为key,值为该格式类的路径。
FEED_EXPORTERS_BASE
默认存储格式有:
{
'json': 'scrapy.exporters.JsonItemExporter',
'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
'jl': 'scrapy.exporters.JsonLinesItemExporter',
'csv': 'scrapy.exporters.CsvItemExporter',
'xml': 'scrapy.exporters.XmlItemExporter',
'marshal': 'scrapy.exporters.MarshalItemExporter',
'pickle': 'scrapy.exporters.PickleItemExporter',
}
5、编码及数据输出
FEED_EXPORT_ENCODING
存储文件编码,默认为None,一般设置为utf-8。
FEED_EXPORT_FIELDS
设定输出哪些字段,以及字段的顺序,例子:
FEED_EXPORT_FIELDS = ["foo", "bar", "baz"]
FEED_EXPORT_INDENT
默认值为0,单值为0或负数时将在新一行输出数据,设置大于0则为每一级的数据添加等量倍的空格缩进。
3 使用范例
# -*- coding: utf-8 -*-
import scrapy
class QuotesItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ['toscrape.com']
custom_settings = {
'FEED_EXPORT_ENCODING': 'utf-8',
'FEED_URI': 'quotes.jsonlines',
}
def __init__(self, category=None, *args, **kwargs):
super(QuotesSpider, self).__init__(*args, **kwargs)
self.start_urls = ['http://quotes.toscrape.com/tag/%s/' % category, ]
def parse(self, response):
quote_block = response.css('div.quote')
for quote in quote_block:
text = quote.css('span.text::text').extract_first()
author = quote.xpath('span/small/text()').extract_first()
# item = dict(text=text, author=author)
item = QuotesItem()
item['text'] = text
item['author'] = author
yield item
next_page = response.css('li.next a::attr("href")').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
在同目录下cmd输入:
(保存数据到爬虫中定义的文件中)
scrapy runspider Quotes_Spider.py -a category=love
(保存数据到命令行中指定的文件)
scrapy runspider Quotes_Spider.py -a category=love -o new_quotes.json