[TOC]
Scrapy Study Note Scrapy学习笔记
IPython
JupyterNotebook
Anaconda
1. Basic Concepts 基础概念
1.1.Command line tool 命令行工具
1.2.Spiders 蜘蛛程序
1.3.Selectors 文本选择器
1.4.Item 数据模型对象
1.5.Item Loaders 数据模型对象 增强操作API
1.6.Scrapy Shell shell交互式操作
IPython
bpython
standard python shell
jupyter notebook
1.7.Item Pipeline 数据模型对象 管道处理
数据转换
数据过滤
数据存储
1.8.Feed exports 数据导出
Serialization formats:
JSON: JsonItemExporter
JSON Lines: JsonLinesItemExporter
CSV: CsvItemExporter
XML: XmlItemExporter
Storages:
Local filesystem: FileFeedStorage
FTP: FTPFeedStorage
S3 (requires botocore or boto): S3FeedStorage
Standard output: StdoutFeedStorage
Storage backends:
file:///tmp/export.csv
ftp://user:[email protected]/path/to/export.csv
s3://mybucket/path/to/export.csv
s3://aws_key:aws_secret@mybucket/path/to/export.csv
stdout: xxx
1.9.Requests and Responses 请求和响应
Spiders 生成 Request, Response 返回到发出请求的 Spiders
Request -> Downloader 下载器接收请求、执行请求并返回响应 -> Response
Request objects:
class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags])
请求行url+method 请求headers 请求body
Request subclasses:
FormRequest objects:
class scrapy.http.FormRequest(url[, formdata, ...])
formdata (dict or iterable of tuples)
Response objects:
class scrapy.http.Response(url[, status=200, headers=None, body=b'', flags=None, request=None])
响应行url+status 响应headers 响应body
Response subclasses:
TextResponse objects:
class scrapy.http.TextResponse(url[, encoding[, ...]])
text: response.text = response.body.decode(response.encoding)
HtmlResponse objects:
class scrapy.http.HtmlResponse(url[, ...])
XmlResponse objects:
class scrapy.http.XmlResponse(url[, ...])
1.10.Link Extractors 链接提取器
class scrapy.linkextractors.LinkExtractor
LxmlLinkExtractor:
class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=False, unique=True, process_value=None, strip=True)
1.11.Settings 设置
SCRAPY_SETTINGS_MODULE = myproject.settings
设置优先级:
Command line options (most precedence)
scrapy crawl myspider -s LOG_FILE=scrapy.log
Settings per-spider
class MySpider(scrapy.Spider):
name = 'myspider'
custom_settings = {
'SOME_SETTING': 'some value',
}
Project settings module
settings.py
Default settings per-command
default_settings
Default global settings (less precedence)
scrapy.settings.default_settings
DEFAULT_REQUEST_HEADERS
{
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
DOWNLOADER_MIDDLEWARES_BASE
{
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}
DOWNLOAD_DELAY = 0.25 # 250 ms of delay
DOWNLOAD_HANDLERS_BASE
{
'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler',
'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',
'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',
}
ITEM_PIPELINES = {
'mybot.pipelines.validate.ValidateMyItem': 300,
'mybot.pipelines.validate.StoreMyItem': 800,
}
MEMDEBUG_NOTIFY = ['[email protected]']
MEMUSAGE_NOTIFY_MAIL = ['[email protected]']
1.12.Exceptions 异常
DropItem:
exception scrapy.exceptions.DropItem
CloseSpider:
exception scrapy.exceptions.CloseSpider(reason='cancelled')
DontCloseSpider:
exception scrapy.exceptions.DontCloseSpider
IgnoreRequest:
exception scrapy.exceptions.IgnoreRequest
NotConfigured:
exception scrapy.exceptions.NotConfigured
NotSupported:
exception scrapy.exceptions.NotSupported
2. Build-In Services 构建服务
2.1 Logging 日志
Log levels:
logging.CRITICAL - for critical errors (highest severity)
logging.ERROR - for regular errors
logging.WARNING - for warning messages
logging.INFO - for informational messages
logging.DEBUG - for debugging messages (lowest severity)
How to log messages:
import logging
logging.warning("This is a warning")
logging.log(logging.WARNING, "This is a warning")
logger = logging.getLogger()
logger.warning("This is a warning")
Logging from Spiders:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://scrapinghub.com']
def parse(self, response):
self.logger.info('Parse function called on %s', response.url)
import logging
import scrapy
logger = logging.getLogger('mycustomlogger')
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://scrapinghub.com']
def parse(self, response):
logger.info('Parse function called on %s', response.url)
Logging configuration:
Command-line options:
--logfile FILE Overrides LOG_FILE
--loglevel/-L LEVEL Overrides LOG_LEVEL
--nolog Sets LOG_ENABLED to False
scrapy.utils.log module:
import logging
from scrapy.utils.log import configure_logging
configure_logging(install_root_handler=False)
logging.basicConfig(
filename='log.txt',
format='%(levelname)s: %(message)s',
level=logging.INFO
)
2.2 Stats Collection 统计收集
Stats Collector
2.3 Sending e-mail 发邮件
smtplib
Twisted nonblocking IO
from scrapy.mail import MailSender
mailer = MailSender()
mailer = MailSender.from_settings(settings)
mailer.send(to=["[email protected]"], subject="Some subject", body="Some body", cc=["[email protected]"])
2.4 Telnet Console Telnet控制台
No
2.5 Web Service Web服务
https://github.com/scrapy-plugins/scrapy-jsonrpc
3. Solving Specific Problems 解决特殊问题
3.1 Frequently Asked Questions 常问的问题
https://docs.scrapy.org/en/latest/faq.html
Does Scrapy crawl in breadth-first or depth-first order?
3.2 Debugging Spiders 调试蜘蛛程序
No
3.3 Spiders Contracts 蜘蛛程序契约
Custom Contracts:
SPIDER_CONTRACTS = {
'myproject.contracts.ResponseCheck': 10,
'myproject.contracts.ItemValidate': 10,
}
3.4 Common Practices 常见练习
No
3.5 Broad Crawls 大规模化爬虫
Increase concurrency:
CONCURRENT_REQUESTS = 100
Increase Twisted IO thread pool maximum size:
REACTOR_THREADPOOL_MAXSIZE = 20
Reduce log level:
LOG_LEVEL = 'INFO'
Disable cookies:
COOKIES_ENABLED = False
Disable retries:
RETRY_ENABLED = False
Reduce download timeout:
DOWNLOAD_TIMEOUT = 15
Disable redirects:
REDIRECT_ENABLED = False
3.6 Using Firefox for scraping
No
3.7 Using Firebug for scraping
No
3.8 Debugging memory leaks 调试内存泄漏
Too Many Requests?
Too many spiders?
3.9 Downloading and processing files and images
Using the Files Pipeline:
FilesPipeline
Using the Images Pipeline:
ImagesPipeline
Enabling your Media Pipeline:
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
FILES_STORE = '/path/to/valid/dir'
IMAGES_STORE = '/path/to/valid/dir'
IMAGES_STORE = 's3://bucket/images'
IMAGES_STORE = 'gs://bucket/images/'
FILES_EXPIRES = 120 # 120 days of delay for files expiration
IMAGES_EXPIRES = 30 # 30 days of delay for images expiration
IMAGES_THUMBS = {
'small': (50, 50),
'big': (270, 270),
}
3.10 Deploying Spiders 发布蜘蛛程序
Scrapyd (open source)
Scrapy Cloud (cloud-based)
3.11 AutoThrottle extension 自动节流阀扩展
AUTOTHROTTLE_ENABLED Default: False
AUTOTHROTTLE_START_DELAY Default: 5.0
AUTOTHROTTLE_MAX_DELAY Default: 60.0
AUTOTHROTTLE_TARGET_CONCURRENCY Default: 1.0
AUTOTHROTTLE_DEBUG Default: False
CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS_PER_IP
DOWNLOAD_DELAY
3.12 Benchmarking 基准
scrapy bench
3.13 Jobs: pausing and resuming crawls
No
4. Extending Scrapy Scrapy延伸
4.1 Architecture overview 架构概观
Scrapy Engine
Scheduler
Downloader
Spiders
Item Pipeline
- Event-driven networking:
Scrapy is written with Twisted, a popular event-driven networking framework for Python.
Thus, it’s implemented using a non-blocking (aka asynchronous) code for concurrency.
4.2 Downloader Middleware 下载器中间件
Activating a downloader middleware:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomDownloaderMiddleware': 543,
}
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomDownloaderMiddleware': 543,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}
4.3 Spider Middleware 蜘蛛程序中间件
Activating a spider middleware:
SPIDER_MIDDLEWARES = {
'myproject.middlewares.CustomSpiderMiddleware': 543,
}
SPIDER_MIDDLEWARES = {
'myproject.middlewares.CustomSpiderMiddleware': 543,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
}
4.4 Extensions 扩展
Loading & activating extensions:
EXTENSIONS = {
'scrapy.extensions.corestats.CoreStats': 500,
'scrapy.extensions.telnet.TelnetConsole': 500,
}
Disabling an extension:
EXTENSIONS = {
'scrapy.extensions.corestats.CoreStats': None,
}
4.5 Core API 核心API
Crawler API:
class scrapy.crawler.Crawler(spidercls, settings)
class scrapy.crawler.CrawlerRunner(settings=None)
class scrapy.crawler.CrawlerProcess(settings=None, install_root_handler=True)
Settings API:
scrapy.settings.SETTINGS_PRIORITIES:
SETTINGS_PRIORITIES = {
'default': 0,
'command': 10,
'project': 20,
'spider': 30,
'cmdline': 40,
}
SpiderLoader API:
class scrapy.loader.SpiderLoader
Signals API:
class scrapy.signalmanager.SignalManager(sender=_Anonymous)
Stats Collector API:
class scrapy.statscollectors.StatsCollector
4.6 Signals 信号控制
scrapy.signals.engine_started()
scrapy.signals.engine_stopped()
scrapy.signals.item_scraped(item, response, spider)
scrapy.signals.item_dropped(item, response, exception, spider)
scrapy.signals.spider_opened(spider)
scrapy.signals.spider_closed(spider, reason)
scrapy.signals.spider_idle(spider)
scrapy.signals.spider_error(failure, response, spider)
scrapy.signals.request_scheduled(request, spider)
scrapy.signals.request_dropped(request, spider)
scrapy.signals.response_received(response, request, spider)
scrapy.signals.response_downloaded(response, request, spider)
4.7 Item Exporters 数据模型对象导出
BaseItemExporter:
class scrapy.exporters.BaseItemExporter(fields_to_export=None, export_empty_fields=False, encoding='utf-8', indent=0)
XmlItemExporter:
class scrapy.exporters.XmlItemExporter(file, item_element='item', root_element='items', **kwargs)
CsvItemExporter:
class scrapy.exporters.CsvItemExporter(file, include_headers_line=True, join_multivalued=', ', **kwargs)
PickleItemExporter:
class scrapy.exporters.PickleItemExporter(file, protocol=0, **kwargs)
PprintItemExporter:
class scrapy.exporters.PprintItemExporter(file, **kwargs)
JsonItemExporter:
class scrapy.exporters.JsonItemExporter(file, **kwargs)
JsonLinesItemExporter:
class scrapy.exporters.JsonLinesItemExporter(file, **kwargs)