Scrapy学习笔记（一）

[TOC]

Scrapy Study Note Scrapy学习笔记

IPython 
JupyterNotebook 
Anaconda

1. Basic Concepts 基础概念

1.1.Command line tool 命令行工具

1.2.Spiders 蜘蛛程序

1.3.Selectors 文本选择器

1.4.Item 数据模型对象

1.5.Item Loaders 数据模型对象增强操作API

1.6.Scrapy Shell shell交互式操作

IPython
bpython
standard python shell
jupyter notebook

1.7.Item Pipeline 数据模型对象管道处理

数据转换 
数据过滤 
数据存储

1.8.Feed exports 数据导出

Serialization formats:
	JSON: JsonItemExporter
	JSON Lines: JsonLinesItemExporter
	CSV: CsvItemExporter
	XML: XmlItemExporter
Storages: 
	Local filesystem: FileFeedStorage
	FTP: FTPFeedStorage
	S3 (requires botocore or boto): S3FeedStorage
	Standard output: StdoutFeedStorage
Storage backends:
	file:///tmp/export.csv
	ftp://user:[email protected]/path/to/export.csv
	s3://mybucket/path/to/export.csv
	s3://aws_key:aws_secret@mybucket/path/to/export.csv
	stdout: xxx

1.9.Requests and Responses 请求和响应

Spiders 生成 Request, Response 返回到发出请求的 Spiders
Request -> Downloader 下载器接收请求、执行请求并返回响应 -> Response

Request objects:
	class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags])
	请求行url+method 请求headers 请求body
Request subclasses:
	FormRequest objects:
  		class scrapy.http.FormRequest(url[, formdata, ...])
    		formdata (dict or iterable of tuples)
Response objects:
	class scrapy.http.Response(url[, status=200, headers=None, body=b'', flags=None, request=None])
	响应行url+status 响应headers 响应body
Response subclasses:
	TextResponse objects:
  		class scrapy.http.TextResponse(url[, encoding[, ...]])
    		text: response.text = response.body.decode(response.encoding)
	HtmlResponse objects:
  		class scrapy.http.HtmlResponse(url[, ...])
	XmlResponse objects:
  		class scrapy.http.XmlResponse(url[, ...])

1.10.Link Extractors 链接提取器

class scrapy.linkextractors.LinkExtractor
LxmlLinkExtractor:
	class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=False, unique=True, process_value=None, strip=True)

1.11.Settings 设置

SCRAPY_SETTINGS_MODULE = myproject.settings

设置优先级: 
Command line options (most precedence)
  scrapy crawl myspider -s LOG_FILE=scrapy.log
Settings per-spider
  class MySpider(scrapy.Spider):
    name = 'myspider'
    custom_settings = {
        'SOME_SETTING': 'some value',
    }
Project settings module
  settings.py
Default settings per-command
  default_settings
Default global settings (less precedence)
  scrapy.settings.default_settings

DEFAULT_REQUEST_HEADERS
	{
	'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
	'Accept-Language': 'en',
	}
DOWNLOADER_MIDDLEWARES_BASE	
	{
	'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
	'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
	'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
	'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
	'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
	'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
	'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
	'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
	'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
	'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
	'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
	'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
	'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
	'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
	}
DOWNLOAD_DELAY = 0.25    # 250 ms of delay
DOWNLOAD_HANDLERS_BASE
	{
	'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler',
	'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
	'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
	's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',
	'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',
	}
ITEM_PIPELINES = {
	'mybot.pipelines.validate.ValidateMyItem': 300,
	'mybot.pipelines.validate.StoreMyItem': 800,
	}
MEMDEBUG_NOTIFY = ['[email protected]']
MEMUSAGE_NOTIFY_MAIL = ['[email protected]']

1.12.Exceptions 异常

DropItem:
	exception scrapy.exceptions.DropItem
CloseSpider:
	exception scrapy.exceptions.CloseSpider(reason='cancelled')
DontCloseSpider:
	exception scrapy.exceptions.DontCloseSpider
IgnoreRequest:
	exception scrapy.exceptions.IgnoreRequest
NotConfigured:
	exception scrapy.exceptions.NotConfigured
NotSupported:
	exception scrapy.exceptions.NotSupported

2. Build-In Services 构建服务

2.1 Logging 日志

Log levels: 
	logging.CRITICAL - for critical errors (highest severity)
	logging.ERROR - for regular errors
	logging.WARNING - for warning messages
	logging.INFO - for informational messages
	logging.DEBUG - for debugging messages (lowest severity)
How to log messages:
	import logging

	logging.warning("This is a warning")

	logging.log(logging.WARNING, "This is a warning")

	logger = logging.getLogger()
	logger.warning("This is a warning")
Logging from Spiders:
	import scrapy
	class MySpider(scrapy.Spider):
	    name = 'myspider'
	    start_urls = ['https://scrapinghub.com']
	    def parse(self, response):
	        self.logger.info('Parse function called on %s', response.url)

	import logging
	import scrapy
	logger = logging.getLogger('mycustomlogger')
	class MySpider(scrapy.Spider):
	    name = 'myspider'
	    start_urls = ['https://scrapinghub.com']
	    def parse(self, response):
	        logger.info('Parse function called on %s', response.url)
Logging configuration:
	Command-line options:
		--logfile FILE			Overrides LOG_FILE
		--loglevel/-L LEVEL		Overrides LOG_LEVEL
		--nolog					Sets LOG_ENABLED to False
scrapy.utils.log module:
	import logging
	from scrapy.utils.log import configure_logging
	configure_logging(install_root_handler=False)
	logging.basicConfig(
	    filename='log.txt',
	    format='%(levelname)s: %(message)s',
	    level=logging.INFO
	)

2.2 Stats Collection 统计收集

Stats Collector

2.3 Sending e-mail 发邮件

smtplib
Twisted nonblocking IO

from scrapy.mail import MailSender
mailer = MailSender()
mailer = MailSender.from_settings(settings)
mailer.send(to=["[email protected]"], subject="Some subject", body="Some body", cc=["[email protected]"])

2.4 Telnet Console Telnet控制台

No

2.5 Web Service Web服务

https://github.com/scrapy-plugins/scrapy-jsonrpc

3. Solving Specific Problems 解决特殊问题

3.1 Frequently Asked Questions 常问的问题

https://docs.scrapy.org/en/latest/faq.html

Does Scrapy crawl in breadth-first or depth-first order?

3.2 Debugging Spiders 调试蜘蛛程序

No

3.3 Spiders Contracts 蜘蛛程序契约

Custom Contracts: 
	SPIDER_CONTRACTS = {
	    'myproject.contracts.ResponseCheck': 10,
	    'myproject.contracts.ItemValidate': 10,
	}

3.4 Common Practices 常见练习

No

3.5 Broad Crawls 大规模化爬虫

Increase concurrency:
	CONCURRENT_REQUESTS = 100
Increase Twisted IO thread pool maximum size:	
	REACTOR_THREADPOOL_MAXSIZE = 20
Reduce log level:
	LOG_LEVEL = 'INFO'
Disable cookies:
	COOKIES_ENABLED = False
Disable retries:
	RETRY_ENABLED = False
Reduce download timeout:
	DOWNLOAD_TIMEOUT = 15
Disable redirects:
	REDIRECT_ENABLED = False

3.6 Using Firefox for scraping

No

3.7 Using Firebug for scraping

No

3.8 Debugging memory leaks 调试内存泄漏

Too Many Requests?
Too many spiders?

3.9 Downloading and processing files and images

Using the Files Pipeline: 
	FilesPipeline
Using the Images Pipeline: 
	ImagesPipeline
Enabling your Media Pipeline:
	ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
	ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
FILES_STORE = '/path/to/valid/dir'
IMAGES_STORE = '/path/to/valid/dir'
IMAGES_STORE = 's3://bucket/images'
IMAGES_STORE = 'gs://bucket/images/'
FILES_EXPIRES = 120 # 120 days of delay for files expiration
IMAGES_EXPIRES = 30 # 30 days of delay for images expiration
IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (270, 270),
}

3.10 Deploying Spiders 发布蜘蛛程序

Scrapyd (open source)
Scrapy Cloud (cloud-based)

3.11 AutoThrottle extension 自动节流阀扩展

AUTOTHROTTLE_ENABLED 				Default: False
AUTOTHROTTLE_START_DELAY			Default: 5.0
AUTOTHROTTLE_MAX_DELAY				Default: 60.0
AUTOTHROTTLE_TARGET_CONCURRENCY		Default: 1.0
AUTOTHROTTLE_DEBUG					Default: False
CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS_PER_IP
DOWNLOAD_DELAY

3.12 Benchmarking 基准

scrapy bench

3.13 Jobs: pausing and resuming crawls

No

4. Extending Scrapy Scrapy延伸

4.1 Architecture overview 架构概观

Data flow:
Components:

	Scrapy Engine
	Scheduler
	Downloader
	Spiders
	Item Pipeline

Event-driven networking:

	Scrapy is written with Twisted, a popular event-driven networking framework for Python. 
	Thus, it’s implemented using a non-blocking (aka asynchronous) code for concurrency.

4.2 Downloader Middleware 下载器中间件

Activating a downloader middleware:
	DOWNLOADER_MIDDLEWARES = {
	    'myproject.middlewares.CustomDownloaderMiddleware': 543,
	}
	DOWNLOADER_MIDDLEWARES = {
		'myproject.middlewares.CustomDownloaderMiddleware': 543,
		'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
	}

4.3 Spider Middleware 蜘蛛程序中间件

Activating a spider middleware: 
	SPIDER_MIDDLEWARES = {
	    'myproject.middlewares.CustomSpiderMiddleware': 543,
	}
	SPIDER_MIDDLEWARES = {
		'myproject.middlewares.CustomSpiderMiddleware': 543,
		'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
	}

4.4 Extensions 扩展

Loading & activating extensions:
	EXTENSIONS = {
	    'scrapy.extensions.corestats.CoreStats': 500,
	    'scrapy.extensions.telnet.TelnetConsole': 500,
	}
Disabling an extension:
	EXTENSIONS = {
	    'scrapy.extensions.corestats.CoreStats': None,
	}

4.5 Core API 核心API

Crawler API:
	class scrapy.crawler.Crawler(spidercls, settings)
	class scrapy.crawler.CrawlerRunner(settings=None)
	class scrapy.crawler.CrawlerProcess(settings=None, install_root_handler=True)
Settings API:
	scrapy.settings.SETTINGS_PRIORITIES: 
		SETTINGS_PRIORITIES = {
		    'default': 0,
		    'command': 10,
		    'project': 20,
		    'spider': 30,
		    'cmdline': 40,
		}
SpiderLoader API:
	class scrapy.loader.SpiderLoader
Signals API:
	class scrapy.signalmanager.SignalManager(sender=_Anonymous)
Stats Collector API:
	class scrapy.statscollectors.StatsCollector

4.6 Signals 信号控制

scrapy.signals.engine_started()
scrapy.signals.engine_stopped()
scrapy.signals.item_scraped(item, response, spider)
scrapy.signals.item_dropped(item, response, exception, spider)
scrapy.signals.spider_opened(spider)
scrapy.signals.spider_closed(spider, reason)
scrapy.signals.spider_idle(spider)
scrapy.signals.spider_error(failure, response, spider)
scrapy.signals.request_scheduled(request, spider)
scrapy.signals.request_dropped(request, spider)
scrapy.signals.response_received(response, request, spider)
scrapy.signals.response_downloaded(response, request, spider)

4.7 Item Exporters 数据模型对象导出

BaseItemExporter:
	class scrapy.exporters.BaseItemExporter(fields_to_export=None, export_empty_fields=False, encoding='utf-8', indent=0)
XmlItemExporter:
	class scrapy.exporters.XmlItemExporter(file, item_element='item', root_element='items', **kwargs)
CsvItemExporter:
	class scrapy.exporters.CsvItemExporter(file, include_headers_line=True, join_multivalued=', ', **kwargs)
PickleItemExporter:
	class scrapy.exporters.PickleItemExporter(file, protocol=0, **kwargs)
PprintItemExporter:
	class scrapy.exporters.PprintItemExporter(file, **kwargs)
JsonItemExporter:
	class scrapy.exporters.JsonItemExporter(file, **kwargs)				
JsonLinesItemExporter:
	class scrapy.exporters.JsonLinesItemExporter(file, **kwargs)

Scrapy学习笔记（一）

Scrapy Study Note Scrapy学习笔记

1. Basic Concepts 基础概念

1.1.Command line tool 命令行工具

1.2.Spiders 蜘蛛程序

1.3.Selectors 文本选择器

1.4.Item 数据模型对象

1.5.Item Loaders 数据模型对象 增强操作API

1.6.Scrapy Shell shell交互式操作

1.7.Item Pipeline 数据模型对象 管道处理

1.8.Feed exports 数据导出

1.9.Requests and Responses 请求和响应

1.10.Link Extractors 链接提取器

1.11.Settings 设置

1.12.Exceptions 异常

2. Build-In Services 构建服务

2.1 Logging 日志

2.2 Stats Collection 统计收集

2.3 Sending e-mail 发邮件

2.4 Telnet Console Telnet控制台

2.5 Web Service Web服务

3. Solving Specific Problems 解决特殊问题

3.1 Frequently Asked Questions 常问的问题

3.2 Debugging Spiders 调试蜘蛛程序

3.3 Spiders Contracts 蜘蛛程序契约

3.4 Common Practices 常见练习

3.5 Broad Crawls 大规模化爬虫

3.6 Using Firefox for scraping

3.7 Using Firebug for scraping

3.8 Debugging memory leaks 调试内存泄漏

3.9 Downloading and processing files and images

3.10 Deploying Spiders 发布蜘蛛程序

3.11 AutoThrottle extension 自动节流阀扩展

3.12 Benchmarking 基准

3.13 Jobs: pausing and resuming crawls

4. Extending Scrapy Scrapy延伸

4.1 Architecture overview 架构概观

4.2 Downloader Middleware 下载器中间件

4.3 Spider Middleware 蜘蛛程序中间件

4.4 Extensions 扩展

4.5 Core API 核心API

4.6 Signals 信号控制

4.7 Item Exporters 数据模型对象导出

你可能感兴趣的:(Scrapy学习笔记（一）)

1.5.Item Loaders 数据模型对象增强操作API

1.7.Item Pipeline 数据模型对象管道处理