Scrapy学习笔记(一)

[TOC]

Scrapy Study Note Scrapy学习笔记

IPython 
JupyterNotebook 
Anaconda 

1. Basic Concepts 基础概念

1.1.Command line tool 命令行工具
1.2.Spiders 蜘蛛程序
1.3.Selectors 文本选择器
1.4.Item 数据模型对象
1.5.Item Loaders 数据模型对象 增强操作API
1.6.Scrapy Shell shell交互式操作
IPython
bpython
standard python shell
jupyter notebook
1.7.Item Pipeline 数据模型对象 管道处理
数据转换 
数据过滤 
数据存储
1.8.Feed exports 数据导出
Serialization formats:
	JSON: JsonItemExporter
	JSON Lines: JsonLinesItemExporter
	CSV: CsvItemExporter
	XML: XmlItemExporter
Storages: 
	Local filesystem: FileFeedStorage
	FTP: FTPFeedStorage
	S3 (requires botocore or boto): S3FeedStorage
	Standard output: StdoutFeedStorage
Storage backends:
	file:///tmp/export.csv
	ftp://user:[email protected]/path/to/export.csv
	s3://mybucket/path/to/export.csv
	s3://aws_key:aws_secret@mybucket/path/to/export.csv
	stdout: xxx
1.9.Requests and Responses 请求和响应
Spiders 生成 Request, Response 返回到发出请求的 Spiders
Request -> Downloader 下载器接收请求、执行请求并返回响应 -> Response

Request objects:
	class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags])
	请求行url+method 请求headers 请求body
Request subclasses:
	FormRequest objects:
  		class scrapy.http.FormRequest(url[, formdata, ...])
    		formdata (dict or iterable of tuples)
Response objects:
	class scrapy.http.Response(url[, status=200, headers=None, body=b'', flags=None, request=None])
	响应行url+status 响应headers 响应body
Response subclasses:
	TextResponse objects:
  		class scrapy.http.TextResponse(url[, encoding[, ...]])
    		text: response.text = response.body.decode(response.encoding)
	HtmlResponse objects:
  		class scrapy.http.HtmlResponse(url[, ...])
	XmlResponse objects:
  		class scrapy.http.XmlResponse(url[, ...])
1.10.Link Extractors 链接提取器
class scrapy.linkextractors.LinkExtractor
LxmlLinkExtractor:
	class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=False, unique=True, process_value=None, strip=True)
1.11.Settings 设置
SCRAPY_SETTINGS_MODULE = myproject.settings

设置优先级: 
Command line options (most precedence)
  scrapy crawl myspider -s LOG_FILE=scrapy.log
Settings per-spider
  class MySpider(scrapy.Spider):
    name = 'myspider'
    custom_settings = {
        'SOME_SETTING': 'some value',
    }
Project settings module
  settings.py
Default settings per-command
  default_settings
Default global settings (less precedence)
  scrapy.settings.default_settings

DEFAULT_REQUEST_HEADERS
	{
	'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
	'Accept-Language': 'en',
	}
DOWNLOADER_MIDDLEWARES_BASE	
	{
	'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
	'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
	'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
	'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
	'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
	'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
	'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
	'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
	'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
	'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
	'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
	'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
	'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
	'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
	}
DOWNLOAD_DELAY = 0.25    # 250 ms of delay
DOWNLOAD_HANDLERS_BASE
	{
	'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler',
	'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
	'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
	's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',
	'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',
	}
ITEM_PIPELINES = {
	'mybot.pipelines.validate.ValidateMyItem': 300,
	'mybot.pipelines.validate.StoreMyItem': 800,
	}
MEMDEBUG_NOTIFY = ['[email protected]']
MEMUSAGE_NOTIFY_MAIL = ['[email protected]']
1.12.Exceptions 异常
DropItem:
	exception scrapy.exceptions.DropItem
CloseSpider:
	exception scrapy.exceptions.CloseSpider(reason='cancelled')
DontCloseSpider:
	exception scrapy.exceptions.DontCloseSpider
IgnoreRequest:
	exception scrapy.exceptions.IgnoreRequest
NotConfigured:
	exception scrapy.exceptions.NotConfigured
NotSupported:
	exception scrapy.exceptions.NotSupported					

2. Build-In Services 构建服务

2.1 Logging 日志
Log levels: 
	logging.CRITICAL - for critical errors (highest severity)
	logging.ERROR - for regular errors
	logging.WARNING - for warning messages
	logging.INFO - for informational messages
	logging.DEBUG - for debugging messages (lowest severity)
How to log messages:
	import logging

	logging.warning("This is a warning")

	logging.log(logging.WARNING, "This is a warning")

	logger = logging.getLogger()
	logger.warning("This is a warning")
Logging from Spiders:
	import scrapy
	class MySpider(scrapy.Spider):
	    name = 'myspider'
	    start_urls = ['https://scrapinghub.com']
	    def parse(self, response):
	        self.logger.info('Parse function called on %s', response.url)

	import logging
	import scrapy
	logger = logging.getLogger('mycustomlogger')
	class MySpider(scrapy.Spider):
	    name = 'myspider'
	    start_urls = ['https://scrapinghub.com']
	    def parse(self, response):
	        logger.info('Parse function called on %s', response.url)
Logging configuration:
	Command-line options:
		--logfile FILE			Overrides LOG_FILE
		--loglevel/-L LEVEL		Overrides LOG_LEVEL
		--nolog					Sets LOG_ENABLED to False
scrapy.utils.log module:
	import logging
	from scrapy.utils.log import configure_logging
	configure_logging(install_root_handler=False)
	logging.basicConfig(
	    filename='log.txt',
	    format='%(levelname)s: %(message)s',
	    level=logging.INFO
	)
2.2 Stats Collection 统计收集
Stats Collector
2.3 Sending e-mail 发邮件
smtplib
Twisted nonblocking IO

from scrapy.mail import MailSender
mailer = MailSender()
mailer = MailSender.from_settings(settings)
mailer.send(to=["[email protected]"], subject="Some subject", body="Some body", cc=["[email protected]"])
2.4 Telnet Console Telnet控制台
No
2.5 Web Service Web服务
https://github.com/scrapy-plugins/scrapy-jsonrpc

3. Solving Specific Problems 解决特殊问题

3.1 Frequently Asked Questions 常问的问题
https://docs.scrapy.org/en/latest/faq.html

Does Scrapy crawl in breadth-first or depth-first order?
3.2 Debugging Spiders 调试蜘蛛程序
No
3.3 Spiders Contracts 蜘蛛程序契约
Custom Contracts: 
	SPIDER_CONTRACTS = {
	    'myproject.contracts.ResponseCheck': 10,
	    'myproject.contracts.ItemValidate': 10,
	}
3.4 Common Practices 常见练习
No
3.5 Broad Crawls 大规模化爬虫
Increase concurrency:
	CONCURRENT_REQUESTS = 100
Increase Twisted IO thread pool maximum size:	
	REACTOR_THREADPOOL_MAXSIZE = 20
Reduce log level:
	LOG_LEVEL = 'INFO'
Disable cookies:
	COOKIES_ENABLED = False
Disable retries:
	RETRY_ENABLED = False
Reduce download timeout:
	DOWNLOAD_TIMEOUT = 15
Disable redirects:
	REDIRECT_ENABLED = False						
3.6 Using Firefox for scraping
No
3.7 Using Firebug for scraping
No
3.8 Debugging memory leaks 调试内存泄漏
Too Many Requests?
Too many spiders?
3.9 Downloading and processing files and images
Using the Files Pipeline: 
	FilesPipeline
Using the Images Pipeline: 
	ImagesPipeline
Enabling your Media Pipeline:
	ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
	ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
FILES_STORE = '/path/to/valid/dir'
IMAGES_STORE = '/path/to/valid/dir'
IMAGES_STORE = 's3://bucket/images'
IMAGES_STORE = 'gs://bucket/images/'
FILES_EXPIRES = 120 # 120 days of delay for files expiration
IMAGES_EXPIRES = 30 # 30 days of delay for images expiration
IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (270, 270),
}
3.10 Deploying Spiders 发布蜘蛛程序
Scrapyd (open source)
Scrapy Cloud (cloud-based)
3.11 AutoThrottle extension 自动节流阀扩展
AUTOTHROTTLE_ENABLED 				Default: False
AUTOTHROTTLE_START_DELAY			Default: 5.0
AUTOTHROTTLE_MAX_DELAY				Default: 60.0
AUTOTHROTTLE_TARGET_CONCURRENCY		Default: 1.0
AUTOTHROTTLE_DEBUG					Default: False
CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS_PER_IP
DOWNLOAD_DELAY
3.12 Benchmarking 基准
scrapy bench
3.13 Jobs: pausing and resuming crawls
No

4. Extending Scrapy Scrapy延伸

4.1 Architecture overview 架构概观
  • Data flow: Scrapy学习笔记(一)_第1张图片
  • Components:
	Scrapy Engine
	Scheduler
	Downloader
	Spiders
	Item Pipeline
  • Event-driven networking:
	Scrapy is written with Twisted, a popular event-driven networking framework for Python. 
	Thus, it’s implemented using a non-blocking (aka asynchronous) code for concurrency.	
4.2 Downloader Middleware 下载器中间件
Activating a downloader middleware:
	DOWNLOADER_MIDDLEWARES = {
	    'myproject.middlewares.CustomDownloaderMiddleware': 543,
	}
	DOWNLOADER_MIDDLEWARES = {
		'myproject.middlewares.CustomDownloaderMiddleware': 543,
		'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
	}
4.3 Spider Middleware 蜘蛛程序中间件
Activating a spider middleware: 
	SPIDER_MIDDLEWARES = {
	    'myproject.middlewares.CustomSpiderMiddleware': 543,
	}
	SPIDER_MIDDLEWARES = {
		'myproject.middlewares.CustomSpiderMiddleware': 543,
		'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
	}
4.4 Extensions 扩展
Loading & activating extensions:
	EXTENSIONS = {
	    'scrapy.extensions.corestats.CoreStats': 500,
	    'scrapy.extensions.telnet.TelnetConsole': 500,
	}
Disabling an extension:
	EXTENSIONS = {
	    'scrapy.extensions.corestats.CoreStats': None,
	}
4.5 Core API 核心API
Crawler API:
	class scrapy.crawler.Crawler(spidercls, settings)
	class scrapy.crawler.CrawlerRunner(settings=None)
	class scrapy.crawler.CrawlerProcess(settings=None, install_root_handler=True)
Settings API:
	scrapy.settings.SETTINGS_PRIORITIES: 
		SETTINGS_PRIORITIES = {
		    'default': 0,
		    'command': 10,
		    'project': 20,
		    'spider': 30,
		    'cmdline': 40,
		}
SpiderLoader API:
	class scrapy.loader.SpiderLoader
Signals API:
	class scrapy.signalmanager.SignalManager(sender=_Anonymous)
Stats Collector API:
	class scrapy.statscollectors.StatsCollector			
4.6 Signals 信号控制
scrapy.signals.engine_started()
scrapy.signals.engine_stopped()
scrapy.signals.item_scraped(item, response, spider)
scrapy.signals.item_dropped(item, response, exception, spider)
scrapy.signals.spider_opened(spider)
scrapy.signals.spider_closed(spider, reason)
scrapy.signals.spider_idle(spider)
scrapy.signals.spider_error(failure, response, spider)
scrapy.signals.request_scheduled(request, spider)
scrapy.signals.request_dropped(request, spider)
scrapy.signals.response_received(response, request, spider)
scrapy.signals.response_downloaded(response, request, spider)
4.7 Item Exporters 数据模型对象导出
BaseItemExporter:
	class scrapy.exporters.BaseItemExporter(fields_to_export=None, export_empty_fields=False, encoding='utf-8', indent=0)
XmlItemExporter:
	class scrapy.exporters.XmlItemExporter(file, item_element='item', root_element='items', **kwargs)
CsvItemExporter:
	class scrapy.exporters.CsvItemExporter(file, include_headers_line=True, join_multivalued=', ', **kwargs)
PickleItemExporter:
	class scrapy.exporters.PickleItemExporter(file, protocol=0, **kwargs)
PprintItemExporter:
	class scrapy.exporters.PprintItemExporter(file, **kwargs)
JsonItemExporter:
	class scrapy.exporters.JsonItemExporter(file, **kwargs)				
JsonLinesItemExporter:
	class scrapy.exporters.JsonLinesItemExporter(file, **kwargs)

转载于:https://my.oschina.net/shaochuan/blog/2875665

你可能感兴趣的:(Scrapy学习笔记(一))