命令格式:scrapy
commands | 作用 | 命令作用域 |
---|---|---|
crawl |
使用一个spider开始爬取任务 | 项目内 |
check |
代码语法检查 | 项目内 |
list |
列出当前项目中所有可用的spiders ,每一行显示一个spider | 项目内 |
edit |
在命令窗口下编辑一个爬虫 | 项目内 |
parse |
用指定spider方法来访问URL | 项目内 |
bench |
测试当前爬行速度 | 全局 |
fetch |
使用Scrapy downloader获取URL | 全局 |
genspider |
使用预定义模板生成一个新的spider | 全局 |
runspider |
Run a self-contained spider (without creating a project) | 全局 |
settings |
获取Scrapy配置信息 | 全局 |
shell |
命令行交互窗口下访问URL | 全局 |
startproject |
创建一个新项目 | 全局 |
version |
打印Scrapy版本 | 全局 |
view |
通过浏览器打开URL,显示内容为Scrapy实际所见 | 全局 |
1、创建项目 startproject
scrapy startproject myproject [project_dir]
在project_dir
路径下创建一个名为myproject
的新的爬虫项目,若没有指名project_dir
,则project_dir
名字将和myproject
一样。
C:\Users\m1812>scrapy startproject mytestproject
New Scrapy project 'mytestproject', using template directory 'C:\\Users\\m1812\\Anaconda3\\lib\\site-packages\\scrapy\\templates\\project', created in:
C:\Users\m1812\mytestproject
You can start your first spider with:
cd mytestproject
scrapy genspider example example.com
C:\Users\m1812>cd mytestproject
C:\Users\m1812\mytestproject>tree
文件夹 PATH 列表
卷序列号为 5680-D4D0
C:.
└─mytestproject
├─spiders
│ └─__pycache__
└─__pycache__
2、生成爬虫 genspider
在上面的目录下:
scrapy genspider mydomain mydomain.com
C:\Users\m1812\mytestproject>scrapy genspider baidu www.baidu.com
Created spider 'baidu' using template 'basic' in module:
mytestproject.spiders.baidu
看下
genspider
的详细用法:
C:\Users\m1812\mytestproject>scrapy genspider -h
Usage
=====
scrapy genspider [options]
Generate new spider using pre-defined templates
Options
=======
--help, -h show this help message and exit
--list, -l List available templates
--edit, -e Edit spider after creating it
--dump=TEMPLATE, -d TEMPLATE
Dump template to standard output
--template=TEMPLATE, -t TEMPLATE
Uses a custom template.
--force If the spider already exists, overwrite it with the
template
Global Options
--------------
--logfile=FILE log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
log level (default: DEBUG)
--nolog disable logging completely
--profile=FILE write python cProfile stats to FILE
--pidfile=FILE write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
set/override setting (may be repeated)
--pdb enable pdb on failure
模板的使用:-t TEMPLATE
模板类型:
C:\Users\m1812\mytestproject>scrapy genspider -l
Available templates:
basic
crawl
csvfeed
xmlfeed
测试下模板的使用:
C:\Users\m1812\mytestproject>scrapy genspider -t crawl zhihu www.zhihu.com
Created spider 'zhihu' using template 'crawl' in module:
mytestproject.spiders.zhihu
3、运行爬虫 crawl
scrapy crawl
C:\Users\m1812\mytestproject>scrapy crawl zhihu
2019-04-06 15:14:18 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: mytestproject)
2019-04-06 15:14:18 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mytestproject.spiders', 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'mytestproject', 'SPIDER_MODULES': ['mytestproject.spiders']}
2019-04-06 15:14:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.logstats.LogStats']
2019-04-06 15:14:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 15:14:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 15:14:18 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 15:14:18 [scrapy.core.engine] INFO: Spider opened
2019-04-06 15:14:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 15:14:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 15:14:23 [scrapy.core.engine] DEBUG: Crawled (400) (referer: None)
2019-04-06 15:14:28 [scrapy.core.engine] DEBUG: Crawled (400) (referer: None)
2019-04-06 15:14:28 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://www.zhihu.com/>: HTTP status code is not handled or not allowed
2019-04-06 15:14:28 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-06 15:14:28 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 527,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 813,
'downloader/response_count': 2,
'downloader/response_status_count/400': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 6, 7, 14, 28, 947408),
'log_count/DEBUG': 3,
'log_count/INFO': 8,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 4, 6, 7, 14, 18, 593508)}
2019-04-06 15:14:28 [scrapy.core.engine] INFO: Spider closed (finished)
知乎需要一些请求头才能成功访问,所以这里状态码显示不成功。
4、检查代码 check
scrapy check [-l]
C:\Users\m1812\mytestproject>scrapy check
----------------------------------------------------------------------
Ran 0 contracts in 0.000s
OK
如果随便把代码改错,这里删了网址的一个引号。再运行下,就能检查到错误。
C:\Users\m1812\mytestproject>scrapy check
Traceback (most recent call last):
File "C:\Users\m1812\Anaconda3\Scripts\scrapy-script.py", line 5, in
sys.exit(scrapy.cmdline.execute())
File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\cmdline.py", line 141, in execute
cmd.crawler_process = CrawlerProcess(settings)
File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\crawler.py", line 238, in __init__
super(CrawlerProcess, self).__init__(settings)
File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\crawler.py", line 129, in __init__
self.spider_loader = _get_spider_loader(settings)
File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\crawler.py", line 325, in _get_spider_loader
return loader_cls.from_settings(settings.frozencopy())
File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\spiderloader.py", line 45, in from_settings
return cls(settings)
File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\spiderloader.py", line 23, in __init__
self._load_all_spiders()
File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\spiderloader.py", line 32, in _load_all_spiders
for module in walk_modules(name):
File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\utils\misc.py", line 71, in walk_modules
submod = import_module(fullpath)
File "C:\Users\m1812\Anaconda3\lib\importlib\__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 986, in _gcd_import
File "", line 969, in _find_and_load
File "", line 958, in _find_and_load_unlocked
File "", line 673, in _load_unlocked
File "", line 661, in exec_module
File "", line 767, in get_code
File "", line 727, in source_to_code
File "", line 222, in _call_with_frames_removed
File "C:\Users\m1812\mytestproject\mytestproject\spiders\zhihu.py", line 10
start_urls = [http://www.zhihu.com/']
这个指令实用性不高。
5、显示项目内可用爬虫 list
scrapy list
C:\Users\m1812\mytestproject>scrapy list
baidu
zhihu
6、编辑爬虫 edit
scrapy edit
windows下好像用不了,一般也用不到,在ide中如pycharm中运行即可。
7、获取URL fetch
这是个全局命令:scrapy fetch [options]
详细用法:
C:\Users\m1812\mytestproject>scrapy fetch -h
Usage
=====
scrapy fetch [options]
Fetch a URL using the Scrapy downloader and print its content to stdout. You
may want to use --nolog to disable logging
Options
=======
--help, -h show this help message and exit
--spider=SPIDER use this spider
--headers print response HTTP headers instead of body
--no-redirect do not handle HTTP 3xx status codes and print response
as-is
Global Options
--------------
--logfile=FILE log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
log level (default: DEBUG)
--nolog disable logging completely
--profile=FILE write python cProfile stats to FILE
--pidfile=FILE write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
set/override setting (may be repeated)
--pdb enable pdb on failure
测试下获取百度的信息,注意这里一定要加上http://
C:\Users\m1812>scrapy fetch http://www.baidu.com
2019-04-06 15:44:51 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2019-04-06 15:44:51 [scrapy.utils.log] INFO: Overridden settings: {}
2019-04-06 15:44:51 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole']
2019-04-06 15:44:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 15:44:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 15:44:51 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 15:44:51 [scrapy.core.engine] INFO: Spider opened
2019-04-06 15:44:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 15:44:51 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 15:44:51 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
2019-04-06 15:44:51 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-06 15:44:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 211,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 1476,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 6, 7, 44, 51, 989960),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 4, 6, 7, 44, 51, 759268)}
2019-04-06 15:44:51 [scrapy.core.engine] INFO: Spider closed (finished)
鐧惧害涓€涓嬶紝浣犲氨鐭ラ亾 ©2017 Baidu 浣跨敤鐧惧害鍓嶅繀璇?/a> 鎰忚鍙嶉 浜琁CP璇?30173鍙?nbsp;
>
试下不输出日志
C:\Users\m1812>scrapy fetch --nolog http://www.baidu.com
鐧惧害涓€涓嬶紝浣犲氨鐭ラ亾 ©2017 Baidu 浣跨敤鐧惧害鍓嶅繀璇?/a> 鎰忚鍙嶉 浜琁CP璇?30173鍙?nbsp;
>
获取headers
C:\Users\m1812>scrapy fetch --nolog --headers http://www.baidu.com
> User-Agent: Scrapy/1.3.3 (+http://scrapy.org)
> Accept-Language: en
> Accept-Encoding: gzip,deflate
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>
< Date: Sat, 06 Apr 2019 07:48:42 GMT
< Server: bfe/1.0.8.18
< Content-Type: text/html
< Last-Modified: Mon, 23 Jan 2017 13:28:12 GMT
< Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
< Pragma: no-cache
< Set-Cookie: BDORZ=27315; max-age=86400; domain=.baidu.com; path=/
此外还有很多其他功能,如--no-redirect
:禁止重定向
8、以Scrapy所见在浏览器中打开URL view
这是个全局命令:scrapy view [options]
通过浏览器打开URL,显示内容为Scrapy实际所见。有时候spider看到的页面和常规方式不同,这个方法能检查spider看到的信息是否和你期待的一致。
C:\Users\m1812>scrapy view http://www.baidu.com
2019-04-06 16:01:45 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2019-04-06 16:01:45 [scrapy.utils.log] INFO: Overridden settings: {}
2019-04-06 16:01:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.logstats.LogStats']
2019-04-06 16:01:46 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 16:01:46 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 16:01:46 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 16:01:46 [scrapy.core.engine] INFO: Spider opened
2019-04-06 16:01:46 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:01:46 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 16:01:46 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
2019-04-06 16:01:46 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-06 16:01:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 211,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 1476,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 6, 8, 1, 46, 435330),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 4, 6, 8, 1, 46, 78537)}
2019-04-06 16:01:46 [scrapy.core.engine] INFO: Spider closed (finished)
测试下淘宝,很多加载不出来,说明淘宝用的是ajax异步加载,常规的request方法不能获得信息。
9、命令行交互窗口下访问URL shell
这是个全局命令:scrapy shell [options]
C:\Users\m1812>scrapy shell http://www.baidu.com
2019-04-06 16:11:41 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2019-04-06 16:11:41 [scrapy.utils.log] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2019-04-06 16:11:41 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2019-04-06 16:11:42 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 16:11:42 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 16:11:42 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 16:11:42 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 16:11:42 [scrapy.core.engine] INFO: Spider opened
2019-04-06 16:11:42 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
2019-04-06 16:11:42 [traitlets] DEBUG: Using default logger
2019-04-06 16:11:42 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler
[s] item {}
[s] request
[s] response <200 http://www.baidu.com>
[s] settings
[s] spider
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]: scrapy
Out[1]:
In [2]: request
Out[2]:
In [3]: response
Out[3]: <200 http://www.baidu.com>
In [4]: view(response)
Out[4]: True
In [5]: response.text
Out[5]: '\r\n 百度一下,你就知道 \r\n'
In [6]: response.headers
Out[6]:
{b'Cache-Control': b'private, no-cache, no-store, proxy-revalidate, no-transform',
b'Content-Type': b'text/html',
b'Date': b'Sat, 06 Apr 2019 08:11:42 GMT',
b'Last-Modified': b'Mon, 23 Jan 2017 13:28:12 GMT',
b'Pragma': b'no-cache',
b'Server': b'bfe/1.0.8.18',
b'Set-Cookie': b'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/'}
In [7]: response.css('title::text').extract_first()
Out[7]: '百度一下,你就知道'
In [8]: exit()
10、用指定spider方法来访问URL parse
scrapy parse
这里用前一讲访问quotes.toscrape.com的spider测试。
C:\Users\m1812>cd quotetutorial
C:\Users\m1812\quotetutorial>scrapy parse http://quotes.toscrape.com -c parse
2019-04-06 16:24:23 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: quotetutorial)
2019-04-06 16:24:23 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['quotetutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'quotetutorial.spiders', 'BOT_NAME': 'quotetutorial'}
2019-04-06 16:24:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.corestats.CoreStats']
2019-04-06 16:24:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 16:24:24 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 16:24:24 [scrapy.middleware] INFO: Enabled item pipelines:
['quotetutorial.pipelines.QuotetutorialPipeline',
'quotetutorial.pipelines.MongoPipeline']
2019-04-06 16:24:24 [scrapy.core.engine] INFO: Spider opened
2019-04-06 16:24:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:24:24 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 16:24:24 [scrapy.core.engine] DEBUG: Crawled (404) (referer: None)
2019-04-06 16:24:25 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
2019-04-06 16:24:25 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-06 16:24:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 444,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2701,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 6, 8, 24, 25, 485334),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 4, 6, 8, 24, 24, 258282)}
2019-04-06 16:24:25 [scrapy.core.engine] INFO: Spider closed (finished)
>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items ------------------------------------------------------------
[{�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mAlbert Einstein�[39;49;00m�[33m'�[39;49;00m,
�[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mchange�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mdeep-thoughts�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mthinking�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mworld�[39;49;00m�[33m'�[39;49;00m],
�[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“The world as we have created it is a process of our thinking. It �[39;49;00m�[33m'�[39;49;00m
�[33m'�[39;49;00m�[33mcannot be changed without changing our thinking.”�[39;49;00m�[33m'�[39;49;00m},
{�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mJ.K. Rowling�[39;49;00m�[33m'�[39;49;00m,
�[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mabilities�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mchoices�[39;49;00m�[33m'�[39;49;00m],
�[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“It is our choices, Harry, that show what we truly are, far more �[39;49;00m�[33m'�[39;49;00m
�[33m'�[39;49;00m�[33mthan our abilities.”�[39;49;00m�[33m'�[39;49;00m},
{�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mAlbert Einstein�[39;49;00m�[33m'�[39;49;00m,
�[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33minspirational�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mlife�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mlive�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mmiracle�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mmiracles�[39;49;00m�[33m'�[39;49;00m],
�[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“There are only two ways to live your life. One is as though �[39;49;00m�[33m'�[39;49;00m
�[33m'�[39;49;00m�[33mnothing is a miracle. The other is as though everything is a �[39;49;00m�[33m'�[39;49;00m
�[33m'�[39;49;00m�[33mmiracle.”�[39;49;00m�[33m'�[39;49;00m},
{�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mJane Austen�[39;49;00m�[33m'�[39;49;00m,
�[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33maliteracy�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mbooks�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mclassic�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mhumor�[39;49;00m�[33m'�[39;49;00m],
�[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“The person, be it gentleman or lady, who has not pleasure in a �[39;49;00m�[33m'�[39;49;00m
�[33m'�[39;49;00m�[33mgood novel, must be intolerably stupid.”�[39;49;00m�[33m'�[39;49;00m},
{�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mMarilyn Monroe�[39;49;00m�[33m'�[39;49;00m,
�[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mbe-yourself�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33minspirational�[39;49;00m�[33m'�[39;49;00m],
�[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m"�[39;49;00m�[33m“Imperfection is beauty, madness is genius and it�[39;49;00m�[33m'�[39;49;00m�[33ms better to be �[39;49;00m�[33m"�[39;49;00m
�[33m'�[39;49;00m�[33mabsolutely ridiculous than absolutely boring.”�[39;49;00m�[33m'�[39;49;00m},
{�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mAlbert Einstein�[39;49;00m�[33m'�[39;49;00m,
�[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33madulthood�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33msuccess�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mvalue�[39;49;00m�[33m'�[39;49;00m],
�[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“Try not to become a man of success. Rather become a man of �[39;49;00m�[33m'�[39;49;00m
�[33m'�[39;49;00m�[33mvalue.”�[39;49;00m�[33m'�[39;49;00m},
{�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mAndré Gide�[39;49;00m�[33m'�[39;49;00m,
�[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mlife�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mlove�[39;49;00m�[33m'�[39;49;00m],
�[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“It is better to be hated for what you are than to be loved for �[39;49;00m�[33m'�[39;49;00m
�[33m'�[39;49;00m�[33mwhat you are not.”�[39;49;00m�[33m'�[39;49;00m},
{�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mThomas A. Edison�[39;49;00m�[33m'�[39;49;00m,
�[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33medison�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mfailure�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33minspirational�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mparaphrased�[39;49;00m�[33m'�[39;49;00m],
�[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m"�[39;49;00m�[33m“I have not failed. I�[39;49;00m�[33m'�[39;49;00m�[33mve just found 10,000 ways that won�[39;49;00m�[33m'�[39;49;00m�[33mt work.”�[39;49;00m�[33m"�[39;49;00m},
{�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mEleanor Roosevelt�[39;49;00m�[33m'�[39;49;00m,
�[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mmisattributed-eleanor-roosevelt�[39;49;00m�[33m'�[39;49;00m],
�[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“A woman is like a tea bag; you never know how strong it is until �[39;49;00m�[33m'�[39;49;00m
�[33m"�[39;49;00m�[33mit�[39;49;00m�[33m'�[39;49;00m�[33ms in hot water.”�[39;49;00m�[33m"�[39;49;00m},
{�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mSteve Martin�[39;49;00m�[33m'�[39;49;00m,
�[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mhumor�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mobvious�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33msimile�[39;49;00m�[33m'�[39;49;00m],
�[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“A day without sunshine is like, you know, night.”�[39;49;00m�[33m'�[39;49;00m}]
# Requests -----------------------------------------------------------------
[ ]
输出了Scraped Items
和requests
。
11、获取Scrapy配置信息 settings
scrapy settings [options]
C:\Users\m1812\quotetutorial>scrapy settings -h
Usage
=====
scrapy settings [options]
Get settings values
Options
=======
--help, -h show this help message and exit
--get=SETTING print raw setting value
--getbool=SETTING print setting value, interpreted as a boolean
--getint=SETTING print setting value, interpreted as an integer
--getfloat=SETTING print setting value, interpreted as a float
--getlist=SETTING print setting value, interpreted as a list
Global Options
--------------
--logfile=FILE log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
log level (default: DEBUG)
--nolog disable logging completely
--profile=FILE write python cProfile stats to FILE
--pidfile=FILE write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
set/override setting (may be repeated)
--pdb enable pdb on failure
测试:
C:\Users\m1812\quotetutorial>scrapy settings --get MONGO_URI
localhost
12、运行爬虫 runspider
和crawl
不同的是,runspider
直接运行的是文件名称(xxx.py
),并且要移动到相应目录下。
scrapy runspider
C:\Users\m1812\quotetutorial>cd quotetutorial
C:\Users\m1812\quotetutorial\quotetutorial>dir
驱动器 C 中的卷没有标签。
卷的序列号是 5680-D4D0
C:\Users\m1812\quotetutorial\quotetutorial 的目录
2019/04/05 22:44 .
2019/04/05 22:44 ..
2019/04/05 20:04 364 items.py
2019/04/05 19:16 1,887 middlewares.py
2019/04/05 22:35 1,431 pipelines.py
2019/04/05 22:44 3,292 settings.py
2019/04/05 22:02 spiders
2017/03/10 23:31 0 __init__.py
2019/04/06 14:33 __pycache__
5 个文件 6,974 字节
4 个目录 28,533,673,984 可用字节
C:\Users\m1812\quotetutorial\quotetutorial>cd spiders
C:\Users\m1812\quotetutorial\quotetutorial\spiders>dir
驱动器 C 中的卷没有标签。
卷的序列号是 5680-D4D0
C:\Users\m1812\quotetutorial\quotetutorial\spiders 的目录
2019/04/05 22:02 .
2019/04/05 22:02 ..
2019/04/05 22:02 914 quotes.py
2017/03/10 23:31 161 __init__.py
2019/04/05 22:02 __pycache__
2 个文件 1,075 字节
3 个目录 28,533,673,984 可用字节
C:\Users\m1812\quotetutorial\quotetutorial\spiders>scrapy runspider quotes.py
运行结果和crawl
是一样的。
13、显示版本 version
显示scrapy的版本信息,相关依赖库信息。
C:\Users\m1812\quotetutorial>scrapy version -v
Scrapy : 1.3.3
lxml : 3.6.4.0
libxml2 : 2.9.4
cssselect : 1.0.1
parsel : 1.2.0
w3lib : 1.17.0
Twisted : 17.5.0
Python : 3.5.2 |Anaconda 4.2.0 (64-bit)| (default, Jul 5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
pyOpenSSL : 16.2.0 (OpenSSL 1.0.2j 26 Sep 2016)
Platform : Windows-10-10.0.17134-SP0
14、测试爬行速度 bench
C:\Users\m1812>scrapy bench
2019-04-06 16:43:34 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2019-04-06 16:43:34 [scrapy.utils.log] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'LOGSTATS_INTERVAL': 1, 'LOG_LEVEL': 'INFO'}
2019-04-06 16:43:37 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.closespider.CloseSpider',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.logstats.LogStats']
2019-04-06 16:43:37 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 16:43:37 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 16:43:37 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 16:43:37 [scrapy.core.engine] INFO: Spider opened
2019-04-06 16:43:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:38 [scrapy.extensions.logstats] INFO: Crawled 61 pages (at 3660 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:39 [scrapy.extensions.logstats] INFO: Crawled 109 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:40 [scrapy.extensions.logstats] INFO: Crawled 157 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:41 [scrapy.extensions.logstats] INFO: Crawled 205 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:42 [scrapy.extensions.logstats] INFO: Crawled 245 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:43 [scrapy.extensions.logstats] INFO: Crawled 285 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:44 [scrapy.extensions.logstats] INFO: Crawled 317 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:45 [scrapy.extensions.logstats] INFO: Crawled 357 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:46 [scrapy.extensions.logstats] INFO: Crawled 389 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:47 [scrapy.core.engine] INFO: Closing spider (closespider_timeout)
2019-04-06 16:43:47 [scrapy.extensions.logstats] INFO: Crawled 429 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:48 [scrapy.extensions.logstats] INFO: Crawled 445 pages (at 960 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 182101,
'downloader/request_count': 445,
'downloader/request_method_count/GET': 445,
'downloader/response_bytes': 1209563,
'downloader/response_count': 445,
'downloader/response_status_count/200': 445,
'finish_reason': 'closespider_timeout',
'finish_time': datetime.datetime(2019, 4, 6, 8, 43, 48, 395684),
'log_count/INFO': 18,
'request_depth_max': 16,
'response_received_count': 445,
'scheduler/dequeued': 445,
'scheduler/dequeued/memory': 445,
'scheduler/enqueued': 8901,
'scheduler/enqueued/memory': 8901,
'start_time': datetime.datetime(2019, 4, 6, 8, 43, 37, 309871)}
2019-04-06 16:43:48 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)
大概是每分钟2000个页面。