python3的爬虫笔记14——Scrapy命令

命令格式:scrapy [options] [args]

commands 作用 命令作用域
crawl 使用一个spider开始爬取任务 项目内
check 代码语法检查 项目内
list 列出当前项目中所有可用的spiders ,每一行显示一个spider 项目内
edit 在命令窗口下编辑一个爬虫 项目内
parse 用指定spider方法来访问URL 项目内
bench 测试当前爬行速度 全局
fetch 使用Scrapy downloader获取URL 全局
genspider 使用预定义模板生成一个新的spider 全局
runspider Run a self-contained spider (without creating a project) 全局
settings 获取Scrapy配置信息 全局
shell 命令行交互窗口下访问URL 全局
startproject 创建一个新项目 全局
version 打印Scrapy版本 全局
view 通过浏览器打开URL,显示内容为Scrapy实际所见 全局

1、创建项目 startproject

scrapy startproject myproject [project_dir]
project_dir路径下创建一个名为myproject的新的爬虫项目,若没有指名project_dir,则project_dir名字将和myproject一样。

C:\Users\m1812>scrapy startproject mytestproject
New Scrapy project 'mytestproject', using template directory 'C:\\Users\\m1812\\Anaconda3\\lib\\site-packages\\scrapy\\templates\\project', created in:
    C:\Users\m1812\mytestproject

You can start your first spider with:
    cd mytestproject
    scrapy genspider example example.com
C:\Users\m1812>cd mytestproject

C:\Users\m1812\mytestproject>tree
文件夹 PATH 列表
卷序列号为 5680-D4D0
C:.
└─mytestproject
    ├─spiders
    │  └─__pycache__
    └─__pycache__

2、生成爬虫 genspider

在上面的目录下:
scrapy genspider mydomain mydomain.com

C:\Users\m1812\mytestproject>scrapy genspider baidu www.baidu.com
Created spider 'baidu' using template 'basic' in module:
  mytestproject.spiders.baidu



看下genspider的详细用法:

C:\Users\m1812\mytestproject>scrapy genspider -h
Usage
=====
  scrapy genspider [options]  

Generate new spider using pre-defined templates

Options
=======
--help, -h              show this help message and exit
--list, -l              List available templates
--edit, -e              Edit spider after creating it
--dump=TEMPLATE, -d TEMPLATE
                        Dump template to standard output
--template=TEMPLATE, -t TEMPLATE
                        Uses a custom template.
--force                 If the spider already exists, overwrite it with the
                        template

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure

模板的使用:-t TEMPLATE
模板类型:

C:\Users\m1812\mytestproject>scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

测试下模板的使用:

C:\Users\m1812\mytestproject>scrapy genspider -t crawl zhihu www.zhihu.com
Created spider 'zhihu' using template 'crawl' in module:
  mytestproject.spiders.zhihu
对比下和刚刚baidu的区别

3、运行爬虫 crawl

scrapy crawl

C:\Users\m1812\mytestproject>scrapy crawl zhihu
2019-04-06 15:14:18 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: mytestproject)
2019-04-06 15:14:18 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mytestproject.spiders', 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'mytestproject', 'SPIDER_MODULES': ['mytestproject.spiders']}
2019-04-06 15:14:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2019-04-06 15:14:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 15:14:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 15:14:18 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 15:14:18 [scrapy.core.engine] INFO: Spider opened
2019-04-06 15:14:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 15:14:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 15:14:23 [scrapy.core.engine] DEBUG: Crawled (400)  (referer: None)
2019-04-06 15:14:28 [scrapy.core.engine] DEBUG: Crawled (400)  (referer: None)
2019-04-06 15:14:28 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://www.zhihu.com/>: HTTP status code is not handled or not allowed
2019-04-06 15:14:28 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-06 15:14:28 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 527,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 813,
 'downloader/response_count': 2,
 'downloader/response_status_count/400': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 6, 7, 14, 28, 947408),
 'log_count/DEBUG': 3,
 'log_count/INFO': 8,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 4, 6, 7, 14, 18, 593508)}
2019-04-06 15:14:28 [scrapy.core.engine] INFO: Spider closed (finished)

知乎需要一些请求头才能成功访问,所以这里状态码显示不成功。

4、检查代码 check

scrapy check [-l]

C:\Users\m1812\mytestproject>scrapy check

----------------------------------------------------------------------
Ran 0 contracts in 0.000s

OK

如果随便把代码改错,这里删了网址的一个引号。再运行下,就能检查到错误。

C:\Users\m1812\mytestproject>scrapy check
Traceback (most recent call last):
  File "C:\Users\m1812\Anaconda3\Scripts\scrapy-script.py", line 5, in 
    sys.exit(scrapy.cmdline.execute())
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\cmdline.py", line 141, in execute
    cmd.crawler_process = CrawlerProcess(settings)
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\crawler.py", line 238, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\crawler.py", line 129, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\crawler.py", line 325, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\spiderloader.py", line 45, in from_settings
    return cls(settings)
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\spiderloader.py", line 23, in __init__
    self._load_all_spiders()
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\spiderloader.py", line 32, in _load_all_spiders
    for module in walk_modules(name):
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\utils\misc.py", line 71, in walk_modules
    submod = import_module(fullpath)
  File "C:\Users\m1812\Anaconda3\lib\importlib\__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "", line 986, in _gcd_import
  File "", line 969, in _find_and_load
  File "", line 958, in _find_and_load_unlocked
  File "", line 673, in _load_unlocked
  File "", line 661, in exec_module
  File "", line 767, in get_code
  File "", line 727, in source_to_code
  File "", line 222, in _call_with_frames_removed
  File "C:\Users\m1812\mytestproject\mytestproject\spiders\zhihu.py", line 10
    start_urls = [http://www.zhihu.com/']

这个指令实用性不高。

5、显示项目内可用爬虫 list

scrapy list

C:\Users\m1812\mytestproject>scrapy list
baidu
zhihu

6、编辑爬虫 edit

scrapy edit
windows下好像用不了,一般也用不到,在ide中如pycharm中运行即可。

7、获取URL fetch

这是个全局命令:scrapy fetch [options]
详细用法:

C:\Users\m1812\mytestproject>scrapy fetch -h
Usage
=====
  scrapy fetch [options] 

Fetch a URL using the Scrapy downloader and print its content to stdout. You
may want to use --nolog to disable logging

Options
=======
--help, -h              show this help message and exit
--spider=SPIDER         use this spider
--headers               print response HTTP headers instead of body
--no-redirect           do not handle HTTP 3xx status codes and print response
                        as-is

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure

测试下获取百度的信息,注意这里一定要加上http://

C:\Users\m1812>scrapy fetch http://www.baidu.com
2019-04-06 15:44:51 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2019-04-06 15:44:51 [scrapy.utils.log] INFO: Overridden settings: {}
2019-04-06 15:44:51 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole']
2019-04-06 15:44:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 15:44:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 15:44:51 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 15:44:51 [scrapy.core.engine] INFO: Spider opened
2019-04-06 15:44:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 15:44:51 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 15:44:51 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)
2019-04-06 15:44:51 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-06 15:44:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 211,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1476,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 6, 7, 44, 51, 989960),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 4, 6, 7, 44, 51, 759268)}
2019-04-06 15:44:51 [scrapy.core.engine] INFO: Spider closed (finished)

 鐧惧害涓€涓嬶紝浣犲氨鐭ラ亾  
>

试下不输出日志

C:\Users\m1812>scrapy fetch --nolog http://www.baidu.com

 鐧惧害涓€涓嬶紝浣犲氨鐭ラ亾  
>

获取headers

C:\Users\m1812>scrapy fetch --nolog --headers http://www.baidu.com
> User-Agent: Scrapy/1.3.3 (+http://scrapy.org)
> Accept-Language: en
> Accept-Encoding: gzip,deflate
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>
< Date: Sat, 06 Apr 2019 07:48:42 GMT
< Server: bfe/1.0.8.18
< Content-Type: text/html
< Last-Modified: Mon, 23 Jan 2017 13:28:12 GMT
< Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
< Pragma: no-cache
< Set-Cookie: BDORZ=27315; max-age=86400; domain=.baidu.com; path=/

此外还有很多其他功能,如--no-redirect:禁止重定向

8、以Scrapy所见在浏览器中打开URL view

这是个全局命令:scrapy view [options]
通过浏览器打开URL,显示内容为Scrapy实际所见。有时候spider看到的页面和常规方式不同,这个方法能检查spider看到的信息是否和你期待的一致。

C:\Users\m1812>scrapy view http://www.baidu.com
2019-04-06 16:01:45 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2019-04-06 16:01:45 [scrapy.utils.log] INFO: Overridden settings: {}
2019-04-06 16:01:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2019-04-06 16:01:46 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 16:01:46 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 16:01:46 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 16:01:46 [scrapy.core.engine] INFO: Spider opened
2019-04-06 16:01:46 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:01:46 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 16:01:46 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)
2019-04-06 16:01:46 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-06 16:01:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 211,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1476,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 6, 8, 1, 46, 435330),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 4, 6, 8, 1, 46, 78537)}
2019-04-06 16:01:46 [scrapy.core.engine] INFO: Spider closed (finished)

测试下淘宝,很多加载不出来,说明淘宝用的是ajax异步加载,常规的request方法不能获得信息。


9、命令行交互窗口下访问URL shell

这是个全局命令:scrapy shell [options]

C:\Users\m1812>scrapy shell http://www.baidu.com
2019-04-06 16:11:41 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2019-04-06 16:11:41 [scrapy.utils.log] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2019-04-06 16:11:41 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2019-04-06 16:11:42 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 16:11:42 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 16:11:42 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 16:11:42 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 16:11:42 [scrapy.core.engine] INFO: Spider opened
2019-04-06 16:11:42 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)
2019-04-06 16:11:42 [traitlets] DEBUG: Using default logger
2019-04-06 16:11:42 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    
[s]   item       {}
[s]   request    
[s]   response   <200 http://www.baidu.com>
[s]   settings   
[s]   spider     
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: scrapy
Out[1]: 

In [2]: request
Out[2]: 

In [3]: response
Out[3]: <200 http://www.baidu.com>

In [4]: view(response)
Out[4]: True

In [5]: response.text
Out[5]: '\r\n 百度一下,你就知道  

关于百度 About Baidu

©2017 Baidu 使用百度前必读  意见反馈 京ICP证030173号 

\r\n' In [6]: response.headers Out[6]: {b'Cache-Control': b'private, no-cache, no-store, proxy-revalidate, no-transform', b'Content-Type': b'text/html', b'Date': b'Sat, 06 Apr 2019 08:11:42 GMT', b'Last-Modified': b'Mon, 23 Jan 2017 13:28:12 GMT', b'Pragma': b'no-cache', b'Server': b'bfe/1.0.8.18', b'Set-Cookie': b'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/'} In [7]: response.css('title::text').extract_first() Out[7]: '百度一下,你就知道' In [8]: exit()
In [4]的显示结果

10、用指定spider方法来访问URL parse

scrapy parse [options]
这里用前一讲访问quotes.toscrape.com的spider测试。

C:\Users\m1812>cd quotetutorial

C:\Users\m1812\quotetutorial>scrapy parse http://quotes.toscrape.com -c parse
2019-04-06 16:24:23 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: quotetutorial)
2019-04-06 16:24:23 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['quotetutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'quotetutorial.spiders', 'BOT_NAME': 'quotetutorial'}
2019-04-06 16:24:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats']
2019-04-06 16:24:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 16:24:24 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 16:24:24 [scrapy.middleware] INFO: Enabled item pipelines:
['quotetutorial.pipelines.QuotetutorialPipeline',
 'quotetutorial.pipelines.MongoPipeline']
2019-04-06 16:24:24 [scrapy.core.engine] INFO: Spider opened
2019-04-06 16:24:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:24:24 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 16:24:24 [scrapy.core.engine] DEBUG: Crawled (404)  (referer: None)
2019-04-06 16:24:25 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)
2019-04-06 16:24:25 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-06 16:24:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 444,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2701,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 6, 8, 24, 25, 485334),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 4, 6, 8, 24, 24, 258282)}
2019-04-06 16:24:25 [scrapy.core.engine] INFO: Spider closed (finished)

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[{�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mAlbert Einstein�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mchange�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mdeep-thoughts�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mthinking�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mworld�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“The world as we have created it is a process of our thinking. It �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mcannot be changed without changing our thinking.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mJ.K. Rowling�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mabilities�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mchoices�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“It is our choices, Harry, that show what we truly are, far more �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mthan our abilities.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mAlbert Einstein�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33minspirational�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mlife�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mlive�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mmiracle�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mmiracles�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“There are only two ways to live your life. One is as though �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mnothing is a miracle. The other is as though everything is a �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mmiracle.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mJane Austen�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33maliteracy�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mbooks�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mclassic�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mhumor�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“The person, be it gentleman or lady, who has not pleasure in a �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mgood novel, must be intolerably stupid.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mMarilyn Monroe�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mbe-yourself�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33minspirational�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m"�[39;49;00m�[33m“Imperfection is beauty, madness is genius and it�[39;49;00m�[33m'�[39;49;00m�[33ms better to be �[39;49;00m�[33m"�[39;49;00m
          �[33m'�[39;49;00m�[33mabsolutely ridiculous than absolutely boring.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mAlbert Einstein�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33madulthood�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33msuccess�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mvalue�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“Try not to become a man of success. Rather become a man of �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mvalue.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mAndré Gide�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mlife�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mlove�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“It is better to be hated for what you are than to be loved for �[39;49;00m�[33m'�[39;49;00m
          �[33m'�[39;49;00m�[33mwhat you are not.”�[39;49;00m�[33m'�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mThomas A. Edison�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33medison�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mfailure�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33minspirational�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mparaphrased�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m"�[39;49;00m�[33m“I have not failed. I�[39;49;00m�[33m'�[39;49;00m�[33mve just found 10,000 ways that won�[39;49;00m�[33m'�[39;49;00m�[33mt work.”�[39;49;00m�[33m"�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mEleanor Roosevelt�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mmisattributed-eleanor-roosevelt�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“A woman is like a tea bag; you never know how strong it is until �[39;49;00m�[33m'�[39;49;00m
          �[33m"�[39;49;00m�[33mit�[39;49;00m�[33m'�[39;49;00m�[33ms in hot water.”�[39;49;00m�[33m"�[39;49;00m},
 {�[33m'�[39;49;00m�[33mauthor�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33mSteve Martin�[39;49;00m�[33m'�[39;49;00m,
  �[33m'�[39;49;00m�[33mtags�[39;49;00m�[33m'�[39;49;00m: [�[33m'�[39;49;00m�[33mhumor�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mobvious�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33msimile�[39;49;00m�[33m'�[39;49;00m],
  �[33m'�[39;49;00m�[33mtext�[39;49;00m�[33m'�[39;49;00m: �[33m'�[39;49;00m�[33m“A day without sunshine is like, you know, night.”�[39;49;00m�[33m'�[39;49;00m}]

# Requests  -----------------------------------------------------------------
[]

输出了Scraped Itemsrequests

11、获取Scrapy配置信息 settings

scrapy settings [options]

C:\Users\m1812\quotetutorial>scrapy settings -h
Usage
=====
  scrapy settings [options]

Get settings values

Options
=======
--help, -h              show this help message and exit
--get=SETTING           print raw setting value
--getbool=SETTING       print setting value, interpreted as a boolean
--getint=SETTING        print setting value, interpreted as an integer
--getfloat=SETTING      print setting value, interpreted as a float
--getlist=SETTING       print setting value, interpreted as a list

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure

测试:

C:\Users\m1812\quotetutorial>scrapy settings --get MONGO_URI
localhost

12、运行爬虫 runspider

crawl不同的是,runspider直接运行的是文件名称(xxx.py),并且要移动到相应目录下。
scrapy runspider

C:\Users\m1812\quotetutorial>cd quotetutorial

C:\Users\m1812\quotetutorial\quotetutorial>dir
 驱动器 C 中的卷没有标签。
 卷的序列号是 5680-D4D0

 C:\Users\m1812\quotetutorial\quotetutorial 的目录

2019/04/05  22:44              .
2019/04/05  22:44              ..
2019/04/05  20:04               364 items.py
2019/04/05  19:16             1,887 middlewares.py
2019/04/05  22:35             1,431 pipelines.py
2019/04/05  22:44             3,292 settings.py
2019/04/05  22:02              spiders
2017/03/10  23:31                 0 __init__.py
2019/04/06  14:33              __pycache__
               5 个文件          6,974 字节
               4 个目录 28,533,673,984 可用字节

C:\Users\m1812\quotetutorial\quotetutorial>cd spiders

C:\Users\m1812\quotetutorial\quotetutorial\spiders>dir
 驱动器 C 中的卷没有标签。
 卷的序列号是 5680-D4D0

 C:\Users\m1812\quotetutorial\quotetutorial\spiders 的目录

2019/04/05  22:02              .
2019/04/05  22:02              ..
2019/04/05  22:02               914 quotes.py
2017/03/10  23:31               161 __init__.py
2019/04/05  22:02              __pycache__
               2 个文件          1,075 字节
               3 个目录 28,533,673,984 可用字节
C:\Users\m1812\quotetutorial\quotetutorial\spiders>scrapy runspider quotes.py

运行结果和crawl是一样的。

13、显示版本 version

显示scrapy的版本信息,相关依赖库信息。

C:\Users\m1812\quotetutorial>scrapy version -v
Scrapy    : 1.3.3
lxml      : 3.6.4.0
libxml2   : 2.9.4
cssselect : 1.0.1
parsel    : 1.2.0
w3lib     : 1.17.0
Twisted   : 17.5.0
Python    : 3.5.2 |Anaconda 4.2.0 (64-bit)| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
pyOpenSSL : 16.2.0 (OpenSSL 1.0.2j  26 Sep 2016)
Platform  : Windows-10-10.0.17134-SP0

14、测试爬行速度 bench

C:\Users\m1812>scrapy bench
2019-04-06 16:43:34 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2019-04-06 16:43:34 [scrapy.utils.log] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'LOGSTATS_INTERVAL': 1, 'LOG_LEVEL': 'INFO'}
2019-04-06 16:43:37 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.closespider.CloseSpider',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2019-04-06 16:43:37 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 16:43:37 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 16:43:37 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 16:43:37 [scrapy.core.engine] INFO: Spider opened
2019-04-06 16:43:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:38 [scrapy.extensions.logstats] INFO: Crawled 61 pages (at 3660 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:39 [scrapy.extensions.logstats] INFO: Crawled 109 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:40 [scrapy.extensions.logstats] INFO: Crawled 157 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:41 [scrapy.extensions.logstats] INFO: Crawled 205 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:42 [scrapy.extensions.logstats] INFO: Crawled 245 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:43 [scrapy.extensions.logstats] INFO: Crawled 285 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:44 [scrapy.extensions.logstats] INFO: Crawled 317 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:45 [scrapy.extensions.logstats] INFO: Crawled 357 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:46 [scrapy.extensions.logstats] INFO: Crawled 389 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:47 [scrapy.core.engine] INFO: Closing spider (closespider_timeout)
2019-04-06 16:43:47 [scrapy.extensions.logstats] INFO: Crawled 429 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:48 [scrapy.extensions.logstats] INFO: Crawled 445 pages (at 960 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 182101,
 'downloader/request_count': 445,
 'downloader/request_method_count/GET': 445,
 'downloader/response_bytes': 1209563,
 'downloader/response_count': 445,
 'downloader/response_status_count/200': 445,
 'finish_reason': 'closespider_timeout',
 'finish_time': datetime.datetime(2019, 4, 6, 8, 43, 48, 395684),
 'log_count/INFO': 18,
 'request_depth_max': 16,
 'response_received_count': 445,
 'scheduler/dequeued': 445,
 'scheduler/dequeued/memory': 445,
 'scheduler/enqueued': 8901,
 'scheduler/enqueued/memory': 8901,
 'start_time': datetime.datetime(2019, 4, 6, 8, 43, 37, 309871)}
2019-04-06 16:43:48 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)

大概是每分钟2000个页面。

你可能感兴趣的:(python3的爬虫笔记14——Scrapy命令)