Scrapy是使用python实现的一个web抓取框架,这篇文章将对Scrapy的概要、安装进行说明,并结合scrapy shell获取页面的title的简单示例来获取scrapy的直观使用感受。
Scrapy是使用python实现的一个web抓取框架, 非常适合用于网站数据爬取、结构化数据提取等操作,相较于通用搜索为目的的Apache Nutch更加小巧和灵活,概要信息如下表所示:
项目 | 说明 |
---|---|
官网 | https://scrapy.org/ |
开源/闭源 | 开源 |
源码管理地址 | https://github.com/scrapy/scrapy |
开发语言 | python |
当前稳定版本 | 1.13.0 (2019/03/19) |
使用pip即可直接安装Scrapy,执行命令如下所示:
执行命令:pip install scrapy
本文使用python3和python并存的环境,使用pip3进行安装, 安装日志如下所示:
liumiaocn:scrapy liumiao$ pip3 install scrapy
Collecting scrapy
Downloading
...省略
Successfully built protego PyDispatcher zope.interface
Installing collected packages: six, pycparser, cffi, cryptography, pyasn1, pyasn1-modules, attrs, service-identity, protego, cssselect, pyOpenSSL, w3lib, PyDispatcher, incremental, constantly, Automat, PyHamcrest, zope.interface, idna, hyperlink, Twisted, lxml, parsel, queuelib, scrapy
Successfully installed Automat-20.2.0 PyDispatcher-2.0.5 PyHamcrest-2.0.2 Twisted-20.3.0 attrs-19.3.0 cffi-1.14.0 constantly-15.1.0 cryptography-2.8 cssselect-1.1.0 hyperlink-19.0.0 idna-2.9 incremental-17.5.0 lxml-4.5.0 parsel-1.5.2 protego-0.1.16 pyOpenSSL-19.1.0 pyasn1-0.4.8 pyasn1-modules-0.2.8 pycparser-2.20 queuelib-1.5.0 scrapy-2.0.1 service-identity-18.1.0 six-1.14.0 w3lib-1.21.0 zope.interface-5.0.1
liumiaocn:scrapy liumiao$
liumiaocn:scrapy liumiao$ scrapy -h
Scrapy 2.0.1 - no active project
Usage:
scrapy [options] [args]
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy -h" to see more info about a command
liumiaocn:scrapy liumiao$
爬虫实际上是对HTML进行的处理,最为简单的确认Scrapy的功能的示例方式是通过scrapy shell来进行,scrapy shell提供了一种交互式的方式进行数据的抓取,也可以用于抓取的调试。
执行如下示例命令:
执行命令:scrapy shell https://scrapy.org/
liumiaocn:scrapy liumiao$ scrapy shell https://scrapy.org/
2020-03-28 05:38:09 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: scrapybot)
2020-03-28 05:38:09 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.5 (default, Nov 1 2019, 02:16:32) - [Clang 11.0.0 (clang-1100.0.33.8)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Darwin-19.2.0-x86_64-i386-64bit
2020-03-28 05:38:09 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-03-28 05:38:09 [scrapy.crawler] INFO: Overridden settings:
{'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'LOGSTATS_INTERVAL': 0}
2020-03-28 05:38:09 [scrapy.extensions.telnet] INFO: Telnet Password: 5e36afd357190e93
2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-03-28 05:38:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-03-28 05:38:09 [scrapy.core.engine] INFO: Spider opened
2020-03-28 05:38:10 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler
[s] item {}
[s] request
[s] response <200 https://scrapy.org/>
[s] settings
[s] spider
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>>
输入response.css(‘title’),回车即可看到输出的信息中的title内容
>>> response.css('title')
[]
>>>
进一步获取title的详细信息
>>> response.css('title').extract_first()
'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework '
>>>