Scrapy是一款非常成熟的爬虫框架,可以抓取网页数据并抽取结构化数据,目前已经有很多企业用于生产环境。对于它的更多介绍,可以查阅相关资料(官方网站:www.scrapy.org)。
我们根据官网提供的安装指南,来一步步安装,主要参考了http://doc.scrapy.org/en/latest/intro/install.html页面的介绍:
- Requirements¶
- Python 2.5, 2.6, 2.7 (3.x is not yet supported)
- Twisted 2.5.0, 8.0 or above (Windows users: you’ll need to install Zope.Interface and maybe pywin32 because of this Twisted bug)
- w3lib
- lxml or libxml2 (if using libxml2, version 2.6.28 or above is highly recommended)
- simplejson (not required if using Python 2.6 or above)
- pyopenssl (for HTTPS support. Optional, but highly recommended)
下面记录一下从安装Python到安装scrapy的过程,最后,通过执行命令进行抓取数据来验证我们所做的安装配置工作。
准备工作
操作系统:RHEL 5
Python版本:Python-2.7.2
zope.interface版本:zope.interface-3.8.0
Twisted版本:Twisted-11.1.0
libxml2版本:libxml2-2.7.4.tar.gz
w3lib版本:w3lib-1.0
Scrapy版本:Scrapy-0.14.0.2841
安装配置
1、安装zlib
首先检查一下你的系统中是否已经安装zlib,该库是一个与数据压缩相关的工具包,scrapy框架依赖于该工具包。我使用的RHEL 5系统,查看是否安装:
- [root@localhost scrapy]# rpm -qa zlib
- zlib-1.2.3-3
我的系统已经默认安装了,安装的话,可以跳过该步骤。如果没有安装的话,可以到 http://www.zlib.net/
上下载,并进行安装。假如下载的是zlib-1.2.5.tar.gz,安装命令如下所示:
- [root@localhost scrapy]# tar -xvzf zlib-1.2.5.tar.gz
- [root@localhost zlib-1.2.5]# cd zlib-1.2.5
- [root@localhost zlib-1.2.5]# make
- [root@localhost zlib-1.2.5]# make install
2、安装Python
我的系统中已经安装的Python 2.4,根据官网要求和建议,我选择了Python-2.7.2,下载地址如下所示:
http://www.python.org/download/(需要代理)
http://www.python.org/ftp/python/2.7.2/Python-2.7.2.tgz
我下载了Python的源代码,重新编译后,进行安装,过程如下所示:
- [root@localhost scrapy]# tar -zvxf Python-2.7.2.tgz
- [root@localhost scrapy]# cd Python-2.7.2
- [root@localhost Python-2.7.2]# ./configure
- [root@localhost Python-2.7.2]# make
- [root@localhost Python-2.7.2]# make install
默认情况下,Python程序被安装到/usr/local/lib/python2.7。
如果你的系统中没有安装过Python,此时通过命令行执行一下:
- [root@localhost scrapy]# python
- Python 2.7.2 (default, Dec 5 2011, 22:04:07)
- [GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2
- Type "help", "copyright", "credits" or "license" for more information.
- >>>
表示最新安装的Python已经可以使用了。
如果你的系统中还有其他版本的Python,例如我的系统中2.4版本的,所以要做一个符号链接:
- [root@localhost python2.7]# mv /usr/bin/python /usr/bin/python.bak
- [root@localhost python2.7]# ln -s /usr/local/bin/python /usr/bin/python
这样操作以后,在执行python,就生效了。
3、安装setuptools
这里主要是安装一个用来管理Python模块的工具,如果已经安装就跳过该步骤。如果你需要安装,可以参考下面的链接:
http://pypi.python.org/pypi/setuptools/0.6c11#installation-instructions
http://pypi.python.org/packages/2.7/s/setuptools/setuptools-0.6c11-py2.7.egg#md5=fe1f997bc722265116870bc7919059ea
不过,在安装Python-2.7.2以后,可以看到Python的解压缩包里面有一个setup.py脚本,使用这个脚本可以安装Python一些相关的模块,执行命令:
- [root@localhost Python-2.7.2]# python setup.py install
安装执行后,相关Python模块被安装到目录/usr/local/lib/python2.7/site-packages下。
4、安装zope.interface
下载地址如下所示:
http://pypi.python.org/pypi/zope.interface/3.8.0
http://pypi.python.org/packages/source/z/zope.interface/zope.interface-3.8.0.tar.gz#md5=8ab837320b4532774c9c89f030d2a389
安装过程如下所示:
- [root@localhost scrapy]$ tar -xvzf zope.interface-3.8.0.tar.gz
- [root@localhost scrapy]$ cd zope.interface-3.8.0
- [root@localhost zope.interface-3.8.0]$ python setup.py build
- [root@localhost zope.interface-3.8.0]$ python setup.py install
安装完成后,可以在/usr/local/lib/python2.7/site-packages下面看到zope和zope.interface-3.8.0-py2.7.egg-info。
5、安装Twisted
下载地址如下所示:
http://twistedmatrix.com/trac/
http://pypi.python.org/packages/source/T/Twisted/Twisted-11.1.0.tar.bz2#md5=972f3497e6e19318c741bf2900ffe31c
安装过程如下所示:
- [root@localhost scrapy]# bzip2 -d Twisted-11.1.0.tar.bz2
- [root@localhost scrapy]# tar -xvf Twisted-11.1.0.tar
- [root@localhost scrapy]# cd Twisted-11.1.0
- [root@localhost Twisted-11.1.0]# python setup.py install
安装完成后,可以在/usr/local/lib/python2.7/site-packages下面看到twisted和Twisted-11.1.0-py2.7.egg-info。
6、安装w3lib
下载地址如下所示:
http://pypi.python.org/pypi/w3lib
http://pypi.python.org/packages/source/w/w3lib/w3lib-1.0.tar.gz#md5=f28aeb882f27a616e0fc43d01f4dcb21
安装过程如下所示:
- [root@localhost scrapy]# tar -xvzf w3lib-1.0.tar.gz
- [root@localhost scrapy]# cd w3lib-1.0
- [root@localhost w3lib-1.0]# python setup.py install
安装完成后,可以在/usr/local/lib/python2.7/site-packages下面看到w3lib和w3lib-1.0-py2.7.egg-info。
7、安装libxml2
下载地址如下所示:
http://download.chinaunix.net/download.php?id=28497&ResourceID=6095
http://download.chinaunix.net/down.php?id=28497&ResourceID=6095&site=1
或者,可以到网站http://xmlsoft.org上面找到相应版本的压缩包。
安装过程如下所示:
- [root@localhost scrapy]# tar -xvzf libxml2-2.7.4.tar.gz
- [root@localhost scrapy]# cd libxml2-2.7.4
- [root@localhost libxml2-2.7.4]# ./configure
- [root@localhost libxml2-2.7.4]# make
- [root@localhost libxml2-2.7.4]# make install
8、安装pyOpenSSL
该步骤可选,对应的安装包下载地址为:
https://launchpad.net/pyopenssl
如果需要的话,可以选择需要的版本。我这里直接跳过该步骤。
9、安装Scrapy
下载地址如下所示:
http://scrapy.org/download/
http://pypi.python.org/pypi/Scrapy
http://pypi.python.org/packages/source/S/Scrapy/Scrapy-0.14.0.2841.tar.gz#md5=fe63c5606ca4c0772d937b51869be200
安装过程如下所示:
- [root@localhost scrapy]# tar -xvzf Scrapy-0.14.0.2841.tar.gz
- [root@localhost scrapy]# cd Scrapy-0.14.0.2841
- [root@localhost Scrapy-0.14.0.2841]# python setup.py install
安装验证
经过上面的安装和配置过程,已经完成了Scrapy的安装,我们可以通过如下命令行来验证一下:
- [root@localhost scrapy]# scrapy
- Scrapy 0.14.0.2841 - no active project
-
- Usage:
- scrapy <command> [options] [args]
-
- Available commands:
- fetch Fetch a URL using the Scrapy downloader
- runspider Run a self-contained spider (without creating a project)
- settings Get settings values
- shell Interactive scraping console
- startproject Create new project
- version Print Scrapy version
- view Open URL in browser, as seen by Scrapy
-
- Use "scrapy <command> -h" to see more info about a command
上面提示信息,提供了一个fetch命令,这个命令抓取指定的网页,可以先看看fetch命令的帮助信息,如下所示:
- [root@localhost scrapy]# scrapy fetch --help
- Usage
- =====
- scrapy fetch [options] <url>
-
- Fetch a URL using the Scrapy downloader and print its content to stdout. You
- may want to use --nolog to disable logging
-
- Options
- =======
- --help, -h show this help message and exit
- --spider=SPIDER use this spider
- --headers print response HTTP headers instead of body
-
- Global Options
- --------------
- --logfile=FILE log file. if omitted stderr will be used
- --loglevel=LEVEL, -L LEVEL
- log level (default: DEBUG)
- --nolog disable logging completely
- --profile=FILE write python cProfile stats to FILE
- --lsprof=FILE write lsprof profiling stats to FILE
- --pidfile=FILE write process ID to FILE
- --set=NAME=VALUE, -s NAME=VALUE
- set/override setting (may be repeated)
根据命令提示,指定一个URL,执行后抓取一个网页的数据,如下所示:
- [root@localhost scrapy]# scrapy fetch http://doc.scrapy.org/en/latest/intro/install.html > install.html
- 2011-12-05 23:40:04+0800 [scrapy] INFO: Scrapy 0.14.0.2841 started (bot: scrapybot)
- 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
- 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
- 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
- 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled item pipelines:
- 2011-12-05 23:40:05+0800 [default] INFO: Spider opened
- 2011-12-05 23:40:05+0800 [default] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
- 2011-12-05 23:40:05+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
- 2011-12-05 23:40:05+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
- 2011-12-05 23:40:07+0800 [default] DEBUG: Crawled (200) <GET http://doc.scrapy.org/en/latest/intro/install.html> (referer: None)
- 2011-12-05 23:40:07+0800 [default] INFO: Closing spider (finished)
- 2011-12-05 23:40:07+0800 [default] INFO: Dumping spider stats:
- {'downloader/request_bytes': 227,
- 'downloader/request_count': 1,
- 'downloader/request_method_count/GET': 1,
- 'downloader/response_bytes': 22676,
- 'downloader/response_count': 1,
- 'downloader/response_status_count/200': 1,
- 'finish_reason': 'finished',
- 'finish_time': datetime.datetime(2011, 12, 5, 15, 40, 7, 918833),
- 'scheduler/memory_enqueued': 1,
- 'start_time': datetime.datetime(2011, 12, 5, 15, 40, 5, 5749)}
- 2011-12-05 23:40:07+0800 [default] INFO: Spider closed (finished)
- 2011-12-05 23:40:07+0800 [scrapy] INFO: Dumping global stats:
- {'memusage/max': 17711104, 'memusage/startup': 17711104}
- [root@localhost scrapy]# ll install.html
- -rw-r--r-- 1 root root 22404 Dec 5 23:40 install.html
- [root@localhost scrapy]#
可见,我们已经成功抓取了一个网页。
接下来,可以根据scrapy官网的指南来进一步应用scrapy框架,Tutorial链接页面为http://doc.scrapy.org/en/latest/intro/tutorial.html。