【Python笔记】WEB抓取框架Scrapy的安装方法

Scrapy是一个Python实现的轻量级爬虫框架,它借助Twisted实现异步抓取。关于Scrapy的简介,可以参考官网文档Scrapy at a glance,关于业界对Scrapy的评价,可以参考Quora上的这篇问答帖Is there a better crawler than Scrapy?

本文给出两种安装Scrapy的方式:1) 通过pip安装;2) 通过源码编译安装。

1. 通过pip安装Scrapy
借助pip来安装Scrapy也是手册Scrapy Installation Guide建议的方式。

1.1 安装pip
pip是一个开源的Python包管理工具,根据其官网说明,安装pip只需2步:1) 先获取get-pip.py;2) 执行python get-pip.py即可。但我的机器执行第2步时,会报下面的错误:

$ python get-pip.py   
Collecting pip
  Could not find any downloads that satisfy the requirement pip
  No distributions at all found for pip
不清楚是否跟公司网络设置有关(网络是否限制对pypi源的访问?待求证),无奈之下,只能获取pip源码包(从 这里)解压后,再执行python setup.py  build和python setup.py install来手动安装。
只要Python环境事先已装好easy_install工具,则pip源码安装过程一般不会出错,这里不赘述。

1.2 安装Scrapy
在pip成功安装的前提下,执行以下命令:
pip install Scrapy
正常情况下,pip会自动拉取编译Scrapy依赖的其它python包, 但由于pip可能调用setuptools去拉取依赖包,而后者可能因为openssl的原因,拉取依赖包失败。这时,我们可以通过下面的命令绕过setuptools,而是用pip直接拉取那个会失败的包:
pip install --cert /home/slvher/tools/https-ca/https/cacert.pem -U -i https://pypi.python.org/simple six
具体的参数含义可以查阅 pip user guide,这里不赘述。
最终完成安装后,我们再执行一次pip install Scrapy,会看到如下输出:
Requirement already satisfied (use --upgrade to upgrade): Scrapy in ./tools/Python-2.7.8/lib/python2.7/site-packages/Scrapy-0.24.4-py2.7.egg
Requirement already satisfied (use --upgrade to upgrade): Twisted>=10.0.0 in ./tools/Python-2.7.8/lib/python2.7/site-packages/Twisted-14.0.2-py2.7-linux-x86_64.egg (from Scrapy)</span>
Requirement already satisfied (use --upgrade to upgrade): w3lib>=1.8.0 in ./tools/Python-2.7.8/lib/python2.7/site-packages/w3lib-1.10.0-py2.7.egg (from Scrapy)
Requirement already satisfied (use --upgrade to upgrade): queuelib in ./tools/Python-2.7.8/lib/python2.7/site-packages/queuelib-1.2.2-py2.7.egg (from Scrapy)
Requirement already satisfied (use --upgrade to upgrade): lxml in ./tools/Python-2.7.8/lib/python2.7/site-packages (from Scrapy)
Requirement already satisfied (use --upgrade to upgrade): pyOpenSSL in ./tools/Python-2.7.8/lib/python2.7/site-packages/pyOpenSSL-0.14-py2.7.egg (from Scrapy)
Requirement already satisfied (use --upgrade to upgrade): cssselect>=0.9 in ./tools/Python-2.7.8/lib/python2.7/site-packages/cssselect-0.9.1-py2.7.egg (from Scrapy)
Requirement already satisfied (use --upgrade to upgrade): six>=1.5.2 in ./tools/Python-2.7.8/lib/python2.7/site-packages/six-1.8.0-py2.7.egg (from Scrapy)
Requirement already satisfied (use --upgrade to upgrade): zope.interface>=3.6.0 in ./tools/Python-2.7.8/lib/python2.7/site-packages/zope.interface-4.1.2-py2.7-linux-x86_64.egg (from Twisted>=10.0.0->Scrapy)
Requirement already satisfied (use --upgrade to upgrade): cryptography>=0.2.1 in ./tools/Python-2.7.8/lib/python2.7/site-packages/cryptography-0.7.1-py2.7-linux-x86_64.egg (from pyOpenSSL->Scrapy)
Requirement already satisfied (use --upgrade to upgrade): setuptools in ./tools/Python-2.7.8/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg (from zope.interface>=3.6.0->Twisted>=10.0.0->Scrapy)
Requirement already satisfied (use --upgrade to upgrade): cffi>=0.8 in ./tools/Python-2.7.8/lib/python2.7/site-packages/cffi-0.8.6-py2.7-linux-x86_64.egg (from cryptography>=0.2.1->pyOpenSSL->Scrapy)
Requirement already satisfied (use --upgrade to upgrade): enum34 in ./tools/Python-2.7.8/lib/python2.7/site-packages/enum34-1.0.4-py2.7.egg (from cryptography>=0.2.1->pyOpenSSL->Scrapy)
Requirement already satisfied (use --upgrade to upgrade): pyasn1 in ./tools/Python-2.7.8/lib/python2.7/site-packages/pyasn1-0.1.7-py2.7.egg (from cryptography>=0.2.1->pyOpenSSL->Scrapy)
Requirement already satisfied (use --upgrade to upgrade): pycparser in ./tools/Python-2.7.8/lib/python2.7/site-packages/pycparser-2.10-py2.7.egg (from cffi>=0.8->cryptography>=0.2.1->pyOpenSSL->Scrapy)
上面的输出中,./tools/Python-2.7.8/是我机器上python的安装路径,可以看到,Scrapy依赖的一级依赖包及二级依赖包均被以*.egg的形式安装到site-packages目录下。

通过pip安装Scrapy的方式比较省事,但某些网络环境下,pip可能无法获取到某些python package源,因此,下面给出通过源码安装Scrapy的步骤。

2. 源码编译安装Scrapy

2.1 获取Scrapy源码包并解压
从github或独立官网可获取源码包,这里略过。

2.2 编译源码并安装
cd至源码目录,执行python setup.py build,成功后,执行python setup.py install。

特别补充说明几点:
a. Scrapy依赖Twisted/lxml/pyOpenSSL/cssselect/w3lib/queuelib/six等python包,故执行install命令后,会依次拉取并编译这些依赖包(如果这些包之前已安装且版本高于Scrapy最低要求,则不会重新拉取依赖),因此,安装过程可能比较慢。

b. 若安装过程报错"ImportError: libffi.so.6: cannot open shared object file: No such file or directory",则需先确认已成功安装libffi库;其次确认已将其加入共享库查找路径(可通过在当前shell终端export LD_LIBRARY_PATH=libffi_install_path来实现,只在当前终端有效,故Scrapy的安装始终应在这个终端内进行)。
想了解libffi的话,可参考其官网

c. 安装时会编译二级依赖包cryptography-0.7.1(pyOpenSSL会依赖这个包),而编译这个包时,可能由于系统默认的openssl版本太低导致报错"openssl/ecdh.h: No such file or directory",这是因为低版本的openssl根本没有ecdh.h这个头文件。
解决办法:安装高版本openssl库(如我机器上安装的是openssl-1.0.1j),然后通过在shell终端执行"export C_INCLUDE_PATH=/home/slvher/tools/openssl-1.0.1j/include/"将高版本openssl的include路径加入到gcc对头文件的搜索路径中,然后重新执行python setup.py install。
编译安装成功后,看起来似乎大功告成,实际上,前方有个巨大的坑。。。因为编译安装python时,依赖的是系统自带的旧版本openssl(否则也不会报openssl/ecdh.h不存在的错误了),而这里我们编译cryptography时,引入的是高版本openssl。
具体的坑表现为:当安装完开始使用Scrapy时(假如我们执行命令"scrapy startproject tutorial"),python解释器会默认加载旧版本的openssl,所以scrapy在import cryptography相关库时,会报类似于下面的错误:

py_install_path/site-packages/cryptography-0.7.1-py2.7-linux-x86_64.egg/cryptography/_Cryptography_cffi_70441dc9x8be47966.so: undefined symbol: EC_KEY_new_by_curve_name
报这个错误的原因是,python解释器加载的低版本openssl缺少某些函数。解决办法似乎很简单:借助LD_PRELOAD将openssl在解释器启动时替换为高版本即可,但通过export LD_PRELOAD完成替 换后,再次执行"scrapy startproject tutorial",会发现解释器直接core掉了。gdb显示挂在了解释器加载hashlib库的某个函数上,出core原因比较容易理解:hashlib是随着python编译的 ,它依赖的也是旧版openssl,而我们强行替换openssl版本会由于新旧版本的某些实现不兼容,导致进程挂掉

d. 解决用上步所述的问题的根本解决办法是,重新编译python,编译时,指定其依赖高版本的openssl。
在重新编译支持高版本openssl的python的过程中,也有一些细节需要注意,具体可以参考这篇笔记。

通过上面2种安装方式中的任何一种安装完Scrapy后,还需进一步验证其是否真的可用。

3. 验证是否可正常使用Scrapy
根据Scrapy Tutorial的实例,可以通过执行下面的命令来验证Scrapy是否可用:

$/home/slvher/tools/Python-2.7.8/bin/scrapy startproject gre
输出结果如下:     
/home/slvher/tools/Python-2.7.8/lib/python2.7/site-packages/twisted/internet/_sslverify.py:184: UserWarning: You do not have the service_identity module installed. 
Please install it from <https://pypi.python.org/pypi/service_identity>. Without the service_identity module and a recent enough pyOpenSSL tosupport it, Twisted can perform only rudimentary TLS client hostnameverification.  Many valid certificate/hostname mappings may be rejected.
  verifyHostname, VerificationError = _selectVerifyImplementation()
New Scrapy project 'gre' created in:
    /home/slvher/Projects/python-cases/scrapy/gre
You can start your first spider with:
    cd gre
    scrapy genspider example example.com
至此,终于可以happy地开始使用Scrapy了。 ^_^

【参考资料】
1. Scrapy at a glance  
2. Scrapy Installation Guide  
3. pip documents
4. libffi 
5. 源码编译安装Python时,如何支持自定义安装的高版本openssl库  

========================= EOF =====================


你可能感兴趣的:(python,scrapy)