我是在Centos7.6下安装好python3.6.0后,安装pyspider的,中间遇到了一些坑,在这里记录一下
github地址:https://github.com/binux/pyspider
官方文档:http://docs.pyspider.org/en/latest/
官方安装教程,实际安装过程并不像官方文档那样顺利
看云文档(译文):https://www.kancloud.cn/manbuheiniu/pyspider/
按照官方的pip install pyspider命令,在运行一会后,会报错如下:
Traceback (most recent call last):
File "/tmp/pip-build-jbqzapjv/pycurl/setup.py", line 223, in configure_unix
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
File "/usr/local/python3/lib/python3.6/subprocess.py", line 707, in __init__
restore_signals, start_new_session)
File "/usr/local/python3/lib/python3.6/subprocess.py", line 1326, in _execute_child
raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'curl-config'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-build-jbqzapjv/pycurl/setup.py", line 913, in
ext = get_extension(sys.argv, split_extension_source=split_extension_source)
File "/tmp/pip-build-jbqzapjv/pycurl/setup.py", line 582, in get_extension
ext_config = ExtensionConfiguration(argv)
File "/tmp/pip-build-jbqzapjv/pycurl/setup.py", line 99, in __init__
self.configure()
File "/tmp/pip-build-jbqzapjv/pycurl/setup.py", line 227, in configure_unix
raise ConfigurationError(msg)
__main__.ConfigurationError: Could not run curl-config: [Errno 2] No such file or directory: 'curl-config'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-jbqzapjv/pycurl/
这个错误简单点说就是系统自带的curl版本低了,pyspider需要7.43及以上版本的curl,于是安装curl:
wget https://curl.haxx.se/download/curl-7.43.0.tar.gz
tar -zxvf curl-7.43.0.tar.gz
cd curl-7.43.0
./configure
make & make install
之后再执行pip install pyspider即可,得到如下信息:
Installing collected packages: MarkupSafe, Jinja2, itsdangerous, Werkzeug, click, Flask, chardet, cssselect, lxml, pycurl, idna, certifi, urllib3, requests, Flask-Login, u-msgpack-python, six, tblib, jsmin, PyYAML, defusedxml, wsgidav, tornado, pyquery, pyspider
Running setup.py install for pycurl ... done
Running setup.py install for Flask-Login ... done
Running setup.py install for jsmin ... done
Running setup.py install for PyYAML ... done
Running setup.py install for tornado ... done
Running setup.py install for pyspider ... done
Successfully installed Flask-1.0.3 Flask-Login-0.4.1 Jinja2-2.10.1 MarkupSafe-1.1.1 PyYAML-5.1.1 Werkzeug-0.15.4 certifi-2019.3.9 chardet-3.0.4 click-7.0 cssselect-1.0.3 defusedxml-0.6.0 idna-2.8 itsdangerous-1.1.0 jsmin-2.2.2 lxml-4.3.4 pycurl-7.43.0.2 pyquery-1.4.0 pyspider-0.3.10 requests-2.22.0 six-1.12.0 tblib-1.4.0 tornado-4.5.3 u-msgpack-python-2.5.1 urllib3-1.25.3 wsgidav-3.0.0
此时pyspider安装成功了,但有个遗留问题是因为重装了curl,原来的libcurl的库软链丢失了,在python里import pycurl会报错,处理办法:
#清理原链接库
rm -f /usr/lib64/libcurl.so.4*
#新安装的libcurl在/usr/local/lib/目录下,查看一下
ll /usr/local/lib/ | grep curl
#在lib64目录下创建软链接指定libcurl.so库
ln -s /usr/local/lib/libcurl.so.4.3.0 /usr/lib64/libcurl.so.4.3.0
ln -s /usr/local/lib/libcurl.so.4.3.0 /usr/lib64/libcurl.so.4
pyspider安装成功后,执行文件是放在python安装目录的bin目录下,启动:
/usr/local/python3/bin/pyspider
得到这样的错误:
File "/usr/local/python3/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/usr/local/python3/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "/usr/local/python3/lib/python3.6/site-packages/pyspider/run.py", line 384, in webui
app.run(host=host, port=port)
File "/usr/local/python3/lib/python3.6/site-packages/pyspider/webui/app.py", line 59, in run
from .webdav import dav_app
File "/usr/local/python3/lib/python3.6/site-packages/pyspider/webui/webdav.py", line 216, in
dav_app = WsgiDAVApp(config)
File "/usr/local/python3/lib/python3.6/site-packages/wsgidav/wsgidav_app.py", line 135, in __init__
_check_config(config)
File "/usr/local/python3/lib/python3.6/site-packages/wsgidav/wsgidav_app.py", line 119, in _check_config
raise ValueError("Invalid configuration:\n - " + "\n - ".join(errors))
ValueError: Invalid configuration:
- Deprecated option 'domaincontroller': use 'http_authenticator.domain_controller' instead.
经搜索,https://blog.csdn.net/SiHann/article/details/88239892 给出了答案,简单点说就是wsgidav的版本高了,要降版本,狂汗,pyspider的依赖一会高一会低,有点不稳的感觉啊。。。。,于是:
python -m pip install wsgidav==2.4.1
降版本后,启动成功
python -m pip install --upgrade pip
phantomjs是模拟浏览器执行js脚本渲染html页面的强大库,在现在的很多动态渲染的网站上抓取中非常必要
从官网上下载 http://phantomjs.org/download.html
先安装依赖项libfonts
yum -y install libfonts
解压安装包,并安装到指定位置,并设置好path:
tar -xjvf phantomjs-2.1.1-linux-x86_64.tar.bz2
mv phantomjs-2.1.1-linux-x86_64 /usr/local/phantomjs
ln -s /usr/local/phantomjs/bin/phantomjs /usr/local/bin/phantomjs
#查看安装是否成功
phantomjs -v
此后启动pyspider,会有phantomjs启动的提示:
phantomjs fetcher running on port 25555
[I 190614 16:52:20 result_worker:49] result_worker starting...
[I 190614 16:52:21 processor:211] processor starting...
pyspider安装完成后,多数教程给出的示例就是直接执行pyspider就可以访问5000端口开始试用了
后来发现有几个点需要布置,方便后续的使用:
/usr/local/python3/bin/pyspider
project.db
result.db
scheduler.1d
scheduler.1h
scheduler.all
task.db
但是如果下次换了路径,之前的配置就没有了!!!所以需要做个固定的快捷启动配置
我是这样做的,创建/home/pyspider目录,并在其中创建启动脚本start.sh:
#!/bin/sh
cd `dirname $0`
if [ `ps -ef | grep 'pyspider' |grep -v 'grep' | wc -l` -lt "1" ];
then
nohup pyspider all &
echo "pyspider started"
fi
停止脚本stop.sh:
ps -ef|grep -v 'grep'|grep pyspider | awk '{print $2}' | xargs kill -9
并授于执行权限
chmod +x start.sh
chmod +x stop.sh
这样以后只需要执行这个start.sh即可
当然,后续要把数据库修改为mysql,则不存在切换路径后数据丢失的问题了