pyspider安装过程记录

我是在Centos7.6下安装好python3.6.0后,安装pyspider的,中间遇到了一些坑,在这里记录一下

 

参考资料

 

github地址:https://github.com/binux/pyspider

官方文档:http://docs.pyspider.org/en/latest/

官方安装教程,实际安装过程并不像官方文档那样顺利

看云文档(译文):https://www.kancloud.cn/manbuheiniu/pyspider/

 

主要特性

  • python 脚本控制,可以用任何你喜欢的html解析包(内置 pyquery)
  • WEB 界面编写调试脚本,起停脚本,监控执行状态,查看活动历史,获取结果产出
  • 支持 MySQL, MongoDB, SQLite
  • 支持使用RabbitMQ、redis做队列
  • 支持抓取 JavaScript 的页面
  • 组件可替换,支持单机/分布式部署和任务协作,支持 Docker 部署
  • 强大的调度控制、任务跟踪监控

 

填坑过程

  • 1号坑:curl版本问题

按照官方的pip install pyspider命令,在运行一会后,会报错如下:

    Traceback (most recent call last):
      File "/tmp/pip-build-jbqzapjv/pycurl/setup.py", line 223, in configure_unix
        stdout=subprocess.PIPE, stderr=subprocess.PIPE)
      File "/usr/local/python3/lib/python3.6/subprocess.py", line 707, in __init__
        restore_signals, start_new_session)
      File "/usr/local/python3/lib/python3.6/subprocess.py", line 1326, in _execute_child
        raise child_exception_type(errno_num, err_msg)
    FileNotFoundError: [Errno 2] No such file or directory: 'curl-config'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "", line 1, in 
      File "/tmp/pip-build-jbqzapjv/pycurl/setup.py", line 913, in 
        ext = get_extension(sys.argv, split_extension_source=split_extension_source)
      File "/tmp/pip-build-jbqzapjv/pycurl/setup.py", line 582, in get_extension
        ext_config = ExtensionConfiguration(argv)
      File "/tmp/pip-build-jbqzapjv/pycurl/setup.py", line 99, in __init__
        self.configure()
      File "/tmp/pip-build-jbqzapjv/pycurl/setup.py", line 227, in configure_unix
        raise ConfigurationError(msg)
    __main__.ConfigurationError: Could not run curl-config: [Errno 2] No such file or directory: 'curl-config'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-jbqzapjv/pycurl/

这个错误简单点说就是系统自带的curl版本低了,pyspider需要7.43及以上版本的curl,于是安装curl:

wget https://curl.haxx.se/download/curl-7.43.0.tar.gz
tar -zxvf curl-7.43.0.tar.gz
cd curl-7.43.0
./configure
make & make install

之后再执行pip install pyspider即可,得到如下信息:

Installing collected packages: MarkupSafe, Jinja2, itsdangerous, Werkzeug, click, Flask, chardet, cssselect, lxml, pycurl, idna, certifi, urllib3, requests, Flask-Login, u-msgpack-python, six, tblib, jsmin, PyYAML, defusedxml, wsgidav, tornado, pyquery, pyspider
  Running setup.py install for pycurl ... done
  Running setup.py install for Flask-Login ... done
  Running setup.py install for jsmin ... done
  Running setup.py install for PyYAML ... done
  Running setup.py install for tornado ... done
  Running setup.py install for pyspider ... done
Successfully installed Flask-1.0.3 Flask-Login-0.4.1 Jinja2-2.10.1 MarkupSafe-1.1.1 PyYAML-5.1.1 Werkzeug-0.15.4 certifi-2019.3.9 chardet-3.0.4 click-7.0 cssselect-1.0.3 defusedxml-0.6.0 idna-2.8 itsdangerous-1.1.0 jsmin-2.2.2 lxml-4.3.4 pycurl-7.43.0.2 pyquery-1.4.0 pyspider-0.3.10 requests-2.22.0 six-1.12.0 tblib-1.4.0 tornado-4.5.3 u-msgpack-python-2.5.1 urllib3-1.25.3 wsgidav-3.0.0

此时pyspider安装成功了,但有个遗留问题是因为重装了curl,原来的libcurl的库软链丢失了,在python里import pycurl会报错,处理办法:

#清理原链接库
rm -f /usr/lib64/libcurl.so.4*

#新安装的libcurl在/usr/local/lib/目录下,查看一下
ll /usr/local/lib/ | grep curl

#在lib64目录下创建软链接指定libcurl.so库
ln -s /usr/local/lib/libcurl.so.4.3.0 /usr/lib64/libcurl.so.4.3.0
ln -s /usr/local/lib/libcurl.so.4.3.0 /usr/lib64/libcurl.so.4

 

  • 2号坑:启动失败

pyspider安装成功后,执行文件是放在python安装目录的bin目录下,启动:

/usr/local/python3/bin/pyspider

得到这样的错误:

  File "/usr/local/python3/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/python3/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/python3/lib/python3.6/site-packages/pyspider/run.py", line 384, in webui
    app.run(host=host, port=port)
  File "/usr/local/python3/lib/python3.6/site-packages/pyspider/webui/app.py", line 59, in run
    from .webdav import dav_app
  File "/usr/local/python3/lib/python3.6/site-packages/pyspider/webui/webdav.py", line 216, in 
    dav_app = WsgiDAVApp(config)
  File "/usr/local/python3/lib/python3.6/site-packages/wsgidav/wsgidav_app.py", line 135, in __init__
    _check_config(config)
  File "/usr/local/python3/lib/python3.6/site-packages/wsgidav/wsgidav_app.py", line 119, in _check_config
    raise ValueError("Invalid configuration:\n  - " + "\n  - ".join(errors))
ValueError: Invalid configuration:
  - Deprecated option 'domaincontroller': use 'http_authenticator.domain_controller' instead.

经搜索,https://blog.csdn.net/SiHann/article/details/88239892 给出了答案,简单点说就是wsgidav的版本高了,要降版本,狂汗,pyspider的依赖一会高一会低,有点不稳的感觉啊。。。。,于是:

python -m pip install wsgidav==2.4.1

降版本后,启动成功

 

 升级pip

python -m pip install --upgrade pip

安装phantomjs

phantomjs是模拟浏览器执行js脚本渲染html页面的强大库,在现在的很多动态渲染的网站上抓取中非常必要

从官网上下载 http://phantomjs.org/download.html

先安装依赖项libfonts

yum -y install libfonts

解压安装包,并安装到指定位置,并设置好path:

tar -xjvf phantomjs-2.1.1-linux-x86_64.tar.bz2 
mv phantomjs-2.1.1-linux-x86_64 /usr/local/phantomjs
ln -s /usr/local/phantomjs/bin/phantomjs /usr/local/bin/phantomjs

#查看安装是否成功
phantomjs -v

此后启动pyspider,会有phantomjs启动的提示:

phantomjs fetcher running on port 25555
[I 190614 16:52:20 result_worker:49] result_worker starting...
[I 190614 16:52:21 processor:211] processor starting...

快捷启动/停止配置

pyspider安装完成后,多数教程给出的示例就是直接执行pyspider就可以访问5000端口开始试用了

后来发现有几个点需要布置,方便后续的使用:

  • pyspider的执行文件是放在python的bin目录下的,我这里是:
/usr/local/python3/bin/pyspider
  •  默认的数据库使用SQLite,会保存在启动该命令的当前执行路径下的data目录中,包含以下数据文件:
project.db
result.db
scheduler.1d
scheduler.1h
scheduler.all
task.db

但是如果下次换了路径,之前的配置就没有了!!!所以需要做个固定的快捷启动配置

我是这样做的,创建/home/pyspider目录,并在其中创建启动脚本start.sh:

#!/bin/sh
cd `dirname $0`
if [ `ps -ef | grep 'pyspider' |grep -v 'grep' | wc -l` -lt "1" ];
then
    nohup pyspider all  &
    echo "pyspider started"
fi

停止脚本stop.sh:

ps -ef|grep -v 'grep'|grep pyspider | awk '{print $2}' | xargs kill -9

并授于执行权限

chmod +x start.sh
chmod +x stop.sh

这样以后只需要执行这个start.sh即可

当然,后续要把数据库修改为mysql,则不存在切换路径后数据丢失的问题了

 

你可能感兴趣的:(linux,python,pyspider)