2019独角兽企业重金招聘Python工程师标准>>>
Deployment 部署 =========== #由于pyspider有不同的组件,所以您可以运行“pyspider”来启动一个独立的和第三方服务的免费实例。或者使用MySQL或MongoDB和RabbitMQ来部署分布式抓取集群 Since pyspider has various components, you can just run `pyspider` to start a standalone and third service free instance. Or using MySQL or MongoDB and RabbitMQ to deploy a distributed crawl cluster. #在生产环境中部署pyspider,在每个流程中运行组件,在数据库服务中存储数据更加可靠和灵活。 To deploy pyspider in product environment, running component in each process and store data in database service is more reliable and flexible. #安装 Installation ------------ #要在每个单独进程中部署pyspider组件,至少需要一个数据库服务,pyspider现在支持[MySQL](http://www.mysql.com/), [MongoDB](http://www.mongodb.org/) and [PostgreSQL](http://www.postgresql.org/),你可以选择其中一个 To deploy pyspider components in each single processes, you need at least one database service. pyspider now supports [MySQL](http://www.mysql.com/), [MongoDB](http://www.mongodb.org/) and [PostgreSQL](http://www.postgresql.org/). You can choose one of them. #并且你必须有一个消息队列服务让组件连接到一起,你可以使用[RabbitMQ](http://www.rabbitmq.com/), [Beanstalk](http://kr.github.io/beanstalkd/) or [Redis](http://redis.io/)作为消息队列 And you need a message queue service to connect the components together. You can use [RabbitMQ](http://www.rabbitmq.com/), [Beanstalk](http://kr.github.io/beanstalkd/) or [Redis](http://redis.io/) as message queue. # `pip install --allow-all-external pyspider[all]` #即使你之前已经使用“pip”安装了pyspider,也必须带上‘pyspider[all]’安装,以满足安装MySQL/MongoDB/RabbitMQ的需求 > Even if you had install pyspider using `pip` before. Install with `pyspider[all]` is necessary to install the requirements for MySQL/MongoDB/RabbitMQ. #如果你使用Ubuntu系统,尝试 if you are using Ubuntu, try: ``` apt-get install python python-dev python-distribute python-pip libcurl4-openssl-dev libxml2-dev libxslt1-dev python-lxml ``` 安装二进制包 to install binary packages. #部署 Deployment ---------- #以下文档是以MySQL + RabbitMQ为例 **This document is based on MySQL + RabbitMQ** ### config.json #json配置文件 #即使你有能力使用 命令行去指定参数,配置文件也是一个更好的选择 Although you can use command-line to specify the parameters. A config file is a better choice. ``` { "taskdb": "mysql+taskdb://username:password@host:port/taskdb", "projectdb": "mysql+projectdb://username:password@host:port/projectdb", "resultdb": "mysql+resultdb://username:password@host:port/resultdb", "message_queue": "amqp://username:password@host:port/%2F", "webui": { "username": "some_name", "password": "some_passwd", "need-auth": true } } ``` #你可以通过运行“pyspider - help”和“pyspider webui - help”来获得完整的子命令选项。“webui”是JSON配置文件的子命令。您可以用这种方式添加更多其它组件 you can get complete options by running `pyspider --help` and `pyspider webui --help` for subcommands. `"webui"` in JSON is configs for subcommands. You can add parameters for other components similar to this one. #数据库连接url #### Database Connection URI #`"taskdb"`, `"projectdb”`, `"resultdb"` 使用下面的格式连接数据库 `"taskdb"`, `"projectdb”`, `"resultdb"` is using database connection URI with format below: ``` mysql: mysql+type://user:passwd@host:port/database sqlite: # relative path sqlite+type:///path/to/database.db # absolute path sqlite+type:////path/to/database.db # memory database sqlite+type:// mongodb: mongodb+type://[username:password@]host1[:port1][,host2[:port2],...[,hostN[:portN]]][/[database][?options]] more: http://docs.mongodb.org/manual/reference/connection-string/ sqlalchemy: sqlalchemy+postgresql+type://user:passwd@host:port/database sqlalchemy+mysql+mysqlconnector+type://user:passwd@host:port/database more: http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html local: local+projectdb://filepath,filepath type: #可以是`taskdb`, `projectdb`, `resultdb`中的一个 should be one of `taskdb`, `projectdb`, `resultdb`. ``` #### Message Queue URL 消息队列地址 #你可以使用连接地址指定消息队列 You can use connection URL to specify the message queue: ``` rabbitmq: amqp://username:password@host:5672/%2F Refer: https://www.rabbitmq.com/uri-spec.html beanstalk: beanstalk://host:11300/ redis: redis://host:6379/db redis://host1:port1,host2:port2,...,hostn:portn (for redis 3.x in cluster mode) builtin: None ``` #提示postgresql:您需要用自己创建utf8编码数据库。pyspider不会为您创建数据库。 > Hint for postgresql: you need to create database with encoding utf8 by your own. pyspider will not create database for you. #运行 running ------- #您应该使用子命令独自运行组件。你可以“&”命令后添加[screen](http://linux.die.net/man/1/screen)或[nohup](http://linux.die.net/man/1/nohup)让程序在后台运行,以防止您的ssh会话结束后退出。* *建议使用[Supervisor]管理组件 You should run components alone with subcommands. You may add `&` after command to make it running in background and use [screen](http://linux.die.net/man/1/screen) or [nohup](http://linux.die.net/man/1/nohup) to prevent exit after your ssh session ends. **It's recommended to manage components with [Supervisor](http://supervisord.org/).** ``` # start **only one** scheduler instance 启动唯一的调度器实例 pyspider -c config.json scheduler # phantomjs pyspider -c config.json phantomjs # start fetcher / processor / result_worker instances as many as your needs 启动你必须使用的实例 pyspider -c config.json --phantomjs-proxy="localhost:25555" fetcher pyspider -c config.json processor pyspider -c config.json result_worker #启动webui,设置“--scheduler-rpc”,如果调度程序不在与webui相同的主机上运行 # start webui, set `--scheduler-rpc` if scheduler is not running on the same host as webui pyspider -c config.json webui ``` #在docker中运行 Running with Docker ------------------- [Running pyspider with Docker](Running-pyspider-with-Docker) Deployment of demo.pyspider.org ------------------------------- [Deployment of demo.pyspider.org](Deployment-demo.pyspider.org)