scrapy爬虫之scrapyd-client管理spider

简介

Scrapyd作为守护进程,运行scrapy爬虫的服务程序,它支持以http/json命令方式发布、删除、启动、停止爬虫程序。scrapyd可以管理多个project,并且每个project可以有多个版本,但是只有最新版本被用来运行spider。

Scrapyd-client是一个专门用来发布scrapy爬虫的工具,虽然也具有部分管理功能,但是不如scrapyd齐全,因此建议只用来发布。

注意:
scrapyd-client最版版有scrapyd-deploy和scrapyd-client命令,老版的可能只有scrapyd-client命令。

安装

source activate scrapy
pip install scrapyd
pip install scrapyd-client

部署

1.修改scarpy.cfg
在项目根目录下修改scrapy.cfg

cd scrapy/douban
vim scrapy.cfg
#test是deploy的别名
[deploy:yanggd]
url = http://10.11.2.102:6800/
#工程名
project = douban
#访问web的用户名及密码
#username=
#password=


#启动scrapyd
scrapyd

运行后,scrapyd会在启动一个web,用来监控spider的运行情况。由于scarpyd不支持用户认证,但可以通过nginx代理或其他方式设置认证。

2.scrapyd-deploy部署

#部署scrapy project
scrapyd-deploy test -p douban -v v1

其中:
test为deploy别名
douban为project名
v1为版本号

3.管理spider

#列出所有工程
scrapyd-client -t http://10.11.2.102:6800 projects
或
curl http://10.11.2.102:6800/listprojects.json
{"status": "ok", "projects": ["default", "douban"], "node_name": "yanggd-QiTianM4650-D089"}
#查看爬虫
curl http://10.11.2.102:6800/listspiders.json?project=douban
{"status": "ok", "spiders": ["douban_login", "fanghua", "langyabang", "movieTop250", "movieTop250_crawlspider", "movieTop250_login_crawlspider", "tongcheng_pipeline"], "node_name": "yanggd-QiTianM4650-D089"}
#列出版本
curl http://10.11.2.102:6800/listversions.json?project=douban
{"status": "ok", "versions": ["1516115564", "1516199516", "1516265513", "v1"], "node_name": "yanggd-QiTianM4650-D089"}
#删除版本
curl http://10.11.2.102:6800/delversion.json -d "project=douban&version=1516115564"
{"status": "ok", "node_name": "yanggd-QiTianM4650-D089"}
#调度执行爬虫
curl http://10.11.2.102:6800/schedule.json -d "project=douban&spider=tongcheng_pipeline&jobid=tongcheng_pipeline"
{"status": "ok", "jobid": "tongcheng_pipeline", "node_name": "yanggd-QiTianM4650-D089"}
#查看爬虫的执行状态
curl http://10.11.2.102:6800/listjobs.json?project=douban|| python -m json.tool
{"status": "ok", "running": [{"start_time": "2018-01-22 19:45:14.376731", "pid": 28067, "id": "tongcheng_pipeline", "spider": "tongcheng_pipeline"}], "finished": [], "pending": [], "node_name": "yanggd-QiTianM4650-D089"}
#停止爬虫
curl http://10.11.2.102:6800/cancel.json -d "project=douban&job=tongcheng_pipeline"
{"status": "ok", "prevstate": null, "node_name": "yanggd-QiTianM4650-D089"}

你可能感兴趣的:(scrapy爬虫)