scrapy爬虫部署(centos7)(含scrapy_splash)2019-03-10

1.配置好python环境,详情见《python3安装(centos)》

2.安装docker:

yum install -y docker

3.配置国内镜像源:

进入docker安装目录(默认为/etc/docker/),vim目录下的daemon.json:

vim /etc/docker/daemon.json

写入以下内容:

{

"registry-mirrors": [

"https://kfwkfulq.mirror.aliyuncs.com",

"https://2lqq34jg.mirror.aliyuncs.com",

"https://pee6w651.mirror.aliyuncs.com",

"https://registry.docker-cn.com",

"http://hub-mirror.c.163.com"

],

"dns": ["8.8.8.8","8.8.4.4"]

}

3.启动docker:

systemctl start docker

4.拉取splash镜像:

docker pull scrapinghub/splash

5.运行splash:

docker run -d -p 8050:8050 scrapinghub/splash

(如果是阿里云服务器,注意安全组配置)

6.设置mysql,以免报too many connection(如果之前设置过,忽略。详情见《解决Mysql错误Too many connections的方法》)

vim /etc/my.conf

在[mysqld]下添加:

max_connections=1000

wait_timeout=100

interactive_timeout=100

max_allowed_packet=15M

7.安装scrapyd

pip install scrapyd --upgrade

8.设置软链接

ln -s /usr/local/python3/bin/scrapy /usr/bin/scrapy

ln -s /usr/local/python3/bin/scrapyd /usr/bin/scrapyd

ln -s /usr/local/python3/bin/twist /usr/bin/twist

ln -s /usr/local/python3/bin/twistd /usr/bin/twistd

9.设置scrapyd配置文件,允许远程链接:

vim /usr/local/python3/lib/python3.6/site-packages/scrapyd/default_scrapyd.conf

修改:

bind_address = 0.0.0.0

10.启动scrapyd(端口6800,注意配置阿里云安全组。)

nohup scrapyd &

11.创建文件夹用来存储scrapy spider的日志(在spider的setting.py中配置该地址)

mkdir /var/log/spider/log


(以下操作在本地电脑上进行)

12.安装scrapy-client

pip install scrapyd-client

13.配置scrapy.cfg,设置远程scrapyd地址;

14.在爬虫的setting.py配置文件中配置sys_evn,LOG_FILE_DIR(setting.py中的LOG_FILE变量必须写死,地址为12步创建的路径)

15.在scrapy.cfg同目录下执行scrapyd-deploy命令,

scrapyd-deploy

如果报命令找不到,在相应环境的Script/下新增scrapyd-deploy.bat文件,内容为:

  @echo off

"D:\anaconda3-5.0.1\envs\py36\python.exe" "D:\anaconda3-5.0.1\envs\py36\Scripts\scrapyd-deploy" %1 %2 %3 %4 %5 %6 %7 %8 %9

(具体路径根据实际情况来)

16.用curl 命令调用scrapyd来启动和停止spider

---------------------

你可能感兴趣的:(scrapy爬虫部署(centos7)(含scrapy_splash)2019-03-10)