scrapyrt是实时抓取api框架,我们生产环境一直使用默认的python 3.6.8环境,来部署的scrapyrt。但由于自动化抓取playwright至少需要python 3.7以上,又因为阿里云centos 8默认的python 3.6.8升级后带来很多不便,现在需要将scrapyrt部署到docker中,在docker中scrapyrt基于python 3.8。
1.1 scrapyrt官方docker
scrapyrt官方也提供了docker部署,那是5年前的镜像,默认是基于python 2.7的,镜像地址: https://hub.docker.com/r/scrapinghub/scrapyrt/tags
官方介绍的安装命令如下:
docker pull scrapinghub/scrapyrt
docker run -p 9080:9080 -tid -v /home/user/quotesbot:/scrapyrt/project scrapinghub/scrapyrt
1.2 自己构建镜像
去github上把scrapyrt源码最新的下载,地址:GitHub - scrapinghub/scrapyrt: HTTP API for Scrapy spiders
修改dockerfile文件,如下所示:
# To build:
# > docker build -t scrapyrt .
#
# to start as daemon with port 9080 of api exposed as 9080 on host
# and host's directory ${PROJECT_DIR} mounted as /scrapyrt/project
#
# > docker run -p 9080:9080 -tid -v ${PROJECT_DIR}:/scrapyrt/project scrapyrt
#FROM python:3.10-slim-buster
FROM python:3.8.16-slim-bullseye
RUN mkdir -p /scrapyrt/src /scrapyrt/project
RUN mkdir -p /var/log/scrapyrt
RUN pip install --upgrade pip
RUN sed -i s@http://deb.debian.org@http://mirrors.aliyun.com@g /etc/apt/sources.list
RUN apt-get clean
RUN apt-get update && apt-get install -y nodejs
RUN pip install playwright==1.28.0 -i https://mirrors.aliyun.com/pypi/simple
RUN playwright install --with-deps chromium
ADD . /scrapyrt/src
RUN pip install /scrapyrt/src -i https://mirrors.aliyun.com/pypi/simple
COPY requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt -i https://mirrors.aliyun.com/pypi/simple
WORKDIR /scrapyrt/project
ENTRYPOINT ["scrapyrt", "-i", "0.0.0.0"]
EXPOSE 9080
在requirements.txt中添加必需的包,我业务中需要的是如下:
Scrapy==2.5.1
elasticsearch==7.17.4
elasticsearch-dsl==7.4.0
PyExecJS==1.5.1
PyMySQL==1.0.2
requests==2.27.1
redis==4.3.4
gerapy-selenium==0.0.3
prometheus-client==0.14.1
pyOpenSSL==22.0.0
接着将源码在本机构建镜像或者上传到有docker环境的centos中构建镜像,如下代码,构建镜像名称为myscrapyrt
[root@dev-data-node001 scrapyrt-master]# pwd
/root/scrapyrt-master
[root@dev-data-node001 scrapyrt-master]# ls
artwork Dockerfile docs LICENSE README.rst requirements-dev.txt requirements.txt scrapyrt setup.cfg setup.py tests
[root@dev-data-node001 scrapyrt-master]# docker build -t hushaoren/myscrapyrt:1.4 .
接着查看所有镜像如下所示(会发现通过tag生成的镜像与原镜像ID是一样的) :
[root@dev-data-node001 scrapyrt-master]# docker images --all
REPOSITORY TAG IMAGE ID CREATED SIZE
hushaoren/myscrapyrt 1.4 0c23df3b64d8 2 minutes ago 1.23GB
python 3.8.16-slim-bullseye 25238eab133c 2 weeks ago 124MB
接着使用登录到docker hub 上传镜像hushaoren/myscrapyrt, 如下所示:
[root@dev-data-node001 scrapyrt-master]# docker login
Authenticating with existing credentials...
WARNING! Your password will be stored unencrypted in /root/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded
[root@dev-data-node001 scrapyrt-master]# docker push hushaoren/myscrapyrt
将realtime-python-crawler项目上传到开发环境centos的目录中 如下所示:
[root@dev-data-node001 realtime-python-crawler]# pwd
/root/scrapyrt_project/realtime-python-crawler
[root@dev-data-node001 realtime-python-crawler]# ls
config.json logs main_redis.py realtime_python_crawler scrapy.cfg
main.py __pycache__ requirements.txt scrapyrt_settings.py
下面是最重的命令,拉取镜像,创建容器并启动
docker pull hushaoren/myscrapyrt:1.4
docker run -p 9080:9080 -e TZ="Asia/Shanghai" -tid -v /root/scrapyrt_project/realtime-python-crawler:/scrapyrt/project -d hushaoren/myscrapyrt:1.4 -S scrapyrt_settings -i 0.0.0.0
对外端口号是9080
-v是将 宿主目录的realtime-python-crawler项目文件 映射到容器目录scrapyrt/project 下
-d 后台运行
-S是使用scrapyrt_settings.py文件(自己创建的扩展文件覆盖原有default_settings.py文件)
-e Tz="Asia/Shanghai" 修改容器的时间,与宿主保持一致
使用Postman 运行效果如下所示:
最后在docker中生成的日志文件也能在宿主目录/root/scrapyrt_project/realtime-python-crawler下同步,这样日志文件也可以通过filebeat来采集了。
迁移到docker后需要注意事项:
1) python程序中动态获取本机外网ip只会获取到127.0.0.1,修改方法把本机外网ip写入配置文件,通过配置文件来读取。
2) realtime-python-crawler项目添加新的pip包时,需要重新构建镜像,再按第二点流程执行(关键代码:RUN pip install -r /tmp/requirements.txt)。