magical_spider轻量级web采集框架

magical_spider 一个使用于所有web站点的轻量级爬虫采集方案。

项目地址: https://github.com/lixi5338619/magical_spider

image.png

使用说明

1、配置settings.py （配置驱动路径，注意当前系统后缀是否需要.exe）
2、启动 flask 服务
3、运行测试

代码参考demo文件内容,运行过程主要借助runflow.py。

# runflow.py
import requests

host = 'http://127.0.0.1:5000'

def magical_start(project_name,base_url = 'http://www.lxspider.com'):

    # 1、create browser and select session_id

    result = requests.post(f'{host}/create',data={'name':project_name,'url':base_url}).json()

    session_id,process_url = result['session_id'],result['process_url']

    return session_id,process_url

def magical_request(session_id,process_url,request_url):

    # 2、request browser_xhr

    data = {'session_id':session_id,'process_url':process_url,

            'request_url':request_url,'request_type':'get'}

    result = requests.post(f'{host}/xhr',data=data).json()

    return result['result']

def magical_close(session_id,process_url,process_name):

    # 4、close browser

    close_data = {'session_id':session_id,'process_url':process_url,'process_name':process_name}

    requests.post(f'{host}/close',data=close_data).json()

发起GET请求

from demo.runflow import magical_start,magical_request,magical_close

project_name = 'cnipa'
base_url = 'https://www.cnipa.gov.cn'

session_id,process_url = magical_start(project_name,base_url)

print(len(magical_request(session_id, process_url,'https://www.cnipa.gov.cn/col/col57/index.html')))

magical_close(session_id,process_url,project_name)

发起POST请求

from demo.runflow import magical_start,magical_request,magical_close
import json

project_name = 'chinadrugtrials'
base_url = 'http://www.chinadrugtrials.org.cn'

session_id,process_url = magical_start(project_name,base_url)

data = {"id": "","ckm_index": "","sort": "desc","sort2": "","rule": "CTR","secondLevel": "0","currentpage": "2","keywords": "","reg_no": "","indication": "","case_no": "","drugs_name": "","drugs_type": "","appliers": "","communities": "","researchers": "","agencies": "","state": ""}
formdata = json.dumps(data)

print(magical_request(session_id=session_id, process_url=process_url,
                      request_url='http://www.chinadrugtrials.org.cn/clinicaltrials.searchlist.dhtml',
                      request_type='post',formdata=formdata
                      ))

magical_close(session_id,process_url,project_name)

注意事项

1、服务端的index页可以查看和管理当前运行中的任务，也能查看系统内存和磁盘使用情况。
2、demo文件夹中有任务流程汇总runflow.py，以及案例，单任务和多任务示例。
3、创建的任务名不能重复。如果运行中途报错，可能需要手动关闭index页中的任务。

linux部署

1.安装chrome (自行选择安装位置)

yum install https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm

2.检查chrome的版本

google-chrome --version

3.安装对应版本的 chromedriver_linux64

比如我的chrome版本是104.0.5112.79

wget https://npm.taobao.org/mirrors/chromedriver/104.0.5112.79/chromedriver_linux64.zip

4.解压

unzip chromedriver_linux64

5.授权

chmod 777 chromedrive

6.修改项目代码settings.py中的chromedriver路径

7.安装python依赖后启动flask项目

Python依赖：flask、sqlite3、selenium、websockets、opencv-python、numpy

flask启动方式：python3 server.py

8.开启服务器端口访问权限

9.运行项目测试

内容推荐

GITHUB、爬虫逆向知识站、爬虫逆向工具站