pyspider爬取快代理并储存mongodb

pyspider是国人做的一个爬虫框架,我个人使用后觉得比起scrapy它优点是网页操作,效果一目了然,缺陷是网页操作,没有自动补全,大项目使用不方便,算的成也网页,输也网页了
简单说下项目怎么安装
pip install pyspider即可
python3.7版本如何配置请看
https://blog.csdn.net/weixin_43486804/article/details/104183799
先看一下我们爬取的网页,如果有人问为什么我老爬这个网站,这个网站爬的数据可以再次使用
pyspider爬取快代理并储存mongodb_第1张图片

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2020-02-05 16:09:11
# Project: pyspider_ip

from pyspider.libs.base_handler import *
import pymongo

class Handler(BaseHandler):
    crawl_config = {
    }

    def __init__(self):
    #抱歉我数据库没设密码不能给你们网址
        self.myclient = pymongo.MongoClient("mongodb://47.×××.××.××:27017/")
        self.mydb = self.myclient["ip"]
        # mydb.authenticate("lihang","980207",mechanism='SCRAM-SHA-1')
        print("successed")
        self.mycol = self.mydb['agent']
        self.ip=''
        self.port=''
        self.name=''
        self.port=''
        
    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('https://www.kuaidaili.com/free/inha/1', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.td.href, callback=self.detail_page)

    @config(priority=2)
    def detail_page(self, response):
        res ={
            "url": response.url,
            "title": response.doc('title').text(),
            "ip": response.doc("table > tbody:nth-child(2) > tr > td[data-title*='IP']").text().split(" "),
            "port": response.doc("table > tbody:nth-child(2) > tr > td[data-title*='PORT']").text().split(" "),
            "time": response.doc("table > tbody:nth-child(2) > tr > td[data-title*='响应速度']").text().split(" "),
            "name": response.doc("table > tbody:nth-child(2) > tr > td[data-title*='位置']").text().split(" ")
        }
        ress=[]
        for i in range(0,len(res['ip'])):
            ress.append({"ip":res['ip'][i],"port":res['port'][i],"name":res['name'][i],"time":res['time'][i]})
        self.mycol.insert_many(ress)
        return ress

pyspider爬取快代理并储存mongodb_第2张图片

css怎么选择就不用说了,因为response.doc()这种css选择器比起xpath有一点点小缺陷,不能按顺序抓取到每个元素,只会把相同元素加个空格保留起来,所以后面我用split()把它分割成数组,而后重写了__init__方法,加入mongo数据库,运行即可

你可能感兴趣的:(python,爬虫)