pyspider是国人做的一个爬虫框架,我个人使用后觉得比起scrapy它优点是网页操作,效果一目了然,缺陷是网页操作,没有自动补全,大项目使用不方便,算的成也网页,输也网页了
简单说下项目怎么安装
pip install pyspider即可
python3.7版本如何配置请看
https://blog.csdn.net/weixin_43486804/article/details/104183799
先看一下我们爬取的网页,如果有人问为什么我老爬这个网站,这个网站爬的数据可以再次使用
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2020-02-05 16:09:11
# Project: pyspider_ip
from pyspider.libs.base_handler import *
import pymongo
class Handler(BaseHandler):
crawl_config = {
}
def __init__(self):
#抱歉我数据库没设密码不能给你们网址
self.myclient = pymongo.MongoClient("mongodb://47.×××.××.××:27017/")
self.mydb = self.myclient["ip"]
# mydb.authenticate("lihang","980207",mechanism='SCRAM-SHA-1')
print("successed")
self.mycol = self.mydb['agent']
self.ip=''
self.port=''
self.name=''
self.port=''
@every(minutes=24 * 60)
def on_start(self):
self.crawl('https://www.kuaidaili.com/free/inha/1', callback=self.index_page)
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
for each in response.doc('a[href^="http"]').items():
self.crawl(each.td.href, callback=self.detail_page)
@config(priority=2)
def detail_page(self, response):
res ={
"url": response.url,
"title": response.doc('title').text(),
"ip": response.doc("table > tbody:nth-child(2) > tr > td[data-title*='IP']").text().split(" "),
"port": response.doc("table > tbody:nth-child(2) > tr > td[data-title*='PORT']").text().split(" "),
"time": response.doc("table > tbody:nth-child(2) > tr > td[data-title*='响应速度']").text().split(" "),
"name": response.doc("table > tbody:nth-child(2) > tr > td[data-title*='位置']").text().split(" ")
}
ress=[]
for i in range(0,len(res['ip'])):
ress.append({"ip":res['ip'][i],"port":res['port'][i],"name":res['name'][i],"time":res['time'][i]})
self.mycol.insert_many(ress)
return ress
css怎么选择就不用说了,因为response.doc()这种css选择器比起xpath有一点点小缺陷,不能按顺序抓取到每个元素,只会把相同元素加个空格保留起来,所以后面我用split()把它分割成数组,而后重写了__init__方法,加入mongo数据库,运行即可