一个用Python实现的高可用低延迟的高匿IP代理池 -- HAipproxy的使用

本文只记录HAipproxy在Scrapy爬虫框架中的使用,前提是你已部署服务端。可以查看haipproxy具体使用说明。

如果在部署方面碰到问题可以评论,我看到会帮忙解决。

下载源码并解压https://github.com/SpiderClub/haipproxy/archive/master.zip

有两种方式调用IP代理池:

  1. Python客户端
  2. squid二级代理

我用的是第一种:

client,config,utils包放到你的爬虫项目下然后调整导包错误问题。

首先在middlewares.py模块中定义代理下载中间件

import time
import redis
from scrapy import signals
from .client.py_cli import ProxyFetcher


class JobboleProxyMiddleware(object):

    def __init__(self, redis_host, redis_port, redis_db, redis_password):
        args = dict(host=redis_host, port=redis_port, password=redis_password, db=redis_db)
        self.scheme = 'http'
        self.success_req = 'http:success:request'
        self.cur_time = 'http:success:time'
        self.request_count = 0
        self.fetcher = ProxyFetcher(self.scheme, strategy='greedy', redis_args=args)
        self.conn = redis.StrictRedis(redis_host, redis_port, redis_db, redis_password)
        self.proxy = self.get_next_proxy()
        self.start = time.time() * 1000

    @classmethod
    def from_crawler(cls, crawler):
        s = cls(
            redis_host=crawler.settings.get("REDIS_HOST"),
            redis_port=crawler.settings.get("REDIS_PORT"),
            redis_db=crawler.settings.get("REDIS_DB"),
            redis_password=crawler.settings.get("REDIS_PASSWORD"),
        )
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        self.start = time.time() * 1000
        self.request_count += 1
        if self.request_count == 1:
            spider.logger.info("首次启动不设代理!")
        else:
            # 对request对象加上proxy
            spider.logger.info("this is request ip:" + self.proxy)
            request.meta['proxy'] = self.proxy
        return None

    def process_response(self, request, response, spider):
        end = time.time() * 1000
        # 如果返回的response状态不是200,重新生成当前request对象
        if response.status != 200:
            self.fetcher.proxy_feedback('failure', self.proxy)
            spider.logger.info('Current ip is blocked! The proxy is {}'.format(self.proxy))
            # 对当前request换下一个代理
            self.proxy = self.get_next_proxy()
            request.meta['proxy'] = self.proxy
            return request
        else:
            spider.logger.info('Request succeeded! The proxy is {}'.format(self.proxy))
            # if you use greedy strategy, you must feedback
            self.fetcher.proxy_feedback('success', self.proxy, int(end - self.start))
            # not considering transaction
            self.conn.incr(self.success_req, 1)
            self.conn.rpush(self.cur_time, int(end / 1000))
            return response

    def process_exception(self, request, exception, spider):
        spider.logger.error('Request failed!The proxy is {}. Exception:{}'.format(self.proxy, exception))
        # it's important to feedback, otherwise you may use the bad proxy next time
        self.fetcher.proxy_feedback('failure', self.proxy)
        # 对当前request换下一个代理
        self.proxy = self.get_next_proxy()
        request.meta['proxy'] = self.proxy
        return request

    def get_next_proxy(self):
        # 获取一个可用代理
        return self.fetcher.get_proxy()

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

然后在settings.py模块中配置代理IP的相关设置:

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36'

DOWNLOADER_MIDDLEWARES = {
    'jobbole.middlewares.JobboleProxyMiddleware': 543,
}

REDIS_HOST = "你的redis服务器地址"
REDIS_PORT = 6379
REDIS_DB = 0
REDIS_PASSWORD = "123456"

以上就是HAipproxy的使用,代码里有详细注释就不做过多说明了。

如果有疑问可以评论指出!

接下来会写一个利用HAipproxy代理池实现的完整的爬虫项目。。。

你可能感兴趣的:(Python)