本文只记录HAipproxy
在Scrapy爬虫框架中的使用,前提是你已部署服务端。可以查看haipproxy具体使用说明。
如果在部署方面碰到问题可以评论,我看到会帮忙解决。
下载源码并解压https://github.com/SpiderClub/haipproxy/archive/master.zip
有两种方式调用IP代理池:
我用的是第一种:
将client
,config
,utils
包放到你的爬虫项目下然后调整导包错误问题。
首先在middlewares.py模块中定义代理下载中间件
import time
import redis
from scrapy import signals
from .client.py_cli import ProxyFetcher
class JobboleProxyMiddleware(object):
def __init__(self, redis_host, redis_port, redis_db, redis_password):
args = dict(host=redis_host, port=redis_port, password=redis_password, db=redis_db)
self.scheme = 'http'
self.success_req = 'http:success:request'
self.cur_time = 'http:success:time'
self.request_count = 0
self.fetcher = ProxyFetcher(self.scheme, strategy='greedy', redis_args=args)
self.conn = redis.StrictRedis(redis_host, redis_port, redis_db, redis_password)
self.proxy = self.get_next_proxy()
self.start = time.time() * 1000
@classmethod
def from_crawler(cls, crawler):
s = cls(
redis_host=crawler.settings.get("REDIS_HOST"),
redis_port=crawler.settings.get("REDIS_PORT"),
redis_db=crawler.settings.get("REDIS_DB"),
redis_password=crawler.settings.get("REDIS_PASSWORD"),
)
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
self.start = time.time() * 1000
self.request_count += 1
if self.request_count == 1:
spider.logger.info("首次启动不设代理!")
else:
# 对request对象加上proxy
spider.logger.info("this is request ip:" + self.proxy)
request.meta['proxy'] = self.proxy
return None
def process_response(self, request, response, spider):
end = time.time() * 1000
# 如果返回的response状态不是200,重新生成当前request对象
if response.status != 200:
self.fetcher.proxy_feedback('failure', self.proxy)
spider.logger.info('Current ip is blocked! The proxy is {}'.format(self.proxy))
# 对当前request换下一个代理
self.proxy = self.get_next_proxy()
request.meta['proxy'] = self.proxy
return request
else:
spider.logger.info('Request succeeded! The proxy is {}'.format(self.proxy))
# if you use greedy strategy, you must feedback
self.fetcher.proxy_feedback('success', self.proxy, int(end - self.start))
# not considering transaction
self.conn.incr(self.success_req, 1)
self.conn.rpush(self.cur_time, int(end / 1000))
return response
def process_exception(self, request, exception, spider):
spider.logger.error('Request failed!The proxy is {}. Exception:{}'.format(self.proxy, exception))
# it's important to feedback, otherwise you may use the bad proxy next time
self.fetcher.proxy_feedback('failure', self.proxy)
# 对当前request换下一个代理
self.proxy = self.get_next_proxy()
request.meta['proxy'] = self.proxy
return request
def get_next_proxy(self):
# 获取一个可用代理
return self.fetcher.get_proxy()
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
然后在settings.py模块中配置代理IP的相关设置:
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36'
DOWNLOADER_MIDDLEWARES = {
'jobbole.middlewares.JobboleProxyMiddleware': 543,
}
REDIS_HOST = "你的redis服务器地址"
REDIS_PORT = 6379
REDIS_DB = 0
REDIS_PASSWORD = "123456"
以上就是HAipproxy的使用,代码里有详细注释就不做过多说明了。
如果有疑问可以评论指出!
接下来会写一个利用HAipproxy代理池实现的完整的爬虫项目。。。