本系列文档用于对Python爬虫技术的学习进行简单的教程讲解,巩固自己技术知识的同时,万一一不小心又正好对你有用那就更好了。
Python 版本是3.7.4
本篇文章主要Scrapy
框架的Download Middlewares
(下载器中间件)模块的使用,下载器中间的作用主要用于进行反爬虫。主要内容包含:
下载器中间件是引擎和下载器之间通信的中间件。在这个中间件中我们可以设置代理、更换请求头信息等来达到反反爬虫的目的。要写下载器中间,可以在下载器中实现两个方法。一个是process_request(self, request, spider)
,这个方法是在请求发送之前会执行,还有一个是process_response(self, request, response, spider)
,这个方法是数据下载到引擎之前执行。
这个方法是在下载器发送请求之前会执行的,一般可以在这个里卖弄设置随机代理,随机请求头等。
process_request
方法,将直接返回这个response对象。已经激活的中间件的process_response()
方法则会在每个response
返回时被调用。process_exception
方法。这个方法是下载器下载的数据到引擎中间会执行的方法。
爬虫在频繁访问一个页面的时候,这个请求头如果一直保持一致,那么很容易被服务器发现,从而简直掉这个请求头的访问。因此我们要在访问这个页面之前随机更改请求头,这样才可以避免爬虫被抓。随机更改请求头,可以在下载中间件中实现,在请求发送给服务器之前,随机的选中一个请求头,这样就可以避免总是用一个请求头了。具体开发步骤如下:
import scrapy
class HttpbinSpider(scrapy.Spider):
name = 'httpbin'
allowed_domains = ['httpbin.org']
start_urls = ['https://httpbin.org/user-agent']
def parse(self, response):
print(response.text)
yield scrapy.Request(self.start_urls[0], dont_filter=True)
middlewares.py
代码如下:import random
class UseragentDemoDownloaderMiddleware(object):
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/44.0.2403.155 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
'Mozilla/5.0 (X11; Linux i686; rv:64.0) Gecko/20100101 Firefox/64.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:64.0) Gecko/20100101 Firefox/64.0',
'Mozilla/5.0 (X11; Linux i586; rv:63.0) Gecko/20100101 Firefox/63.0',
'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:63.0) Gecko/20100101 Firefox/63.0',
]
def process_request(self, request, spider):
user_agent = random.choice(self.USER_AGENTS)
request.headers['User-Agent'] = user_agent
setting.py
配置文件,开启下载器中间件,代码如下:DOWNLOADER_MIDDLEWARES = {
'useragent_demo.middlewares.UseragentDemoDownloaderMiddleware': 543,
}
运行即可看到请求头切换打印效果。
Scrapy框架为我们提供了一个请求头代理中间件类UserAgentMiddleware
:
scrapy.downloadermiddlewares.useragent.UserAgentMiddleware
通过配置项USER_AGENT设置用户代理:
settings.py
#...
#UserAgentMiddleware默认打开
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
#...
关于代理IP相关的我们在前面的基础教程文章中已经讲过很多遍,在这里就不再进行说明了。Scrapy框架提供了一个代理代理服务器中间件类HttpProxyMiddleware
:
scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware
用以设置代理服务器,通过设定request.meta['proxy']
来设置代理,会从环境变量http_proxy
、https_proxy
、no_proxy
依次获取。设置代理我们有两种方式,一种设置是开放代理(开放代理就是有ip和port,不用用户名和密码就可使用),另外一种是独享代理,需要用户名和密码才能访问。
我们以http://httpbin.org/ip
的返回进行样理测试样例开发。
开发中间件类代码如下:
import random
class IpProxyDownloaderMiddleware(object):
PROXIES = [
'202.109.157.47:9000',
'115.28.148.192:8118',
]
def process_request(self, request, spider):
proxy = random.choice(self.PROXIES)
print(proxy)
request.meta['proxy'] = "http://" + proxy
开发spider爬虫模块代码如下:
import scrapy
class IpproxySpider(scrapy.Spider):
name = 'ipproxy'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/ip']
def parse(self, response):
print(response.text)
yield scrapy.Request(self.start_urls[0], dont_filter=True)
在setting.py
中设置中间件如下:
DOWNLOADER_MIDDLEWARES = {
# 'useragent_demo.middlewares.UseragentDemoDownloaderMiddleware': 543,
'useragent_demo.middlewares.IpProxyDownloaderMiddleware': 544,
}
运行查看结构即可,有的部分童鞋会有可能遇到这样的错误:
[scrapy.core.scraper] ERROR: Error downloading <GET http://httpbin.org/ip>
Traceback (most recent call last):
File "D:\Python\Python37\lib\site-packages\twisted\internet\defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "D:\Python\Python37\lib\site-packages\twisted\python\failure.py", line 512, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "D:\Python\Python37\lib\site-packages\scrapy\core\downloader\middleware.py", line 44, in process_request
defer.returnValue((yield download_func(request=request, spider=spider)))
File "D:\Python\Python37\lib\site-packages\scrapy\utils\defer.py", line 45, in mustbe_deferred
result = f(*args, **kw)
File "D:\Python\Python37\lib\site-packages\scrapy\core\downloader\handlers\__init__.py", line 71, in download_request
return handler.download_request(request, spider)
File "D:\Python\Python37\lib\site-packages\scrapy\core\downloader\handlers\http11.py", line 68, in download_request
return agent.download_request(request)
File "D:\Python\Python37\lib\site-packages\scrapy\core\downloader\handlers\http11.py", line 332, in download_request
method, to_bytes(url, encoding='ascii'), headers, bodyproducer)
File "D:\Python\Python37\lib\site-packages\scrapy\core\downloader\handlers\http11.py", line 253, in request
proxyEndpoint = self._getEndpoint(self._proxyURI)
File "D:\Python\Python37\lib\site-packages\twisted\web\client.py", line 1715, in _getEndpoint
return self._endpointFactory.endpointForURI(uri)
File "D:\Python\Python37\lib\site-packages\twisted\web\client.py", line 1593, in endpointForURI
raise SchemeNotSupported("Unsupported scheme: %r" % (uri.scheme,))
twisted.web.error.SchemeNotSupported: Unsupported scheme: b''
出现这种错误的原因是因为在中间件中进行设置IP代理池的时候没有添加传输协议,在上面的下载中间件中设置代理时,需要添加协议名称,“http://”或者“https://”。即可将此问题解决。
将上例中间件代码修改为如下即可:
import random
import base64
class IpProxyDownloaderMiddleware(object):
def process_request(self, request, spider):
# 你的独享代理ip和端口号
proxy = 'ip:port'
user_password = 'username:password'
request.meta['proxy'] = "http://" + proxy
# bytes
b64_user_password = base64.b64encode(user_password.encode('utf-8'))
request.headers['Proxy-Authorization'] = 'Basic ' + b64_user_password.decode('utf-8')