scrapy设置中间件(随机User_Agent和随机代理)

少量更换User_Agent
方法一

 # settings.py
USER_AGENT = ''
DEFAULT_REQUEST_HEADERS = {}

方法二

 # spider
yield scrapy.Request(url,callback=函数名,headers={})

大量User-Agent切换(中间件)
1、获取User-Agent
   # 方法1 :新建useragents.py,存放大量User-Agent,random模块随机切换
   # 方法2 :安装fake_useragent模块(sudo pip3 install fack_useragent)
       from fake_useragent import UserAgent
       ua_obj = UserAgent()
       ua = ua_obj.random
2、middlewares.py新建中间件类
  from fake_useragent import UserAgent 
  class RandomUseragentMiddleware(object):
    def process_request(self,reuqest,spider):
        ua = UserAgent()
        request.headers['User-Agent'] = ua.random
3、settings.py添加此下载器中间件
  DOWNLOADER_MIDDLEWARES = {'Baidu.middlewares.RandomUserAgentDownloaderMiddleware' : 优先级}

随机代理

1.在middlerwares.py文件,同级目录下创建proxies.py存放代理

proxy_list = [
    "http://127.0.0.1:8888",
    "http://1.1.1.1:8888"
]

2.在middlerwares.py引入proxies.py文件里的代理,在middlerwares.py增加如下代码

from .proxies import proxy_list
import random
class RandomProxyDownloaderMiddleware(object):
    def process_request(self,request,spider):
        proxy = random.choice(proxy_list)
        #为拦截下来的请求设置代理
        request.meta["proxy"] = proxy
        print(proxy)

    #代理尝试了几次不能用,因此要处理异常
    def process_exception(self, request, exception, spider):
        #把请求重新交给调度器,再来一遍,又重新走流程(process_request)
        return request

settings.py

DOWNLOADER_MIDDLEWARES = {
   'Baidu.middlewares.BaiduDownloaderMiddleware': 543,
   'Baidu.middlewares.RandomUserAgentDownloaderMiddleware':200,
   'Baidu.middlewares.RandomProxyDownloaderMiddleware':250
}

你可能感兴趣的:(scrapy设置中间件(随机User_Agent和随机代理))