scrapy初识中间件(代理随机更换,ip随机更换,cookie更换)

scrapy初识中间件(代理随机更换,ip随机更换,cookie更换)

  • 中间件
    • 随机更换User_Agent
    • 随机更换IP
    • 更换cookie

中间件

爬虫中间件
——spider与引擎之间
作用:拦截未去重的请求和响应还可以拦截item

下载器中间件
——下载器和引擎之间
作用:批量拦截所有的请求和响应
为什么拦截请求:
———篡改UA伪装,让请求载体具有不同的身份标识
———修改请求对应的ip
为什么拦截响应:
———篡改响应数据,篡改响应对象

随机更换User_Agent

class MiddlewareDownloaderMiddleware(object):
	#构件UA 池
    user_agents_list = [
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
        'Opera/8.0 (Windows NT 5.1; U; en)',
        'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
        'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
        'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) ',
    ]

    def process_request(self, request, spider):
    #随机更换UA
        request.headers['User_Agent']=random.choice(self.user_agents_list)
        return None

创建UA 池之后就可以实现随机更换代理

随机更换IP

def process_request(self, request, spider):
        request.meta['proxy']='http://223.199.24.230' #request.meta['proxy']='http(https): //    ip 地址'     (process_request函数中)
        print(request.meta['proxy'])
        return None

也可以创建IP 池进行随机的更换IP

更换cookie

原文:https://blog.csdn.net/fuck487/article/details/84617194
更换cookie的三种方式:
——第一种:setting文件中设置cookie
当COOKIES_ENABLED是注释的时候scrapy默认没有开启cookie
当COOKIES_ENABLED是false的时候scrapy默认使用setting中的cookie
当COOKIES_ENABLED是true的时候scrapy就会关闭setting中的cookie,改为使用自定义的cookie

所以当使用自定义cookie时需要将COOKIES_ENABLED置于true,如果将其置于false又不使用自定义的cookie就可能会导致获取页面失败

——第二种:middlewares中设置cookie
在下载中间件的 process_request函数中添加cookie

    def process_request(self, request, spider):
        # 随机更换UA
        request.headers['User_Agent'] = random.choice(self.user_agents_list)
        request.cookies= {'cookie':'bid=hj4uGSMjAuc; douban-fav-remind=1;\
         trc_cookie_storage=taboola%2520global%253Auser-id%3D5f44a1c4-23d4-4138-9b9a-3b731c88e3f7-tuct4b44e0b;\
          ll="108318"; __gads=ID=e218d585da77f9d6:T=1581139603:S=ALNI_MailAZtg45yojaAjqe9-kUjOG41DA; \
          _pk_ref.100001.8cb4=%5B%22%22%2C%22%22%2C1582346889%2C%22https%3A%2F%2Fwww.baidu.com%2\
          Flink%3Furl%3DbS9RsvNfJlPDlQHuH7b2QAAjy9U19zAetSwrWiIIlaVvmiko4QHqEjS_-rJNlvZZ%26wd%3D%2\
          6eqid%3Df1c82df50022eb9a000000035e50b284%22%5D; _pk_id.100001.8cb4=bf0ff7d7ac0b8210.1573482065.4.1582346889.1581139600.; \
          _pk_ses.100001.8cb4=*; __utma=30149280.1093727161.1573482079.1573482079.1582346890.2; __utmc=30149280;\
           __utmz=30149280.1582346890.2.2.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utmt=1; __utmb=30149280.1.10.1582346890'
        }
        return None

——第三种:在spider文件中,重写start_request方法,在scrapy的Request函数的参数中传递cookies

# 重载start_requests方法
    def start_requests(self):
        headers = {
                    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0"}
        # 指定cookies
        cookies = {
                    'uuid': '66a0f5e7546b4e068497.1542881406.1.0.0',
                    '_lxsdk_cuid': '1673ae5bfd3c8-0ab24c91d32ccc8-143d7240-144000-1673ae5bfd4c8',
                    '__mta': '222746148.1542881402495.1542881402495.1542881402495.1',
                    'ci': '20',
                    'rvct': '20%2C92%2C282%2C281%2C1',
                    '_lx_utm': 'utm_source%3DBaidu%26utm_medium%3Dorganic',
                    '_lxsdk_s': '1674f401e2a-d02-c7d-438%7C%7C35'}
 
                # 再次请求到详情页,并且声明回调函数callback,dont_filter=True 不进行域名过滤,meta给回调函数传递数据
        yield Request(detailUrl, headers=headers, cookies=cookies, callback=self.detail_parse, meta={'myItem': item},  dont_filter=True)

最后在setting中将COOKIES_ENABLED置于false

你可能感兴趣的:(scrapy初学)