爬虫中间件
——spider与引擎之间
作用:拦截未去重的请求和响应还可以拦截item
下载器中间件
——下载器和引擎之间
作用:批量拦截所有的请求和响应
为什么拦截请求:
———篡改UA伪装,让请求载体具有不同的身份标识
———修改请求对应的ip
为什么拦截响应:
———篡改响应数据,篡改响应对象
class MiddlewareDownloaderMiddleware(object):
#构件UA 池
user_agents_list = [
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
'Opera/8.0 (Windows NT 5.1; U; en)',
'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) ',
]
def process_request(self, request, spider):
#随机更换UA
request.headers['User_Agent']=random.choice(self.user_agents_list)
return None
创建UA 池之后就可以实现随机更换代理
def process_request(self, request, spider):
request.meta['proxy']='http://223.199.24.230' #request.meta['proxy']='http(https): // ip 地址' (process_request函数中)
print(request.meta['proxy'])
return None
也可以创建IP 池进行随机的更换IP
原文:https://blog.csdn.net/fuck487/article/details/84617194
更换cookie的三种方式:
——第一种:setting文件中设置cookie
当COOKIES_ENABLED是注释的时候scrapy默认没有开启cookie
当COOKIES_ENABLED是false的时候scrapy默认使用setting中的cookie
当COOKIES_ENABLED是true的时候scrapy就会关闭setting中的cookie,改为使用自定义的cookie
所以当使用自定义cookie时需要将COOKIES_ENABLED置于true,如果将其置于false又不使用自定义的cookie就可能会导致获取页面失败
——第二种:middlewares中设置cookie
在下载中间件的 process_request函数中添加cookie
def process_request(self, request, spider):
# 随机更换UA
request.headers['User_Agent'] = random.choice(self.user_agents_list)
request.cookies= {'cookie':'bid=hj4uGSMjAuc; douban-fav-remind=1;\
trc_cookie_storage=taboola%2520global%253Auser-id%3D5f44a1c4-23d4-4138-9b9a-3b731c88e3f7-tuct4b44e0b;\
ll="108318"; __gads=ID=e218d585da77f9d6:T=1581139603:S=ALNI_MailAZtg45yojaAjqe9-kUjOG41DA; \
_pk_ref.100001.8cb4=%5B%22%22%2C%22%22%2C1582346889%2C%22https%3A%2F%2Fwww.baidu.com%2\
Flink%3Furl%3DbS9RsvNfJlPDlQHuH7b2QAAjy9U19zAetSwrWiIIlaVvmiko4QHqEjS_-rJNlvZZ%26wd%3D%2\
6eqid%3Df1c82df50022eb9a000000035e50b284%22%5D; _pk_id.100001.8cb4=bf0ff7d7ac0b8210.1573482065.4.1582346889.1581139600.; \
_pk_ses.100001.8cb4=*; __utma=30149280.1093727161.1573482079.1573482079.1582346890.2; __utmc=30149280;\
__utmz=30149280.1582346890.2.2.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utmt=1; __utmb=30149280.1.10.1582346890'
}
return None
——第三种:在spider文件中,重写start_request方法,在scrapy的Request函数的参数中传递cookies
# 重载start_requests方法
def start_requests(self):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0"}
# 指定cookies
cookies = {
'uuid': '66a0f5e7546b4e068497.1542881406.1.0.0',
'_lxsdk_cuid': '1673ae5bfd3c8-0ab24c91d32ccc8-143d7240-144000-1673ae5bfd4c8',
'__mta': '222746148.1542881402495.1542881402495.1542881402495.1',
'ci': '20',
'rvct': '20%2C92%2C282%2C281%2C1',
'_lx_utm': 'utm_source%3DBaidu%26utm_medium%3Dorganic',
'_lxsdk_s': '1674f401e2a-d02-c7d-438%7C%7C35'}
# 再次请求到详情页,并且声明回调函数callback,dont_filter=True 不进行域名过滤,meta给回调函数传递数据
yield Request(detailUrl, headers=headers, cookies=cookies, callback=self.detail_parse, meta={'myItem': item}, dont_filter=True)
最后在setting中将COOKIES_ENABLED置于false