SCRAPY之动态设置user-agent和IP代理池

UA代理池和IP代理池

1. UA代理池

​ UA代理池也称作user-agent代理池,目的是在http头部加入user-agent选项,模拟浏览器进行发包给服务器端,起到伪装作用。也是很重要的一种反爬策略之一。

​ 大部分服务器在请求快了会首先检查User_Agent,而scrapy默认的浏览器头是scrapy1.1 我们需要开启并且修改成浏览器头,如:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1

USER-AGENT会随机自动更换最好了

从预先定义的user-agent的列表中随机选择一个来采集不同的页面

在settings.py中添加以下代码:

DOWNLOADER_MIDDLEWARES = {
 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
    'qiubai_proj.middlewares.RotateUserAgentMiddleware' :400,
}

settings.py中添加USER_AGENT_LIST的配置

USER_AGENT_LIST = [
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
      "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
      "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
      "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
      "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
      "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
      "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]

在middlewares文件里添加代理中间件类RotateUserAgentMiddleware

import random
#from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

from settings import USER_AGENT_LIST

class RotateUserAgentMiddleware(UserAgentMiddleware):
    '''
    用户代理中间件(处于下载中间件位置)
    '''

    def process_request(self, request, spider):
        user_agent = random.choice(USER_AGENT_LIST)
        if user_agent:
            request.headers.setdefault('User-Agent', user_agent)
            print(f"User-Agent:{user_agent}")

运行爬虫后,可以看到user-agent相关信息。

可以将上述user_agent_list列表加入到配置文件settings.py中。

2. IP代理池

​ 用 Python 爬取网站内容的时候,容易受到反爬虫机制的限制,而突破反爬虫机制的一个重要措施就是使用IP代理。我们可以在网络上找到许多IP代理,但稳定的IP代理成本都较高。因此利用免费代理构建自己的代理池就非常有必要了。

提供一个免费IP代理网站:

http://www.thebigproxylist.com/

http://www.xicidaili.com/

telnet: 用于检测远端服务器的某个端口是否可用

telnet IP地址 端口

可以直接下载现成的txt文件
http://www.thebigproxylist.com/

下载之后,试试看用不同的代理去爬百度首页

上述proxy网站提供的获取IP代理信息的API接口为:

http://www.thebigproxylist.com/members/proxy-api.php?output=all&user=list&pass=8a544b2637e7a45d1536e34680e11adf

浏览器伪装一下才能爬取,使用requests库进行改造

# -*- coding: utf-8 -*-  
__author__ = 'zhougy'
__date__ = '2018/9/7 下午2:32' 

import time

import requests

import threading
from threading import Lock
import queue

g_lock = Lock()

n_thread = 10

headers = {
     "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko)"
				   " Chrome/68.0.3440.106 Safari/537.36",

}

def fetch_web_data(url, proxy=None, timeout=10):
	try:
		r = requests.get(url, timeout=timeout, headers=headers, proxies=proxy)
		data = r.text
		return data
	except Exception as e:
		print(f"fetch_web-data has error url: {url}")
		return None


def write_ip_pair(ip_pair):
	'''
	将可用的IP和端口动态持久化到proxy_ip_list_日期.txt文件中
	:param ip_pair:
	:return:
	'''
	proxy_file_name = "proxy_ip_list_%s.txt" % (time.strftime("%Y.%m.%d", time.localtime(time.time())))
	with open(proxy_file_name, "a+", encoding="utf-8") as f:
		f.write(f"{ip_pair}\n")


#def write_ip(ip_port_pair):
class IpProxyCheckThread(threading.Thread):
	def __init__(self, queue):
		threading.Thread.__init__(self)
		self.__queue = queue


	def run(self):
		global g_lock
		while True:
			data = self.__queue.get()
			ip_port_pair = data.split(",")[0]
			print(f"the check ip is {ip_port_pair} ")
			proxy = {
				"http":ip_port_pair,
			}
			url = "http://httpbin.org/get?x=2&y=4"
			data = fetch_web_data(url, proxy=proxy, timeout=15)
			if data == None:
				print(f"当前ip {ip_port_pair} 校验不成功,丢弃!")
				continue
			print(f"当前ip {ip_port_pair} 校验成功,可用!")
			g_lock.acquire()
			write_ip_pair(ip_port_pair)
			g_lock.release()



class FetchProxyListThread(threading.Thread):
	def __init__(self, url, mq):
		threading.Thread.__init__(self)
		self.__url = url
		self.__mq = mq


	def run(self):
		data = fetch_web_data(self.__url)
		print(data)
		ip_pool_list = data.split("\n")
		[self.__mq.put(ip_pool) for ip_pool in ip_pool_list]



def process():
	mq = queue.Queue()

	thread_list = []
	for i in range(n_thread):
		t = IpProxyCheckThread(mq)
		t.setDaemon(True)
		thread_list.append(t)

	[t.start()  for t in thread_list]

	url = "http://www.thebigproxylist.com/members/proxy-api.php?output=all&user=list&pass=8a544b2637e7a45d1536e34680e11adf"
	fth = FetchProxyListThread(url, mq)
	fth.start()

	fth.join()
	[t.join() for t in thread_list]

	mq.join()



if __name__ == "__main__":
	process()

3 scrapy中加入IP代理池

由day05节,讲述了如何得到网络连接良好的一组ip

(1) 将之前过滤出来的可用的一组IP和端口放入一个列表中(可以读取文件,加载到list中)

创建一个my_proxies.py文件,内容大致如下(内容可以动态修改)

PROXY =[
"187.65.49.137:3128",
"108.61.246.98:8088",
"167.99.197.73:8080",
]

(2)在middlewares.py文件中加入IP代理池中间件

import random
from . import my_proxies
class MyProxyMidleware(object):
    
    def process_request(self, request, spider):
        request.meta['proxy']  = random.choice(my_proxies.PROXY)

(3) 在配置文件中加入添加映射关系

DOWNLOADER_MIDDLEWARES = {
     'qiubai_proj.middlewares.MyProxyMidleware':300,
}

(4)启动scrapy,查看IP代理池效果

​ scrapy crawl qiubai

注:

需要注意的是,如果开启了IP代理池,IP端口的网络质量非常重要,如果质量不好,很可能会拖慢爬取的速度。

你可能感兴趣的:(Scrapy)