UA代理池和IP代理池
UA代理池也称作user-agent代理池,目的是在http头部加入user-agent选项,模拟浏览器进行发包给服务器端,起到伪装作用。也是很重要的一种反爬策略之一。
大部分服务器在请求快了会首先检查User_Agent,而scrapy默认的浏览器头是scrapy1.1 我们需要开启并且修改成浏览器头,如:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1
USER-AGENT会随机自动更换最好了
从预先定义的user-agent的列表中随机选择一个来采集不同的页面
在settings.py中添加以下代码:
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
'qiubai_proj.middlewares.RotateUserAgentMiddleware' :400,
}
settings.py中添加USER_AGENT_LIST的配置
USER_AGENT_LIST = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
在middlewares文件里添加代理中间件类RotateUserAgentMiddleware
import random
#from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
from settings import USER_AGENT_LIST
class RotateUserAgentMiddleware(UserAgentMiddleware):
'''
用户代理中间件(处于下载中间件位置)
'''
def process_request(self, request, spider):
user_agent = random.choice(USER_AGENT_LIST)
if user_agent:
request.headers.setdefault('User-Agent', user_agent)
print(f"User-Agent:{user_agent}")
运行爬虫后,可以看到user-agent相关信息。
可以将上述user_agent_list列表加入到配置文件settings.py中。
用 Python 爬取网站内容的时候,容易受到反爬虫机制的限制,而突破反爬虫机制的一个重要措施就是使用IP代理。我们可以在网络上找到许多IP代理,但稳定的IP代理成本都较高。因此利用免费代理构建自己的代理池就非常有必要了。
提供一个免费IP代理网站:
http://www.thebigproxylist.com/
http://www.xicidaili.com/
telnet: 用于检测远端服务器的某个端口是否可用
telnet IP地址 端口
可以直接下载现成的txt文件
http://www.thebigproxylist.com/
下载之后,试试看用不同的代理去爬百度首页
上述proxy网站提供的获取IP代理信息的API接口为:
http://www.thebigproxylist.com/members/proxy-api.php?output=all&user=list&pass=8a544b2637e7a45d1536e34680e11adf
浏览器伪装一下才能爬取,使用requests库进行改造
# -*- coding: utf-8 -*-
__author__ = 'zhougy'
__date__ = '2018/9/7 下午2:32'
import time
import requests
import threading
from threading import Lock
import queue
g_lock = Lock()
n_thread = 10
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko)"
" Chrome/68.0.3440.106 Safari/537.36",
}
def fetch_web_data(url, proxy=None, timeout=10):
try:
r = requests.get(url, timeout=timeout, headers=headers, proxies=proxy)
data = r.text
return data
except Exception as e:
print(f"fetch_web-data has error url: {url}")
return None
def write_ip_pair(ip_pair):
'''
将可用的IP和端口动态持久化到proxy_ip_list_日期.txt文件中
:param ip_pair:
:return:
'''
proxy_file_name = "proxy_ip_list_%s.txt" % (time.strftime("%Y.%m.%d", time.localtime(time.time())))
with open(proxy_file_name, "a+", encoding="utf-8") as f:
f.write(f"{ip_pair}\n")
#def write_ip(ip_port_pair):
class IpProxyCheckThread(threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
self.__queue = queue
def run(self):
global g_lock
while True:
data = self.__queue.get()
ip_port_pair = data.split(",")[0]
print(f"the check ip is {ip_port_pair} ")
proxy = {
"http":ip_port_pair,
}
url = "http://httpbin.org/get?x=2&y=4"
data = fetch_web_data(url, proxy=proxy, timeout=15)
if data == None:
print(f"当前ip {ip_port_pair} 校验不成功,丢弃!")
continue
print(f"当前ip {ip_port_pair} 校验成功,可用!")
g_lock.acquire()
write_ip_pair(ip_port_pair)
g_lock.release()
class FetchProxyListThread(threading.Thread):
def __init__(self, url, mq):
threading.Thread.__init__(self)
self.__url = url
self.__mq = mq
def run(self):
data = fetch_web_data(self.__url)
print(data)
ip_pool_list = data.split("\n")
[self.__mq.put(ip_pool) for ip_pool in ip_pool_list]
def process():
mq = queue.Queue()
thread_list = []
for i in range(n_thread):
t = IpProxyCheckThread(mq)
t.setDaemon(True)
thread_list.append(t)
[t.start() for t in thread_list]
url = "http://www.thebigproxylist.com/members/proxy-api.php?output=all&user=list&pass=8a544b2637e7a45d1536e34680e11adf"
fth = FetchProxyListThread(url, mq)
fth.start()
fth.join()
[t.join() for t in thread_list]
mq.join()
if __name__ == "__main__":
process()
由day05节,讲述了如何得到网络连接良好的一组ip
(1) 将之前过滤出来的可用的一组IP和端口放入一个列表中(可以读取文件,加载到list中)
创建一个my_proxies.py文件,内容大致如下(内容可以动态修改)
PROXY =[
"187.65.49.137:3128",
"108.61.246.98:8088",
"167.99.197.73:8080",
]
(2)在middlewares.py文件中加入IP代理池中间件
import random
from . import my_proxies
class MyProxyMidleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = random.choice(my_proxies.PROXY)
(3) 在配置文件中加入添加映射关系
DOWNLOADER_MIDDLEWARES = {
'qiubai_proj.middlewares.MyProxyMidleware':300,
}
(4)启动scrapy,查看IP代理池效果
scrapy crawl qiubai
注:
需要注意的是,如果开启了IP代理池,IP端口的网络质量非常重要,如果质量不好,很可能会拖慢爬取的速度。