Python爬虫之伪装浏览器User-Agent

【导读】记录学习爬虫的过程
【主题】Python爬虫之伪装浏览器原理
【分析】
1.创建自定义请求对象的目的,对抗网站的反爬虫机制
2.反爬虫机制1:判断用户是否是浏览器访问(User-Agent)
3.对抗措施1:伪装浏览器进行访问
【注意】
使用request()来包装请求,再通过urlopen()获取页面。单纯使用 urlopen 并不能足以构建一个完整的请求,需要给request一个header参数,而header参数就是用来存放User-Agent内容的,发起请求需要传递header参数;
【代码】

from urllib import request
import re
import random
url = r'http://www.baidu.com/'
header1 = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"}
header2 = {"User-Agent":"Mozilla/5.0 (Linux; Android 5.1.1; vivo X6S A Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/6.2 TBS/044207 Mobile Safari/537.36 MicroMessenger/6.7.3.1340(0x26070332) NetType/4G Language/zh_CN Process/tools"}
header3 = {"User-Agent":"Mozilla/5.0 (Linux; Android 8.1; EML-AL00 Build/HUAWEIEML-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.143 Crosswalk/24.53.595.0 XWEB/358 MMWEBSDK/23 Mobile Safari/537.36 MicroMessenger/6.7.2.1340(0x2607023A) NetType/4G Language/zh_CN"}
header4 = {"User-Agent":"Mozilla/5.0 (Linux; U; Android 8.0.0; zh-CN; MHA-AL00 Build/HUAWEIMHA-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.108 UCBrowser/12.1.4.994 Mobile Safari/537.36"}
header5 = {"User-Agent":"Mozilla/5.0 (Linux; U; Android 8.1.0; zh-cn; BLA-AL00 Build/HUAWEIBLA-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/8.9 Mobile Safari/537.36"}
list1 = [header1,header2,header3,header4,header5]
i= random.randint(1, 4)
req = request.Request(url,headers= list1[i])
pat = r'(.*?)'
reponse = request.urlopen(req).read().decode()
data = re.findall(pat,reponse)
print(data[0])

#输出
#百度一下,你就知道


你可能感兴趣的:(Python爬虫)