有些网站不喜欢被程序访问,因此他们会检查链接的来源。如果访问来源不是正常的途径,就给你“掐掉”。所以为了让我们的爬虫更好的为我们服务,需要对代码进行一些改进–隐藏-,让它看起来更像是普通人通过普通浏览器的正常点击
通过查阅帮助文档,可知Request有个headers参数,通过设置这个参数,可以伪造成浏览器访问。设置这个headers参数有两种途径:
(1)实例化Request对象的时候将headers参数传进去
(2)通过add_header()方法往Request对象添加headers
headers = {
'Host': "fanyi.youdao.com",
'Referer': "http://fanyi.youdao.com/?keyfrom=dict2.index",
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36",
}
代码:
import urllib.request
import urllib.parse
import json
# 浏览器请求的网址
url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule'
headers = {
'Host': "fanyi.youdao.com",
'Referer': "http://fanyi.youdao.com/?keyfrom=dict2.index",
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36",
}
content=input("请输入需要翻译的内容:")
data={}
# data['i']='I love you'
data['i']=content
data['from']='AUTO'
data['to']='AUTO'
data['smartresult']='dict'
data['client']='fanyideskweb'
data['doctype']='json'
data['version']='2.1'
data['keyfrom']='fanyi.web'
data['action']='FY_BY_CLICKBUTTION'
data['typoResult']='false'
# 使用urllib.parse.urlencode()转换字符串
data = urllib.parse.urlencode(data).encode('utf-8')
req=urllib.request.Request(url,data,headers)
response = urllib.request.urlopen(req)
html = response.read().decode('utf-8')
target = json.loads(html)
print("翻译结果:%s"% (target['translateResult'][0][0]['tgt']))
req=urllib.request.Request(url,data)
req.add_header('Referer','http://fanyi.youdao.com/?keyfrom=dict2.index')
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36')
response = urllib.request.urlopen(req)
html = response.read().decode('utf-8')
target = json.loads(html)
print("翻译结果:%s"% (target['translateResult'][0][0]['tgt']))