关于urllib.request.Request和urllib.request.urlopen的区别:
urllib.request.urlopen()来说urllib.request.Request是进一步的包装请求。
如果简单的访问一个网页可以直接使用urllib.request.urlopen。
如果要进行header,data的自定义,则要使用urllib.request.Request
urllib.request.Request(url, data=None, headers={
}, origin_req_host=None, unverifiable=False, method=None)
下面是爬取有道翻译的代码:
import urllib.request
import urllib.parse
import json
import time
while True:
content = input('请输入需要翻译的内容(输入"q"退出程序):')
if content == 'q':
break
url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"
data = {
}
data['i'] = content
data['from'] = 'AUTO'
data['to'] = 'AUTO'
data['smartresult'] = 'dict'
data['client'] = 'fanyideskweb'
data['salt'] = '15821157689747'
data['sign'] = 'd5a392995c28c285198043f7111d1d00'
data['ts'] = '1582115768974'
data['bv'] = 'ec579abcd509567b8d56407a80835950'
data['doctype'] = 'json'
data['version'] = '2.1'
data['keyfrom'] = 'fanyi.web'
data['action'] = 'FY_BY_CLICKBUTTION'
data = urllib.parse.urlencode(data).encode('utf-8') #编码data
####方法一先创建head(为了隐藏)
head = {
}
head['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400'
req = urllib.request.Request(url,data,head)
###方法二 先创建req 然后添加进来(注意函数参数)
###req = urllib.request.Request(url,data)
###req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400')
response = urllib.request.urlopen(url,data)# 发送请求,带data就是post,不带data是get
html = response.read().decode('utf-8')#解码data
target = json.loads(html) # json.loads()用于将str类型的数据转成dict
print('翻译结果:%s'%(target['translateResult'][0][0]['tgt']))
time.sleep(5)#休眠时间
附:有道的JS反爬虫代码,完全运行后还需要改你得到的POST请求的URL
我的URL:http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule
需要修改成http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule
就是把_o去掉