python3 有道翻译反爬虫 记录

长话短说!
今天初学爬虫,教程是爬有道词典的接口,教程是2016年的,但是没想到今天,按照教程总是失败,也就是18年6月30日,有道有了反爬虫机制,很是郁闷,总是返回{"errorCode":50}这个json代码如下:
1 import urllib.request
  2 import urllib.parse
  5 import hashlib
  6 #通过抓包得到的url,不是浏览器上的url
  7 url = "http://fanyi.youdao.com/translate_o"
  8
  9 #模拟headers
 10 headers = {"Accept" : "application/json, text/javascript, */*; q=0.01",
 11         "X-Requested-With" : "XMLHttpRequest",
 12         "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
 13         "Content-Type" : "application/x-www-form-urlencoded;charset=UTF-8",
 14         "Accept-Language":"zh-CN,zh;q=0.9",
 15        }
 17 #用户接口
 18 key = input("请输入要翻译的英文")
 19
 29 formdata = {
 30         "i":key,
 31         "from":"AUTO",
 32         "to":"AUTO",
 33         "smartresult":"dict",
 34         "client":"fanyideskweb",
 37         "doctype":"json",
 38         "version":"2.1",
 39         "keyfrom":"fanyi.web",
 40         "action":"FY_BY_CLICKBUTTION",
 41         "typoResult":"false",
 42         }
 43 #urlencode转码
 44 data = urllib.parse.urlencode(formdata).encode("utf-8")
 45 #print(data)
 46
 47 #如果Rquest()里面的data有参数值,那么就是post请求,如果没有就是get请求
 48 request = urllib.request.Request(url,data=data,headers=headers)
 49
 50 print(urllib.request.urlopen(request).read().decode("utf-8"))

~                                                                                                                                                                                                          
~                                           



于是看了一位仁兄的帖子: 爬虫一
代码增加了加密盐 就是在post表单部分增加了动态参数 "salt"    和  "sign" ,根据时间戳、输入文本、固定字符串然后经过md5加密的一个动态参数
1 import urllib.request
  2 import urllib.parse
  3 import time
  4 import random
  5 import hashlib
  6 #通过抓包得到的url,不是浏览器上的url
  7 url = "http://fanyi.youdao.com/translate_o"
  8
  9 #模拟headers
 10 headers = {"Accept" : "application/json, text/javascript, */*; q=0.01",
 11         "X-Requested-With" : "XMLHttpRequest",
 12         "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
 13         "Content-Type" : "application/x-www-form-urlencoded;charset=UTF-8",
 14         "Accept-Language":"zh-CN,zh;q=0.9",
}
 17 #用户接口
 18 key = input("请输入要翻译的英文")
 19 #有道反爬虫
 20 s = "fanyideskweb"
 21 d = "ebSeFb%=XZ%T[KZ)c(sy!"
 22 n = key
 23 r = str(int(time.time()*1000)+random.randint(1,10))
 24 sign = hashlib.md5((s+n+r+d).encode("utf-8")).hexdigest()
 25 print(r)
 26 print("------")
 27 print(sign)
 28 #发送到web服务器的表单数据
 29 formdata = {
 30         "i":key,
 31         "from":"AUTO",
 32         "to":"AUTO",
 33         "smartresult":"dict",
 34         "client":"fanyideskweb",
 35         "salt":r,
 36         "sign":sign,
 37         "doctype":"json",
 38         "version":"2.1",
 39         "keyfrom":"fanyi.web",
 40         "action":"FY_BY_CLICKBUTTION",
 41         "typoResult":"false",
 42         }
 43 #urlencode转码
 44 data = urllib.parse.urlencode(formdata).encode("utf-8")
 45 #print(data)
 46
 47 #如果Rquest()里面的data有参数值,那么就是post请求,如果没有就是get请求
 48 request = urllib.request.Request(url,data=data,headers=headers)
 49
 50 print(urllib.request.urlopen(request).read().decode("utf-8"))

~                                                                                                                                                                                            
~            

增加data加密盐后,仍然返回{"errorCode":50}这个json,第二次郁闷,学个爬虫这么难吗?:
于是又看了一位老哥的帖子: 爬虫二
原来是反爬机制更新,有道又增加了动态cookies,于是乎,代码又变成了这样子:
 import urllib.request
  2 import urllib.parse
  3 import time
  4 import random
  5 import hashlib
  6 #通过抓包得到的url,不是浏览器上的url
  7 url = "http://fanyi.youdao.com/translate_o"
  8
  9 #模拟headers
 10 headers = {"Accept" : "application/json, text/javascript, */*; q=0.01",
 11         "X-Requested-With" : "XMLHttpRequest",
 12         "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
 13         "Content-Type" : "application/x-www-form-urlencoded;charset=UTF-8",
 14         "Accept-Language":"zh-CN,zh;q=0.9",
 16         "Cookie": "[email protected]; OUTFOX_SEARCH_USER_ID_NCOO=851738882.3048133; fanyi-ad-id=46607; fanyi-ad-closed=1; JSESSIONID=aaa9sqX_qF7z497yhbprw; ___rl__test__co        okies="+str(int(time.time()*1000))}
 17 #用户接口
 18 key = input("请输入要翻译的英文")
 19 #有道反爬虫
 20 s = "fanyideskweb"
 21 d = "ebSeFb%=XZ%T[KZ)c(sy!"
 22 n = key
 23 r = str(int(time.time()*1000)+random.randint(1,10))
 24 sign = hashlib.md5((s+n+r+d).encode("utf-8")).hexdigest()
 25 print(r)
 26 print("------")
 27 print(sign)
 28 #发送到web服务器的表单数据
 29 formdata = {
 30         "i":key,
 31         "from":"AUTO",
 32         "to":"AUTO",
 33         "smartresult":"dict",
 34         "client":"fanyideskweb",
 35         "salt":r,
 36         "sign":sign,
 37         "doctype":"json",
 38         "version":"2.1",
 39         "keyfrom":"fanyi.web",
 40         "action":"FY_BY_CLICKBUTTION",
 41         "typoResult":"false",
 42         }
 43 #urlencode转码
 44 data = urllib.parse.urlencode(formdata).encode("utf-8")
 45 #print(data)
 46
 47 #如果Rquest()里面的data有参数值,那么就是post请求,如果没有就是get请求
 48 request = urllib.request.Request(url,data=data,headers=headers)
 49
 50 print(urllib.request.urlopen(request).read().decode("utf-8"))

~                                                                                                                                                                                                          
~                                                                                                                                                                                                          
~                                                                                                                            


本以为就成功了,但是仍然返回 {"errorCode":50}这个json,彻底崩溃。。。。
本着浏览器能访问,我就能模拟的原则,没有放弃
最后对比了两份不同时间,不同翻译文本的headers和post表单,发现了 "Referer": "http://fanyi.youdao.com/",
这个参数引起了我的注意,于是乎我在headers中增加后再次尝试
终于成功了,猜测可能是有道服务器增加这项验证,也可能我哪里的关系搞混乱了
最终代码:
1 import urllib.request
  2 import urllib.parse
  3 import time
  4 import random
  5 import hashlib
  6 #通过抓包得到的url,不是浏览器上的url
  7 url = "http://fanyi.youdao.com/translate_o"
  8
  9 #模拟headers
 10 headers = {"Accept" : "application/json, text/javascript, */*; q=0.01",
 11         "X-Requested-With" : "XMLHttpRequest",
 12         "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
 13         "Content-Type" : "application/x-www-form-urlencoded;charset=UTF-8",
 14         "Accept-Language":"zh-CN,zh;q=0.9",
 15         "Referer": "http://fanyi.youdao.com/",
 16         "Cookie": "[email protected]; OUTFOX_SEARCH_USER_ID_NCOO=851738882.3048133; fanyi-ad-id=46607; fanyi-ad-closed=1; JSESSIONID=aaa9sqX_qF7z497yhbprw; ___rl__test__co        okies="+str(int(time.time()*1000))}
 17 #用户接口
 18 key = input("请输入要翻译的英文")
 19 #有道反爬虫
 20 s = "fanyideskweb"
 21 d = "ebSeFb%=XZ%T[KZ)c(sy!"
 22 n = key
 23 r = str(int(time.time()*1000)+random.randint(1,10))
 24 sign = hashlib.md5((s+n+r+d).encode("utf-8")).hexdigest()
 25 print(r)
 26 print("------")
 27 print(sign)
 28 #发送到web服务器的表单数据
 29 formdata = {
 30         "i":key,
 31         "from":"AUTO",
 32         "to":"AUTO",
 33         "smartresult":"dict",
 34         "client":"fanyideskweb",
 35         "salt":r,
 36         "sign":sign,
 37         "doctype":"json",
 38         "version":"2.1",
 39         "keyfrom":"fanyi.web",
 40         "action":"FY_BY_CLICKBUTTION",
 41         "typoResult":"false",
 42         }
 43 #urlencode转码
 44 data = urllib.parse.urlencode(formdata).encode("utf-8")
 45 #print(data)
 46
 47 #如果Rquest()里面的data有参数值,那么就是post请求,如果没有就是get请求
 48 request = urllib.request.Request(url,data=data,headers=headers)
 49
 50 print(urllib.request.urlopen(request).read().decode("utf-8"))

~                                                           

实在不好意思 ,第一次发帖,可能很多地方没有讲清楚或者认识上的错误,还请大神们指出


你可能感兴趣的:(python3 有道翻译反爬虫 记录)