长话短说!
今天初学爬虫,教程是爬有道词典的接口,教程是2016年的,但是没想到今天,按照教程总是失败,也就是18年6月30日,有道有了反爬虫机制,很是郁闷,总是返回{"errorCode":50}这个json代码如下:
1 import urllib.request
2 import urllib.parse
5 import hashlib
6 #通过抓包得到的url,不是浏览器上的url
7 url = "http://fanyi.youdao.com/translate_o"
8
9 #模拟headers
10 headers = {"Accept" : "application/json, text/javascript, */*; q=0.01",
11 "X-Requested-With" : "XMLHttpRequest",
12 "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
13 "Content-Type" : "application/x-www-form-urlencoded;charset=UTF-8",
14 "Accept-Language":"zh-CN,zh;q=0.9",
15 }
17 #用户接口
18 key = input("请输入要翻译的英文")
19
29 formdata = {
30 "i":key,
31 "from":"AUTO",
32 "to":"AUTO",
33 "smartresult":"dict",
34 "client":"fanyideskweb",
37 "doctype":"json",
38 "version":"2.1",
39 "keyfrom":"fanyi.web",
40 "action":"FY_BY_CLICKBUTTION",
41 "typoResult":"false",
42 }
43 #urlencode转码
44 data = urllib.parse.urlencode(formdata).encode("utf-8")
45 #print(data)
46
47 #如果Rquest()里面的data有参数值,那么就是post请求,如果没有就是get请求
48 request = urllib.request.Request(url,data=data,headers=headers)
49
50 print(urllib.request.urlopen(request).read().decode("utf-8"))
~
~
于是看了一位仁兄的帖子: 爬虫一
代码增加了加密盐 就是在post表单部分增加了动态参数
"salt"
和 "sign" ,根据时间戳、输入文本、固定字符串然后经过md5加密的一个动态参数
1 import urllib.request
2 import urllib.parse
3 import time
4 import random
5 import hashlib
6 #通过抓包得到的url,不是浏览器上的url
7 url = "http://fanyi.youdao.com/translate_o"
8
9 #模拟headers
10 headers = {"Accept" : "application/json, text/javascript, */*; q=0.01",
11 "X-Requested-With" : "XMLHttpRequest",
12 "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
13 "Content-Type" : "application/x-www-form-urlencoded;charset=UTF-8",
14 "Accept-Language":"zh-CN,zh;q=0.9",
}
17 #用户接口
18 key = input("请输入要翻译的英文")
19 #有道反爬虫
20 s = "fanyideskweb"
21 d = "ebSeFb%=XZ%T[KZ)c(sy!"
22 n = key
23 r = str(int(time.time()*1000)+random.randint(1,10))
24 sign = hashlib.md5((s+n+r+d).encode("utf-8")).hexdigest()
25 print(r)
26 print("------")
27 print(sign)
28 #发送到web服务器的表单数据
29 formdata = {
30 "i":key,
31 "from":"AUTO",
32 "to":"AUTO",
33 "smartresult":"dict",
34 "client":"fanyideskweb",
35 "salt":r,
36 "sign":sign,
37 "doctype":"json",
38 "version":"2.1",
39 "keyfrom":"fanyi.web",
40 "action":"FY_BY_CLICKBUTTION",
41 "typoResult":"false",
42 }
43 #urlencode转码
44 data = urllib.parse.urlencode(formdata).encode("utf-8")
45 #print(data)
46
47 #如果Rquest()里面的data有参数值,那么就是post请求,如果没有就是get请求
48 request = urllib.request.Request(url,data=data,headers=headers)
49
50 print(urllib.request.urlopen(request).read().decode("utf-8"))
~
~
增加data加密盐后,仍然返回{"errorCode":50}这个json,第二次郁闷,学个爬虫这么难吗?:
于是又看了一位老哥的帖子: 爬虫二
原来是反爬机制更新,有道又增加了动态cookies,于是乎,代码又变成了这样子:
import urllib.request
2 import urllib.parse
3 import time
4 import random
5 import hashlib
6 #通过抓包得到的url,不是浏览器上的url
7 url = "http://fanyi.youdao.com/translate_o"
8
9 #模拟headers
10 headers = {"Accept" : "application/json, text/javascript, */*; q=0.01",
11 "X-Requested-With" : "XMLHttpRequest",
12 "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
13 "Content-Type" : "application/x-www-form-urlencoded;charset=UTF-8",
14 "Accept-Language":"zh-CN,zh;q=0.9",
16 "Cookie": "[email protected]; OUTFOX_SEARCH_USER_ID_NCOO=851738882.3048133; fanyi-ad-id=46607; fanyi-ad-closed=1; JSESSIONID=aaa9sqX_qF7z497yhbprw; ___rl__test__co okies="+str(int(time.time()*1000))}
17 #用户接口
18 key = input("请输入要翻译的英文")
19 #有道反爬虫
20 s = "fanyideskweb"
21 d = "ebSeFb%=XZ%T[KZ)c(sy!"
22 n = key
23 r = str(int(time.time()*1000)+random.randint(1,10))
24 sign = hashlib.md5((s+n+r+d).encode("utf-8")).hexdigest()
25 print(r)
26 print("------")
27 print(sign)
28 #发送到web服务器的表单数据
29 formdata = {
30 "i":key,
31 "from":"AUTO",
32 "to":"AUTO",
33 "smartresult":"dict",
34 "client":"fanyideskweb",
35 "salt":r,
36 "sign":sign,
37 "doctype":"json",
38 "version":"2.1",
39 "keyfrom":"fanyi.web",
40 "action":"FY_BY_CLICKBUTTION",
41 "typoResult":"false",
42 }
43 #urlencode转码
44 data = urllib.parse.urlencode(formdata).encode("utf-8")
45 #print(data)
46
47 #如果Rquest()里面的data有参数值,那么就是post请求,如果没有就是get请求
48 request = urllib.request.Request(url,data=data,headers=headers)
49
50 print(urllib.request.urlopen(request).read().decode("utf-8"))
~
~
~
本以为就成功了,但是仍然返回
{"errorCode":50}这个json,彻底崩溃。。。。
本着浏览器能访问,我就能模拟的原则,没有放弃
最后对比了两份不同时间,不同翻译文本的headers和post表单,发现了 "Referer": "http://fanyi.youdao.com/",
这个参数引起了我的注意,于是乎我在headers中增加后再次尝试
终于成功了,猜测可能是有道服务器增加这项验证,也可能我哪里的关系搞混乱了
最终代码:
1 import urllib.request
2 import urllib.parse
3 import time
4 import random
5 import hashlib
6 #通过抓包得到的url,不是浏览器上的url
7 url = "http://fanyi.youdao.com/translate_o"
8
9 #模拟headers
10 headers = {"Accept" : "application/json, text/javascript, */*; q=0.01",
11 "X-Requested-With" : "XMLHttpRequest",
12 "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",
13 "Content-Type" : "application/x-www-form-urlencoded;charset=UTF-8",
14 "Accept-Language":"zh-CN,zh;q=0.9",
15 "Referer": "http://fanyi.youdao.com/",
16 "Cookie": "[email protected]; OUTFOX_SEARCH_USER_ID_NCOO=851738882.3048133; fanyi-ad-id=46607; fanyi-ad-closed=1; JSESSIONID=aaa9sqX_qF7z497yhbprw; ___rl__test__co okies="+str(int(time.time()*1000))}
17 #用户接口
18 key = input("请输入要翻译的英文")
19 #有道反爬虫
20 s = "fanyideskweb"
21 d = "ebSeFb%=XZ%T[KZ)c(sy!"
22 n = key
23 r = str(int(time.time()*1000)+random.randint(1,10))
24 sign = hashlib.md5((s+n+r+d).encode("utf-8")).hexdigest()
25 print(r)
26 print("------")
27 print(sign)
28 #发送到web服务器的表单数据
29 formdata = {
30 "i":key,
31 "from":"AUTO",
32 "to":"AUTO",
33 "smartresult":"dict",
34 "client":"fanyideskweb",
35 "salt":r,
36 "sign":sign,
37 "doctype":"json",
38 "version":"2.1",
39 "keyfrom":"fanyi.web",
40 "action":"FY_BY_CLICKBUTTION",
41 "typoResult":"false",
42 }
43 #urlencode转码
44 data = urllib.parse.urlencode(formdata).encode("utf-8")
45 #print(data)
46
47 #如果Rquest()里面的data有参数值,那么就是post请求,如果没有就是get请求
48 request = urllib.request.Request(url,data=data,headers=headers)
49
50 print(urllib.request.urlopen(request).read().decode("utf-8"))
~
实在不好意思 ,第一次发帖,可能很多地方没有讲清楚或者认识上的错误,还请大神们指出