拉勾网的反爬机制做的不错,一般网站加上User-Agent和Referer请求头即可获取信息,但是拉勾网需要设置Cookie信息,而且加入了时间戳。。
如果不设置Cookie,就会出现以下信息:
{"status":false,"msg":"您操作太频繁,请稍后再访问","clientIp":"124.167.153.75","state":2402}
而且拉勾网的网页信息是通过另一个请求获取的信息。
所以需要通过第一个请求获取Cookie,然后把Cookie设置到headers中去请求第二个连接。
print(response.getheaders())
[('Server', 'openresty'), ('Date', 'Sat, 06 Jun 2020 01:37:19 GMT'), ('Content-Type', 'text/html;charset=UTF-8'), ('Transfer-Encoding', 'chunked'), ('Connection', 'close'), ('Vary', 'Accept-Encoding'), ('Set-Cookie', 'JSESSIONID=ABAAAECABIEACCABE16C25AFA311FF4A564C96560281062; Path=/; HttpOnly'), ('REQUEST_ID', 'c29a16c5-2915-40df-b354-525c872c7c78'), ('Content-Language', 'en-US'), ('Set-Cookie', 'SEARCH_ID=24d200665f8a4bc09446a129b7a13994; Version=1; Max-Age=86400; Expires=Sun, 07-Jun-2020 01:37:18 GMT; Path=/'), ('Set-Cookie', 'user_trace_token=20200606093719-f0d22007-0f32-4a31-a21a-5702e7c1ed9a; Max-Age=31536000; Path=/; Domain=.lagou.com; '), ('Set-Cookie', 'X_HTTP_TOKEN=42daf4b72327b2819347041951bf5e71415983ed09; Max-Age=31536000; Path=/; Domain=.lagou.com; '), ('Cache-Control', 'no-cache')]
'''
@Description: urllib
@Author: sikaozhifu
@Date: 2020-06-05 20:06:42
@LastEditTime: 2020-06-06 09:28:23
@LastEditors: Please set LastEditors
'''
from urllib import request
from urllib import parse
# 第一个链接
url_start = 'https://www.lagou.com/jobs/list_java%E5%B7%A5%E7%A8%8B%E5%B8%88?labelWords=sug&fromSearch=true&suginput=java%E5%B7%A5%E7%A8%8B%E5%B8%88'
# 第二个链接
url_parse = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
data = {
'first': 'true',
'pn': str(1),
'kd': 'java工程师'
}
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
}
req = request.Request(url_start, headers=headers)
response = request.urlopen(req)
# 设置Cookie
cookie = ''
for header in response.getheaders():
if header[0] == 'Set-Cookie':
# 拼接 Cookie 信息。
cookie = cookie + header[1].split(';')[0] + '; '
# 最后一个Cookie没有; 需要去掉
cookie = cookie[:-1]
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
'Cookie': cookie,
'Referer':
'https://www.lagou.com/jobs/list_java%E5%B7%A5%E7%A8%8B%E5%B8%88?labelWords=sug&fromSearch=true&suginput=java%E5%B7%A5%E7%A8%8B%E5%B8%88'
}
# 设置请求方式为POST
req = request.Request(url_parse, data=parse.urlencode(data).encode('utf-8'), headers=headers, method='POST')
resp = request.urlopen(req)
print(resp.read().decode('utf-8'))