很多网站需要通过提交表单来进行登陆或相应的操作,可以用requests库的POST方法,通过观测表单源代码和逆向工程来填写表单获取网页信息。本代码以获取拉勾网Python相关招聘职位为例作为练习。打开拉钩网,F12进入浏览器开发者工具,可以发现网站使用了Ajax,点击Network选项卡,选中XHR项,在Header中可以看到请求的网址,Response中可以看到返回的信息为Json格式。这里由于Json字符串比较长且复杂,所以可以用Preview选项观察,正好是网页中的职位信息。招聘信息全在content-posiotionResult-result中。翻页后发现请求地址没有改变,但是提交方法为POST,提交的字段中有一个pn字段随着翻页在改变,因此,可以据此构造出爬虫程序。代码如下:
import requests
import json
import time
import pymongo
client = pymongo.MongoClient('localhost',27017)
mydb = client['mydb']
lagou = mydb['lagou']
cookie = '这里换成你自己的cookie'
headers = {'cookie': cookie,
'origin': "https://www.lagou.com",
'x-anit-forge-code': "0",
'accept-encoding': "gzip, deflate, br",
'accept-language': "zh-CN,zh;q=0.8,",
'user-agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
'content-type': "application/x-www-form-urlencoded; charset=UTF-8",
'accept': "application/json, text/javascript, */*; q=0.01",
'referer': "https://www.lagou.com/jobs/list_Pyhon?labelWords=&fromSearch=true&suginput=",
'x-requested-with': "XMLHttpRequest",
'connection': "keep-alive",
'x-anit-forge-token': "None"}
def get_page(url, params):
html = requests.post(url,data=params,headers=headers)
json_data = json.loads(html.text)
total_count = json_data['content']['positionResult']['totalCount']
page_number = int(total_count/15) if int(total_count/15)<30 else 30
get_info(url,page_number)
def get_info(url,page):
for pn in range(1,page+1):
params={
'first':'true',
'pn':str(pn),
'kd':'Python'
}
try:
html = requests.post(url,data=params,headers=headers)
json_data = json.loads(html.text)
results = json_data['content']['positionResult']['result']
for result in results:
infos = {
'businessZones':result['businessZones'],
'city': result['city'],
'companyFullName': result['companyFullName'],
'companyLabelList': result['companyLabelList'],
'companySize': result['companySize'],
'district': result['district'],
'education': result['education'],
'financeStage': result['financeStage'],
'firstType': result['firstType'],
'formatCreateTime': result['formatCreateTime'],
'gradeDescription': result['gradeDescription'],
'imState': result['imState'],
'industryField': result['industryField'],
'positionAdvantage': result['positionAdvantage'],
'salary': result['salary'],
'workYear': result['workYear'],
}
lagou.insert_one(infos)
time.sleep(2)
except requests.exceptions.ConnectionError:
pass
if __name__=='__main__':
url = 'https://www.lagou.com/jobs/positionAjax.json'
params = {
'first': 'true',
'pn': '1',
'kd': 'Python'
}
get_page(url,params)
拉钩网由于采取了反扒技术,使用简单的代理或者使用普通的headers都会被屏蔽,提示“您的操作过于频繁,请稍后再试”,经过尝试,如果采用完整的头部就没有问题,爬取的数据存储在MongoDB数据库中。