python3.6爬取智联招聘信息(解决动态加载)

由于工作需要,爬取智联招聘的招聘信息。

一、了解。
image.png

由于智联已经不用登录后才能访问,所以可以在请求头中去掉cookie信息也能访问。但是智联是动态加载的,所以在控制台中直接找到


image.png

上面信息获取到url,直接利用url打开访问json数据
在此之前要构造请求头
说明一下url的组成
kw 搜索内容
cityId 城市ID
kt 不知道为啥一定要为3,其他的关联度有问题。。
其他的无关紧要

# 根据第一页的URL,抓取“python”岗位的信息
url = r'https://fe-api.zhaopin.com/c/i/sou?pageSize=60&cityId=763&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=python&kt=3&lastUrlQuery=%7B%22jl%22:%22489%22,%22kw%22:%22%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E5%B8%88%22,%22kt%22:%223%22%7D&at=9c5682b1a4f54de89c899fb7efc7e359&rt=54eaf1be1b8845c089439d53365ea5dd&_v=0.84300214&x-zp-page-request-id=280f6d80d733447fbebafab7b8158873-1541403039080-617179'
# 构造请求的头信息,防止反爬虫
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
二、爬取

利用requests.get函数发送请求,基于response返回json数据
具体的匹配规则如下代码

# 利用for循环,生成规律的链接,并对这些链接进行请求的发送和解析内容
for i in range(0,20001,60):
    url ='https://fe-api.zhaopin.com/c/i/sou?start='+str(i)+r'&pageSize=60&cityId=763&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=python&kt=3&lastUrlQuery=%7B%22p%22:5,%22jl%22:%22489%22,%22kw%22:%22%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E5%B8%88%22,%22kt%22:%223%22%7D&at=17a95e7000264c3898168b11c8f17193&rt=57a342d946134b66a264e18fc60a17c6&_v=0.02365098&x-zp-page-request-id=a3f1b317599f46338d56e5d080a05223-1541300804515-144155'
    response = requests.get(url, headers = headers)
    print('Down Loading:','https://fe-api.zhaopin.com/c/i/sou?start='+str(i)+'&pageSize=60','......')
    name = 'python'
    company = [i['company']['name'] for i in response.json()['data']['results']]
    size = [i['company']['size']['name'] for i in response.json()['data']['results']]
    type = [i['company']['type']['name'] for i in response.json()['data']['results']]
    positionURL = [i['positionURL'] for i in response.json()['data']['results']]
    workingExp = [i['workingExp']['name'] for i in response.json()['data']['results']]
    eduLevel = [i['eduLevel']['name'] for i in response.json()['data']['results']]
    salary = [i['salary'] for i in response.json()['data']['results']]
    jobName = [i['jobName'] for i in response.json()['data']['results']]
    welfare = [i['welfare'] for i in response.json()['data']['results']]
    city = [i['city']['items'][0]['name'] for i in response.json()['data']['results']]
    createDate = [i['createDate']for i in response.json()['data']['results']]
    jobs.append(pd.DataFrame({'name':name,'company':company,'size':size,'type':type,'positionURL':positionURL,
                              'workingExp':workingExp,'eduLevel':eduLevel,'salary':salary,
                              'jobName':jobName,'welfare':welfare,'city':city,'createDate':createDate}))

将数据导出到Excel文件中,也可以存到数据库
拼接所有页码下的招聘信息
jobs2 = pd.concat(jobs)

将数据导出到Excel文件中
jobs2.to_excel(r'G:\python.xlsx', index = False)

完成,上面的其实可以优化,参考了别人的,由于老板要的紧,就这样写了。有空改一下,写好一点

你可能感兴趣的:(python3.6爬取智联招聘信息(解决动态加载))