网址:https://search.51job.com/list/040000,000000,0000,00,9,99,%20,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=
简化后:https://search.51job.com/list/040000,000000,0000,00,9,99,%20,2,1.html?
相比之下,这份获取的数据更适合练习学习数据分析。
爬取的方法跟步骤跟智联招聘网那篇一样。都是用到了第三方库requests
import requests
import re
import os
import time
class Spider(object):
page_count = 0
def __init__(self):
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
global path, page_count
path = './前程无忧招聘网/'
if not os.path.exists(path):
os.mkdir(path)
self.path = path
with open(path + "前程无忧招聘.json", "a", encoding='utf-8') as fp:
fp.write('{')
def response(self, url, headers):
"""
请求访问服务器获取资源
"""
try:
response = requests.get(url, headers)
response.encoding = response.apparent_encoding
return response
except:
print('访问失败')
return None
def parse(self, response):
"""
解析出我们需要进入的岗位名称及其详情链接
"""
res = re.findall(r'\s+', response.text,
re.S)
return res
def parse_S(self, response):
try:
price = re.findall('(.*?)', response.text, re.S)[0]
except:
price = None
try:
company_name = re.findall('class="cname">.*?title="(.*?)" class.*?
如发现可以改进的地方或者哪里做得不好,希望大家能够提出多多交流。