用python爬取前程无忧招聘网

直接上代码了,相比前篇文章智联招聘网的数据,前程无忧网的数据可以爬取很多。

网址:https://search.51job.com/list/040000,000000,0000,00,9,99,%20,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=
简化后:https://search.51job.com/list/040000,000000,0000,00,9,99,%20,2,1.html?
相比之下,这份获取的数据更适合练习学习数据分析
爬取的方法跟步骤跟智联招聘网那篇一样。都是用到了第三方库requests

import requests
import re
import os
import time


class Spider(object):
    page_count = 0

    def __init__(self):
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
        global path, page_count
        path = './前程无忧招聘网/'
        if not os.path.exists(path):
            os.mkdir(path)
        self.path = path
        with open(path + "前程无忧招聘.json", "a", encoding='utf-8') as fp:
            fp.write('{')

    def response(self, url, headers):
        """
        请求访问服务器获取资源
        """
        try:
            response = requests.get(url, headers)
            response.encoding = response.apparent_encoding
            return response
        except:
            print('访问失败')
            return None

    def parse(self, response):
        """
        解析出我们需要进入的岗位名称及其详情链接
        """
        res = re.findall(r'\s+', response.text,
                         re.S)
        return res

    def parse_S(self, response):
        try:
            price = re.findall('
(.*?)', response.text, re.S)[0] except: price = None try: company_name = re.findall('class="cname">.*?title="(.*?)" class.*?

如发现可以改进的地方或者哪里做得不好,希望大家能够提出多多交流。

你可能感兴趣的:(python)