Boss直聘网requests多进程爬虫,写入Mysql

  • 学爬虫好久了,今天用requests库爬了一下,Boss直聘的python职位信息,解析后写入Mysql数据库

首先要确定要做的具体框架:

  1. 获取所爬城市的编号
  2. 通过获取城市编码及python对应的编码,构造url进行请求,获取具体职位的url
  3. 爬取具体职位url,解析需要的关键信息,同时写入数据库

废话不说 开始上代码
首先导入所需要的模块:这里用到了’正则表达式’因此要导入lxml;用到了jsonpath;用到了多进程以及数据库mysql接口。

import requests
from lxml import etree
import time
import re
import random
import json
import jsonpath
from bs4 import BeautifulSoup
import re
import time
from multiprocessing import Manager,Pool
import multiprocessing
import pymysql

包导入后,由于Boss直聘网是按照城市代码以及工作代码构造的url,因此按照开头写的框架第一步获取城市的编号,可全部获取,也可获取省会城市或热门城市,我在这里获取的是全部省会城市,具体可根据需求调整jsonpath。
代码如下:

def city_code_spider():
    global city_code_queue
    url = 'https://www.zhipin.com/common/data/city.json'
    r = requests.get(url,headers=headers,verify=False)
    r.encoding = r.apparent_encoding
    result = json.loads(r.text)
    city_list = jsonpath.jsonpath(result,'$.data.cityList')
    result_list = []
    for i in range(len(city_list[0])):
        province_list = jsonpath.jsonpath(city_list,'$.*'+str([i]))
        province_name = jsonpath.jsonpath(city_list,'$.*'+ str([i])+'.name')[0]
        #表示所有的城市
        #for j in range(len(province_list[0]['subLevelModelList'])):
        #表示所有的省会城市
        for j in [0]:
            city_dic = {}
            city_name = jsonpath.jsonpath(city_list,'$.*'+str([i])+'.subLevelModelList'+str([j])+'.name')[0]
            city_code = jsonpath.jsonpath(city_list,'$.*'+str([i])+'.subLevelModelList'+str([j])+'.code')[0]
            city_dic['province_name'] = province_name
            city_dic['city_name'] = city_name
            city_dic['city_code'] = city_code
            #将city_dic put到队列当中,在后面的程序中get
            city_code_queue.put(city_dic)
            #print(city_dic)
  • 获取完所需城市的代码后,继续按照框架获取每个城市具体职位的url。`这里写代
  • 代码如下:
def get_end_url(city_code_queue,end_url_queue,headers,proxies):
    global f
    if not get_end_url_exit:
        while not city_code_queue.empty():
            city_code_list = city_code_queue.get()
            print(city_code_list)
            city_code = city_code_list['city_code']
            print(city_code)
            #省会python岗位的职位列表url,其中python在Boss直聘的代码为P100109,其他岗位可自行查询并更换
            url = 'https://www.zhipin.com/c'+ str(city_code) + '-p100109/' 
            #Boss直聘网站默认每个城市的职位信息最多显示10页,因此这里我们只取10页。
            for page in range(1,11):   
                params = {
                        'page':str(page),
                        'ka':'page-'+ str(page)
                }
                User_Agent_list = [
                "Mozilla/5.0 (Windows NT 6.1; rv2.0.1) Gecko/20100101 Firefox/4.0.1",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1",
                "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11"
                ]
                #这里随机选取User_Agent,防止被封
                User_Agent = random.choice(User_Agent_list)
                get_end_url_headers = headers.copy()
                get_end_url_headers.update({'User-Agent':User_Agent})
                #print(get_end_url_headers)
                #为了使程序更加健壮,代码基本上都加了try,except,也是防止程序跑到一半挺住
                try:  
                #这里的timeout要加上,指的是请求超时后报错,如果不加可能卡在某个请求不动
                    r = requests.get(url,headers=get_end_url_headers,params = params,proxies=proxies,verify=False,timeout=15)
                    r.encoding = r.apparent_encoding
                    print(r.status_code)
                except Exception as e:
                    print(e)
                    city_code_queue.put(city_code_list)
                    continue
                try:
                    result = etree.HTML(r.text)
                    #以下是解析部分
                    #获取Jobid
                    jobids = result.xpath('//li//h3[@class="name"]/a/@data-jobid')
                    #获取末尾url
                    end_urls = result.xpath('//li//div[@class="info-primary"]//h3/a/@href')
                    count = 0
                    for end_url in end_urls:
                        job_id = jobids[count]
                        count+=1
                        end_url_dic = {}
                        end_url_dic['province_name'] = city_code_list['province_name']
                        end_url_dic['city_name'] = city_code_list['city_name']
                        end_url_dic['city_code'] = city_code_list['city_code']
                        end_url_dic['end_url'] = end_url
                        end_url_dic['job_id'] = job_id
                        end_urls_list.append(end_url)
                        end_url_queue.put(end_url_dic)
                        print(end_url_dic)
                        f.write(str(end_url_dic))
                except Exception as E:
                    print(E)
                time.sleep(5)   
    print('city_code_queue空为:' + str(city_code_queue.empty()))
  • 职位详情的url获取后,就进行框架的最后一步,爬取职位详情解析后写入数据库。
  • 代码如下:
def get_job_detail(end_url_queue,headers,proxies):
    print('get_job_detail开始')
    global error_get_end_url_list
    while not end_url_queue.empty():
        User_Agent_list = [
                "Mozilla/5.0 (Windows NT 6.1; rv2.0.1) Gecko/20100101 Firefox/4.0.1",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1",
                "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11"
                ]
        User_Agent = random.choice(User_Agent_list)
        get_job_detail_headers = headers.copy()
        get_job_detail_headers.update({'User-Agent':User_Agent})
        get_end_url_list = end_url_queue.get()
        get_end_url = get_end_url_list['end_url']
        detail_url = 'https://www.zhipin.com' + get_end_url
        try:
            r = requests.get(url=detail_url,headers=get_job_detail_headers,proxies=proxies,timeout=15,verify=False)
            r.encoding = r.apparent_encoding
            print(r.url)
            #print(r.status_code)
            result = etree.HTML(r.text)
        except Exception as e:
        #因为爬取过程中会有部分url访问失败,这里构造error_get_end_url_list列表,将报错的get_end_url_list放入此列表中,同时再次加入到end_url_queue队列中,等待get继续请求,如果超过3次访问失败就不再访问。
            print(e)
            print(r.url)
            error_get_end_url_list.append(get_end_url_list)
            if error_get_end_url_list.count(get_end_url_list) >= 3:
                continue
            else:
                end_url_queue.put(get_end_url_list)
                continue
        #以下为解析部分
        #获取职位名称
        try:
            job_name = result.xpath('//div[@class="name"]/h1')[0].text
        except:
            job_name = ''
        try:
            job_id = get_end_url_list['job_id']
        except:
            job_id = ''
        #获取发布时间
        try:
            job_times = result.xpath('//div[@class="job-author"]/span/text()')[0]
            job_time = job_times.split('发布于')[1]
        except:
            job_time = ''
        #获取职位薪水0\
        try:
            job_salary = result.xpath('//div[@class="info-primary"]/div[@class="name"]/span')[0].text.strip()
        except:
            job_salary = ''
        #工作地点
        try:
            job_address = result.xpath('//div[@class="info-primary"]/p/text()')[0]
        except:
            job_address =''
        #工作经验
        try:
            job_experience = result.xpath('//div[@class="info-primary"]/p/text()')[1]
        except:
            job_experience = ''
        #工作学历
        try:
            job_education = result.xpath('//div[@class="info-primary"]/p/text()')[2]
        except:
            job_education = ''
        #工作描述
        try:
            job_descriptions = result.xpath('//div[@class="detail-content"]/div[@class="job-sec"]/div[@class="text"]/text()')
            job_description = ''
            for i in range(len(job_descriptions)):
                content = job_descriptions[i].strip()
                job_description += content
        except:
            job_description = ''
        #团队介绍
        try:
            team_introduces = result.xpath('//div[@class="detail-content"]/div[@class="job-sec"]/div[@class="job-tags"]/span/text()')
            team_introduce = ','.join(str(x) for x in team_introduces)
        except:
            team_introduce = ''
        #公司介绍
        try:
            company_introduce = result.xpath('//div[@class="detail-content"]/div[@class="job-sec company-info"]/div[@class="text"]')[0].text.strip()
        except:
            company_introduce = ''

        #写入数据库
        params = [job_id,job_name,job_time,job_salary,job_address,job_experience,job_education,job_description]
        try:
            conn = pymysql.connect(host='localhost',port=3306,db='Boss_spider',user='root',passwd='591777',charset='utf8')
            cs1 = conn.cursor()
            count = cs1.execute("insert into job_detail(job_id,job_name,job_time,job_salary,job_address,job_experience,job_education,job_description) values(%s,%s,%s,%s,%s,%s,%s,%s)",params)
            print(count)
            conn.commit()
            cs1.close()
            conn.close()
        except Exception as e:
            print(e)
        print(job_id,job_name,job_time)
        time.sleep(1)
    print('get_job_detail结束')

主代码如下:

if __name__ == "__main__":
    requests.packages.urllib3.disable_warnings()
    end_urls_list = []
    get_end_url_exit = False
    #get_job_detail_EXIT信号开关
    get_job_detail_EXIT = False   
    error_get_end_url_list = []
    headers = {
        "Host":"www.zhipin.com",
        "Connection":"keep-alive",
        "Cache-Control":"max-age=0",
        "Upgrade-Insecure-Requests":"1",
        "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language":"zh-CN,zh;q=0.9,en;q=0.8"
    }
    #一下为代理ip部分,不用代理也可以,但是要放慢速度而且很容易被封哦,如果不用代理,可以把requests里的proxies去掉。但是最好还是用,毕竟安全第一么。
    #代理IP
    proxyHost = "http-dyn.abuyun.com"
    proxyPort = "9020"
    #代理隧道验证信息
    proxyUser = "**************"
    proxyPass = "**************"
    proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
      "host" : proxyHost,
      "port" : proxyPort,
      "user" : proxyUser,
      "pass" : proxyPass
    }
    proxies = {
    "http"  : proxyMeta,
    "https" : proxyMeta
    }
    f = open('job_list.txt','a')
    #创建城市代码队列
    city_code_queue = Manager().Queue()
    #创建末尾url队列
    end_url_queue = Manager().Queue()
    #爬取城市代码网页,将所有的城市代码爬取下来,并传入队列当中,以方便接下来爬取每个城市的职位信息使用
    city_code_spider()
    #创建进程池
    p1=Pool(2)
    p2= Pool(3)
    for i in range(2):
        p1.apply_async(get_end_url,(city_code_queue,end_url_queue,headers,proxies))
        print(1)
    while end_url_queue.empty():
        pass
    print('继续往下执行')

    for i in range(3):
        p2.apply(get_job_detail,(end_url_queue,headers,proxies))
        print(2)
    while not end_url_queue.empty():
        pass
    get_job_detail_EXIT = True
    print('------start--------')
    while not city_code_queue.empty():
        pass
    print('队列为空了')
    p1.close()
    p2.close()
    p1.join()
    p2.join()
    print('-------end------')

  • 爬取部分结果如下:
    Boss直聘网requests多进程爬虫,写入Mysql_第1张图片

总结了以下几点:

  1. requests库功能很强大,但是为了避免请求卡死,应注意加上timeout;
  2. 为了使程序更加健壮,要善用try,except,finally,这样可以防止程序跑到一半遇bug退出;
  3. 这里面我用的多进程,是因为我在爬取拉勾网职位的时候已经使用了多线程,但是想爬虫这类IO密集型处理,还是使用多线程比较好(虽然python多线程有GIL,但是IO密集型还是有效果的)。

接下来,还会继续爬取各类网站,并进行数据可视化分析,希望大家持续关注,以下是我的github地址,里面会有各类爬虫的全部代码,如果大家觉得有帮助,可以star一下,并希望随时给我,提出问题噢!
GitHub https://github.com/wangyeqiang/Crawl

你可能感兴趣的:(Crawl)