Python selenium Boss直聘数据爬取(仅供学习使用)

写在前面,因为最近刚好需要分析行业数据,又在查询时,发现了许多博主写了一些东西,但很多都已经失效了,所以写了那么一篇文章,希望能够帮到大家

注:BOSS直聘数据为js加载数据,故使用selenium

一、pip selenium/bs4、下载chromedriver.exe

下载命令
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple selenium
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple bs4

chromedriver.exe下载网址chormedriver下载地址(注意版本)
注:若下载失败,请自寻原因即解决办法

二、查询数据来源

注:此篇以查询关键字‘python’为数据来源

进入BOSS直聘官网 BOSS直聘
输入python 点击确认
Python selenium Boss直聘数据爬取(仅供学习使用)_第1张图片
进入里页,点击下一页,获取连接格式
Python selenium Boss直聘数据爬取(仅供学习使用)_第2张图片
示例图
此时发现格式为 https://www.zhipin.com/c101280600/?query=python&page=2&ka=-2

再在chrome浏览器中按【F12】再按【F5】获取相关数据
Python selenium Boss直聘数据爬取(仅供学习使用)_第3张图片
可以发现,citysites.json 为对应城市的传输文件,并且其对应的code为上述格式中的c后的数字,因此我们可以将其复制出来,并改为

citys = [{“name”:“北京”,“code”:101010100,“url”:"/beijing/"},{“name”:“上海”,“code”:101020100,“url”:"/shanghai/"},{“name”:“广州”,“code”:101280100,“url”:"/guangzhou/"},{“name”:“深圳”,“code”:101280600,“url”:"/shenzhen/"},{“name”:“杭州”,“code”:101210100,“url”:"/hangzhou/"},{“name”:“天津”,“code”:101030100,“url”:"/tianjin/"},{“name”:“西安”,“code”:101110100,“url”:"/xian/"},{“name”:“苏州”,“code”:101190400,“url”:"/suzhou/"},{“name”:“武汉”,“code”:101200100,“url”:"/wuhan/"},{“name”:“厦门”,“code”:101230200,“url”:"/xiamen/"},{“name”:“长沙”,“code”:101250100,“url”:"/changsha/"},{“name”:“成都”,“code”:101270100,“url”:"/chengdu/"},{“name”:“郑州”,“code”:101180100,“url”:"/zhengzhou/"},{“name”:“重庆”,“code”:101040100,“url”:"/chongqing/"},{“name”:“佛山”,“code”:101280800,“url”:"/foshan/"},{“name”:“合肥”,“code”:101220100,“url”:"/hefei/"},{“name”:“济南”,“code”:101120100,“url”:"/jinan/"},{“name”:“青岛”,“code”:101120200,“url”:"/qingdao/"},{“name”:“南京”,“code”:101190100,“url”:"/nanjing/"},{“name”:“东莞”,“code”:101281600,“url”:"/dongguan/"}]
则此时我们已经获取了对应城市数据

三、代码实现

from selenium import webdriver
from bs4 import BeautifulSoup

#无头浏览器开启
driver = webdriver.Chrome('chromedriver.exe')

#城市json
citys = [{"name":"北京","code":101010100,"url":"/beijing/"},{"name":"上海","code":101020100,"url":"/shanghai/"},{"name":"广州","code":101280100,"url":"/guangzhou/"},{"name":"深圳","code":101280600,"url":"/shenzhen/"},{"name":"杭州","code":101210100,"url":"/hangzhou/"},{"name":"天津","code":101030100,"url":"/tianjin/"},{"name":"西安","code":101110100,"url":"/xian/"},{"name":"苏州","code":101190400,"url":"/suzhou/"},{"name":"武汉","code":101200100,"url":"/wuhan/"},{"name":"厦门","code":101230200,"url":"/xiamen/"},{"name":"长沙","code":101250100,"url":"/changsha/"},{"name":"成都","code":101270100,"url":"/chengdu/"},{"name":"郑州","code":101180100,"url":"/zhengzhou/"},{"name":"重庆","code":101040100,"url":"/chongqing/"},{"name":"佛山","code":101280800,"url":"/foshan/"},{"name":"合肥","code":101220100,"url":"/hefei/"},{"name":"济南","code":101120100,"url":"/jinan/"},{"name":"青岛","code":101120200,"url":"/qingdao/"},{"name":"南京","code":101190100,"url":"/nanjing/"},{"name":"东莞","code":101281600,"url":"/dongguan/"}]

#每个城市爬取
for city in citys:

	#只获取前十页
    urls = ['https://www.zhipin.com/c{}/?query=python&page={}&ka=page-{}'.format(city['code'],i,i) for i in range(1,11)]

    for url in urls:

        driver.get(url)
		
		#获取源码,解析
        html=driver.page_source
        bs = BeautifulSoup(html,'html.parser')

        job_all = bs.find_all('div', {"class": "job-primary"})
        # print(job_all)

        for job in job_all:
        	#工作名称
	        job_name = job.find('span', {"class": "job-name"}).get_text()
			#工作地点
	        job_place = job.find('span', {'class': "job-area"}).get_text()
			#工作公司
	        job_company = job.find('div', {'class': 'company-text'}).find('h3', {'class': "name"}).get_text()
			#工作薪资
	        job_salary = job.find('span', {'class': 'red'}).get_text()
			#工作学历
	        job_education = job.find('div',{'class':'job-limit'}).find('p').get_text()[-2:]
			#工作标签
	        job_label = job.find('a', {'class': 'false-link'}).get_text()
	
	        #注:csv编码需更改为utf-8(若编码不为UTF-8)另:下载后需用记事本打开再另存为时将编码改为带BOM的UTF-8格式
	        with open('job.csv','a+',encoding='UTF-8-SIG') as fh:
	
	            #处理避免读取歧义
	            fh.write(job_name.replace(',','、') + "," + job_place + "," + job_company + "," + job_salary + "," + job_education +','+ job_label + "\n")
	
	            #检验成功写入、并成功获取数据
	            print('工作:' + job_name + ",地区:" + job_place + ",公司:" + job_company + ",薪资:" + job_salary + ',学历:'+ job_education +",标签:" + job_label,end="\n")

#关闭无头浏览器,减少内存损耗
driver.quit()

注:此处py文件需与chromedriver.exe同目录下,boss直聘网不得处于登录状态

运行完毕后,会在同目录下出现job.csv文件,此时可查询并分析(若有编码问题请自寻解决)

你可能感兴趣的:(Python学习)