Hello guys, I am William Lee, and this is Enginploy Episode Two. Today, I am going to find some jobs for you guys. Let’s rock and roll!
First thing I need to do is listing the steps:
BTW, here is my repository:
This are amazing tools you don’t want to miss.
Tool | Concept | Usage | Link |
Python 3.4 | Program Language | Code Stuff | https://docs.python.org/3/ |
MongoDB 4.0 | NoSQL Database | Store Stuff | https://docs.mongodb.com/v4.0/ |
Requests | Python Lib | Handle HTTP Stuff | http://docs.python-requests.org/en/master/ |
Beautiful Soup 4 | Python Lib | Parse HTML Stuff | https://www.crummy.com/software/BeautifulSoup/bs4/doc/ |
Pymongo | Python Lib | Handle Mongo Stuff | http://api.mongodb.com/python/current/tutorial.html |
Fake UserAgent | Python Lib | Generate Random User-Agent | https://github.com/hellysmile/fake-useragent |
Honestly, Fake UserAgent is not a neccessary tool you have to use. You can make an array with a bunch of User-Agent elements, and pick one of elements randomly for requesting websites. Use it or not, it’s up to you.
Okay, let’s follow the list above that I have mentioned, and finish the mission step by step.
import pymongo
# get companies from mongo
mongo_client = pymongo.MongoClient('mongodb://localhost:27017/')
db = mongo_client['enginploy']
company_36kr = db['company_36kr']
companies = company_36kr.find()
This is almost a standard code template. It processes:
As the README of github repository says, we need to install fake-useragent manually. Like this:
pip install fake-useragent
And then, import to our script:
from fake_useragent import UserAgent
ua = UserAgent()
Now, we can change header value whenever we want:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
headers['User-Agent'] = ua.random
Easy Peasy, guys : )
In this step, we still use Requests and Beautiful Soup to process details:
def main():
# ignore query companies step
URL = r'https://www.zhipin.com/job_detail/?{}'
ua = UserAgent()
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
for company in companies:
param = urlencode({
'position': '',
'industry': '',
'scity': 100010000,
'query': company['name']
real_url = URL.format(param)
headers['User-Agent'] = ua.random
r = requests.get(real_url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
jobs = soup.find_all('div', class_='job-primary')
mongo_client = pymongo.MongoClient('mongodb://localhost:27017/')
db = mongo_client['enginploy']
job_boss = db['job_boss']
print('Get', company['name'])
for job in jobs:
loot(job, job_boss)
# ignore loot() detail
Above the codes just do a few things:
functionLet’s see what loot()
function looks like:
def loot(job, job_boss):
job_name = job.select(
'div.info-primary > h3 > a > div.job-title')[0].string
job_salary = job.select('div.info-primary > h3 > a > span.red')[0].string
job_tags = job.select('div.info-primary > p')[0].contents
job_tags = rm_vline(job_tags)
company_name = job.select('div.info-company > div > h3 > a')[0].string
company_tags = job.select('div.info-company > div > p')[0].contents
company_tags = rm_vline(company_tags)
job_publish = job.select('div.info-publis > p')[0].string
hash_raw = '{},{},{}'.format(job_name, job_salary, company_name)
m = hashlib.md5()
hashcode = m.hexdigest()
info = {
'job_name': job_name,
'job_salary': job_salary,
'job_tags': job_tags,
'company_name': company_name,
'company_tags': company_tags,
'job_publish': job_publish,
'hashcode': hashcode
job_boss.find_one_and_replace({'hashcode': info['hashcode']}, info, upsert=True)
def rm_vline(tags):
return [str(x).replace(r'\'', '') for x in tags if x != None and str(x) != '']
function have those steps below:
Now, we can count how many jobs have been collected.
> db.job_boss.count()
About 725 jobs, not bad.
And here are some samples:
产品助理 5k-7k 圣贝拉母婴月子会所 ['珠海 ', '3-5年', '大专'] ['医疗健康', 'A轮', '100-499人'] 发布于12月11日
设计师 8k-12k 圣贝拉母婴月子会所 ['珠海 ', '1-3年', '本科'] ['医疗健康', 'A轮', '100-499人'] 发布于12月13日
大数据研发总监 35k-65k 阿博茨科技 ['北京 海淀区 清河', '5-10年', '本科'] ['互联网', 'A轮', '100-499人'] 发布于11月21日
搜索算法专家 30k-60k 阿博茨科技 ['北京 海淀区 清河', '3-5年', '本科'] ['互联网', 'A轮', '100-499人'] 发布于05月04日
Java高级开发工程师 20k-35k 葡萄智学 ['北京 海淀区 五道口', '3-5年', '本科'] ['在线教育', '天使轮', '100-499人'] 发布于09月13日
Java技术经理/架构师 25k-45k 葡萄智学 ['北京 海淀区 五道口', '3-5年', '本科'] ['在线教育', '天使轮', '100-499人'] 发布于12月04日
搜索研发(高级)工程师 15k-30k 小熊博望 ['北京 海淀区 西二旗', '1-3年', '本科'] ['互联网', '未融资', '100-499人'] 发布于12月17日
后端研发(高级)工程师 15k-30k 小熊博望 ['北京 海淀区 西二旗', '1-3年', '本科'] ['互联网', '未融资', '100-499人'] 发布于12月17日
核算会计 3k-5k 融易算 ['青岛 ', '经验不限', '大专'] ['企业服务', '天使轮', '500-999人'] 发布于11月16日
会计主管 6k-10k 融易算 ['青岛 ', '3-5年', '大专'] ['企业服务', '天使轮', '500-999人'] 发布于11月16日
JAVA工程师 15k-30k 乐行科技 ['广州 天河区 五山', '3-5年', '本科'] ['互联网', 'A轮', '100-499人'] 发布于10月15日
产品运营 12k-20k 乐行科技 ['广州 天河区 五山', '1-3年', '本科'] ['互联网', 'A轮', '100-499人'] 发布于11月19日
Java 10k-20k 乐行科技 ['广州 天河区 五山', '1-3年', '本科'] ['互联网', 'A轮', '100-499人'] 发布于07月04日
We make it! Now, we have a bunch of rich companies that have annouced they receive a plenty of money, and also know their jobs. In the next episode, we will try to paint a chart and summarize some rules. Thanks for reading, and see you soon.