ENGINPLOY Ep2 - Find Some Jobs

ENGINPLOY Ep2 - Find Some Jobs

Hello guys, I am William Lee, and this is Enginploy Episode Two. Today, I am going to find some jobs for you guys. Let’s rock and roll!

First thing I need to do is listing the steps:

  1. Query all companies name which were collected from Episode One.
  2. Prepare a python lib called fake-useragent, for generating random fake User-Agent’s value.
  3. Collect Jobs information and store to MongoDB.
  4. Check out how many jobs we have collected.

BTW, here is my repository:
https://github.com/william8188/enginploy


Tools

This are amazing tools you don’t want to miss.

Tool Concept Usage Link
Python 3.4 Program Language Code Stuff https://docs.python.org/3/
MongoDB 4.0 NoSQL Database Store Stuff https://docs.mongodb.com/v4.0/
Requests Python Lib Handle HTTP Stuff http://docs.python-requests.org/en/master/
Beautiful Soup 4 Python Lib Parse HTML Stuff https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Pymongo Python Lib Handle Mongo Stuff http://api.mongodb.com/python/current/tutorial.html
Fake UserAgent Python Lib Generate Random User-Agent https://github.com/hellysmile/fake-useragent

Honestly, Fake UserAgent is not a neccessary tool you have to use. You can make an array with a bunch of User-Agent elements, and pick one of elements randomly for requesting websites. Use it or not, it’s up to you.


Find Some Jobs

Okay, let’s follow the list above that I have mentioned, and finish the mission step by step.

1. Query all companies name

import pymongo

# get companies from mongo
mongo_client = pymongo.MongoClient('mongodb://localhost:27017/')
db = mongo_client['enginploy']
company_36kr = db['company_36kr']
companies = company_36kr.find()
mongo_client.close() 

This is almost a standard code template. It processes:

  • New a MongoDB client object and open connection
  • Specify a database
  • Execute a query task
  • Close client connection eventually

2. Prepare fake-useragent

As the README of github repository says, we need to install fake-useragent manually. Like this:

pip install fake-useragent

And then, import to our script:

from fake_useragent import UserAgent

ua = UserAgent()

Now, we can change header value whenever we want:

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
headers['User-Agent'] = ua.random

Easy Peasy, guys : )

3. Collect Jobs information and store to MongoDB

In this step, we still use Requests and Beautiful Soup to process details:

def main():

    # ignore query companies step
    
    URL = r'https://www.zhipin.com/job_detail/?{}'
    ua = UserAgent()
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    for company in companies:
        param = urlencode({
            'position': '',
            'industry': '',
            'scity': 100010000,
            'query': company['name']
        })
        real_url = URL.format(param)
        headers['User-Agent'] = ua.random
        r = requests.get(real_url, headers=headers)
        soup = BeautifulSoup(r.text, 'html.parser')
        jobs = soup.find_all('div', class_='job-primary')
        mongo_client = pymongo.MongoClient('mongodb://localhost:27017/')
        db = mongo_client['enginploy']
        job_boss = db['job_boss']
        print('Get', company['name'])
        for job in jobs:
            loot(job, job_boss)

# ignore loot() detail
            

Above the codes just do a few things:

  • Change the header of User-Agent
  • Request urls and receive HTML texts
  • Craft content we want
  • New a MongoDB Client
  • Execute loot() function

Let’s see what loot() function looks like:

def loot(job, job_boss):
    job_name = job.select(
        'div.info-primary > h3 > a > div.job-title')[0].string
    job_salary = job.select('div.info-primary > h3 > a > span.red')[0].string
    job_tags = job.select('div.info-primary > p')[0].contents
    job_tags = rm_vline(job_tags)
    company_name = job.select('div.info-company > div > h3 > a')[0].string
    company_tags = job.select('div.info-company > div > p')[0].contents
    company_tags = rm_vline(company_tags)
    job_publish = job.select('div.info-publis > p')[0].string
    hash_raw = '{},{},{}'.format(job_name, job_salary, company_name)
    m = hashlib.md5()
    m.update(hash_raw.encode())
    hashcode = m.hexdigest()
    info = {
        'job_name': job_name,
        'job_salary': job_salary,
        'job_tags': job_tags,
        'company_name': company_name,
        'company_tags': company_tags,
        'job_publish': job_publish,
        'hashcode': hashcode
    }
    job_boss.find_one_and_replace({'hashcode': info['hashcode']}, info, upsert=True)
    print(info['job_name'],info['job_salary'],info['company_name'],info['job_tags'],info['company_tags'],info['job_publish'])


def rm_vline(tags):
    return [str(x).replace(r'\'', '') for x in tags if x != None and str(x) != '']

loot() function have those steps below:

  • Loot text
  • Compute hashcode for every jobs
  • Upsert jobs to MongoDB

4. Check out how many jobs we have collected

Now, we can count how many jobs have been collected.

> db.job_boss.count()
725

About 725 jobs, not bad.

And here are some samples:

产品助理 5k-7k 圣贝拉母婴月子会所 ['珠海  ', '3-5年', '大专'] ['医疗健康', 'A轮', '100-499人'] 发布于12月11日
设计师 8k-12k 圣贝拉母婴月子会所 ['珠海  ', '1-3年', '本科'] ['医疗健康', 'A轮', '100-499人'] 发布于12月13日
大数据研发总监 35k-65k 阿博茨科技 ['北京 海淀区 清河', '5-10年', '本科'] ['互联网', 'A轮', '100-499人'] 发布于11月21日
搜索算法专家 30k-60k 阿博茨科技 ['北京 海淀区 清河', '3-5年', '本科'] ['互联网', 'A轮', '100-499人'] 发布于05月04日
Java高级开发工程师 20k-35k 葡萄智学 ['北京 海淀区 五道口', '3-5年', '本科'] ['在线教育', '天使轮', '100-499人'] 发布于09月13日
Java技术经理/架构师 25k-45k 葡萄智学 ['北京 海淀区 五道口', '3-5年', '本科'] ['在线教育', '天使轮', '100-499人'] 发布于12月04日
搜索研发(高级)工程师 15k-30k 小熊博望 ['北京 海淀区 西二旗', '1-3年', '本科'] ['互联网', '未融资', '100-499人'] 发布于12月17日
后端研发(高级)工程师 15k-30k 小熊博望 ['北京 海淀区 西二旗', '1-3年', '本科'] ['互联网', '未融资', '100-499人'] 发布于12月17日
核算会计 3k-5k 融易算 ['青岛  ', '经验不限', '大专'] ['企业服务', '天使轮', '500-999人'] 发布于11月16日
会计主管 6k-10k 融易算 ['青岛  ', '3-5年', '大专'] ['企业服务', '天使轮', '500-999人'] 发布于11月16日
JAVA工程师 15k-30k 乐行科技 ['广州 天河区 五山', '3-5年', '本科'] ['互联网', 'A轮', '100-499人'] 发布于10月15日
产品运营 12k-20k 乐行科技 ['广州 天河区 五山', '1-3年', '本科'] ['互联网', 'A轮', '100-499人'] 发布于11月19日
Java 10k-20k 乐行科技 ['广州 天河区 五山', '1-3年', '本科'] ['互联网', 'A轮', '100-499人'] 发布于07月04日

Mission Completed

We make it! Now, we have a bunch of rich companies that have annouced they receive a plenty of money, and also know their jobs. In the next episode, we will try to paint a chart and summarize some rules. Thanks for reading, and see you soon.

你可能感兴趣的:(ENGINPLOY,小护士,enginploy,fake-useragent,python)