ENGINPLOY Ep1 - Find Some Rich Companies

ENGINPLOY Ep1 - Find Some Rich Companies

Hello guys, I am William Lee. Today, I am going to bring a whole new blog with the whole new experience to you guys.

As you can see, here is ENGINPLOY, that redefines the way of employment with computer technology. I names ENGINEPLOY, because engineering of employment is a long word. ENGINPLOY has a series of episode, and each ot them has its own main mission. Therefore, episode 1’s main mission is Find Some Rich Companies.


Tools

Okay, first thing first. I gotta check out what tools I should use.

Tool Concept Usage Link
Python 3.4 Program Language Code Stuff https://docs.python.org/3/
MongoDB 4.0 NoSQL Database Store Stuff https://docs.mongodb.com/v4.0/
Requests Python Lib Handle HTTP Stuff http://docs.python-requests.org/en/master/
Beautiful Soup 4 Python Lib Parse HTML Stuff https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Pymongo Python Lib Handle Mongo Stuff http://api.mongodb.com/python/current/tutorial.html

Above these tools, I do really really recommend to you guys to practice a lot. They are so fashion so cool so that I feels like living in the future. However, I don’t introduce how to install them in your development environment or production environment. You are on your own now. Click the links above and start your journey.


Github Repository

The second things I need to do, is that preparing a unfamous Github repository. : )

And here is my link:
https://github.com/william8188/enginploy

After that, I clone this repository by a simple command:

git clone [email protected]:william8188/enginploy.git

Pretty easy, right?
If you guys don’t have a keypair (public/private key) for this journey, you should check this link out:
https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/


Find Some Rich Companies

All right, getting started the main mission. Where to find the rich companies? 36kr website !

1. Check robots.txt

Firstly, Let’s check out its robots.txt:
https://www.36kr.com/robots.txt

# robots.txt
User-agent: *
Disallow: /users
Disallow: /xiaozhi
Disallow: /asynces
Disallow: /goods

I am an honest humble gentleman, I swear that I shall follow this robots exclusion protocol and take those public information legally. You guys shall make this promise, too.

Therefore, I cannot visit /users, xiaozhi, /asynces, /goods URLs.

2. Find A Pattern

After a couple minutes, I finally figure out how to find a pattern to get some information that reflects which companies are really freaking rich. Here are the ways:

  1. Visit this URL https://www.36kr.com/search/articles/36%E6%B0%AA%E9%A6%96%E5%8F%91
  2. Get title likes that “36氪首发 | 做零售门店营销工具小程序,「企迈云商」获5000万元A轮融资”
  3. Figure out company’s name by locating “「” and “」” charactors positions.

Coding Stuff

1. Visit The URL By Requests

import requests

URL = r'https://www.36kr.com/search/articles/36%E6%B0%AA%E9%A6%96%E5%8F%91'
r = requests.get(URL)
print(r.text)

Once I print the text from request object, I know I have got a bunch of HTML stuff.

2. Handle HTML By Beautiful Soup 4

from bs4 import BeautifulSoup
import re

html_stuff = r.text
soup = BeautifulSoup(html_stuff, 'html.parser')
script_list = soup.find_all('script', string=re.compile('window.initialState='))
print(len(script_list))
print(script_list[0])

Yep, I get the HTML string that I need, and I gonna cut off those useless charactors.

3. Convert String To JSON

Because I get the HTML string like this:

<script>window.initialState={"searchResultData":{"code":0,"data" ... script>

I pretty sure this string hide a complete JSON. Let’s convert it:

html_string = str(script_list[0])
html_string = html_string.replace('','')
json_string = html_string
print(json_string)

4. Observe JSON Structure

I copy the JSON to this website:
https://www.json.cn/

And I can clearly to find out how to get the titles.

{
    "searchResultData":{
        "code":0,
        "data":{
            "searchResult":{
                "code":0,
                "data":{
                    "items":[
                        {
                            "id":5169947,
                            "title":"36氪首发 | 做零售门店营销工具小程序,「企迈云商」获5000万元A轮融资",
                            "project_id":"1",
                        
                        ...
                        

5. Get The Titles

import json

json_dict = json.loads(json_string)
items = json_dict['searchResultData']['data']['searchResult']['data']['items']
titles = []
for item in items:
    titles.append(item['title'])
    print(item['title'])

After that, I get the result like this:

36氪首发 | 做零售门店营销工具小程序,「企迈云商」获5000万元A轮融资
36氪首发 |「新声信息技术」完成A轮融资,将建设多家新兴产业引领中心
36氪首发 | 母婴经济正当时,高端月子护理品牌「圣贝拉」获 5000 万元 A 轮融资
36氪首发 |「阿博茨科技」宣布完成 3000 万美元 B 轮融资,人工智能在金融领域落地加速
36氪首发 | 「iFaster 甄快」获 1000 万天使轮融资,想用快充解决方案切入电动车充电市场

6. Refine Company’s Name

I wanna use regex to refine company’s name, because it is so easy to code.

PATTERN = '.*「(.+)」.*'
infos = []
for title in titles:
    if re.match(PATTERN, title):
        res = re.search(PATTERN, title)
        infos.append(
            {
                'name': res.groups()[0],
                'title': title
            }
        )

Pretty close. Now I have store these infomation to MongoDB. But before that, hash code should be calculated.

7. Calculate Hash Code

import hashlib

for info in infos:
    m = hashlib.md5()
    m.update(info['title'].encode())
    hashcode = m.hexdigest()
    info['hashcode'] = hashcode
    print(info['name'], info['hashcode'])

Print !

兰渡文化 f0ff5f576ca8f049514ed9a4e27c6a82
畅行智能 ffcc09b6431364a01f06f373f0aa26f3
捍宇医疗 0c7eb712f09cb5d2d4a8cc089d89b185
葡萄智学 40d7f2a903e448f6ed683817a181392c
锐吉科技 8c0bec2c846974cb59f13adcc51f6795

8. Store To MongoDB

import pymongo

mongo_client = pymongo.MongoClient('mongodb://localhost:27017/')
db = enginploy_db = mongo_client['enginploy']
company_36kr = db['company_36kr']
for info in infos:
    company_36kr.find_one_and_replace(
        {'hashcode': info['hashcode']}, info, upsert=True)

When I open a terminal to check out data in MongoDB, I find the result like this:

> db.company_36kr.find()
{ "_id" : ObjectId("5c29e56338c7150eb0a59fcd"), "name" : "企迈云商", "hashcode" : "116f850e060823ae38425b493ef6f7b2", "title" : "36氪首发 | 做零售门店营销工具小程序,「企迈云商」获5000万元A轮融资" }
{ "_id" : ObjectId("5c29e56338c7150eb0a59fcf"), "name" : "新声信息技术", "hashcode" : "3321713e18ddd40a35b9109dd3ec0a35", "title" : "36氪首发 |「新声信息技术」完成A轮融资,将建设多家新兴产业引领中心" }
{ "_id" : ObjectId("5c29e56338c7150eb0a59fd1"), "name" : "圣贝拉", "hashcode" : "3565af0ab172c112a78be8ebee309f8c", "title" : "36氪首发 | 母婴经济正当时,高端月子护理品牌「圣贝拉」获 5000 万元 A 轮融资" }

Yep, I made it.


Mission Completed

Now, I finish the main mission of this episode. I feel so happy : )

When the idea of ENGINPLOY comes out, I spend a whole afternoon to setup environment, read the document I need, debug again and again. Dude, come on. Nothing could be easy to reach to the end.

This is the ENGINPLOY Episode One - Find Some Rich Companies. I hope you guys could enjoy this whole new blog. And I will continue to deliver more ENGINPLOY content. I am William Lee. Thanks for reading, and see you next time.

BTW, the chinese version would come out later, don’t miss it.

你可能感兴趣的:(ENGINPLOY)