Hello guys, I am William Lee. Today, I am going to bring a whole new blog with the whole new experience to you guys.
As you can see, here is ENGINPLOY, that redefines the way of employment with computer technology. I names ENGINEPLOY, because engineering of employment is a long word. ENGINPLOY has a series of episode, and each ot them has its own main mission. Therefore, episode 1’s main mission is Find Some Rich Companies.
Okay, first thing first. I gotta check out what tools I should use.
Tool | Concept | Usage | Link |
---|---|---|---|
Python 3.4 | Program Language | Code Stuff | https://docs.python.org/3/ |
MongoDB 4.0 | NoSQL Database | Store Stuff | https://docs.mongodb.com/v4.0/ |
Requests | Python Lib | Handle HTTP Stuff | http://docs.python-requests.org/en/master/ |
Beautiful Soup 4 | Python Lib | Parse HTML Stuff | https://www.crummy.com/software/BeautifulSoup/bs4/doc/ |
Pymongo | Python Lib | Handle Mongo Stuff | http://api.mongodb.com/python/current/tutorial.html |
Above these tools, I do really really recommend to you guys to practice a lot. They are so fashion so cool so that I feels like living in the future. However, I don’t introduce how to install them in your development environment or production environment. You are on your own now. Click the links above and start your journey.
The second things I need to do, is that preparing a unfamous Github repository. : )
And here is my link:
https://github.com/william8188/enginploy
After that, I clone this repository by a simple command:
git clone [email protected]:william8188/enginploy.git
Pretty easy, right?
If you guys don’t have a keypair (public/private key) for this journey, you should check this link out:
https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/
All right, getting started the main mission. Where to find the rich companies? 36kr website !
Firstly, Let’s check out its robots.txt:
https://www.36kr.com/robots.txt
# robots.txt
User-agent: *
Disallow: /users
Disallow: /xiaozhi
Disallow: /asynces
Disallow: /goods
I am an honest humble gentleman, I swear that I shall follow this robots exclusion protocol and take those public information legally. You guys shall make this promise, too.
Therefore, I cannot visit /users
, xiaozhi
, /asynces
, /goods
URLs.
After a couple minutes, I finally figure out how to find a pattern to get some information that reflects which companies are really freaking rich. Here are the ways:
import requests
URL = r'https://www.36kr.com/search/articles/36%E6%B0%AA%E9%A6%96%E5%8F%91'
r = requests.get(URL)
print(r.text)
Once I print the text from request object, I know I have got a bunch of HTML stuff.
from bs4 import BeautifulSoup
import re
html_stuff = r.text
soup = BeautifulSoup(html_stuff, 'html.parser')
script_list = soup.find_all('script', string=re.compile('window.initialState='))
print(len(script_list))
print(script_list[0])
Yep, I get the HTML string that I need, and I gonna cut off those useless charactors.
Because I get the HTML string like this:
<script>window.initialState={"searchResultData":{"code":0,"data" ... script>
I pretty sure this string hide a complete JSON. Let’s convert it:
html_string = str(script_list[0])
html_string = html_string.replace('','')
json_string = html_string
print(json_string)
I copy the JSON to this website:
https://www.json.cn/
And I can clearly to find out how to get the titles.
{
"searchResultData":{
"code":0,
"data":{
"searchResult":{
"code":0,
"data":{
"items":[
{
"id":5169947,
"title":"36氪首发 | 做零售门店营销工具小程序,「企迈云商」获5000万元A轮融资",
"project_id":"1",
...
import json
json_dict = json.loads(json_string)
items = json_dict['searchResultData']['data']['searchResult']['data']['items']
titles = []
for item in items:
titles.append(item['title'])
print(item['title'])
After that, I get the result like this:
36氪首发 | 做零售门店营销工具小程序,「企迈云商」获5000万元A轮融资
36氪首发 |「新声信息技术」完成A轮融资,将建设多家新兴产业引领中心
36氪首发 | 母婴经济正当时,高端月子护理品牌「圣贝拉」获 5000 万元 A 轮融资
36氪首发 |「阿博茨科技」宣布完成 3000 万美元 B 轮融资,人工智能在金融领域落地加速
36氪首发 | 「iFaster 甄快」获 1000 万天使轮融资,想用快充解决方案切入电动车充电市场
I wanna use regex to refine company’s name, because it is so easy to code.
PATTERN = '.*「(.+)」.*'
infos = []
for title in titles:
if re.match(PATTERN, title):
res = re.search(PATTERN, title)
infos.append(
{
'name': res.groups()[0],
'title': title
}
)
Pretty close. Now I have store these infomation to MongoDB. But before that, hash code should be calculated.
import hashlib
for info in infos:
m = hashlib.md5()
m.update(info['title'].encode())
hashcode = m.hexdigest()
info['hashcode'] = hashcode
print(info['name'], info['hashcode'])
Print !
兰渡文化 f0ff5f576ca8f049514ed9a4e27c6a82
畅行智能 ffcc09b6431364a01f06f373f0aa26f3
捍宇医疗 0c7eb712f09cb5d2d4a8cc089d89b185
葡萄智学 40d7f2a903e448f6ed683817a181392c
锐吉科技 8c0bec2c846974cb59f13adcc51f6795
import pymongo
mongo_client = pymongo.MongoClient('mongodb://localhost:27017/')
db = enginploy_db = mongo_client['enginploy']
company_36kr = db['company_36kr']
for info in infos:
company_36kr.find_one_and_replace(
{'hashcode': info['hashcode']}, info, upsert=True)
When I open a terminal to check out data in MongoDB, I find the result like this:
> db.company_36kr.find()
{ "_id" : ObjectId("5c29e56338c7150eb0a59fcd"), "name" : "企迈云商", "hashcode" : "116f850e060823ae38425b493ef6f7b2", "title" : "36氪首发 | 做零售门店营销工具小程序,「企迈云商」获5000万元A轮融资" }
{ "_id" : ObjectId("5c29e56338c7150eb0a59fcf"), "name" : "新声信息技术", "hashcode" : "3321713e18ddd40a35b9109dd3ec0a35", "title" : "36氪首发 |「新声信息技术」完成A轮融资,将建设多家新兴产业引领中心" }
{ "_id" : ObjectId("5c29e56338c7150eb0a59fd1"), "name" : "圣贝拉", "hashcode" : "3565af0ab172c112a78be8ebee309f8c", "title" : "36氪首发 | 母婴经济正当时,高端月子护理品牌「圣贝拉」获 5000 万元 A 轮融资" }
Yep, I made it.
Now, I finish the main mission of this episode. I feel so happy : )
When the idea of ENGINPLOY comes out, I spend a whole afternoon to setup environment, read the document I need, debug again and again. Dude, come on. Nothing could be easy to reach to the end.
This is the ENGINPLOY Episode One - Find Some Rich Companies. I hope you guys could enjoy this whole new blog. And I will continue to deliver more ENGINPLOY content. I am William Lee. Thanks for reading, and see you next time.
BTW, the chinese version would come out later, don’t miss it.