广州-小护士

ENGINPLOY Ep1 - Find Some Rich Companies

Hello guys, I am William Lee. Today, I am going to bring a whole new blog with the whole new experience to you guys.

As you can see, here is ENGINPLOY, that redefines the way of employment with computer technology. I names ENGINEPLOY, because engineering of employment is a long word. ENGINPLOY has a series of episode, and each ot them has its own main mission. Therefore, episode 1’s main mission is Find Some Rich Companies.

Tools

Okay, first thing first. I gotta check out what tools I should use.

Tool	Concept	Usage	Link
Python 3.4	Program Language	Code Stuff	https://docs.python.org/3/
MongoDB 4.0	NoSQL Database	Store Stuff	https://docs.mongodb.com/v4.0/
Requests	Python Lib	Handle HTTP Stuff	http://docs.python-requests.org/en/master/
Beautiful Soup 4	Python Lib	Parse HTML Stuff	https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Pymongo	Python Lib	Handle Mongo Stuff	http://api.mongodb.com/python/current/tutorial.html

Above these tools, I do really really recommend to you guys to practice a lot. They are so fashion so cool so that I feels like living in the future. However, I don’t introduce how to install them in your development environment or production environment. You are on your own now. Click the links above and start your journey.

Github Repository

The second things I need to do, is that preparing a unfamous Github repository. : )

And here is my link:
https://github.com/william8188/enginploy

After that, I clone this repository by a simple command:

git clone git@github.com:william8188/enginploy.git

Pretty easy, right?
If you guys don’t have a keypair (public/private key) for this journey, you should check this link out:
https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/

Find Some Rich Companies

All right, getting started the main mission. Where to find the rich companies? 36kr website !

1. Check robots.txt

Firstly, Let’s check out its robots.txt:
https://www.36kr.com/robots.txt

# robots.txt
User-agent: *
Disallow: /users
Disallow: /xiaozhi
Disallow: /asynces
Disallow: /goods

I am an honest humble gentleman, I swear that I shall follow this robots exclusion protocol and take those public information legally. You guys shall make this promise, too.

Therefore, I cannot visit /users, xiaozhi, /asynces, /goods URLs.

2. Find A Pattern

After a couple minutes, I finally figure out how to find a pattern to get some information that reflects which companies are really freaking rich. Here are the ways:

Visit this URL https://www.36kr.com/search/articles/36%E6%B0%AA%E9%A6%96%E5%8F%91
Get title likes that “36氪首发 | 做零售门店营销工具小程序，「企迈云商」获5000万元A轮融资”
Figure out company’s name by locating “「” and “」” charactors positions.

Coding Stuff

1. Visit The URL By Requests

import requests

URL = r'https://www.36kr.com/search/articles/36%E6%B0%AA%E9%A6%96%E5%8F%91'
r = requests.get(URL)
print(r.text)

Once I print the text from request object, I know I have got a bunch of HTML stuff.

2. Handle HTML By Beautiful Soup 4

from bs4 import BeautifulSoup
import re

html_stuff = r.text
soup = BeautifulSoup(html_stuff, 'html.parser')
script_list = soup.find_all('script', string=re.compile('window.initialState='))
print(len(script_list))
print(script_list[0])

Yep, I get the HTML string that I need, and I gonna cut off those useless charactors.

3. Convert String To JSON

Because I get the HTML string like this:

<script>window.initialState={"searchResultData":{"code":0,"data" ... script>

I pretty sure this string hide a complete JSON. Let’s convert it:

html_string = str(script_list[0])
html_string = html_string.replace('','')
json_string = html_string
print(json_string)

4. Observe JSON Structure

I copy the JSON to this website:
https://www.json.cn/

And I can clearly to find out how to get the titles.

{
    "searchResultData":{
        "code":0,
        "data":{
            "searchResult":{
                "code":0,
                "data":{
                    "items":[
                        {
                            "id":5169947,
                            "title":"36氪首发 | 做零售门店营销工具小程序，「企迈云商」获5000万元A轮融资",
                            "project_id":"1",
                        
                        ...

5. Get The Titles

import json

json_dict = json.loads(json_string)
items = json_dict['searchResultData']['data']['searchResult']['data']['items']
titles = []
for item in items:
    titles.append(item['title'])
    print(item['title'])

After that, I get the result like this:

36氪首发 | 做零售门店营销工具小程序，「企迈云商」获5000万元A轮融资
36氪首发 |「新声信息技术」完成A轮融资，将建设多家新兴产业引领中心
36氪首发 | 母婴经济正当时，高端月子护理品牌「圣贝拉」获 5000 万元 A 轮融资
36氪首发 |「阿博茨科技」宣布完成 3000 万美元 B 轮融资，人工智能在金融领域落地加速
36氪首发 | 「iFaster 甄快」获 1000 万天使轮融资，想用快充解决方案切入电动车充电市场

6. Refine Company’s Name

I wanna use regex to refine company’s name, because it is so easy to code.

PATTERN = '.*「(.+)」.*'
infos = []
for title in titles:
    if re.match(PATTERN, title):
        res = re.search(PATTERN, title)
        infos.append(
            {
                'name': res.groups()[0],
                'title': title
            }
        )

Pretty close. Now I have store these infomation to MongoDB. But before that, hash code should be calculated.

7. Calculate Hash Code

import hashlib

for info in infos:
    m = hashlib.md5()
    m.update(info['title'].encode())
    hashcode = m.hexdigest()
    info['hashcode'] = hashcode
    print(info['name'], info['hashcode'])

Print !

兰渡文化 f0ff5f576ca8f049514ed9a4e27c6a82
畅行智能 ffcc09b6431364a01f06f373f0aa26f3
捍宇医疗 0c7eb712f09cb5d2d4a8cc089d89b185
葡萄智学 40d7f2a903e448f6ed683817a181392c
锐吉科技 8c0bec2c846974cb59f13adcc51f6795

8. Store To MongoDB

import pymongo

mongo_client = pymongo.MongoClient('mongodb://localhost:27017/')
db = enginploy_db = mongo_client['enginploy']
company_36kr = db['company_36kr']
for info in infos:
    company_36kr.find_one_and_replace(
        {'hashcode': info['hashcode']}, info, upsert=True)

When I open a terminal to check out data in MongoDB, I find the result like this:

> db.company_36kr.find()
{ "_id" : ObjectId("5c29e56338c7150eb0a59fcd"), "name" : "企迈云商", "hashcode" : "116f850e060823ae38425b493ef6f7b2", "title" : "36氪首发 | 做零售门店营销工具小程序，「企迈云商」获5000万元A轮融资" }
{ "_id" : ObjectId("5c29e56338c7150eb0a59fcf"), "name" : "新声信息技术", "hashcode" : "3321713e18ddd40a35b9109dd3ec0a35", "title" : "36氪首发 |「新声信息技术」完成A轮融资，将建设多家新兴产业引领中心" }
{ "_id" : ObjectId("5c29e56338c7150eb0a59fd1"), "name" : "圣贝拉", "hashcode" : "3565af0ab172c112a78be8ebee309f8c", "title" : "36氪首发 | 母婴经济正当时，高端月子护理品牌「圣贝拉」获 5000 万元 A 轮融资" }

Yep, I made it.

Mission Completed

Now, I finish the main mission of this episode. I feel so happy : )

When the idea of ENGINPLOY comes out, I spend a whole afternoon to setup environment, read the document I need, debug again and again. Dude, come on. Nothing could be easy to reach to the end.

This is the ENGINPLOY Episode One - Find Some Rich Companies. I hope you guys could enjoy this whole new blog. And I will continue to deliver more ENGINPLOY content. I am William Lee. Thanks for reading, and see you next time.

BTW, the chinese version would come out later, don’t miss it.

ENGINPLOY Ep3 - Pyecharts Visualize Data 广州-小护士 ENGINPLOY
ENGINPLOYEp3-PyechartsVisualizeDataHelloguys,IamWilliamLee,andthisisEnginployEpisodeThree.Today,Iwanttogiveyouguyssomethingsspecial.IusedPyechartstovisualizethedatathatwehavecollectedinthepreviousepis
ENGINPLOY Ep1 - Find Some Rich Companies 广州-小护士 ENGINPLOY
ENGINPLOYEp1-FindSomeRichCompaniesHelloguys,IamWilliamLee.Today,Iamgoingtobringawholenewblogwiththewholenewexperiencetoyouguys.Asyoucansee,hereisENGINPLOY,thatredefinesthewayofemploymentwithcomputerte
ENGINPLOY Ep2 - Find Some Jobs 广州-小护士 ENGINPLOY 小护士 enginploy fake-useragent python
ENGINPLOYEp2-FindSomeJobsHelloguys,IamWilliamLee,andthisisEnginployEpisodeTwo.Today,Iamgoingtofindsomejobsforyouguys.Let’srockandroll!FirstthingIneedtodoislistingthesteps:Queryallcompaniesnamewhichwer
用MiddleGenIDE工具生成hibernate的POJO（根据数据表生成POJO类） AdyZhang POJO eclipse Hibernate MiddleGenIDE
推荐:MiddlegenIDE插件, 是一个Eclipse 插件. 用它可以直接连接到数据库, 根据表按照一定的HIBERNATE规则作出BEAN和对应的XML ，用完后你可以手动删除它加载的JAR包和XML文件! 今天开始试着使用
.9.png Cb123456 android
“点九”是andriod平台的应用软件开发里的一种特殊的图片形式，文件扩展名为：.9.png 　　智能手机中有自动横屏的功能,同一幅界面会在随着手机(或平板电脑)中的方向传感器的参数不同而改变显示的方向,在界面改变方向后,界面上的图形会因为长宽的变化而产生拉伸,造成图形的失真变形。　　我们都知道android平台有多种不同的分辨率，很多控件的切图文件在被放大拉伸后，边
算法的效率天子之骄算法效率复杂度最坏情况运行时间大O阶平均情况运行时间
算法的效率效率是速度和空间消耗的度量。集中考虑程序的速度，也称运行时间或执行时间，用复杂度的阶(O)这一标准来衡量。空间的消耗或需求也可以用大O表示，而且它总是小于或等于时间需求。以下是我的学习笔记： 1.求值与霍纳法则，即为秦九韶公式。 2.测定运行时间的最可靠方法是计数对运行时间有贡献的基本操作的执行次数。运行时间与这个计数成正比。
java数据结构何必如此 java 数据结构
Java 数据结构 Java工具包提供了强大的数据结构。在Java中的数据结构主要包括以下几种接口和类：枚举（Enumeration）位集合（BitSet）向量（Vector）栈（Stack）字典（Dictionary）哈希表（Hashtable）属性（Properties）以上这些类是传统遗留的，在Java2中引入了一种新的框架-集合框架(Collect
MybatisHelloWorld 3213213333332132
//测试入口TestMyBatis package com.base.helloworld.test; import java.io.IOException; import org.apache.ibatis.io.Resources; import org.apache.ibatis.session.SqlSession; import org.apache.ibat
Java|urlrewrite|URL重写|多个参数 7454103 java xml Web 工作
个人工作经验！如有不当之处，敬请指点 1.0 web -info 目录下建立 urlrewrite.xml 文件类似如下： <?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE u
达梦数据库+ibatis darkranger sql mysql ibatis SQL Server
--插入数据方面如果您需要数据库自增... 那么在插入的时候不需要指定自增列. 如果想自己指定ID列的值, 那么要设置 set identity_insert 数据库名.模式名.表名; ----然后插入数据; example: create table zhabei.test( id bigint identity(1,1) primary key, nam
XML 解析四种方式 aijuans android
XML现在已经成为一种通用的数据交换格式,平台的无关性使得很多场合都需要用到XML。本文将详细介绍用Java解析XML的四种方法。 XML现在已经成为一种通用的数据交换格式,它的平台无关性,语言无关性,系统无关性,给数据集成与交互带来了极大的方便。对于XML本身的语法知识与技术细节,需要阅读相关的技术文献,这里面包括的内容有DOM(Document Object
spring中配置文件占位符的使用 avords
1.类 <?xml version="1.0" encoding="UTF-8"?><!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.o
前端工程化-公共模块的依赖和常用的工作流 bee1314 webpack
题记：一个人的项目，还有工程化的问题嘛？我们在推进模块化和组件化的过程中，肯定会不断的沉淀出我们项目的模块和组件。对于这些沉淀出的模块和组件怎么管理？另外怎么依赖也是个问题？你真的想这样嘛？ var BreadCrumb = require(‘../../../../uikit/breadcrumb’); //真心ugly。
上司说「看你每天准时下班就知道你工作量不饱和」，该如何回应？ bijian1013 项目管理沟通 IT职业规划
问题：上司说「看你每天准时下班就知道你工作量不饱和」，如何回应正常下班时间6点，只要是6点半前下班的，上司都认为没有加班。 Eno-Bea回答，注重感受，不一定是别人的虽然我不知道你具体从事什么工作与职业，但是我大概猜测，你是从事一项不太容易出现阶段性成果的工作
TortoiseSVN，过滤文件征客丶 SVN
环境： TortoiseSVN 1.8 配置：在文件夹空白处右键选择 TortoiseSVN -> Settings 在 Global ignote pattern 中添加要过滤的文件：多类型用英文空格分开 *name ：过滤所有名称为 name 的文件或文件夹 *.name ：过滤所有后缀为 name 的文件或文件夹 --------
【Flume二】HDFS sink细说 bit1129 Flume
1. Flume配置 a1.sources=r1 a1.channels=c1 a1.sinks=k1 ###Flume负责启动44444端口 a1.sources.r1.type=avro a1.sources.r1.bind=0.0.0.0 a1.sources.r1.port=44444 a1.sources.r1.chan
The Eight Myths of Erlang Performance bookjovi erlang
erlang有一篇guide很有意思： http://www.erlang.org/doc/efficiency_guide 里面有个The Eight Myths of Erlang Performance： http://www.erlang.org/doc/efficiency_guide/myths.html Myth: Funs are sl
java多线程网络传输文件(非同步)-2008-08-17 ljy325 java 多线程 socket
利用 Socket 套接字进行面向连接通信的编程。客户端读取本地文件并发送；服务器接收文件并保存到本地文件系统中。使用说明:请将TransferClient, TransferServer, TempFile三个类编译，他们的类包是FileServer. 客户端: 修改TransferClient: serPort, serIP, filePath, blockNum,的值来符合您机器的系
读《研磨设计模式》-代码笔记-模板方法模式 bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ import java.sql.Connection; import java.sql.DriverManager; import java.sql.PreparedStatement; import java.sql.ResultSet;
配置心得 chenyu19891124 配置
时间就这样不知不觉的走过了一个春夏秋冬，转眼间来公司已经一年了，感觉时间过的很快，时间老人总是这样不停走，从来没停歇过。作为一名新手的配置管理员，刚开始真的是对配置管理是一点不懂，就只听说咱们公司配置主要是负责升级，而具体该怎么做却一点都不了解。经过老员工的一点点讲解，慢慢的对配置有了初步了解，对自己所在的岗位也慢慢的了解。做了一年的配置管理给自总结下： 1.改变从一个以前对配置毫无
对“带条件选择的并行汇聚路由问题”的再思考 comsci 算法工作软件测试嵌入式领域模型
2008年上半年，我在设计并开发基于”JWFD流程系统“的商业化改进型引擎的时候，由于采用了新的嵌入式公式模块而导致出现“带条件选择的并行汇聚路由问题”(请参考2009-02-27博文)，当时对这个问题的解决办法是采用基于拓扑结构的处理思想，对汇聚点的实际前驱分支节点通过算法预测出来，然后进行处理，简单的说就是找到造成这个汇聚模型的分支起点，对这个起始分支节点实际走的路径数进行计算，然后把这个实际
Oracle 10g 的clusterware 32位下载地址 daizj oracle
Oracle 10g 的clusterware 32位下载地址 http://pan.baidu.com/share/link?shareid=531580&uk=421021908 http://pan.baidu.com/share/link?shareid=137223&uk=321552738 http://pan.baidu.com/share/l
非常好的介绍：Linux定时执行工具cron dongwei_6688 linux
Linux经过十多年的发展，很多用户都很了解Linux了，这里介绍一下Linux下cron的理解，和大家讨论讨论。cron是一个Linux 定时执行工具，可以在无需人工干预的情况下运行作业，本文档不讲cron实现原理，主要讲一下Linux定时执行工具cron的具体使用及简单介绍。新增调度任务推荐使用crontab -e命令添加自定义的任务（编辑的是/var/spool/cron下对应用户的cr
Yii assets目录生成及修改 dcj3sjt126com yii
assets的作用是方便模块化，插件化的，一般来说出于安全原因不允许通过url访问protected下面的文件，但是我们又希望将module单独出来，所以需要使用发布，即将一个目录下的文件复制一份到assets下面方便通过url访问。 assets设置对应的方法位置 \framework\web\CAssetManager.php assets配置方法在m
mac工作软件推荐 dcj3sjt126com mac
mac上的Terminal + bash ＋ screen组合现在已经非常好用了，但是还是经不起iterm＋zsh＋tmux的冲击。在同事的强烈推荐下，趁着升级mac系统的机会，顺便也切换到iterm＋zsh＋tmux的环境下了。我为什么要要iterm2 切换过来也是脑袋一热的冲动，我也调查过一些资料，看了下iterm的一些优点： * 兼容性好，远程服务器 vi 什么的低版本能很好兼
Memcached(三)、封装Memcached和Ehcache frank1234 memcached ehcache spring ioc
本文对Ehcache和Memcached进行了简单的封装，这样对于客户端程序无需了解ehcache和memcached的差异，仅需要配置缓存的Provider类就可以在二者之间进行切换，Provider实现类通过Spring IoC注入。 cache.xml <?xml version="1.0" encoding="UTF-8"?>
Remove Duplicates from Sorted List II hcx2013 remove
Given a sorted linked list, delete all nodes that have duplicate numbers, leaving only distinct numbers from the original list. For example,Given 1->2->3->3->4->4->5,
Spring4新特性——注解、脚本、任务、MVC等其他特性改进 jinnianshilongnian spring4
Spring4新特性——泛型限定式依赖注入 Spring4新特性——核心容器的其他改进 Spring4新特性——Web开发的增强 Spring4新特性——集成Bean Validation 1.1(JSR-349)到SpringMVC Spring4新特性——Groovy Bean定义DSL Spring4新特性——更好的Java泛型操作API Spring4新
MySQL安装文档 liyong0802 mysql
工作中用到的MySQL可能安装在两种操作系统中，即Windows系统和Linux系统。以Linux系统中情况居多。安装在Windows系统时与其它Windows应用程序相同按照安装向导一直下一步就即，这里就不具体介绍，本文档只介绍Linux系统下MySQL的安装步骤。 Linux系统下安装MySQL分为三种：RPM包安装、二进制包安装和源码包安装。二
使用VS2010构建HotSpot工程 p2p2500 HotSpot OpenJDK VS2010
1. 下载OpenJDK7的源码： http://download.java.net/openjdk/jdk7 http://download.java.net/openjdk/ 2. 环境配置 ▶
Oracle实用功能之分组后列合并 seandeng888 oracle 分组实用功能合并
1 实例解析由于业务需求需要对表中的数据进行分组后进行合并的处理，鉴于Oracle10g没有现成的函数实现该功能，且该功能如若用JAVA代码实现会比较复杂，因此，特将SQL语言的实现方式分享出来，希望对大家有所帮助。如下：表test 数据如下： ID,SUBJECTCODE,DIMCODE,VALUE 1&nbs
Java定时任务注解方式实现 tuoni java spring jvm xml jni
Spring 注解的定时任务，有如下两种方式：第一种： <?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http
11大Java开源中文分词器的使用方法和分词效果对比 yangshangchuan word分词器 ansj分词器 Stanford分词器 FudanNLP分词器 HanLP分词器
本文的目标有两个： 1、学会使用11大Java开源中文分词器 2、对比分析11大Java开源中文分词器的分词效果本文给出了11大Java开源中文分词的使用方法以及分词结果对比代码，至于效果哪个好，那要用的人结合自己的应用场景自己来判断。 11大Java开源中文分词器，不同的分词器有不同的用法，定义的接口也不一样，我们先定义一个统一的接口： /** * 获取文本的所有分词结果, 对比

ENGINPLOY Ep1 - Find Some Rich Companies

ENGINPLOY Ep1 - Find Some Rich Companies

Tools

Github Repository

Find Some Rich Companies

1. Check robots.txt

2. Find A Pattern

Coding Stuff

1. Visit The URL By Requests

2. Handle HTML By Beautiful Soup 4

3. Convert String To JSON

4. Observe JSON Structure

5. Get The Titles

6. Refine Company’s Name

7. Calculate Hash Code

8. Store To MongoDB

Mission Completed

你可能感兴趣的:(ENGINPLOY)