作者: kevin
- 我们需要知道有哪些爬取目标fields,并提前在items.py 里加入定义,例如下面这样
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class HospitalItem(scrapy.Item):
# Hospital Name
name = scrapy.Field()
# 成立时间 year established
establish_date = scrapy.Field()
# 医院等级 hospital level
level = scrapy.Field()
# 医疗机构类别
hosp_type = scrapy.Field()
# 经营性质
serv_property = scrapy.Field()
在settings.py 中的设置基本都无需更改,除了
这项,很多网站都有 robots.txt这个文件来定义爬虫规则,如果你发现爬虫无法正常爬取,可以尝试把它设为False
然后在项目中的spider directory 里新建一个爬虫py 文件,并创建专门的spider class,其中有这些需要注意的点:
推荐加入类似以下的headers来伪装爬虫,然后在call Request 的时候带上 header的选项就行
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
def parse_dir_contents(self, response):
item = HospitalItem()
item['name'] = response.xpath(
item['establish_date'] = response.xpath(
如果需要对爬取到的文字信息通过regular expression 进行处理(例如去除newline char),可以通过以下类似的方式:
item['qq_acc'] = response.xpath(
如果需要爬取多层级信息,例如top level 是 目录,然后需要进入目录中每一个目标页面获取详细信息,这时候我们需要先爬取到目录页面中所有的目标网址,然后再iterate 每一个网址,并通过callback的方式call 专门爬取详细信息的function来实现,具体实例如下:
def parse(self, response):
hospitals = response.xpath(
'//div[@class="seek_int_left"]/h3/a/@href').extract() #爬取目录中所有项目的各自网址
for hospital in hospitals: # iterate 每一个网址
url = str(hospital)
yield Request(url, callback=self.parse_dir_contents) #用callback的方式 call 具体爬取第二层页面的function
urlhead = "http://yyk.qqyy.com/search_dplmhks0i"
urltail = ".html"
if self.i < 3018: #iterate 每一页目录
real = urlhead + str(self.i) + urltail
self.i = self.i + 1
yield Request(real, headers=self.headers)
df = df.replace('\n','', regex=True)
如果需要导出的数据都被双引号包裹,可以先把所有column的type 换成string,然后在导出的时候加上quoting的参数:
pip3 install fuzzywuzzy[speedup]
# fuzz 比较两个string之间的
from fuzzywuzzy import fuzz
# process是用来比较一个string和其他多个string之间的
from fuzzywuzzy import process
fuzz.ratio("this is a test", "this is a fun") #output 74
fuzz.partial_ratio("this is a test", "test a is this") #output 57
fuzz.token_sort_ratio("this is a test", "is this a test") #output 100
fuzz.token_set_ratio("this is a test", "this is is a test") #output 100
choices = ['fuzzy fuzzy was a bear', 'is this a test', 'THIS IS A TEST']
process.extract("this is a test", choices, scorer=fuzz.ratio)
[('THIS IS A TEST!!', 100),
('is this a test', 86),
('fuzzy fuzzy was a bear', 33)]
对pandas df 中的数据我们可以通过类似以下方式来模糊匹配,假设我们要找出df中可能重复的地址:
lookups_addr = df[df.addr.notnull()].addr
res = [(lookup_a,) + item for lookup_a in lookups_addr for item in process.extract(lookup_a, lookups_addr,limit=2)]
df1 = pd.DataFrame(res, columns=["lookup", "matched", "score", "name"])
df1[(df1.score <100) & (df1.score >90)]
Used Echarts for visualization, which is an open source JavaScript library by Baidu
Lots of good templates to choose from http://echarts.baidu.com/examples/
geo: {
map: 'china',
label: {
emphasis: {
show: false
roam: false,
itemStyle: {
normal: {
areaColor: '#404448',
borderColor: '#111'
emphasis: {
areaColor: '#2a333d'
silent: true, // do not responde to mouse click on map
series : [
name: 'Top 50',
type: 'effectScatter', // here we choose ''effectScatter
coordinateSystem: 'geo', // either 'geo' or 'bmap' depends on what you've specified above in setOptions
data: convertData(top50), // point data
symbolSize: function (val) {
return Math.sqrt(val[2]) / 10; // we can change symbol size based on its value
// if the range is too large, we can take
// their squareroot or even cubic root to reduce range
showEffectOn: 'render',
rippleEffect: {
brushType: 'stroke'
hoverAnimation: true,
label: {
normal: {
formatter: '{b}',
position: 'right',
show: false
itemStyle: {
normal: {
color: '#891d14',
shadowBlur: 5,
shadowColor: '#333'
zlevel: 1