这是优达学城Udacity“数据分析师”纳米学位的一个项目,跟大家分享,希望大家多多交流。
Map Area
Las Vegas, Nevada, USA
拉斯维加斯是美国的赌城,娱乐之都,早就听说过,也非常想去,所以做项目时就想为此地做一下。
一、地图遇到的问题
发现有以下问题:
- 街道名称缩写、不正确;
- 不正确的邮政编码;
导入必要的库,生成原文件10%样本的文件sample.osm;
#导入必要的库
import xml.etree.cElementTree as ET
import pprint
import re
from collections import defaultdict
import csv
import cerberus
import schema
import codecs
import string
import schema
#!/usr/bin/env python
# -*- coding: utf-8 -*-
OSM_FILE = "las-vegas_nevada.osm" # Replace this with your osm file
SAMPLE_FILE = "sample.osm"
k = 10 # Parameter: take every k-th top level element
def get_element(osm_file, tags=('node', 'way', 'relation')):
"""Yield element if it is the right type of tag
Reference:
http://stackoverflow.com/questions/3095434/inserting-newlines-in-xml-file-generated-via-xml-etree-elementtree-in-python
"""
context = iter(ET.iterparse(osm_file, events=('start', 'end')))
_, root = next(context)
for event, elem in context:
if event == 'end' and elem.tag in tags:
yield elem
root.clear()
with open(SAMPLE_FILE, 'wb') as output:
output.write('\n')
output.write('\n ')
# Write every kth top level element
for i, element in enumerate(get_element(OSM_FILE)):
if i % k == 0:
output.write(ET.tostring(element, encoding='utf-8'))
output.write(' ')
看数据中有多少标签
file_name = 'las-vegas_nevada.osm'
def count_tags(filename):
tags = {}
for _,elem in ET.iterparse(filename):
if elem.tag in tags:
tags[elem.tag] +=1
else:
tags[elem.tag] =1
return tags
def test():
tags = count_tags('las-vegas_nevada.osm')
pprint.pprint(tags)
if __name__ == "__main__":
test()
{'bounds': 1,
'member': 4333,
'nd': 1241838,
'node': 1048101,
'osm': 1,
'relation': 556,
'tag': 489437,
'way': 112116}
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problemchars = re.compile(r'[=\+/&<>;\'" \?%#$@\,\.\t\r\n]')
def key_type(element, keys):
if element.tag == "tag":
for tag in element.iter("tag"):
if lower.search(tag.attrib['k']):
keys['lower'] +=1
elif lower_colon.search(tag.attrib['k']):
keys['lower_colon'] += 1
elif problemchars.search(tag.attrib['k']):
keys["problemchars"] += 1
else :
keys["other"] += 1
return keys
def process_map(filename):
keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
for _, element in ET.iterparse(filename):
keys = key_type(element, keys)
return keys
keys = process_map(file_name)
pprint.pprint(keys)
{'lower': 316935, 'lower_colon': 165174, 'other': 7327, 'problemchars': 1}
我们编写了两个函数key_type()和process_map(),这两个函数使用正则表达式匹配来分离我们的键值。
我们将我们的数据集分为4组:lower,lower_colon,problemchars,other。
1、街道名称缩写、不正确
一些街道名称被缩写,例如“Marco St”,更新为全拼;
一些街道名称不正确,例如“S Maryland Parkway Suite A-5-262”,进行了更正;
首先打印所有的街道名称,查看全部类型的街道名称
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
expected = ["Avenue", "Commons", "Court", "Drive", "Lane", "Parkway",
"Place", "Road", "Square", "Street", "Trail"]
def audit_street_type(street_types, street_name):
m = street_type_re.search(street_name)
if m:
street_type = m.group()
if street_type not in expected:
street_types[street_type].add(street_name)
def is_street_name(elem):
return (elem.attrib['k'] == "addr:street") #Function to check value to be a street name
def audit(osmfile):
osm_file = open(osmfile, "r")
street_types = defaultdict(set)
for event, elem in ET.iterparse(osm_file, events=("start",)):
if elem.tag == "node" or elem.tag == "way":
for tag in elem.iter("tag"):
if is_street_name(tag):
audit_street_type(street_types, tag.attrib['v'])
return street_types
sort_street_types = audit(file_name)
pprint.pprint(dict(sort_street_types))
{'1': set(['Spanish Ridge Ave., Suite 1', 'West Gary Avenue #1']),
'10': set(['S. Valley View Blvd. Ste 10']),
'100': set(['S Eastern Ave #100']),
'103': set(['South Pecos Road Suite 103']),
'106': set(['S Grand Canyon Dr #106']),
'107': set(['S Decatur Blvd #107']),
'11': set(['Chandler Ave #11']),
'110': set(['South Jones Boulevard Suite 110']),
'111-559': set(['8465 W Sahara Avenue Suite 111-559']),
'115': set(['North Hualapai Way #115']),
'1170': set(['Fremont Street Ste. 1170']),
'120': set(['E SILVERADO RANCH BLVD Suite 120', 'Silverado Ranch Blvd #120']),
'15': set(['E Sahara Blvd #15']),
'150': set(['West Horizon Ridge Parkway, STE 150']),
'16': set(['Polaris Ave #16']),
'170': set(['Sky Pointe Dr #170']),
'207': set(['E Sahara Ave #207']),
'275': set(['Camino Al Norte #275']),
'2D': set(['Spring Mountain Rd #2D']),
'3230': set(['W Sahara Ave #3230']),
'500': set(['Howard Hughes Parkway, Suite 500']),
'705': set(['West Ali Baba Lane #705']),
'790': set(['Howard Hughes Pkwy #790']),
'93': set(['Highway 93']),
'A': set(['E Tropicana Ave #A', 'West Post Road A']),
'A-5-262': set(['S Maryland Parkway Suite A-5-262']),
'AVE': set(['W PACIFIC AVE']),
'Alicante': set(['Via Alicante']),
'Apache': set(['S Fort Apache', 'S. Fort Apache']),
'Ave': set(['4760 S Polaris Ave',
'E Cheyenne Ave',
'E Sahara Ave',
'East Tropicana Ave',
'S Eastern Ave',
'S. Eastern Ave',
'W Cactus Ave',
'W Sahara Ave',
'W Twain Ave',
'W Washington Ave',
'W. Sahara Ave']),
'Ave.': set(['200 Hoover Ave.',
'6601 W. Twain Ave.',
'East Twain Ave.',
'Glendale Ave.',
'Hoover Ave.',
'W. Arby Ave.',
'West Sahara Ave.']),
'B': set(['Avenue B', 'West Sahara Avenue #B']),
'B12': set(['W Sahara Ave #B12']),
'Blvd': set(['Adams Blvd',
'Buchanan Blvd',
'Carmen Blvd',
'S Casino Center Blvd',
'S Rainbow Blvd',
'S Valley View Blvd',
'Thomas Ryan Blvd',
'W Charleston Blvd']),
'Blvd.': set(['E. Flamingo Blvd.',
'N. Las Vegas Blvd.',
'S Rainbow Blvd.',
'West Lake Mead Blvd.']),
'Bonneville': set(['Bonneville']),
'Boulevard': set(['Adams Boulevard',
'Centennial Center Boulevard',
'Decatur Boulevard',
'Del Webb Boulevard',
'E. Charleston Boulevard',
'East Charleston Boulevard',
'East Lake Mead Boulevard',
'East Silverado Ranch Boulevard',
'Lamb Boulevard',
'Las Vegas Boulevard',
'Montelago Boulevard',
'N Casino Center Boulevard',
'North Arroyo Grande Boulevard',
'North Decatur Boulevard',
'North Jones Boulevard',
'North Lamb Boulevard',
'North Las Vegas Boulevard',
'North Nellis Boulevard',
'North Rainbow Boulevard',
'North Rampart Boulevard',
'Nu Wav Kaiv Boulevard',
'Rainbow Boulevard',
'S Las Vegas Boulevard',
'South Casino Center Boulevard',
'South Decatur Boulevard',
'South Hollywood Boulevard',
'South Jones Boulevard',
'South Las Vegas Boulevard',
'South Martin L King Boulevard',
'South Moapa Valley Boulevard',
'South Nellis Boulevard',
'South Rainbow Boulevard',
'South Rampart Boulevard',
'South Valley View Boulevard',
'Sun City Boulevard',
'Valley View Boulevard',
'Wayne Newton Boulevard',
'West Charleston Boulevard',
'West Lake Mead Boulevard',
'West Oakey Boulevard']),
'Buckskin': set(['W Buckskin']),
'Canto': set(['Via Bel Canto']),
'Charleston': set(['W Charleston']),
'Cir': set(['Sego Glen Cir']),
'Circle': set(['Citadel Circle',
'Inner Circle',
'Mall Ring Circle',
'Village Center Circle']),
'Dr': set(['Club House Dr',
'Corporate Park Dr',
'S Highland Dr',
'Village Walk Dr']),
'East': set(['Town Square East']),
'Experience': set(['Fremont Street Experience']),
'G': set(['Avenue G']),
'Highway': set(['Boulder Highway',
'Nevada Highway',
'North Boulder Highway',
'South Boulder Highway',
'Valley of Fire Highway']),
'Lavender': set(['W Lavender']),
'Ln': set(['Linda Ln']),
'Ln.': set(['Norman Rockwell Ln.']),
'M2': set(['W Craig Rd #M2']),
'Mt.': set(['Spring Mt.']),
'North': set(['Las Vegas Boulevard North']),
'Pkwy': set(['3547 S Maryland Pkwy', 'Grand Central Pkwy']),
'Prado': set(['Paseo del Prado']),
'Rd': set(['2560 E Sunset Rd',
'El Camino Rd',
'Hillpointe Rd',
'Losee Rd',
'Paradise Rd',
'S Fort Apache Rd',
'S Pecos Rd',
'W Craig Rd',
'W Warm Springs Rd']),
'Rd.': set(['E Sunset Rd.']),
'Rd5': set(['Airport Rd5']),
'S': set(['Las Vegas Blvd S',
'Las Vegas Boulevard S',
'Silverado Ranch Boulevard;Las Vegas Boulevard S']),
'S.': set(['Las Vegas Boulevard S.']),
'Sahara': set(['W Sahara']),
'South': set(['Las Vegas Blvd South',
'Las Vegas Boulevard South',
'Market Place South']),
'St': set(['Fremont St', 'Marco St', 'S 3rd St', 'S Main St']),
'St.': set(['5070 Arville St.']),
'Vegas': set(['Wynn Las Vegas']),
'Way': set(['6823 W. Ponderosa Way',
'Brandywine Way',
'Conestoga Way',
'Crystal Water Way',
'Donovan Way',
'Nevada Way',
'North Tenaya Way',
'SLS Way',
'Westminster Way',
'Wiesner Way']),
'apache': set(['South Fort apache']),
'ave': set(['W Sahara ave']),
'blvd': set(['N Rainbow blvd',
'S Las Vegas blvd',
'W Charleston blvd',
'W Lake Mead blvd']),
'blvd.': set(['S Las Vegas blvd.']),
'ln': set(['Rockwell ln'])}
根据以上结果,修复街道名称缩写的问题
#创建修正街道名称的字典
mapping = { "Ct": "Court",
"St": "Street",
"st": "Street",
"St.": "Street",
"St,": "Street",
"ST": "Street",
"street": "Street",
"Street.": "Street",
"Ave": "Avenue",
"Ave.": "Avenue",
"ave": "Avenue",
"AVE":"Avenue",
"Rd.": "Road",
"rd.": "Road",
"Rd": "Road",
"Hwy": "Highway",
"HIghway": "Highway",
"Pkwy": "Parkway",
"Pl": "Place",
"place": "Place",
"Sedgwick": "Sedgwick Street",
"Sq.": "Square",
"Newbury": "Newbury Street",
"Cir":"Circle",
"S Maryland Parkway Suite A-5-262":"S Maryland Parkway",
"Spanish Ridge Ave., Suite 1":"Spanish Ridge Avenue",
"S Eastern Ave #100":"S Eastern Avenue",
"South Pecos Road Suite 103":"South Pecos Road",
"Chandler Ave #11":"Chandler Ave",
}
# 更正街道名称
def update_name(name, mapping):
for key in mapping:
if key in name:
name = string.replace(name,key,mapping[key])
return name
for street_type, ways in sort_street_types.iteritems():
for name in ways:
update_name(name, mapping)
2、不正确的邮政编码
def audit_zipcode(invalid_zipcodes, zipcode):
threeDigits = zipcode[0:3]
if len(zipcode) != 6:
invalid_zipcodes[threeDigits].add(zipcode)
elif not threeDigits.isdigit():
invalid_zipcodes[threeDigits].add(zipcode)
elif threeDigits != '891':
invalid_zipcodes[threeDigits].add(zipcode)
def is_zipcode(elem):
return (elem.attrib['k'] == "addr:postcode")
def audit_zip(osmfile):
osm_file = open(osmfile, "r")
invalid_zipcodes = defaultdict(set)
for event, elem in ET.iterparse(osm_file, events=("start",)):
if elem.tag == "node" or elem.tag == "way":
for tag in elem.iter("tag"):
if is_zipcode(tag):
audit_zipcode(invalid_zipcodes,tag.attrib['v'])
return invalid_zipcodes
sort_zipcode = audit_zip(file_name)
pprint.pprint(dict(sort_zipcode))
{'645': set(['6451112']),
'890': set(['89002',
'89005',
'89011',
'89012',
'89014',
'89014-2132',
'89015',
'89019',
'89025',
'89030',
'89031',
'89040',
'89044',
'89052',
'89070',
'89074',
'89081',
'89086']),
'891': set(['89101',
'89102',
'89102-4370',
'89103',
'89104',
'89104-1307',
'89105',
'89106',
'89107',
'89108',
'89108-7049',
'89109',
'89109-1907',
'89110',
'89113',
'89115',
'89117',
'89118',
'89119',
'89119-1001',
'89119-6304',
'89120',
'89121',
'89122',
'89123',
'89128',
'89128-6634',
'89129',
'89129-7260',
'89130',
'89131',
'89134',
'89135',
'89135-1020',
'89135-1038',
'89139',
'89142',
'89144',
'89145',
'89146',
'89146-2977',
'89147',
'89147-4111',
'89147-8491',
'89148',
'89149',
'89154',
'89156',
'89161',
'89166',
'89169',
'89178',
'89179',
'89183',
'89191']),
'892': set(['8929']),
'NV ': set(['NV 89014',
'NV 89030',
'NV 89031',
'NV 89052',
'NV 89101',
'NV 89107',
'NV 89109',
'NV 89117',
'NV 89119',
'NV 89123',
'NV 89124',
'NV 89129',
'NV 89134',
'NV 89135',
'NV 89142',
'NV 89191']),
'Nev': set(['Nevada 89113'])}
def update_zip(zipcode):
if zipcode[0:3] == 'NV ':
zipcode = zipcode[3:]
return zipcode
elif zipcode[0:6] == 'Nevada':
zipcode = zipcode[7:]
return zipcode
elif zipcode[0:3] != '891':
zipcode = 'None'
return zipcode
elif zipcode[0:3] == '891' and len(zipcode) >6:
zipcode = zipcode[0:5]
return zipcode
for street_type, ways in sort_zipcode.iteritems():
for name in ways:
if update_zip(name):
update_zip(name)
接下来,我们将清理文件并分隔数据;
tags_clean 函数用来设置id,key,value和type的值,在把街道名称和邮编的更改添加到字典时使用;
OSM_PATH = file_name
NODES_PATH = "nodes.csv"
NODE_TAGS_PATH = "nodes_tags.csv"
WAYS_PATH = "ways.csv"
WAY_NODES_PATH = "ways_nodes.csv"
WAY_TAGS_PATH = "ways_tags.csv"
NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp']
NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp']
WAY_TAGS_FIELDS = ['id', 'key', 'value', 'type']
WAY_NODES_FIELDS = ['id', 'node_id', 'position']
SCHEMA = schema.Schema
LOWER_COLON = re.compile(r'^([a-z]|_)+:([a-z]|_)+')
PROBLEMCHARS = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
def tags_clean(id , tag ):
node_tagss = {}
node_tagss['id'] = int(id)
if tag.attrib['k'] == "addr:street":
node_tagss['value'] = update_name(tag.attrib['v'], mapping)
elif tag.attrib['k'] == "addr:postcode":
node_tagss['value'] = update_zip(tag.attrib['v'])
else:
node_tagss['value'] = tag.attrib['v']
if ":" not in tag.attrib['k']:
node_tagss['key'] = tag.attrib['k']
node_tagss['type'] = 'regular'
else:
pcolon = tag.attrib['k'].index(":") + 1
node_tagss['key'] = tag.attrib['k'][pcolon:]
node_tagss['type'] = tag.attrib['k'][:pcolon - 1]
return node_tagss
def shape_element(element, node_attr_fields=NODE_FIELDS, way_attr_fields=WAY_FIELDS,
problem_chars=PROBLEMCHARS, default_tag_type='regular'):
"""Clean and shape node or way XML element to Python dict"""
node_attribs = {}
way_attribs = {}
way_nodes = []
tags = [] # Handle secondary tags the same way for both node and way elements
node_tagss = {}
# YOUR CODE HERE
if element.tag == 'node':
for node in NODE_FIELDS:
node_attribs[node] = element.attrib[node]
node_attribs['id']= int(node_attribs['id'])
node_attribs['uid']= int(node_attribs['uid'])
node_attribs['changeset']= int(node_attribs['changeset'])
node_attribs['lon']= float(node_attribs['lon'])
node_attribs['lat']= float(node_attribs['lat'])
for tag in element.iter("tag"):
tag_clean ={}
if PROBLEMCHARS.search(tag.attrib['k']) == None:
node_tagss = tags_clean(node_attribs['id'] , tag )
tags.append(node_tagss)
#tag_clean = clean(tag)
#if tag_clean:
# tags.append(tag_clean)
if node_attribs:
return {'node': node_attribs, 'node_tags': tags}
else:
return None
elif element.tag == 'way':
for way in WAY_FIELDS:
way_attribs[way] = element.attrib[way]
way_attribs['id']= int(way_attribs['id'])
way_attribs['uid']= int(way_attribs['uid'])
way_attribs['changeset']= int(way_attribs['changeset'])
for tag in element.iter("tag"):
tag_clean ={}
if PROBLEMCHARS.search(tag.attrib['k']) == None:
node_tagss = tags_clean(way_attribs['id'] , tag )
tags.append(node_tagss)
#tag_clean = clean(tag)
#if tag_clean:
# tags.append(tag_clean)
count =0
for nodes in element.iter("nd"):
wnd = {}
wnd['id'] = int(way_attribs['id'])
wnd['node_id'] = int(nodes.attrib['ref'])
wnd['position'] = count
count += 1
way_nodes.append(wnd)
if way_attribs:
return {'way': way_attribs, 'way_nodes': way_nodes, 'way_tags': tags}
else:
return None
# ================================================== #
# Helper Functions #
# ================================================== #
def get_element(osm_file, tags=('node', 'way', 'relation')):
"""Yield element if it is the right type of tag"""
context = ET.iterparse(osm_file, events=('start', 'end'))
_, root = next(context)
for event, elem in context:
if event == 'end' and elem.tag in tags:
yield elem
root.clear()
def validate_element(element, validator, schema=SCHEMA):
"""Raise ValidationError if element does not match schema"""
if validator.validate(element, schema) is not True:
field, errors = next(validator.errors.iteritems())
message_string = "\nElement of type '{0}' has the following errors:\n{1}"
error_strings = (
"{0}: {1}".format(k, v if isinstance(v, str) else ", ".join(v))
for k, v in errors.iteritems()
)
raise cerberus.ValidationError(
message_string.format(field, "\n".join(error_strings))
)
class UnicodeDictWriter(csv.DictWriter, object):
"""Extend csv.DictWriter to handle Unicode input"""
def writerow(self, row):
super(UnicodeDictWriter, self).writerow({
k: (v.encode('utf-8') if isinstance(v, unicode) else v) for k, v in row.iteritems()
})
def writerows(self, rows):
for row in rows:
self.writerow(row)
# ================================================== #
# Main Function #
# ================================================== #
def process_map(file_in, validate):
"""Iteratively process each XML element and write to csv(s)"""
with codecs.open(NODES_PATH, 'w') as nodes_file, \
codecs.open(NODE_TAGS_PATH, 'w') as nodes_tags_file, \
codecs.open(WAYS_PATH, 'w') as ways_file, \
codecs.open(WAY_NODES_PATH, 'w') as way_nodes_file, \
codecs.open(WAY_TAGS_PATH, 'w') as way_tags_file:
nodes_writer = UnicodeDictWriter(nodes_file, NODE_FIELDS)
node_tags_writer = UnicodeDictWriter(nodes_tags_file, NODE_TAGS_FIELDS)
ways_writer = UnicodeDictWriter(ways_file, WAY_FIELDS)
way_nodes_writer = UnicodeDictWriter(way_nodes_file, WAY_NODES_FIELDS)
way_tags_writer = UnicodeDictWriter(way_tags_file, WAY_TAGS_FIELDS)
nodes_writer.writeheader()
node_tags_writer.writeheader()
ways_writer.writeheader()
way_nodes_writer.writeheader()
way_tags_writer.writeheader()
validator = cerberus.Validator()
for element in get_element(file_in, tags=('node', 'way')):
el = shape_element(element)
if el:
if validate is True:
validate_element(el, validator)
if element.tag == 'node':
nodes_writer.writerow(el['node'])
node_tags_writer.writerows(el['node_tags'])
elif element.tag == 'way':
ways_writer.writerow(el['way'])
way_nodes_writer.writerows(el['way_nodes'])
way_tags_writer.writerows(el['way_tags'])
if __name__ == '__main__':
# Note: Validation is ~ 10X slower. For the project consider using a small
# sample of the map when validating.
process_map(OSM_PATH, validate=False)
创建las-vegas_nevada.db数据库,将各csv文件导入数据库中
二、探索数据
以下部分是对拉斯维加斯数据集的探索
文件大小
- las-vegas_nevada.osm : 215 MB
- las-vegas_nevada.db :120MB
- nodes.csv : 84.2 MB
- nodes_tags.csv : 2.28 MB
- ways.csv : 6.46 MB
- ways_nodes.csv : 29.7 MB
- ways_tags.csv : 14.3 MB
node节点数量
sqlite> SELECT count(*)
...> FROM nodes;
1048101
道路数量
sqlite> SELECT count(*)
...> FROM ways;
112116
用户(去重)数量
sqlite> SELECT count(distinct(user.uid))
...> FROM (SELECT uid
...> FROM nodes
...> UNION all
...> SELECT uid
...> FROM ways ) user;
1024
前10名贡献用户
sqlite> SELECT user.user ,count(*)
...> FROM ( SELECT user
...> FROM nodes
...> UNION all
...> SELECT user
...> FROM ways) user
...> GROUP BY user.user
...> ORDER BY count(*)
...> DESC
...> LIMIT 10 ;
alimamo|251497
tomthepom|121172
woodpeck_fixbot|70826
alecdhuse|66528
abellao|55639
gMitchellD|44635
robgeb|41070
nmixter|40093
TheDutchMan13|39304
Tom_Holland|33379
只提交过一次的用户数量
sqlite> SELECT count(*)
...> FROM
...> (SELECT e.user, count(*)
...> FROM ( SELECT user
...> FROM nodes
...> UNION all
...> SELECT user
...> FROM ways ) e
...> GROUP BY e.user
...> HAVING count(*)=1) ;
244
三、其他数据探索
十大便利设施
sqlite> SELECT value, count(*)
...> FROM nodes_tags
...> WHERE key ='amenity'
...> GROUP BY value
...> ORDER BY count(*) DESC
...> LIMIT 10 ;
restaurant|460
place_of_worship|294
fuel|282
fast_food|267
fountain|266
school|208
shelter|122
toilets|85
cafe|78
bar|72
翻译成汉语:
餐馆| 460
礼拜场所| 294
燃料| 282
快餐| 267
喷泉|266
学校|208
住所|122
厕所|85
网吧|78
酒吧|72
果然在哪都是‘民以食为天’,餐馆第一,快餐第四;
好奇心,究竟赌场等场所有多少?
把上面的查询去掉限制前10,发现有:
casino|10
nightclub|10
theatre|10
cinema|8
stripclub|3
即:
赌场|10
牙医|10
夜总会|10
影院|10
电影|8
脱衣舞俱乐部|3
十大美食
sqlite> SELECT nodes_tags.value, count(*)
...> FROM nodes_tags
...> JOIN (SELECT DISTINCT(id)
...> FROM nodes_tags
...> WHERE value= 'restaurant') i
...> ON nodes_tags.id = i.id
...> WHERE nodes_tags.key = 'cuisine'
...> GROUP BY nodes_tags.value
...> ORDER BY count(*) DESC
...> LIMIT 10;
mexican|41
pizza|31
american|20
italian|16
steak_house|16
burger|14
chinese|12
japanese|9
asian|8
buffet|7
墨西哥餐厅、披萨、美国餐厅占前三,此处墨西哥餐厅偏多;
四、关于数据集的其他想法
改进或分析数据的额外建议,实施改进的益处及一些预期的问题。
- 建议:现在的地图数据是不完整的,看到很多用户只提交了一次,建议应该想办法增加用户的活跃度,使用用户激励手段,比如积分,贡献排行榜,游戏化任务,勋章,高贡献用户特权等;
- 好处:增加数据量,丰富地图数据;
- 问题:过程较为复杂,需要人员、资金充足;
建议:openstreetmap组织进行数据清洗和更新,可以官方组织工程师进行,也可以通过游戏化的方式发布由用户进行;
好处:数据会较为干净,用户使用更加精确;地图相关工作者使用地图时可以不用或者少量进行数据清洗,使用更高效;
问题:需要资金,工程师团队,对于非盈利性组织来说,有很大难度;
- 建议:贡献量较少的用户较容易提供不规范的信息,我们可以先从贡献度少的用户开始清理;
- 好处:会提高数据的一致性和准确性;
- 问题:降低新用户的积极性;
- 建议:数据的格式不统一,可以规范一下数据输入的格式;
- 好处:可以提高数据的规范性;
- 坏处:规则不够灵活,难以适应不同地区的情况;
- 建议:由于本数据并不全面,我们可以通过外部的API,比如高德地图、谷歌地图、PokemonGo等,完善和校准本地图中的数据;
- 好处:可以填充信息,校准数据;
- 问题:数据可能重复度大,或者数据格式不同导入困难等
而且涉及商业合作,互为竞争对手,合作难度估计较大;