对于我自己来说,该需求源自于分析Movielens-1m数据集的用户数据:
UserID::Gender::Age::Occupation::Zip-code
1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
5::M::25::20::55455
6::F::50::9::55117
我希望根据Zip-code计算出用户所在的州,然后在地图上显示每个州的用户数量。
那么应该这样写代码:
import pandas as pd
import geopandas as gpd
import plotly.express as px
from uszipcode import SearchEngine
# 创建 SearchEngine 实例
search = SearchEngine()
# 读取用户数据集
data = pd.read_csv('./users.dat', sep='::', engine='python',
names=['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code'])
data = data.dropna(subset=['Zip-code'])
def get_state_name(zipcode):
result = search.by_zipcode(zipcode)
if result is None:
return None
else:
state_abbr = result.state
return state_abbr
data['STATE_ABBR'] = data['Zip-code'].apply(get_state_name)
# 计算每个Zip-code的用户数量
zip_counts = data['STATE_ABBR'].value_counts()
zip_counts_df = zip_counts.reset_index() # 将Series转换为DataFrame
zip_counts_df.columns = ['STATE_ABBR', 'COUNT'] # 重新命名列
# 读取美国地图的shapefile
usa_map = gpd.read_file('./shapefile/USA_States.shp')
# 将Zip-code数据与地图数据进行合并
# 简称合并使用STATE_ABBR,全称合并使用STATE_NAME
zip_geo = pd.merge(usa_map, zip_counts_df, on='STATE_ABBR')
# 绘制地图
fig = px.choropleth(zip_geo,
locations='STATE_ABBR',
locationmode='USA-states',
color='COUNT',
scope='usa',
hover_data=['COUNT'],
color_continuous_scale='Reds',
range_color=(0, zip_geo['COUNT'].max()),
labels={'STATE_ABBR': 'User Count'})
fig.update_layout(title_text='Movielens User Distribution by State')
fig.show()
在上面的代码中,USA_States.shp
可以在efrainmaps(https://www.efrainmaps.es/english-version/free-downloads/united-states/)下载。
效果如下,鼠标悬停到某个州,可以显示出州名称和对应的用户数量:
如果不希望显示州简称,可以创建州的简称与全称的映射,然后将Zip-code映射到州的全称,再显示地图:
# 创建州的简称与全称的映射
# 该映射字典涵盖了50个州、哥伦比亚特区、5个美国领土以及3个军邮邮编简称。
state_name_dict = {
"AL": "Alabama",
"AK": "Alaska",
"AZ": "Arizona",
"AR": "Arkansas",
"CA": "California",
"CO": "Colorado",
"CT": "Connecticut",
"DE": "Delaware",
"FL": "Florida",
"GA": "Georgia",
"HI": "Hawaii",
"ID": "Idaho",
"IL": "Illinois",
"IN": "Indiana",
"IA": "Iowa",
"KS": "Kansas",
"KY": "Kentucky",
"LA": "Louisiana",
"ME": "Maine",
"MD": "Maryland",
"MA": "Massachusetts",
"MI": "Michigan",
"MN": "Minnesota",
"MS": "Mississippi",
"MO": "Missouri",
"MT": "Montana",
"NE": "Nebraska",
"NV": "Nevada",
"NH": "New Hampshire",
"NJ": "New Jersey",
"NM": "New Mexico",
"NY": "New York",
"NC": "North Carolina",
"ND": "North Dakota",
"OH": "Ohio",
"OK": "Oklahoma",
"OR": "Oregon",
"PA": "Pennsylvania",
"RI": "Rhode Island",
"SC": "South Carolina",
"SD": "South Dakota",
"TN": "Tennessee",
"TX": "Texas",
"UT": "Utah",
"VT": "Vermont",
"VA": "Virginia",
"WA": "Washington",
"WV": "West Virginia",
"WI": "Wisconsin",
"WY": "Wyoming",
"DC": "District of Columbia",
"AS": "American Samoa",
"GU": "Guam",
"MP": "Northern Mariana Islands",
"PR": "Puerto Rico",
"UM": "United States Minor Outlying Islands",
"VI": "Virgin Islands",
"AA": "Armed Forces Americas",
"AE": "Armed Forces Europe",
"AP": "Armed Forces Pacific"
}
def get_state_name(zipcode):
result = search.by_zipcode(zipcode)
if result is None:
return None
else:
state_abbr = result.state
state_name = state_name_dict.get(state_abbr, None)
return state_name
data['STATE_NAME'] = data['Zip-code'].apply(get_state_name)
# 后续代码同上,注意要将STATE_ABBR替换为STATE_NAME