本数据为第十四届认证杯C题数据,数据下载链接为:http://www.tzmcm.cn/shiti.html
首先,我们有一份excle文件,里面的内容是这样的:
然后,开始进行数据处理,先在Python里面干我们经常干的事情
df = pd.read_csv('data.csv')
之后,我们可以发现:这份数据里面,有很多没有放置共享单车的位置,还单独列成了一列数据,因此,我们需要将这一类数据去掉。
df = pd.read_csv('data1.csv')
df = df[df['total_cars'] > 0]
print(df)
print一下,就会发现一共是564371行有效数据
之后,我们可以对这个数据进行根据经纬度、时间或者车辆号进行聚类
首先我们根据经纬度进行聚类,并且画出图表来,得到结果:
运行代码为:
cars_by_lat=df.groupby("latitude").agg({'total_cars':'sum'})
#print(cars_by_lat)
plt.plot(cars_by_lat,color='k',alpha=0.5)
plt.show()
看起来很不错的样子,但是!有个大问题,print一下就能发现,以下是print出来的结果
我们可以看到,一开始的表头成为了两行,所以,我们需要将其变成一行,如下为正确代码:
cars_by_lat=df.groupby("latitude").agg({'total_cars':'sum'}).reset_index()
print(cars_by_lat)
plt.plot(cars_by_lat['latitude'],cars_by_lat['total_cars'],color='k',alpha=0.5)
plt.show()
根据时间聚类:
cars_by_time['timestamp'] = cars_by_time['timestamp'].apply(pd.Timestamp)
#print(cars_by_time)
#cars_by_time.plot(color='k',alpha=0.5)
cars_by_time.set_index('timestamp').sort_index().rolling('60min').mean().plot(color='deepskyblue')
plt.show()
之后,我们开始进行对每一个时间段进行统计,先上代码:
我们一开始需要对时间进行处理,计算当地的时间:
timestamps = pd.DatetimeIndex(cars_by_time['timestamp'])
print(timestamps)
timestamps = timestamps.tz_convert('Asia/Jerusalem')
print(timestamps)
cars_by_time['local_time'] = timestamps
tz_convert相当于修改时区,即修改到当地的时间,然后新增一行,讲时间放进去
cars_by_time['weekday'] = cars_by_time['local_time'].dt.day_name()
cars_by_time['hour'] = cars_by_time['local_time'].dt.hour
这一段代码是说将一行时间变成星期和一天之内的小时时间,返回名字,运行结果为:
然后,我们就可以画图了,这个画图比较高级,所以我们不能用普通的plt进行绘制(没找到修改纵坐标的方法),因此,我们用了python里面一个非常高级的画图库,即:Seaborn
具体使用方法可以参考博客:https://blog.csdn.net/qq_40195360/article/details/86605860?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522162752697816780366571121%2522%252C%2522scm%2522%253A%252220140713.130102334..%2522%257D&request_id=162752697816780366571121&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~top_positive~default-1-86605860.first_rank_v2_pc_rank_v29&utm_term=seaborn&spm=1018.2226.3001.4187
讲述特别细致。
上代码:
cars_by_time['usage_rate'] = (45 - cars_by_time['total_cars']) / 45
sns.barplot(x='hour',y='usage_rate',data=cars_by_time,alpha=0.5)
plt.show()
sns.barplot(x='weekday', y='usage_rate', data=cars_by_time,alpha=0.3)
plt.show()
运行图:
总代码为:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#sns.set_style("whitegrid")
#warnings.filterwarnings("ignore")
df = pd.read_csv('data1.csv')
df = df[df['total_cars'] > 0]
#print(df)
#print(df.info())
# 统计每一个时间段所有的停车点所停放的车的总数
cars_by_lat=df.groupby("latitude").agg({'total_cars':'sum'}).reset_index()
#plt.plot(cars_by_lat['latitude'],cars_by_lat['total_cars'],color='k',alpha=0.5)
#plt.show()
cars_by_time = df.groupby('timestamp').agg({'total_cars': 'sum'}).reset_index()
#print(cars_by_time)
cars_by_time['timestamp'] = cars_by_time['timestamp'].apply(pd.Timestamp)
#plt.style.use("fivethirtyeight")
#print(cars_by_time)
#cars_by_time.plot(color='k',alpha=0.5)
#cars_by_time.set_index('timestamp').sort_index().rolling('60min').mean().plot(color='deepskyblue')
#plt.show()
a=pd.DataFrame(cars_by_time.set_index('timestamp').sort_index().rolling('60min').mean())
#print(a.describe())
timestamps = pd.DatetimeIndex(cars_by_time['timestamp'])
print(timestamps)
timestamps = timestamps.tz_convert('Asia/Jerusalem')
print(timestamps)
cars_by_time['local_time'] = timestamps
cars_by_time['weekday'] = cars_by_time['local_time'].dt.day_name()
print(cars_by_time['weekday'])
cars_by_time['hour'] = cars_by_time['local_time'].dt.hour
print(cars_by_time['hour'])
cars_by_time['rate'] = (45 - cars_by_time['total_cars']) / 45
sns.barplot(x='hour',y='rate',data=cars_by_time,alpha=0.5)
plt.show()
sns.barplot(x='weekday', y='rate', data=cars_by_time,alpha=0.3)
plt.show()