商店数据来自天池口碑商家客流量预测比赛,这里只筛选了一部分数据。
https://tianchi.aliyun.com/competition/introduction.htm?spm=5176.100066.333.10.W
7qorD&raceId=231591
“shop_payNum_new.csv”的数据各个字段的含义
Field Sample Description
pay_num 1 客流量(每天使用支付宝在该店消费的人数)
shop_id 1 商家 id,与 shop_info 对应
time_stamp 2016/1/1 支付时间
cate_2_name fast food 商店的二级分类名称
b) 实验要求:
读取数据
import pandas as pd
import matplotlib.pyplot as plt
shop_payNum_new = pd.read_csv('dataset/shop_payNum_new.csv', parse_dates=True, index_col=0)
代码:
convenience_store = shop_payNum_new[shop_payNum_new['cate_2_name'] == 'convenience store']
Oct_payNum = convenience_store[convenience_store.index.month == 10]
shop_id = Oct_payNum['shop_id'].drop_duplicates(keep='first', inplace=False)
for i in shop_id:
tmp = Oct_payNum[Oct_payNum['shop_id'] == i]
tmp.plot(y = 'pay_num', title = 'convenience store and shop_id = ' + str(i), kind='line')
plt.xlabel('day')
plt.show()
思路:
代码:
cate_2_name=shop_payNum_new['cate_2_name'].drop_duplicates(keep='first', inplace=False)
for i in cate_2_name:
tmp = shop_payNum_new[shop_payNum_new['cate_2_name'] == i]
Oct_payNum = tmp[tmp.index.month == 10]
day_payNum = Oct_payNum.groupby(Oct_payNum.index.day).mean()
day_payNum.plot(y = 'pay_num', title=str(i)+ ' pay_num', kind='line')
plt.xlabel('day')
plt.show()
思路:
代码:
fast_food = shop_payNum_new[shop_payNum_new['cate_2_name'] == 'fast food']
shop = fast_food[fast_food['shop_id'] == 196]
month_payNum = shop.groupby(shop.index.month).sum()
month_payNum.plot(y = 'pay_num', title='fast food && shop id = 196', kind='bar')
plt.xlabel('month')
plt.show()
思路:
代码:
Oct_payNum = shop[shop.index.month == 10]
week_payNum = Oct_payNum.groupby(Oct_payNum.index.strftime('%w')).mean()
week_payNum.plot(y = 'pay_num', title = 'fast food week pay_num', kind = 'bar')
plt.xlabel('week')
plt.show()
思路:
代码:
shop.plot(y = 'pay_num', kind='hist', title= 'fast food && shop id = 196', alpha = 0.5)
plt.show()
思路:
代码:
shop.plot(y = 'pay_num', title = 'fast food && shop id = 196', kind = 'kde')
plt.show()
思路:
代码:
shop = shop_payNum_new[shop_payNum_new.index.month == 10]
shop_rate = shop.groupby(shop['cate_2_name']).sum() / shop['pay_num'].sum()
shop_rate['pay_num'].plot(kind='pie', autopct='%.2f')
plt.title("Oct")
plt.show()
思路:
“pima.csv”数据前 9 个字段的含义:
(1) Number of times pregnant
(2) Plasma glucose concentration a 2 hours in an oral glucose tolerance
test
(3) Diastolic blood pressure (mm Hg)
(4) Triceps skin fold thickness (mm)
(5) 2-Hour serum insulin (mu U/ml)
(6) Body mass index (weight in kg/(height in m)^2)
(7) Diabetes pedigree function
(8) Age (years)
(9) Class variable (0 or 1)
b) 实验要求:
读取数据:
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import andrews_curves
pima = pd.read_csv('dataset/pima.csv', header = None)
代码:
pima.columns = ['Number of times pregnant',
'Plasma glucose concentration a 2 hours in an oral glucosetolerancetest',
'Diastolic blood pressure (mm Hg)', 'Triceps skin fold thickness (mm)',
'2-Hour serum insulin (mu U/ml)', 'Body mass index', 'Diabetes pedigree function',
'Age (years)', 'Class variable']
PregnantNum = pima[['Age (years)', 'Number of times pregnant']]
means = PregnantNum['Number of times pregnant'].mean()
ax = PregnantNum[PregnantNum['Number of times pregnant'] >= means].plot(x = 'Age (years)', y = 'Number of times pregnant', kind='scatter', c='red', ax = None, label = 'Number of times pregnant >= mean')
PregnantNum[PregnantNum['Number of times pregnant'] < means].plot(x = 'Age (years)', y = 'Number of times pregnant', kind='scatter', c='blue', ax = ax, label = 'Number of times pregnant < mean')
思路:
代码:
color = {1:'green', 0:'red'}
pd.scatter_matrix(pima.iloc[:, [1, 2, 3]], figsize=(9,9), diagonal='kde', marker='o',
s = 40, alpha = 0.4, c = pima['Class variable'].apply(lambda x : color[x]))
plt.show()
思路:
代码:
andrews_curves(pima, 'Class variable', color=['b', 'green'])
plt.show()
思路: