变量解释
变量 | 含义 | 说明 |
---|---|---|
ReceiptID | 收据单号 | |
Value | 支付金额 | |
pmethod | 支付渠道 | 1现金,2信用卡,3电子支付,4其他 |
sex | 性别 | 1男性,2女性 |
homeown | 是否有住宅 | 1有,2无,3未知 |
income | 收入 | |
age | 年龄 | |
其他 | 其他 | 购买的各种类商品的数量 |
import matplotlib.pyplot as plt
plt.rcParams["font.sans-serif"] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = False
# 中文环境
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 加载数据
data = pd.read_excel('C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/Simulated Coles Data.xlsx',sheet_name=1)
data.info()
RangeIndex: 58000 entries, 0 to 57999
Data columns (total 50 columns):
ReceiptID 58000 non-null int64
Value 58000 non-null float64
pmethod 58000 non-null int64
sex 58000 non-null int64
homeown 58000 non-null int64
income 57999 non-null float64
age 57999 non-null float64
PostCode 48208 non-null object
nchildren 57998 non-null float64
fruit 58000 non-null object
freshmeat 58000 non-null int64
dairy 58000 non-null int64
MozerallaCheese 58000 non-null int64
cannedveg 57999 non-null float64
cereal 57991 non-null float64
frozenmeal 58000 non-null int64
frozendessert 58000 non-null int64
PizzaBase 57999 non-null float64
TomatoSauce 58000 non-null int64
frozen fish 58000 non-null int64
bread 58000 non-null int64
milk 57999 non-null float64
softdrink 58000 non-null int64
fruitjuice 58000 non-null int64
confectionery 57999 non-null float64
fish 58000 non-null int64
vegetables 58000 non-null int64
icecream 58000 non-null int64
energydrink 58000 non-null int64
tea 58000 non-null int64
coffee 58000 non-null int64
laundryPowder 58000 non-null int64
householCleaners 58000 non-null int64
corn chips 58000 non-null int64
Frozen yogurt 58000 non-null int64
Chocolate 58000 non-null int64
Olive Oil 58000 non-null int64
Baby Food 58000 non-null int64
Napies 58000 non-null int64
banana 58000 non-null int64
cat food 58000 non-null int64
dog food 58000 non-null int64
mince 58000 non-null int64
sunflower Oil 58000 non-null int64
chicken 58000 non-null int64
vitamins 58000 non-null int64
deodorants 58000 non-null int64
dishwashingliquid 58000 non-null int64
onions 58000 non-null int64
lettuce 58000 non-null int64
dtypes: float64(9), int64(39), object(2)
memory usage: 22.1+ MB
结论
PostCode缺失值较多,且对后续分析作用不大,直接删除此列
#输出文件名
outputfile = 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
#填补数据
data.drop(['PostCode'] ,axis=1,inplace=True)
#输出到指定文件
data.to_excel(outputfile)
对income用众数插补
#输出文件名
outputfile = 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
#填补数据
data['income'] = data['income'].fillna(data['income'].mode()[0])
#输出到指定文件
data.to_excel(outputfile)
对age用均值插补
#输出文件名
outputfile = 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
#填补数据
data['age'] = data['age'].fillna(data['age'].mean())
#输出到指定文件
data.to_excel(outputfile)
对nchildren用前一个数据插补
#输出文件名
outputfile = 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
#填补数据
data['nchildren'] = data['nchildren'].fillna(method='pad')
#输出到指定文件
data.to_excel(outputfile)
对cannedveg用众数插补
#输出文件名
outputfile = 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
#填补数据
data['cannedveg'] = data['cannedveg'].fillna(data['cannedveg'].mode()[0])
#输出到指定文件
data.to_excel(outputfile)
对cereal用前一个数据插补
#输出文件名
outputfile = 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
#填补数据
data['cereal'] = data['cereal'].fillna(method='pad')
#输出到指定文件
data.to_excel(outputfile)
对PizzaBase用众数插补
#输出文件名
outputfile = 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
#填补数据
data['PizzaBase'] = data['PizzaBase'].fillna(data['PizzaBase'].mode()[0])
#输出到指定文件
data.to_excel(outputfile)
对milk用众数插补
#输出文件名
outputfile = 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
#填补数据
data['milk'] = data['milk'].fillna(data['milk'].mode()[0])
#输出到指定文件
data.to_excel(outputfile)
对confectionery用众数插补
#输出文件名
outputfile = 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
#填补数据
data['confectionery'] = data['confectionery'].fillna(data['confectionery'].mode()[0])
#输出到指定文件
data.to_excel(outputfile)
data=pd.read_excel('C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls')
data.info()
# 补好缺失值的新数据
RangeIndex: 58000 entries, 0 to 57999
Data columns (total 50 columns):
Unnamed: 0 58000 non-null int64
ReceiptID 58000 non-null int64
Value 58000 non-null float64
pmethod 58000 non-null int64
sex 58000 non-null int64
homeown 58000 non-null int64
income 58000 non-null float64
age 58000 non-null float64
nchildren 58000 non-null int64
fruit 58000 non-null object
freshmeat 58000 non-null int64
dairy 58000 non-null int64
MozerallaCheese 58000 non-null int64
cannedveg 58000 non-null int64
cereal 58000 non-null int64
frozenmeal 58000 non-null int64
frozendessert 58000 non-null int64
PizzaBase 58000 non-null int64
TomatoSauce 58000 non-null int64
frozen fish 58000 non-null int64
bread 58000 non-null int64
milk 58000 non-null int64
softdrink 58000 non-null int64
fruitjuice 58000 non-null int64
confectionery 58000 non-null int64
fish 58000 non-null int64
vegetables 58000 non-null int64
icecream 58000 non-null int64
energydrink 58000 non-null int64
tea 58000 non-null int64
coffee 58000 non-null int64
laundryPowder 58000 non-null int64
householCleaners 58000 non-null int64
corn chips 58000 non-null int64
Frozen yogurt 58000 non-null int64
Chocolate 58000 non-null int64
Olive Oil 58000 non-null int64
Baby Food 58000 non-null int64
Napies 58000 non-null int64
banana 58000 non-null int64
cat food 58000 non-null int64
dog food 58000 non-null int64
mince 58000 non-null int64
sunflower Oil 58000 non-null int64
chicken 58000 non-null int64
vitamins 58000 non-null int64
deodorants 58000 non-null int64
dishwashingliquid 58000 non-null int64
onions 58000 non-null int64
lettuce 58000 non-null int64
dtypes: float64(3), int64(46), object(1)
memory usage: 22.1+ MB
data.describe()
Unnamed: 0 | ReceiptID | Value | pmethod | sex | homeown | income | age | nchildren | freshmeat | ... | cat food | dog food | mince | sunflower Oil | chicken | vitamins | deodorants | dishwashingliquid | onions | lettuce | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 58000.000000 | 58000.000000 | 58000.000000 | 58000.000000 | 58000.000000 | 58000.000000 | 58000.000000 | 58000.000000 | 58000.000000 | 58000.000000 | ... | 58000.000000 | 58000.000000 | 58000.000000 | 58000.000000 | 58000.000000 | 58000.000000 | 58000.000000 | 58000.000000 | 58000.000000 | 58000.000000 |
mean | 28999.500000 | 629000.500000 | 77.224161 | 2.411672 | 1.597828 | 1.301379 | 74851.023481 | 39.695309 | 1.255259 | 0.185948 | ... | 0.147448 | 0.196362 | 0.250655 | 0.271276 | 0.650897 | 0.201845 | 0.251552 | 0.351345 | 0.221931 | 0.743172 |
std | 16743.302143 | 16743.302143 | 57.524626 | 0.883101 | 0.490341 | 0.510445 | 23939.928881 | 11.566665 | 1.058244 | 0.389068 | ... | 0.354555 | 0.397249 | 0.433394 | 0.444622 | 0.476691 | 0.401380 | 0.433909 | 0.477395 | 0.415549 | 0.436887 |
min | 0.000000 | 600001.000000 | 0.929691 | 1.000000 | 1.000000 | 1.000000 | 6000.230000 | 10.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 14499.750000 | 614500.750000 | 29.504596 | 2.000000 | 1.000000 | 1.000000 | 65589.607777 | 33.752039 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 28999.500000 | 629000.500000 | 63.506446 | 2.000000 | 2.000000 | 1.000000 | 70169.178875 | 37.775684 | 1.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
75% | 43499.250000 | 643500.250000 | 116.000000 | 3.000000 | 2.000000 | 2.000000 | 75364.282452 | 42.204893 | 2.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 |
max | 57999.000000 | 658000.000000 | 1967.696760 | 4.000000 | 2.000000 | 3.000000 | 650235.250000 | 95.000000 | 11.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 49 columns
描述性统计分析
数据转换成dataframe
inputfile= 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
outputfile= 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
import numpy as np
from pandas import DataFrame,Series
from sklearn.cluster import KMeans
from sklearn.cluster import Birch
df=DataFrame(data)
df.head()
Unnamed: 0 | ReceiptID | Value | pmethod | sex | homeown | income | age | nchildren | fruit | ... | cat food | dog food | mince | sunflower Oil | chicken | vitamins | deodorants | dishwashingliquid | onions | lettuce | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 600001 | 78.0 | 2 | 2 | 2 | 83167.0 | 72.0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 |
1 | 1 | 600002 | 120.0 | 3 | 2 | 1 | 15151.0 | 78.0 | 3 | 0 | ... | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
2 | 2 | 600003 | 198.0 | 3 | 1 | 1 | 21367.0 | 42.0 | 0 | 1 | ... | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 |
3 | 3 | 600004 | 190.0 | 2 | 1 | 1 | 56528.0 | 59.0 | 3 | 0 | ... | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 |
4 | 4 | 600005 | 173.0 | 3 | 1 | 1 | 43373.0 | 39.0 | 2 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
5 rows × 50 columns
对客户的年收入,年龄,及交易的金额做聚类分析
# 将这三个变量作子表
avi = df.pivot_table(index='ReceiptID',
values=['age','income','Value'])
avi.head()
Value | age | income | |
---|---|---|---|
ReceiptID | |||
600001 | 78.0 | 72.0 | 83167.0 |
600002 | 120.0 | 78.0 | 15151.0 |
600003 | 198.0 | 42.0 | 21367.0 |
600004 | 190.0 | 59.0 | 56528.0 |
600005 | 173.0 | 39.0 | 43373.0 |
# 用于判断用户价值的函数
def avi_func(x):
level = x.apply(lambda x:'1' if x>0 else '0')
level = level.Value + level.income
d={"11":"高价值用户",
"01":"高潜在价值用户",
"10":"中价值用户",
"00":"一般价值用户"}
return d[level]
avi['lebel']=avi.apply(lambda x:x-x.mean()).apply(avi_func,axis=1)
avi.head()
Value | age | income | lebel | |
---|---|---|---|---|
ReceiptID | ||||
600001 | 78.0 | 72.0 | 83167.0 | 高价值用户 |
600002 | 120.0 | 78.0 | 15151.0 | 中价值用户 |
600003 | 198.0 | 42.0 | 21367.0 | 中价值用户 |
600004 | 190.0 | 59.0 | 56528.0 | 中价值用户 |
600005 | 173.0 | 39.0 | 43373.0 | 中价值用户 |
# 高价值用户可视化
import matplotlib.pyplot as plt
import seaborn as sns
avi.loc[avi.lebel=='高价值用户', 'Color']='g'
avi.loc[avi.lebel!='高价值用户', 'Color']='r'
avi.plot.scatter('age','income',c=avi.Color,figsize=(10,8))
plt.title('高价值用户分布情况')
plt.show()
# 高潜在价值用户可视化
import matplotlib.pyplot as plt
import seaborn as sns
avi.loc[avi.lebel=='高潜在价值用户', 'Color']='g'
avi.loc[avi.lebel!='高潜在价值用户', 'Color']='r'
avi.plot.scatter('age','income',c=avi.Color,figsize=(10,8))
plt.title('高潜在价值用户分布情况')
plt.show()
高潜在价值用户主要年龄为20-40,年收入在10k到20k之间,超市可以推出针对该群体的促销活动
将商品按以下品类分类
肉类(M)
粮油(G)
蔬果(V)
饮料(D)
零食(S)
其他食品(O)
生活用品(L)
# 合并相同类型的数据
df['M']=df.freshmeat+df.dairy+df['MozerallaCheese']+df.cannedveg+df.cereal+df.frozenmeal+df.frozendessert+df.fish+df['frozen fish']+df.chicken +df.mince
df['G']=df.bread+df.PizzaBase+df.TomatoSauce+df['Olive Oil']+df['sunflower Oil']
df['V']=df.vegetables+df.banana+df.onions+df.lettuce
df['D']=df.milk+df.softdrink+df.fruitjuice+df.energydrink +df['Frozen yogurt']+df.tea+df.coffee
df['S']=df.icecream+df.confectionery+df['corn chips']+df.Chocolate
df['O']=df.vitamins+df['Baby Food']+df['cat food']+df['dog food']
df['L']=df.Napies+df.deodorants+df.dishwashingliquid+df.laundryPowder+df.householCleaners
# 作顾客消费子表
goods = df.pivot_table(index='ReceiptID',
values=['Value','M','G','V','D','S','O','L'])
goods.head()
D | G | L | M | O | S | V | Value | |
---|---|---|---|---|---|---|---|---|
ReceiptID | ||||||||
600001 | 4 | 2 | 2 | 2 | 1 | 2 | 4 | 78.0 |
600002 | 1 | 3 | 2 | 7 | 0 | 1 | 1 | 120.0 |
600003 | 3 | 2 | 1 | 4 | 1 | 1 | 3 | 198.0 |
600004 | 3 | 1 | 2 | 3 | 2 | 0 | 2 | 190.0 |
600005 | 4 | 1 | 1 | 5 | 0 | 1 | 0 | 173.0 |
goods.loc['总和']=goods.sum()
goods
D | G | L | M | O | S | V | Value | |
---|---|---|---|---|---|---|---|---|
ReceiptID | ||||||||
600001 | 4.0 | 2.0 | 2.0 | 2.0 | 1.0 | 2.0 | 4.0 | 7.800000e+01 |
600002 | 1.0 | 3.0 | 2.0 | 7.0 | 0.0 | 1.0 | 1.0 | 1.200000e+02 |
600003 | 3.0 | 2.0 | 1.0 | 4.0 | 1.0 | 1.0 | 3.0 | 1.980000e+02 |
600004 | 3.0 | 1.0 | 2.0 | 3.0 | 2.0 | 0.0 | 2.0 | 1.900000e+02 |
600005 | 4.0 | 1.0 | 1.0 | 5.0 | 0.0 | 1.0 | 0.0 | 1.730000e+02 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
657997 | 1.0 | 2.0 | 1.0 | 5.0 | 2.0 | 0.0 | 3.0 | 1.940000e+02 |
657998 | 2.0 | 1.0 | 2.0 | 2.0 | 1.0 | 1.0 | 0.0 | 1.940000e+02 |
657999 | 3.0 | 2.0 | 1.0 | 1.0 | 1.0 | 0.0 | 3.0 | 1.420000e+02 |
658000 | 4.0 | 1.0 | 2.0 | 4.0 | 0.0 | 2.0 | 3.0 | 7.200000e+01 |
总和 | 137158.0 | 140957.0 | 101269.0 | 209125.0 | 59680.0 | 70548.0 | 133975.0 | 4.479001e+06 |
58001 rows × 8 columns
消费类型最多的是肉类,因此超市应保持该类产品供给充足
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
frequent_itemsets = apriori(data, min_support=0.3, use_colnames=True)
frequent_itemsets
support | itemsets | |
---|---|---|
0 | 0.483372 | (Baby Food) |
1 | 0.446236 | (Chocolate) |
2 | 0.466486 | (Frozen yogurt) |
3 | 0.444063 | (Olive Oil) |
4 | 0.466952 | (PizzaBase) |
... | ... | ... |
242 | 0.385125 | (cereal, bread, milk, lettuce) |
243 | 0.306058 | (cereal, bread, milk, vegetables) |
244 | 0.326049 | (bread, milk, lettuce, chicken) |
245 | 0.323065 | (bread, fruit, milk, lettuce) |
246 | 0.302211 | (cereal, milk, lettuce, chicken) |
247 rows × 2 columns
由支持率可分为几类:baby food和pizzabase和frozen yogurt等等
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head(100)
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
---|---|---|---|---|---|---|---|---|---|
0 | (Baby Food) | (banana) | 0.483372 | 0.759711 | 0.376483 | 0.778868 | 1.025216 | 0.009260 | 1.086632 |
1 | (banana) | (Baby Food) | 0.759711 | 0.483372 | 0.376483 | 0.495561 | 1.025216 | 0.009260 | 1.024163 |
2 | (bread) | (Baby Food) | 0.827394 | 0.483372 | 0.407117 | 0.492047 | 1.017946 | 0.007177 | 1.017077 |
3 | (Baby Food) | (bread) | 0.483372 | 0.827394 | 0.407117 | 0.842242 | 1.017946 | 0.007177 | 1.094121 |
4 | (cereal) | (Baby Food) | 0.762471 | 0.483372 | 0.376294 | 0.493519 | 1.020991 | 0.007736 | 1.020033 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | (laundryPowder) | (bread) | 0.494377 | 0.827394 | 0.411688 | 0.832740 | 1.006461 | 0.002643 | 1.031963 |
96 | (bread) | (vegetables) | 0.827394 | 0.585156 | 0.488650 | 0.590590 | 1.009286 | 0.004496 | 1.013272 |
97 | (vegetables) | (bread) | 0.585156 | 0.827394 | 0.488650 | 0.835077 | 1.009286 | 0.004496 | 1.046587 |
98 | (cereal) | (chicken) | 0.762471 | 0.650907 | 0.496516 | 0.651193 | 1.000439 | 0.000218 | 1.000820 |
99 | (chicken) | (cereal) | 0.650907 | 0.762471 | 0.496516 | 0.762806 | 1.000439 | 0.000218 | 1.001413 |
100 rows × 9 columns
总结
该超市目前顾客主要为较为富裕的中年女性,主要销售商品为以肉类为主的食品,目前在中产阶级市场仍有潜在发展空间,可以多针对该类型用户进行促销。