【Python数据挖掘】购物篮分析

购物篮分析

变量解释

变量 含义 说明
ReceiptID 收据单号
Value 支付金额
pmethod 支付渠道 1现金,2信用卡,3电子支付,4其他
sex 性别 1男性,2女性
homeown 是否有住宅 1有,2无,3未知
income 收入
age 年龄
其他 其他 购买的各种类商品的数量

数据导入

import matplotlib.pyplot as plt
plt.rcParams["font.sans-serif"] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = False
# 中文环境
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 加载数据
data = pd.read_excel('C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/Simulated Coles Data.xlsx',sheet_name=1)
data.info()

RangeIndex: 58000 entries, 0 to 57999
Data columns (total 50 columns):
ReceiptID            58000 non-null int64
Value                58000 non-null float64
pmethod              58000 non-null int64
sex                  58000 non-null int64
homeown              58000 non-null int64
income               57999 non-null float64
age                  57999 non-null float64
PostCode             48208 non-null object
nchildren            57998 non-null float64
fruit                58000 non-null object
freshmeat            58000 non-null int64
dairy                58000 non-null int64
MozerallaCheese      58000 non-null int64
cannedveg            57999 non-null float64
cereal               57991 non-null float64
frozenmeal           58000 non-null int64
frozendessert        58000 non-null int64
PizzaBase            57999 non-null float64
TomatoSauce          58000 non-null int64
frozen fish          58000 non-null int64
bread                58000 non-null int64
milk                 57999 non-null float64
softdrink            58000 non-null int64
fruitjuice           58000 non-null int64
confectionery        57999 non-null float64
fish                 58000 non-null int64
vegetables           58000 non-null int64
icecream             58000 non-null int64
energydrink          58000 non-null int64
tea                  58000 non-null int64
coffee               58000 non-null int64
laundryPowder        58000 non-null int64
householCleaners     58000 non-null int64
corn chips           58000 non-null int64
Frozen yogurt        58000 non-null int64
Chocolate            58000 non-null int64
Olive Oil            58000 non-null int64
Baby Food            58000 non-null int64
Napies               58000 non-null int64
banana               58000 non-null int64
cat food             58000 non-null int64
dog food             58000 non-null int64
mince                58000 non-null int64
sunflower Oil        58000 non-null int64
chicken              58000 non-null int64
vitamins             58000 non-null int64
deodorants           58000 non-null int64
dishwashingliquid    58000 non-null int64
onions               58000 non-null int64
lettuce              58000 non-null int64
dtypes: float64(9), int64(39), object(2)
memory usage: 22.1+ MB

结论

  • 共58000个观测值,部分数据有缺失值

数据预处理

PostCode缺失值较多,且对后续分析作用不大,直接删除此列

#输出文件名
outputfile = 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
 #填补数据
data.drop(['PostCode'] ,axis=1,inplace=True)
 #输出到指定文件
data.to_excel(outputfile)

对income用众数插补

#输出文件名
outputfile = 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
 #填补数据
data['income'] =  data['income'].fillna(data['income'].mode()[0])
 #输出到指定文件
data.to_excel(outputfile) 

对age用均值插补

#输出文件名
outputfile = 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
 #填补数据
data['age'] =  data['age'].fillna(data['age'].mean())
 #输出到指定文件
data.to_excel(outputfile) 

对nchildren用前一个数据插补

#输出文件名
outputfile = 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
 #填补数据
data['nchildren'] =  data['nchildren'].fillna(method='pad')
 #输出到指定文件
data.to_excel(outputfile)

对cannedveg用众数插补

#输出文件名
outputfile = 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
 #填补数据
data['cannedveg'] =  data['cannedveg'].fillna(data['cannedveg'].mode()[0])
 #输出到指定文件
data.to_excel(outputfile) 

对cereal用前一个数据插补

#输出文件名
outputfile = 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
 #填补数据
data['cereal'] =  data['cereal'].fillna(method='pad')
 #输出到指定文件
data.to_excel(outputfile)

对PizzaBase用众数插补

#输出文件名
outputfile = 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
 #填补数据
data['PizzaBase'] =  data['PizzaBase'].fillna(data['PizzaBase'].mode()[0])
 #输出到指定文件
data.to_excel(outputfile) 

对milk用众数插补

#输出文件名
outputfile = 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
 #填补数据
data['milk'] =  data['milk'].fillna(data['milk'].mode()[0])
 #输出到指定文件
data.to_excel(outputfile) 

对confectionery用众数插补

#输出文件名
outputfile = 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
 #填补数据
data['confectionery'] =  data['confectionery'].fillna(data['confectionery'].mode()[0])
 #输出到指定文件
data.to_excel(outputfile) 
data=pd.read_excel('C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls')
data.info()
# 补好缺失值的新数据

RangeIndex: 58000 entries, 0 to 57999
Data columns (total 50 columns):
Unnamed: 0           58000 non-null int64
ReceiptID            58000 non-null int64
Value                58000 non-null float64
pmethod              58000 non-null int64
sex                  58000 non-null int64
homeown              58000 non-null int64
income               58000 non-null float64
age                  58000 non-null float64
nchildren            58000 non-null int64
fruit                58000 non-null object
freshmeat            58000 non-null int64
dairy                58000 non-null int64
MozerallaCheese      58000 non-null int64
cannedveg            58000 non-null int64
cereal               58000 non-null int64
frozenmeal           58000 non-null int64
frozendessert        58000 non-null int64
PizzaBase            58000 non-null int64
TomatoSauce          58000 non-null int64
frozen fish          58000 non-null int64
bread                58000 non-null int64
milk                 58000 non-null int64
softdrink            58000 non-null int64
fruitjuice           58000 non-null int64
confectionery        58000 non-null int64
fish                 58000 non-null int64
vegetables           58000 non-null int64
icecream             58000 non-null int64
energydrink          58000 non-null int64
tea                  58000 non-null int64
coffee               58000 non-null int64
laundryPowder        58000 non-null int64
householCleaners     58000 non-null int64
corn chips           58000 non-null int64
Frozen yogurt        58000 non-null int64
Chocolate            58000 non-null int64
Olive Oil            58000 non-null int64
Baby Food            58000 non-null int64
Napies               58000 non-null int64
banana               58000 non-null int64
cat food             58000 non-null int64
dog food             58000 non-null int64
mince                58000 non-null int64
sunflower Oil        58000 non-null int64
chicken              58000 non-null int64
vitamins             58000 non-null int64
deodorants           58000 non-null int64
dishwashingliquid    58000 non-null int64
onions               58000 non-null int64
lettuce              58000 non-null int64
dtypes: float64(3), int64(46), object(1)
memory usage: 22.1+ MB
data.describe()
Unnamed: 0 ReceiptID Value pmethod sex homeown income age nchildren freshmeat ... cat food dog food mince sunflower Oil chicken vitamins deodorants dishwashingliquid onions lettuce
count 58000.000000 58000.000000 58000.000000 58000.000000 58000.000000 58000.000000 58000.000000 58000.000000 58000.000000 58000.000000 ... 58000.000000 58000.000000 58000.000000 58000.000000 58000.000000 58000.000000 58000.000000 58000.000000 58000.000000 58000.000000
mean 28999.500000 629000.500000 77.224161 2.411672 1.597828 1.301379 74851.023481 39.695309 1.255259 0.185948 ... 0.147448 0.196362 0.250655 0.271276 0.650897 0.201845 0.251552 0.351345 0.221931 0.743172
std 16743.302143 16743.302143 57.524626 0.883101 0.490341 0.510445 23939.928881 11.566665 1.058244 0.389068 ... 0.354555 0.397249 0.433394 0.444622 0.476691 0.401380 0.433909 0.477395 0.415549 0.436887
min 0.000000 600001.000000 0.929691 1.000000 1.000000 1.000000 6000.230000 10.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 14499.750000 614500.750000 29.504596 2.000000 1.000000 1.000000 65589.607777 33.752039 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 28999.500000 629000.500000 63.506446 2.000000 2.000000 1.000000 70169.178875 37.775684 1.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 1.000000
75% 43499.250000 643500.250000 116.000000 3.000000 2.000000 2.000000 75364.282452 42.204893 2.000000 0.000000 ... 0.000000 0.000000 1.000000 1.000000 1.000000 0.000000 1.000000 1.000000 0.000000 1.000000
max 57999.000000 658000.000000 1967.696760 4.000000 2.000000 3.000000 650235.250000 95.000000 11.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 49 columns

描述性统计分析

  • 性别平均值大于1.5,说明女性顾客比男性顾客的购物概率大
  • 超市顾客的平均年龄在33-42岁,中年顾客占比比较大

客户分析

数据转换成dataframe

inputfile= 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
outputfile= 'C:/Users/QYH123/Music/iTunes/作业/Python/数据挖掘/期中作业/data.xls'
import numpy as np
from pandas import DataFrame,Series
from sklearn.cluster import KMeans
from sklearn.cluster import Birch
df=DataFrame(data)
df.head()
Unnamed: 0 ReceiptID Value pmethod sex homeown income age nchildren fruit ... cat food dog food mince sunflower Oil chicken vitamins deodorants dishwashingliquid onions lettuce
0 0 600001 78.0 2 2 2 83167.0 72.0 0 0 ... 0 0 0 0 1 1 0 1 1 1
1 1 600002 120.0 3 2 1 15151.0 78.0 3 0 ... 0 0 1 1 0 0 1 0 0 1
2 2 600003 198.0 3 1 1 21367.0 42.0 0 1 ... 0 1 0 1 1 0 0 0 1 1
3 3 600004 190.0 2 1 1 56528.0 59.0 3 0 ... 1 1 0 1 1 0 0 1 0 1
4 4 600005 173.0 3 1 1 43373.0 39.0 2 0 ... 0 0 0 0 1 0 0 1 0 0

5 rows × 50 columns

对客户的年收入,年龄,及交易的金额做聚类分析

# 将这三个变量作子表
avi = df.pivot_table(index='ReceiptID',
                    values=['age','income','Value'])
avi.head()
Value age income
ReceiptID
600001 78.0 72.0 83167.0
600002 120.0 78.0 15151.0
600003 198.0 42.0 21367.0
600004 190.0 59.0 56528.0
600005 173.0 39.0 43373.0
# 用于判断用户价值的函数
def avi_func(x):
    level = x.apply(lambda x:'1' if x>0 else '0')
    level = level.Value + level.income
    d={"11":"高价值用户",
      "01":"高潜在价值用户",
      "10":"中价值用户",
      "00":"一般价值用户"}
    return d[level]
avi['lebel']=avi.apply(lambda x:x-x.mean()).apply(avi_func,axis=1)
avi.head()
Value age income lebel
ReceiptID
600001 78.0 72.0 83167.0 高价值用户
600002 120.0 78.0 15151.0 中价值用户
600003 198.0 42.0 21367.0 中价值用户
600004 190.0 59.0 56528.0 中价值用户
600005 173.0 39.0 43373.0 中价值用户
# 高价值用户可视化
import matplotlib.pyplot as plt
import seaborn as sns
avi.loc[avi.lebel=='高价值用户', 'Color']='g'
avi.loc[avi.lebel!='高价值用户', 'Color']='r'
avi.plot.scatter('age','income',c=avi.Color,figsize=(10,8))
plt.title('高价值用户分布情况')
plt.show()

【Python数据挖掘】购物篮分析_第1张图片

# 高潜在价值用户可视化
import matplotlib.pyplot as plt
import seaborn as sns
avi.loc[avi.lebel=='高潜在价值用户', 'Color']='g'
avi.loc[avi.lebel!='高潜在价值用户', 'Color']='r'
avi.plot.scatter('age','income',c=avi.Color,figsize=(10,8))
plt.title('高潜在价值用户分布情况')
plt.show()

【Python数据挖掘】购物篮分析_第2张图片

高潜在价值用户主要年龄为20-40,年收入在10k到20k之间,超市可以推出针对该群体的促销活动

商品分析

将商品按以下品类分类

肉类(M)

  • freshmeat
  • dairy
  • MozerallaCheese
  • cannedveg
  • cereal
  • frozenmeal
  • frozendessert
  • fish
  • frozen fish
  • chicken
  • mince

粮油(G)

  • bread
  • PizzaBase
  • TomatoSauce
  • Olive Oil
  • sunflower Oil

蔬果(V)

  • vegetables
  • banana
  • fruit
  • onions
  • lettuce

饮料(D)

  • milk
  • softdrink
  • fruitjuice
  • Frozen yogurt
  • energydrink
  • tea
  • coffee

零食(S)

  • icecream
  • confectionery
  • corn chips
  • Chocolate

其他食品(O)

  • Baby Food
  • cat food
  • dog food
  • vitamins

生活用品(L)

  • Napies
  • deodorants
  • dishwashingliquid
  • laundryPowder
  • householCleaners
# 合并相同类型的数据
df['M']=df.freshmeat+df.dairy+df['MozerallaCheese']+df.cannedveg+df.cereal+df.frozenmeal+df.frozendessert+df.fish+df['frozen fish']+df.chicken +df.mince
df['G']=df.bread+df.PizzaBase+df.TomatoSauce+df['Olive Oil']+df['sunflower Oil']
df['V']=df.vegetables+df.banana+df.onions+df.lettuce
df['D']=df.milk+df.softdrink+df.fruitjuice+df.energydrink +df['Frozen yogurt']+df.tea+df.coffee
df['S']=df.icecream+df.confectionery+df['corn chips']+df.Chocolate
df['O']=df.vitamins+df['Baby Food']+df['cat food']+df['dog food']
df['L']=df.Napies+df.deodorants+df.dishwashingliquid+df.laundryPowder+df.householCleaners
# 作顾客消费子表
goods = df.pivot_table(index='ReceiptID',
                    values=['Value','M','G','V','D','S','O','L'])
goods.head()
D G L M O S V Value
ReceiptID
600001 4 2 2 2 1 2 4 78.0
600002 1 3 2 7 0 1 1 120.0
600003 3 2 1 4 1 1 3 198.0
600004 3 1 2 3 2 0 2 190.0
600005 4 1 1 5 0 1 0 173.0
goods.loc['总和']=goods.sum()
goods
D G L M O S V Value
ReceiptID
600001 4.0 2.0 2.0 2.0 1.0 2.0 4.0 7.800000e+01
600002 1.0 3.0 2.0 7.0 0.0 1.0 1.0 1.200000e+02
600003 3.0 2.0 1.0 4.0 1.0 1.0 3.0 1.980000e+02
600004 3.0 1.0 2.0 3.0 2.0 0.0 2.0 1.900000e+02
600005 4.0 1.0 1.0 5.0 0.0 1.0 0.0 1.730000e+02
... ... ... ... ... ... ... ... ...
657997 1.0 2.0 1.0 5.0 2.0 0.0 3.0 1.940000e+02
657998 2.0 1.0 2.0 2.0 1.0 1.0 0.0 1.940000e+02
657999 3.0 2.0 1.0 1.0 1.0 0.0 3.0 1.420000e+02
658000 4.0 1.0 2.0 4.0 0.0 2.0 3.0 7.200000e+01
总和 137158.0 140957.0 101269.0 209125.0 59680.0 70548.0 133975.0 4.479001e+06

58001 rows × 8 columns

消费类型最多的是肉类,因此超市应保持该类产品供给充足

关联分析

import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
frequent_itemsets = apriori(data, min_support=0.3, use_colnames=True)
frequent_itemsets
support itemsets
0 0.483372 (Baby Food)
1 0.446236 (Chocolate)
2 0.466486 (Frozen yogurt)
3 0.444063 (Olive Oil)
4 0.466952 (PizzaBase)
... ... ...
242 0.385125 (cereal, bread, milk, lettuce)
243 0.306058 (cereal, bread, milk, vegetables)
244 0.326049 (bread, milk, lettuce, chicken)
245 0.323065 (bread, fruit, milk, lettuce)
246 0.302211 (cereal, milk, lettuce, chicken)

247 rows × 2 columns

由支持率可分为几类:baby food和pizzabase和frozen yogurt等等

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head(100)
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
0 (Baby Food) (banana) 0.483372 0.759711 0.376483 0.778868 1.025216 0.009260 1.086632
1 (banana) (Baby Food) 0.759711 0.483372 0.376483 0.495561 1.025216 0.009260 1.024163
2 (bread) (Baby Food) 0.827394 0.483372 0.407117 0.492047 1.017946 0.007177 1.017077
3 (Baby Food) (bread) 0.483372 0.827394 0.407117 0.842242 1.017946 0.007177 1.094121
4 (cereal) (Baby Food) 0.762471 0.483372 0.376294 0.493519 1.020991 0.007736 1.020033
... ... ... ... ... ... ... ... ... ...
95 (laundryPowder) (bread) 0.494377 0.827394 0.411688 0.832740 1.006461 0.002643 1.031963
96 (bread) (vegetables) 0.827394 0.585156 0.488650 0.590590 1.009286 0.004496 1.013272
97 (vegetables) (bread) 0.585156 0.827394 0.488650 0.835077 1.009286 0.004496 1.046587
98 (cereal) (chicken) 0.762471 0.650907 0.496516 0.651193 1.000439 0.000218 1.000820
99 (chicken) (cereal) 0.650907 0.762471 0.496516 0.762806 1.000439 0.000218 1.001413

100 rows × 9 columns

总结
该超市目前顾客主要为较为富裕的中年女性,主要销售商品为以肉类为主的食品,目前在中产阶级市场仍有潜在发展空间,可以多针对该类型用户进行促销。

你可能感兴趣的:(数据挖掘,python,机器学习)