一、项目介绍
Adventure Works Cycles是Adventure Works样本数据库所虚构的公司,该公司集生产和销售金属复合材料自行车于全球各地区自行车市场。销售方式主要有两种,前期主要是分销商模式,于2018年公司实现财政收入目标,为了迎合互联网时代,2019年公司开始通过自有网站获取线上商户,从而进军中国市场。
这家公司主要有下面四个产品线:
- Adventure Works Cycles生产的自行车;
- 自行车部件,例如车轮,踏板或制动组件;
- 从供应商处购买的自行车服装,用于转售给Adventure Works Cycles的客户;
- 从供应商处购买的自行车配件,用于转售给Adventure Works Cycles的客户。
目前的需求是:2019年12月5日线上业务经理,需要向公司CEO汇报2019年11月自行车销售情况,所以数据部门要提供11月份线上自行业务数据分析报告。
二、需求分析与实现
2.1 数据源
本次分析数据主要来自于以下三张表:
a.dw_customer_order 时间地区产品聚合表:用于从整体、地域和产品维度反映销售概况。
b.ods_customer 每日新增用户表:用于形成用户画像。
c.ods_sales_orders 订单明细表:用于用户行为分析。
2.2 分析框架
- 从整体的角度:分析2019.1—2019.11自行车整体销售表现;
- 从地域的角度:分析11月每个区域销售量表现、11月TOP10城市销售量表现;
- 从产品的角度:分析11月类别产品销售量表现、11月细分产品销售量表现;
- 热销产品角度:分析11月TOP10产品销量榜、11月TOP10销量增速榜;
- 从用户的角度:分析11月用户年龄分布及每个年龄段产品购买喜好、11月男女用户比例及产品购买喜好。
2.3 分析流程
- 使用Jupyter Notebook进行数据处理;
- 将处理后的数据保存至MySQL数据库中;
- 用PowerBI连接MySQL数据库进行数据可视化;
- 用PPT呈现最终分析结果。
三、分析过程
3.1 使用Jupyter Notebook进行数据处理(部分代码)
# 导入模块
import pandas as pd
import numpy as np
import pymysql
pymysql.install_as_MySQLdb() # 为了兼容mysqldb,只需要加入
from sqlalchemy import create_engine
from datetime import datetime
3.1.1 自行车整体销售表现
3.1.1.1 从数据库读取源数据:dw_customer_order
# 读取数据表 dw_customer_order。不同城市,每天产品销售信息
# 创建数据库引擎
engine = create_engine('mysql://用户名:登录密码@主机地址:端口号/数据库?charset=gbk',echo=False)
datafrog = engine
gather_customer_order = pd.read_sql_query("select * from dw_customer_order",con = datafrog)
# 查看源数据前5行,观察数据,判断数据是否正常识别
gather_customer_order.head()
# 利用create_date字段增加create_year_month月份字段。按月维度分析时使用
gather_customer_order['create_year_month'] = gather_customer_order['create_date'].apply(lambda x:x.strftime('%Y-%m'))
gather_customer_order['create_year_month'].head()
Out:
0 2019-01
1 2019-01
2 2019-01
3 2019-01
4 2019-01
Name: create_year_month, dtype: object
# 筛选产品类型cplb_zw中的自行车作为新的gather_customer_order
gather_customer_order = gather_customer_order.loc[gather_customer_order['cplb_zw']=='自行车']
gather_customer_order.head()
3.1.1.2 自行车整体销售量表现
# 聚合每月订单数量和销售金额,用groupby创建一个新的对象,需要将order_num、sum_amount求和,对日期降序排序,记得重置索引
overall_sales_performance = gather_customer_order.groupby('create_year_month').agg({'order_num':'sum','sum_amount':'sum'}).sort_index(ascending = True).reset_index()
overall_sales_performance.head()
# 求每月自行车销售订单量环比,观察最近一年数据变化趋势
# 环比是本月与上月的对比,例如本期2019-02月销售额与上一期2019-01月销售额做对比
order_num_diff = list(overall_sales_performance.order_num.diff(1)/overall_sales_performance.order_num.shift(1))
order_num_diff.pop(0) # 删除列表中第一个元素
order_num_diff.append(0) # 将0新增到列表末尾
"""
环比有环比增长速度和环比发展速度两种计算方法。
环比增长速度=(本期数-上期数)/上期数*100%,反映本期比上期增长了多少。
环比发展速度= 本期数/上期数*100%,环比发展速度是报告期水平与前一水平之比,反映现象在前后两期的发展变化情况。
"""
# 将环比转化为DataFrame
overall_sales_performance = pd.concat([overall_sales_performance,pd.DataFrame({'order_num_diff':order_num_diff}).shift(1).fillna(0)],axis=1)
overall_sales_performance.head()
# 求每月自行车销售金额环比
sum_amount_diff = list(overall_sales_performance.sum_amount.diff(1)/overall_sales_performance.sum_amount.shift(1))
sum_amount_diff.pop(0) # 删除列表中第一个元素
sum_amount_diff.append(0) # 将0新增到列表末尾
sum_amount_diff
# 将环比转化为DataFrame
overall_sales_performance = pd.concat([overall_sales_performance,pd.DataFrame(sum_amount_diff,columns=['sum_amount_diff']).shift(1).fillna(0)],axis=1)
overall_sales_performance.head()
# 销量环比字段名order_diff,销售金额环比字段名amount_diff
# 按照日期排序,升序
overall_sales_performance.rename(columns = {"order_num_diff":"order_diff", "sum_amount_diff":"amount_diff"},inplace=True)
overall_sales_performance = overall_sales_performance.sort_values('create_year_month')
# 查看每月自行车订单量、销售金额、环比、前5行
overall_sales_performance.head(5)
3.1.2 2019年11月自行车地域销售表现
3.1.2.1 源数据dw_customer_order,数据清洗筛选10月11月数据
# 筛选10月11月自行车数据
gather_customer_order['create_year_month'].isin(['2019-10','2019-11'])
gather_customer_order_10_11 = gather_customer_order.loc[gather_customer_order['create_year_month'].isin(['2019-10','2019-11'])]
gather_customer_order_10_11.head()
3.1.2.2 2019年11月自行车区域销售量表现
# 按照区域、月分组,订单量求和,销售金额求和
gather_customer_order_10_11_group = gather_customer_order_10_11.groupby(['chinese_territory','create_year_month']).agg({'order_num':'sum','sum_amount':'sum'}).reset_index()
gather_customer_order_10_11_group.head()
# 将区域存为列表
region_list = gather_customer_order_10_11_group['chinese_territory'].unique()
# pct_change()当前元素与先前元素的相差百分比,求不同区域10月11月环比
order_x = pd.Series([])
amount_x = pd.Series([])
for i in region_list:
a = gather_customer_order_10_11_group.loc[gather_customer_order_10_11_group['chinese_territory']==i]['order_num'].pct_change()
b = gather_customer_order_10_11_group.loc[gather_customer_order_10_11_group['chinese_territory']==i]['sum_amount'].pct_change()
order_x = order_x.append(a)
amount_x = amount_x.append(b)
gather_customer_order_10_11_group['order_diff'] = order_x.fillna(0)
gather_customer_order_10_11_group['amount_diff'] = amount_x.fillna(0)
# 10月11月各个区域自行车销售数量、销售金额环比
gather_customer_order_10_11_group.head()
3.1.2.3 2019年11月自行车销售量TOP10城市环比
# 筛选11月自行车交易数据
gather_customer_order_11 = gather_customer_order.loc[gather_customer_order['create_year_month'] == '2019-11']
# 将gather_customer_order_11按照chinese_city城市分组,求和销售数量order_num
gather_customer_order_city_11 = gather_customer_order_11.groupby(['chinese_city']).agg({'order_num':'sum'}).reset_index()
# 11月自行车销售数量前十城市
gather_customer_order_city_head = gather_customer_order_city_11.sort_values(by = 'order_num',ascending = False).head(10)
# 根据gather_customer_order_city_head的前十城市,查看10月11月自行车销售数据gather_customer_order_10_11,赋予变量gather_customer_order_10_11_head
gather_customer_order_10_11_head = gather_customer_order_10_11[gather_customer_order_10_11['chinese_city'].isin(gather_customer_order_city_head.chinese_city)]
# 分组计算前十城市,自行车销售数量销售金额
gather_customer_order_city_10_11 = gather_customer_order_10_11_head.groupby(['chinese_city','create_year_month']).agg({'order_num':'sum','sum_amount':'sum'}).reset_index()
# 计算前十城市环比
city_top_list = list(gather_customer_order_city_10_11['chinese_city'].unique())
order_top_x = pd.Series([])
amount_top_x = pd.Series([])
for i in city_top_list:
# print(i)
a=gather_customer_order_city_10_11.loc[gather_customer_order_city_10_11['chinese_city']==i]['order_num'].pct_change()
b=gather_customer_order_city_10_11.loc[gather_customer_order_city_10_11['chinese_city']==i]['sum_amount'].pct_change()
order_top_x=order_top_x.append(a)
amount_top_x =amount_top_x.append(b)
# order_diff销售数量环比,amount_diff销售金额环比
gather_customer_order_city_10_11['order_diff'] = order_top_x.fillna(0)
gather_customer_order_city_10_11['amount_diff'] = amount_top_x.fillna(0)
gather_customer_order_city_10_11.head(5)
3.1.3 2019年11月自行车产品销售表现
3.1.3.1 细分市场销量表现
# 求每个月自行车累计销售数量,gather_customer_order表利用groupby聚合月份,求每个月自行车的销售数量,赋值给变量gather_customer_order_group_month
gather_customer_order_group_month = gather_customer_order.groupby('create_year_month').agg({'order_num':'sum'}).reset_index()
# 利用pd.merge模块合并自行车销售信息表(gather_customer_order)+自行车每月累计销售数量表(gather_customer_order_group_month)赋值变量给order_num_proportion
order_num_proportion = pd.merge(left = gather_customer_order,
right = gather_customer_order_group_month,
how = 'inner',
on = 'create_year_month')
# 计算自行车销量/自行车每月销量占比,计算结果形成新的列'order_proportion'
order_num_proportion['order_proportion'] = order_num_proportion['order_num_x']/order_num_proportion['order_num_y']
# 重命名自行车每月销售量order_num_y为sum_month_order
order_num_proportion = order_num_proportion.rename(columns = {'order_num_y':'sum_month_order'})
order_num_proportion.head()
3.1.3.2 公路/山地/旅游自行车细分市场表现
公路自行车细分市场销量表现
gather_customer_order_road = gather_customer_order[gather_customer_order['cpzl_zw'] == '公路自行车']
# 求公路自行车不同型号'product_name'字段的产品销售数量,赋值变量为gather_customer_order_road_month
gather_customer_order_road_month = gather_customer_order_road.groupby(by = ['create_year_month','product_name']).agg({'order_num':'sum'}).reset_index()
gather_customer_order_road_month['cpzl_zw'] = '公路自行车'
# 求每个月公路自行车累计销售数量 赋值为gather_customer_order_road_month_sum,记得重置索引
gather_customer_order_road_month_sum = gather_customer_order_road_month.groupby('create_year_month').agg({'order_num':'sum'}).reset_index()
# 合并公路自行车gather_customer_order_road_month与每月累计销售数量,主键为'create_year_month'
# 用于计算不同型号产品的占比
gather_customer_order_road_month = pd.merge(left = gather_customer_order_road_month,
right = gather_customer_order_road_month_sum,
how = 'inner',
on = 'create_year_month')
gather_customer_order_road_month.head()
山地自行车细分市场销量表现
# 与公路自行车处理过程一致,赋予变量gather_customer_order_Mountain筛选山地自行车→求山地自行车不同型号的产品销售数量→求每月累计销售数量→合并→目的是用于产品子类比较环比
gather_customer_order_Mountain = gather_customer_order[gather_customer_order['cpzl_zw'] == '山地自行车']
# 求山地自行车不同型号产品销售数量
gather_customer_order_Mountain_month = gather_customer_order_Mountain.groupby(by = ['create_year_month','product_name']).agg({'order_num':'sum'}).reset_index()
gather_customer_order_Mountain_month['cpzl_zw'] = '山地自行车'
# 每个月公路自行车累计销售数量
gather_customer_order_Mountain_month_sum = gather_customer_order_Mountain_month.groupby('create_year_month').agg({'order_num':'sum'}).reset_index()
gather_customer_order_Mountain_month_sum.head()
#合并山地自行车hz_customer_order_Mountain_month与每月累计销售数量
#用于计算不同型号产品的占比
gather_customer_order_Mountain_month = pd.merge(left = gather_customer_order_Mountain_month,
right = gather_customer_order_Mountain_month_sum,
how = 'inner',
on = 'create_year_month')
gather_customer_order_Mountain_month.head()
旅游自行车细分市场销量表现
gather_customer_order_tour = gather_customer_order[gather_customer_order['cpzl_zw'] == '旅游自行车']
# 求旅游自行车不同型号产品销售数量
gather_customer_order_tour_month = gather_customer_order_tour.groupby(by = ['create_year_month','product_name']).agg({'order_num':'sum'}).reset_index()
gather_customer_order_tour_month['cpzl_zw'] = '旅游自行车'
gather_customer_order_tour_month_sum = gather_customer_order_tour_month.groupby('create_year_month').agg({'order_num':'sum'}).reset_index()
gather_customer_order_tour_month = pd.merge(left = gather_customer_order_tour_month,
right = gather_customer_order_tour_month_sum,
how = 'inner',
on = 'create_year_month')
gather_customer_order_tour_month.head()
#将山地自行车、旅游自行车、公路自行车每月销量信息竖向合并
gather_customer_order_month = pd.concat([gather_customer_order_road_month,gather_customer_order_Mountain_month,gather_customer_order_tour_month])
# 新增一列'order_num_proportio',为销售量占每月自行车总销售量比率
gather_customer_order_month['order_num_proportio'] = gather_customer_order_month['order_num_x']/gather_customer_order_month['order_num_y']
# gather_customer_order_month中的order_num_x(当月产品累积销量)修改字段名为order_month_product
# order_num_y(当月自行车总销量)修改字段名为sum_order_month
gather_customer_order_month = gather_customer_order_month.rename(columns = {'order_num_x':'order_month_product','order_num_y':'sum_order_month'})
gather_customer_order_month.head()
计算2019年11月自行车环比
# 计算11月环比,先筛选10月11月数据
gather_customer_order_month_10_11 = gather_customer_order_month[gather_customer_order_month.create_year_month.isin(['2019-10','2019-11'])]
# 将10月11月自行车销售信息排序
gather_customer_order_month_10_11 = gather_customer_order_month_10_11.sort_values(by = ['product_name','create_year_month'])
product_name = list(gather_customer_order_month_10_11.product_name.drop_duplicates())
# 计算自行车销售数量环比
order_top_x = pd.Series([])
amount_top_x = pd.Series([])
for i in product_name:
# print(i)
a=gather_customer_order_month_10_11.loc[gather_customer_order_month_10_11['product_name']==i]['order_month_product'].pct_change()
b=gather_customer_order_month_10_11.loc[gather_customer_order_month_10_11['product_name']==i]['sum_order_month'].pct_change()
order_top_x = order_top_x.append(a)
amount_top_x = amount_top_x.append(b)
gather_customer_order_month_10_11['order_num_diff'] = order_top_x.fillna(0)
# 筛选出11月自行车数据
gather_customer_order_month_11 = gather_customer_order_month_10_11[gather_customer_order_month_10_11['create_year_month'] == '2019-11']
计算2019年1月至11月产品累计销量
# 筛选2019年1月至11月自行车数据,赋予变量为gather_customer_order_month_1_11
gather_customer_order_month_1_11 = gather_customer_order_month[gather_customer_order_month.create_year_month.isin(['2019-01', '2019-02', '2019-03', '2019-04', '2019-05', '2019-06','2019-07', '2019-08', '2019-09', '2019-10', '2019-11'])]
# 计算2019年1月至11月自行车累计销量
gather_customer_order_month_1_11_sum = gather_customer_order_month_1_11.groupby(by = 'product_name').order_month_product.sum().reset_index()
# 重命名sum_order_1_11:1-11月产品累计销量
gather_customer_order_month_1_11_sum = gather_customer_order_month_1_11_sum.rename(columns = {'order_month_product':'sum_order_1_11'})
gather_customer_order_month_1_11_sum.head()
2019年11月自行车产品销量、环比、累计销量
gather_customer_order_month_11 = pd.merge(left = gather_customer_order_month_11,
right = gather_customer_order_month_1_11_sum,
how = 'left',
on = 'product_name')
gather_customer_order_month_11.head()
3.1.4 2019年11月热品销售分析
3.1.4.1 11月产品销量TOP10产品,销售数量及环比
# 筛选11月数据
gather_customer_order_11 = gather_customer_order.loc[gather_customer_order['create_year_month'] == '2019-11']
计算TOP10产品
# 计算产品销售数量,\ 为换行符
# 按照销量降序,取TOP10产品
customer_order_11_top10 = gather_customer_order_11.groupby(by = 'product_name').order_num.count().reset_index().sort_values(by = 'order_num',ascending = False).head(10)
# TOP10销量产品信息
list(customer_order_11_top10['product_name'])
Out:
['Mountain-200 Silver',
'Mountain-200 Black',
'Road-150 Red',
'Road-750 Black',
'Road-550-W Yellow',
'Road-250 Black',
'Road-350-W Yellow',
'Road-250 Red',
'Touring-1000 Blue',
'Mountain-400-W Silver']
计算TOP10销量及环比
customer_order_month_10_11 = gather_customer_order_month_10_11[['create_year_month','product_name','order_month_product','cpzl_zw','order_num_diff']]
customer_order_month_10_11 = customer_order_month_10_11[customer_order_month_10_11['product_name'].\
isin(list(customer_order_11_top10['product_name']))]
customer_order_month_10_11['category'] = '本月TOP10销量'
customer_order_month_10_11.head()
3.1.4.2 11月增速TOP10产品,销售数量及环比
customer_order_month_11 = gather_customer_order_month_10_11.loc[gather_customer_order_month_10_11['create_year_month'] == '2019-11'].\
sort_values(by = 'order_num_diff',ascending = False).head(10)
customer_order_month_11_top10_seep = gather_customer_order_month_10_11.loc[gather_customer_order_month_10_11['product_name'].isin(list(customer_order_month_11['product_name']))]
customer_order_month_11_top10_seep = customer_order_month_11_top10_seep[['create_year_month','product_name','order_month_product','cpzl_zw','order_num_diff']]
customer_order_month_11_top10_seep['category'] = '本月TOP10增速'
# axis = 0按照行维度合并,axis = 1按照列维度合并
hot_products_11 = pd.concat([customer_order_month_10_11,customer_order_month_11_top10_seep],axis = 0)
hot_products_11.tail()
3.1.5 用户行为分析
3.1.5.1 用户年龄分析
# sales_customer_order_11['birth_year']字段要求修改为int类型
sales_customer_order_11['birth_year'] = sales_customer_order_11['birth_year'].fillna(0).astype('int32')
# 计算用户年龄
sales_customer_order_11['customer_age'] = 2020 - sales_customer_order_11['birth_year']
# 年龄分层1
listBins = [30,34,39,44,49,54,59,64]
listLabels = ['30-34','35-39','40-44','45-49','50-54','55-59','60-64']
# 新增'age_level'分层区间列
sales_customer_order_11['age_level'] = pd.cut(sales_customer_order_11['customer_age'],bins=listBins, labels=listLabels, include_lowest = True)
# 筛选销售订单为自行车的订单信息
df_customer_order_bycle = sales_customer_order_11.loc[sales_customer_order_11['cplb_zw'] == '自行车']
# 计算年龄比例,最终形成df_customer_order_bycle['age_level_rate']
df_customer_order_bycle['age_level_rate'] = 1/df_customer_order_bycle.shape[0]
# 将年龄分为3个层次,分别为'<=35'、'35-45'、'>=45'
df_customer_order_bycle['age_level2'] = pd.cut(df_customer_order_bycle.customer_age,bins=[0,35,45,63],right=False,labels=['<=35','35-45','>=45'])
# 求每个年龄段人数
age_level2_count = df_customer_order_bycle.groupby(by = 'age_level2').sales_order_key.count().reset_index()
age_level2_count
3.1.5.2 用户性别
gender_count = df_customer_order_bycle.groupby(by = 'gender').cplb_zw.count().reset_index()
df_customer_order_bycle = pd.merge(df_customer_order_bycle,age_level2_count,on = 'age_level2').rename(columns = {'sales_order_key_y':'age_level2_count'})
df_customer_order_bycle['age_level2_rate'] = 1/df_customer_order_bycle['age_level2_count']
df_customer_order_bycle = pd.merge(df_customer_order_bycle,gender_count,on = 'gender').rename(columns = {'cplb_zw_y':'gender_count'})
df_customer_order_bycle['gender_rate'] = 1/df_customer_order_bycle['gender_count']
df_customer_order_bycle.head()
3.2 将处理后的数据保存至MySQL数据库中
3.2.1 自行车整体销售表现
- pt_overall_sale_performance_1
3.2.2 2019年11月自行车地域销售表现
- pt_bicy_november_territory_2
- pt_bicy_november_october_city_3
3.3.3 2019年11月自行车产品销售表现
- pt_bicycle_product_sales_month_4
- pt_bicycle_product_sales_order_month_4
- pt_bicycle_product_sales_order_month_11
3.3.4 2019年11月热品销售分析
- pt_hot_products_november
3.3.5 用户行为分析
- pt_user_behavior_november
3.3 PowerBI连接MySQL数据库进行数据可视化
点击查看:可视化报表
3.4 PPT呈现分析报告
3.4.1 整体销售表现
结果显示:
- 近11个月,11月自行车销售量最多,为3316辆;较10月增长7.1%;
- 近11个月,11月自行车销售额最高,为6190万元,较10月增长8.7%;自行车销售金额与销售数量趋势一致。
3.4.2 地域销售表现
结果显示:
- 11月华东地区自行车销售量在8个地区中最多;
- 较10月,华南地区增加23.6%,增速最快;
- TOP城市市场份额总占比13.41%(城市份额=城市销售量/总销售量)。
3.4.3 产品销量表现
结果显示:
- 11月公路自行车占比最多;
- 较10月相比,旅游自行车增速最快。
结果显示:
- 11月公路自行车,除Road-350-WYellow外,其他型号的自行车环比都呈上升趋势;
- Road-650较10月增长14.29%,增速最快;
- Road-150Red销售占比最高,约为19.63%。
结果显示:
- 11月山地自行车,除Mountain-200Black外,其他型号的自行车环比呈上升的趋势;
- 型号Mountain-500Silver增速最快,为19.51%;
- 型号Mountain-200Silver销售份额占比最大。
结果显示:
- 11月旅游自行车,除型号Touring-2000Blue、Touring-3000Blue外,其他型号的自行车环呈上升趋势;
- 型号Touring-1000Yellow较10月增速最快,为27.18%;
- 型号Touring-1000Blue销售份额占比最大,为32.52%。
3.4.4 热品销售分析
结果显示:
- 11月 型 号为Mountain-200 Silver销售量最多,为395辆;较 10月增长10.64%;
- 11月,型号为Touring-1000 Yellow增速最快;较10月增长 27.18%。
3.4.5 用户行为分析
结果显示:
- 根据年龄断划分,年龄35-39岁消费人数占比最高,为29%;之后随着年龄的增长,占比逐渐下降;
- 针对年龄(大于30岁)和细分市场的关联分析, 购买公路自行车占比最大,旅游自行车占比最小;
- 男性与女性购买自行车占比几乎相同;
- 针对性别和细分市场的关联分析,男性和女性购买公路自行车占比最高,购买旅游自行车占比最少。