一、MySQL数据汇总整理
1.从订单角度汇总支付信息
2.基于用户ID汇总订单信息
语句过长不展示了
3.基于产品汇总订单信息
语句过长不展示了
根据人、货、场将数据分为三张大表,有利于进行数据探索
二、利用Python进行数据探索
1.载入需要用到的Python库和连接本地MySQL
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from sqlalchemy import create_engine
sns.set(style="ticks", color_codes=True, font_scale=1.5)
color = sns.color_palette()
sns.set_style('darkgrid')
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
%matplotlib inline
mysql_con = create_engine('mysql+pymysql://root:[email protected]:3306/onlineshop', echo=False)
SKU:产品+尺码+颜色
SPU:标准化产品
2.基于产品的数据探索
df_products_summary = pd.read_sql("""
SELECT
products.id AS product_id,
products.created_at,
products.published_at,
LOWER(products.product_type) AS product_type,
products.title AS product_title,
skus.id AS sku_id,
skus.product_style,
skus.sku,
skus.price AS sku_price
FROM
onlineshop.products AS products
LEFT JOIN
onlineshop.products_skus AS skus ON products.id = skus.product_id;
""", con=mysql_con)
df_products_summary.head()
details = rstr(df_products_summary)
display(details.sort_values(by='missing ration', ascending=False))
探索维度
--非标品,尺码、颜色
--批次,销售情况四季不同款
--季度趋势、四季不同款
(1)数据清洗
df_products_summary["product_type"].unique()
存在拼写错误,去掉代金券,及未发布产品
df_products_summary.loc[df_products_summary['product_type'] == "hooide",'product_type']="hoodie"
df_products_summary.loc[df_products_summary['product_type'] == "tousers",'product_type']="trousers"
def fix_type(product_type):
if product_type == "hooide":
return "hoodie"
elif product_type == "tousers":
return "trousers"
elif product_type in ["maxi","mini","midi"]:
return "dress"
else:
return product_type
df_products_summary['product_type'] = df_products_summary.apply(lambda row: fix_type(row["product_type"]), axis=1)
df_products_summary = df_products_summary[df_products_summary["product_type"]!="gift card"]
## interesting point~~ 23个产品没有publish....
df_products_summary.loc[df_products_summary["published_at"].notnull(), "published_at"].nunique()
df_products_summary = df_products_summary[df_products_summary["product_type"]!=""]
再次查看数据
df_products_summary = df_products_summary[df_products_summary["published_at"].notnull()]
details = rstr(df_products_summary)
display(details.sort_values(by='missing ration', ascending=False))
(2).基于产品类型,汇总product_style,sku_id不同的个数和sku_price平均值
product_type_grouped = df_products_summary.groupby('product_type')
product_type_agg = product_type_grouped.agg({'product_id': pd.Series.nunique,
'product_style': pd.Series.nunique,
'sku_id':pd.Series.nunique,
'sku_price':pd.Series.mean})
product_type_agg
print("having {} products".format(df_products_summary["product_id"].nunique()))
print("having {} product style ".format(df_products_summary["product_style"].nunique()))
#colors
fig =plt.figure(figsize=(20,5))
f1 = fig.add_subplot(121)
product_type_agg["product_id"].sort_values(ascending=True).plot(kind='bar')
f1 = fig.add_subplot(122)
product_type_agg["sku_price"].sort_values(ascending=True).plot(kind='bar')
plt.figure(figsize=(16,9))
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']
grouped_product_type = df_products_summary.groupby('product_type')
product_type_agg = grouped_product_type.agg({'product_id': pd.Series.nunique,
'sku_id':pd.Series.nunique,
'sku_price':pd.Series.mean})
(product_type_agg['product_id']/product_type_agg['product_id'].sum()).plot(kind='pie',autopct='%1.1f%%',
colors = colors,
startangle=120,
pctdistance=0.85)
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.axis('equal')
plt.show()
3.基于产品批次
drop_grouped = df_products_summary.groupby('publish_drop')
drop_grouped_agg = drop_grouped.agg({'product_id': pd.Series.nunique,
'product_type':pd.Series.nunique,
'product_style':pd.Series.nunique,
'sku_id':pd.Series.nunique,
'sku_price':pd.Series.mean})
drop_grouped_agg.rename(columns ={'product_id':'total_products',
'product_type':"different_type",
'sku_id':'total_sku',
'product_style':"total_spu",
'sku_price':"mean price"}, inplace = True)
drop_grouped_agg
fig = plt.figure(figsize=(25, 5))
f1 = fig.add_subplot(131)
ax = drop_grouped_agg['total_products'].plot(kind="bar")
f1 = fig.add_subplot(132)
ax2 = drop_grouped_agg['total_sku'].plot(kind="bar")
f1 = fig.add_subplot(133)
ax2 = drop_grouped_agg['mean price'].plot()
根据批次,从春天到冬天越来越贵
4.基于用户的数据探索
查看数据
df_customer_summary = pd.read_sql("""
select * from customer_summary_view
""", con=mysql_con)
df_customer_summary.shape
df_customer_summary.describe()
details = rstr(df_customer_summary)
display(details.sort_values(by='missing ration', ascending=False))
购买商品的用户占注册用户的30%
fig = plt.figure(figsize=(25, 5))
f1 = fig.add_subplot(131)
ax = register_grouped_agg['customer_id']["nunique"].plot()
ax.set_title("customer monthly grow")
f1 = fig.add_subplot(132)
ax = register_grouped_agg['customer_id']["nunique"].cumsum().plot()
ax.set_title("cumulative customer grow")
#f1 = fig.add_subplot(133)
#ax = register_grouped_agg['total_orders']["sum"].plot()
#ax.set_title("customer contribution")
可以看到虽然用户是每月都在增加,但是每月新增用户在逐月降低
5.Place, Web Traffic
df_traffic_summary = pd.read_sql("""
select * from traffic
""", con=mysql_con)
df_traffic_summary.describe()
df_traffic_summary.head()
df_traffic_summary['month_group'] = df_traffic_summary['date_day'].apply(lambda x: x.strftime('%Y-%m'))
traffic_grouped = df_traffic_summary.groupby('month_group')
traffic_grouped_agg = traffic_grouped.agg({'page_views': [("mean", pd.Series.mean), ("sum", pd.Series.sum)],
'sessions':[("mean", pd.Series.mean), ("sum", pd.Series.sum)],
'avg_session_in_s':[("mean", pd.Series.mean)]})
traffic_grouped_agg
--基于page-view的数据埋点
traffic_grouped_agg["page_views"]["sum"].plot()
fig = plt.figure(figsize=(25, 5))
fig.add_subplot(131)
traffic_grouped_agg["sessions"]["sum"].plot()
fig.add_subplot(132)
traffic_grouped_agg["sessions"]["sum"].cumsum().plot()
fig.add_subplot(133)
traffic_grouped_agg["avg_session_in_s"]["mean"].plot()
Lifetime Value (LTV) vs Customer Acquisition Cost (CAC)
获客成本 Customer Acquisition Cost (CAC)
用户生命周期价值(LTV)是否能够大于[单个用户获取成本(CAC)+单个用户运营成本(COC)]?
1.决定计算的周期 (month, quarter, year)
2.投入的成本 / 增长的用户量
customer life value (CLV)
大概一个用户在生命周期内,会给我们带来的价值有多少? 其实是一个指标,帮助我们进行销售策略调整
问题来了:
用户注册 => 用户购买 => ???
购买过后是 active customer or inactive customer
商业模式,盈利模式
- B2B Ecommerce
- B2C Ecommerce
- C2C Ecommerce
- C2B Ecommerce
生命周期的问题:
- 合约用户
- 非合约用户
AAARRR Model
所谓价值,是否考虑Profit问题,Margin
average customer lifespan
用户始终会坚持购买的年限(月)?
电商经验,一般是 1-3 years
用户留存?
一个新客户在未来的一段时间内是否完成了您期许用户完成的行为?
用户留存可能会变......
获客成本可能会变......
可能会基于用户分组后再考虑,不同的分析
Churn Rate (客户流失率)
用户流失率是一个指标
Cohort Analysis 群体分析 或 分组分析
repeat purchase rate, and cohort analysis of historical data.
按初始行为时间分组的留存分析可以消除用户增长对用户参与数据带来的影响
Order period — 根据订单日期分组,year-month
Cohort group — 根据用户第一次购买时间分组,year-month
Cohort period — A integer representation a customer’s stage in its “lifetime”. The number represents the number of months passed since the first purchase.
创建首次购买时间的用户分组 cohort_group
df_order_summary.set_index('customer_id', inplace = True)
df_order_summary.head()
cohorts.reset_index(inplace=True)
cohorts.set_index(['cohort_group', 'cohort_period'], inplace=True)
cohort_sizes = cohorts.groupby(level=0)['total_customers'].first()
user_retention = cohorts['total_customers'].unstack(0).divide(cohort_sizes, axis = 1)
plt.figure(figsize=(16,9))
ax = sns.heatmap(user_retention, annot=True,cmap="YlGnBu", fmt='.0%')
ax.set_ylabel('Cohort Period', fontsize = 15)
ax.set_xlabel('Cohort Group', fontsize = 15)
ax.set_title('Retention Rates Across Cohorts', fontsize = 20)
可以看出客户流失率很低,再次点击率不到10%
每次上新后,回顾率有所提升
cohorts.reset_index(inplace=True)
cohorts.set_index(['cohort_group', 'cohort_period'], inplace=True)
unstacked_order = cohorts['total_customers'].unstack(0)
plt.figure(figsize=(16,9))
ax = sns.heatmap(unstacked_order, annot=True,cmap='Blues', fmt='g')
ax.set_ylabel('Cohort Period', fontsize = 15)
ax.set_xlabel('Cohort Group', fontsize = 15)
--新品上多少?sku数据,基于size和Product tpye调整
Product Performance
df_orders_with_products = pd.read_sql("""
SELECT
orders.id as order_id,
orders.created_at,
orders.closed_at,
orders.cancelled_at,
orders.financial_status,
orders.fulfillment_status,
orders.processed_at,
orders.total_price,
items.product_style,
items.product_id,
items.quantity
FROM
onlineshop.orders orders
LEFT JOIN
onlineshop.orders_items items ON orders.id = items.order_id;
""", con=mysql_con)
df_orders_with_products.head()
df_orders_with_products['order_period'] = df_orders_with_products['created_at'].apply(lambda x: x.strftime('%Y-%m'))
df_products_summary.head()
product_grouped = df_products_summary.groupby('product_style')
product_grouped_agg = product_grouped.agg({
'sku_id': pd.Series.nunique,
'product_id': pd.Series.nunique,
'product_type':'first',
'created_at':'first',
'published_at':'first',
'publish_drop':'first'})
product_grouped_agg.rename(columns ={'sku_id':'total_skus','product_id':'total_products','created_at':'product_created_at'}, inplace = True)
product_info = product_grouped_agg.reset_index()
product_info.shape
product_info[product_info["total_products"]>1]
df_product_performance= pd.merge(df_orders_with_products, product_info,how="left", on='product_style')
details = rstr(df_product_performance)
display(details.sort_values(by='missing ration', ascending=False))
df_pfer = df_product_performance.groupby(["publish_drop","order_period"])
perfromance_agg = df_pfer.agg({
'order_id': pd.Series.nunique,
'product_id':pd.Series.nunique,
'product_style':pd.Series.nunique,
'quantity':pd.Series.sum,
'total_skus':pd.Series.sum})
perfromance_agg.rename(columns ={'order_id':'total_orders','product_id':'total_products'}, inplace = True)
def cohort_period(df):
df['cohort_period'] = np.arange(len(df)) + 1
return df
product_cohorts = perfromance_agg.groupby(level=0).apply(cohort_period)
product_cohorts.head()
product_cohorts.reset_index(inplace=True)
product_cohorts.set_index(['publish_drop', 'cohort_period'], inplace=True)
unstacked_order = product_cohorts['total_orders'].unstack(0)
plt.figure(figsize=(16,9))
ax = sns.heatmap(unstacked_order, annot=True,cmap="YlGnBu", fmt='g',linewidths=.3)
ax.set_ylabel('Cohort Period', fontsize = 15)
ax.set_xlabel('Cohort Group', fontsize = 15)
ax.set_title('Product Drop Sales', fontsize = 20)
df_type_performance = df_product_performance.groupby(["order_period","product_type"])
type_perfromance_agg = df_type_performance.agg({
'order_id': pd.Series.nunique,
'product_id':pd.Series.nunique,
'quantity':pd.Series.sum,
'total_skus':pd.Series.sum})
type_perf = type_perfromance_agg.reset_index()
all_types = type_perf["product_type"].unique()
for p_type in all_types: type_perf[type_perf["product_type"]==p_type].plot(x="order_period",y="quantity",title=p_type)
type_perfromance_agg.reset_index(inplace=True)
type_perfromance_agg.set_index(['order_period', 'product_type'], inplace=True)
unstacked_order = type_perfromance_agg['order_id'].unstack(0)
plt.figure(figsize=(16,9))
ax = sns.heatmap(unstacked_order, annot=True,cmap="YlGnBu", fmt='g',linewidths=.3)
ax.set_ylabel('Cohort Period', fontsize = 15)
ax.set_xlabel('Cohort Group', fontsize = 15)
ax.set_title('Product Type Heatmap', fontsize = 20)
数据本身的问题
- published_at 小于 created_at ??
- SPU, SKU vs 销售产品 及 销售产品属性, 运营人员产品拆分,一款颜色卖得好,一款卖得差
- 数据仓库的设计上需要考虑一下。尽量把这种变化能找出来
用户留存问题,回归商业问题
- Who are your best customers?
- How can a company offer the best product and make the most money?
- How to segment profitable customers?
- How much budget need to spend to acquire customers?
RFM
- 有一个客户三周里天天都会购买产品,然后消失好几个月? Alive or Not
- 有一个客户,每一个季度只购买一次,并且上一个季度也购买了? Alive or Not
改进的思考:
用户生命周期最少的那部分用户,例如10天,有什么具体特征,为什么不用?
用户生命周期最多的那部分用户,有什么特点?
分布人数最多的用户,怎么样能想办法抓住他们的痛点?延长他们生命周期
究竟是用的久的用户(二八理论),还是分布人数最多的用户(长尾理论),产生的商业价值大?
发现了如下的几个特点。
流失用户中,40%的用户没有完善资料新增用户没有导入通讯录好友;
流失概率比导入的高20%新增用户;
在第一周使用中,如果添加的好友低于3,则一个月后的流失概率超过一半。
用户流失前一个月,互动率远低于APP平均值。
Further Discurssion: 有啥问题吗?细思恐极~
- Not all customers have the same CLV
- CLV is a forward-looking concept, you can’t know how much it is
- What we are really interested in is Residual Lifetime Value (RLV), not past spend
- Comparing a future, uncertain quantity (CLV) to a current, certain one (CPA)
Individual-level estimates:
Readings:
https://www.datascience.com/blog/intro-to-predictive-modeling-for-customer-lifetime-value
The Churn prediction is a classification problem... 生存模型?
the customer lifetime value prediction is a regression problem.
Random Forest can be used for both churn classification and CLV regression.