提到购物篮分析,就不得不说到一个无数次被提及的故事——啤酒与尿不湿的故事,这个经典案例常被拿到各种会场、餐桌和文章中,以至于听到耳朵都要磨成茧。购物篮分析,就是分析一段时间内客户购物篮中商品的规律,通过数据分析和数据挖掘,发现用户的消费习惯,尤其是哪些商品经常会一起购买,从而更好用于商品的排列、采购、推广和营销,以此来满足消费者的需要,提高公司收益。
在做购物篮分析时,往往是以订单、产品为最小颗粒度的数据,来分析用户在购买A产品以后,对B产品带来的影响有多大。
购物篮分析的实现,是基于条件概率,也就是贝叶斯公式。在实际应用时,主要会牵扯到3个指标:置信度、支持度和提升度。
将信息展示如下,N代表数量:
指标名称 | 指标说明 | 计算公式 | 举例 |
---|---|---|---|
产品A的订单数 | 有购买过产品A的订单数量 | N(A) | 400 |
产品B的订单数 | 有购买过产品B的订单数量 | N(B) | 300 |
同时购买产品A和B的订单数 | 同时购买过产品A和B的订单数量 | N(A∩B) | 200 |
总订单数 | 所有订单数量 | N(I) | 1000 |
支持度 | 支持的程度,一般用百分比表示。本例中,A和B的支持度,即A和B同时出现的频率,如果A和B一起出现的频率非常小,那么就说明了A和B之间的联系并不大 | A和B的支持度=N(A∩B)/ N(I) | 200/1000=20% |
置信度 | 揭示了A出现时,B是否一定会出现,如果出现则其大概有多大的可能出现 | P(B|A)=P(A∩B) /P(A)=[N(A∩B)/N(I)]/[N(A)/N(I)]=N(A∩B)/N(A) | 200/400=50% |
提升度 | 提升度反映了关联规则中的A与B的相关性,提升度>1且越高表明正相关性越高,提升度<1且越低表明负相关性越高,提升度=1表明没有相关性,即相互独立。 | P(A→B)=P(B|A)/P(B) =[N(A∩B)/N(I)]/[N(A)/N(I)]/[N(B)/N(I)]=N(A∩B)*N(I)/N(A)/N(B) | 法1:0.5/(300/1000)=0.5/0.3≈1.67 法2:150*1000/400/300≈1.67 |
为方便演示,构建product_order作为测试表,表数据截图如下:
构建数据的代码如下:
--构建测试数据
DROP TABLE IF EXISTS "product_order";
CREATE TABLE "product_order" (
"order_id" varchar(100) COLLATE "pg_catalog"."default",
"product" varchar(50) COLLATE "pg_catalog"."default",
"qty" int8
)
;
INSERT INTO "product_order" VALUES ('1', 'A', 1);
INSERT INTO "product_order" VALUES ('1', 'B', 2);
INSERT INTO "product_order" VALUES ('1', 'C', 3);
INSERT INTO "product_order" VALUES ('1', 'D', 5);
INSERT INTO "product_order" VALUES ('2', 'A', 4);
INSERT INTO "product_order" VALUES ('2', 'C', 2);
INSERT INTO "product_order" VALUES ('3', 'D', 3);
实现的SQL代码如下:
--构建两两组合表
SELECT A.order_id,
A.product product_a,
b.product product_b
FROM
product_order
A LEFT JOIN product_order b ON A.order_id = b.order_id;
计算出两两组合的订单数量,计算结果如下图所示:
实现代码如下:
-- 求两两组合的订单数量
SELECT
product_a,
product_b,
COUNT ( DISTINCT order_id ) AS buy_together
FROM
(
SELECT A
.order_id,
A.product product_a,
b.product product_b
FROM
product_order
A LEFT JOIN product_order b ON A.order_id = b.order_id
) T
GROUP BY
product_a,
product_b;
-- 求每个商品的订单数量
SELECT
product,
COUNT ( DISTINCT order_id ) product_cnt
FROM
product_order
GROUP BY
product;
实现代码如下图:
--求所有订单数量
SELECT COUNT
( DISTINCT order_id ) total_order_cnt
FROM
product_order;
实现代码如下:
--求拼接所有基础数据
SELECT
t1.product_a,
t1.product_b,
t1.buy_together,
t2.product_a_cnt,
t3.product_b_cnt,
t4.total_order_cnt
FROM
(
SELECT
product_a,
product_b,
COUNT ( DISTINCT order_id ) AS buy_together
FROM
(
SELECT A
.order_id,
A.product product_a,
b.product product_b
FROM
product_order
A LEFT JOIN product_order b ON A.order_id = b.order_id
) T
GROUP BY
product_a,
product_b
) t1
LEFT JOIN ( SELECT product, COUNT ( DISTINCT order_id ) product_a_cnt FROM product_order GROUP BY product ) t2 ON t1.product_a = t2.product
LEFT JOIN ( SELECT product, COUNT ( DISTINCT order_id ) product_b_cnt FROM product_order GROUP BY product ) t3 ON t1.product_b = t3.product
LEFT JOIN ( SELECT COUNT ( DISTINCT order_id ) total_order_cnt FROM product_order ) t4 ON 1 = 1;
实现代码如下:
-- 排除两两如何中产品相同的数据
SELECT
t1.product_a,
t1.product_b,
t1.buy_together,
t2.product_a_cnt,
t3.product_b_cnt,
t4.total_order_cnt
FROM
(
SELECT
product_a,
product_b,
COUNT ( DISTINCT order_id ) AS buy_together
FROM
(
SELECT A
.order_id,
A.product product_a,
b.product product_b
FROM
product_order
A LEFT JOIN product_order b ON A.order_id = b.order_id
) T
GROUP BY
product_a,
product_b
) t1
LEFT JOIN ( SELECT product, COUNT ( DISTINCT order_id ) product_a_cnt FROM product_order GROUP BY product ) t2 ON t1.product_a = t2.product
LEFT JOIN ( SELECT product, COUNT ( DISTINCT order_id ) product_b_cnt FROM product_order GROUP BY product ) t3 ON t1.product_b = t3.product
LEFT JOIN ( SELECT COUNT ( DISTINCT order_id ) total_order_cnt FROM product_order ) t4 ON 1 = 1
--增加筛选条件
WHERE
t1.product_a <> t1.product_b;
利用得到的基础数据,增加对应的衍生指标。
--完整SQL,进行购物篮分析
SELECT
*,
round(buy_together * 1.0 / total_order_cnt,4) AS support,
round(buy_together * 1.0 / product_a_cnt,4) AS confidence,
round(( buy_together * 1.0 / product_a_cnt ) / ( product_b_cnt *1.0/ total_order_cnt ),4) AS promote
-- round(buy_together*total_order_cnt*1.0/product_a_cnt/product_b_cnt,4) promote_b
FROM
(
SELECT
t1.product_a,
t1.product_b,
t1.buy_together,
t2.product_a_cnt,
t3.product_b_cnt,
t4.total_order_cnt
FROM
(
SELECT
product_a,
product_b,
COUNT ( DISTINCT order_id ) AS buy_together
FROM
(
SELECT A
.order_id,
A.product product_a,
b.product product_b
FROM
product_order
A LEFT JOIN product_order b ON A.order_id = b.order_id
) T
GROUP BY
product_a,
product_b
) t1
LEFT JOIN ( SELECT product, COUNT ( DISTINCT order_id ) product_a_cnt FROM product_order GROUP BY product ) t2 ON t1.product_a = t2.product
LEFT JOIN ( SELECT product, COUNT ( DISTINCT order_id ) product_b_cnt FROM product_order GROUP BY product ) t3 ON t1.product_b = t3.product
LEFT JOIN ( SELECT COUNT ( DISTINCT order_id ) total_order_cnt FROM product_order ) t4 ON 1 = 1
WHERE
t1.product_a <> t1.product_b
) main
ORDER BY
promote DESC;
上述1中的写法在数据量大的话执行起来可能会比较慢,因此这里给出优化的一个版本:
-- 使用 WITH 语句执行购物篮分析查询
WITH basket_analysis_base AS (
SELECT
product_a,
product_b,
COUNT(DISTINCT order_id) AS buy_together
FROM
(
-- 只需修改此处即可
-- begin
SELECT
A.order_id,
A.product AS product_a,
b.product AS product_b
FROM
product_order A
LEFT JOIN product_order b ON A.order_id = b.order_id
-- end
) T
GROUP BY
product_a,
product_b
WHERE
product_a <> product_b
),
product_counts AS (
SELECT
product,
COUNT(DISTINCT order_id) AS cnt
FROM
product_order
GROUP BY
product
)
SELECT
t1.product_a,
t1.product_b,
t1.buy_together,
t2a.cnt AS product_a_cnt,
t2b.cnt AS product_b_cnt,
t3.cnt AS total_order_cnt,
ROUND(t1.buy_together * 1.0 / t3.cnt, 4) AS support,
ROUND(t1.buy_together * 1.0 / t2a.cnt, 4) AS confidence,
ROUND((t1.buy_together * 1.0 / t2a.cnt) / (t2b.cnt * 1.0 / t3.cnt), 4) AS promote
FROM
basket_analysis_base t1
LEFT JOIN product_counts t2a ON t1.product_a = t2a.product
LEFT JOIN product_counts t2b ON t1.product_b = t2b.product
CROSS JOIN (SELECT COUNT(DISTINCT order_id) AS cnt FROM product_order) t3
ORDER BY
promote DESC;