用来作练习的两张表的结构及部分数据如下:可以看到,1个用户可以有很多个订单,每个订单里拥有很多个商品,
题目一:每个用户有多少个订单
思路:在orders表中按照user_id分组,组内统计订单个数(去重)
select user_id, count(distinct order_id) as ord_cnt
from orders
group by user_id
order by ord_cnt desc
limit 20;
题目二:每个用户平均每个订单有多少商品
思路:先根据order_products_prior表算出每个订单有多少商品,按照order_id分组统计商品数,
selct order_id, count(product_id) as prod_cnt
from order_products_prior
group by order_id
再根据orders表和上述结果作join,算出每个用户(用户是重复的)的每个订单有多少商品,然后按照user_id分组,组内计算每个用户的商品数和订单数,求平均
select t1.user_id, sum(t2.prod_cnt)/count(t1.order_id)
from orders t1
join(
selct order_id, count(product_id) as prod_cnt
from order_products_prior
group by order_id
) t2 on t1.order_id = t2.order_id
group by t1.user_id
limit 20;
或者在作join后使用avg函数去求平均值,需要注意的是order_id要没有重复才可用avg
select t1.user_id, avg(t2.prod_cnt)
from orders t1
join(
selct order_id, count(product_id) as prod_cnt
from order_products_prior
group by order_id
) t2 on t1.order_id = t2.order_id
group by t1.user_id
limit 20;
题目三:每个用户在一周中购买订单数量的分布,形如user_id,dow_0,dow_1,…,dow_6
思路:对orders表,先按user_id分组,然后使用case when语句增加dow_0, dow_, …, dow_6这7列字段,对每个新增字段比如dow_0,遍历每行记录作判断,如果该行的order_dow值为0,增对应的dow_0字段的值为1,否则为0;最后组内统计每个新增字段的加和值
select user_id,
sum(case order_dow when '0' then 1 else 0 end) as dow_0,
sum(case order_dow when '1' then 1 else 0 end) as dow_1,
sum(case order_dow when '2' then 1 else 0 end) as dow_2,
sum(case order_dow when '3' then 1 else 0 end) as dow_3,
sum(case order_dow when '4' then 1 else 0 end) as dow_4,
sum(case order_dow when '5' then 1 else 0 end) as dow_5,
sum(case order_dow when '6' then 1 else 0 end) as dow_6
from orders
group by user_id
limit 20;
或者使用count,这时不满足判定条件的不赋值
select user_id,
count(case order_dow when '0' then 1 end) as dow_0,
count(case order_dow when '1' then 1 end) as dow_1,
count(case order_dow when '2' then 1 end) as dow_2,
count(case order_dow when '3' then 1 end) as dow_3,
count(case order_dow when '4' then 1 end) as dow_4,
count(case order_dow when '5' then 1 end) as dow_5,
count(case order_dow when '6' then 1 end) as dow_6
from orders
group by user_id
limit 20;
这里再设想一下,如果orders表中同一天内有重复的order_id,那么then就应该改为then order_id,最后的聚合函数改为count(distinct case…),不过这种情况不太可能会发生,一个用户每次购买的订单,都会有唯一的订单id
题目四:(把days_since_prior_order当作date,会有重复)计算每个用户平均一天(购买那几天)购买商品数量
day1: 2个订单: 订单1有 10个product,订单2有15个product,则共有25个
day2:一个订单 ,12个product
day3:没有购买
user_id平均一天购买商品数为 (25+12)/2
思路:因为orders表中days_since_prior_order字段有的是空字符’’,但是该条记录是一次购买行为,所以要先处理空字符’’,为计算每个用户(user_id有重复的)的购买天数作准备
select user_id, if(days_since_prior_order='','0.0',days_since_prior_order) as dt
from orders
然后求出每个用户(去重)的购买天数
select t.user_id, count(distinct t.dt) as day_cnt
from (
select user_id, if(days_since_prior_order='','0.0',days_since_prior_order) as dt
from orders
) t
group by t.user_id
order by cast(t.user_id as int)
limit 20;
接着根据order_products_prior表计算出每个订单有多少商品
select order_id, count(product_id) as prod_cnt
from order_products_prior
group by order_id
最后将处理空字符后的结果与上述结果根据order_id作join,得到user_id(重复),order_id, dt, prod_cnt,再根据user_id分组,组内计算每个用户购买的商品总数和购买天数
select t1.user_id, sum(t2.prod_cnt)/count(distinct t1.dt) as avg_day_prod_cnt
from (
select user_id, order_id, if(days_since_prior_order='','0.0',days_since_prior_order) as dt
from orders
) t1 join(
select order_id, count(product_id) as prod_cnt
from order_products_prior
group by order_id
) t2 on t1.order_id=t2.order_id
group by t1.user_id
limit 20;
或者使用avg函数处理去重后的天数问题,在上述join后按user_id,dt分组,组内求出商品总数,最后按user_id分组,对prod_cnt使用avg函数
select t3.user_id, avg(t3.prod_cnt) as avg_day_prod_cnt
from(
select t1.user_id, t1.dt, sum(t2.prod_cnt) as prod_cnt
from (
select user_id, order_id, if(days_since_prior_order='','0.0',days_since_prior_order) as dt
from orders
) t1 join(
select order_id, count(product_id) as prod_cnt
from order_products_prior
group by order_id
) t2 on t1.order_id=t2.order_id
group by t1.user_id, t1.dt
)t3
group by t3.user_id
limit 20;
题目五:每个用户最喜爱购买的三个product是什么,最终表结构可以是3个列,或者是一个字符串
user_id | top1 | top2 | top3
123 | prod_23 | prod_12 | prod_35
先根据orders表和order_products_prior表根据order_id作join,得出每个用户(user_id重复)每个订单每个商品,再根据user_id,product_id分组,计算每组的个数即用户购买该商品的次数(user_id, product_id, prod_buy_cnt)
select t1.user_id, t2.product_id, count(1) as prod_buy_cnt
from orders t1
join order_products_prior t2
on t1.order_id=t2.order_id
group by t1.user_id,t2.product_id
接下来涉及到组内排序,这里先介绍row_number() over()排序功能
row_number() over()
在使用 row_number() over()函数时候,over()里头的分组以及排序的执行晚于 where group by order by 的执行。
partition by 用于给结果集分组,如果没有指定那么它把整个结果集作为一个分组,它和聚合函数不同的地方在于它能够返回一个分组中的多条记录,而聚合函数一般只有一个反映统计值的记录。
例如:employee,根据部门分组排序
对查询结果进行排序:(无分组)
row_number() over()返回的是排序后的顺序编号(从1开始)
rank() over()
rank() over() 是跳跃排序,有两个第二名时接下来就是第四名(同样是在各个分组内)
dense_rank() over()
dense_rank() over()是连续排序,有两个第二名时仍然跟着第三名。相比之下row_number是没有重复值的
再回到我们的题目上,接下来使用row_number对一个用户的不同商品,按照购买次数进行降序排列
select t3.user_id, t3.product_id, t3.prod_buy_cnt,
row_number() over (partirion by t3.user_id order by t3.prod_buy_cnt desc) as row_num
from(
select t1.user_id, t2.product_id, count(1) as prod_buy_cnt
from orders t1
join order_products_prior t2
on t1.order_id=t2.order_id
group by t1.user_id,t2.product_id
) t3
最后对上述结果取row_num<4,并按user_id分组后,使用collect_list函数将每个user_id购买数量最多的前三个product_id及其对应的row_num用分隔符’_'拼接后放到一个数组里
select t.user_id, collect_list(concat_ws('_', t.product_id, cast(t.row_num as string))) as top_3_prods
from
(
select t3.user_id, t3.product_id, t3.prod_buy_cnt,
row_number() over (partirion by t3.user_id order by t3.prod_buy_cnt desc) as row_num
from(
select t1.user_id, t2.product_id, count(1) as prod_buy_cnt
from orders t1
join order_products_prior t2
on t1.order_id=t2.order_id
group by t1.user_id,t2.product_id
) t3
)t
where t.row_num<4
group by t.user_id
limit 20;