date | user | age | programid | Playtime |
---|---|---|---|---|
20190421 | u1 | 30 | a | 4min |
20190421 | u1 | 30 | b | 10min |
20190421 | u2 | 27 | a | 2min |
20190422 | u3 | 35 | c | 3min |
20190422 | u2 | 27 | d | 1min |
case when
标记时, 先考虑能否直接总结规律group by
时, 一定注意分组顺序在开窗之前group by
和distinct
去重, 因为去重操作往往可以在外层查询中进行. 子查询中进行去重会降低代码运行效率not in
逻辑的实现row_number()
标记, 第二层将标记逆序排列, 然后case when
+row_number()
标记出最值, 第三层聚合, 过滤未被标记的null, 从而得到最值. 这种思路可以在最外层聚合之前都保留完整数据. 但三层嵌套+开窗, job太多了注意到数据源格式不规范.
新建表, 将Playtime
列改名为playtime
, 将string
类型对应转换为int
类型
-- 建表
create table test(
date bigint,
user string,
age int,
programid string,
playtime int
)
row format delimited
fields terminated by '\t'
stored as textfile
;
-- 加载数据
insert into test
select date, user, age, programid, playtime
, substr(Playtime, 1, length(Playtime) - 3)
from test_old
;
结果如下
date | user | age | programid | playtime |
---|---|---|---|---|
20190421 | u1 | 30 | a | 4 |
20190421 | u1 | 30 | b | 10 |
20190421 | u2 | 27 | a | 2 |
20190422 | u3 | 35 | c | 3 |
20190422 | u2 | 27 | d | 1 |
这是一个汇总统计, 最后只需要输出一条信息
所以最外层不使用group by, 直接使用聚合函数即可
相应的, 内层需要得到用户, 年龄, 观看时长三个属性. 所以在内层使用分组聚合
-- 内层
select user
, age
, sum(playtime) sumtime
from test
group by user, age
;
-- 汇总
select count(user)
, round(avg(age), 2)
, round(avg(sum), 2)
from(
select user
, age
, sum(playtime) sum
from test
group by user, age
) temp
;
典型的标记+统计, 依旧需要在内层聚合用户
需要注意的是如何完成标记
-- 内层:需要额外标记年龄段
select user
, max(case
when age between 0 and 9 then 0
when age between 10 and 19 then 1
when age between 20 and 29 then 2
when age between 30 and 39 then 3
-- 全年龄段手动标记, 这里就省略不写
end) flag
, sum(Playtime) sum
from test
group by user, age
-- 手动标记很不方便, 学习大佬的思路
select user
, max(floor(age / 10)) flag
, sum(Playtime) sum
from test
group by user, age
;
注: case when
是最常用到的标记方法, 这种思路简单, 但可能不是最优
-- 汇总, 完成统计
select flag
, count(user)
, avg(sumtime)
from(
select user
, age
, floor(age / 10) flag
, sum(Playtime) sumtime
from test
group by user, age
) temp
group by flag
;
求最值是非常常见的题目
sql的常用逻辑是内层标记, 外层聚合
求最值往往需要聚合之后, 再加一层用于统计聚合结果, 但是标记 -> 聚合 -> 统计结果
的三层查询, 会产生大量job
所以一般会借助中间表, 或优化sql逻辑
这里我们内层统计时长, 并使用开窗完成标记; 外层简单聚合, 取最值
示例代码中, 连用了开窗函数和分组
-- 内层, 这里把用户不同时段看的相同节目做了聚合
-- sum(playtime)必须写出来, 不然无法开窗
select user
, programid
, sum(playtime)
, case row_number() over(partition by user order by sum(Playtime) desc)
when 1 then programid end love
from test
group by user, programid
-- 汇总(job3)
select user
, max(love)
from(
select user
, programid
, sum(Playtime)
, case row_number() over(partition by user order by sum(Playtime) desc)
when 1 then programid end love
from test
group by user, programid
) temp
group by user
;
这是典型的not in逻辑
常见的解决思路是left join
+ is null
-- 寻找时长小于5的用户
-- 内层不要去重, 不然会产生大量job
select user
from test
where playtime <= 5
;
-- 左半连接找null(job 1)
select test.user
from test
left join(
select user
from test
where playtime <= 5
) temp
on test.user=temp.user
where temp.user is null
group by test.user
;