目录
一、基于行的窗口函数——行的起点~行的终点
二、基于值的窗口函数——值的起点~值的终点
三、基于分区的窗口函数
四、基于缺省的窗口函数
五、lead与lag
六、first_value和last_value
七、排名函数——rank(113)、dense_rank(112)、row_number(123)
八、NTILE分桶
九、窗口函数练习题
(一)练习一
1.统计每个用户截至每次下单的累积下单总额
2.统计每个用户截至每次下单的当月累积下单总额
3.统计每个用户每次下单距离上次下单相隔的天数(首次下单按0天算)
4.查询所有下单记录以及每个用户的每个下单记录所在月份的首/末次下单日期
5.为每个用户的所有下单记录按照订单金额进行排名
(二)练习二
1.查询所有用户的下单次数
2.查询每个用户下单的次数
3.查询每个用户下单总额
4.查询每个用户每月下单总额
5.每个人截止到当前日期所下订单总额——每天的累加和
6.rows between and用法
7.ntile用例
注意:between and中间不同颜色的关键词,对应and后面颜色的关键词,且只能在对应颜色范围内选择and后面是什么关键词!
between 后面是起点行:
起点位于当前行之前:
unbounded preceding:负无穷,整个表的第一行
[num] preceding:当前行的前num行,起点[num] preceding>=终点[num] preceding
例如:前面5行作为起点,前面2行作为终点是可行的;
前面2行作为起点,前面5行作为终点是不可行的
起点位于当前行:
current row:当前行
起点位于当前行之后:
[num] following:当前行的后num行,起点[num] following<=终点[num] following
例如:后面5行作为起点,后面2行作为终点是不可行的;
后面2行作为起点,后面5行作为终点是可行的
and后面是终点行:
终点位于当前行之前:
[num] preceding
current row
[num] following
unbounded following:正无穷,整个表的最后一行
终点位于当前行:
current row
[num] following
unbounded following
终点位于当前行之后:
[num] following
unbounded following
计算历史截止到当前的销售额(不区分user_id):
上图单词的含义由行变为值即可,关键词选取规则与行的规则相同,此处省略。
order by [column]:基于行中的order by是指定行的顺序,只有在每行的顺序固定的前提下,对每行进行聚合才是有意义的。基于值的order by目的不是排序,而是比表名基于哪个字段的值来划分窗口。
当窗口是[num] preceding或[num] following来定义时:
order by指定的column必须是[整数]型,才能够进行num的加减!窗口不涉及到加减num的操作,column字段类型随意。字符串按照字符大小比较。
between 后面是起点行:
起点位于当前值之前:
unbounded preceding:负无穷,第一行某个字段的值
[num] preceding:当前值的减去num
起点位于当前行:
current row:当前字段对应的值
起点位于当前行之后:
[num] following:当前值的加上num
and后面是终点行:
终点位于当前值之前:
[num] preceding
current row
[num] following
unbounded following:正无穷,最后一行某个字段的值
终点位于当前行:
current row
[num] following
unbounded following
终点位于当前行之后:
[num] following
unbounded following
统计截止到每个下单日期的销售额(注意下单日期相同的也要相加)
每个用户,截止到每次下单时,历史下单总额
功能:lead函数用于获取当前行之后的行的值。lag函数用于获取当前行之前的行的值。
注意:lead和lag不支持自定义窗口。没有range between and和rows between and。
只需考虑在取行数时是否需要分区和排序。
统计每一次下单记录的 上一次和下一次的下单日期,从而计算每次下单的时间间隔
功能:获取窗口内某一列的第一个值/最后一个值,需要指定字段。
boolean:是否要跳过null值,如果first_value第一行是null值,true就跳过,找下一行;false就不跳过,返回null值;last_value的boolean最后一行是null值是否跳过。
注意:排名函数也不支持自定义窗口。
NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值。
如果切片不均匀,默认增加第一个切片的分布
查询想要每位顾客购买金额前1/3的交易记录
select t.name, order_date, cost
from (
select name,
order_date,
cost,
ntile(3) over (partition by name order by cost desc) as rk
from t_window) t
where t.rk = 1;
-- 数据源
create table order_info
(
order_id string, --订单id
user_id string, -- 用户id
user_name string, -- 用户姓名
order_date string, -- 下单日期
order_amount int -- 订单金额
);
insert overwrite table order_info
values ('1', '1001', '小元', '2022-01-01', '10'),
('2', '1002', '小海', '2022-01-02', '15'),
('3', '1001', '小元', '2022-02-03', '23'),
('4', '1002', '小海', '2022-01-04', '29'),
('5', '1001', '小元', '2022-01-05', '46'),
('6', '1001', '小元', '2022-04-06', '42'),
('7', '1002', '小海', '2022-01-07', '50'),
('8', '1001', '小元', '2022-01-08', '50'),
('9', '1003', '小辉', '2022-04-08', '62'),
('10', '1003', '小辉', '2022-04-09', '62'),
('11', '1004', '小猛', '2022-05-10', '12'),
('12', '1003', '小辉', '2022-04-11', '75'),
('13', '1004', '小猛', '2022-06-12', '80'),
('14', '1003', '小辉', '2022-04-13', '94');
select order_id,
user_id,
user_name,
order_date,
order_amount,
sum(order_amount) over (partition by user_id order by order_date
rows between unbounded preceding and current row ) sum_so_far
from order_info;
select order_id,
user_id,
user_name,
order_date,
order_amount,
sum(order_amount) over (partition by user_id ,month(order_date) order by order_date
rows between unbounded preceding and current row ) sum_so_far
from order_info;
select order_id,
user_id,
user_name,
order_date,
order_amount,
datediff(order_date, (lag(order_date, 1, order_date) over
(partition by user_id order by order_date))) as diff
from order_info;
select order_id,
user_id,
user_name,
order_date,
order_amount,
first_value(order_date) over-- 不需要管理null值,默认false
(partition by user_id ,substring(order_date, 1, 7) order by order_date) as first_date,
last_value(order_date) over
(partition by user_id ,substring(order_date, 1, 7) order by order_date
rows between unbounded preceding and unbounded following) as last_date
from order_info;
注意:这里选取每月最后一次下单日期需要考虑正无穷,选取每月第一次下单日期的rows between and可写可不写。
select order_id,
user_id,
user_name,
order_date,
order_amount,
-- 注意排名函数的()内没有值
rank() over (partition by user_id order by order_amount desc ) as rk,
dense_rank() over (partition by user_id order by order_amount desc ) as drk,
row_number() over (partition by user_id order by order_amount desc ) as rn
from order_info;
数据源——t_window表
select name, count(*) over () from t_window where substring(order_date, 1, 7) = '2015-04';
-- 对名字进行去重的两种方法 select distinct name, count(*) over () from t_window where substring(order_date, 1, 7) = '2015-04'; select name, count(*) over () from t_window where substring(order_date, 1, 7) = '2015-04' group by name;
select distinct name, count(*) over (partition by name) as num from t_window where substring(order_date, 1, 7) = '2015-04';
select name, order_date, cost, sum(cost) over (partition by name) cost_sum from t_window;
-- order by之前
select name, order_date, cost, sum(cost) over (partition by month(order_date)) from t_window;
-- order by之后
select name, order_date, cost, sum(cost) over (partition by month(order_date) order by order_date) from t_window;
select name, order_date, cost, sum(cost) over (partition by name order by order_date) from t_window;
select name , order_date , cost , sum(cost) over () as row1-- 全表数据cost的总和,661 , sum(cost) over (partition by name) as row2-- 个人累计消费总和 , sum(cost) over (partition by name order by order_date rows between unbounded preceding and unbounded following) as row22 , sum(cost) over (partition by name order by order_date) as row3-- 个人截止当前日期累计消费总和 , sum(cost) over (partition by name order by order_date rows between unbounded preceding and current row ) as row4-- 效果同row3 , sum(cost) over (partition by name order by order_date rows between 1 preceding and current row ) as row5-- 当前消费额与上一次消费额的总和 , sum(cost) over (partition by name order by order_date rows between 1 preceding and 1 following ) as row6-- 上一次和下一次的消费总和 , sum(cost) over (partition by name order by order_date rows between current row and unbounded following) as row7-- 当前消费到最后一次消费总和 , sum(cost) over (partition by name order by order_date rows between current row and 3 following) as row8-- 距离当前日期最近的后三次消费总和 from t_window;
select name , order_date , cost , ntile(3) over () as row1 , ntile(3) over (partition by name) as row2-- mark 四条数据,4%3=1,多出来的一条又回到第一 -- ,ntile(3) over (order by order_date) as row3-- 排完序以后再切片 , ntile(3) over (order by name) as row3-- 按照name的首字母进行排序 , ntile(3) over (partition by name order by cost desc ) as rows4 from t_window;