分析函数 over(partition by 列名 order by 列名 rows between 开始位置 and 结束位置)
over()
函数中包括三个函数:包括分区partition by 列名
、排序order by 列名
、指定窗口范围rows between 开始位置 and 结束位置
。① LEAD(col,n,DEFAULT)
② LAG(col,n,DEFAULT)
③ FIRST_VALUE
④ LAST_VALUE
1.使用标准的聚合函数COUNT、SUM、MIN、MAX、AVG
2.使用PARTITION BY
语句,使用一个或者多个原始数据类型的列
3.使用PARTITION BY与ORDER BY
语句,使用一个或者多个数据类型的分区或者排序列
4.使用窗口规范,窗口规范支持以下格式:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
5.窗口范围说明:
我们常使用的窗口范围是ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
(表示从起点到当前行),常用该窗口来计算累加。
PRECEDING
:往前
FOLLOWING
:往后
CURRENT ROW
:当前行
UNBOUNDED
:起点(一般结合PRECEDING,FOLLOWING使用)
UNBOUNDED PRECEDING
表示该窗口最前面的行(起点)
UNBOUNDED FOLLOWING
:表示该窗口最后面的行(终点)
比如说:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
(表示从起点到当前行)
ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING
(表示往前2行到往后1行)
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
(表示往前2行到当前行)
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
(表示当前行到终点)
当ORDER BY
后面缺少窗口从句条件(即 分析函数 over(partition by 列名 order by 列名)
),窗口规范默认是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
当ORDER BY
和窗口从句都缺失(即分析函数 over(partition by 列名)
), 窗口规范默认是 ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
注意:OVER
从句支持以下函数, 但是并不支持和窗口一起使用它们:
Ranking
函数: Rank, NTile, DenseRank, CumeDist, PercentRank.
Lead
和 Lag
函数(即 Ranking
函数 不能和 Lead
、Lag
函数一起使用)
①RANK
:从1开始,按照顺序按照值排序时产生一个自增编号,值相等时会重复,会产生空位(如:1、2、3、3、3、6)
②ROW_NUMBER
:从1开始,按照顺序,按照值排序时产生一个自增编号,不会重复(如:1、2、3、4、5、6)
③DENSE_RANK
:从1开始,按照值排序时产生一个自增编号,值相等时会重复,不会产生空位(如:1、2、3、3、3、4)
④ CUME_DIST
:小于等于当前值的行数/分组内总行数。比如,统计小于等于当前薪水的人数,所占总人数的比例
⑤PERCENT_RANK
:分组内当前行的RANK值-1/分组内总行数-1
⑥ NTILE(n)
:用于将分组数据按照顺序切分成n片,返回当前切片值,如果切片不均匀,默认增加第一个切片的分布。NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
⑦Distinct
: 去重。如COUNT(DISTINCT a) OVER (PARTITION BY c)
本小节主要讲解COUNT、SUM、MIN、MAX、AVG
的用法。
-------------------------------------------------------------------------
//订单表order : name, order_date, cost
Jan,2020-01-01,87
Jan,2020-01-02,95
Jan,2020-03-03,68
Jan,2020-05-01,68
Ben,2020-04-01,94
Ben,2020-01-02,56
Ben,2020-04-03,84
Ben,2020-05-01,84
Dan,2020-02-01,64
Dan,2020-03-02,86
Dan,2020-04-03,84
Dan,2020-04-01,84
Tim,2020-03-01,65
Tim,2020-02-02,85
Tim,2020-01-03,78
Tim,2020-04-01,78
Bob,2020-02-01,67
Bob,2020-03-02,95
Bob,2020-04-03,70
Bob,2020-05-01,70
------------------------------------------------------------------------
create table order_test(
name string,
order_date string,
cost int)
row format delimited fields terminated by ',';
-------------------------------------------------------------------------
load data local inpath '/tmp/order.txt' into table order;
//COUNT、SUM、MIN、MAX、AVG
//以 sum函数举例
select
name,
order_date,
cost,
--① over():所有的数据求和。sum_01是一样的。
--求和范围:order_test表的所有数据
sum(cost) over() as sum_01,
--② over(partition by) :按照name分组,对分组相加。组内的sum_02是一个值。
--求和范围:以name分组,每个组内求和
sum(cost) over(partition by name) as sum_02,
--③ over(parition by order by):按照name分组,对分组按照时间升序累加。组内sum_03是一个变化的累加值
--求和范围:默认为从起点到当前行。以name分组,每个组内按order_date累计求和。注意和②的区别
sum(cost) over(partition by name order by order_date) as sum_03,
--④ between unbounded preceding and current row : 同③,从起点到当前行。(order by 省略窗口范围的默认范围)
sum(cost) over(partition by name order by order_date rows between unbounded preceding and current row) as sum_04,
--⑤ rows between n preceding and current row: 以name分组,当前行和前面n行做聚合。聚合的行数为: n+1(当前行)。 示例为 n=1的情况
sum(cost) over(partition by name order by order_date rows between 1 preceding and current row) as sum_05,
--⑥ rows between n1 preceding and n2 following:以name分组,当前行 + 前n1行 +后n2行做聚合。聚合的行数为:n1+n2+1(当前行) 示例n1 n2 =1
sum(cost) over(partition by name order by order_date rows between 1 preceding and 1 following) as sum_06,
--⑦rows between current row and unbounded following:以name分组,当前行+后面的所有行
sum(cost) over(partition by name order by order_date rows between current row and unbounded following) as sum_07
from order_test;
执行上面的sql:
select
name,
order_date,
cost,
sum(cost) over() as sum_01,
sum(cost) over(partition by name) as sum_02,
sum(cost) over(partition by name order by order_date) as sum_03,
sum(cost) over(partition by name order by order_date rows between unbounded preceding and current row) as sum_04,
sum(cost) over(partition by name order by order_date rows between 1 preceding and current row) as sum_05,
sum(cost) over(partition by name order by order_date rows between 1 preceding and 1 following) as sum_06,
sum(cost) over(partition by name order by order_date rows between current row and unbounded following) as sum_07
from order_test;
查询结果如下:
name order_date cost sum_01 sum_02 sum_03 sum_04 sum_05 sum_06 sum_07
Ben 2020-01-02 56 1562 318 56 56 56 150 318
Ben 2020-04-01 94 1562 318 150 150 150 234 262
Ben 2020-04-03 84 1562 318 234 234 178 262 168
Ben 2020-05-01 84 1562 318 318 318 168 168 84
Bob 2020-02-01 67 1562 302 67 67 67 162 302
Bob 2020-03-02 95 1562 302 162 162 162 232 235
Bob 2020-04-03 70 1562 302 232 232 165 235 140
Bob 2020-05-01 70 1562 302 302 302 140 140 70
Dan 2020-02-01 64 1562 318 64 64 64 150 318
Dan 2020-03-02 86 1562 318 150 150 150 234 254
Dan 2020-04-01 84 1562 318 234 234 170 254 168
Dan 2020-04-03 84 1562 318 318 318 168 168 84
Jan 2020-01-01 87 1562 318 87 87 87 182 318
Jan 2020-01-02 95 1562 318 182 182 182 250 231
Jan 2020-03-03 68 1562 318 250 250 163 231 136
Jan 2020-05-01 68 1562 318 318 318 136 136 68
Tim 2020-01-03 78 1562 306 78 78 78 163 306
Tim 2020-02-02 85 1562 306 163 163 163 228 228
Tim 2020-03-01 65 1562 306 228 228 150 228 143
Tim 2020-04-01 78 1562 306 306 306 143 143 78
为了便于理解,我们以ben
的四条数据,详细说明sum_01 至 sum_07的计算结果
name order_date cost sum_01 sum_02 sum_03 sum_04 sum_05 sum_06 sum_07
Ben 2020-01-02 56 1562 318 56 56 56 150 318
Ben 2020-04-01 94 1562 318 150 150 150 234 262
Ben 2020-04-03 84 1562 318 234 234 178 262 168
Ben 2020-05-01 84 1562 318 318 318 168 168 84
①sum_01
-——sum(cost) over() as sum_01
sum_01 = 1562
是表总所有人的消费总额
②sum_02
—— sum(cost) over(partition by name) as sum_02
sum_02 按name分组,是ben的消费总额 sum_02 =318 = 56+94+84+84
③sum_03
——sum(cost) over(partition by name order by order_date) as sum_03
按照name分组,对分组按照时间升序累加。默认为从起点到当前行
sum_03 = 56 : 起点到当前行,两行重合, 只有一条消费记录:56
sum_03 = 150: 第一行加上第二行: 56+94 = 150
sum_03 = 234: 从第一行加到第三行 56+94+84 = 234
sum_03 = 318:同理累加
④sum_04
——between unbounded preceding and current row
同③,从起点到当前行。(order by 省略窗口范围的默认范围)
⑤sum_05
——rows between n preceding and current row
:
以name分组,当前行和前面n行做聚合。聚合的行数为: n+1(当前行)。 示例为 n=1的情况。
sum_05 = 56:此时当前行是第1行,前面没有行。总共就一行。
sum_05 = 150:此时当前行是第二行94,前面一行是56,相加 = 150
sum_05 = 178: 此时当前行是第三行84,前面一行是84,相加 = 178
sum_05 = 168:同理可得。
⑥ sum_06
——rows between n1 preceding and n2 following
:
以name分组,当前行 + 前n1行 +后n2行做聚合。聚合的行数为:n1+n2+1(当前行) 示例n1,n2 =1
sum_06 = 150:此时当前行是第一行56,没有前一行,但是有后面(第二行 94):56+94 = 150
sum_06 = 234:此时当前行是第二行94,前面一行56,后面一行84, 94+56+84 = 234
sum_06 = 262:同理
sum_06 = 168:同理
⑦sum_07
——rows between current row and unbounded following
:
以name分组,当前行+后面的所有行
sum_07 = 318:当前行是第一行56,后面所有行:96,84,84。56+96+84+84 = 318
sum_07 = 262:当前行是第二行94,后面所有行:84,84。 94+84+84 = 262
sum_07 = 168:同理
sum_07 = 84: 同理。
用法小结:
sum
为例,演示了聚合函数的用法,其他COUNT、AVG,MIN,MAX
,和SUM
用法一样。
结果和ORDER BY相关,默认为升序
如果不指定ROWS BETWEEN,默认为从起点到当前行;
如果不指定ORDER BY,则将分组内所有值累加;
关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:无界限(起点或终点)
UNBOUNDED PRECEDING:表示从前面的起点
UNBOUNDED FOLLOWING:表示到后面的终点
select
name,
order_date,
cost,
- - row_number() 自然顺序
row_number() over(partition by name order by order_date) as rn,
- -rank(): 相同重复,留下空位,排名总数不变
rank() over(partition by name order by order_date) as rk,
- - dens_rank():相同重复,不留空位,排名总数减少
dense_rank() over(partition by name order by order_date) as den_rk
from order_test
## 由于示例测试数据没有用户在同一天购买,所以测试效果不明显。关键在于理解三个函数的不同点。
select
name,
order_date,
cost,
- - 全局数据切片,切成3片
ntile(3) over() as n1,
- - 按照name 分组,然后组内数据切成3份
ntile(3) over(partition by name) as n2,
- - 全局按照cost(升序)排序,将数据切成3份
ntile(3) over(order by cost) as n3,
- -按照name分组,组内按照cost升序排列,将数据切成3份
ntile(3) over(partition by name order by cost) as n4
from order_test;
查询结果:
name order_date cost n1 n2 n3 n4
Ben 2020-01-02 56 3 1 1 1
Ben 2020-05-01 84 2 3 2 1
Ben 2020-04-03 84 2 2 2 2
Ben 2020-04-01 94 3 1 3 3
Bob 2020-02-01 67 1 3 1 1
Bob 2020-04-03 70 1 1 1 1
Bob 2020-05-01 70 1 2 2 2
Bob 2020-03-02 95 1 1 3 3
Dan 2020-02-01 64 2 1 1 1
Dan 2020-04-01 84 2 3 2 1
Dan 2020-04-03 84 2 1 2 2
Dan 2020-03-02 86 2 2 3 3
Jan 2020-05-01 68 3 1 1 1
Jan 2020-03-03 68 3 1 1 1
Jan 2020-01-01 87 3 3 3 2
Jan 2020-01-02 95 3 2 3 3
Tim 2020-03-01 65 2 1 1 1
Tim 2020-04-01 78 1 3 2 1
Tim 2020-01-03 78 1 2 2 2
Tim 2020-02-02 85 1 1 3 3
小于等于当前值的行数/分组内总行数。比如,统计小于等于当前薪水的人数,所占总人数的比例
示例
todo
分组内当前行的RANK值-1/分组内总行数-1
示例
todo
row_number() + rand()
① 先群求出表的总记录数
with data_cnt (
select
count(*) as cnt
from t
)
② row_number() + rand()
select
*
from
(
select
*
row_number() over(
partition by t.date,
order by rand()
) rand_rk
from t
)t1
where t1.rand_rk / date_cnt.cnt < x% (目标百分比)
用于统计窗口内往上(向后)第n行值,第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
示例:统计顾客上一次 和 上两次购买的时间
select
name,
order_date,
cost,
- - 计算每个顾客上一次购买的时间,如果没有默认为1900-01-01
lag(order_date,1,"1900-01-01") over(partition by name order by order_date) as last_time_01,
- - 计算每个顾客上两次购买的时间,如果没有默认为null
lag(order_date,2) over(partition by name order by order_date) as last_time_02
from order_test;
执行结果:
name order_date cost last_time_01 last_time_02
Ben 2020-01-02 56 1900-01-01 NULL
Ben 2020-04-01 94 2020-01-02 NULL
Ben 2020-04-03 84 2020-04-01 2020-01-02
Ben 2020-05-01 84 2020-04-03 2020-04-01
Bob 2020-02-01 67 1900-01-01 NULL
Bob 2020-03-02 95 2020-02-01 NULL
Bob 2020-04-03 70 2020-03-02 2020-02-01
Bob 2020-05-01 70 2020-04-03 2020-03-02
Dan 2020-02-01 64 1900-01-01 NULL
Dan 2020-03-02 86 2020-02-01 NULL
Dan 2020-04-01 84 2020-03-02 2020-02-01
Dan 2020-04-03 84 2020-04-01 2020-03-02
Jan 2020-01-01 87 1900-01-01 NULL
Jan 2020-01-02 95 2020-01-01 NULL
Jan 2020-03-03 68 2020-01-02 2020-01-01
Jan 2020-05-01 68 2020-03-03 2020-01-02
Tim 2020-01-03 78 1900-01-01 NULL
Tim 2020-02-02 85 2020-01-03 NULL
Tim 2020-03-01 65 2020-02-02 2020-01-03
Tim 2020-04-01 78 2020-03-01 2020-02-02
用于统计窗口内往下(向前)第n行值,第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)。 记忆 lag(落后) ——> current(当前) ——>lead(领先)
示例:统计顾客下一次 和 下两次购买的时间
select
name,
order_date,
cost,
- - 计算每个顾客下一次购买的时间,如果没有默认为1900-01-01
lead(order_date,1,"1900-01-01") over(partition by name order by order_date) as next_time_01,
- - 计算每个顾客下两次购买的时间,如果没有默认为null
lead(order_date,2) over(partition by name order by order_date) as next_time_02
from order_test;
执行结果:
name order_date cost next_time_01 next_time_02
Ben 2020-01-02 56 2020-04-01 2020-04-03
Ben 2020-04-01 94 2020-04-03 2020-05-01
Ben 2020-04-03 84 2020-05-01 NULL
Ben 2020-05-01 84 1900-01-01 NULL
Bob 2020-02-01 67 2020-03-02 2020-04-03
Bob 2020-03-02 95 2020-04-03 2020-05-01
Bob 2020-04-03 70 2020-05-01 NULL
Bob 2020-05-01 70 1900-01-01 NULL
Dan 2020-02-01 64 2020-03-02 2020-04-01
Dan 2020-03-02 86 2020-04-01 2020-04-03
Dan 2020-04-01 84 2020-04-03 NULL
Dan 2020-04-03 84 1900-01-01 NULL
Jan 2020-01-01 87 2020-01-02 2020-03-03
Jan 2020-01-02 95 2020-03-03 2020-05-01
Jan 2020-03-03 68 2020-05-01 NULL
Jan 2020-05-01 68 1900-01-01 NULL
Tim 2020-01-03 78 2020-02-02 2020-03-01
Tim 2020-02-02 85 2020-03-01 2020-04-01
Tim 2020-03-01 65 2020-04-01 NULL
Tim 2020-04-01 78 1900-01-01 NULL
取分组内排序后,截止到当前行,最后一个值
示例: 求每个用户第一次 和最后一次购买的时间。
select
name,
order_date,
cost,
- - ① 第一次购买时间
first_value(order_date) over(partition by name order by order_date) as first_time_01,
- - ② 最后一次购买时间
last_value(order_date) over(partition by name order by order_date) as last_time_01,
- - ③ 使用 last_value + order by desc 并不能取到最小值(第一次购买时间),
- - 一定要注意范围: 分组后,从起始行到当前行。 desc后 分组起始第一行都是最大时间,所以实际取得的是最后一次购买时间
last_value(order_date) over(partition by name order by order_date desc) as fisrt_time_02,
- - ④ 使用 first_value + order by desc 取到最大值(最后一次购买时间)
- - 一定要注意范围: 分组后,从起始行到当前行
first_value(order_date) over(partition by name order by order_date desc) as last_time_02,
- - ⑤ 使用 row_number() 取 rn = 1 可以获得最小值,但是需要再嵌套一层
row_number() over(partition by name order by order_date) as rn
from order_test;
-------------------------纯sql-------------------------------------
select
name,
order_date,
cost,
first_value(order_date) over(partition by name order by order_date) as first_time_01,
last_value(order_date) over(partition by name order by order_date) as last_time_01,
last_value(order_date) over(partition by name order by order_date desc) as fisrt_time_02,
first_value(order_date) over(partition by name order by order_date desc) as last_time_02,
row_number() over(partition by name order by order_date) as rn
from order_test;
执行结果:
name order_date cost first_time_01 last_time_01 first_time_02 last_time_02 rn
Ben 2020-05-01 84 2020-01-02 2020-05-01 2020-05-01 2020-05-01 4
Ben 2020-04-03 84 2020-01-02 2020-04-03 2020-04-03 2020-05-01 3
Ben 2020-04-01 94 2020-01-02 2020-04-01 2020-04-01 2020-05-01 2
Ben 2020-01-02 56 2020-01-02 2020-01-02 2020-01-02 2020-05-01 1
Bob 2020-05-01 70 2020-02-01 2020-05-01 2020-05-01 2020-05-01 4
Bob 2020-04-03 70 2020-02-01 2020-04-03 2020-04-03 2020-05-01 3
Bob 2020-03-02 95 2020-02-01 2020-03-02 2020-03-02 2020-05-01 2
Bob 2020-02-01 67 2020-02-01 2020-02-01 2020-02-01 2020-05-01 1
Dan 2020-04-03 84 2020-02-01 2020-04-03 2020-04-03 2020-04-03 4
Dan 2020-04-01 84 2020-02-01 2020-04-01 2020-04-01 2020-04-03 3
Dan 2020-03-02 86 2020-02-01 2020-03-02 2020-03-02 2020-04-03 2
Dan 2020-02-01 64 2020-02-01 2020-02-01 2020-02-01 2020-04-03 1
Jan 2020-05-01 68 2020-01-01 2020-05-01 2020-05-01 2020-05-01 4
Jan 2020-03-03 68 2020-01-01 2020-03-03 2020-03-03 2020-05-01 3
Jan 2020-01-02 95 2020-01-01 2020-01-02 2020-01-02 2020-05-01 2
Jan 2020-01-01 87 2020-01-01 2020-01-01 2020-01-01 2020-05-01 1
Tim 2020-04-01 78 2020-01-03 2020-04-01 2020-04-01 2020-04-01 4
Tim 2020-03-01 65 2020-01-03 2020-03-01 2020-03-01 2020-04-01 3
Tim 2020-02-02 85 2020-01-03 2020-02-02 2020-02-02 2020-04-01 2
Tim 2020-01-03 78 2020-01-03 2020-01-03 2020-01-03 2020-04-01 1
特别注意:
first_value() + order by desc
这样永远取的都是第一行,且desc倒序,第一行为最大值。rows between ... and ...
),即不能指定窗口的范围。所以order by 后面不能指定范围,那么对应的默认范围就是 rows between unbounded preceding and current row
,从起始位置到当前行。
明确窗口的范围: 分组后,从起点行到当前行。 last_value 取范围内的最后一条数据。
原数数据:
name order_date cost
Ben 2020-01-02 56
Ben 2020-04-03 84
Ben 2020-05-01 84
Ben 2020-04-01 94
partition by name order by order_date desc 后的数据:
name order_date cost
Ben 2020-05-01 84 (第一条数据)
Ben 2020-04-03 84 (第二条数据)
Ben 2020-04-01 94 (第三条数据)
Ben 2020-01-02 56 (第四条数据)
last_value(order_date) :取窗口范围内的最后一条数据的时间。
第一行数据:起始行就是当前行,整个窗口范围就只有一条数据,last_value(order_date) 只能取 2020-05-01
得到:
name order_date cost first_time_02
Ben 2020-05-01 84 2020-05-01
第二行数据:起始行第一行,当前行第二行。窗口范围 第一行到第二行, last_value(order_date)去最后一条数据时间 2020-04-03
得到:
name order_date cost first_time_02
Ben 2020-05-01 84 2020-05-01
Ben 2020-04-03 84 2020-04-03
其他同理可得。
todo
hive wiki
https://blog.csdn.net/scgaliguodong123_/article/details/60135385
Hive面试/经典练习题