常见的GROUP BY 和 DISTRIBUTE BY 等语句并不能支持诸如分组排名、滑动平均值等计算,原因是 GROUP BY 语句只能为每个分组的数据返回一条记录,而非每条数据一行。但是,Hive 0.11之后引入了窗口查询功能,使用 WINDOW 语句我们可以基于分区和窗口,在实现分组分析的目的的同时,为每条数据都生成一行结果记录。
语法形式如下:
Function(arg1,....argn) OVER([PARTITION BY<...>] [ORDER BY<...>] [window_clause])
Windowing functions | 介绍 |
---|---|
LEAD(col,n,DEFAULT) | 用于统计窗口内往下第n行的值。参数涵义:参数col为列名,第二个参数n为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL) |
LAG(col,n,DEFAULT) | 用于统计窗口内往上第n行值。参数意义:参数col为列名,第二个参数n为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL) |
FIRST_VALUE(col,false) | 取分组内排序后,截止到当前行,第一个值。参数:第一个参数col代表分析列,第二个参数是boolean类型,表示是否跳过NULL值(默认false) |
LAST_VALUE(col,false) | 取分组内排序后,截止到当前行,最后一个值。参数涵义与FIRST_VALUE一致 |
举例:分组内最后一个
select key,product_code,cost,
count(*) over (partition by key) as count,
LAST_VALUE(cost) over (partition by key) as last_value_1,
LAST_VALUE(cost) over (partition by key order by cost desc) as last_value_2
from
(
select key,product_code,cost
from
(
select 1 as key,'UK' as product_code,20 as cost
union
select 1 as key,'US' as product_code,10 as cost
union
select 1 as key,'EU' as product_code,5 as cost
union
select 2 as key,'UK' as product_code,3 as cost
union
select 2 as key,'EU' as product_code,6 as cost
)unioned
distribute by key,product_code,cost
)mid
结果:最后一列演示了排序之后到当前行的最后一个值;倒数第二列由于没有排序同时窗口限制默认是UNBOUNDED 因此比较魔性。。。
key,product_code,cost,count,last_value_1,last_value_2
1 UK 20 3 10 20
1 US 10 3 10 10
1 EU 5 3 10 5
2 EU 6 2 3 6
2 UK 3 2 3 3
函数名 | 介绍 |
---|---|
ROW_NUMBER() | 从1开始,按照顺序,生成分组内记录的序号,不会有重复和空位 |
RANK() | 生成数据项在分组中的排名,排名相等则并列同时会在名次中留下空位,比如1,2,2,4,5 |
DENSE_RANK() | 生成数据项在分组中的排名,排名相等会在名次中不会留下空位,比如1,2,2,3,4 |
CUME_DIST() | 小于等于当前值的行数/分组内总行数,比如,统计小于等于当前薪水的人数,所占总人数的比例 |
PERCENT_RANK() | 分组内当前行的RANK值-1/分组内总行数-1 |
切记:以上函数都是组内分析,不论rank开始比值中的分子和分母都是组内的统计。
OVER子句
标准聚合函数:count, sum, min, max, avg
OVER一般与PARTITION BY和ORDER BY结合使用
其中支持窗口设置,格式如下:
(ROW | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROW | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROW | RANGE) BETWEEN [num] PRECEDING AND (UNBOUNDED | [num]) FOLLOWING
值得注意的是:ORDER BY一定要小心使用,注意窗口范围
ORDER BY与窗口 | 说明 |
---|---|
存在ORDER BY,而不存在 窗口从句 | 窗口规范默认是截止到当前行:RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW |
不存在ORDER BY,也没有窗口从句 | 窗口不做限制,即没有上下限:ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING |
举例:存在order by和不存在的区别(默认窗口区间不同)
select key,product_code,cost,
count(*) over (partition by key) as count,
sum(cost) over (partition by key) as total_costs,
sum(cost) over (partition by key order by cost) as total_current_costs
from
(
select key,product_code,cost
from
(
select 1 as key,'UK' as product_code,20 as cost
union
select 1 as key,'US' as product_code,10 as cost
union
select 1 as key,'EU' as product_code,5 as cost
union
select 2 as key,'UK' as product_code,3 as cost
union
select 2 as key,'EU' as product_code,6 as cost
)unioned
distribute by key,product_code,cost
)mid
结果:最后两列结果不同,最后一列是截止到当前行
key,product_code,cost,count,total_costs,total_current_costs
1 EU 5 3 35 5
1 US 10 3 35 15
1 UK 20 3 35 35
2 UK 3 2 9 3
2 EU 6 2 9 9
4.其它支持
distinct | hive2.1.0之后,在聚合函数(sum, count, avg)中支持distinct,但是在order by或者 窗口限制中不支持:conut(distinct a) over(partition by c);Hive2.2.0中在使用ORDER BY和窗口限制时支持distinct:count(distinct a) over (partition by c order by d rows between 1 preceding and 1 following) |
聚合函数 | Hive2.1.0以后支持在OVER从句中支持聚合函数:select rank() over(order by sum(b)) |
参考:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics#LanguageManualWindowingAndAnalytics-WINDOWclause