HIve 分析和窗口函数 WindowingAndAnalytics

常见的GROUP BY 和 DISTRIBUTE BY 等语句并不能支持诸如分组排名、滑动平均值等计算,原因是 GROUP BY 语句只能为每个分组的数据返回一条记录,而非每条数据一行。但是,Hive 0.11之后引入了窗口查询功能,使用 WINDOW 语句我们可以基于分区和窗口,在实现分组分析的目的的同时,为每条数据都生成一行结果记录。

语法形式如下:

Function(arg1,....argn) OVER([PARTITION BY<...>] [ORDER BY<...>] [window_clause])
  1. 窗口函数Windowing functions
Windowing functions 介绍
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行的值。参数涵义:参数col为列名,第二个参数n为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值。参数意义:参数col为列名,第二个参数n为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
FIRST_VALUE(col,false) 取分组内排序后,截止到当前行,第一个值。参数:第一个参数col代表分析列,第二个参数是boolean类型,表示是否跳过NULL值(默认false)
LAST_VALUE(col,false) 取分组内排序后,截止到当前行,最后一个值。参数涵义与FIRST_VALUE一致

举例:分组内最后一个

select key,product_code,cost,
count(*) over (partition by key) as count,
LAST_VALUE(cost) over (partition by key) as last_value_1,
LAST_VALUE(cost) over (partition by key order by cost desc) as last_value_2
from 
(
select key,product_code,cost
from 
(
select 1 as key,'UK' as product_code,20 as cost
union
select 1 as key,'US' as product_code,10 as cost
union
select 1 as key,'EU' as product_code,5 as cost
union
select 2 as key,'UK' as product_code,3 as cost
union
select 2 as key,'EU' as product_code,6 as cost
)unioned
distribute by key,product_code,cost
)mid

结果:最后一列演示了排序之后到当前行的最后一个值;倒数第二列由于没有排序同时窗口限制默认是UNBOUNDED 因此比较魔性。。。

key,product_code,cost,count,last_value_1,last_value_2
1       UK      20      3       10      20
1       US      10      3       10      10
1       EU      5       3       10      5
2       EU      6       2       3       6
2       UK      3       2       3       3
  1. 分析函数:
函数名 介绍
ROW_NUMBER() 从1开始,按照顺序,生成分组内记录的序号,不会有重复和空位
RANK() 生成数据项在分组中的排名,排名相等则并列同时会在名次中留下空位,比如1,2,2,4,5
DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位,比如1,2,2,3,4
CUME_DIST() 小于等于当前值的行数/分组内总行数,比如,统计小于等于当前薪水的人数,所占总人数的比例
PERCENT_RANK() 分组内当前行的RANK值-1/分组内总行数-1

切记:以上函数都是组内分析,不论rank开始比值中的分子和分母都是组内的统计。

  • OVER子句

  • 标准聚合函数:count, sum, min, max, avg

  • OVER一般与PARTITION BY和ORDER BY结合使用

  • 其中支持窗口设置,格式如下:

    (ROW | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
    (ROW | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
    (ROW | RANGE) BETWEEN [num] PRECEDING AND (UNBOUNDED | [num]) FOLLOWING

  • 值得注意的是:ORDER BY一定要小心使用,注意窗口范围

ORDER BY与窗口 说明
存在ORDER BY,而不存在 窗口从句 窗口规范默认是截止到当前行:RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
不存在ORDER BY,也没有窗口从句 窗口不做限制,即没有上下限:ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

举例:存在order by和不存在的区别(默认窗口区间不同)

select key,product_code,cost,
count(*) over (partition by key) as count,
sum(cost) over (partition by key) as total_costs,
sum(cost) over (partition by key order by cost) as total_current_costs
from 
(
select key,product_code,cost
from 
(
select 1 as key,'UK' as product_code,20 as cost
union
select 1 as key,'US' as product_code,10 as cost
union
select 1 as key,'EU' as product_code,5 as cost
union
select 2 as key,'UK' as product_code,3 as cost
union
select 2 as key,'EU' as product_code,6 as cost
)unioned
distribute by key,product_code,cost
)mid

结果:最后两列结果不同最后一列是截止到当前行

key,product_code,cost,count,total_costs,total_current_costs
1       EU      5       3       35      5
1       US      10      3       35      15
1       UK      20      3       35      35
2       UK      3       2       9       3
2       EU      6       2       9       9

4.其它支持

distinct hive2.1.0之后,在聚合函数(sum, count, avg)中支持distinct,但是在order by或者 窗口限制中不支持:conut(distinct a) over(partition by c);Hive2.2.0中在使用ORDER BY和窗口限制时支持distinct:count(distinct a) over (partition by c order by d rows between 1 preceding and 1 following)
聚合函数 Hive2.1.0以后支持在OVER从句中支持聚合函数:select rank() over(order by sum(b))

参考:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics#LanguageManualWindowingAndAnalytics-WINDOWclause

你可能感兴趣的:(Hive,大数据)