Hive碎碎念(2):分析函数和窗口函数


转载请在文章起始处注明出处,谢谢。

文章转载自文章


在Hive 0.11之后支持的,扫描多个输入的行计算每行的结果。通常和OVER,PARTITION BY, ORDER BY, WINDOWING配合使用。和传统的分组结果不一样,传统的结果每组中只有一个结果。分析函数的结果会出现多次,和每条记录都连接输出。
语法形式如下:

Function(arg1,....argn) OVER([PARTITION BY<...>] [ORDER BY<...>] [window_clause])

窗口函数

OVER从句

使用标准的聚合函数COUNT,SUM,MIN,MAX,AVG
使用PARTITION BY语句,使用一个或者多个原始数据类型的列
使用PARTITION BY与ORDER BY语句,使用一个或者多个数据类型的分区或者拍序列
使用窗口规范,窗口规范支持一下格式:

(ROW | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROW | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROW | RANGE) BETWEEN [num] PRECEDING AND (UNBOUNDED | [num]) FOLLOWING

当ORDER BY后面缺少窗口从句条件,窗口规范默认是

RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

当ORDER BY和窗口从句都缺失,窗口规范默认是:

ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

分析函数

Hive2.1.0及以后版本支持distinct

在聚合函数(sum, count, avg)中支持distinct,但是在order by或者 窗口限制中不支持。
conut(distinct a) over(partition by c)

Hive2.1.0以后支持在OVER从句中支持聚合函数

select rank() over(order by sum(b))

Hive2.2.0中在使用ORDER BY和窗口限制时支持distinct

count(distinct a) over (partition by c order by d rows between 1 preceding and 1 following)

通过实例深入理解窗口函数和分析函数

COUNT、SUM、MIN、MAX、AVG案例分析

## 创建数据表
create table orders(
    user_id string,
    device_id string,
    user_type string,
    price float,
    sales int);

## 添加数据orders.txt
zhangsa test1   new     67.1    2
lisi    test2   old     43.32   1
wanger  test3   new     88.88   3
liliu   test4   new     66.0    1
tom     test5   new     54.32   1
tomas   test6   old     77.77   2
tomson  test7   old     88.44   3
tom1    test8   new     56.55   6
tom2    test9   new     88.88   5
tom3    test10  new     66.66   5

## 开窗函数案例
select
    user_id,
    user_type,
    sales,
    -- 默认从起点到当前所有重复行
    sum(sales) over(partition by user_type order by sales asc) as sales_1,
    -- 从起点到当前所有重复行与sales_1结果相同
    sum(sales) over(partition by user_type order by sales asc range between unbounded preceding and current row) as sales_2,
    -- 从起点到当前行,结果与sale_1结果不同
    sum(sales) over(partition by user_type order by sales asc rows between unbounded preceding and current row) as sales_3,
    -- 当前行加上往前3行
    sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and current row) as sales_4,
    -- 当前范围往上加3行
    sum(sales) over(partition by user_type order by sales asc range between 3 preceding and current row) as sales_5,
    -- 当前行+往前3行+往后1行
    sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and 1 following) as sales_6,
    --
    sum(sales) over(partition by user_type order by sales asc range between 3 preceding and 1 following) as sales_7,
    -- 当前行+之后所有行
    sum(sales) over(partition by user_type order by sales asc rows between current row and unbounded following) as sales_8,
    --
    sum(sales) over(partition by user_type order by sales asc range between current row and unbounded following) as sales_9,
    -- 分组内所有行
    sum(sales) over(partition by user_type) as sales_10
from
    orders
order by
    user_type,
    sales,
    user_id;

上述查询结果如下:

user_id user_type sales sales_1 sales_2 sales_3 sales_4 sales_5 sales_6 sales_7 sales_8 sales_9 sales_10
liliu new 1 2 2 2 2 2 4 4 22 23 23
tom new 1 2 2 1 1 2 2 4 23 23 23
zhangsa new 2 4 4 4 4 4 7 7 21 21 23
wanger new 3 7 7 7 7 7 12 7 19 19 23
tom2 new 5 17 17 17 15 15 21 21 11 16 23
tom3 new 5 17 17 12 11 15 16 21 16 16 23
tom1 new 6 23 23 23 19 19 19 19 6 6 23
lisi old 1 1 1 1 1 1 3 3 6 6 6
tomas old 2 3 3 3 3 3 6 6 5 5 6
tomson old 3 6 6 6 6 6 6 6 3 3 6

注意

结果和ORDER BY相关,默认为升序
如果不指定ROWS BETWEEN,默认为从起点到当前行;
如果不指定ORDER BY,则将分组内所有值累加;
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:无界限(起点或终点)
UNBOUNDED PRECEDING:表示从前面的起点
UNBOUNDED FOLLOWING:表示到后面的终点
其他COUNT、AVG,MIN,MAX,和SUM用法一样。

FIRST_VALUE和LAST_VALUE案例分析

select
    user_id,
    user_type,
    sales,
    ROW_NUMBER() OVER(PARTITION BY user_type ORDER BY sales) AS row_num,
    first_value(user_id) over (partition by user_type order by sales desc) as max_sales_user,
    first_value(user_id) over (partition by user_type order by sales asc) as min_sales_user,
    last_value(user_id) over (partition by user_type order by sales desc) as curr_last_min_user,
    last_value(user_id) over (partition by user_type order by sales asc) as curr_last_max_user
from
    orders
order by
    user_type,
    sales;

上述查询结果如下:

user_id user_type sales row_num max_sales_user min_sales_user curr_last_min_user curr_last_max_user
tom new 1 1 tom1 tom tom liliu
liliu new 1 2 tom1 tom tom liliu
zhangsa new 2 3 tom1 tom zhangsa zhangsa
wanger new 3 4 tom1 tom wanger wanger
tom3 new 5 5 tom1 tom tom3 tom2
tom2 new 5 6 tom1 tom tom3 tom2
tom1 new 6 7 tom1 tom tom1 tom1
lisi old 1 1 tomson lisi lisi lisi
tomas old 2 2 tomson lisi tomas tomas
tomson old 3 3 tomson lisi tomson tomson

LEAD与LAG

select
    user_id,
    device_id,
    sales,
    ROW_NUMBER() OVER(ORDER BY sales) AS row_num,
    lead(device_id) over (order by sales) as default_after_one_line,
    lag(device_id) over (order by sales) as default_before_one_line,
    lead(device_id,2) over (order by sales) as after_two_line,
    lag(device_id,2,'abc') over (order by sales) as before_two_line
from
    orders
order by
    sales;

上述查询结果如下

user_id device_id sales row_num default_after_one_line default_before_one_line after_two_line before_two_line
lisi test2 1 3 test6 test4 test1 test5
liliu test4 1 2 test2 test5 test6 abc
tom test5 1 1 test4 NULL test2 abc
zhangsa test1 2 5 test7 test6 test3 test2
tomas test6 2 4 test1 test2 test7 test4
wanger test3 3 7 test10 test7 test9 test1
tomson test7 3 6 test3 test1 test10 test6
tom2 test9 5 9 test8 test10 NULL test3
tom3 test10 5 8 test9 test3 test8 test7
tom1 test8 6 10 NULL test9 NULL test10

RANK、ROW_NUMBER、DENSE_RANK

select
user_id,user_type,sales,
RANK() over (partition by user_type order by sales desc) as r,
ROW_NUMBER() over (partition by user_type order by sales desc) as rn,
DENSE_RANK() over (partition by user_type order by sales desc) as dr
from
orders;

上述查询结果如下

user_id user_type sales r rn dr
tom1 new 6 1 1 1
tom3 new 5 2 2 2
tom2 new 5 2 3 2
wanger new 3 4 4 3
zhangsa new 2 5 5 4
tom new 1 6 6 5
liliu new 1 6 7 5
tomson old 3 1 1 1
tomas old 2 2 2 2
lisi old 1 3 3 3

NTILE

select
    user_type,sales,
    --分组内将数据分成2片
    NTILE(2) OVER(PARTITION BY user_type ORDER BY sales) AS nt2,
    --分组内将数据分成3片
    NTILE(3) OVER(PARTITION BY user_type ORDER BY sales) AS nt3,
    --分组内将数据分成4片
    NTILE(4) OVER(PARTITION BY user_type ORDER BY sales) AS nt4,
    --将所有数据分成4片
    NTILE(4) OVER(ORDER BY sales) AS all_nt4
from
    orders
order by
    user_type,
    sales;

上述查询结果如下

user_type sales nt2 nt3 nt4 all_nt4
new 1 1 1 1 1
new 1 1 1 1 1
new 2 1 1 2 2
new 3 1 2 2 3
new 5 2 2 3 4
new 5 2 3 3 3
new 6 2 3 4 4
old 1 1 1 1 1
old 2 1 2 2 2
old 3 2 3 3 2

求取sale前20%的用户ID

select
    user_id
from
(
    select
        user_id,
        NTILE(5) OVER(ORDER BY sales desc) AS nt
    from
        orders
)A
where nt=1;

结果如下

+----------+
| user_id |
+----------+
| tom1 |
| tom3 |
+----------+

CUME_DIST、PERCENT_RANK

select
    user_id,user_type,sales,
    --没有partition,所有数据均为1组
    CUME_DIST() OVER(ORDER BY sales) AS cd1,
    --按照user_type进行分组
    CUME_DIST() OVER(PARTITION BY user_type ORDER BY sales) AS cd2
from
    orders;

上述结果如下

+----------+------------+--------+------+----------------------+--+
| user_id | user_type | sales | cd1 | cd2 |
+----------+------------+--------+------+----------------------+--+
| liliu | new | 1 | 0.3 | 0.2857142857142857 |
| tom | new | 1 | 0.3 | 0.2857142857142857 |
| zhangsa | new | 2 | 0.5 | 0.42857142857142855 |
| wanger | new | 3 | 0.7 | 0.5714285714285714 |
| tom2 | new | 5 | 0.9 | 0.8571428571428571 |
| tom3 | new | 5 | 0.9 | 0.8571428571428571 |
| tom1 | new | 6 | 1.0 | 1.0 |
| lisi | old | 1 | 0.3 | 0.3333333333333333 |
| tomas | old | 2 | 0.5 | 0.6666666666666666 |
| tomson | old | 3 | 0.7 | 1.0 |
+----------+------------+--------+------+----------------------+--+

select
    user_type,sales,
    --分组内总行数
    SUM(1) OVER(PARTITION BY user_type) AS s,
    --RANK值
    RANK() OVER(ORDER BY sales) AS r,
    PERCENT_RANK() OVER(ORDER BY sales) AS pr,
    --分组内
    PERCENT_RANK() OVER(PARTITION BY user_type ORDER BY sales) AS prg
from
    orders;

上述结果如下

+------------+--------+----+-----+---------------------+---------------------+--+
| user_type | sales | s | r | pr | prg |
+------------+--------+----+-----+---------------------+---------------------+--+
| new | 1 | 7 | 1 | 0.0 | 0.0 |
| new | 1 | 7 | 1 | 0.0 | 0.0 |
| new | 2 | 7 | 4 | 0.3333333333333333 | 0.3333333333333333 |
| new | 3 | 7 | 6 | 0.5555555555555556 | 0.5 |
| new | 5 | 7 | 8 | 0.7777777777777778 | 0.6666666666666666 |
| new | 5 | 7 | 8 | 0.7777777777777778 | 0.6666666666666666 |
| new | 6 | 7 | 10 | 1.0 | 1.0 |
| old | 1 | 3 | 1 | 0.0 | 0.0 |
| old | 2 | 3 | 4 | 0.3333333333333333 | 0.5 |
| old | 3 | 3 | 6 | 0.5555555555555556 | 1.0 |
+------------+--------+----+-----+---------------------+---------------------+--+

你可能感兴趣的:(Hive碎碎念(2):分析函数和窗口函数)