转载请在文章起始处注明出处，谢谢。

文章转载自文章

在Hive 0.11之后支持的，扫描多个输入的行计算每行的结果。通常和OVER,PARTITION BY, ORDER BY, WINDOWING配合使用。和传统的分组结果不一样，传统的结果每组中只有一个结果。分析函数的结果会出现多次，和每条记录都连接输出。
语法形式如下：

Function(arg1,....argn) OVER([PARTITION BY<...>] [ORDER BY<...>] [window_clause])

窗口函数

OVER从句

使用标准的聚合函数COUNT,SUM,MIN,MAX,AVG
使用PARTITION BY语句，使用一个或者多个原始数据类型的列
使用PARTITION BY与ORDER BY语句，使用一个或者多个数据类型的分区或者拍序列
使用窗口规范，窗口规范支持一下格式：

(ROW | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROW | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROW | RANGE) BETWEEN [num] PRECEDING AND (UNBOUNDED | [num]) FOLLOWING

当ORDER BY后面缺少窗口从句条件，窗口规范默认是

RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

当ORDER BY和窗口从句都缺失，窗口规范默认是：

ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

分析函数

Hive2.1.0及以后版本支持distinct

在聚合函数(sum, count, avg)中支持distinct，但是在order by或者窗口限制中不支持。
conut(distinct a) over(partition by c)

Hive2.1.0以后支持在OVER从句中支持聚合函数

select rank() over(order by sum(b))

Hive2.2.0中在使用ORDER BY和窗口限制时支持distinct

count(distinct a) over (partition by c order by d rows between 1 preceding and 1 following)

通过实例深入理解窗口函数和分析函数

COUNT、SUM、MIN、MAX、AVG案例分析

## 创建数据表
create table orders(
    user_id string,
    device_id string,
    user_type string,
    price float,
    sales int);

## 添加数据orders.txt
zhangsa test1   new     67.1    2
lisi    test2   old     43.32   1
wanger  test3   new     88.88   3
liliu   test4   new     66.0    1
tom     test5   new     54.32   1
tomas   test6   old     77.77   2
tomson  test7   old     88.44   3
tom1    test8   new     56.55   6
tom2    test9   new     88.88   5
tom3    test10  new     66.66   5

## 开窗函数案例
select
    user_id,
    user_type,
    sales,
    -- 默认从起点到当前所有重复行
    sum(sales) over(partition by user_type order by sales asc) as sales_1,
    -- 从起点到当前所有重复行与sales_1结果相同
    sum(sales) over(partition by user_type order by sales asc range between unbounded preceding and current row) as sales_2,
    -- 从起点到当前行，结果与sale_1结果不同
    sum(sales) over(partition by user_type order by sales asc rows between unbounded preceding and current row) as sales_3,
    -- 当前行加上往前3行
    sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and current row) as sales_4,
    -- 当前范围往上加3行
    sum(sales) over(partition by user_type order by sales asc range between 3 preceding and current row) as sales_5,
    -- 当前行+往前3行+往后1行
    sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and 1 following) as sales_6,
    --
    sum(sales) over(partition by user_type order by sales asc range between 3 preceding and 1 following) as sales_7,
    -- 当前行+之后所有行
    sum(sales) over(partition by user_type order by sales asc rows between current row and unbounded following) as sales_8,
    --
    sum(sales) over(partition by user_type order by sales asc range between current row and unbounded following) as sales_9,
    -- 分组内所有行
    sum(sales) over(partition by user_type) as sales_10
from
    orders
order by
    user_type,
    sales,
    user_id;

上述查询结果如下：

user_id	user_type	sales	sales_1	sales_2	sales_3	sales_4	sales_5	sales_6	sales_7	sales_8	sales_9	sales_10
liliu	new	1	2	2	2	2	2	4	4	22	23	23
tom	new	1	2	2	1	1	2	2	4	23	23	23
zhangsa	new	2	4	4	4	4	4	7	7	21	21	23
wanger	new	3	7	7	7	7	7	12	7	19	19	23
tom2	new	5	17	17	17	15	15	21	21	11	16	23
tom3	new	5	17	17	12	11	15	16	21	16	16	23
tom1	new	6	23	23	23	19	19	19	19	6	6	23
lisi	old	1	1	1	1	1	1	3	3	6	6	6
tomas	old	2	3	3	3	3	3	6	6	5	5	6
tomson	old	3	6	6	6	6	6	6	6	3	3	6

注意

结果和ORDER BY相关,默认为升序
如果不指定ROWS BETWEEN,默认为从起点到当前行;
如果不指定ORDER BY，则将分组内所有值累加;
PRECEDING：往前
FOLLOWING：往后
CURRENT ROW：当前行
UNBOUNDED：无界限（起点或终点）
UNBOUNDED PRECEDING：表示从前面的起点
UNBOUNDED FOLLOWING：表示到后面的终点
其他COUNT、AVG，MIN，MAX，和SUM用法一样。

FIRST_VALUE和LAST_VALUE案例分析

select
    user_id,
    user_type,
    sales,
    ROW_NUMBER() OVER(PARTITION BY user_type ORDER BY sales) AS row_num,
    first_value(user_id) over (partition by user_type order by sales desc) as max_sales_user,
    first_value(user_id) over (partition by user_type order by sales asc) as min_sales_user,
    last_value(user_id) over (partition by user_type order by sales desc) as curr_last_min_user,
    last_value(user_id) over (partition by user_type order by sales asc) as curr_last_max_user
from
    orders
order by
    user_type,
    sales;

上述查询结果如下:

user_id	user_type	sales	row_num	max_sales_user	min_sales_user	curr_last_min_user	curr_last_max_user
tom	new	1	1	tom1	tom	tom	liliu
liliu	new	1	2	tom1	tom	tom	liliu
zhangsa	new	2	3	tom1	tom	zhangsa	zhangsa
wanger	new	3	4	tom1	tom	wanger	wanger
tom3	new	5	5	tom1	tom	tom3	tom2
tom2	new	5	6	tom1	tom	tom3	tom2
tom1	new	6	7	tom1	tom	tom1	tom1
lisi	old	1	1	tomson	lisi	lisi	lisi
tomas	old	2	2	tomson	lisi	tomas	tomas
tomson	old	3	3	tomson	lisi	tomson	tomson

LEAD与LAG

select
    user_id,
    device_id,
    sales,
    ROW_NUMBER() OVER(ORDER BY sales) AS row_num,
    lead(device_id) over (order by sales) as default_after_one_line,
    lag(device_id) over (order by sales) as default_before_one_line,
    lead(device_id,2) over (order by sales) as after_two_line,
    lag(device_id,2,'abc') over (order by sales) as before_two_line
from
    orders
order by
    sales;

上述查询结果如下

user_id	device_id	sales	row_num	default_after_one_line	default_before_one_line	after_two_line	before_two_line
lisi	test2	1	3	test6	test4	test1	test5
liliu	test4	1	2	test2	test5	test6	abc
tom	test5	1	1	test4	NULL	test2	abc
zhangsa	test1	2	5	test7	test6	test3	test2
tomas	test6	2	4	test1	test2	test7	test4
wanger	test3	3	7	test10	test7	test9	test1
tomson	test7	3	6	test3	test1	test10	test6
tom2	test9	5	9	test8	test10	NULL	test3
tom3	test10	5	8	test9	test3	test8	test7
tom1	test8	6	10	NULL	test9	NULL	test10

RANK、ROW_NUMBER、DENSE_RANK

select
user_id,user_type,sales,
RANK() over (partition by user_type order by sales desc) as r,
ROW_NUMBER() over (partition by user_type order by sales desc) as rn,
DENSE_RANK() over (partition by user_type order by sales desc) as dr
from
orders;

上述查询结果如下

user_id	user_type	sales	r	rn	dr
tom1	new	6	1	1	1
tom3	new	5	2	2	2
tom2	new	5	2	3	2
wanger	new	3	4	4	3
zhangsa	new	2	5	5	4
tom	new	1	6	6	5
liliu	new	1	6	7	5
tomson	old	3	1	1	1
tomas	old	2	2	2	2
lisi	old	1	3	3	3

NTILE

select
    user_type,sales,
    --分组内将数据分成2片
    NTILE(2) OVER(PARTITION BY user_type ORDER BY sales) AS nt2,
    --分组内将数据分成3片
    NTILE(3) OVER(PARTITION BY user_type ORDER BY sales) AS nt3,
    --分组内将数据分成4片
    NTILE(4) OVER(PARTITION BY user_type ORDER BY sales) AS nt4,
    --将所有数据分成4片
    NTILE(4) OVER(ORDER BY sales) AS all_nt4
from
    orders
order by
    user_type,
    sales;

上述查询结果如下

user_type	sales	nt2	nt3	nt4	all_nt4
new	1	1	1	1	1
new	1	1	1	1	1
new	2	1	1	2	2
new	3	1	2	2	3
new	5	2	2	3	4
new	5	2	3	3	3
new	6	2	3	4	4
old	1	1	1	1	1
old	2	1	2	2	2
old	3	2	3	3	2

求取sale前20%的用户ID

select
    user_id
from
(
    select
        user_id,
        NTILE(5) OVER(ORDER BY sales desc) AS nt
    from
        orders
)A
where nt=1;

结果如下

+----------+
| user_id |
+----------+
| tom1 |
| tom3 |
+----------+

CUME_DIST、PERCENT_RANK

select
    user_id,user_type,sales,
    --没有partition,所有数据均为1组
    CUME_DIST() OVER(ORDER BY sales) AS cd1,
    --按照user_type进行分组
    CUME_DIST() OVER(PARTITION BY user_type ORDER BY sales) AS cd2
from
    orders;

上述结果如下

+----------+------------+--------+------+----------------------+--+
| user_id | user_type | sales | cd1 | cd2 |
+----------+------------+--------+------+----------------------+--+
| liliu | new | 1 | 0.3 | 0.2857142857142857 |
| tom | new | 1 | 0.3 | 0.2857142857142857 |
| zhangsa | new | 2 | 0.5 | 0.42857142857142855 |
| wanger | new | 3 | 0.7 | 0.5714285714285714 |
| tom2 | new | 5 | 0.9 | 0.8571428571428571 |
| tom3 | new | 5 | 0.9 | 0.8571428571428571 |
| tom1 | new | 6 | 1.0 | 1.0 |
| lisi | old | 1 | 0.3 | 0.3333333333333333 |
| tomas | old | 2 | 0.5 | 0.6666666666666666 |
| tomson | old | 3 | 0.7 | 1.0 |
+----------+------------+--------+------+----------------------+--+

select
    user_type,sales,
    --分组内总行数
    SUM(1) OVER(PARTITION BY user_type) AS s,
    --RANK值
    RANK() OVER(ORDER BY sales) AS r,
    PERCENT_RANK() OVER(ORDER BY sales) AS pr,
    --分组内
    PERCENT_RANK() OVER(PARTITION BY user_type ORDER BY sales) AS prg
from
    orders;

上述结果如下

+------------+--------+----+-----+---------------------+---------------------+--+
| user_type | sales | s | r | pr | prg |
+------------+--------+----+-----+---------------------+---------------------+--+
| new | 1 | 7 | 1 | 0.0 | 0.0 |
| new | 1 | 7 | 1 | 0.0 | 0.0 |
| new | 2 | 7 | 4 | 0.3333333333333333 | 0.3333333333333333 |
| new | 3 | 7 | 6 | 0.5555555555555556 | 0.5 |
| new | 5 | 7 | 8 | 0.7777777777777778 | 0.6666666666666666 |
| new | 5 | 7 | 8 | 0.7777777777777778 | 0.6666666666666666 |
| new | 6 | 7 | 10 | 1.0 | 1.0 |
| old | 1 | 3 | 1 | 0.0 | 0.0 |
| old | 2 | 3 | 4 | 0.3333333333333333 | 0.5 |
| old | 3 | 3 | 6 | 0.5555555555555556 | 1.0 |
+------------+--------+----+-----+---------------------+---------------------+--+

Hive碎碎念（2）：分析函数和窗口函数

转载请在文章起始处注明出处，谢谢。

文章转载自文章

窗口函数

分析函数

Hive2.1.0及以后版本支持distinct

Hive2.1.0以后支持在OVER从句中支持聚合函数

Hive2.2.0中在使用ORDER BY和窗口限制时支持distinct

通过实例深入理解窗口函数和分析函数

COUNT、SUM、MIN、MAX、AVG案例分析

上述查询结果如下：

注意

FIRST_VALUE和LAST_VALUE案例分析

上述查询结果如下:

LEAD与LAG

上述查询结果如下

RANK、ROW_NUMBER、DENSE_RANK

上述查询结果如下

NTILE

上述查询结果如下

求取sale前20%的用户ID

结果如下

CUME_DIST、PERCENT_RANK

上述结果如下

上述结果如下

你可能感兴趣的:(Hive碎碎念（2）：分析函数和窗口函数)

user_type	sales	nt2	nt3	nt4	all_nt4
new	1	1	1	1	1
new	1	1	1	1	1
new	2	1	1	2	2
new	3	1	2	2	3
new	5	2	2	3	4
new	5	2	3	3	3
new	6	2	3	4	4
old	1	1	1	1	1
old	2	1	2	2	2
old	3	2	3	3	2

user_type	sales	nt2	nt3	nt4	all_nt4
new	1	1	1	1	1
new	1	1	1	1	1
new	2	1	1	2	2
new	3	1	2	2	3
new	5	2	2	3	4
new	5	2	3	3	3
new	6	2	3	4	4
old	1	1	1	1	1
old	2	1	2	2	2
old	3	2	3	3	2

user_type	sales	nt2	nt3	nt4	all_nt4
new	1	1	1	1	1
new	1	1	1	1	1
new	2	1	1	2	2
new	3	1	2	2	3
new	5	2	2	3	4
new	5	2	3	3	3
new	6	2	3	4	4
old	1	1	1	1	1
old	2	1	2	2	2
old	3	2	3	3	2