hive窗口函数之ntile、lag、lead、first_value、last_value

目录

    • 1.样例数据
    • 2.ntile(n)
      • 2.1实例
    • 3.lag、lead、first_value、last_value
      • 3.1实例
        • 3.1.1问题1:如果想取分组后pv最后一个值
        • 3.1.2问题2:如果不排序会怎样?


其他窗口函数可翻看:
窗口函数之(sum、avg、max、min)
窗口函数之(row_number, rank, dense_rank)


1.样例数据

id		crtime	   pv
cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4
cookie2,2015-04-10,2
cookie2,2015-04-11,3
cookie2,2015-04-12,5
cookie2,2015-04-13,6
cookie2,2015-04-14,3
cookie2,2015-04-15,9
cookie2,2015-04-16,7

2.ntile(n)

ntile(n)用于将分组数据进行切片,n代表切成多少片。相当于把数据分成几等份,如果不能均匀等份,则多出来的从第一片开始加。
比如多出来1份,则加给第一片。
比如多出来2份,则分别加给第一片和第二片。

2.1实例

select id,crtime,pv,
ntile(2) over(partition by id order by crtime) n2, --分2片
ntile(3) over(partition by id order by crtime) n3, --分3片
ntile(4) over(partition by id order by crtime) n4, --分4片
ntile(5) over(partition by id order by crtime) n5  --分5片
from nt;
->
id		crtime			pv		n2		n3		n4		n5
cookie1 2015-04-10      1       1       1       1       1
cookie1 2015-04-11      5       1       1       1       1
cookie1 2015-04-12      7       1       1       2       2
cookie1 2015-04-13      3       1       2       2       2
cookie1 2015-04-14      2       2       2       3       3
cookie1 2015-04-15      4       2       3       3       4
cookie1 2015-04-16      4       2       3       4       5
cookie2 2015-04-10      2       1       1       1       1
cookie2 2015-04-11      3       1       1       1       1
cookie2 2015-04-12      5       1       1       2       2
cookie2 2015-04-13      6       1       2       2       2
cookie2 2015-04-14      3       2       2       3       3
cookie2 2015-04-15      9       2       3       3       4
cookie2 2015-04-16      7       2       3       4       5

可以看到,cookie1有7条数据,当将分组数据分成2片时,7/2余数为1份,加到第1片中,所以有4个1,3个2;
当将分组数据分成3片时,7/3余数为1份,加到第1片中,所以有3个1,2个2,2个3;
当将分组数据分成4片时,7/4余数为3份,分别加到第1,2,3片中,所以有2个1,2个2,2个3,1个4;
当将分组数据分成5片时,7/5余数为2份,分别加到第1,2片中,所以有2个1,2个2,1个3,1个4,1个5。

需求:统计cookie前1/3天的pv数有多少?
思路:前1/3天,可以使用ntile(3)分成三片,取ntile值为1的pv进行sum。

select t.id,sum(t.pv) spv from
(select id,crtime,pv,ntile(3) over(partition by id order by crtime) nt3 from nt) t 
where t.nt3 = 1
group by t.id;
->
id		spv	
cookie1 13
cookie2 10

3.lag、lead、first_value、last_value

这几个函数经常用于时间序列,但是不支持rows between(window子句)
lag(col,n,default):统计窗口内往上数第n行的值。

  • col:列名,n:往上数第n行,不写默认是1,default:往上第n行为null时取该默认值,不写为null。

lead(col,n,default):统计窗口内往下数第n行的值。

  • col:列名,n:往下数第n行,不写默认是1,default:往下第n行为null时取该默认值,不写为null。

first_value(col):求分组排序后截止到当前行的第一个值。
last_value(col):求分组排序后截止到当前行的最后一个值

3.1实例

select *,
lag(crtime,1,'a') over(partition by id order by crtime) lagc,
lead(crtime,2,'b') over(partition by id order by crtime) leadc,
first_value(pv) over(partition by id order by crtime) fpv,
last_value(pv) over(partition by id order by crtime) lpv 
from nt;
->
id		crtime			pv		lagc			leadc			fpv		lpv
cookie1 2015-04-10      1       a       		2015-04-12      1       1
cookie1 2015-04-11      5       2015-04-10      2015-04-13      1       5
cookie1 2015-04-12      7       2015-04-11      2015-04-14      1       7
cookie1 2015-04-13      3       2015-04-12      2015-04-15      1       3
cookie1 2015-04-14      2       2015-04-13      2015-04-16      1       2
cookie1 2015-04-15      4       2015-04-14      b       		1       4
cookie1 2015-04-16      4       2015-04-15      b       		1       4
cookie2 2015-04-10      2       a       		2015-04-12      2       2
cookie2 2015-04-11      3       2015-04-10      2015-04-13      2       3
cookie2 2015-04-12      5       2015-04-11      2015-04-14      2       5
cookie2 2015-04-13      6       2015-04-12      2015-04-15      2       6
cookie2 2015-04-14      3       2015-04-13      2015-04-16      2       3
cookie2 2015-04-15      9       2015-04-14      b       		2       9
cookie2 2015-04-16      7       2015-04-15      b       		2       7

3.1.1问题1:如果想取分组后pv最后一个值

select *,
first_value(pv) over(partition by id order by crtime desc) newpv 
from nt;
->
id		crtime			pv		newpv
cookie1 2015-04-16      4       4
cookie1 2015-04-15      4       4
cookie1 2015-04-14      2       4
cookie1 2015-04-13      3       4
cookie1 2015-04-12      7       4
cookie1 2015-04-11      5       4
cookie1 2015-04-10      1       4
cookie2 2015-04-16      7       7
cookie2 2015-04-15      9       7
cookie2 2015-04-14      3       7
cookie2 2015-04-13      6       7
cookie2 2015-04-12      5       7
cookie2 2015-04-11      3       7
cookie2 2015-04-10      2       7
但是此时的crtime是倒序的,如果想升序排序,则需要加order by id,crtime

select *,
first_value(pv) over(partition by id order by crtime desc) newpv 
from nt
order by id,crtime;
->
id		crtime			pv		newpv
cookie1 2015-04-10      1       4
cookie1 2015-04-11      5       4
cookie1 2015-04-12      7       4
cookie1 2015-04-13      3       4
cookie1 2015-04-14      2       4
cookie1 2015-04-15      4       4
cookie1 2015-04-16      4       4
cookie2 2015-04-10      2       7
cookie2 2015-04-11      3       7
cookie2 2015-04-12      5       7
cookie2 2015-04-13      6       7
cookie2 2015-04-14      3       7
cookie2 2015-04-15      9       7
cookie2 2015-04-16      7       7

3.1.2问题2:如果不排序会怎样?

不排序则crtime既不是升序也不是降序

select *,
lag(pv) over(partition by id) lagc,  - 默认取前1行的值,前1行没有值默认为null
lead(pv) over(partition by id) leadc - 默认取下1行的值,下1行没有值默认为null
from nt;
->
id		crtime			pv		lagc	leadc
cookie1 2015-04-10      1       NULL    4
cookie1 2015-04-16      4       1       4
cookie1 2015-04-15      4       4       2
cookie1 2015-04-14      2       4       3
cookie1 2015-04-13      3       2       7
cookie1 2015-04-12      7       3       5
cookie1 2015-04-11      5       7       NULL
cookie2 2015-04-16      7       NULL    9
cookie2 2015-04-15      9       7       3
cookie2 2015-04-14      3       9       6
cookie2 2015-04-13      6       3       5
cookie2 2015-04-12      5       6       3
cookie2 2015-04-11      3       5       2
cookie2 2015-04-10      2       3       NULL

select *,
first_value(pv) over(partition by id) fpv, -取分组的第一个值
last_value(pv) over(partition by id) lpv   -取分组的最后一个值
from nt;
->
id		crtime			pv		fpv		lpv 
cookie1 2015-04-10      1       1       5
cookie1 2015-04-16      4       1       5
cookie1 2015-04-15      4       1       5
cookie1 2015-04-14      2       1       5
cookie1 2015-04-13      3       1       5
cookie1 2015-04-12      7       1       5
cookie1 2015-04-11      5       1       5
cookie2 2015-04-16      7       7       2
cookie2 2015-04-15      9       7       2
cookie2 2015-04-14      3       7       2
cookie2 2015-04-13      6       7       2
cookie2 2015-04-12      5       7       2
cookie2 2015-04-11      3       7       2
cookie2 2015-04-10      2       7       2

你可能感兴趣的:(Hive)