OVER():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化。
CURRENT ROW:当前行
n PRECEDING:往前n行数据
n FOLLOWING:往后n行数据
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到后面的终点
LAG(col,n,default_val):往前第n行数据
LEAD(col,n, default_val):往后第n行数据
NTILE(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。注意:n必须为int类型。
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94
[test@hadoop102 datas]$ vi business.txt
// hive 中创建表格
0: jdbc:hive2://hadoop102:10000> create table business(
name string,
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
//导入数据
0: jdbc:hive2://hadoop102:10000> load data local inpath "/opt/module/datas/business.txt" into table business;
(1)查询在2017年4月份购买过的顾客及总人数
select
name,
count(*) over()
from
business
where
substring(orderdate,1,7) = '2017-04'
group by
name;
执行结果:
(2)查询每个顾客的月购买明细及月购买总额
select
name,
orderdate,
cost,
sum(cost) over(partition by month(orderdate), name)
from
business;
执行结果:
(3)上述的场景, 将每个顾客的cost按照日期进行累加
select
name ,
orderdate,
cost,
// 每个月每个顾客根据消费日期排序消费金额累加
sum (cost) over(partition by month(orderdate) ,name order by orderdate ),
// 每个顾客根据消费日期排序消费金额累加
sum (cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING
and current row),
// 每个顾客根据消费日期排序消费金额累加 跟上面一样!!!
sum (cost) over(partition by name order by orderdate )
from
business;
执行结果:
(4)查询每个顾客上次的购买时间
select
name,
orderdate,
cost,
// 第三个参数 :如果没有上一个就用1970-01-01填充
lag(orderdate,1,'1970-01-01') over(partition by name order by orderdate) as time1,
// 第三个参数不写,null 填充
lag(orderdate,2) over (partition by name order by orderdate) as time2
from business;
执行结果:
(5)查询前20%时间的订单信息 (14条数据 前20% 就是 按日期排序的前3条)
select
*
from
(select
name,
orderdate,
cost,
ntile(5) over (order by orderdate) sorted
from business) t
where
sorted = 1;
执行结果: