Select 窗口函数
over():指定分析函数的数据窗口大小,这个数据窗口大小随行的变化而变化。
CURRENT ROW:当前行
n PRECEDING:往前n行(包含其间所有)
n FOLLOWING:往后n行(包含其间所有)
UNBOUNDED PRECEDING:表示从前面的起点开始
UNBOUNDED FOLLOWING:表示到后面的终点开始
LAG(col,n):往前第n行数据
LEAD(col,n):往后第n行数据
NTILE(n):分n组,并给分组编号
示例准备 --------------------------------------
business.txt
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94
hive> create table business(name string, orderdate string, cost int)
row format delimited
fields terminated by ',';
hive> load data local inpath '/home/user000/data/business.txt' into table business;
查询在2017年4月购买过的顾客及总人数
错误:hive> select name,count(1) from business where substring(orderdate,1,7) = '2017-04' group by name;
+-------+------+--+
| name | _c1 |
+-------+------+--+
| jack | 1 |
| mart | 4 |
+-------+------+--+
正确:hive> select name,count(1) over() from business where substring(orderdate,1,7) = '2017-04' group by name;
+-------+---------+--+
| name | _wcol0 |
+-------+---------+--+
| mart | 2 |
| jack | 2 |
+-------+---------+--+
解释:group by name后窗口是2个,所以加一个over指定窗口,然后count就能得到正确答案。
如果不加,每个name下有各自的记录,count是分别计算每个name下的记录。
查询顾客的购买明细及月购买总额
hive(default)> select *,sum(cost) over(partition by month(orderdate)) from business;
hive(default)> select *,sum(cost) over(distribute by month(orderdate)) from business;
+----------------+---------------------+----------------+---------+--+
| business.name | business.orderdate | business.cost | _wcol0 |
+----------------+---------------------+----------------+---------+--+
| jack | 2017-01-01 | 10 | 205 |
| jack | 2017-01-08 | 55 | 205 |
| tony | 2017-01-07 | 50 | 205 |
| jack | 2017-01-05 | 46 | 205 |
| tony | 2017-01-04 | 29 | 205 |
| tony | 2017-01-02 | 15 | 205 |
| jack | 2017-02-03 | 23 | 23 |
| mart | 2017-04-13 | 94 | 341 |
| jack | 2017-04-06 | 42 | 341 |
| mart | 2017-04-11 | 75 | 341 |
| mart | 2017-04-09 | 68 | 341 |
| mart | 2017-04-08 | 62 | 341 |
| neil | 2017-05-10 | 12 | 12 |
| neil | 2017-06-12 | 80 | 80 |
+----------------+---------------------+----------------+---------+--+
解释:partition by orderdate会根据日期分区,使用partition by month(orderdate)按月分区。
上述的场景,要将cost按照日期进行累加
hive(default)> select *,sum(cost) over(sort by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW) from business;
+----------------+---------------------+----------------+---------+--+
| business.name | business.orderdate | business.cost | _wcol0 |
+----------------+---------------------+----------------+---------+--+
| jack | 2017-01-01 | 10 | 10 |
| tony | 2017-01-02 | 15 | 25 |
| tony | 2017-01-04 | 29 | 54 |
| jack | 2017-01-05 | 46 | 100 |
| tony | 2017-01-07 | 50 | 150 |
| jack | 2017-01-08 | 55 | 205 |
| jack | 2017-02-03 | 23 | 228 |
| jack | 2017-04-06 | 42 | 270 |
| mart | 2017-04-08 | 62 | 332 |
| mart | 2017-04-09 | 68 | 400 |
| mart | 2017-04-11 | 75 | 475 |
| mart | 2017-04-13 | 94 | 569 |
| neil | 2017-05-10 | 12 | 581 |
| neil | 2017-06-12 | 80 | 661 |
+----------------+---------------------+----------------+---------+--+
按名字统计,每个人的累加:
+----------------+---------------------+----------------+---------+--+
| business.name | business.orderdate | business.cost | _wcol0 |
+----------------+---------------------+----------------+---------+--+
| jack | 2017-01-01 | 10 | 10 |
| jack | 2017-01-05 | 46 | 56 |
| jack | 2017-01-08 | 55 | 111 |
| jack | 2017-02-03 | 23 | 134 |
| jack | 2017-04-06 | 42 | 176 |
| mart | 2017-04-08 | 62 | 62 |
| mart | 2017-04-09 | 68 | 130 |
| mart | 2017-04-11 | 75 | 205 |
| mart | 2017-04-13 | 94 | 299 |
| neil | 2017-05-10 | 12 | 12 |
| neil | 2017-06-12 | 80 | 92 |
| tony | 2017-01-02 | 15 | 15 |
| tony | 2017-01-04 | 29 | 44 |
| tony | 2017-01-07 | 50 | 94 |
+----------------+---------------------+----------------+---------+--+
查询顾客上次的购买时间
hive(default)> select name,orderdate,lag(orderdate,1) over(distribute by name sort by orderdate) from business;
+-------+-------------+-------------+--+
| name | orderdate | _wcol0 |
+-------+-------------+-------------+--+
| jack | 2017-01-01 | NULL |
| jack | 2017-01-05 | 2017-01-01 |
| jack | 2017-01-08 | 2017-01-05 |
| jack | 2017-02-03 | 2017-01-08 |
| jack | 2017-04-06 | 2017-02-03 |
| mart | 2017-04-08 | NULL |
| mart | 2017-04-09 | 2017-04-08 |
| mart | 2017-04-11 | 2017-04-09 |
| mart | 2017-04-13 | 2017-04-11 |
| neil | 2017-05-10 | NULL |
| neil | 2017-06-12 | 2017-05-10 |
| tony | 2017-01-02 | NULL |
| tony | 2017-01-04 | 2017-01-02 |
| tony | 2017-01-07 | 2017-01-04 |
+-------+-------------+-------------+--+
lead示例:
hive(default)> select name,orderdate,lead(orderdate,1) over(distribute by name sort by orderdate) from business;
+-------+-------------+-------------+--+
| name | orderdate | _wcol0 |
+-------+-------------+-------------+--+
| jack | 2017-01-01 | 2017-01-05 |
| jack | 2017-01-05 | 2017-01-08 |
| jack | 2017-01-08 | 2017-02-03 |
| jack | 2017-02-03 | 2017-04-06 |
| jack | 2017-04-06 | NULL |
| mart | 2017-04-08 | 2017-04-09 |
| mart | 2017-04-09 | 2017-04-11 |
| mart | 2017-04-11 | 2017-04-13 |
| mart | 2017-04-13 | NULL |
| neil | 2017-05-10 | 2017-06-12 |
| neil | 2017-06-12 | NULL |
| tony | 2017-01-02 | 2017-01-04 |
| tony | 2017-01-04 | 2017-01-07 |
| tony | 2017-01-07 | NULL |
+-------+-------------+-------------+--+
查询前20%的订单信息
ntile(5),然后取第一组,就是20%了。
hive(default)> select name,orderdate,cost,ntile(5) over(sort by orderdate) from business;
+-------+-------------+-------+---------+--+
| name | orderdate | cost | _wcol0 |
+-------+-------------+-------+---------+--+
| jack | 2017-01-01 | 10 | 1 |
| tony | 2017-01-02 | 15 | 1 |
| tony | 2017-01-04 | 29 | 1 |
| jack | 2017-01-05 | 46 | 2 |
| tony | 2017-01-07 | 50 | 2 |
| jack | 2017-01-08 | 55 | 2 |
| jack | 2017-02-03 | 23 | 3 |
| jack | 2017-04-06 | 42 | 3 |
| mart | 2017-04-08 | 62 | 3 |
| mart | 2017-04-09 | 68 | 4 |
| mart | 2017-04-11 | 75 | 4 |
| mart | 2017-04-13 | 94 | 4 |
| neil | 2017-05-10 | 12 | 5 |
| neil | 2017-06-12 | 80 | 5 |
+-------+-------------+-------+---------+--+
hive(default)> select * from
(select name,orderdate,cost,ntile(5) over(sort by orderdate) gid from business ) t
where gid = 1;
+---------+--------------+---------+--------+--+
| t.name | t.orderdate | t.cost | t.gid |
+---------+--------------+---------+--------+--+
| jack | 2017-01-01 | 10 | 1 |
| tony | 2017-01-02 | 15 | 1 |
| tony | 2017-01-04 | 29 | 1 |
+---------+--------------+---------+--------+--+
RANK() DENSE_RANK() ROW_NUMBER()
RANK():排序相同时会重复,总数不会变,例如排序结果是 1 1 3 4
DENSE_RANK():排序相同时,总数会减少,例如排序结果是 1 1 2 3
ROW_NUMBER():行号