Hive窗口函数 over()

有以下数据:

Jack,2017-01-01,10
Tony,2017-01-02,15
Jack,2017-02-03,23
Tony,2017-01-04,29
Jack,2017-01-05,46
Jack,2017-04-06,42
Tony,2017-01-07,50
Jack,2017-01-08,55
Mark,2017-04-08,62
Mart,2017-04-09,68
Meil,2017-05-10,12
Mart,2017-04-11,75
Meil,2017-06-12,80
Mart,2017-04-13,94

需求:

1、查询2017-04购买的顾客总人数

2、顾客购买明细及月份总额

3、上述场景,将cost按日期累加

4、查询顾客上次购买时间

5、查询前20%购买的订单信息


一、建表并导入数据:

-- 建表
create table business(
    name string,
    orderdate string,
    cost int)
row format delimited
fields terminated by ",";

--导入数据
load data local inpath "/usr/local/src/test4/hive/business.txt" into table business;

查询表:

Hive窗口函数 over()_第1张图片

二、需求分析

1、查询2017-04购买的顾客总人数

a、首先想到使用聚合函数count()

-- 先求出2017-04这月一共有多少条记录
select count(*) from business where substr(orderdate,1,7) = "2017-04";

res:
+------+--+
| _c0  |
+------+--+
| 5    |
+------+--+

b、现在按照顾客进行分组 

select name,count(*) from business where substr(orderdate,1,7) = "2017-04" group by name;

res:
+-------+------+--+
| name  | _c1  |
+-------+------+--+
| Jack  | 1    |
| Mark  | 1    |
| Mart  | 3    |
+-------+------+--+

数据被分成了三组:

Hive窗口函数 over()_第2张图片                     Hive窗口函数 over()_第3张图片

使用over()函数:over只对聚合函数起作用,count分别对上面三个组内进行计数,over统计一共有多少个组(有一个count进行累加一次)

select name,count(*) over() total_num from business where substr(orderdate,1,7) = "2017-04" group by name;

res:
+-------+------------+--+
| name  | total_num  |
+-------+------------+--+
| Mart  | 3          |
| Mark  | 3          |
| Jack  | 3          |
+-------+------------+--+

2.查询顾客购买明细及月份总额

a、首先选出所有明细信息:

select * from business;

res:
+----------------+---------------------+----------------+--+
| business.name  | business.orderdate  | business.cost  |
+----------------+---------------------+----------------+--+
| Jack           | 2017-01-01          | 10             |
| Tony           | 2017-01-02          | 15             |
| Jack           | 2017-02-03          | 23             |
| Tony           | 2017-01-04          | 29             |
| Jack           | 2017-01-05          | 46             |
| Jack           | 2017-04-06          | 42             |
| Tony           | 2017-01-07          | 50             |
| Jack           | 2017-01-08          | 55             |
| Mark           | 2017-04-08          | 62             |
| Mart           | 2017-04-09          | 68             |
| Meil           | 2017-05-10          | 12             |
| Mart           | 2017-04-11          | 75             |
| Meil           | 2017-06-12          | 80             |
| Mart           | 2017-04-13          | 94             |
+----------------+---------------------+----------------+--+

b、求总额:(这是所有数据的总和,因为没有分组(group by),所以over()的针对的是每一条数据)

select *, sum(cost) over() from business; 

res:
+----------------+---------------------+----------------+---------------+--+
| business.name  | business.orderdate  | business.cost  | sum_window_0  |
+----------------+---------------------+----------------+---------------+--+
| Mart           | 2017-04-13          | 94             | 661           |
| Meil           | 2017-06-12          | 80             | 661           |
| Mart           | 2017-04-11          | 75             | 661           |
| Meil           | 2017-05-10          | 12             | 661           |
| Mart           | 2017-04-09          | 68             | 661           |
| Mark           | 2017-04-08          | 62             | 661           |
| Jack           | 2017-01-08          | 55             | 661           |
| Tony           | 2017-01-07          | 50             | 661           |
| Jack           | 2017-04-06          | 42             | 661           |
| Jack           | 2017-01-05          | 46             | 661           |
| Tony           | 2017-01-04          | 29             | 661           |
| Jack           | 2017-02-03          | 23             | 661           |
| Tony           | 2017-01-02          | 15             | 661           |
| Jack           | 2017-01-01          | 10             | 661           |
+----------------+---------------------+----------------+---------------+--+

c、针对四月份的数据,我们需要进行求总额,

思路:分区或者分组,但是使用group by date,只能查询date,(select date ,name group by date)其它字段不能查询

解决:使用窗口函数,并对窗口函数进行分区over(distribute by())  或者over(partition by())

select *,sum(cost) over(distribute by month(orderdate)) from business;

res:
+----------------+---------------------+----------------+---------------+--+
| business.name  | business.orderdate  | business.cost  | sum_window_0  |
+----------------+---------------------+----------------+---------------+--+
| Jack           | 2017-01-01          | 10             | 205           |
| Jack           | 2017-01-08          | 55             | 205           |
| Tony           | 2017-01-07          | 50             | 205           |
| Jack           | 2017-01-05          | 46             | 205           |
| Tony           | 2017-01-04          | 29             | 205           |
| Tony           | 2017-01-02          | 15             | 205           |
| Jack           | 2017-02-03          | 23             | 23            |
| Mart           | 2017-04-13          | 94             | 341           |
| Jack           | 2017-04-06          | 42             | 341           |
| Mart           | 2017-04-11          | 75             | 341           |
| Mart           | 2017-04-09          | 68             | 341           |
| Mark           | 2017-04-08          | 62             | 341           |
| Meil           | 2017-05-10          | 12             | 12            |
| Meil           | 2017-06-12          | 80             | 80            |
+----------------+---------------------+----------------+---------------+--+

3、上述场景,将cost按时间累加

分析:

 

a、先按照购买时间进行排序

select * from business sort by orderdate;

res:
+----------------+---------------------+----------------+--+
| business.name  | business.orderdate  | business.cost  |
+----------------+---------------------+----------------+--+
| Jack           | 2017-01-01          | 10             |
| Tony           | 2017-01-02          | 15             |
| Tony           | 2017-01-04          | 29             |
| Jack           | 2017-01-05          | 46             |
| Tony           | 2017-01-07          | 50             |
| Jack           | 2017-01-08          | 55             |
| Jack           | 2017-02-03          | 23             |
| Jack           | 2017-04-06          | 42             |
| Mark           | 2017-04-08          | 62             |
| Mart           | 2017-04-09          | 68             |
| Mart           | 2017-04-11          | 75             |
| Mart           | 2017-04-13          | 94             |
| Meil           | 2017-05-10          | 12             |
| Meil           | 2017-06-12          | 80             |
+----------------+---------------------+----------------+--+
-- 参数讲解
    -- sort by orderdate:按照购买日期进行排序
    -- UNBOUNDED PRECEDING:从起点开始
    -- CURRENT ROW:到当前行
-- 计算从开始到当前时间的总花费

select *,sum(cost) over(sort by orderdate rows between UNBOUNDED PRECEDING and CURRENT 
ROW) from business;

res:
+----------------+---------------------+----------------+---------------+--+
| business.name  | business.orderdate  | business.cost  | sum_window_0  |
+----------------+---------------------+----------------+---------------+--+
| Jack           | 2017-01-01          | 10             | 10            |
| Tony           | 2017-01-02          | 15             | 25            |
| Tony           | 2017-01-04          | 29             | 54            |
| Jack           | 2017-01-05          | 46             | 100           |
| Tony           | 2017-01-07          | 50             | 150           |
| Jack           | 2017-01-08          | 55             | 205           |
| Jack           | 2017-02-03          | 23             | 228           |
| Jack           | 2017-04-06          | 42             | 270           |
| Mark           | 2017-04-08          | 62             | 332           |
| Mart           | 2017-04-09          | 68             | 400           |
| Mart           | 2017-04-11          | 75             | 475           |
| Mart           | 2017-04-13          | 94             | 569           |
| Meil           | 2017-05-10          | 12             | 581           |
| Meil           | 2017-06-12          | 80             | 661           |
+----------------+---------------------+----------------+---------------+--+

row函数:

current row:当前行

n  PRECEDING:往前n行

n  FOLLOWING:往后n行

UNBOUNDED:起点

UNBOUNDED PRECEDING:从前面起点

UNBOUNDED FOLLOWING:到后面终点

LAG(col,n):往前的第n行

LEAD(col,n):往后的第n行

--参数讲解
    -- sort by orderdate:按照时间排序
    -- 1 preceding:当前行的前1行
    -- 1 following:当前行的后一行
-- 计算相邻三行的值(第一行计算当前行 + 后一行;  最后一行计算当前行 + 前一行)
select *,sum(cost) over(sort by orderdate rows between 1 preceding and 1 following) from business;

res:
+----------------+---------------------+----------------+---------------+--+
| business.name  | business.orderdate  | business.cost  | sum_window_0  |
+----------------+---------------------+----------------+---------------+--+
| Jack           | 2017-01-01          | 10             | 25            |
| Tony           | 2017-01-02          | 15             | 54            |
| Tony           | 2017-01-04          | 29             | 90            |
| Jack           | 2017-01-05          | 46             | 125           |
| Tony           | 2017-01-07          | 50             | 151           |
| Jack           | 2017-01-08          | 55             | 128           |
| Jack           | 2017-02-03          | 23             | 120           |
| Jack           | 2017-04-06          | 42             | 127           |
| Mark           | 2017-04-08          | 62             | 172           |
| Mart           | 2017-04-09          | 68             | 205           |
| Mart           | 2017-04-11          | 75             | 237           |
| Mart           | 2017-04-13          | 94             | 181           |
| Meil           | 2017-05-10          | 12             | 186           |
| Meil           | 2017-06-12          | 80             | 92            |
+----------------+---------------------+----------------+---------------+--+

 demo2:

-- 参数详解:
    -- distribute by name:按名字进行分区
    --  sort by orderdate:在每个分区中按照时间进行排序
    --  UNBOUNDED PRECEDING and current row:从起点行到当前行
-- 计算每个人一共的总花费

select *,sum(cost) over(distribute by name sort by orderdate rows between UNBOUNDED PRECEDING and current row) from business;

res:
+----------------+---------------------+----------------+---------------+--+
| business.name  | business.orderdate  | business.cost  | sum_window_0  |
+----------------+---------------------+----------------+---------------+--+
| Jack           | 2017-01-01          | 10             | 10            |
| Jack           | 2017-01-05          | 46             | 56            |
| Jack           | 2017-01-08          | 55             | 111           |
| Jack           | 2017-02-03          | 23             | 134           |
| Jack           | 2017-04-06          | 42             | 176           |
| Mark           | 2017-04-08          | 62             | 62            |
| Mart           | 2017-04-09          | 68             | 68            |
| Mart           | 2017-04-11          | 75             | 143           |
| Mart           | 2017-04-13          | 94             | 237           |
| Meil           | 2017-05-10          | 12             | 12            |
| Meil           | 2017-06-12          | 80             | 92            |
| Tony           | 2017-01-02          | 15             | 15            |
| Tony           | 2017-01-04          | 29             | 44            |
| Tony           | 2017-01-07          | 50             | 94            |
+----------------+---------------------+----------------+---------------+--+

demo3:

--参数讲解:
    -- sort by orderdate:按照时间排序
    -- current row and unbounded following:当前行到终点行

select *,sum(cost) over(sort by orderdate rows between current row and unbounded following) from business;

res:
+----------------+---------------------+----------------+---------------+--+
| business.name  | business.orderdate  | business.cost  | sum_window_0  |
+----------------+---------------------+----------------+---------------+--+
| Jack           | 2017-01-01          | 10             | 661           |
| Tony           | 2017-01-02          | 15             | 651           |
| Tony           | 2017-01-04          | 29             | 636           |
| Jack           | 2017-01-05          | 46             | 607           |
| Tony           | 2017-01-07          | 50             | 561           |
| Jack           | 2017-01-08          | 55             | 511           |
| Jack           | 2017-02-03          | 23             | 456           |
| Jack           | 2017-04-06          | 42             | 433           |
| Mark           | 2017-04-08          | 62             | 391           |
| Mart           | 2017-04-09          | 68             | 329           |
| Mart           | 2017-04-11          | 75             | 261           |
| Mart           | 2017-04-13          | 94             | 186           |
| Meil           | 2017-05-10          | 12             | 92            |
| Meil           | 2017-06-12          | 80             | 80            |
+----------------+---------------------+----------------+---------------+--+

四、查询顾客上次购买时间,以及下次购买时间(电商网站常用于求页面跳转的前后时间)

分析:lag(clo,n):返回的是当前行的第前n行

-- 参数详解:
    -- distribute by name:按照姓名分组
    -- sort by orderdate:按照时间排序
    -- lag(orderdate,1):返回当前orderdate行的前一行
    -- lead(orderdate,1):返回当前orderdate行的后一行

select *,
    lag(orderdate,1) over(distribute by name sort by orderdate),
    lead(orderdate,1) over(distribute by name sort by orderdate)
from business;

res:
+----------------+---------------------+----------------+---------------+----------------+--+
| business.name  | business.orderdate  | business.cost  | lag_window_0  | lead_window_1  |
+----------------+---------------------+----------------+---------------+----------------+--+
| Jack           | 2017-01-01          | 10             | NULL          | 2017-01-05     |
| Jack           | 2017-01-05          | 46             | 2017-01-01    | 2017-01-08     |
| Jack           | 2017-01-08          | 55             | 2017-01-05    | 2017-02-03     |
| Jack           | 2017-02-03          | 23             | 2017-01-08    | 2017-04-06     |
| Jack           | 2017-04-06          | 42             | 2017-02-03    | NULL           |
| Mark           | 2017-04-08          | 62             | NULL          | NULL           |
| Mart           | 2017-04-09          | 68             | NULL          | 2017-04-11     |
| Mart           | 2017-04-11          | 75             | 2017-04-09    | 2017-04-13     |
| Mart           | 2017-04-13          | 94             | 2017-04-11    | NULL           |
| Meil           | 2017-05-10          | 12             | NULL          | 2017-06-12     |
| Meil           | 2017-06-12          | 80             | 2017-05-10    | NULL           |
| Tony           | 2017-01-02          | 15             | NULL          | 2017-01-04     |
| Tony           | 2017-01-04          | 29             | 2017-01-02    | 2017-01-07     |
| Tony           | 2017-01-07          | 50             | 2017-01-04    | NULL           |
+----------------+---------------------+----------------+---------------+----------------+--+

5、查询前20%购买的订单信息

分析:可以按照时间分成五等份,然后返回其中的第一份

NTILE(n):将数据等分成n份

select *, ntile(5) over(sort by orderdate) from business;

+----------------+---------------------+----------------+---------+--+
| business.name  | business.orderdate  | business.cost  | sorted  |
+----------------+---------------------+----------------+---------+--+
| Jack           | 2017-01-01          | 10             | 1       |
| Tony           | 2017-01-02          | 15             | 1       |
| Tony           | 2017-01-04          | 29             | 1       |
| Jack           | 2017-01-05          | 46             | 2       |
| Tony           | 2017-01-07          | 50             | 2       |
| Jack           | 2017-01-08          | 55             | 2       |
| Jack           | 2017-02-03          | 23             | 3       |
| Jack           | 2017-04-06          | 42             | 3       |
| Mark           | 2017-04-08          | 62             | 3       |
| Mart           | 2017-04-09          | 68             | 4       |
| Mart           | 2017-04-11          | 75             | 4       |
| Mart           | 2017-04-13          | 94             | 4       |
| Meil           | 2017-05-10          | 12             | 5       |
| Meil           | 2017-06-12          | 80             | 5       |
+----------------+---------------------+----------------+---------+--+
-- 下面语句报错,因为 ntile、sum、agg等函数不能放在where后面当做查询条件
select *, ntile(5) over(sort by orderdate) as sorted from business where sorted = 1;

-- 下面语句报错,因为having必须跟在group by 语句后面
select *, ntile(5) over(sort by orderdate) as sorted from business having sorted = 1;


-- 所以使用了子查询,将上一步查询的结果放在子句中
select name,orderdate,cost from (
    select *,ntile(5) over(order by orderdate) sorted from business
) t
where sorted = 1;
-- Tips:子查询不能使用select *

+---------+--------------+---------+-----------+--+
| t.name  | t.orderdate  | t.cost  | t.sorted  |
+---------+--------------+---------+-----------+--+
| Jack    | 2017-01-01   | 10      | 1         |
| Tony    | 2017-01-02   | 15      | 1         |
| Tony    | 2017-01-04   | 29      | 1         |
+---------+--------------+---------+-----------+--+

 

你可能感兴趣的:(Hive)