有以下数据:
Jack,2017-01-01,10
Tony,2017-01-02,15
Jack,2017-02-03,23
Tony,2017-01-04,29
Jack,2017-01-05,46
Jack,2017-04-06,42
Tony,2017-01-07,50
Jack,2017-01-08,55
Mark,2017-04-08,62
Mart,2017-04-09,68
Meil,2017-05-10,12
Mart,2017-04-11,75
Meil,2017-06-12,80
Mart,2017-04-13,94
需求:
1、查询2017-04购买的顾客总人数
2、顾客购买明细及月份总额
3、上述场景,将cost按日期累加
4、查询顾客上次购买时间
5、查询前20%购买的订单信息
一、建表并导入数据:
-- 建表
create table business(
name string,
orderdate string,
cost int)
row format delimited
fields terminated by ",";
--导入数据
load data local inpath "/usr/local/src/test4/hive/business.txt" into table business;
查询表:
二、需求分析
1、查询2017-04购买的顾客总人数
a、首先想到使用聚合函数count()
-- 先求出2017-04这月一共有多少条记录
select count(*) from business where substr(orderdate,1,7) = "2017-04";
res:
+------+--+
| _c0 |
+------+--+
| 5 |
+------+--+
b、现在按照顾客进行分组
select name,count(*) from business where substr(orderdate,1,7) = "2017-04" group by name;
res:
+-------+------+--+
| name | _c1 |
+-------+------+--+
| Jack | 1 |
| Mark | 1 |
| Mart | 3 |
+-------+------+--+
数据被分成了三组:
使用over()函数:over只对聚合函数起作用,count分别对上面三个组内进行计数,over统计一共有多少个组(有一个count进行累加一次)
select name,count(*) over() total_num from business where substr(orderdate,1,7) = "2017-04" group by name;
res:
+-------+------------+--+
| name | total_num |
+-------+------------+--+
| Mart | 3 |
| Mark | 3 |
| Jack | 3 |
+-------+------------+--+
2.查询顾客购买明细及月份总额
a、首先选出所有明细信息:
select * from business;
res:
+----------------+---------------------+----------------+--+
| business.name | business.orderdate | business.cost |
+----------------+---------------------+----------------+--+
| Jack | 2017-01-01 | 10 |
| Tony | 2017-01-02 | 15 |
| Jack | 2017-02-03 | 23 |
| Tony | 2017-01-04 | 29 |
| Jack | 2017-01-05 | 46 |
| Jack | 2017-04-06 | 42 |
| Tony | 2017-01-07 | 50 |
| Jack | 2017-01-08 | 55 |
| Mark | 2017-04-08 | 62 |
| Mart | 2017-04-09 | 68 |
| Meil | 2017-05-10 | 12 |
| Mart | 2017-04-11 | 75 |
| Meil | 2017-06-12 | 80 |
| Mart | 2017-04-13 | 94 |
+----------------+---------------------+----------------+--+
b、求总额:(这是所有数据的总和,因为没有分组(group by),所以over()的针对的是每一条数据)
select *, sum(cost) over() from business;
res:
+----------------+---------------------+----------------+---------------+--+
| business.name | business.orderdate | business.cost | sum_window_0 |
+----------------+---------------------+----------------+---------------+--+
| Mart | 2017-04-13 | 94 | 661 |
| Meil | 2017-06-12 | 80 | 661 |
| Mart | 2017-04-11 | 75 | 661 |
| Meil | 2017-05-10 | 12 | 661 |
| Mart | 2017-04-09 | 68 | 661 |
| Mark | 2017-04-08 | 62 | 661 |
| Jack | 2017-01-08 | 55 | 661 |
| Tony | 2017-01-07 | 50 | 661 |
| Jack | 2017-04-06 | 42 | 661 |
| Jack | 2017-01-05 | 46 | 661 |
| Tony | 2017-01-04 | 29 | 661 |
| Jack | 2017-02-03 | 23 | 661 |
| Tony | 2017-01-02 | 15 | 661 |
| Jack | 2017-01-01 | 10 | 661 |
+----------------+---------------------+----------------+---------------+--+
c、针对四月份的数据,我们需要进行求总额,
思路:分区或者分组,但是使用group by date,只能查询date,(select date ,name group by date)其它字段不能查询
解决:使用窗口函数,并对窗口函数进行分区over(distribute by()) 或者over(partition by())
select *,sum(cost) over(distribute by month(orderdate)) from business;
res:
+----------------+---------------------+----------------+---------------+--+
| business.name | business.orderdate | business.cost | sum_window_0 |
+----------------+---------------------+----------------+---------------+--+
| Jack | 2017-01-01 | 10 | 205 |
| Jack | 2017-01-08 | 55 | 205 |
| Tony | 2017-01-07 | 50 | 205 |
| Jack | 2017-01-05 | 46 | 205 |
| Tony | 2017-01-04 | 29 | 205 |
| Tony | 2017-01-02 | 15 | 205 |
| Jack | 2017-02-03 | 23 | 23 |
| Mart | 2017-04-13 | 94 | 341 |
| Jack | 2017-04-06 | 42 | 341 |
| Mart | 2017-04-11 | 75 | 341 |
| Mart | 2017-04-09 | 68 | 341 |
| Mark | 2017-04-08 | 62 | 341 |
| Meil | 2017-05-10 | 12 | 12 |
| Meil | 2017-06-12 | 80 | 80 |
+----------------+---------------------+----------------+---------------+--+
3、上述场景,将cost按时间累加
分析:
a、先按照购买时间进行排序
select * from business sort by orderdate;
res:
+----------------+---------------------+----------------+--+
| business.name | business.orderdate | business.cost |
+----------------+---------------------+----------------+--+
| Jack | 2017-01-01 | 10 |
| Tony | 2017-01-02 | 15 |
| Tony | 2017-01-04 | 29 |
| Jack | 2017-01-05 | 46 |
| Tony | 2017-01-07 | 50 |
| Jack | 2017-01-08 | 55 |
| Jack | 2017-02-03 | 23 |
| Jack | 2017-04-06 | 42 |
| Mark | 2017-04-08 | 62 |
| Mart | 2017-04-09 | 68 |
| Mart | 2017-04-11 | 75 |
| Mart | 2017-04-13 | 94 |
| Meil | 2017-05-10 | 12 |
| Meil | 2017-06-12 | 80 |
+----------------+---------------------+----------------+--+
-- 参数讲解
-- sort by orderdate:按照购买日期进行排序
-- UNBOUNDED PRECEDING:从起点开始
-- CURRENT ROW:到当前行
-- 计算从开始到当前时间的总花费
select *,sum(cost) over(sort by orderdate rows between UNBOUNDED PRECEDING and CURRENT
ROW) from business;
res:
+----------------+---------------------+----------------+---------------+--+
| business.name | business.orderdate | business.cost | sum_window_0 |
+----------------+---------------------+----------------+---------------+--+
| Jack | 2017-01-01 | 10 | 10 |
| Tony | 2017-01-02 | 15 | 25 |
| Tony | 2017-01-04 | 29 | 54 |
| Jack | 2017-01-05 | 46 | 100 |
| Tony | 2017-01-07 | 50 | 150 |
| Jack | 2017-01-08 | 55 | 205 |
| Jack | 2017-02-03 | 23 | 228 |
| Jack | 2017-04-06 | 42 | 270 |
| Mark | 2017-04-08 | 62 | 332 |
| Mart | 2017-04-09 | 68 | 400 |
| Mart | 2017-04-11 | 75 | 475 |
| Mart | 2017-04-13 | 94 | 569 |
| Meil | 2017-05-10 | 12 | 581 |
| Meil | 2017-06-12 | 80 | 661 |
+----------------+---------------------+----------------+---------------+--+
row函数:
current row:当前行
n PRECEDING:往前n行
n FOLLOWING:往后n行
UNBOUNDED:起点
UNBOUNDED PRECEDING:从前面起点
UNBOUNDED FOLLOWING:到后面终点
LAG(col,n):往前的第n行
LEAD(col,n):往后的第n行
--参数讲解
-- sort by orderdate:按照时间排序
-- 1 preceding:当前行的前1行
-- 1 following:当前行的后一行
-- 计算相邻三行的值(第一行计算当前行 + 后一行; 最后一行计算当前行 + 前一行)
select *,sum(cost) over(sort by orderdate rows between 1 preceding and 1 following) from business;
res:
+----------------+---------------------+----------------+---------------+--+
| business.name | business.orderdate | business.cost | sum_window_0 |
+----------------+---------------------+----------------+---------------+--+
| Jack | 2017-01-01 | 10 | 25 |
| Tony | 2017-01-02 | 15 | 54 |
| Tony | 2017-01-04 | 29 | 90 |
| Jack | 2017-01-05 | 46 | 125 |
| Tony | 2017-01-07 | 50 | 151 |
| Jack | 2017-01-08 | 55 | 128 |
| Jack | 2017-02-03 | 23 | 120 |
| Jack | 2017-04-06 | 42 | 127 |
| Mark | 2017-04-08 | 62 | 172 |
| Mart | 2017-04-09 | 68 | 205 |
| Mart | 2017-04-11 | 75 | 237 |
| Mart | 2017-04-13 | 94 | 181 |
| Meil | 2017-05-10 | 12 | 186 |
| Meil | 2017-06-12 | 80 | 92 |
+----------------+---------------------+----------------+---------------+--+
demo2:
-- 参数详解:
-- distribute by name:按名字进行分区
-- sort by orderdate:在每个分区中按照时间进行排序
-- UNBOUNDED PRECEDING and current row:从起点行到当前行
-- 计算每个人一共的总花费
select *,sum(cost) over(distribute by name sort by orderdate rows between UNBOUNDED PRECEDING and current row) from business;
res:
+----------------+---------------------+----------------+---------------+--+
| business.name | business.orderdate | business.cost | sum_window_0 |
+----------------+---------------------+----------------+---------------+--+
| Jack | 2017-01-01 | 10 | 10 |
| Jack | 2017-01-05 | 46 | 56 |
| Jack | 2017-01-08 | 55 | 111 |
| Jack | 2017-02-03 | 23 | 134 |
| Jack | 2017-04-06 | 42 | 176 |
| Mark | 2017-04-08 | 62 | 62 |
| Mart | 2017-04-09 | 68 | 68 |
| Mart | 2017-04-11 | 75 | 143 |
| Mart | 2017-04-13 | 94 | 237 |
| Meil | 2017-05-10 | 12 | 12 |
| Meil | 2017-06-12 | 80 | 92 |
| Tony | 2017-01-02 | 15 | 15 |
| Tony | 2017-01-04 | 29 | 44 |
| Tony | 2017-01-07 | 50 | 94 |
+----------------+---------------------+----------------+---------------+--+
demo3:
--参数讲解:
-- sort by orderdate:按照时间排序
-- current row and unbounded following:当前行到终点行
select *,sum(cost) over(sort by orderdate rows between current row and unbounded following) from business;
res:
+----------------+---------------------+----------------+---------------+--+
| business.name | business.orderdate | business.cost | sum_window_0 |
+----------------+---------------------+----------------+---------------+--+
| Jack | 2017-01-01 | 10 | 661 |
| Tony | 2017-01-02 | 15 | 651 |
| Tony | 2017-01-04 | 29 | 636 |
| Jack | 2017-01-05 | 46 | 607 |
| Tony | 2017-01-07 | 50 | 561 |
| Jack | 2017-01-08 | 55 | 511 |
| Jack | 2017-02-03 | 23 | 456 |
| Jack | 2017-04-06 | 42 | 433 |
| Mark | 2017-04-08 | 62 | 391 |
| Mart | 2017-04-09 | 68 | 329 |
| Mart | 2017-04-11 | 75 | 261 |
| Mart | 2017-04-13 | 94 | 186 |
| Meil | 2017-05-10 | 12 | 92 |
| Meil | 2017-06-12 | 80 | 80 |
+----------------+---------------------+----------------+---------------+--+
四、查询顾客上次购买时间,以及下次购买时间(电商网站常用于求页面跳转的前后时间)
分析:lag(clo,n):返回的是当前行的第前n行
-- 参数详解:
-- distribute by name:按照姓名分组
-- sort by orderdate:按照时间排序
-- lag(orderdate,1):返回当前orderdate行的前一行
-- lead(orderdate,1):返回当前orderdate行的后一行
select *,
lag(orderdate,1) over(distribute by name sort by orderdate),
lead(orderdate,1) over(distribute by name sort by orderdate)
from business;
res:
+----------------+---------------------+----------------+---------------+----------------+--+
| business.name | business.orderdate | business.cost | lag_window_0 | lead_window_1 |
+----------------+---------------------+----------------+---------------+----------------+--+
| Jack | 2017-01-01 | 10 | NULL | 2017-01-05 |
| Jack | 2017-01-05 | 46 | 2017-01-01 | 2017-01-08 |
| Jack | 2017-01-08 | 55 | 2017-01-05 | 2017-02-03 |
| Jack | 2017-02-03 | 23 | 2017-01-08 | 2017-04-06 |
| Jack | 2017-04-06 | 42 | 2017-02-03 | NULL |
| Mark | 2017-04-08 | 62 | NULL | NULL |
| Mart | 2017-04-09 | 68 | NULL | 2017-04-11 |
| Mart | 2017-04-11 | 75 | 2017-04-09 | 2017-04-13 |
| Mart | 2017-04-13 | 94 | 2017-04-11 | NULL |
| Meil | 2017-05-10 | 12 | NULL | 2017-06-12 |
| Meil | 2017-06-12 | 80 | 2017-05-10 | NULL |
| Tony | 2017-01-02 | 15 | NULL | 2017-01-04 |
| Tony | 2017-01-04 | 29 | 2017-01-02 | 2017-01-07 |
| Tony | 2017-01-07 | 50 | 2017-01-04 | NULL |
+----------------+---------------------+----------------+---------------+----------------+--+
5、查询前20%购买的订单信息
分析:可以按照时间分成五等份,然后返回其中的第一份
NTILE(n):将数据等分成n份
select *, ntile(5) over(sort by orderdate) from business;
+----------------+---------------------+----------------+---------+--+
| business.name | business.orderdate | business.cost | sorted |
+----------------+---------------------+----------------+---------+--+
| Jack | 2017-01-01 | 10 | 1 |
| Tony | 2017-01-02 | 15 | 1 |
| Tony | 2017-01-04 | 29 | 1 |
| Jack | 2017-01-05 | 46 | 2 |
| Tony | 2017-01-07 | 50 | 2 |
| Jack | 2017-01-08 | 55 | 2 |
| Jack | 2017-02-03 | 23 | 3 |
| Jack | 2017-04-06 | 42 | 3 |
| Mark | 2017-04-08 | 62 | 3 |
| Mart | 2017-04-09 | 68 | 4 |
| Mart | 2017-04-11 | 75 | 4 |
| Mart | 2017-04-13 | 94 | 4 |
| Meil | 2017-05-10 | 12 | 5 |
| Meil | 2017-06-12 | 80 | 5 |
+----------------+---------------------+----------------+---------+--+
-- 下面语句报错,因为 ntile、sum、agg等函数不能放在where后面当做查询条件
select *, ntile(5) over(sort by orderdate) as sorted from business where sorted = 1;
-- 下面语句报错,因为having必须跟在group by 语句后面
select *, ntile(5) over(sort by orderdate) as sorted from business having sorted = 1;
-- 所以使用了子查询,将上一步查询的结果放在子句中
select name,orderdate,cost from (
select *,ntile(5) over(order by orderdate) sorted from business
) t
where sorted = 1;
-- Tips:子查询不能使用select *
+---------+--------------+---------+-----------+--+
| t.name | t.orderdate | t.cost | t.sorted |
+---------+--------------+---------+-----------+--+
| Jack | 2017-01-01 | 10 | 1 |
| Tony | 2017-01-02 | 15 | 1 |
| Tony | 2017-01-04 | 29 | 1 |
+---------+--------------+---------+-----------+--+