目录
1.什么是over()子句?
2.over()子句的开窗范围
2.1 window clause
2.2 默认
2.3 order by
2.4 partition by
2.5 partition by + order by
3.案例
3.1 数据准备
3.2 默认情况:over()
3.3 partition by
3.4 order by
3.5 partition by + order by
我们可以形象的把over()子句理解成开窗子句,即打开一个窗口,窗口内包含多条记录,over()会给每一行开一个窗口。如下图,总共有5条记录,每一行代表一条记录,over()在每一条记录的基础上打开一个窗口,给r1记录打开w1窗口,窗口内只包含自己,给r2打开w2窗口,窗口内包含r1、r2,给r3打开w3窗口,窗口内包含r1、r2、r3,以此类推.....
由上我们不难发现,在使用over()子句进行查询的时候, 不仅可以查询到每条记录的信息,还可以查询到这条记录对应窗口内的所有记录的聚合信息,所以我们通常结合聚合函数和over()子句一起使用。
那么over()是如何进行开窗的呢?即每条记录对应的窗口内应该包含哪些记录呢?这些都是在over()子句的括号内进行定义。
先看一张图:
current row代表查询的当前行,1 preceding代表前一行,1 following代表后一行,unbounded preceding代表第一行,unbounded following代表最后一行。(注意这里的第一行和最后一行并不是严格的第一行和最后一行,根据具体情况而定)
over()子句的开窗范围可以通过window 子句(window clause)在over()的括号中定义,window clause的规范如下:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING) (ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING) (ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
例如 select *,sum(column_name) over( rows between unbounded preceding and unbounded following) from table_name 表示查询每一行的所有列值,同时给每一行打开一个从第一行到最后一行的窗口,并统计窗口内所有记录的column_name列值的和。最后给每一行输出该行的所有属性以及该行对应窗口内记录的聚合值。
如果over()子句中什么都不写的话,默认开窗范围是:rows between unbounded preceding and unbounded following
如果over()子句中接order by,例如:over(order by date),则默认的开窗范围为根据date排序后的rows between unbounded preceding and current row,即第一行到当前行,意思是over(order by date)和over(order by date rows rows between unbounded preceding and current row)的效果是一样的。
如果over子句中接partition by(和group by类似,都是根据列值对行进行分组),例如over(partition by month(date)),则每一行的默认的开窗范围为当前行所在分组的所有记录。注意partition by子句不能单独和window clause子句一起使用,必须结合order by子句,下面会讨论。
先分组,再排序,即组内排序。同样的,如果 order by后不接window clause,则每一行的默认的开窗范围为:当前行所在分组的第一行到当前行,即over(partition by (month(date)) order by orderdate)和over(partition by (month(date)) order by orderdate rows between undounded preceding and current row)是一样的。
为了能够更好的理解,下面通过具体例子来说明。
我们准备一张order表,字段分别为name,orderdate,cost.数据内容如下:
jack,2015-01-01,10
tony,2015-01-02,15
jack,2015-02-03,23
tony,2015-01-04,29
jack,2015-01-05,46
jack,2015-04-06,42
tony,2015-01-07,50
jack,2015-01-08,55
mart,2015-04-08,62
mart,2015-04-09,68
neil,2015-05-10,12
mart,2015-04-11,75
neil,2015-06-12,80
mart,2015-04-13,94
在hive中建立一张表orders,将数据导入进去.
CREATE TABLE orders(
name varchar(30),
orderdate date,
cost int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as textfile;
load data local inpath '/home/grid/data/orders.txt' overwrite into table orders;
查询1月份的订单信息及1月份的销售总额(下面两条hql执行结果一样):
-- 默认情况
select *,sum(cost) over() as jan_total from orders where month(orderdate) = 1;
-- 加window子句
select *,sum(cost) over(rows between unbounded preceding and unbounded following) as jan_total
from orders
where month(orderdate) = 1;
执行结果:
查询所有的订单信息,并统计每一个月的销售总额:
select *,sum(cost)over(partition by month(orderdate)) as month_total from orders;
执行结果:
查询一月份的订单信息及一月份的销售增长情况(下面两条hql结果相同):
-- 默认情况
select *,sum(cost) over(order by orderdate) as jan_increment
from orders
where month(orderdate) = 1;
-- 加window子句
select
*,
sum(cost) over(order by orderdate rows between unbounded preceding and current row) as jan_increment
from orders
where month(orderdate) = 1;
执行结果如下:
统计每一个月的销售增长情况(下面两条hql执行结果相同):
-- 默认情况
select
orderdate,
cost,
sum(cost) over(partition by(month(orderdate)) order by orderdate) as month_inc
from orders;
-- 加window clause
select
orderdate,
cost,
sum(cost) over(partition by(month(orderdate)) order by orderdate rows between unbounded preceding and current row) as month_inc
from orders;
执行结果为: