Hive窗口函数语法详细说明及案列

Hive窗口函数语法详细说明及案列

1.什么时候用开窗函数?
开窗函数常结合聚合函数使用,一般来讲聚合后的行数要少于聚合前的行数,但是有时我们既想显示聚集前的数据,
又要显示聚集后的数据,这时我们便引入了窗口函数.

例如:

+-------+-------------+-------+---------------+--+
| name  |  orderdate  | cost  | sum_window_0  |
+-------+-------------+-------+---------------+--+
| jack  | 2017-01-01  | 10    | 205           |
| jack  | 2017-01-08  | 55    | 205           |
| tony  | 2017-01-07  | 50    | 205           |
| jack  | 2017-01-05  | 46    | 205           |
| tony  | 2017-01-04  | 29    | 205           |
| tony  | 2017-01-02  | 15    | 205           |
| jack  | 2017-02-03  | 23    | 23            |
| mart  | 2017-04-13  | 94    | 341           |
| jack  | 2017-04-06  | 42    | 341           |
| mart  | 2017-04-11  | 75    | 341           |
| mart  | 2017-04-09  | 68    | 341           |
| mart  | 2017-04-08  | 62    | 341           |
| neil  | 2017-05-10  | 12    | 12            |
| neil  | 2017-06-12  | 80    | 80            |
+-------+-------------+-------+---------------+--

2.窗口函数的语法:

UDAF() over (PARTITION By col1,col2 order by col3 窗口子句(rows between .. and ..) AS 列别名

注意:PARTITION By后可跟多个字段,order By只跟一个字段。

3.over()的作用
over()决定了聚合函数的聚合范围,默认对整个窗口中的数据进行聚合,聚合函数对每一条数据调用一次。

例如:

select name, orderdate, cost, sum(cost) over()
from business; 

4.partition by子句:
使用Partiton by子句对数据进行分区,可以用paritition by对区内的进行聚合。

例如:

select name, orderdate, cost, sum(cost) over(partition by name)
from business;

5.order by子句:

作用:
(1)对分区中的数据进行排序;
(2)确定聚合哪些行(默认从起点到当前行的聚合)

例如:

select name, orderdate, cost, sum(cost) over(partition by name order by orderdate)
from business;

6.窗口子句
CURRENT ROW:当前行
n PRECEDING:往前n行数据
n FOLLOWING:往后n行数据
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到后面的终点

通过使用partition by子句将数据进行了分区。如果想要对窗口进行更细的动态划分,
就要引入窗口子句。

例如:

select name, orderdate,cost, sum(cost) 
over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW)
from business;

7.几点注意:

(1)order by必须跟在partition by后;
(2)Rows必须跟在Order by子;
(3)(partition by … order by)可替换为(distribute by … sort by …)

案列一
时间格式的转换和窗口函数的综合应用。
我们有如下的用户访问数据
userId visitDate visitCount
u01 2017/1/21 5
u02 2017/1/23 6
u03 2017/1/22 8
u04 2017/1/20 3
u01 2017/1/23 6
u01 2017/2/21 8
U02 2017/1/23 6
U01 2017/2/22 4
要求使用SQL统计出每个用户的累积访问次数,如下表所示:
用户id 月份 小计 累积
u01 2017-01 11 11
u01 2017-02 12 23
u02 2017-01 12 12
u03 2017-01 8 8
u04 2017-01 3 3

--建表
create table test01_visit (
    userId string COMMENT '用户ID',
    visitDate string COMMENT '访问时间',
    visitCount string COMMENT '访问数量'
) COMMENT '练习题1用户访问数据'
row format delimited fields terminated by '\t'
location '/warehouse/test/test01';
--插入数据
insert into table test01_visit values('u01','2017/1/21','5');
insert into table test01_visit values('u02','2017/1/23','6');
insert into table test01_visit values('u03','2017/1/22','8');
insert into table test01_visit values('u04','2017/1/20','3');
insert into table test01_visit values('u01','2017/1/23','6');
insert into table test01_visit values('u01','2017/2/21','8');
insert into table test01_visit values('U02','2017/1/23','6');
insert into table test01_visit values('U01','2017/2/22','4');

–统计每个用户的累积访问次数

--转换时间格式
select from_unixtime(unix_timestamp(visitDate,'yyyy/mm/dd'),'yyyy-mm') FROM test01_visit;

–转换大小写

select lower(userId) FROM test01_visit;

–统计各月的访问次数

select
    lower(userId) userId,
    from_unixtime(unix_timestamp(visitDate,'yyyy/mm/dd'),'yyyy-mm') visitMonth,
    sum(visitCount) visit_month
from test01_visit
group by lower(userId),from_unixtime(unix_timestamp(visitDate,'yyyy/mm/dd'),'yyyy-mm');

–统计累计的访问次数

select 
    t1.userId userId,
    t1.visitMonth visitMonth,
    t1.visit_month count_month,
    sum(t1.visit_month) over (partition by userId order by visitMonth rows between UNBOUNDED PRECEDING AND CURRENT ROW) sum_month
from 
(
    select
        lower(userId) userId,
        from_unixtime(unix_timestamp(visitDate,'yyyy/mm/dd'),'yyyy-mm') visitMonth,
        sum(visitCount) visit_month
    from test01_visit
    group by lower(userId),from_unixtime(unix_timestamp(visitDate,'yyyy/mm/dd'),'yyyy-mm')
) t1

案列二
经典Hive案列求PV(访问次数) UV(访客数)

–有50W个京东店铺,每个顾客访客访问任何一个店铺的任何一个商品时都会产生一条访问日志,访问日志存储的表名为Visit,
–访客的用户id为user_id,被访问的店铺名称为shop,请统计:
–1)每个店铺的UV(访客数)
–2)每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数

--用户表
drop table if exists test02_user;
create table test02_user(
    user_id string COMMENT '用户id',
    user_name string COMMENT '用户昵称'
)
row format delimited fields terminated by '\t'
location '/warehouse/test/test02/user';
--店铺表
drop table if exists test02_shoop;
create table test02_shoop(
    shoop_id string COMMENT '店铺id',
    shoop_name string COMMENT '店铺名称'
)
row format delimited fields terminated by '\t'
location '/warehouse/test/test02/shoop';
--访问日志表
drop table if exists test02_Visit;
create table test02_Visit(
    shoop_name string COMMENT '店铺名称',
    user_id string COMMENT '用户id',
    visit_time string COMMENT '访问时间'
)
row format delimited fields terminated by '\t'
location '/warehouse/test/test02';
--插入用户数据
insert into table test02_Visit values ('huawei','1001','2017-02-10');
insert into table test02_Visit values ('icbc','1001','2017-02-10');
insert into table test02_Visit values ('huawei','1001','2017-02-10');
insert into table test02_Visit values ('apple','1001','2017-02-10');
insert into table test02_Visit values ('huawei','1001','2017-02-10');
insert into table test02_Visit values ('huawei','1002','2017-02-10');
insert into table test02_Visit values ('huawei','1002','2017-02-10');
insert into table test02_Visit values ('huawei','1001','2017-02-10');
insert into table test02_Visit values ('huawei','1003','2017-02-10');
insert into table test02_Visit values ('huawei','1004','2017-02-10');
insert into table test02_Visit values ('huawei','1005','2017-02-10');
insert into table test02_Visit values ('icbc','1002','2017-02-10');
insert into table test02_Visit values ('jingdong','1006','2017-02-10');
--每个店铺的UV(访客数)
select 
    shoop_name,
    count(*) visit_count
from test02_Visit
group by shoop_name;
--每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数
select 
    rank,
    shoop_name,
    user_id,
    t2.visit_count
from 
(
    select 
        shoop_name,
        user_id,
        rank() over(partition by shoop_name order by t1.visit_count desc) rank,
        t1.visit_count visit_count
    from
    (
        select 
            shoop_name,
            user_id,
            count(*) visit_count
        from test02_Visit
        group by shoop_name,user_id
    ) t1
) t2 
where rank<=3
order by shoop_name,rank;

你可能感兴趣的:(大数据之Hive)