hive sql系列(二)

每天分享一个sql,帮助大家找到sql的快乐

需求:统计每个人每个月访问量和累计月访问量

建表语句

CREATE TABLE visit( 
    userId string, 
    visitDate string,
    visitCount int
)
ROW format delimited FIELDS TERMINATED BY "\t";

数据

INSERT INTO TABLE visit VALUES
( 'u01', '2017/1/21', 5 ),
( 'u02', '2017/1/23', 6 ),
( 'u03', '2017/1/22', 8 ),
( 'u04', '2017/1/20', 3 ),
( 'u01', '2017/1/23', 6 ),
( 'u01', '2017/2/21', 8 ),
( 'u02', '2017/1/23', 6 ),
( 'u01', '2017/2/22', 4 );

-- 数据核对情况
u01 2017/1  5+6=11
u01 2017/2  8+4=12
u02 2017/1  6+6=12
u03 2017/1  8
u04 2017/1  3

实现

select
    tmp2.userId,
    tmp2.month,
    tmp2.total_month,
    sum(tmp2.total_month) over(partition by userId order by tmp2.month rows between unbounded preceding and current row)
from
    (select
        tmp1.userId,
        tmp1.month,
        sum(tmp1.visitCount) as total_month
    from
        (select
            userId,
            date_format(replace(visitDate,"/","-"),"yyyy-MM") as month,
            visitCount
        from
            visit
        )tmp1
    group by tmp1.userId,tmp1.month
    )tmp2
;

结果

Total MapReduce CPU Time Spent: 4 seconds 760 msec
OK
tmp2.userid	tmp2.month	tmp2.total_month	sum_window_0
u01	2017-01	11	11
u01	2017-02	12	23
u02	2017-01	12	12
u03	2017-01	8	8
u04	2017-01	3	3
Time taken: 31.629 seconds, Fetched: 5 row(s)

分析

1、从需求来看,需要计算每个月的访问结果和截至当月累计和,并一起展示,必然用到开窗,这个在hive sql系列一讲过了
2、累计效果需要用到rows范围
3、这里日期格式不能直接用,需要用replace函数,这个函数作用最易于想到
4、这里实现步骤,先对日期转格式,在开窗直接计算


扩展

select date_format("2019/12/12","yyyy-MM");
select split("2019/12/12","/");
select replace("2019/12/12","/","-");
select month(replace("2019/12/12","/","-"));
select date_format(replace("2019/12/12","/","-"),"yyyy-MM");
--从开始到当前月累计
sum(tmp2.total_month) over(order by tmp2.month rows between unbounded preceding and current row)

知识点

rows是行数的意思,后面跟的是函数的范围
between是行数在什么范围
unbounded是行数的起点,这里可以将unbounded替换成1,那就是相邻上一个月的意思
preceding是前面的意思
current row是当前行的意思
连一起的意思是参与计算的行数是从起点开始直到当前行,这就是开窗的数据范围,再加上窗口之上的sum函数,
就是统计从起点到当前的和

 

你可能感兴趣的:(hive,大数据,sql,hive)