在大数据面试中,Hive知识的考察大部分会问级联求和,业务场景虽然有很多种,比如说,年收入,月收入统计;访客访问次数年统计,月统计。等等。但是基本根源知识是级联求和,本文就以访客访问统计为例。
根据访客的每日访问信息,进行累计访问:
有如下访客访问次数统计表 t_access_times
需要输出报表:t_access_times_accumulate
vi /home/chunsoft/t_access_times.dat
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
jdbc:hive2://172.16.208.128:10000>create table t_access_times(username string,month string,access_time int)
row format delimited fields terminated by ',';
jdbc:hive2://172.16.208.128:10000>
load data local inpath '/home/chunsoft/t_access_times.dat' into table t_access_times;
jdbc:hive2://172.16.208.128:10000>
select username,month,sum(access_time) as access_time from t_access_times group by username,month;
+-----------+----------+---------+--+
| username | month | access_time |
+-----------+----------+---------+--+
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
+-----------+----------+---------+--+
select A.*,B.* from
(select username,month,sum(access_time) as access_time from t_access_times group by username,month) A
inner join
(select username,month,sum(access_time) as access_time from t_access_times group by username,month) B
on
A.username=B.username
+-------------+----------+-----------+-------------+----------+-----------+--+
| a.username | a.month | a.access_time | b.username | b.month | b.access_time |
+-------------+----------+-----------+-------------+----------+-----------+--+
| A | 2015-01 | 33 | A | 2015-01 | 33 |
| A | 2015-01 | 33 | A | 2015-02 | 10 |
| A | 2015-02 | 10 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |
| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-01 | 30 | B | 2015-02 | 15 |
| B | 2015-02 | 15 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
+-------------+----------+-----------+-------------+----------+-----------+--+
select A.username,A.month,max(A.access_time) access_time,sum(B.access_time) accumulate
from
(select username,month,sum(access_time) as access_time from t_access_times group by username,month) A
inner join
(select username,month,sum(access_time) as access_time from t_access_times group by username,month) B
on
A.username=B.username
where B.month<=A.month
group by A.username,A.month
order by A.username,A.month
+-------------+----------+---------+-------------+--+
| a.username | a.month | access_time | accumulate |
+-------------+----------+---------+-------------+--+
| A | 2015-01 | 33 | 33 |
| A | 2015-02 | 10 | 43 |
| B | 2015-01 | 30 | 30 |
| B | 2015-02 | 15 | 45 |
这里面的max(A.access_time) 起到一个聚合作用,保留一条结果。