Hive经典面试题——级联求和(访客访问统计报表)

在大数据面试中,Hive知识的考察大部分会问级联求和,业务场景虽然有很多种,比如说,年收入,月收入统计;访客访问次数年统计,月统计。等等。但是基本根源知识是级联求和,本文就以访客访问统计为例。

1、基本需求

根据访客的每日访问信息,进行累计访问:

  • 输入数据:

有如下访客访问次数统计表 t_access_times

为了减轻计算复杂度,去掉了天的信息只留下了年月。
Hive经典面试题——级联求和(访客访问统计报表)_第1张图片

  • 输出数据:

需要输出报表:t_access_times_accumulate
Hive经典面试题——级联求和(访客访问统计报表)_第2张图片

2、实现步骤

2.1 原始数据

vi /home/chunsoft/t_access_times.dat

A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5

2.2 创建表

jdbc:hive2://172.16.208.128:10000>create table t_access_times(username string,month string,access_time int)
row format delimited fields terminated by ',';

2.3 将数据加载到表中

jdbc:hive2://172.16.208.128:10000>
load data local inpath '/home/chunsoft/t_access_times.dat' into table t_access_times;

2.4 计算单个用户的月访问次数

jdbc:hive2://172.16.208.128:10000>
select username,month,sum(access_time) as access_time from t_access_times group by username,month;

+-----------+----------+---------+--+
| username  |  month   | access_time  |
+-----------+----------+---------+--+
| A         | 2015-01  | 33      |
| A         | 2015-02  | 10      |
| B         | 2015-01  | 30      |
| B         | 2015-02  | 15      |
+-----------+----------+---------+--+

2.4 上面的表自己跟自己进行inner join方便求总访问

select A.*,B.* from
(select username,month,sum(access_time) as access_time from t_access_times group by username,month) A
inner join
(select username,month,sum(access_time) as access_time from t_access_times group by username,month) B
on
A.username=B.username

+-------------+----------+-----------+-------------+----------+-----------+--+
| a.username  | a.month  | a.access_time  | b.username  | b.month  | b.access_time  |
+-------------+----------+-----------+-------------+----------+-----------+--+
| A           | 2015-01  | 33        | A           | 2015-01  | 33        |
| A           | 2015-01  | 33        | A           | 2015-02  | 10        |
| A           | 2015-02  | 10        | A           | 2015-01  | 33        |
| A           | 2015-02  | 10        | A           | 2015-02  | 10        |
| B           | 2015-01  | 30        | B           | 2015-01  | 30        |
| B           | 2015-01  | 30        | B           | 2015-02  | 15        |
| B           | 2015-02  | 15        | B           | 2015-01  | 30        |
| B           | 2015-02  | 15        | B           | 2015-02  | 15        |
+-------------+----------+-----------+-------------+----------+-----------+--+

2.5 根据当前月份的大小来判断累计求和并排序

select A.username,A.month,max(A.access_time) access_time,sum(B.access_time) accumulate
from
  (select username,month,sum(access_time) as access_time from t_access_times group by username,month) A
  inner join
  (select username,month,sum(access_time) as access_time from t_access_times group by username,month) B
  on
  A.username=B.username
where B.month<=A.month
group by A.username,A.month
order by A.username,A.month

+-------------+----------+---------+-------------+--+
| a.username  | a.month  | access_time  | accumulate  |
+-------------+----------+---------+-------------+--+
| A           | 2015-01  | 33      | 33          |
| A           | 2015-02  | 10      | 43          |
| B           | 2015-01  | 30      | 30          |
| B           | 2015-02  | 15      | 45          |

这里面的max(A.access_time) 起到一个聚合作用,保留一条结果。

你可能感兴趣的:(big-data)