实际生产中,各种指标的报表统计,往往都会涉及到多维分析,比如,统计日活数,日会话次数,日回头访客数,日新,日用户平均访问时长,访问深度……都需要从不同维度,各种角度去分析,如果上述维度分析需求,都逐个开发计算sql(逐个去group by聚合),工作繁冗!
那么,如何解决这个问题呢?
- 创建一个统一的目标维度分析聚合结果表,这个表应该包含所有的维度字段
- 利用hive的高阶聚合函数,在一个sql中,即可计算出所有可能的维度组合
省 |
市 |
区 |
手机型号 |
操作系统 |
App版本 |
下载渠道 |
小时段 |
日活总数 |
江西 |
\n |
\n |
\n |
\n |
\n |
\n |
\n |
1000 |
江苏 |
\n |
\n |
\n |
\n |
\n |
\n |
\n |
1500 |
河南 |
\n |
\n |
\n |
\n |
\n |
\n |
\n |
1800 |
…… |
|
|
|
|
|
|
|
|
江西 |
九江 |
\n |
\n |
\n |
\n |
\n |
\n |
800 |
江西 |
赣州 |
\n |
\n |
\n |
\n |
\n |
\n |
600 |
江西 |
南昌 |
\n |
\n |
\n |
\n |
\n |
\n |
450 |
江西 |
…… |
\n |
\n |
\n |
\n |
\n |
\n |
550 |
江苏 |
南通 |
\n |
\n |
\n |
\n |
\n |
\n |
660 |
江苏 |
苏州 |
\n |
\n |
\n |
\n |
\n |
\n |
540 |
江苏 |
徐州 |
\n |
\n |
\n |
\n |
\n |
\n |
400 |
江苏 |
…… |
\n |
\n |
\n |
\n |
\n |
\n |
320 |
\n |
\n |
\n |
MI6 |
\n |
\n |
\n |
\n |
1500 |
\n |
\n |
\n |
MI8 |
\n |
\n |
\n |
\n |
2200 |
\n |
\n |
\n |
MATE10 |
\n |
\n |
\n |
\n |
1800 |
\n |
\n |
\n |
IPHONE6 |
\n |
\n |
\n |
\n |
1200 |
\n |
\n |
\n |
…… |
\n |
\n |
\n |
\n |
…… |
|
|
|
|
|
|
|
|
|
SELECT
province,
dau_cnt
FROM cube
WHERE province is not null and coalesce(city,district,devicetype,osname,....) is null
上述表的行数很大
比如按(省、市、区、手机型号、app版本、下载渠道、小时段)维度组合计算日活数,结果行数有: 省维度的基数 * 市维度的基数 * 区维度的基数 * ……
基数: 就是某个维度字段的去重值个数!
INSERT INTO TABLE cube
SELECT
province,
city,
district,
device_type,
os_name,
app_version,
release_channel,
hour_segement,
count(distinct guid) as dau_cnt
FROM t_src
GROUP BY
province,
city,
district,
device_type,
os_name,
app_version,
release_channel,
house_segment
WITH CUBE
;
INSERT INTO TABLE cube
SELECT
province,
city,
district,
device_type,
os_name,
app_version,
release_channel,
hour_segement,
count(distinct guid) as dau_cnt
FROM t_src
GROUP BY
province,
city,
district,
device_type,
os_name,
app_version,
release_channel,
house_segment
GROUPING SETS(
(),
(province),
(province,city),
(province,city,district),
(device_type)
)
;
INSERT INTO TABLE cube
SELECT
province,
city,
district,
device_type,
os_name,
app_version,
release_channel,
hour_segement,
count(distinct guid) as dau_cnt
FROM t_src
GROUP BY province,city,district
WITH ROLLUP
;