hive之蚂蚁森林

介绍

蚂蚁森林是hive中的一个很有价值的项目
话不错说,步入正题

数据结构介绍

以下表记录了用户每天的蚂蚁森林低碳生活领取的记录流水。

user_low_carbon

user_id data_dt low_carbon

用户 日期 减少碳排放

u_001   2017/1/1	10
u_001	2017/1/2	150
u_001	2017/1/2	110
u_001	2017/1/2	10
u_001	2017/1/4	50
u_001	2017/1/4	10
u_001	2017/1/6	45
u_001	2017/1/6	90
u_002	2017/1/1	10
u_002	2017/1/2	150
u_002	2017/1/2	70
u_002	2017/1/3	30
u_002	2017/1/3	80
u_002	2017/1/4	150
u_002	2017/1/5	101
u_002	2017/1/6	68
u_003	2017/1/1	20
u_003	2017/1/2	10
u_003	2017/1/2	150
u_003	2017/1/3	160
u_003	2017/1/4	20
u_003	2017/1/5	120
u_003	2017/1/6	20
u_003	2017/1/7	10
u_003	2017/1/7	110
u_004	2017/1/1	110
u_004	2017/1/2	20
u_004	2017/1/2	50
u_004	2017/1/3	120
u_004	2017/1/4	30
u_004	2017/1/5	60
u_004	2017/1/6	120
u_004	2017/1/7	10
u_004	2017/1/7	120
u_005	2017/1/1	80
u_005	2017/1/2	50
u_005	2017/1/2	80
u_005	2017/1/3	180
u_005	2017/1/4	180
u_005	2017/1/4	10
u_005	2017/1/5	80
u_005	2017/1/6	280
u_005	2017/1/7	80
u_005	2017/1/7	80
u_006	2017/1/1	40
u_006	2017/1/2	40
u_006	2017/1/2	140
u_006	2017/1/3	210
u_006	2017/1/3	10
u_006	2017/1/4	40
u_006	2017/1/5	40
u_006	2017/1/6	20
u_006	2017/1/7	50
u_006	2017/1/7	240
u_007	2017/1/1	130
u_007	2017/1/2	30
u_007	2017/1/2	330
u_007	2017/1/3	30
u_007	2017/1/4	530
u_007	2017/1/5	30
u_007	2017/1/6	230
u_007	2017/1/7	130
u_007	2017/1/7	30
u_008	2017/1/1	160
u_008	2017/1/2	60
u_008	2017/1/2	60
u_008	2017/1/3	60
u_008	2017/1/4	260
u_008	2017/1/5	360
u_008	2017/1/6	160
u_008	2017/1/7	60
u_008	2017/1/7	60
u_009	2017/1/1	70
u_009	2017/1/2	70
u_009	2017/1/2	70
u_009	2017/1/3	170
u_009	2017/1/4	270
u_009	2017/1/5	70
u_009	2017/1/6	70
u_009	2017/1/7	70
u_009	2017/1/7	70
u_010	2017/1/1	90
u_010	2017/1/2	90
u_010	2017/1/2	90
u_010	2017/1/3	90
u_010	2017/1/4	90
u_010	2017/1/4	80
u_010	2017/1/5	90
u_010	2017/1/5	90
u_010	2017/1/6	190
u_010	2017/1/7	90
u_010	2017/1/7	90
u_011	2017/1/1	110
u_011	2017/1/2	100
u_011	2017/1/2	100
u_011	2017/1/3	120
u_011	2017/1/4	100
u_011	2017/1/5	100
u_011	2017/1/6	100
u_011	2017/1/7	130
u_011	2017/1/7	100
u_012	2017/1/1	10
u_012	2017/1/2	120
u_012	2017/1/2	10
u_012	2017/1/3	10
u_012	2017/1/4	50
u_012	2017/1/5	10
u_012	2017/1/6	20
u_012	2017/1/7	10
u_012	2017/1/7	10
u_013	2017/1/1	50
u_013	2017/1/2	150
u_013	2017/1/2	50
u_013	2017/1/3	150
u_013	2017/1/4	550
u_013	2017/1/5	350
u_013	2017/1/6	50
u_013	2017/1/7	20
u_013	2017/1/7	60
u_014	2017/1/1	220
u_014	2017/1/2	120
u_014	2017/1/2	20
u_014	2017/1/3	20
u_014	2017/1/4	20
u_014	2017/1/5	250
u_014	2017/1/6	120
u_014	2017/1/7	270
u_014	2017/1/7	20
u_015	2017/1/1	10
u_015	2017/1/2	20
u_015	2017/1/2	10
u_015	2017/1/3	10
u_015	2017/1/4	20
u_015	2017/1/5	70
u_015	2017/1/6	10
u_015	2017/1/7	80
u_015	2017/1/7	60

蚂蚁森林植物换购表,用于记录申领环保植物所需要减少的碳排放量

plant_carbon

plant_id plant_name low_carbon

植物编号 植物名 换购植物所需要的碳

p001	梭梭树	17
p002	沙柳	19
p003	樟子树	146
p004	胡杨	215

题目
1.蚂蚁森林植物申领统计
问题:假设2017年1月1日开始记录低碳数据(user_low_carbon),假设2017年10月1日之前满足申领条件的用户都申领了一颗 p004-胡杨,剩余的能量全部用来领取“p002-沙柳”。统计在10月1日累计申领“p002-沙柳”排名前10的用户信息;以及他比后一名多领了几颗沙柳。
得到的统计结果如下表样式:

u_007	66	3
u_013	63	10
u_008	53	7
u_005	46	1
u_010	45	1
u_014	44	5
u_011	39	2
u_009	37	5
u_006	32	9
u_002	23	1

2.蚂蚁森林低碳用户排名分析
问题:查询user_low_carbon表中每日流水记录,条件为:

用户在2017年,连续三天(或以上)的天数里,
每天减少碳排放(low_carbon)都超过100g的用户低碳流水。
需要查询返回满足以上条件的user_low_carbon表中的记录流水。

例如用户u_002符合条件的记录如下,因为2017/1/2~2017/1/5连续四天的碳排放量之和都大于等于100g:
运行结果如下所示

u_002	2017/1/2	150
u_002	2017/1/2	70
u_002	2017/1/3	30
u_002	2017/1/3	80
u_002	2017/1/4	150
u_002	2017/1/5	101
u_005	2017/1/2	50
u_005	2017/1/2	80
u_005	2017/1/3	180
u_005	2017/1/4	180
u_005	2017/1/4	10
u_008	2017/1/4	260
u_008	2017/1/5	360
u_008	2017/1/6	160
u_008	2017/1/7	60
u_008	2017/1/7	60
u_009	2017/1/2	70
u_009	2017/1/2	70
u_009	2017/1/3	170
u_009	2017/1/4	270
u_010	2017/1/4	90
u_010	2017/1/4	80
u_010	2017/1/5	90
u_010	2017/1/5	90
u_010	2017/1/6	190
u_010	2017/1/7	90
u_010	2017/1/7	90
u_011	2017/1/1	110
u_011	2017/1/2	100
u_011	2017/1/2	100
u_011	2017/1/3	120
u_013	2017/1/2	150
u_013	2017/1/2	50
u_013	2017/1/3	150
u_013	2017/1/4	550
u_013	2017/1/5	350
u_014	2017/1/5	250
u_014	2017/1/6	120
u_014	2017/1/7	270
u_014	2017/1/7	20

参考答案与解析

这里我们采用 Hive 的 HQL 来解决这两个问题。

准备工作

创建表
create table plant_carbon (plant_id string,plant_name string,low_carbon int) row format delimited fields terminated by '\t' stored as textfile;

create table user_low_carbon(user_id String,data_dt String,low_carbon int) row format delimited fields terminated by '\t'stored as textfile; 
导入数据
load data local inpath '/opt/data/plant_carbon.txt' into table plant_carbon;

load data local inpath '/opt/data/user_low_carbon.txt' into table user_low_carbon;

题目 1 详解

  1. 查询所有人的在规定时间内积攒的碳
select user_id,sum(low_carbon)low_carbons from user_low_carbon
where  unix_timestamp(data_dt,'yyyy/MM/dd')>= unix_timestamp('2017/01/01','yyyy/MM/dd')
and unix_timestamp(data_dt,'yyyy/MM/dd')< unix_timestamp('2017/10/01','yyyy/MM/dd')
GROUP BY user_id  ORDER BY low_carbons DESC
  1. 查询胡杨需要的碳
 select low_carbon from plant_carbon where plant_id='p004'
  1. 查询沙柳需要的碳
select low_carbon from plant_carbon where plant_id='p002'
  1. 查询前11位用户减去领取一棵胡杨的碳量后,全部领取沙柳的数量。
select t1.user_id,floor((t1.low_carbons-t2.low_carbon)/t3.low_carbon) plant_count  from 
(select user_id,sum(low_carbon)low_carbons from user_low_carbon
where  unix_timestamp(data_dt,'yyyy/MM/dd')>= unix_timestamp('2017/01/01','yyyy/MM/dd')
and unix_timestamp(data_dt,'yyyy/MM/dd')< unix_timestamp('2017/10/01','yyyy/MM/dd')
GROUP BY user_id  ORDER BY low_carbons DESC)t1,
(select low_carbon from plant_carbon where plant_id='p004')t2,
(select low_carbon from plant_carbon where plant_id='p002')t3 limit 11
  1. 最后的查询
select user_id,plant_count,plant_count-LEAD(plant_count,1,0) over(order by plant_count DESC) less_count
from (select t1.user_id,floor((t1.low_carbons-t2.low_carbon)/t3.low_carbon) plant_count  from 
(select user_id,sum(low_carbon)low_carbons from user_low_carbon
where  unix_timestamp(data_dt,'yyyy/MM/dd')>= unix_timestamp('2017/01/01','yyyy/MM/dd')
and unix_timestamp(data_dt,'yyyy/MM/dd')< unix_timestamp('2017/10/01','yyyy/MM/dd')
GROUP BY user_id  ORDER BY low_carbons DESC)t1,
(select low_carbon from plant_carbon where plant_id='p004')t2,
(select low_carbon from plant_carbon where plant_id='p002')t3 limit 11)t4 limit 10
注意

由于这里使用了 LEAD() 函数,如果第 4 步不保留第 11 位用户,那么 LEAD() 得到的值就是0,继而导致 u_002 的 less_count 为 23。

结果为

hive之蚂蚁森林_第1张图片

题目 2 详解

题目 2 会比题目难不少,请大家保持好心态!

基本思路

  1. 查询每个用户每天减少碳排放量
select user_id, regexp_replace(data_dt,'/','-') data_dt,sum(low_carbon) sum_day_low_carbon from 
user_low_carbon where year(regexp_replace(data_dt,'/','-'))=2017 
group by user_id,data_dt having sum_day_low_carbon>100
  1. 查询上一天,上两天,后一天,后两天的日期
select user_id,today,
LAG(today,1,'1970-1-1') OVER(PARTITION BY user_id ORDER BY today ) yesterday,
LAG(today,2,'1970-1-1') OVER(PARTITION BY user_id ORDER BY today ) before_yesterday,
lead(today,1,'1970-1-1') OVER(PARTITION BY user_id ORDER BY today ) tomorrow,
lead(today,2,'1970-1-1') OVER(PARTITION by user_id order by today) after_tomorrow from 
(select user_id, regexp_replace(data_dt,'/','-') today,sum(low_carbon) sum_day_low_carbon from 
user_low_carbon where year(regexp_replace(data_dt,'/','-')) = 2017 
group by user_id,data_dt having sum_day_low_carbon>100)t1
  1. 查询上一天,上两天,后一天,后两天的差
select user_id,today,DATEDIFF(today,yesterday)t_d_y,DATEDIFF(today,before_yesterday)t_d_b,DATEDIFF(today,tomorrow)t_d_t,DATEDIFF(today,after_tomorrow)t_d_a
from(select user_id,today, LAG(today,1,'1970-1-1')over(PARTITION by user_id order by today) yesterday,LAG(today,2,'1970-1-1')over(PARTITION by user_id order by today) before_yesterday,
LEAD(today,1,'1970-1-1')over(PARTITION by user_id order by today) tomorrow,LEAD(today,2,'1970-1-1')over(PARTITION by user_id order by today) after_tomorrow from 
(select user_id, regexp_replace(data_dt,'/','-') today,sum(low_carbon) sum_day_low_carbon from user_low_carbon where year(regexp_replace(data_dt,'/','-'))=2017 group by user_id,data_dt having sum_day_low_carbon>100)t1)t2
  1. 查询上一天,上两天,后一天,后两天的三天连续的
SELECT user_id, REGEXP_REPLACE(today,'-','/') data_dt from (
select user_id,today,DATEDIFF(today,yesterday)t_d_y,DATEDIFF(today,before_yesterday)t_d_b,
DATEDIFF(today,tomorrow)t_d_t,DATEDIFF(today,after_tomorrow)t_d_a from(select user_id,today ,
LAG(today,1,'1970-1-1')over(PARTITION by user_id order by today) yesterday,
LAG(today,2,'1970-1-1')over(PARTITION by user_id order by today) before_yesterday,
LEAD(today,1,'1970-1-1')over(PARTITION by user_id order by today) tomorrow,
LEAD(today,2,'1970-1-1')over(PARTITION by user_id order by today) after_tomorrow from 
(select user_id, regexp_replace(data_dt,'/','-') today,sum(low_carbon) sum_day_low_carbon from 
user_low_carbon where year(regexp_replace(data_dt,'/','-'))=2017 
group by user_id,data_dt having sum_day_low_carbon>100)t1)t2)t3
where (t_d_t=-1 AND t_d_a=-2) OR  (t_d_t=-1 AND t_d_y=1) OR (t_d_y=1 AND t_d_b=2);
  1. 关联原表,即可求出最后的结果
SELECT t5.user_id user_id, t5.data_dt data_dt, t5.low_carbon low_carbon FROM 
(SELECT user_id, REGEXP_REPLACE(today,'-','/') data_dt from 
(select user_id,today,DATEDIFF(today,yesterday)t_d_y,DATEDIFF(today,before_yesterday)t_d_b,
DATEDIFF(today,tomorrow)t_d_t,DATEDIFF(today,after_tomorrow)t_d_a from
(select user_id,today ,
LAG(today,1,'1970-1-1')over(PARTITION by user_id order by today) yesterday,
LAG(today,2,'1970-1-1')over(PARTITION by user_id order by today) before_yesterday,
LEAD(today,1,'1970-1-1')over(PARTITION by user_id order by today) tomorrow,
LEAD(today,2,'1970-1-1')over(PARTITION by user_id order by today) after_tomorrow from 
(select user_id, regexp_replace(data_dt,'/','-') today,sum(low_carbon) sum_day_low_carbon from 
user_low_carbon where year(regexp_replace(data_dt,'/','-'))=2017 
group by user_id,data_dt having sum_day_low_carbon>100)t1)t2)t3
where (t_d_t=-1 AND t_d_a=-2) OR  (t_d_t=-1 AND t_d_y=1) OR (t_d_y=1 AND t_d_b=2))t4 
left join user_low_carbon t5 on t4.user_id=t5.user_id and t4.data_dt=t5.data_dt
结果为

hive之蚂蚁森林_第2张图片

你可能感兴趣的:(大数据生态链)