hive(七)蚂蚁金服笔试题详解

目录标题

  • 背景说明
  • 题目一:
    • 解题步骤
      • 数据处理
      • 分步操作
  • 题目二
    • 分步骤处理
      • 解法一
      • 解法二

背景说明

以下表记录了用户每天的蚂蚁森林低碳生活领取的记录流水。

table_name:user_low_carbon
user_id data_dt  low_carbon
用户     日期      减少碳排放(g)

数据:

u_001	2017/1/1	10
u_001	2017/1/2	150
u_001	2017/1/2	110
u_001	2017/1/2	10
u_001	2017/1/4	50
u_001	2017/1/4	10
u_001	2017/1/6	45
u_001	2017/1/6	90
u_002	2017/1/1	10
u_002	2017/1/2	150
u_002	2017/1/2	70
u_002	2017/1/3	30
u_002	2017/1/3	80
u_002	2017/1/4	150
u_002	2017/1/5	101
u_002	2017/1/6	68
u_003	2017/1/1	20
u_003	2017/1/2	10
u_003	2017/1/2	150
u_003	2017/1/3	160
u_003	2017/1/4	20
u_003	2017/1/5	120
u_003	2017/1/6	20
u_003	2017/1/7	10
u_003	2017/1/7	110
u_004	2017/1/1	110
u_004	2017/1/2	20
u_004	2017/1/2	50
u_004	2017/1/3	120
u_004	2017/1/4	30
u_004	2017/1/5	60
u_004	2017/1/6	120
u_004	2017/1/7	10
u_004	2017/1/7	120
u_005	2017/1/1	80
u_005	2017/1/2	50
u_005	2017/1/2	80
u_005	2017/1/3	180
u_005	2017/1/4	180
u_005	2017/1/4	10
u_005	2017/1/5	80
u_005	2017/1/6	280
u_005	2017/1/7	80
u_005	2017/1/7	80
u_006	2017/1/1	40
u_006	2017/1/2	40
u_006	2017/1/2	140
u_006	2017/1/3	210
u_006	2017/1/3	10
u_006	2017/1/4	40
u_006	2017/1/5	40
u_006	2017/1/6	20
u_006	2017/1/7	50
u_006	2017/1/7	240
u_007	2017/1/1	130
u_007	2017/1/2	30
u_007	2017/1/2	330
u_007	2017/1/3	30
u_007	2017/1/4	530
u_007	2017/1/5	30
u_007	2017/1/6	230
u_007	2017/1/7	130
u_007	2017/1/7	30
u_008	2017/1/1	160
u_008	2017/1/2	60
u_008	2017/1/2	60
u_008	2017/1/3	60
u_008	2017/1/4	260
u_008	2017/1/5	360
u_008	2017/1/6	160
u_008	2017/1/7	60
u_008	2017/1/7	60
u_009	2017/1/1	70
u_009	2017/1/2	70
u_009	2017/1/2	70
u_009	2017/1/3	170
u_009	2017/1/4	270
u_009	2017/1/5	70
u_009	2017/1/6	70
u_009	2017/1/7	70
u_009	2017/1/7	70
u_010	2017/1/1	90
u_010	2017/1/2	90
u_010	2017/1/2	90
u_010	2017/1/3	90
u_010	2017/1/4	90
u_010	2017/1/4	80
u_010	2017/1/5	90
u_010	2017/1/5	90
u_010	2017/1/6	190
u_010	2017/1/7	90
u_010	2017/1/7	90
u_011	2017/1/1	110
u_011	2017/1/2	100
u_011	2017/1/2	100
u_011	2017/1/3	120
u_011	2017/1/4	100
u_011	2017/1/5	100
u_011	2017/1/6	100
u_011	2017/1/7	130
u_011	2017/1/7	100
u_012	2017/1/1	10
u_012	2017/1/2	120
u_012	2017/1/2	10
u_012	2017/1/3	10
u_012	2017/1/4	50
u_012	2017/1/5	10
u_012	2017/1/6	20
u_012	2017/1/7	10
u_012	2017/1/7	10
u_013	2017/1/1	50
u_013	2017/1/2	150
u_013	2017/1/2	50
u_013	2017/1/3	150
u_013	2017/1/4	550
u_013	2017/1/5	350
u_013	2017/1/6	50
u_013	2017/1/7	20
u_013	2017/1/7	60
u_014	2017/1/1	220
u_014	2017/1/2	120
u_014	2017/1/2	20
u_014	2017/1/3	20
u_014	2017/1/4	20
u_014	2017/1/5	250
u_014	2017/1/6	120
u_014	2017/1/7	270
u_014	2017/1/7	20
u_015	2017/1/1	10
u_015	2017/1/2	20
u_015	2017/1/2	10
u_015	2017/1/3	10
u_015	2017/1/4	20
u_015	2017/1/5	70
u_015	2017/1/6	10
u_015	2017/1/7	80
u_015	2017/1/7	60

蚂蚁森林植物换购表,用于记录申领环保植物所需要减少的碳排放量

table_name:  plant_carbon
plant_id plant_name low_carbon
植物编号	植物名	换购植物所需要的碳

数据:

p001	梭梭树	17
p002	沙柳	19
p003	樟子树	146
p004	胡杨	215

题目一:

蚂蚁森林植物申领统计
问题:假设2017年1月1日开始记录低碳数据(user_low_carbon),假设2017年10月1日之前满足申领条件的用户都申领了一颗p004-胡杨,
剩余的能量全部用来领取“p002-沙柳” 。
统计在10月1日累计申领“p002-沙柳” 排名前10的用户信息;以及他比后一名多领了几颗沙柳。
得到的统计结果如下表样式:

user_id  plant_count less_count(比后一名多领了几颗沙柳)
u_101    1000         100
u_088    900          400
u_103    500

解题步骤

数据处理

1.创建表

create table user_low_carbon(user_id String,data_dt String,low_carbon int) row format delimited fields terminated by '\t';
create table plant_carbon(plant_id string,plant_name String,low_carbon int) row format delimited fields terminated by '\t';

2.加载数据

load data local inpath "/opt/module/data/user_low_carbon.txt" into table user_low_carbon;
load data local inpath "/opt/module/data/plant_carbon.txt" into table plant_carbon;

3.设置本地模式

set hive.exec.mode.local.auto=true;

分步操作

1.统计每个用户截止到2017/10/1日期总低碳量

select
    user_id,
    sum(low_carbon) sum_low_carbon
from
    user_low_carbon
where
    date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM')<'2017-10'
group by
    user_id
order by
    sum_low_carbon desc
limit 11;t1

此表重命名为 t1表
测试
hive(七)蚂蚁金服笔试题详解_第1张图片
2.取出胡杨的能量

select low_carbon from plant_carbon where plant_id='p004';t2

此表重命名为 t2表
测试
在这里插入图片描述
3.取出沙柳的能量

select low_carbon from plant_carbon where plant_id='p002';t3

此表重命名为 t3表
测试
在这里插入图片描述
4.计算每个人申领沙柳的棵数
简写:

select
    user_id,
    floor((sum_low_carbon-t2.low_carbon)/t3.low_carbon) plant_count
from
    t1,t2,t3;t4

此表重命名为 t4表
注意:floor函数是取整函数
测试:
整体版:

select
 user_id,
 floor((sum_low_carbon-t2.low_carbon)/t3.low_carbon) plant_count
from
 (select
  user_id,
  sum(low_carbon) sum_low_carbon
 from
  user_low_carbon
 where
  date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM')<'2017-10'
 group by
  user_id
 order by
  sum_low_carbon desc
limit 11)t1,
 (select low_carbon from plant_carbon where plant_id='p004')t2,
 (select low_carbon from plant_carbon where plant_id='p002')t3;t4

测试:
hive(七)蚂蚁金服笔试题详解_第2张图片
5.按照申领沙柳棵数排序,并将下一行数据中的plant_count放置当前行

select
    user_id,
    plant_count,
    lead(plant_count,1,'9999-99-99') over(order by plant_count desc) lead_plant_count
from
    t4
limit 10;t5

此表重命名为 t5表
注意:lead函数是窗口函数,是往后第 n 行数据;
整体版:

select
    user_id,
    plant_count,
    lead(plant_count,1,'9999-99-99') over(order by plant_count desc) lead_plant_count
from
    (select
 user_id,
 floor((sum_low_carbon-t2.low_carbon)/t3.low_carbon) plant_count
from
 (select
  user_id,
  sum(low_carbon) sum_low_carbon
 from
  user_low_carbon
 where
  date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM')<'2017-10'
 group by
  user_id
 order by
  sum_low_carbon desc
limit 11)t1,
 (select low_carbon from plant_carbon where plant_id='p004')t2,
 (select low_carbon from plant_carbon where plant_id='p002')t3)t4
limit 10;t5

hive(七)蚂蚁金服笔试题详解_第3张图片
6.求相差的沙柳棵数

select
    user_id,
    plant_count,
    (plant_count-lead_plant_count) plant_count_diff
from
    t5;

整体版

select
    user_id,
    plant_count,
    (plant_count-lead_plant_count) plant_count_diff
from
    (select
    user_id,
    plant_count,
    lead(plant_count,1,'9999-99-99') over(order by plant_count desc) lead_plant_count
from
    (select
 user_id,
 floor((sum_low_carbon-t2.low_carbon)/t3.low_carbon) plant_count
from
 (select
  user_id,
  sum(low_carbon) sum_low_carbon
 from
  user_low_carbon
 where
  date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM')<'2017-10'
 group by
  user_id
 order by
  sum_low_carbon desc
limit 11)t1,
 (select low_carbon from plant_carbon where plant_id='p004')t2,
 (select low_carbon from plant_carbon where plant_id='p002')t3)t4
 order by
    plant_count desc
limit 10)t5;

测试
hive(七)蚂蚁金服笔试题详解_第4张图片

题目二

蚂蚁森林低碳用户排名分析
问题:查询user_low_carbon表中每日流水记录,条件为:
用户在2017年,连续三天(或以上)的天数里,每天减少碳排放(low_carbon)都超过100g的用户低碳流水。
需要查询返回满足以上条件的user_low_carbon表中的记录流水。
例如用户u_002符合条件的记录如下,因为2017/1/2~2017/1/5连续四天的碳排放量之和都大于等于100g:

seq(key) user_id data_dt  low_carbon
xxxxx10    u_002  2017/1/2  150
xxxxx11    u_002  2017/1/2  70
xxxxx12    u_002  2017/1/3  30
xxxxx13    u_002  2017/1/3  80
xxxxx14    u_002  2017/1/4  150
xxxxx14    u_002  2017/1/5  101

上一小问已经将数据处理了,接下来是分步处理了

分步骤处理

解法一

1.过滤出2017年且单日低碳量超过100g

select
    user_id,
    date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt
from
    user_low_carbon
where
    substring(data_dt,1,4)='2017'
group by
    user_id,data_dt
having
    sum(low_carbon)>=100;t1

测试:
hive(七)蚂蚁金服笔试题详解_第5张图片
接下来是判断是否三个(及以上)数据连续,解法一的思路是看这一行的数据和他的前两行或者后两行的差值是否是1(或者-1)和2(或者-2)
2.将前两行数据以及后两行数据的日期放置当前行
简写:

select
    user_id,
    data_dt,
    lag(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lag2,
    lag(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lag1,
    lead(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lead1,
    lead(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lead2
from
    t1;

完整版:(补充t1)

select
    user_id,
    data_dt,
    lag(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lag2,
    lag(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lag1,
    lead(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lead1,
    lead(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lead2
from
    (select
    user_id,
    date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt
from
    user_low_carbon
where
    substring(data_dt,1,4)='2017'
group by
    user_id,data_dt
having
    sum(low_carbon)>=100)t1;t2

测试
hive(七)蚂蚁金服笔试题详解_第6张图片
3.计算当前日期跟前后两行时间的差值
简写:

select
    user_id,
    data_dt,
    datediff(data_dt,lag2) lag2_diff,
    datediff(data_dt,lag1) lag1_diff,
    datediff(data_dt,lead1) lead1_diff,
    datediff(data_dt,lead2) lead2_diff
from
    t2;t3

完整版:

select
    user_id,
    data_dt,
    datediff(data_dt,lag2) lag2_diff,
    datediff(data_dt,lag1) lag1_diff,
    datediff(data_dt,lead1) lead1_diff,
    datediff(data_dt,lead2) lead2_diff
from
    (select
    user_id,
    data_dt,
    lag(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lag2,
    lag(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lag1,
    lead(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lead1,
    lead(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lead2
from
    (select
    user_id,
    date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt
from
    user_low_carbon
where
    substring(data_dt,1,4)='2017'
group by
    user_id,data_dt
having
    sum(low_carbon)>=100)t1)t2;t3

测试:
hive(七)蚂蚁金服笔试题详解_第7张图片
4.过滤出连续3天超过100g的用户
简写版:

select
    user_id,
    data_dt
from
    t3
where
    (lag2_diff=2 and lag1_diff=1) 
    or 
    (lag1_diff=1 and lead1_diff=-1) 
    or 
    (lead1_diff=-1 and lead2_diff=-2);t4

完整版:

select
    user_id,
    data_dt
from
    (select
    user_id,
    data_dt,
    datediff(data_dt,lag2) lag2_diff,
    datediff(data_dt,lag1) lag1_diff,
    datediff(data_dt,lead1) lead1_diff,
    datediff(data_dt,lead2) lead2_diff
from
    (select
    user_id,
    data_dt,
    lag(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lag2,
    lag(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lag1,
    lead(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lead1,
    lead(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lead2
from
    (select
    user_id,
    date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt
from
    user_low_carbon
where
    substring(data_dt,1,4)='2017'
group by
    user_id,data_dt
having
    sum(low_carbon)>=100)t1)t2)t3
where
    (lag2_diff=2 and lag1_diff=1) 
    or 
    (lag1_diff=1 and lead1_diff=-1) 
    or 
    (lead1_diff=-1 and lead2_diff=-2);t4

测试:
hive(七)蚂蚁金服笔试题详解_第8张图片
5.修改原表日期表达方式

select
  user_id,
  date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt,
  low_carbon
from 
  user_low_carbon;t6

测试
hive(七)蚂蚁金服笔试题详解_第9张图片

6.关联转换后的表

select
    t6.user_id,
    t6.data_dt,
    t6.low_carbon
from
    (select
    user_id,
    data_dt
from
    (select
    user_id,
    data_dt,
    datediff(data_dt,lag2) lag2_diff,
    datediff(data_dt,lag1) lag1_diff,
    datediff(data_dt,lead1) lead1_diff,
    datediff(data_dt,lead2) lead2_diff
from
    (select
    user_id,
    data_dt,
    lag(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lag2,
    lag(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lag1,
    lead(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lead1,
    lead(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lead2
from
    (select
    user_id,
    date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt
from
    user_low_carbon
where
    substring(data_dt,1,4)='2017'
group by
    user_id,data_dt
having
    sum(low_carbon)>=100)t1)t2)t3
where
    (lag2_diff=2 and lag1_diff=1) 
    or 
    (lag1_diff=1 and lead1_diff=-1) 
    or 
    (lead1_diff=-1 and lead2_diff=-2))t4
join
    (select
  user_id,
  date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt,
  low_carbon
from 
  user_low_carbon) t6
on
    t4.user_id = t6.user_id and t4.data_dt = t6.data_dt;

hive(七)蚂蚁金服笔试题详解_第10张图片

注意,不要用关键词做替换表名,我用user做表名,出错了

  • 现在总体看来,这种解法比较按部就班,很繁琐,接下来请看解法二:

解法二

运用等差数列的形式

2017/1/2 1 1-1
2017/1/3 2 1-1
2017/1/4 3 1-1
2017/1/5 4 1-1
2017/1/6 5 1-1
2017/1/8 6 1-2
2017/1/9 7 1-2

数后面的数据,三个(及以上)相同的即连续

1.过滤出2017年且单日低碳量超过100g

select
    user_id,
    date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt
from
    user_low_carbon
where
    substring(data_dt,1,4)='2017'
group by
    user_id,data_dt
having
    sum(low_carbon)>=100;t1

2.按照日期进行排序,并给每一条数据一个标记
简写

select
    user_id,
    data_dt,
    rank() over(partition by user_id order by data_dt) rk
from
    t1;t2

完整

select
    user_id,
    data_dt,
    rank() over(partition by user_id order by data_dt) rk
from
    (select
    user_id,
    date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt
from
    user_low_carbon
where
    substring(data_dt,1,4)='2017'
group by
    user_id,data_dt
having
    sum(low_carbon)>=100)t1;t2

hive(七)蚂蚁金服笔试题详解_第11张图片
3.将日期减去当前的rank值

select
    user_id,
    data_dt,
    date_sub(data_dt,rk) data_sub_rk
from
    (select
    user_id,
    data_dt,
    rank() over(partition by user_id order by data_dt) rk
from
    (select
    user_id,
    date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt
from
    user_low_carbon
where
    substring(data_dt,1,4)='2017'
group by
    user_id,data_dt
having
    sum(low_carbon)>=100)t1)t2;t3

测试

hive(七)蚂蚁金服笔试题详解_第12张图片

4.过滤出连续3天超过100g的用户

select
    user_id
from
    (select
    user_id,
    data_dt,
    date_sub(data_dt,rk) data_sub_rk
from
    (select
    user_id,
    data_dt,
    rank() over(partition by user_id order by data_dt) rk
from
    (select
    user_id,
    date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt
from
    user_low_carbon
where
    substring(data_dt,1,4)='2017'
group by
    user_id,data_dt
having
    sum(low_carbon)>=100)t1)t2)t3
group by 
    user_id,data_sub_rk
having
    count(*)>=3;

你可能感兴趣的:(Bigdata,#,hive)