以下表记录了用户每天的蚂蚁森林低碳生活领取的记录流水。
table_name:user_low_carbon
user_id data_dt low_carbon
用户 日期 减少碳排放(g)
数据:
u_001 2017/1/1 10
u_001 2017/1/2 150
u_001 2017/1/2 110
u_001 2017/1/2 10
u_001 2017/1/4 50
u_001 2017/1/4 10
u_001 2017/1/6 45
u_001 2017/1/6 90
u_002 2017/1/1 10
u_002 2017/1/2 150
u_002 2017/1/2 70
u_002 2017/1/3 30
u_002 2017/1/3 80
u_002 2017/1/4 150
u_002 2017/1/5 101
u_002 2017/1/6 68
u_003 2017/1/1 20
u_003 2017/1/2 10
u_003 2017/1/2 150
u_003 2017/1/3 160
u_003 2017/1/4 20
u_003 2017/1/5 120
u_003 2017/1/6 20
u_003 2017/1/7 10
u_003 2017/1/7 110
u_004 2017/1/1 110
u_004 2017/1/2 20
u_004 2017/1/2 50
u_004 2017/1/3 120
u_004 2017/1/4 30
u_004 2017/1/5 60
u_004 2017/1/6 120
u_004 2017/1/7 10
u_004 2017/1/7 120
u_005 2017/1/1 80
u_005 2017/1/2 50
u_005 2017/1/2 80
u_005 2017/1/3 180
u_005 2017/1/4 180
u_005 2017/1/4 10
u_005 2017/1/5 80
u_005 2017/1/6 280
u_005 2017/1/7 80
u_005 2017/1/7 80
u_006 2017/1/1 40
u_006 2017/1/2 40
u_006 2017/1/2 140
u_006 2017/1/3 210
u_006 2017/1/3 10
u_006 2017/1/4 40
u_006 2017/1/5 40
u_006 2017/1/6 20
u_006 2017/1/7 50
u_006 2017/1/7 240
u_007 2017/1/1 130
u_007 2017/1/2 30
u_007 2017/1/2 330
u_007 2017/1/3 30
u_007 2017/1/4 530
u_007 2017/1/5 30
u_007 2017/1/6 230
u_007 2017/1/7 130
u_007 2017/1/7 30
u_008 2017/1/1 160
u_008 2017/1/2 60
u_008 2017/1/2 60
u_008 2017/1/3 60
u_008 2017/1/4 260
u_008 2017/1/5 360
u_008 2017/1/6 160
u_008 2017/1/7 60
u_008 2017/1/7 60
u_009 2017/1/1 70
u_009 2017/1/2 70
u_009 2017/1/2 70
u_009 2017/1/3 170
u_009 2017/1/4 270
u_009 2017/1/5 70
u_009 2017/1/6 70
u_009 2017/1/7 70
u_009 2017/1/7 70
u_010 2017/1/1 90
u_010 2017/1/2 90
u_010 2017/1/2 90
u_010 2017/1/3 90
u_010 2017/1/4 90
u_010 2017/1/4 80
u_010 2017/1/5 90
u_010 2017/1/5 90
u_010 2017/1/6 190
u_010 2017/1/7 90
u_010 2017/1/7 90
u_011 2017/1/1 110
u_011 2017/1/2 100
u_011 2017/1/2 100
u_011 2017/1/3 120
u_011 2017/1/4 100
u_011 2017/1/5 100
u_011 2017/1/6 100
u_011 2017/1/7 130
u_011 2017/1/7 100
u_012 2017/1/1 10
u_012 2017/1/2 120
u_012 2017/1/2 10
u_012 2017/1/3 10
u_012 2017/1/4 50
u_012 2017/1/5 10
u_012 2017/1/6 20
u_012 2017/1/7 10
u_012 2017/1/7 10
u_013 2017/1/1 50
u_013 2017/1/2 150
u_013 2017/1/2 50
u_013 2017/1/3 150
u_013 2017/1/4 550
u_013 2017/1/5 350
u_013 2017/1/6 50
u_013 2017/1/7 20
u_013 2017/1/7 60
u_014 2017/1/1 220
u_014 2017/1/2 120
u_014 2017/1/2 20
u_014 2017/1/3 20
u_014 2017/1/4 20
u_014 2017/1/5 250
u_014 2017/1/6 120
u_014 2017/1/7 270
u_014 2017/1/7 20
u_015 2017/1/1 10
u_015 2017/1/2 20
u_015 2017/1/2 10
u_015 2017/1/3 10
u_015 2017/1/4 20
u_015 2017/1/5 70
u_015 2017/1/6 10
u_015 2017/1/7 80
u_015 2017/1/7 60
蚂蚁森林植物换购表,用于记录申领环保植物所需要减少的碳排放量
table_name: plant_carbon
plant_id plant_name low_carbon
植物编号 植物名 换购植物所需要的碳
数据:
p001 梭梭树 17
p002 沙柳 19
p003 樟子树 146
p004 胡杨 215
蚂蚁森林植物申领统计
问题:假设2017年1月1日开始记录低碳数据(user_low_carbon),假设2017年10月1日之前满足申领条件的用户都申领了一颗p004-胡杨,
剩余的能量全部用来领取“p002-沙柳” 。
统计在10月1日累计申领“p002-沙柳” 排名前10的用户信息;以及他比后一名多领了几颗沙柳。
得到的统计结果如下表样式:
user_id plant_count less_count(比后一名多领了几颗沙柳)
u_101 1000 100
u_088 900 400
u_103 500 …
1.创建表
create table user_low_carbon(user_id String,data_dt String,low_carbon int) row format delimited fields terminated by '\t';
create table plant_carbon(plant_id string,plant_name String,low_carbon int) row format delimited fields terminated by '\t';
2.加载数据
load data local inpath "/opt/module/data/user_low_carbon.txt" into table user_low_carbon;
load data local inpath "/opt/module/data/plant_carbon.txt" into table plant_carbon;
3.设置本地模式
set hive.exec.mode.local.auto=true;
1.统计每个用户截止到2017/10/1日期总低碳量
select
user_id,
sum(low_carbon) sum_low_carbon
from
user_low_carbon
where
date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM')<'2017-10'
group by
user_id
order by
sum_low_carbon desc
limit 11;t1
select low_carbon from plant_carbon where plant_id='p004';t2
select low_carbon from plant_carbon where plant_id='p002';t3
此表重命名为 t3表
测试
4.计算每个人申领沙柳的棵数
简写:
select
user_id,
floor((sum_low_carbon-t2.low_carbon)/t3.low_carbon) plant_count
from
t1,t2,t3;t4
此表重命名为 t4表
注意:floor函数是取整函数
测试:
整体版:
select
user_id,
floor((sum_low_carbon-t2.low_carbon)/t3.low_carbon) plant_count
from
(select
user_id,
sum(low_carbon) sum_low_carbon
from
user_low_carbon
where
date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM')<'2017-10'
group by
user_id
order by
sum_low_carbon desc
limit 11)t1,
(select low_carbon from plant_carbon where plant_id='p004')t2,
(select low_carbon from plant_carbon where plant_id='p002')t3;t4
测试:
5.按照申领沙柳棵数排序,并将下一行数据中的plant_count放置当前行
select
user_id,
plant_count,
lead(plant_count,1,'9999-99-99') over(order by plant_count desc) lead_plant_count
from
t4
limit 10;t5
此表重命名为 t5表
注意:lead函数是窗口函数,是往后第 n 行数据;
整体版:
select
user_id,
plant_count,
lead(plant_count,1,'9999-99-99') over(order by plant_count desc) lead_plant_count
from
(select
user_id,
floor((sum_low_carbon-t2.low_carbon)/t3.low_carbon) plant_count
from
(select
user_id,
sum(low_carbon) sum_low_carbon
from
user_low_carbon
where
date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM')<'2017-10'
group by
user_id
order by
sum_low_carbon desc
limit 11)t1,
(select low_carbon from plant_carbon where plant_id='p004')t2,
(select low_carbon from plant_carbon where plant_id='p002')t3)t4
limit 10;t5
select
user_id,
plant_count,
(plant_count-lead_plant_count) plant_count_diff
from
t5;
整体版
select
user_id,
plant_count,
(plant_count-lead_plant_count) plant_count_diff
from
(select
user_id,
plant_count,
lead(plant_count,1,'9999-99-99') over(order by plant_count desc) lead_plant_count
from
(select
user_id,
floor((sum_low_carbon-t2.low_carbon)/t3.low_carbon) plant_count
from
(select
user_id,
sum(low_carbon) sum_low_carbon
from
user_low_carbon
where
date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM')<'2017-10'
group by
user_id
order by
sum_low_carbon desc
limit 11)t1,
(select low_carbon from plant_carbon where plant_id='p004')t2,
(select low_carbon from plant_carbon where plant_id='p002')t3)t4
order by
plant_count desc
limit 10)t5;
蚂蚁森林低碳用户排名分析
问题:查询user_low_carbon表中每日流水记录,条件为:
用户在2017年,连续三天(或以上)的天数里,每天减少碳排放(low_carbon)都超过100g的用户低碳流水。
需要查询返回满足以上条件的user_low_carbon表中的记录流水。
例如用户u_002符合条件的记录如下,因为2017/1/2~2017/1/5连续四天的碳排放量之和都大于等于100g:
seq(key) user_id data_dt low_carbon
xxxxx10 u_002 2017/1/2 150
xxxxx11 u_002 2017/1/2 70
xxxxx12 u_002 2017/1/3 30
xxxxx13 u_002 2017/1/3 80
xxxxx14 u_002 2017/1/4 150
xxxxx14 u_002 2017/1/5 101
上一小问已经将数据处理了,接下来是分步处理了
1.过滤出2017年且单日低碳量超过100g
select
user_id,
date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt
from
user_low_carbon
where
substring(data_dt,1,4)='2017'
group by
user_id,data_dt
having
sum(low_carbon)>=100;t1
测试:
接下来是判断是否三个(及以上)数据连续,解法一的思路是看这一行的数据和他的前两行或者后两行的差值是否是1(或者-1)和2(或者-2)
2.将前两行数据以及后两行数据的日期放置当前行
简写:
select
user_id,
data_dt,
lag(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lag2,
lag(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lag1,
lead(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lead1,
lead(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lead2
from
t1;
完整版:(补充t1)
select
user_id,
data_dt,
lag(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lag2,
lag(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lag1,
lead(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lead1,
lead(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lead2
from
(select
user_id,
date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt
from
user_low_carbon
where
substring(data_dt,1,4)='2017'
group by
user_id,data_dt
having
sum(low_carbon)>=100)t1;t2
select
user_id,
data_dt,
datediff(data_dt,lag2) lag2_diff,
datediff(data_dt,lag1) lag1_diff,
datediff(data_dt,lead1) lead1_diff,
datediff(data_dt,lead2) lead2_diff
from
t2;t3
完整版:
select
user_id,
data_dt,
datediff(data_dt,lag2) lag2_diff,
datediff(data_dt,lag1) lag1_diff,
datediff(data_dt,lead1) lead1_diff,
datediff(data_dt,lead2) lead2_diff
from
(select
user_id,
data_dt,
lag(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lag2,
lag(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lag1,
lead(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lead1,
lead(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lead2
from
(select
user_id,
date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt
from
user_low_carbon
where
substring(data_dt,1,4)='2017'
group by
user_id,data_dt
having
sum(low_carbon)>=100)t1)t2;t3
select
user_id,
data_dt
from
t3
where
(lag2_diff=2 and lag1_diff=1)
or
(lag1_diff=1 and lead1_diff=-1)
or
(lead1_diff=-1 and lead2_diff=-2);t4
完整版:
select
user_id,
data_dt
from
(select
user_id,
data_dt,
datediff(data_dt,lag2) lag2_diff,
datediff(data_dt,lag1) lag1_diff,
datediff(data_dt,lead1) lead1_diff,
datediff(data_dt,lead2) lead2_diff
from
(select
user_id,
data_dt,
lag(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lag2,
lag(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lag1,
lead(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lead1,
lead(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lead2
from
(select
user_id,
date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt
from
user_low_carbon
where
substring(data_dt,1,4)='2017'
group by
user_id,data_dt
having
sum(low_carbon)>=100)t1)t2)t3
where
(lag2_diff=2 and lag1_diff=1)
or
(lag1_diff=1 and lead1_diff=-1)
or
(lead1_diff=-1 and lead2_diff=-2);t4
select
user_id,
date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt,
low_carbon
from
user_low_carbon;t6
6.关联转换后的表
select
t6.user_id,
t6.data_dt,
t6.low_carbon
from
(select
user_id,
data_dt
from
(select
user_id,
data_dt,
datediff(data_dt,lag2) lag2_diff,
datediff(data_dt,lag1) lag1_diff,
datediff(data_dt,lead1) lead1_diff,
datediff(data_dt,lead2) lead2_diff
from
(select
user_id,
data_dt,
lag(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lag2,
lag(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lag1,
lead(data_dt,1,'1970-01-01') over(partition by user_id order by data_dt) lead1,
lead(data_dt,2,'1970-01-01') over(partition by user_id order by data_dt) lead2
from
(select
user_id,
date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt
from
user_low_carbon
where
substring(data_dt,1,4)='2017'
group by
user_id,data_dt
having
sum(low_carbon)>=100)t1)t2)t3
where
(lag2_diff=2 and lag1_diff=1)
or
(lag1_diff=1 and lead1_diff=-1)
or
(lead1_diff=-1 and lead2_diff=-2))t4
join
(select
user_id,
date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt,
low_carbon
from
user_low_carbon) t6
on
t4.user_id = t6.user_id and t4.data_dt = t6.data_dt;
注意,不要用关键词做替换表名,我用user做表名,出错了
运用等差数列的形式
2017/1/2 1 1-1
2017/1/3 2 1-1
2017/1/4 3 1-1
2017/1/5 4 1-1
2017/1/6 5 1-1
2017/1/8 6 1-2
2017/1/9 7 1-2
数后面的数据,三个(及以上)相同的即连续
1.过滤出2017年且单日低碳量超过100g
select
user_id,
date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt
from
user_low_carbon
where
substring(data_dt,1,4)='2017'
group by
user_id,data_dt
having
sum(low_carbon)>=100;t1
2.按照日期进行排序,并给每一条数据一个标记
简写
select
user_id,
data_dt,
rank() over(partition by user_id order by data_dt) rk
from
t1;t2
完整
select
user_id,
data_dt,
rank() over(partition by user_id order by data_dt) rk
from
(select
user_id,
date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt
from
user_low_carbon
where
substring(data_dt,1,4)='2017'
group by
user_id,data_dt
having
sum(low_carbon)>=100)t1;t2
select
user_id,
data_dt,
date_sub(data_dt,rk) data_sub_rk
from
(select
user_id,
data_dt,
rank() over(partition by user_id order by data_dt) rk
from
(select
user_id,
date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt
from
user_low_carbon
where
substring(data_dt,1,4)='2017'
group by
user_id,data_dt
having
sum(low_carbon)>=100)t1)t2;t3
测试
4.过滤出连续3天超过100g的用户
select
user_id
from
(select
user_id,
data_dt,
date_sub(data_dt,rk) data_sub_rk
from
(select
user_id,
data_dt,
rank() over(partition by user_id order by data_dt) rk
from
(select
user_id,
date_format(regexp_replace(data_dt,'/','-'),'yyyy-MM-dd') data_dt
from
user_low_carbon
where
substring(data_dt,1,4)='2017'
group by
user_id,data_dt
having
sum(low_carbon)>=100)t1)t2)t3
group by
user_id,data_sub_rk
having
count(*)>=3;