目录
第一类:累计值的计算
第二类:列转行 case ... when.... [collect_list和collect_set]
第三类:行转列(Lateral View和UDTF函数(explode、split)结合使用)
第四类:截取字段substr(...,...,...)
第一题:根据下列数据,现要求出:每个用户截止到每月为止的最大单月访问次数、累计到该月的总访问次数、当月访问总次数。
数据:
1、准备数据:
[hdp@hdp02 demo]$ vim demo04.txt //写入demo04.txt中 hive (demodb01)> create table demo01(name string, dates string ,Hz int) row format delimited fields terminated by ","; //创建表 hive (demodb01)> load data local inpath "/home/hdp/demo/demo04.txt" into table demo01; //导入数据 hive (demodb01)> select * from demo01; //查询验证 |
2、需求分析:按照姓名、日期分组,求当月汇总访问次数,每个用户截止到每月为止的最大单月访问次数、累计到该月的总访问次数。(这种递进式的多条件累计就要用到表连接,条件计算或者窗口函数)
3、操作并得出结果:
方法1:传统sql语句思想:
(1)、按照姓名、日期分组,求当月汇总访问次数,并保存
create table demo01_1 as select name,dates, sum(hz) as s_hz
from demo01 group by name,dates;
(2)、连接表 ,并保存为视图
create view demo01_2view as
select t1.name aname,t1.dates adates,t1.s_hz ashz,t2.name bname,t2.dates bdates,t2.s_hz bshz
from demo01_1 t1 join demo01_1 t2 on t1.name=t2.name;
(3)、条件统计,输出结果
select t.aname name,t.adates detes,t.ashz hz,max(bshz) sm_hz,sum(bshz) s_hz
from demo01_2view t
where t.adates>=t.bdates
group by t.aname,t.adates,t.ashz;
name detes hz sm_hz s_hz
A 2015-01 33 33 33
A 2015-02 10 33 43
A 2015-03 38 38 81
B 2015-01 30 30 30
B 2015-02 15 30 45
B 2015-03 44 44 89
方法2:利用hive的窗口函数
select a.name,a.dates,max(a.hz),max(a.sm_hz),max(a.s_hz) from
(select name,substr(dates,1,7) dates,
sum(hz) over(partition by name,substr(dates,1,7)) hz,
max(sum(hz) over(partition by name,substr(dates,1,7))) over(partition by name order by substr(dates,1,7)) sm_hz,
sum(hz) over(partition by name order by substr(dates,1,7)) s_hz
from demo01) a group by a.name,a.dates;
第二题:根据下列数据,编写Hive的HQL语句求出每个店铺的当月销售额和累计到当月的总销售额
数据:
1、准备数据
[hdp@hdp02 demo]$ vim demo05.txt //写入demo05.txt中 hive> create table demo02 (Store string,Smonth string ,amount decimal(8,2)) row format delimited fields terminated by ','; //创建表 hive> load data local inpath '/home/hdp/demo/demo05.txt' into table demo02; //导入数据 hive> select * from demo02; //查询验证 |
2、分析需求:求出每个店铺的当月销售额和累计到当月的总销售额
3、操作并得出结果:
方法1:
(1)、计算每个店铺的当月销售额 ,并保存
hive> create view demo02_1view as
select store,smonth,sum(amount) as s_amount from demo02 group by store,smonth;
(2)连接表 ,并保存为视图
hive> create view demo02_2view as
select a.store astore,a.smonth asmonth,a.s_amount as_amount,
b.store bstore,b.smonth bsmonth,b.s_amount bs_amount
from demo02_1view a inner join demo02_1view b on a.store=b.store;
(3)、条件计算,并输出结果
hive> select astore,asmonth,as_amount,sum(bs_amount) as s_amount
from demo02_2view
where asmonth>=bsmonth
group by astore,asmonth,as_amount;
astore asmonth as_amount s_amount
a 01 350.00 350.00
a 02 5000.00 5350.00
a 03 600.00 5950.00
b 01 7800.00 7800.00
b 02 2500.00 10300.00
c 01 470.00 470.00
c 02 630.00 1100.00
方法2:用hive的窗口函数
select a.store,a.smonth,max(a.s1) s1,max(a.s2) s2 from
(select store,smonth,
sum(amount) over(partition by store,smonth) s1,
sum(amount) over(partition by store order by smonth asc) s2
from demo02) a group by a.store,a.smonth;
Hive中collect相关的函数有collect_list和collect_set。(列转行)
它们都是将分组中的某列转为一个数组返回,不同的是collect_list不去重而collect_set去重。
第一题:现有一份以下格式的数据:编写Hive的HQL语句来实现以下结果:
表示有id为1,2,3的学生选修了课程a,b,c,d,e,f中其中几门:
数据: id course |
1、准备数据
[hdp@hdp02 demo]$ vi demo08.txt //写入demo06.txt中
hive> create table demo05(id int, course string comment '科目')
row format delimited fields terminated by ','; //hive创建表
hive> load data local inpath '/home/hdp/demo/demo08.txt' into table demo05; //导入数据
hive> select * from demo05; //查验数据
2、需求分析:行列转化 case ... when.... [collect_list和collect_set(列转行函数)]
3、操作与展现
方法1:利用collect_list()行列转换函数 concat_ws()组合拼接 concat_ws(',',collect_list(course))
select a.id,
(case when a.c like '%a%' then 1 else 0 end)a ,
(case when a.c like '%b%' then 1 else 0 end)b ,
(case when a.c like '%c%' then 1 else 0 end)c ,
(case when a.c like '%d%' then 1 else 0 end)d,
(case when a.c like '%e%' then 1 else 0 end)e ,
(case when a.c like '%f%' then 1 else 0 end)f
from (select id,concat_ws(',',collect_list(course)) c from demo05 group by id) a;
a.id a b c d e f
1 1 1 1 0 1 0
2 1 0 1 1 0 1
3 1 1 1 0 1 0
方法2:利用聚合函数sum()或者max()
select id,
sum(case when course like '%a%' then 1 else 0 end) a,
sum(case when course like '%b%' then 1 else 0 end) b,
sum(case when course like '%c%' then 1 else 0 end) c,
sum(case when course like '%d%' then 1 else 0 end) d,
sum(case when course like '%e%' then 1 else 0 end) e,
sum(case when course like '%f%' then 1 else 0 end) f
from demo05 group by id;
id a b c d e f
1 1 1 1 0 1 0
2 1 0 1 1 0 1
3 1 1 1 0 1 0
方法3:利用if(...)函数
select id,
max(if(course like '%a%',1,0)) a,
max(if(course like '%b%',1,0)) b,
max(if(course like '%c%',1,0)) c,
max(if(course like '%d%',1,0)) d,
max(if(course like '%e%',1,0)) e,
max(if(course like '%f%',1,0)) f
from demo05 group by id;
id a b c d e f
1 1 1 1 0 1 0
2 1 0 1 1 0 1
3 1 1 1 0 1 0
第二题:按照下列数据,计算所有数学课程成绩 大于 语文课程成绩的学生的学号
数据:
1、准备数据
[hdp@hdp02 demo]$ vim demo06.txt //写入demo06.txt中 hive> load data local inpath '/home/hdp/demo/demo06.txt' into table demo03; //导入数据 hive> select * from demo03; //查验数据 |
2、分析需求:计算所有数学课程成绩 大于 语文课程成绩的学生的学号
3、操作并得出结果:
(1)使用case...when...将不同的课程名称转换成不同的列,输出结果
select * from
(select sid,
sum(case course when "yuwen" then score else 0 end) as yuwen,
sum(case course when "shuxue" then score else 0 end) as shuxue,
sum(case course when "yingyu" then score else 0 end) as yingyu
from demo03 group by sid) t1
where t1.shuxue > t1.yuwen;
t1.sid t1.yuwen t1.shuxue t1.yingyu
1 43 55 0
2 77 88 0
首先通过UDTF函数拆分成多行,再将多行结果组合成一个支持别名的虚拟表。
主要解决在select使用UDTF做查询过程中,查询只能包含单个UDTF,不能包含其他字段、以及多个UDTF的问题
语法:
LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias); -- 虚拟视图
- lateral view在UDTF前使用,表示连接UDTF所分裂的字段。
- UDTF(expression):使用的UDTF函数,例如explode()。
- tableAlias:表示UDTF函数转换的虚拟表的名称。
- columnAlias:表示虚拟表的虚拟字段名称,如果分裂之后有一个列,则写一个即可;如果分裂之后有多个列,按照列的顺序在括号中声明所有虚拟列名,以逗号隔开。
select explode(split("a-d-e", "-")); -- 行转列
第一题:求出每种爱好中,年龄最大的两个人(爱好,年龄,姓名)
数据:
hive (demodb01)> select * from demo06;
OK
id(编号)name(姓名) age(年龄) favors(爱好)
1 huangxiaoming 45 a-c-d-f
2 huangzitao 36 b-c-d-e
3 huanglei 41 c-d-e
4 liushishi 22 a-d-e
5 liudehua 39 e-f-d
6 liuyifei 35 a-d-e
操作:
① 行转列
select a.id as id,a.name as name,a.age as age,favor_view.favor
from demo06 a
LATERAL VIEW explode(split(a.favors, "-")) favor_view as favor;
② 查出:每种爱好中,按照年龄倒序(爱好,年龄,姓名、排名)的结果
select b.favor, b.age ,b.name,
row_number() over(partition by b.favor order by b.age desc) rank
from
(select a.id id,a.name name, a.age age, favor_view.favor favor
from demo06 a
lateral view explode(split(a.favors,"-")) favor_view as favor) b;
③ 查询输出:每种爱好中,年龄最大的两个人(爱好,年龄,姓名)
select * from
( select b.favor favor, b.age age ,b.name name,
row_number() over(partition by b.favor order by b.age desc) rank
from
( select a.id id,a.name name, a.age age, favor_view.favor favor
from demo06 a
lateral view explode(split(a.favors,"-")) favor_view as favor) b) c
where c.rank<=2;
c.favor c.age c.name c.rank
a 45 huangxiaoming 1
a 35 liuyifei 2
b 36 huangzitao 1
c 45 huangxiaoming 1
c 41 huanglei 2
d 45 huangxiaoming 1
d 41 huanglei 2
e 41 huanglei 1
e 39 liudehua 2
f 45 huangxiaoming 1
f 39 liudehua 2
第一题:求出每一年的最高温度是那一天(日期, 最高温度)
数据:2010012325表示在2010年01月23日的气温为25度。
2014010114 |
2001010212 |
2008010516 2007010619 2007010712 2007010812 2007010999 2007011023 2010010114 2010010216 2010010317 2010010410 2010010506 2015010649 2015010722 |
1、准备数据
[hdp@hdp02 demo]$ vi demo07.txt //将数据写入文件
hive (demodb01)> create table demo04(data string) row format delimited fields terminated by "," lines terminated by '\n'; //建表
hive (demodb01)> load data local inpath "/home/hdp/demo/demo07.txt" into table demo04; //导入数据
hive (demodb01)> select * from demo04; //查验数据
2、需求分析:求出每一年的最高温度是那一天(日期, 最高温度)
3、操作并得出结果:
create view demo04_1view as
select substr(data,1,8) t,
substr(data,1,4) y,
substr(data,5,2) m,
substr(data,7,2) d,
substr(data,9,length(data))as temperature
from demo04;
create view demo04_2view as
select a.y y,max(a.temperature) temperature from demo04_1view a
group by a.y;
select a.* from demo04_1view a join demo04_2view b
on a.y=b.y and a.temperature=b.temperature;
a.t a.y a.m a.d a.temperature
20010105 2001 01 05 29
20070109 2007 01 09 99
20080103 2008 01 03 37
20100103 2010 01 03 17
20120107 2012 01 07 32
20130109 2013 01 09 29
20140103 2014 01 03 17
20150109 2015 01 09 99