1.现有这么一批数据,现要求出:每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数
三个字段的意思:
用户名uid,月份month,访问次数count
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
A,2015-03,16
A,2015-03,22
B,2015-03,23
B,2015-03,10
B,2015-03,11
最后结果展示:
用户 月份 最大访问次数 总访问次数 当月访问次数
A 2015-01 33 33 33
A 2015-02 33 43 10
A 2015-03 38 81 38
B 2015-01 30 30 30
B 2015-02 30 45 15
B 2015-03 44 89 44
建表与导入数据
create table if not exists interview01(userId string,month string,count int) row format delimited fields terminated by ",";
load data local inpath "/home/hadoop/hive_data/interview01" into table interview01;
辅助SQL:
先按userid和month分组,求出总次数counts
select userid,month,sum(count)counts
from interview01
group by userid,month
最终SQL:
使用窗口函数求截止到当月的最大访问次数,直接从上表里面查
select userid,month,
max(counts) over(distribute by userid sort by month)maxcount,
sum(counts) over(distribute by userid sort by month)sumcount,counts
from
(select userid,month,sum(count)counts
from interview01
group by userid,month)tmp;
2.求:所有数学课程成绩 大于 语文课程成绩的学生的学号
// 建表语句:
CREATE TABLE `course` (
`id` int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
`sid` int(11) DEFAULT NULL,
`course` varchar(255) DEFAULT NULL,
`score` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
// 插入数据
// 字段解释:id, 学号, 课程, 成绩
INSERT INTO `course` VALUES (1, 1, 'yuwen', 43);
INSERT INTO `course` VALUES (2, 1, 'shuxue', 55);
INSERT INTO `course` VALUES (3, 2, 'yuwen', 77);
INSERT INTO `course` VALUES (4, 2, 'shuxue', 88);
INSERT INTO `course` VALUES (5, 3, 'yuwen', 98);
INSERT INTO `course` VALUES (6, 3, 'shuxue', 65);
辅助SQL:
这题可以看做为一到mysql题,因为每个字段都只有一个成绩,所以这里我们要对没有的字段进行补0操作
select
sid,case course
when "yuwen" then score else 0 end yuwen,
case course when "shuxue" then score else 0 end shuxue
from course
然后我们从上表中操作语文和数学的最大值,然后作为一个表,直接进行查询
select
sid,shuxue,yuwen
from
(select
sid,sum(yuwen) yuwen,max(shuxue) shuxue
from
(select
sid,case course
when "yuwen" then score else 0 end yuwen,
case course when "shuxue" then score else 0 end shuxue
from course) a
group by sid) b where b.shuxue>b.yuwen;
结果
+------+--------+-------+
| sid | shuxue | yuwen |
+------+--------+-------+
| 1 | 55 | 43 |
| 2 | 88 | 77 |
+------+--------+-------+
3.比如:2010012325表示在2010年01月23日的气温为25度。现在要求使用hive,
计算每一年出现过的最大气温的日期+温度。
2014010114
2014010216
2014010317
2014010410
2014010506
2012010609
2012010732
2012010812
2012010919
2012011023
2001010116
2001010212
2001010310
2001010411
2001010529
2013010619
2013010722
2013010812
2013010929
2013011023
2008010105
2008010216
2008010337
2008010414
2008010516
2007010619
2007010712
2007010812
2007010999
2007011023
2010010114
2010010216
2010010317
2010010410
2010010506
2015010649
2015010722
2015010812
2015010999
2015011023
建表与导入数据
create table if not exists interview03(data string);
load data local inpath "/home/hadoop/hive_data/interview03" into table interview03;
辅助SQL:
先求每一年的最大温度
select
substr(data,1,4) year,max(cast(substr(data,-2,2) as int))
max_temp
from interview03
group by substr(data,1,4)
然后将上表与原表连接,连接条件是年份与气温符合的,然后再找出那天是哪一天
select
a.year year,a.max_temp max_temp,substr(b.data,1,8) day
from
(select
substr(data,1,4) year,max(cast(substr(data,-2,2) as int))
max_temp
from interview03
group by substr(data,1,4)
)a join
interview03 b on
a.year=substr(b.data,1,4) and a.max_temp=cast(substr(b.data,-2,2) as int);
结果
year max_temp day
2001 29 20010105
2007 99 20070109
2008 37 20080103
2010 17 20100103
2012 32 20120107
2013 29 20130109
2014 17 20140103
2015 99 20150109
4.现有一份以下格式的数据:表示有id为1,2,3的学生选修了课程a,b,c,d,e,f中其中几门
id course
1,a
1,b
1,c
1,e
2,a
2,c
2,d
2,f
3,a
3,b
3,c
3,e
编写Hive的HQL语句来实现以下结果:
表中的1表示选修,表中的0表示未选修
id a b c d e f
1 1 1 1 0 1 0
2 1 0 1 1 0 1
3 1 1 1 0 1 0
建表与导入数据
create table if not exists interview04(id int,course string) row format delimited fields terminated by ",";
load data local inpath "/home/hadoop/hive_data/interview04" into table interview04;
首先得到所有备选课程
select collect_set(trim(course))courses from interview04;
然后求每一个学员选修的课程
select
id,collect_set(trim(course)) id_course
from interview04
group by id;
这里要使用笛卡尔积关联(a表中的每一天数据关联b表所有数据),从原表中查询course字段放入set集合中,然后与上表关联,得到已选课程与备选课程
首先修改笛卡尔积参数
set hive.strict.checks.cartesian.product=false;
set hive.mapred.mode=nonstrict;
辅助SQL:
select
b.id id,b.id_course id_course,a.courses courses
from
(select collect_set(trim(course))courses from interview04
) a join
(select
id,collect_set(trim(course)) id_course
from interview04
group by id
) b;
id id_course courses
1 ["a","b","c","e"] ["a","b","c","e","d","f"]
2 ["a","c","d","f"] ["a","b","c","e","d","f"]
3 ["a","b","c","e"] ["a","b","c","e","d","f"]
最终SQL:
判断a表与b表中元素的包含关系即可
select
id,
if(array_contains(id_course,courses[0]),1,0) a,
if(array_contains(id_course,courses[1]),1,0) b,
if(array_contains(id_course,courses[2]),1,0) c,
if(array_contains(id_course,courses[3]),1,0) e,
if(array_contains(id_course,courses[4]),1,0) d,
if(array_contains(id_course,courses[5]),1,0) f
from
(select
b.id id,b.id_course id_course,a.courses courses
from
(select collect_set(trim(course))courses from interview04
) a join
(select
id,collect_set(trim(course)) id_course
from interview04
group by id
) b
) c;
最终结果
id a b c e d f
1 1 1 1 1 0 0
2 1 0 1 0 1 1
3 1 1 1 1 0 0
5.现有如下格式的一份数据,需求:编写Hive的HQL语句求出每个店铺的当月销售额和累计到当月的总销售额
uid month sale
a,01,150
a,01,200
b,01,1000
b,01,800
c,01,250
c,01,220
b,01,6000
a,02,2000
a,02,3000
b,02,1000
b,02,1500
c,02,350
c,02,280
a,03,350
a,03,250
建表与导入数据
create table if not exists interview05(uid string,month string,sale int) row format delimited fields terminated by ",";
load data local inpath "/home/hadoop/hive_data/interview05" into table interview05;
最终SQL:
与第一题类似,先求总销售额,再进行开窗算出累计到当月的总销售额
select uid,month,
sum(sales) over(distribute by uid sort by month)sumsales,sales
from
(select uid,month,sum(sale)sales
from interview05
group by uid,month)tmp;
结果:
uid month sumsales sales
a 01 350 350
a 02 5350 5000
a 03 5950 600
b 01 7800 7800
b 02 10300 2500
c 01 470 470
c 02 1100 630