本章主要, 转载一篇Hive SQL
的一些练习题. 做一做, 练习下做业务的能力.
本文相关资源, 可在我的Github项目 https://github.com/SeanYanxml/bigdata/ 目录下可以找到. PS: (如果觉得项目不错, 可以给我一个Star.)
95001,李勇,男,20,CS
95002,刘晨,女,19,IS
95003,王敏,女,22,MA
95004,张立,男,19,IS
95005,刘刚,男,18,MA
95006,孙庆,男,23,CS
1,数据库
2,数学
3,信息系统
4,操作系统
5,数据结构
6,数据处理
95001,1,81
95001,2,85
95001,3,88
95001,4,70
95002,2,90
# 创建course表
create table course (Cno int,Cname string) row format delimited fields terminated by ',' stored as textfile;
load data local inpath '/Users/Sean/Desktop/course.txt' into table course;
# 创建学生表
create table student (Sno int,Sname string,Sex string,Sage int,Sdept string) row format delimited fields terminated by ',' stored as textfile;
load data local inpath '/Users/Sean/Desktop/student' into table course;
# 创建成绩表
create table sc (Sno int,Cno int,Grade int) row format delimited fields terminated by ',' stored as textfile;
load data local inpath '/Users/Sean/Desktop/sc.txt' into table course;
hive> select Sno,Sname from student;
hive> select distinct Sname from student inner join sc on student.Sno=Sc.Sno;
hive> select count(distinct Sno)count from student;
hive> select avg(distinct Grade) from sc where Cno=1;
hive> select Cno,avg(Grade) from sc group by Cno;
select Grade from sc where Cno=1 sort by Grade desc limit 1;
hive> select Cno,count(1) from sc group by Cno;
hive> select Sno from (select Sno,count(Cno) CountCno from sc group by Sno)a where a.CountCno>3;
hive> select Sno from sc group by Sno having count(Cno)>3;
Order By ,在strict 模式下(hive.mapred.mode=strict),order by 语句必须跟着limit语句,但是在非strict下就不是必须的,这样做的理由是必须有一个reduce对最终的结果进行排序,如果最后输出的行数过多,一个reduce需要花费很长的时间。
hive> set hive.mapred.mode=strict;
hive> select Sno from student order by Sno;
FAILED: Error in semantic analysis: 1:33 In strict mode, if ORDER BY is specified, LIMIT must also be specified. Error encountered near token 'Sno'
Sort By,它通常发生在每一个redcue里,“order by” 和“sort by"的区别在于,前者能给保证输出都是有顺序的,而后者如果有多个reduce的时候只是保证了输出的部分有序。set mapred.reduce.tasks=在sort by可以指定,在用sort by的时候,如果没有指定列,它会随机的分配到不同的reduce里去。distribute by 按照指定的字段对数据进行划分到不同的输出reduce中
此方法会根据性别划分到不同的reduce中 ,然后按年龄排序并输出到不同的文件中。
hive> set mapred.reduce.tasks=2;
hive> insert overwrite local directory '/home/hadoop/out'
select * from student distribute by Sex sort by Sage;
hive> select student.*,sc.* from student join sc on (student.Sno =sc.Sno);
hive>select student.Sname,course.Cname,sc.Grade from student join sc on student.Sno=sc.Sno join course on sc.cno=course.cno;
hive> select student.Sname,sc.Grade from student join sc on student.Sno=sc.Sno
where sc.Cno=2 and sc.Grade>90;
hive> select student.Sname,sc.Cno from student left outer join sc on student.Sno=sc.Sno;
如果student的sno值对应的sc在中没有值,则会输出student.Sname null.如果用right out join会保留右边的值,左边的为null。
Join 发生在WHERE 子句之前。如果你想限制 join 的输出,应该在 WHERE 子句中写过滤条件——或是在join 子句中写。
----LEFT SEMI JOIN Hive 当前没有实现 IN/EXISTS 子查询,可以用 LEFT SEMI JOIN 重写子查询语句
重写以下子查询为LEFT SEMI JOIN
SELECT a.key, a.value
FROM a
WHERE a.key exist in
(SELECT b.key
FROM B);
可以被重写为:
SELECT a.key, a.val
FROM a LEFT SEMI JOIN b on (a.key = b.key)
hive> select s1.Sname from student s1 left semi join student s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';
Hive中where 语句的“不等于”的陷阱
[1]. Hive的HQL语句及数据倾斜解决方案
[2]. SQL与HQL练习题
[3]. 黑猴子的家:Hive 查询之 where 语句