course.txt
1,数据库
2,数学
3,信息系统
4,操作系统
5,数据结构
6,数据处理
sc.txt
95001,1,81
95001,2,85
95001,3,88
95001,4,70
95002,2,90
95002,3,80
95002,4,71
95002,5,60
95003,1,82
95003,3,90
95003,5,100
95004,1,80
95004,2,92
95004,4,91
95004,5,70
95005,1,70
95005,2,92
95005,3,99
95005,6,87
95006,1,72
95006,2,62
95006,3,100
95006,4,59
95006,5,60
95006,6,98
95007,3,68
95007,4,91
95007,5,94
95007,6,78
95008,1,98
95008,3,89
95008,6,91
95009,2,81
95009,4,89
95009,6,100
95010,2,98
95010,5,90
95010,6,80
95011,1,81
95011,2,91
95011,3,81
95011,4,86
95012,1,81
95012,3,78
95012,4,85
95012,6,98
95013,1,98
95013,2,58
95013,4,88
95013,5,93
95014,1,91
95014,2,100
95014,4,98
95015,1,91
95015,3,59
95015,4,100
95015,6,95
95016,1,92
95016,2,99
95016,4,82
95017,4,82
95017,5,100
95017,6,58
95018,1,95
95018,2,100
95018,3,67
95018,4,78
95019,1,77
95019,2,90
95019,3,91
95019,4,67
95019,5,87
95020,1,66
95020,2,99
95020,5,93
95021,2,93
95021,5,91
95021,6,99
95022,3,69
95022,4,93
95022,5,82
95022,6,100
students.txt
95001,李勇,男,20,CS
95002,刘晨,女,19,IS
95003,王敏,女,22,MA
95004,张立,男,19,IS
95005,刘刚,男,18,MA
95006,孙庆,男,23,CS
95007,易思玲,女,19,MA
95008,李娜,女,18,CS
95009,梦圆圆,女,18,MA
95010,孔小涛,男,19,CS
95011,包小柏,男,18,MA
95012,孙花,女,20,CS
95013,冯伟,男,21,CS
95014,王小丽,女,19,CS
95015,王君,男,18,MA
95016,钱国,男,21,MA
95017,王风娟,女,18,IS
95018,王一,女,19,IS
95019,邢小丽,女,19,IS
95020,赵钱,男,21,IS
95021,周二,男,17,MA
95022,郑明,男,20,MA
使用下面的语句来创建表。
create table student(Sno int,Sname string,Sex string,Sage int,Sdept string)row format delimited fields terminated by ','stored as textfile;
create table course(Cno int,Cname string) row format delimited fields terminated by ',' stored as textfile;
create table sc(Sno int,Cno int,Grade int)row format delimited fields terminated by ',' stored as textfile;
使用下面的命令来导入数据:
load data local inpath '/root/sql_learn/students.txt' overwrite into table student;
load data local inpath '/root/sql_learn/sc.txt' overwrite into table sc;
load data local inpath '/root/sql_learn/course.txt' overwrite into table course;
select Sno,Sname from student;
select distinct Sname from student inner join sc on(student.Sno=sc.Sno);
select count(1) from student;
select avg(Grade) from sc group by Cno having Cno=1;
select Cname,avg(Grade) as average
from course inner join sc on (course.Cno=sc.Cno)
group by course.Cname,sc.Cno;
这里学到了一招:select 后的字段,必须要么包含在group by中,要么使用聚合函数。
select Grade from sc where Cno=1 order by Grade desc limit 1;
然后查了一下,发现order by只能用一个reducer来完成任务,可以使用下面的语句来提高效率。
select Grade from sc where Cno=1 distribute by Grade sort by Grade desc limit 1;
原理是sort by使用分组排序。按照Grade进行Hash,将结果相同的放到不同的Reducer里面。
不过需要注意的是:如果用sort by进行排序,并且设置mapred.reduce.tasks>1,则sort by只会保证每个reducer的输出有序,并不保证全局有序。sort by不同于order by,它不受Hive.mapred.mode属性的影响,sort by的数据只能保证在同一个reduce中的数据可以按指定字段排序。使用sort by你可以指定执行的reduce个数(通过set mapred.reduce.tasks=n来指定),对输出的数据再执行归并排序,即可得到全部结果。
Map输出的文件大小不均。
Reduce输出文件大小不均。
小文件过多。
文件超大。
select Cno,count(1) from sc group by Cno;
select Sno from sc group by Sno having count(1)>3;
select * from student order by Sno asc;
select * from student order by Sex, Sage asc;
select student.*,sc.Cno,sc.Grade from student inner join sc on(student.Sno=sc.Sno);
select student.Sname,course.Cname,sc.Grade
from student inner join sc on(student.Sno=sc.Sno) inner join course on(course.Cno=sc.Cno);
select Sname from student inner join sc on(student.Sno=sc.Sno) where sc.Cno=2;
select student.*,Cno from student left outer join sc on(student.Sno=sc.Sno);
重写以下子查询
SELECT a.key, a.value
FROM a
WHERE a.key in
(SELECT b.key
FROM B);
重写后的SQL查询如下:
select a.key,a.value from a left outer join b on(a.key=b.key) where b.key is not null;
select a.key,a.value from a left semi join b on(a.key=b.key)
select s2.* from student as s1 inner join student as s2 on(s2.Sdept=s1.Sdept) where s1.Sname='刘晨';
注意比较:
select * from student s1 left join student s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';
select * from student s1 right join student s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';
select * from student s1 inner join student s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';
select * from student s1 left semi join student s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';
select s1.Sname from student s1 right semi join student s2 on s1.Sdept=s2.Sdept and s2.Sname='刘晨';
写完以后还是对HQL有点感觉了。不错。