任务实现:
1、创建表dept,emp和salgrade
#emp.txt
7369,SMITH,CLERK,7902,1980-12-17,800.00,,20
7499,ALLEN,SALESMAN,7698,1981-2-20,1600.00,300.00,30
7521,WARD,SALESMAN,7698,1981-2-22,1250.00,500.00,30
7566,JONES,MANAGER,7839,1981-4-2,2975.00,,20
7654,MARTIN,SALESMAN,7698,1981-9-28,1250.00,1400.00,30
7698,BLAKE,MANAGER,7839,1981-5-1,2850.00,,30
7782,CLARK,MANAGER,7839,1981-6-9,2450.00,,10
7839,KING,PRESIDENT,,1981-11-17,5000.00,,10
7844,TURNER,SALESMAN,7698,1981-9-8,1500.00,0.00,30
7900,JAMES,CLERK,7698,1981-12-3,950.00,,30
7902,FORD,ANALYST,7566,1981-12-3,3000.00,,20
7934,MILLER,CLERK,7782,1982-1-23,1300.00,,10
将数据dept.txt导入到表dept
#dept.txt
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON
将数据salgrade.txt导入到表salgrade
#salgrade.txt
1,700,1200
2,1201,1400
3,1401,2000
4,2001,3000
5,3001,9999
SELECT [ALL|DISTINCT] 字段列表(字段1 别名,....)
FROM 表1 别名, 表2 别名, ....
WHERE 条件 ….
GROUP BY 分组字段 HAVING(组约束条件)
ORDER BY 排序字段1 Asc | Desc, 字段2 Asc|Desc, .....
[CLUSTER BY 字段 | [DISTRIBUTE BY字段] [SORT BY字段]]
LIMIT M,N;
DISTRIBUTE BY 相同字段值会分到一个分区
CLUSTER BY 相当于前两个
任务实现:
1.查询emp中所有的部门编号
任务实现:
1.将部门编号不为10的所有员工按员工编号升序排列
2.将所有员工先按部门编号升序,当部门一样时,再按姓名降序排
任务实现:
1.查看emp表中平均薪水是多少并对其四舍五入保留两位小数显示
2.统计emp表中有多少个不重复部门
group by…having分组查询
group by按照其后的字段分组,可使用多个字段进行分组。通常配合having使用,having后面的条件是对组的约束。同时SELECT子句中的字段必须是分组中的字段或分组函数
任务实现: 查询emp表平均薪水大于2000的部门编号、平均薪水
任务实现:
1.统计获救与死亡情况
select survived,count(survived) from tidanic group by survived;
2.统计舱位分布情况
select pclass,count(pclass) from tidanic group by pclass;
3.统计港口登船人员分布情况
select embarked,count(embarked) from tidanic group by embarked;
任务实现: 查询emp表薪水大于2500的员工姓名及所在部门名称
left outer join:左边表的所有记录连接结果会被返回,右边表没有符合on条件的,连接的右边的列为Null
right outer join:右边表的所有记录连接结果会被返回,左边表没有符合on条件的,连接的左边的列为Null
full outer join:返回两个表所有的记录连接结果
left semi join:左半开边连接,只返回满足连接条件的左边记录,类似MySQL in
注意:On后面只支持=,优化的方式就是大表放在最后面去连接,因为前面的表会被缓存,只有最后一个会通过扫描的方式,left semi join:一旦满足条件就返回左边记录,不进行连接,类似于in。。。exists
任务实现:
查询emp表中的员工姓名,薪水,如果薪水小于2000标记为low,如果薪水在2000和5000之间标记为middle,如果薪水大于5000标记为high
1.统计各个性别存活数
select sex,count(*) as sexcount from tidanic where survived=1 group by sex
2.计算存活总数
select count(*) as allcount from tidanic where survived=1
3.统计性别与生存率的关系
select a.sex,a.sexcount/b.allcount as count from (select sex,count(*) as sexcount from tidanic where survived=1 group by sex) a join (select count(*) as allcount from tidanic where survived=1) b on 1=1;
同理:
4.统计客舱等级与生存率的关系
select a.pclass,a.sexcount/b.allcount as count from (select pclass,count(*) as sexcount from tidanic where survived=1 group by pclass) a join (select count(*) as allcount from tidanic where survived=1) b on 1=1;
5.统计登船港口与生存率的关系
select a.embarked,a.sexcount/b.allcount as count from (select embarked,count(*) as sexcount from tidanic where survived=1 group by embarked) a join (select count(*) as allcount from tidanic where survived=1) b on 1=1;
任务实现:
查询emp表平均薪水大于2000的部门编号、平均薪水并按照部门编号排序,不包括10部门
SELECT deptno,avg(sal) as avg_sal FROM emp WHERE deptno<>10 GROUP BY deptno HAVING avg(sal)>2000 ORDER BY deptno;
可以在执行HQL语句前加explain,可查看执行顺序
SQL执行顺序
FROM ... WHERE ... GROUP BY ... AVG SUM 等聚合函数 ... HAVING ... SELECT... ORDER BY ...
Hive执行顺序
FROM ... WHERE ... SELECT ... GROUP BY ... HAVING ... ORDER BY ...
Hive的执行顺序也是MapReduce的执行顺序
map阶段:
reduce阶段: