[WITH CommonTableExpression (, CommonTableExpression)*]
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows]
ALL | DISTINCT选项指定是否应返回重复的行,默认值为ALL
注意:select distinct county,state 表示多个字段整体去重
where子句中禁止使用聚合函数:Where用于确定结果集,而聚合函数使用前提为结果集已确定,由此可以看出where先执行,group by后执行
分区裁剪:where子句或join中的on子句如果存在分区字段的过滤,则仅访问查询符合条件的分区,裁剪掉没必要访问的分区
group by 用于结合聚合函数,查询结果选择的字段:要么是GROUP BY分组的字段;要么是被聚合函数应用的字段,避免歧义
having子句,用以解决where无法使用聚合的问题,因为他在group by后执行,结果集已确定,所以having子句可以使用聚合函数
limit子句用于约束SELECT语句返回的行数,第一个参数偏移量从0开始
总结:执行顺序:from>where>group(含聚合)>having>order>select
提示:聚合语句要比having子句优先执行(这就是使用having的原因),而where子句在查询过程中执行优先级别优先于聚合语句(sum,min,max,avg,count)。
ORDER BY col_name [ASC|DESC] [NULLS FIRST|LAST]
1)对输出结果进行全局排序,底层只有一个reducetask执行
2)全局排序结果输出行数太多,排序时间过长,建议少使用order by或者搭配limit使用
CLUSTER BY col_name
1)set mapreduce.job.reduces =N;开启N个reducetask如果不手动设置,则Hive根据数据大小在编译期间自动分为随机组
2)使用hash散列根据分组字段对结果集分组,同样根据该字段来组内排序(只能正序)
3)CLUSTER BY因无法满足根据一字段分组,再根据另一字段排序,引出DISTRIBUTE BY 和SORT BY
DISTRIBUTE BY +SORT BY
1)DISTRIBUTE BY负责分,SORT BY负责分组内排序
2)分组和排序可以是不同的字段
3)CLUSTER BY=DISTRIBUTE BY +SORT BY(字段一样)
总结:
select_statement UNION [ALL | DISTINCT] select_statement UNION [ALL | DISTINCT] select_statement ...
1)使用DISTINCT关键字与只使用UNION默认值效果一样,都会删除重复行
2)每个select_statement返回的列的数量和名称必须相同
3)如果想对单个select使用子句需要使用子查询,对整体使用子句则放在整体最后,如下
-- 针对单个select使用子句过滤或限制
SELECT sno,sname FROM (select sno,sname from student_local LIMIT 2) subq1
UNION
SELECT sno,sname FROM (select sno,sname from student_hdfs LIMIT 3) subq2;
-- 针对整体使用子句
select sno,sname from student_local
UNION
select sno,sname from student_hdfs
order by sno desc;
from子句中子查询
1)必须给子查询一个名称
2)子查询返回结果中的列在外部查询中可用
where子句中子查询
分为不相关子查询和相关子查询
-- 不相关子查询,相当于IN、NOT IN,子查询只能选择一个列。
-- (1)执行子查询,其结果不被显示,而是传递给外部查询,作为外部查询的条件使用。
-- (2)执行外部查询,并显示整个结果。(子查询不需要起别名)
SELECT *
FROM student_hdfs
WHERE student_hdfs.num IN (select num from student_local limit 2);
-- 相关子查询,指EXISTS和NOT EXISTS子查询
-- 子查询的WHERE子句中支持对父查询的引用
SELECT A
FROM T1
WHERE EXISTS (SELECT B FROM T2 WHERE T1.X = T2.Y);
-- 选择语句中的CTE
with q1 as (select sno,sname,sage from student where sno = 95002)
select *
from q1;
-- from风格
with q1 as (select sno,sname,sage from student where sno = 95002)
from q1
select *;
-- chaining CTEs 链式
with q1 as ( select * from student where sno = 95002),
q2 as ( select sno,sname,sage from q1)
select * from (select sno from q2) a;
-- union案例
with q1 as (select * from student where sno = 95002),
q2 as (select * from student where sno = 95004)
select * from q1 union all select * from q2;
--视图,CTAS和插入语句中的CTE
-- insert
create table s1 like student;
with q1 as ( select * from student where sno = 95002)
from q1
insert overwrite table s1
select *;
select * from s1;
-- ctas
create table s2 as
with q1 as ( select * from student where sno = 95002)
select * from q1;
-- view
create view v1 as
with q1 as ( select * from student where sno = 95002)
select * from q1;
select * from v1;
join_table:
table_reference [INNER] JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
| table_reference CROSS JOIN table_reference [join_condition] (as of Hive 0.10)
table_reference:
table_factor
| join_table
table_factor:
tbl_name [alias]
| table_subquery alias
| ( table_references )
join_condition:
ON expression
左半开连接,满足连接条件的情况下只返回左边表的记录
等价于inner join之后只返回左表的记录,只是比inner更高效
select * from employee e left semi join employee_address e_addr on e.id=e.addr.id
等价于
select e.* from employee e inner join employee_address e_addr on e.id=e_addr.id;
交叉连接crossjoin,就是无条件的inner join,后面可以跟where子句进行过滤,或者on条件过滤