正常在数据少的情况下
直接使用order by来操作即可,因为是全排序所以要在一个reduce中完成
from records
select year,temperature
order by year asc,temperature desc;
如果数据量大,并且不需要全排序,只是需要每个reduce中的数据排序即可。如下根据year来指定(distribute by)到相同的reduce中,然后根据sort by来排序
from records
select year,temperature
distribute by year
sort by year asc,temperature desc;
当然一般如果不用指定排序默认字段是排序asc的且在同一个reduce中
from records
select year,temperature
cluster by year;
--------------------------------------------------
from records
select year,temperature
cluster by year,temperature;
连接
内连接
Hive中的连接就是把我们查询操作根据连接条件解析成对对应的maper的输出key,value就是数据对象关联的两条记录。Reducer去处理连接查询的操作。
数据准备
/root/hcr/tmp/sample2.txt数据文件
1990ruishenh0
1992ruishenh2
1991ruishenh1
1993ruishenh3
1994ruishenh4
1995ruishenh5
1996ruishenh6
1997ruishenh7
1998ruishenh8
create table records2 (year string,namestring) row format delimited fields terminated by '\t'
loaddata local inpath'/root/hcr/tmp/sample2.txt' overwrite into tablerecords2;
joinon
select records.*,records2.*
from records join records2 on(records.year=records2.year)
在hive中的join on 操作可以多个条件连接,比如 a join b on a.id=b.aid and a.type=b.atype
select records.*,records2.*
from records join records2 on(records.year=records2.year and records.quality!=1)
hive中同样也是支持多表做连接的
selectr1.year,r2.name,r2.year,r4.y,r4.standard fromrecords2 r2 join records r1 on (r1.year=r2.year) join records4 r4 on(r4.y=r2.year);
但是执行后报错,//找问题TODO
提示到因为join子句一般把大数据的表都放到后边;
外连接
左外连接以左表为主查询,关联不到为null
select * from records r left outer joinrecords2 r2 on r.year=r2.year;
右外连接 以右表为主查询,关联不到为null
select * from records r right outer joinrecords2 r2 on r.year=r2.year;
半连接
select * from records2 r left semi join records r2 on r.year=r2.year;
map 连接 /*+MAPJOIN(records2)*/
From records r join records2 r2 onr.year=r2.year
select /*+MAPJOIN(records2)*/ r2.*,r.*;
子查询
子查询是内嵌在另一个SQL语句中的SELECT语句。Hive对子查询的支持很有限。它只允许子查询出现在SELECT语句的FROM子句中。
from
(
From records r
select r.year,MAX(r.temperature)asmax_temperature
where r.temperature !=9999 and (r.quality=0or r.quality=1 or r.quality=2)
group by r.year
) mt
select mt.year,avg(mt.max_temperature)
group by mt.year ;
因为在外层查询要用到子查询的字段,所以必须赋值别名,比如上文中的mt,而且在子查询中的返回的列名中必须不能存在重复的列名。(比如不能有两个records.year,和records2.year)
视图
Hive中的数据就是一个虚拟的存在写好的sql一样,它不会物化实际。且不能向基表加载或者插入数据。
创建视图
create viewmax_records
as
select r.year,MAX(r.temperature)asmax_temperature
From records r
where r.temperature !=9999 and (r.quality=0or r.quality=1 or r.quality=2)
group by r.year ;
查询视图
Select * from max_records;
重现上边子查询操作:
select year,avg(max_temperature)
from max_records
group by year;