提示:在做小数据量查询测试时,可以让hive将mrjob提交给本地运行器运行,可以在hive会话中设置如下参数:
hive> set hive.exec.mode.local.auto=true;
select * from t_access;
select count(*) from t_access;
select max(ip) from t_access;
select * from t_access where access_time<'2017-08-06 15:30:20'
select * from t_access where access_time<'2017-08-06 16:30:20' and ip>'192.168.33.3';
join关联查询示例
假如有a.txt文件
a,1
b,2
c,3
d,4
假如有b.txt文件
a,xx
b,yy
d,zz
e,pp
进行各种join查询:
select
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
join t_b b
on a.name=b.name
结果:
+--------+--------+--------+--------+--+
| aname | anumb | bname | bnick |
+--------+--------+--------+--------+--+
| a | 1 | a | xx |
| b | 2 | b | yy |
| d | 4 | d | zz |
+--------+--------+--------+--------+--+
select
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
left outer join t_b b
on a.name=b.name
结果:
select
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
right outer join t_b b
on a.name=b.name
结果:
结果:
left semi join
hive中不支持exist/IN子查询,可以用left semi join来实现同样的效果:
select
a.name as aname,
a.numb as anumb
from t_a a
left semi join t_b b
on a.name=b.name;
结果:
注意: left semi join的 select子句中,不能有右表的字段
group by分组聚合
select dt,count(*),max(ip) as cnt from t_access group by dt;
select dt,count(*),max(ip) as cnt from t_access group by dt having dt>'20170804';
select
dt,count(*),max(ip) as cnt
from t_access
where url='http://www.edu360.cn/job'
group by dt having dt>'20170804';
注意: 一旦有group by子句,那么,在select子句中就不能有 (分组字段,聚合函数) 以外的字段
如果想要【select非group by的字段】,可以参考https://blog.csdn.net/fly910905/article/details/80865000
## 为什么where必须写在group by的前面,为什么group by后面的条件只能用having
因为,where是用于在真正执行查询逻辑之前过滤数据用的
having是对group by聚合之后的结果进行再过滤;
上述语句的执行逻辑:
子查询
select id,name,father
from
(select id,name,family_members['brother'] as father from t_person) tmp
where father is not null;