Hive中表连接操作大致分为以下四种:
1:join即等值连接,只有某个值在两个表中同时存在才会被检索出来。
2:left outer join即左外连接,左边表中的值无论是否存在右表中,都会输出,但是右表中的记录只有在左表中存在时才会输出。
3:right outer join即右外连接,右边表中的值无论是否存在左表中,都会输出,但是左表中的记录只有在右边中存在时才会输出(和left outer join相反)。
4:left semi join即左半连接,类似于exits。
下面我们通过具体的例子来说明这几种连接操作:
#user表中的数据如下:
hive (hive)> select * from user; OK id name 1 lavimer 2 liaozhongmin 3 liaozemin Time taken: 0.112 seconds
hive (hive)> select * from post; OK uid pid title 1 1 Thinking in Java 1 2 Thinking in Hadoop 2 3 Thinking in C 4 4 Thinking in Hive 5 5 Thinking in HBase 5 6 Thinking in Pig 5 7 Thinking in Flume Time taken: 0.11 seconds
hive (hive)> select s.id,s.name,t.pid,t.title from > (select id,name from user) s > join > (select uid,pid,title from post) t > on s.id=t.uid;查询出来的结果如下:
id name pid title 1 lavimer 1 Thinking in Java 1 lavimer 2 Thinking in Hadoop 2 liaozhongmin 3 Thinking in C
hive (hive)> > select s.id,s.name,t.pid,t.title from > (select id,name from user) s > left outer join > (select uid,pid,title from post) t > on s.id=t.uid;查询出来的结果如下:
id name pid title 1 lavimer 1 Thinking in Java 1 lavimer 2 Thinking in Hadoop 2 liaozhongmin 3 Thinking in C 3 liaozemin NULL NULL注:从上面的结果可以看出,post表的uid只有在user表中存在时才会输出记录,否则输出NULL。
三:右外连接
hive (hive)> select s.id,s.name,t.pid,t.title from > (select id,name from user) s > right outer join > (select uid,pid,title from post) t > on s.id=t.uid;查询出来的结果如下:
id name pid title 1 lavimer 1 Thinking in Java 1 lavimer 2 Thinking in Hadoop 2 liaozhongmin 3 Thinking in C NULL NULL 4 Thinking in Hive NULL NULL 5 Thinking in HBase NULL NULL 6 Thinking in Pig NULL NULL 7 Thinking in Flume注:从上面的结果可以看出,user表中的id只有在post表中存在时才会输出记录,否则输出NULL。
hive (hive)> select s.id,s.name from > (select id,name from user) s > left semi join > (select uid,pid,title from post) t > on s.id=t.uid;查询出来的结果如下:
id name 1 lavimer 2 liaozhongmin这个left semi join比较有意思,因为Hive中没有in/exits这样的子句,但是我们有需要这样的操作,所以Hive将这种类型的子句转换成left semi join。
hive (hive)> select user.id,user.name from user > left semi join > post > on (user.id=post.uid);
select id,name from user where id in (select uid from post);