建表
0: jdbc:hive2://localhost:10000> create database myjoin; No rows affected (3.78 seconds) 0: jdbc:hive2://localhost:10000> use myjoin; No rows affected (0.419 seconds) 0: jdbc:hive2://localhost:10000> create table a(id int,name string) row format delimited fields terminated by ','; No rows affected (2.08 seconds) 0: jdbc:hive2://localhost:10000> create table b(id int,name string) row format delimited fields terminated by ',';
0: jdbc:hive2://localhost:10000> select * from a 0: jdbc:hive2://localhost:10000> ; +-------+---------+--+ | a.id | a.name | +-------+---------+--+ | 1 | qq | | 2 | ww | | 3 | ee | | 4 | rr | | 5 | tt | | 6 | yy | | 7 | aa | | 8 | ss | | 11 | zz | +-------+---------+--+ 9 rows selected (1.881 seconds) 0: jdbc:hive2://localhost:10000> select * from b; +-------+---------+--+ | b.id | b.name | +-------+---------+--+ | 1 | qq | | 2 | 22 | | 3 | dd | | 4 | rr | | 6 | fgf | | 7 | as | | 9 | 23 | | 12 | ww | | 34 | 3 | | 23 | 34 | | 12 | 45 | | 26 | 4r | +-------+---------+--+ 12 rows selected (0.147 seconds)
inner join 的结果,也就是join
0: jdbc:hive2://localhost:10000> select a.*,b.* from a inner join b on a.id = b.id; INFO : Execution completed successfully INFO : MapredLocal task succeeded INFO : Number of reduce tasks is set to 0 since there's no reduce operator INFO : number of splits:1 INFO : Submitting tokens for job: job_1496277833427_0007 INFO : The url to track the job: http://mini2:8088/proxy/application_1496277833427_0007/ INFO : Starting Job = job_1496277833427_0007, Tracking URL = http://mini2:8088/proxy/application_1496277833427_0007/ INFO : Kill Command = /home/hadoop/xxxxxx/hadoop265/bin/hadoop job -kill job_1496277833427_0007 INFO : Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0 INFO : 2017-06-01 16:32:03,138 Stage-3 map = 0%, reduce = 0% INFO : 2017-06-01 16:32:26,221 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 5.05 sec INFO : MapReduce Total cumulative CPU time: 5 seconds 50 msec INFO : Ended Job = job_1496277833427_0007 +-------+---------+-------+---------+--+ | a.id | a.name | b.id | b.name | +-------+---------+-------+---------+--+ | 1 | qq | 1 | qq | | 2 | ww | 2 | 22 | | 3 | ee | 3 | dd | | 4 | rr | 4 | rr | | 6 | yy | 6 | fgf | | 7 | aa | 7 | as | +-------+---------+-------+---------+--+
full outer join ,两边的数据都会出来只不过on条件没有对应上的一端会显示为null
0: jdbc:hive2://localhost:10000> select a.*,b.* from a full outer join b on a.id = b.id; INFO : Number of reduce tasks not specified. Estimated from input data size: 1 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max= INFO : In order to set a constant number of reducers: INFO : set mapreduce.job.reduces= INFO : number of splits:2 INFO : Submitting tokens for job: job_1496277833427_0008 INFO : The url to track the job: http://mini2:8088/proxy/application_1496277833427_0008/ INFO : Starting Job = job_1496277833427_0008, Tracking URL = http://mini2:8088/proxy/application_1496277833427_0008/ INFO : Kill Command = /home/hadoop/xxxxxx/hadoop265/bin/hadoop job -kill job_1496277833427_0008 INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1 INFO : 2017-06-01 16:34:05,413 Stage-1 map = 0%, reduce = 0% INFO : 2017-06-01 16:35:05,889 Stage-1 map = 0%, reduce = 0% INFO : 2017-06-01 16:37:35,521 Stage-1 map = 0%, reduce = 0% INFO : 2017-06-01 16:38:46,061 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 6.52 sec INFO : 2017-06-01 16:38:49,443 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.17 sec INFO : 2017-06-01 16:39:25,252 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 12.65 sec INFO : MapReduce Total cumulative CPU time: 12 seconds 650 msec INFO : Ended Job = job_1496277833427_0008 +-------+---------+-------+---------+--+ | a.id | a.name | b.id | b.name | +-------+---------+-------+---------+--+ | 1 | qq | 1 | qq | | 2 | ww | 2 | 22 | | 3 | ee | 3 | dd | | 4 | rr | 4 | rr | | 5 | tt | NULL | NULL | | 6 | yy | 6 | fgf | | 7 | aa | 7 | as | | 8 | ss | NULL | NULL | | NULL | NULL | 9 | 23 | | 11 | zz | NULL | NULL | | NULL | NULL | 12 | 45 | | NULL | NULL | 12 | ww | | NULL | NULL | 23 | 34 | | NULL | NULL | 26 | 4r | | NULL | NULL | 34 | 3 | +-------+---------+-------+---------+--+ 15 rows selected (371.304 seconds)
select a.*from a left semi join b on a.id = b.id; -- from 前不能写b.* 否则会报错( Error while compiling statement: FAILED: SemanticException [Error 10009]: Line 1:11 Invalid table alias 'b' (state=42000,code=10009))
替代exist in 的用法,返回值只是inner join 中左边的一般,
+-------+---------+--+ | a.id | a.name | +-------+---------+--+ | 1 | qq | | 2 | ww | | 3 | ee | | 4 | rr | | 6 | yy | | 7 | aa | +-------+---------+--+
没有 right semi join
left semi join 是exist in 的高效实现,比inner join 效率高