JOIN用于将两个或多个表中的行链接在一起。Hive支持大多数的SQl JOIN操作,如 INNER JOIN 和 OUTER JOIN。此外,HQL还支持一些特殊的JOIN,如MapJoin 和 Semi-Join。在早期版本中,Hive仅支持equal join。在v2.2.0之后,还支持unequal join。但是,在使用unequal join时要非常小心,除非你知道预期的结果,因为unequal join可能通过生成joined表的笛卡尔积返回许多行。如果要限制join的输出,可以在join之后使用WHERE子句,因为JOIN发生在WHERE子句之前。如果可能,请在join条件上使用过滤条件,而不是在先前筛选数据的条件上使用过滤条件。而且,所有类型的left/right join都是不可交换的,并且总是左/右结合的,而INNER 和 FULL OUTER JOINS 都是可交换和可结合。join的语法如下:
join_table:
table_reference [INNER] JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
| table_reference CROSS JOIN table_reference [join_condition] (as of Hive 0.10)
table_reference:
table_factor
| join_table
table_factor:
tbl_name [alias]
| table_subquery alias
| ( table_references )
join_condition:
ON expression
Hive从 0.13.0开始支持隐式join表示法(见HIVE-5558),也就是在from子句中连接一个逗号分隔的表的列表,而可以省略JOIN关键字,如:
SELECT *
FROM table1 t1, table2 t2, table3 t3
WHERE t1.id = t2.id AND t2.id = t3.id AND t1.zipcode = '02535';
Hive从 0.13.0开始支持非限定列引用(见HIVE-6393),Hive尝试根据join的输入来解决这些问题,如果非限定列引用被解析为有多个表,Hive会将其标记为不明确的引用。如:
CREATE TABLE a (k1 string, v1 string);
CREATE TABLE b (k2 string, v2 string);
SELECT k1, v1, k2, v2
FROM a JOIN b ON k1 = k2;
Hive从 2.2.0 开始支持在ON子句使用复杂表达式(见 HIVE-15211, HIVE-15251),在此之前,Hive不支持非相等条件的join条件。特别是,join条件的语法限制如下:
join_condition:
ON equality_expression ( AND equality_expression )*
equality_expression:
expression = expression
INNER JOIN 或 JOIN 从join表的两侧返回满足join条件的行。JOIN关键字可以用逗号分隔的表名省略,这也就是上面说的隐式join表示法。
示例
> CREATE TABLE IF NOT EXISTS employee_hr
(
name string,
employee_id int,
sin_number string,
start_date date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;
> LOAD DATA INPATH '/tmp/data/employee_hr.txt' OVERWRITE INTO TABLE employee_hr;
> SELECT emp.name, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091 |
| Will | 527-948-090 |
| Lucy | 577-928-094 |
+-----------+------------------+
> SELECT
emp.name, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name != emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 527-948-090 |
| Michael | 647-968-598 |
| Michael | 577-928-094 |
| Will | 547-968-091 |
| Will | 647-968-598 |
| Will | 577-928-094 |
| Shelley | 547-968-091 |
| Shelley | 527-948-090 |
| Shelley | 647-968-598 |
| Shelley | 577-928-094 |
| Lucy | 547-968-091 |
| Lucy | 527-948-090 |
| Lucy | 647-968-598 |
+-----------+------------------+
> SELECT
emp.name, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON
IF(emp.name = 'Will', '1', emp.name) = CASE WHEN emph.name = 'Will' THEN '0' ELSE emph.name END;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091 |
| Lucy | 577-928-094 |
+-----------+------------------+
> SELECT
emp.name, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
WHERE
emp.name = 'Will';
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Will | 527-948-090 |
+-----------+------------------+
> SELECT emp.name, empi.employee_id, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
JOIN employee_id empi ON emp.name = empi.name;
+-----------+-------------------+------------------+
| emp.name | empi.employee_id | emph.sin_number |
+-----------+-------------------+------------------+
| Michael | 2 | 547-968-091 |
| Will | 2 | 527-948-090 |
| Lucy | 4 | 577-928-094 |
+-----------+-------------------+------------------+
> SELECT emp.name
FROM employee emp
JOIN employee emp_b
ON emp.name = emp_b.name;
+-----------+
| emp.name |
+-----------+
| Michael |
| Will |
| Shelley |
| Lucy |
+-----------+
--Implicit join, which support since Hive 0.13.0
> SELECT emp.name, emph.sin_number
FROM employee emp, employee_hr emph
WHERE emp.name = emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091 |
| Will | 527-948-090 |
| Lucy | 577-928-094 |
+-----------+------------------+
--Join using different columns will create additional mapreduce
> SELECT emp.name, empi.employee_id, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
JOIN employee_id empi ON emph.employee_id = empi.employee_id;
+-----------+-------------------+------------------+
| emp.name | empi.employee_id | emph.sin_number |
+-----------+-------------------+------------------+
| Michael | 100 | 547-968-091 |
| Will | 101 | 527-948-090 |
| Lucy | 103 | 577-928-094 |
+-----------+-------------------+------------------+
在多个表之间执行join操作时,会创建Yarn/MapReduce作业来处理HDFS中的数据。每一个作业被称为一个stage。通常,建议将大表放在join语句的最后面以获得更好的性能,同时避免OOM异常。这是因为join序列中的最后一个表通常通过reducer进行流式处理,而其他表默认都会在reducer中进行缓冲。此外,还可以指定一个提示标记 /*+STREAMTABLE (table_name)*/,来建议哪个表应在默认决策中进行流式处理。如下:
--Streaming tables
> SELECT /*+ STREAMTABLE(employee_hr) */
emp.name, empi.employee_id, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
JOIN employee_id empi ON emph.employee_id = empi.employee_id;
除了INNER JOIN以外,Hive还支持OUTER JOIN(外连接)和FULL JOIN(全连接)。下面列出不同join类型的返回结果,假设table_m有m行,table_n 有n行记录
Join 类型 | Join 名称 | 执行逻辑 | 返回的行 |
---|---|---|---|
table_m JOIN table_n | 内连接 | 返回在2个表中所有共有的行 | m ∩ n |
table_m LEFT JOIN table_n | 左外连接 | 这将返回左表中的所有行和右表中匹配的行,如果右表中没有匹配项,则在右表中返回NULL | m |
table_m RIGHT JOIN table_n | 右外连接 | 这将返回右表中的所有行和左表中匹配的行,如果左表中没有匹配项,则在左表中返回NULL | n |
table_m FULL JOIN table_n | 全外连接 | 这将返回两个表中的所有行和两个表中匹配的行,如果左表或右表中没有匹配项,则返回NULL | m + n - m ∩ n |
table_m CROSS JOIN table_n | 交叉连接 | 这将返回两个表中的所有行组合,以生成笛卡尔积 | m * n |
--Left JOIN
SELECT emp.name, emph.sin_number
FROM employee emp
LEFT JOIN employee_hr emph ON emp.name = emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091 |
| Will | 527-948-090 |
| Shelley | NULL |
| Lucy | 577-928-094 |
+-----------+------------------+
--Right JOIN
SELECT emp.name, emph.sin_number
FROM employee emp
RIGHT JOIN employee_hr emph ON emp.name = emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091 |
| Will | 527-948-090 |
| NULL | 647-968-598 |
| Lucy | 577-928-094 |
+-----------+------------------+
--Full OUTER JOIN
SELECT emp.name, emph.sin_number
FROM employee emp
FULL JOIN employee_hr emph ON emp.name = emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Lucy | 577-928-094 |
| Michael | 547-968-091 |
| Shelley | NULL |
| NULL | 647-968-598 |
| Will | 527-948-090 |
+-----------+------------------+
交叉连接是没有join条件的,CROSS JOIN语句可以写成不带不带条件的join或者使用总是为true的条件,如 1=1。这种情况下,可以使用交叉连接join任何数据集。但是,只有在连接没有本质关系的数据时才考虑这种连接,如向表中添加具有行计数的标题。下面示例是交叉连接的三种书写方式:
--CROSS JOIN in different ways
SELECT emp.name, emph.sin_number
FROM employee emp
CROSS JOIN employee_hr emph;
SELECT emp.name, emph.sin_number
FROM employee emp
JOIN employee_hr emph;
SELECT emp.name, emph.sin_number
FROM employee emp
JOIN employee_hr emph on 1=1;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091 |
| Michael | 527-948-090 |
| Michael | 647-968-598 |
| Michael | 577-928-094 |
| Will | 547-968-091 |
| Will | 527-948-090 |
| Will | 647-968-598 |
| Will | 577-928-094 |
| Shelley | 547-968-091 |
| Shelley | 527-948-090 |
| Shelley | 647-968-598 |
| Shelley | 577-928-094 |
| Lucy | 547-968-091 |
| Lucy | 527-948-090 |
| Lucy | 647-968-598 |
| Lucy | 577-928-094 |
+-----------+------------------+
除了内连接、外连接和交叉连接外。Hive还支持一些在关系型数据库中见不到的特殊连接。如 MapJoin 和 Semi-join。
MapJoin意味着只使用map执行join操作,而不使用reduce作业。MapJoin语句会将小表的所有数据读取到内存中,然后广播到所有map中。在map阶段,通过比较大表和小表中的每一行数据与join条件来执行join操作。由于不需要reduce,这种类型的join通常会有更好的性能表现。在Hive的新版本中,如果可能,Hive会在运行时自动将join转换为MapJoin。用户也可以通过提供一个join提示标记 /*+ MAPJOIN(table_name) */ 手动指定广播表。此外,MapJoin可以用于unequal join来提高性能。因为MapJoin 和 WHERE语句都在map阶段执行。下面是一个在CROSS JOIN中使用Mapjoin的例子:
SELECT /*+ MAPJOIN(employee) */ emp.name, emph.sin_number
FROM employee emp
CROSS JOIN employee_hr emph WHERE emp.name <> emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 527-948-090 |
| Michael | 647-968-598 |
| Michael | 577-928-094 |
| Will | 547-968-091 |
| Will | 647-968-598 |
| Will | 577-928-094 |
| Shelley | 547-968-091 |
| Shelley | 527-948-090 |
| Shelley | 647-968-598 |
| Shelley | 577-928-094 |
| Lucy | 547-968-091 |
| Lucy | 527-948-090 |
| Lucy | 647-968-598 |
+-----------+------------------+
Mapjoin的使用有以下的限制:
Bucket MapJoin是MapJoin的一种特殊类型,它可以使用bucket字段(建表时由CLUSTERED BY指定的字段)作为join的条件。相比常规Mapjoin需要获取整个表,Bucket Mapjoin只需要获取请求的bucket数据。要使用Bucket Mapjoin,需要启用一些设置,以确保bucket数是彼此的倍数。如果join的2个表使用相同的bucket数进行排序和分桶,则会执行一个sort-merge join,而不是在内存中缓存所有小表。
> SET hive.optimize.bucketmapjoin = true;
> SET hive.optimize.bucketmapjoin.sortedmerge = true;
> SET hive.input.format = org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
此外,LEFT SEMI JOIN也是MapJoin的一类。它与Hive 0.13.0 之后使用IN/EXISTS的子查询是一样的。但建议用户不要使用这种方式,毕竟这不是标准的SQL。
> SELECT a.name FROM employee a
LEFT SEMI JOIN employee_id b ON a.name = b.name;
+----------+
| a.name |
+----------+
| Michael |
| Will |
| Shelley |
| Lucy |
+----------+
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
书籍 Apache Hive Essentials Second Edition (by Dayong Du) Chapter 4