参考文章:
Hive差集运算详解
https://blog.csdn.net/dr_guo/article/details/51182626
Hive求两个集合的减集,hive集合
http://www.bkjia.com/yjs/942686.html
在日常的工作中,我们需要经常实现各种各样的SQL, 此时避免不了需要实现各个结果集的交.并,差集 操作
交集 :
一般直接执行JOIN 操作即可
并集:
一般使用UNION ALL 操作即可
差集:
差集思路较为复杂,我们本篇文章主要对差集的实现思路做一个总结。
本文主要分为以下几个部分
1. 什么是差集
2. hive中差集的几种思路 以及 各自缺点
3. 场景设定
4. not in 实现差集
5 left outer join 实现差集
什么是差集:
差集定义:
一般地,设A,B是两个集合,由所有属于A且不属于B的元素组成的集合,叫做集合A减集合B(或集合A与集合B之差),
类似地,对于集合A.B,我们把集合{x/x∈A,且x¢B}叫做A与B的差集,
记作A-B记作A-B(或A\B),即A-B={x|x∈A,且x ¢B}(或A\B={x|x∈A,且x ¢B} B-A={x/x∈B且x¢A} 叫做B与A的差集
如图所示:
hive中差集的几种思路 以及 优劣:
hive中实现差集主要通过以下3种途径:
1.LEFT OUTER JOIN (推荐最常用)
2.NOT IN (不推荐)
3.NOT EXISTS (不推荐)
以上三种方法最推荐的是 LEFT OUTER JOIN
NOT IN 缺点:
NOT IN 只能针对 单个字段 , 多个字段不支持
NOT IN /NOT EXIST共有缺点:
会产生 笛卡儿积,需要额外的参数设置, 默认不开启
这两种貌似很费资源,在ODPS里都有限制,下面来介绍一下hive常用的求差集方法,左(右)连接 left outer join
场景设定:
这里我们构建一个场景多以上几种方法做一下测试
假设存在业务需求,涉及2张表。 reserve_order , finish_order
用户购买商品,会存入下单表。reserve_order
用户完成订单,会存入完结订单表。finish_order
两张表的订单号相同。
需求一:
下了单,但是没有完成的订单
需求二:
下了单,并且结单的,但结单日期与下单不一致的订单
为此我们创建2张Hive表:
下单表 reserve_order
结单表 finish_order
create table reserve_order (id int COMMENT 'auto increment id', orderid int COMMENT 'the order id' , cal_date string COMMENT 'order id create time');
create table finish_order (id int COMMENT 'auto increment id', orderid int COMMENT 'the order id' , cal_date string COMMENT 'order id finish time');查看下数据
not in 实现差集:
需求一:
下了单,但是没有完成的订单
需求二:
下了单,并且结单的,但结单日期与下单不一致的订单
可能会存在这样的问题:
0: jdbc:hive2://master:10000> SELECT finish.* FROM finish_order AS finish WHERE finish.orderid NOT IN (SELECT reserve.orderid FROM reserve_order AS reserve);
Error: Error while compiling statement: FAILED: SemanticException Cartesian products are disabled for safety reasons. If you know what you are doing, please sethive.strict.checks.cartesian.product to false and that hive.mapred.mode is not set to 'strict' to proceed. Note that if you may get errors or incorrect results if you make a mistake while using some of the unsafe features. (state=42000,code=40000)
解决方法:
只需要在hive 客户端将这个值设置为 false 即可
set hive.strict.checks.cartesian.product=false;
需求一:
下了单,但是没有完成的订单
SQL 如下:
SELECT reserve.*
FROM reserve_order AS reserve
WHERE reserve.orderid NOT IN (SELECT finish.orderid FROM finish_order AS finish);
+-------------+------------------+-------------------+--+
| reserve.id | reserve.orderid | reserve.cal_date |
+-------------+------------------+-------------------+--+
| 2 | 101 | 2018-07-05 |
+-------------+------------------+-------------------+--+
需求二:
下了单,并且结单的,但结单日期与下单不一致的订单
下了单并且结了单的用户:
SELECT reserve.*
FROM reserve_order AS reserve
JOIN finish_order AS finish ON (reserve.orderid = finish.orderid);
下了单,并且结单的,但结单日期与下单不一致的订单:
SELECT reserve.*
FROM reserve_order AS reserve
JOIN finish_order AS finish ON (reserve.orderid = finish.orderid)
WHERE (reserve.orderid, resrve.`date`) NOT IN (SELECT finish.orderid, finish.`date` FROM finish_order AS finish);
0: jdbc:hive2://master:10000> SELECT reserve.* FROM reserve_order AS reserve JOIN finish_order AS finish ON (reserve.orderid = finish.orderid) WHERE (reserve.orderid, resrve.`date`) NOT IN (SELECT finish.orderid, finish.`date` FROM finish_order AS finish);
Error: Error while compiling statement: FAILED: SemanticException Line 1:202 Invalid SubQuery expression 'date' in definition of SubQuery sq_1 [
IN (SELECT finish.orderid, finish.date FROM finish_order AS finish)
] used as sq_1 at Line 0:-1: SubQuery can contain only 1 item in Select List. (state=42000,code=40000)
可以看到Hive 值只支持单列的 NOT IN/ IN 操作
left outer join 实现差集:
需求一:
下了单,但是没有完成的订单
查询SQL
SELECT reserve.*
FROM reserve_order AS reserve
LEFT OUTER JOIN finish_order AS finish ON (reserve.orderid = finish.orderid)
WHERE finish.orderid IS NULL;
/
SELECT reserve.*
FROM reserve_order AS reserve
LEFT JOIN finish_order AS finish ON (reserve.orderid = finish.orderid)
WHERE finish.orderid IS NULL;
+-------------+------------------+-------------------+--+
| reserve.id | reserve.orderid | reserve.cal_date |
+-------------+------------------+-------------------+--+
| 2 | 101 | 2018-07-05 |
+-------------+------------------+-------------------+--+
这里我们需要注意几个点:
1. 判断为NULL 不能用 = NULL,要用 IS NULL
需求二:
下了单,并且结单的,但结单日期与下单不一致的订单
SELECT reserve.*
FROM reserve_order AS reserve JOIN finish_order AS finish ON (reserve.orderid = finish.orderid)
WHERE finish.`cal_date` != reserve.`cal_date`;
+-------------+------------------+-------------------+--+
| reserve.id | reserve.orderid | reserve.cal_date |
+-------------+------------------+-------------------+--+
| 3 | 102 | 2018-07-05 |
+-------------+------------------+-------------------+--+