old SQL Boy了 ,第一次注意到有exists这个函数
之前都是用in notin 来处理逻辑,也没有什么问题
但是看到exists 在各大数据库,包括Spark SQL 、Flink SQL 中也都有支持
对这个函数产生了兴趣,来探讨一下吧~
首先让GPT-4 来回答这个问题
SQL EXISTS 和 IN 都是用于子查询的条件操作符,但它们有以下不同之处:
工作原理不同:EXISTS 子查询表示如果一个子查询返回的结果集中存在至少一行,则 EXISTS 条件成立,返回 TRUE 值。而 IN 子查询表示如果一个子查询返回的结果集中的值等于 IN 条件列表中的任何一个值,则 IN 条件成立,返回 TRUE 值。
性能不同:在大多数情况下,EXISTS 操作符比 IN 操作符更快,因为 EXISTS 只需要检查是否存在即可,而 IN 操作符必须遍历整个列表,并对每个列表中的值进行比较。
举两个例子来说明此类操作符的不同之处:
使用 EXISTS 示例:
SELECT *
FROM customers
WHERE EXISTS (
SELECT *
FROM orders
WHERE orders.customer_id = customers.customer_id AND orders.product_id = 123
);
该 SQL 查询将返回所有订购了商品编号为 123 的产品的客户列表。在这个例子中,EXISTS 子查询将查找所有符合订单条件的客户 ID,然后将其传递给外部查询中的 WHERE 子句。如果客户 ID 存在,则 WHERE 子句返回与该行相关的所有 customer 表字段。
使用 IN 示例
SELECT *
FROM customers
WHERE customer_id IN (
SELECT customer_id
FROM orders
WHERE orders.product_id = 123
);
该 SQL 查询将返回所有订购了商品编号为 123 的产品的客户列表。和上面的例子一样,子查询将查找所有符合订单条件的客户 ID,并将其传递给外部查询中的 IN 条件列表。如果客户 ID 存在于条件列表中,则 WHERE 子句返回与该行相关的所有 customer 表字段。
package Function.exists
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, row_number, when}
import org.apache.spark.sql.types.TimestampType
import org.apache.spark.sql.{DataFrame, SparkSession}
object ExistsTest {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").appName("orderByTest").getOrCreate()
val customers: Seq[(Int, String, String)] = Seq(
(1, "Name1","Name1 FUll Name"),
(2, "Name2","Name2 FUll Name"),
(3, "Name3","Name3 FUll Name"),
(4, "Name4","Name4 FUll Name"),
(5, "Name4","Name5 FUll Name"),
(6, "Name4","Name6 FUll Name"),
(7, "Name4","Name7 FUll Name"),
(8, "Name4","Name8 FUll Name"),
(9, "Name4","Name9 FUll Name"),
(10, "Name10","Name10 FUll Name"),
(11, "Name11","Name11 FUll Name"),
(12, "Name12","Name12 FUll Name"),
(13, "Name13","Name13 FUll Name")
)
val orders: Seq[(Int, String, Double, Boolean, Double, String)] = Seq(
(1, "Name1", 1000.00, true, 123.00, "2014-02-13 00:01:00"),
(2, "Name2", 2000.00, true, 123.00, "2014-02-13 00:01:00"),
(3, "Name3", 3000.00, true, 123.00, "2014-02-13 00:01:00"),
(4, "Name4", 4000.00, true, 123.00, "2014-02-13 00:01:00"),
(5, "Name4", 5000.00, true, 123.00, "2014-02-13 00:01:00"),
(6, "Name4", 6000.00, true, 123.00, ""),
(7, "Name4", 7000.00, true, 123.00, ""),
(8, "Name4", 8000.00, true, 123.00, "2014-02-13 00:01:00"),
(9, "Name4", 9000.00, true, 123.00, "2014-02-13 00:01:00"),
(10, "Name10", 10000.00, true, 1.00, "2014-02-13 00:01:00"),
(11, "Name11", 11000.00, true, 1.00, "2014-02-13 00:01:00"),
(12, "Name12", 12000.00, true, 1.00, "2014-02-13 00:01:00"),
(13, "Name13", 13000.00, true, 1.00, "2014-02-13 00:01:00")
)
val employee: DataFrame = spark.createDataFrame(
customers
).toDF("customer_id", "user_name", "user_full_name")
employee.createTempView("customers")
val user_info_df: DataFrame = spark.createDataFrame(
orders
).toDF("customer_id", "name", "amount", "flag", "product_id", "at")
user_info_df.createTempView("orders")
employee.show()
user_info_df.show()
val exists_result = spark.sql(
"""
|
|SELECT *,"exists_result" as logic
|FROM customers
|WHERE EXISTS (
| SELECT *
| FROM orders
| WHERE orders.customer_id = customers.customer_id AND orders.product_id = 123
|)
|
|""".stripMargin)
exists_result.explain()
exists_result.show()
val in_result = spark.sql(
"""
|SELECT *,"in_result" as logic
|FROM customers
|WHERE customer_id IN (
| SELECT customer_id
| FROM orders
| WHERE orders.product_id = 123
|)
|
|""".stripMargin)
in_result.explain()
in_result.show()
spark.close();
}
}
exists_result
== Physical Plan ==
*(2) Project [customer_id#6, user_name#7, user_full_name#8, exists_result AS logic#75]
+- *(2) BroadcastHashJoin [customer_id#6], [customer_id#24], LeftSemi, BuildRight, false
:- *(2) LocalTableScan [customer_id#6, user_name#7, user_full_name#8]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#35]
+- *(1) Project [customer_id#24]
+- *(1) LocalTableScan [customer_id#24, name#25, amount#26, flag#27, product_id#28, at#29]
in_result
== Physical Plan ==
*(1) Project [customer_id#6, user_name#7, user_full_name#8, in_result AS logic#98]
+- *(1) BroadcastHashJoin [customer_id#6], [customer_id#24], LeftSemi, BuildRight, false
:- *(1) LocalTableScan [customer_id#6, user_name#7, user_full_name#8]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#88]
+- LocalTableScan [customer_id#24]
在Spark 3.1.2 中,这两者几乎一致,甚至in的用法更纯粹,步骤更少
那为什么GPT说 exists 更加高效呢?
只执行一次
子查询执行一次,然后内外表笛卡尔积,条件筛选
内表小的时候,in的速度更快
指定一个子查询,检测行的存在
循环遍历外表,检查外表中的记录有没有和内表的的数据一致的
一致的就进入结果集
in 和 exists 的区别:
如果子查询得出的结果集记录较少,主查询中的表较大且又有索引时应该用 in,
如果外层的主查询记录较少,子查询中的表大,又有索引时使用 exists。
其实我们区分 in 和 exists 主要是造成了驱动顺序的改变(这是性能变化的关键),如果是 exists,那么以外层表为驱动表,先被访问,如果是 in ,那么先执行子查询,所以我们会以驱动表的快速返回为目标,那么就会考虑到索引及结果集的关系 ,另外 in 是不对 NULL 进行处理。
in 是把外表和内表作 hash 连接,而 exists 是对外表作 loop 循环,每次 loop 循环再对内表进行查询。一直以来认为 exists 比 in 效率高的说法是不准确的
spark.sql(
"""
|
|SELECT *,"not_EXISTS" as logic
|FROM customers
|WHERE not EXISTS (
| SELECT *
| FROM orders
| WHERE orders.customer_id = customers.customer_id AND orders.product_id = 123
|)
|
|""".stripMargin).explain()
spark.sql(
"""
|
|SELECT *,"!=" as logic
|FROM customers
|WHERE not EXISTS (
| SELECT *
| FROM orders
| WHERE orders.customer_id = customers.customer_id AND orders.product_id != 123
|)
|
|""".stripMargin).explain()
== Physical Plan ==
*(2) Project [customer_id#6, user_name#7, user_full_name#8, not_EXISTS AS logic#121]
+- *(2) BroadcastHashJoin [customer_id#6], [customer_id#24], LeftAnti, BuildRight, false
:- *(2) LocalTableScan [customer_id#6, user_name#7, user_full_name#8]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#145]
+- *(1) Project [customer_id#24]
+- *(1) LocalTableScan [customer_id#24, name#25, amount#26, flag#27, product_id#28, at#29]
== Physical Plan ==
*(2) Project [customer_id#6, user_name#7, user_full_name#8, != AS logic#127]
+- *(2) BroadcastHashJoin [customer_id#6], [customer_id#24], LeftAnti, BuildRight, false
:- *(2) LocalTableScan [customer_id#6, user_name#7, user_full_name#8]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [id=#173]
+- *(1) Project [customer_id#24]
+- *(1) LocalTableScan [customer_id#24, name#25, amount#26, flag#27, product_id#28, at#29]
笑死,这两个物理执行计划一致