在Spark 3.1.1 遇到的 from_json regexp_replace组合表达式慢问题的解决
中其实在spark 3.4.x已经解决了,
具体的解决方法可以见 SPARK-44700,
也就是设置spark.sql.optimizer.collapseProjectAlwaysInline
为 false
(默认就是false)
但是 spark 3.4.x是怎么解决的呢?
以如下SQL为例:
Seq("""{"a":1, "b":0.8}""").toDF("s").write.saveAsTable("t")
// spark.sql.planChangeLog.level warn
val df = sql(
"""
|SELECT j.*
|FROM (SELECT from_json(regexp_replace(s, 'a', 'new_a'), 'new_a INT, b DOUBLE') AS j
| FROM t) tmp
|""".stripMargin)
df.explain(true)
在spark 3.1.1
会有如下的转换:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseProject ===
!Project [j#17.new_a AS new_a#19, j#17.b AS b#20] Project [from_json(StructField(new_a,IntegerType,true), StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).new_a AS new_a#19, from_json(StructField(new_a,IntegerType,true), StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).b AS b#20]
!+- Project [from_json(StructField(new_a,IntegerType,true), StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)) AS j#17] +- Relation[s#18] parquet
! +- Relation[s#18] parquet
09:46:57.649 WARN org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.OptimizeJsonExprs ===
!Project [from_json(StructField(new_a,IntegerType,true), StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).new_a AS new_a#19, from_json(StructField(new_a,IntegerType,true), StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).b AS b#20] Project [from_json(StructField(new_a,IntegerType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).new_a AS new_a#19, from_json(StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).b AS b#20]
+- Relation[s#18] parquet +- Relation[s#18] parquet
最终的物理计划如下:
== Physical Plan ==
Project [from_json(StructField(new_a,IntegerType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).new_a AS new_a#19, from_json(StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)).b AS b#20]
+- *(1) ColumnarToRow
+- FileScan parquet default.t[s#18] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/jiahong.li/xmalaya/github/spark/spark-warehouse/org.apache.spark.sq..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct
这里列new_a,b
有多少列,regexp_replace
就会被计算几次,也就是说会有性能损失。
在spark 3.4.0
中没有了以上规则CollapseProject
和OptimizeJsonExprs
的转换,生成的物理计划如下:
== Physical Plan ==
*(2) Project [j#17.new_a AS new_a#20, j#17.b AS b#21]
+- Project [from_json(StructField(new_a,IntegerType,true), StructField(b,DoubleType,true), regexp_replace(s#18, a, new_a, 1), Some(America/Los_Angeles)) AS j#17]
+- *(1) ColumnarToRow
+- FileScan parquet spark_catalog.default.t[s#18] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/jiahong.li/xmalaya/ultimate-git/spark/spark-warehouse/org...., PartitionFilters: [], PushedFilters: [], ReadSchema: struct
无论这里的列new_a,b
有多少列(虽然当前例子中只有一列),regexp_replace
只会被计算一次。
分析一下:
关键是在CollapseProject
这个Rule中:
def apply(plan: LogicalPlan): LogicalPlan = {
apply(plan, conf.getConf(SQLConf.COLLAPSE_PROJECT_ALWAYS_INLINE))
}
...
def apply(plan: LogicalPlan, alwaysInline: Boolean): LogicalPlan = {
plan.transformUpWithPruning(_.containsPattern(PROJECT), ruleId) {
case p1 @ Project(_, p2: Project)
if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline) =>
p2.copy(projectList = buildCleanedProjectList(p1.projectList, p2.projectList))
这里最重要的是canCollapseExpressions
这个方法,主要是来判断是否可以合并project:
def canCollapseExpressions(
consumers: Seq[Expression],
producerMap: Map[Attribute, Expression],
alwaysInline: Boolean = false): Boolean = {
consumers
.filter(_.references.exists(producerMap.contains))
.flatMap(collectReferences)
.groupBy(identity)
.mapValues(_.size)
.forall {
case (reference, count) =>
val producer = producerMap.getOrElse(reference, reference)
val relatedConsumers = consumers.filter(_.references.contains(reference))
def cheapToInlineProducer: Boolean = trimAliases(producer) match {
case e @ (_: CreateNamedStruct | _: UpdateFields | _: CreateMap | _: CreateArray) =>
var nonCheapAccessSeen = false
def nonCheapAccessVisitor(): Boolean = {
try {
nonCheapAccessSeen
} finally {
nonCheapAccessSeen = true
}
}
!relatedConsumers.exists(findNonCheapAccesses(_, reference, e, nonCheapAccessVisitor))
case other => isCheap(other)
}
producer.deterministic && (count == 1 || alwaysInline || cheapToInlineProducer)
}
这里对于JsonToStructs(RegExpReplace)
处理是在 case other中,
def isCheap(e: Expression): Boolean = e match {
case _: Attribute | _: OuterReference => true
case _ if e.foldable => true
// PythonUDF is handled by the rule ExtractPythonUDFs
case _: PythonUDF => true
// Alias and ExtractValue are very cheap.
case _: Alias | _: ExtractValue => e.children.forall(isCheap)
case _ => false
}
isCheap匹配的是之后的一个case _ => false,
所以isCheap返回false, 而 count的个数是2, alwaysInline 是默认值为false, cheapToInlineProducer 为false
,
所以最终canCollapseExpressions返回是false
,所以在spark 3.4.0版本中CollapseProject不会被应用在当前计划中,所以不会有性能的损失
该计划的差异主要部分还是在于Rule CollapseProject
在spark 3.1.1
和spark 3.4.0
的差别处理。