Optimizer 是在 Analyzer 生成 Resolved Logical Plan 后,进行优化的阶段。
有5条优化规则,这些规则都执行一次
消除查询别名,对应逻辑算子树中的 SubqueryAlias 节点。一般来讲,Subqueries 仅用于提供查询的视角范围信息,一旦 Analyzer 阶段结束,该节点就可以被删除,该优化规则直接将 SubqueryAlias 替换为其子节点。
如下SQL,子查询 alias 为 t,在 Analyzed Logical Plan 中,还有 SubqueryAlias t
节点。
explain extended select sum(len) from ( select c1,length(c1) len from t1 group by c1) t;
== Analyzed Logical Plan ==
sum(len): bigint
Aggregate [sum(len#56) AS sum(len)#64L]
+- SubqueryAlias t
+- Aggregate [c1#62], [c1#62, length(c1#62) AS len#56]
+- SubqueryAlias spark_catalog.test.t1
+- HiveTableRelation [`test`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#62], Partition Cols: []]
== Optimized Logical Plan ==
Aggregate [sum(len#56) AS sum(len)#64L]
+- Aggregate [c1#62], [length(c1#62) AS len#56]
+- HiveTableRelation [`test`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#62], Partition Cols: []]
ReplaceExpressions 表达式替换。
4个替换规则,如下所示。
case e: RuntimeReplaceable => e.child
case CountIf(predicate) => Count(new NullIf(predicate, Literal.FalseLiteral))
case BoolOr(arg) => Max(arg)
case BoolAnd(arg) => Min(arg)
RuntimeReplaceable 是一个 trait,有好多子类,用 child 节点把自己替换。如 Nvl 的child是 Coalesce(Seq(left, right))。那么优化的时候用 child 替换 nvl 。
case class Nvl(left: Expression, right: Expression, child: Expression) extends RuntimeReplaceable {
def this(left: Expression, right: Expression) = {
this(left, right, Coalesce(Seq(left, right)))
}
explain extended SELECT nvl(c1,c2) FROM VALUES ('v1', 'v12'), ('v2', 'v22'), ('v3', 'v32') AS tab(c1, c2);
输出结果
== Analyzed Logical Plan ==
nvl(c1, c2): string
Project [nvl(c1#85, c2#86) AS nvl(c1, c2)#87]
+- SubqueryAlias tab
+- LocalRelation [c1#85, c2#86]
== Optimized Logical Plan ==
LocalRelation [nvl(c1, c2)#87]
用max替换 bool_or.
explain extended SELECT bool_or(col) FROM
VALUES (true), (false), (false) AS tab(col);
输出结果
== Analyzed Logical Plan ==
bool_or(col): boolean
Aggregate [bool_or(col#101) AS bool_or(col)#103]
+- SubqueryAlias tab
+- LocalRelation [col#101]
== Optimized Logical Plan ==
Aggregate [max(col#101) AS bool_or(col)#103]
+- LocalRelation [col#101]
用 min 替换 bool_and.
explain extended SELECT bool_and(col) FROM
VALUES (true), (false), (false) AS tab(col);
输出结果:
== Analyzed Logical Plan ==
bool_and(col): boolean
Aggregate [bool_and(col#112) AS bool_and(col)#114]
+- SubqueryAlias tab
+- LocalRelation [col#112]
== Optimized Logical Plan ==
Aggregate [min(col#112) AS bool_and(col)#114]
+- LocalRelation [col#112]
计算当前时间相关的表达式,在同一条 SQL 中可能包含多个计算时间的表达式,如 CurentDate 和 CurrentTimestamp,保证同一个 SQL query 中多个表达式返回相同的值。
subQuery.transformAllExpressionsWithPruning(transformCondition) {
case cd: CurrentDate =>
Literal.create(DateTimeUtils.microsToDays(currentTimestampMicros, cd.zoneId), DateType)
case CurrentTimestamp() | Now() => currentTime
case CurrentTimeZone() => timezone
case localTimestamp: LocalTimestamp =>
val asDateTime = LocalDateTime.ofInstant(instant, localTimestamp.zoneId)
Literal.create(localDateTimeToMicros(asDateTime), TimestampNTZType)
}
Combine Union,把相邻的 union 节点可以合并为一个 union 节点,如以下SQL.
explain extended
select c1 from t1
union
select c1 from t1 where length(c1) = 2
union
select c1 from t1 where length(c1) = 3;
输出结果如下, Analyzed Logical Plan
有2个 Union,Optimized Logical Plan
有 1 个 Union.
== Analyzed Logical Plan ==
c1: string
Distinct
+- Union false, false
:- Distinct
: +- Union false, false
: :- Project [c1#161]
: : +- SubqueryAlias spark_catalog.test.t1
: : +- HiveTableRelation [`test`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#161], Partition Cols: []]
: +- Project [c1#162]
: +- Filter (length(c1#162) = 2)
: +- SubqueryAlias spark_catalog.test.t1
: +- HiveTableRelation [`test`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#162], Partition Cols: []]
+- Project [c1#163]
+- Filter (length(c1#163) = 3)
+- SubqueryAlias spark_catalog.test.t1
+- HiveTableRelation [`test`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#163], Partition Cols: []]
== Optimized Logical Plan ==
Aggregate [c1#161], [c1#161]
+- Union false, false
:- HiveTableRelation [`test`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#161], Partition Cols: []]
:- Filter (isnotnull(c1#162) AND (length(c1#162) = 2))
: +- HiveTableRelation [`test`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#162], Partition Cols: []]
+- Filter (isnotnull(c1#163) AND (length(c1#163) = 3))
+- HiveTableRelation [`test`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#163], Partition Cols: []]
当SQL语句包含子查询时,会在逻辑算子树上生成 SubqueryExpression 表达式。OptimizeSubqueries 优化规则在遇到 SubqueryExpression 表达式时,进一步调用 Optimizer 对该表达式的子计划进行优化。
用来执行算子的替换操作。在SQL语句中,某些查询算子可以直接改写为已有的算子,避免进行重复的逻辑转换。
将 Intersect 操作算子替换为 Left-Semi Join 操作算子,从逻辑上来看,这两种算子是等价的。需要注意的是,ReplaceIntersectWithSemiJoin 仅适用于 INTERSECT DISTINCT 类型的语句,不适用于 INTERSECT ALL 语句。此外,该优化规则执行之前必须消除重复的属性,避免生成的 Join 条件不正确。
示例:
create table t1(c1 string) stored as textfile;
create table t2(c1 string) stored as textfile;
load data local inpath '/etc/profile' overwrite into table t1;
load data local inpath '/etc/profile' overwrite into table t2;
查找长度为4的。
select c1 from t1 where length(c1)=4;
输出结果:
else
else
else
done
Time taken: 0.064 seconds, Fetched 4 row(s)
explain extended
select c1 from t2 where length(c1)<5
intersect distinct
select c1 from t1 where length(c1)=4;
输出结果如下,可以看到,Analyzed Logical Plan
中,为 Intersect,而 Optimized Logical Plan
变为 Join LeftSemi
。
== Analyzed Logical Plan ==
c1: string
Intersect false
:- Project [c1#149]
: +- Filter (length(c1#149) < 5)
: +- SubqueryAlias spark_catalog.hzz.t2
: +- HiveTableRelation [`hzz`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#149], Partition Cols: []]
+- Project [c1#150]
+- Filter (length(c1#150) = 4)
+- SubqueryAlias spark_catalog.hzz.t1
+- HiveTableRelation [`hzz`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#150], Partition Cols: []]
== Optimized Logical Plan ==
Aggregate [c1#149], [c1#149]
+- Join LeftSemi, (c1#149 <=> c1#150)
:- Filter (isnotnull(c1#149) AND (length(c1#149) < 5))
: +- HiveTableRelation [`hzz`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#149], Partition Cols: []]
+- Filter (isnotnull(c1#150) AND (length(c1#150) = 4))
+- HiveTableRelation [`hzz`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#150], Partition Cols: []]
用 AntiJoin 替换 Except。
示例如下:
explain extended
select c1 from t2 where length(c1) <=5
except
select c1 from t1 where length(c1)=4;
输出结果:
== Analyzed Logical Plan ==
c1: string
Except false
:- Project [c1#156]
: +- Filter (length(c1#156) <= 5)
: +- SubqueryAlias spark_catalog.hzz.t2
: +- HiveTableRelation [`hzz`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#156], Partition Cols: []]
+- Project [c1#157]
+- Filter (length(c1#157) = 4)
+- SubqueryAlias spark_catalog.hzz.t1
+- HiveTableRelation [`hzz`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#157], Partition Cols: []]
== Optimized Logical Plan ==
Aggregate [c1#156], [c1#156]
+- Join LeftAnti, (c1#156 <=> c1#157)
:- Filter (isnotnull(c1#156) AND (length(c1#156) <= 5))
: +- HiveTableRelation [`hzz`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#156], Partition Cols: []]
+- Filter (isnotnull(c1#157) AND (length(c1#157) = 4))
+- HiveTableRelation [`hzz`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#157], Partition Cols: []]
示例:
explain extended
select distinct c1 from t1;
输出结果如下:
== Analyzed Logical Plan ==
c1: string
Distinct
+- Project [c1#163]
+- SubqueryAlias spark_catalog.hzz.t1
+- HiveTableRelation [`hzz`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#163], Partition Cols: []]
== Optimized Logical Plan ==
Aggregate [c1#163], [c1#163]
+- HiveTableRelation [`hzz`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#163], Partition Cols: []]
去除 group by中的常数。
示例:group by 都是常数,用 0 替代
explain extended
select sum(length(c1)) from t1 group by 'aa','bb';
== Analyzed Logical Plan ==
sum(length(c1)): bigint
Aggregate [aa, bb], [sum(length(c1#189)) AS sum(length(c1))#191L]
+- SubqueryAlias spark_catalog.hzz.t1
+- HiveTableRelation [`hzz`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#189], Partition Cols: []]
== Optimized Logical Plan ==
Aggregate [0], [sum(length(c1#189)) AS sum(length(c1))#191L]
+- HiveTableRelation [`hzz`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#189], Partition Cols: []]
去除 group by 中重复的表达式,如
explain extended
select sum(length(c1)) from t1 group by c1,c1;
输出结果
== Analyzed Logical Plan ==
sum(length(c1)): bigint
Aggregate [c1#201, c1#201], [sum(length(c1#201)) AS sum(length(c1))#203L]
+- SubqueryAlias spark_catalog.hzz.t1
+- HiveTableRelation [`hzz`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#201], Partition Cols: []]
== Optimized Logical Plan ==
Aggregate [c1#201], [sum(length(c1#201)) AS sum(length(c1))#203L]
+- HiveTableRelation [`hzz`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#201], Partition Cols: []]
包括3大分类。1. 算子下推。2. 算子组合。3. 常量折叠与长度消减。
算子下推:谓词下推,列裁剪。
算子组合:
优化规则 | 优化操作 |
---|---|
PushProjectionThroughUnion | 列裁剪下推 |
ReorderJoin | Join 顺序优化,和 CostBasedJoinReorder 没有关系 |
EliminateOuterJoin | 消除 OuterJoin |
PushPredicateThroughJoin | 谓词下推到Join 算子 |
PushDownPredicate | 谓词下推 |
LimitPushDown | Limit 算子下推 |
ColumnPruning | 列剪裁 |
InferFiltersFromConstraints | |
CollapseRepartition | 重分区组合 |
CollapseProject | 投影算子组合 |
CollapseWindow | Window 组合 |
CombineFilters | 投影算子组合 |
CombineLimits | Limit算子组合 |
CombineUnions | Union算子组合 |
NullPropagation | Null 提取 |
FoldablePropagation | 可折叠算子提取 |
OptimizeIn | In 操作优化 |
ConstantFolding | 常数折叠 |
ReorderAssociativeOperator | 重排序关联算子优化 |
LikeSimplification | Like 算子简化 |
BooleanSimplification | Boolean 算子简化 |
SimplifyConditionals | 条件简化 |
RemoveDispensableExpressions | Dispensable 表达式消除 |
SimplifyBianryComparison | 比较算子简化 |
PruneFilter | 过滤条件剪裁 |
EliminateSorts | 排序算子消除 |
SimplifyCasts | Cast 算子简化 |
SimplifyCaseConversionExpressions | Case 表达式简化 |
RewriteCorrelatedScalarSubquery | 依赖子查询重写 |
EliminateSerialization | 序列化消除 |
RemoveAliasOnlyPorject | 消除别名 |
explain extended
select t1.c1 from t1 join t2
on t1.c1=t2.c1
where t2.c1='done';
通过 t2.c1 = t1.c1 并且t2.c1=‘done’ 推测出 t1.c1=‘done’.
== Analyzed Logical Plan ==
c1: string
Project [c1#235]
+- Filter (c1#236 = done)
+- Join Inner, (c1#235 = c1#236)
:- SubqueryAlias spark_catalog.hzz.t1
: +- HiveTableRelation [`hzz`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#235], Partition Cols: []]
+- SubqueryAlias spark_catalog.hzz.t2
+- HiveTableRelation [`hzz`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#236], Partition Cols: []]
== Optimized Logical Plan ==
Project [c1#235]
+- Join Inner, (c1#235 = c1#236)
:- Filter ((c1#235 = done) AND isnotnull(c1#235))
: +- HiveTableRelation [`hzz`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#235], Partition Cols: []]
+- Filter (isnotnull(c1#236) AND (c1#236 = done))
+- HiveTableRelation [`hzz`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#236], Partition Cols: []]
在Analyzed Logical Plan
中 Filter 中还是 (1 + (2 * 3),在 Optimized Logical Plan
变为了具体的值 7.
explain extended
select c1 from t1 where length(c1)> 1+2*3;
== Analyzed Logical Plan ==
c1: string
Project [c1#266]
+- Filter (length(c1#266) > (1 + (2 * 3)))
+- SubqueryAlias spark_catalog.hzz.t1
+- HiveTableRelation [`hzz`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#266], Partition Cols: []]
== Optimized Logical Plan ==
Filter (isnotnull(c1#266) AND (length(c1#266) > 7))
+- HiveTableRelation [`hzz`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#266], Partition Cols: []]
如以下SQL 1 < 2 可以消除。
explain extended
select c1 from t1 where 1 < 2 and length(c1) = 4;
== Analyzed Logical Plan ==
c1: string
Project [c1#272]
+- Filter ((1 < 2) AND (length(c1#272) = 4))
+- SubqueryAlias spark_catalog.hzz.t1
+- HiveTableRelation [`hzz`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#272], Partition Cols: []]
== Optimized Logical Plan ==
Filter (isnotnull(c1#272) AND (length(c1#272) = 4))
+- HiveTableRelation [`hzz`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#272], Partition Cols: []]
CheckCartesianProducts 判断逻辑算子树是否存在迪卡尔类型的 Join 操作。当存在这样的操作,而SQL中没有显示的使用 cross join 表达式,则会抛出异常。当spark.sql.crossJoin.enabled
为true
时,该规则会被忽略。
一般情况下,如果聚和查询中涉及浮点数的精度处理,性能就会受到很大的影响。对于固定精度的 Decinal 类型,DecimalAggregates 规则将其当做 unscaledLong 类型来执行,这样可以加速聚和操作的速度。
当逻辑算子树中存在两个 TypedFilter 过滤条件且针对同类型的对象条件时,CombineTypeFilters 优化规则会将他们合并到同一个过滤函数中。
ConvertToLocalRelation 将一个 LocalRelation 上的本地操作转化为另一个 LocalRelation
如 VALUES ('v1', 'v12'), ('v2', 'v22'), ('v3', 'v32') AS tab(c1, c2)
就是一个local relation。
explain extended
SELECT c1 FROM VALUES
('v1', 'v12'), ('v2', 'v22'), ('v3', 'v32')
AS tab(c1, c2) where c1='v1';
输出结果, Parsed Logical Plan
中转化为 UnresolvedInlineTable。在Analyzed Logical Plan
中 UnresolvedInlineTable 转化为 LocalRelation。Optimized Logical Plan
变成仅有一个 LocalRelation,把 LocalRelation 和其上的操作转化为一个新的 LocalRelation。
== Parsed Logical Plan ==
'Project ['c1]
+- 'Filter ('c1 = v1)
+- 'SubqueryAlias tab
+- 'UnresolvedInlineTable [c1, c2], [[v1, v12], [v2, v22], [v3, v32]]
== Analyzed Logical Plan ==
c1: string
Project [c1#323]
+- Filter (c1#323 = v1)
+- SubqueryAlias tab
+- LocalRelation [c1#323, c2#324]
== Optimized Logical Plan ==
LocalRelation [c1#323]
PropageEmptyRelation 对空的 LocalRelation 进行折叠。
explain extended
select t1.c1 from (
SELECT c1 FROM VALUES
('v1', 'v12'), ('v2', 'v22'), ('v3', 'v32') AS tab(c1, c2)
where c1='v4'
)t1 join (
SELECT c1 FROM
VALUES ('v1', 'v12'), ('v2', 'v22'), ('v3', 'v32') AS tab(c1, c2) where c1='v4'
)t2 where t1.c1=t2.c1;
结果如下, Analyzed Logical Plan
还有两个子查询做 join 操作。
到了 Optimized Logical Plan
中,仅有一个LocalRelation
,标记 LocalRelation 是空的。因为两个子查询经过优化后都是 LocalRelation
,join 后也是 LocalRelation
。
== Analyzed Logical Plan ==
c1: string
Project [c1#337]
+- Filter (c1#337 = c1#339)
+- Join Inner
:- SubqueryAlias t1
: +- Project [c1#337]
: +- Filter (c1#337 = v4)
: +- SubqueryAlias tab
: +- LocalRelation [c1#337, c2#338]
+- SubqueryAlias t2
+- Project [c1#339]
+- Filter (c1#339 = v4)
+- SubqueryAlias tab
+- LocalRelation [c1#339, c2#340]
== Optimized Logical Plan ==
LocalRelation <empty>, [c1#337]
== Physical Plan ==
LocalTableScan <empty>, [c1#337]
现在 Optimize 里已经没有 OptimizeCodegen 规则。
包含 RewritePredicateSubquery 和 CollapseProject 两条优化规则。