从Spark 2 到 Spark3 这期间, Spark 对于 String 和 Decimal 类型的比较会自动转换为Double 类型。这样会导致转换后的Filter 无法进行 Data Filter Pushed. 社区相关Ticket:
[SPARK-17913][SQL] compare atomic and string type column may return confusing result
[SPARK-22469][SQL] Accuracy problem in comparison with string and numeric
SPARK-29274: Should not coerce decimal type to double type when it’s join column
withTable("t1") {
sql("CREATE TABLE t1 USING PARQUET " +
"SELECT cast(id + 0.1 as decimal(13,2)) as salary FROM range(0, 100)")
sql("select * from t1 where salary = '12.1' ").collect()
}
Query这样会因为Filter 将Decimal类型转换成Double类型,而无法进行数据下推
== Physical Plan ==
*(1) Project [salary#276]
+- *(1) Filter (isnotnull(salary#276) AND (cast(salary#276 as double) = 12.1))
+- *(1) ColumnarToRow
+- FileScan parquet default.t1[salary#276] Batched: true, DataFilters: [isnotnull(salary#276), (cast(salary#276 as double) = 12.1)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/wakun/ws/ebay/spark3/spark-warehouse/org.apache.spark.sql.execution..., PartitionFilters: [], PushedFilters: [IsNotNull(salary)], ReadSchema: struct, UsedIndexes: []
详细Plan转化过程
=== Applying Rule org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings ===
'Project [*] 'Project [*]
!+- 'Filter (salary#277 = 12.1) +- Filter (cast(salary#277 as double) = cast(12.1 as double))
+- SubqueryAlias spark_catalog.default.t1 +- SubqueryAlias spark_catalog.default.t1
+- Relation default.t1[id#276L,salary#277] parquet +- Relation default.t1[id#276L,salary#277] parquet
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantFolding ===
Project [id#276L, salary#277] Project [id#276L, salary#277]
!+- Filter (cast(salary#277 as double) = cast(12.1 as double)) +- Filter (cast(salary#277 as double) = 12.1)
+- Relation default.t1[id#276L,salary#277] parquet +- Relation default.t1[id#276L,salary#277] parquet
sql("select * from t1 where salary = cast('12.1' as decimal) ").collect()
Query 这种写法是错误的,这样是将 12.1 cast 成 decimal(10,0)
类型,结果也就是 12.00,所以数据结果错误
== Physical Plan ==
*(1) Project [salary#276]
+- *(1) Filter (isnotnull(salary#276) AND (salary#276 = 12.00))
+- *(1) ColumnarToRow
+- FileScan parquet default.t1[salary#276] Batched: true, DataFilters: [isnotnull(salary#276), (salary#276 = 12.00)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/wakun/ws/ebay/spark3/spark-warehouse/org.apache.spark.sql.execution..., PartitionFilters: [], PushedFilters: [IsNotNull(salary), EqualTo(salary,12.00)], ReadSchema: struct, UsedIndexes: []
sql("select * from t1 where salary = cast('12.1' as decimal(13,2)) ").collect()
Query 这样写才是对的
== Physical Plan ==
*(1) Project [salary#276]
+- *(1) Filter (isnotnull(salary#276) AND (salary#276 = 12.10))
+- *(1) ColumnarToRow
+- FileScan parquet default.t1[salary#276] Batched: true, DataFilters: [isnotnull(salary#276), (salary#276 = 12.10)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/wakun/ws/ebay/spark3/spark-warehouse/org.apache.spark.sql.execution..., PartitionFilters: [], PushedFilters: [IsNotNull(salary), EqualTo(salary,12.10)], ReadSchema: struct, UsedIndexes: []
详细Plan转化过程
17:59:55.133 ERROR org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$1:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.DecimalPrecision ===
'Project [*] 'Project [*]
!+- 'Filter (salary#277 = 12.1) +- Filter (cast(salary#277 as decimal(13,2)) = cast(12.1 as decimal(13,2)))
+- SubqueryAlias spark_catalog.default.t1 +- SubqueryAlias spark_catalog.default.t1
+- Relation default.t1[id#276L,salary#277] parquet +- Relation default.t1[id#276L,salary#277] parquet
17:59:55.150 ERROR org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantFolding ===
Project [id#276L, salary#277] Project [id#276L, salary#277]
!+- Filter (cast(salary#277 as decimal(13,2)) = cast(12.1 as decimal(13,2))) +- Filter (cast(salary#277 as decimal(13,2)) = 12.10)
+- Relation default.t1[id#276L,salary#277] parquet +- Relation default.t1[id#276L,salary#277] parquet
17:59:55.159 ERROR org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$2:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.SimplifyCasts ===
Project [id#276L, salary#277] Project [id#276L, salary#277]
!+- Filter (cast(salary#277 as decimal(13,2)) = 12.10) +- Filter (salary#277 = 12.10)
+- Relation default.t1[id#276L,salary#277] parquet +- Relation default.t1[id#276L,salary#277] parquet