Vectorized ORC Reader: [SPARK-16060] Adds support for new ORC reader that substantially improves the ORC scan throughput through vectorization (2-5x). To enable the reader, users can set spark.sql.orc.impl to native.
在Spark2.3.0的release文档中,提到ORC Vectored带来的性能提升:
当然该ISSUE的提出还是有些背景的(https://issues.apache.org/jira/browse/SPARK-16060),ORC文件格式本身是Hortonworks提出的针对Hive查询的一种列式存储方案,ORC是在一定程度上扩展了RCFile,是对RCFile的优化。有别于Facebook的RCFile类型,ORC有如下优点:
其存储格式如下:
IndexData中保存了该stripe上数据的位置信息,总行数等信息
RowData以stream的形式保存了数据的具体信息
Stripe Footer中包含该stripe的统计结果,包括Max,Min,count等信息
IndexData
RowData
StripeFooter
...
FileFooter中包含该表的统计结果,以及各个Stripe的位置信息
Postscripts中存储该表的行数,压缩参数,压缩大小,列等信息
其中spark.sql.orc.impl
有native和hive两种选择,如果针对orc类型选择hive格式直接调用org.apache.spark.sql.hive.orc.OrcFileFormat
类实现类的加载,而如果为native则会基于org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
类进行加载。
/** Given a provider name, look up the data source class definition. */
def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) match {
case name if name.equalsIgnoreCase("orc") &&
conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
classOf[OrcFileFormat].getCanonicalName
case name if name.equalsIgnoreCase("orc") &&
conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "hive" =>
"org.apache.spark.sql.hive.orc.OrcFileFormat"
case name => name
}
...
}
什么时候支持ORC Vectored?
override def supportBatch(sparkSession: SparkSession, schema: StructType): Boolean = {
val conf = sparkSession.sessionState.conf
conf.orcVectorizedReaderEnabled && conf.wholeStageEnabled &&
schema.length <= conf.wholeStageMaxNumFields &&
schema.forall(_.dataType.isInstanceOf[AtomicType])
}
需要满足以下条件:
* 开启spark.sql.orc.enableVectorizedReader
: 默认true;
* 开启spark.sql.codegen.wholeStage
: 默认true并且其scheme的长度不大于wholeStageMaxNumFields(默认100列);
* [关键]所有列数据类型需要为AtomicType类型的;
AtomicType类型,可根据定义查看:
/**
* An internal type used to represent everything that is not null, UDTs, arrays, structs, and maps.
*/
protected[sql] abstract class AtomicType extends DataType {
private[sql] type InternalType
private[sql] val tag: TypeTag[InternalType]
private[sql] val ordering: Ordering[InternalType]
}
AtomicType代表了非 null/UDTs/arrays/structs/maps类型。所以如果所含列中如果包含null/UDTs/arrays/structs/maps类型,依然无法收到该ISSUE的便利。
在OrcFileFormat.buildReaderWithPartitionValues中:
if (enableVectorizedReader) {
val batchReader = new OrcColumnarBatchReader(
enableOffHeapColumnVector && taskContext.isDefined, copyToSpark)
// SPARK-23399 Register a task completion listener first to call `close()` in all cases.
// There is a possibility that `initialize` and `initBatch` hit some errors (like OOM)
// after opening a file.
val iter = new RecordReaderIterator(batchReader)
Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ => iter.close()))
// 调用initialize函数
batchReader.initialize(fileSplit, taskAttemptContext)
// 调用initBatch
batchReader.initBatch(
reader.getSchema,
requestedColIds,
requiredSchema.fields,
partitionSchema,
file.partitionValues)
// 生成iter
iter.asInstanceOf[Iterator[InternalRow]]
} else {
val orcRecordReader = new OrcInputFormat[OrcStruct]
.createRecordReader(fileSplit, taskAttemptContext)
val iter = new RecordReaderIterator[OrcStruct](orcRecordReader)
Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ => iter.close()))
val fullSchema = requiredSchema.toAttributes ++ partitionSchema.toAttributes
val unsafeProjection = GenerateUnsafeProjection.generate(fullSchema, fullSchema)
val deserializer = new OrcDeserializer(dataSchema, requiredSchema, requestedColIds)
if (partitionSchema.length == 0) {
iter.map(value => unsafeProjection(deserializer.deserialize(value)))
} else {
val joinedRow = new JoinedRow()
iter.map(value =>
unsafeProjection(joinedRow(deserializer.deserialize(value), file.partitionValues)))
}
}
initialize(): 初始化OrcFile Reader及Hadoop环境配置;
initBatch(): 初始化batch变量和columnarBatch变量(其中batch为ORC Reader矢量化每次读取的结果存储变量,columnarBatch为codegen转换为Spark定义类型存储变量 );
nextBatch(): 迭代器,其核心还是调用ORC自定义的vectored函数,需要根据类型转换Spark定义type;
参考: org.apache.spark.sql.hive.orc.OrcReadBenchmark
结果分析:
针对数字、String类型测试
Native ORC Vectorized > Native ORC Vectorized with copy > Native ORC MR > Hive built-in ORC
针对分区、不分区测试
Partition性能远远>不分区性能