在之前的文章中hudi的Compaction操作中,completeTableService中其实会有deleteMarker的操作,那为什么会有这个操作呢?
这得从Spark DataSource V2说起,引入了DataSource V2以后,hudi的写入文件主要就是V2TableWriteExec类:
sparkContext.runJob(
rdd,
(context: TaskContext, iter: Iterator[InternalRow]) =>
DataWritingSparkTask.run(writerFactory, context, iter, useCommitCoordinator),
rdd.partitions.indices,
(index, result: DataWritingSparkTaskResult) => {
val commitMessage = result.writerCommitMessage
messages(index) = commitMessage
totalNumRowsAccumulator.add(result.numRows)
batchWrite.onDataWriterCommit(commitMessage)
}
)
而DataWritingSparkTask.run方法如下:
while (iter.hasNext) {
// Count is here.
count += 1
dataWriter.write(iter.next())
}
val msg = if (useCommitCoordinator) {
val coordinator = SparkEnv.get.outputCommitCoordinator
val commitAuthorized = coordinator.canCommit(stageId, stageAttempt, partId, attemptId)
if (commitAuthorized) {
logInfo(s"Commit authorized for partition $partId (task $taskId, attempt $attemptId, " +
s"stage $stageId.$stageAttempt)")
dataWriter.commit()
} else {
val message = s"Commit denied for partition $partId (task $taskId, attempt $attemptId, " +
s"stage $stageId.$stageAttempt)"
logInfo(message)
// throwing CommitDeniedException will trigger the catch block for abort
throw new CommitDeniedException(message, stageId, partId, attemptId)
}
} else {
logInfo(s"Writer for partition ${context.partitionId()} is committing.")
dataWriter.commit()
}
之前的文章也说过,主要的就是以下三重曲:
这就不得不提到dataWriter这个变量,在Spark原生的类中,该dataWriter对应的为SingleDirectoryDataWriter或者DynamicPartitionDataWriter,
看这两个类的构造方法会有一个FileCommitProtocol类型的commiter,这个commiter,在以上write/commit/close等操作中扮演着重要的作用:
也就是说在task.write的时候,会先创建临时目录,
之后在task.commit的时候会把临时目录的文件真正的移到需要写入的目录下
那反观一下在hudi中,该dataWriter对应的是HoodieBulkInsertDataInternalWriter:
this.bulkInsertWriterHelper = new BulkInsertDataInternalWriterHelper(hoodieTable,
writeConfig, instantTime, taskPartitionId, taskId, 0, structType, populateMetaFields, arePartitionRecordsSorted);
@Override
public void write(InternalRow record) throws IOException {
bulkInsertWriterHelper.write(record);
}
@Override
public WriterCommitMessage commit() throws IOException {
return new HoodieWriterCommitMessage(bulkInsertWriterHelper.getWriteStatuses());
}
真正进行写操作的是BulkInsertDataInternalWriterHelper,该类的写操作就是直接写真正需要写入的目录,而不是临时目录
那为什么这么做呢?这么做的优点和缺点是什么?
优点: 写数据直接写入目的目录,不需要二次拷贝,提高写入的效率
缺点: 如果spark存在speculative的情况下,会存在相同的数据写入到不同的文件中,造成数据重复不准确
所以说hudi引入了Markers的机制
在写入真正文件的同时,会在 .hoodie/.temp/instantTime目录下创建maker文件,比如.hoodie/.temp/202307237055/f1.parquet.marker.CREATE,
具体的写入marker文件的在HoodieRowCreateHandle的构造方法中:
HoodiePartitionMetadata partitionMetadata =
new HoodiePartitionMetadata(
fs,
instantTime,
new Path(writeConfig.getBasePath()),
FSUtils.getPartitionPath(writeConfig.getBasePath(), partitionPath),
table.getPartitionMetafileFormat());
partitionMetadata.trySave(taskPartitionId);
createMarkerFile(partitionPath, fileName, instantTime, table, writeConfig);
该HoodieRowCreateHandle会在BulkInsertDataInternalWriterHelper.write的方法中被调用。
因为存在了marker文件,所以在写入完后需要清理无效的数据文件(会在job运行完清理),该清理在V2TableWriteExec中的batchWrite.commit方法中,也就是HoodieDataSourceInternalBatchWrite.commit:
@Override
public void commit(WriterCommitMessage[] messages) {
List writeStatList = Arrays.stream(messages).map(m -> (HoodieWriterCommitMessage) m)
.flatMap(m -> m.getWriteStatuses().stream().map(HoodieInternalWriteStatus::getStat)).collect(Collectors.toList());
dataSourceInternalWriterHelper.commit(writeStatList);
}
数据流如下:
HoodieDataSourceInternalBatchWrite.commit
||
\/
dataSourceInternalWriterHelper.commit
||
\/
SparkRDDWriteClient.commitStats
||
\/
SparkRDDWriteClient.commit
||
\/
SparkRDDWriteClient.finalizeWrite
||
\/
HoodieTable.finalizeWrite
||
\/
HoodieTable.reconcileAgainstMarkers
||
\/
HoodieTable.getInvalidDataPaths
||
\/
markers.createdAndMergedDataPaths
在reconcileAgainstMarkers方法中会根据marker文件删除无效的数据文件
注意一点
虽然说在Executor端写入了多个重复数据的文件,但是因为在只有一个真正的文件会被Driver认可,所以通过最终返回的被driver认可的文件和marker文件求交集就能删除掉其他废弃的文件。具体的和driver交互是否能被认可的代码在DataWritingSparkTask中:
// useCommitCoordinator 默认都是true
val msg = if (useCommitCoordinator) {
val coordinator = SparkEnv.get.outputCommitCoordinator
val commitAuthorized = coordinator.canCommit(stageId, stageAttempt, partId, attemptId)
if (commitAuthorized) {
logInfo(s"Commit authorized for partition $partId (task $taskId, attempt $attemptId, " +
s"stage $stageId.$stageAttempt)")
dataWriter.commit()
一个job完成以后,我们可以得到真正的写入的文件,这个时候,Marker目录的意义就没有多大了,所以得进行清除
marker被清理的调用链有很多,比如说SparkRDDWriteClient.commitStats中就有清理:
SparkRDDWriteClient.commitStats
||
\/
SparkRDDWriteClient.postCommit
||
\/
WriteMarkers.quietDeleteMarkerDir
quietDeleteMarkerDir就会直接删除marker目录
更多关于hudi marker的问题,可以参考Apache Hudi内核之文件标记机制深入解析