Apache Hudi初探(十一)(与spark的结合)--hudi的markers机制

背景

在之前的文章中hudi的Compaction操作中,completeTableService中其实会有deleteMarker的操作,那为什么会有这个操作呢?

分析

为什么会存在Marker文件

这得从Spark DataSource V2说起,引入了DataSource V2以后,hudi的写入文件主要就是V2TableWriteExec类:

  sparkContext.runJob(
    rdd,
    (context: TaskContext, iter: Iterator[InternalRow]) =>
      DataWritingSparkTask.run(writerFactory, context, iter, useCommitCoordinator),
    rdd.partitions.indices,
    (index, result: DataWritingSparkTaskResult) => {
      val commitMessage = result.writerCommitMessage
      messages(index) = commitMessage
      totalNumRowsAccumulator.add(result.numRows)
      batchWrite.onDataWriterCommit(commitMessage)
    }
  )

DataWritingSparkTask.run方法如下:

      while (iter.hasNext) {
        // Count is here.
        count += 1
        dataWriter.write(iter.next())
      }

      val msg = if (useCommitCoordinator) {
        val coordinator = SparkEnv.get.outputCommitCoordinator
        val commitAuthorized = coordinator.canCommit(stageId, stageAttempt, partId, attemptId)
        if (commitAuthorized) {
          logInfo(s"Commit authorized for partition $partId (task $taskId, attempt $attemptId, " +
            s"stage $stageId.$stageAttempt)")
          dataWriter.commit()
        } else {
          val message = s"Commit denied for partition $partId (task $taskId, attempt $attemptId, " +
            s"stage $stageId.$stageAttempt)"
          logInfo(message)
          // throwing CommitDeniedException will trigger the catch block for abort
          throw new CommitDeniedException(message, stageId, partId, attemptId)
        }

      } else {
        logInfo(s"Writer for partition ${context.partitionId()} is committing.")
        dataWriter.commit()
      }

之前的文章也说过,主要的就是以下三重曲:

  1. dataWriter.write
  2. dataWriter.commit/abort
  3. dataWriter.close

这就不得不提到dataWriter这个变量,在Spark原生的类中,该dataWriter对应的为SingleDirectoryDataWriter或者DynamicPartitionDataWriter
看这两个类的构造方法会有一个FileCommitProtocol类型的commiter,这个commiter,在以上write/commit/close等操作中扮演着重要的作用:
也就是说在task.write的时候,会先创建临时目录,
之后在task.commit的时候会把临时目录的文件真正的移到需要写入的目录下
那反观一下在hudi中,该dataWriter对应的是HoodieBulkInsertDataInternalWriter

this.bulkInsertWriterHelper = new BulkInsertDataInternalWriterHelper(hoodieTable,
        writeConfig, instantTime, taskPartitionId, taskId, 0, structType, populateMetaFields, arePartitionRecordsSorted);
  
@Override
  public void write(InternalRow record) throws IOException {
    bulkInsertWriterHelper.write(record);
  }

  @Override
  public WriterCommitMessage commit() throws IOException {
    return new HoodieWriterCommitMessage(bulkInsertWriterHelper.getWriteStatuses());
  }

真正进行写操作的是BulkInsertDataInternalWriterHelper,该类的写操作就是直接写真正需要写入的目录,而不是临时目录
那为什么这么做呢?这么做的优点和缺点是什么?
优点: 写数据直接写入目的目录,不需要二次拷贝,提高写入的效率
缺点: 如果spark存在speculative的情况下,会存在相同的数据写入到不同的文件中,造成数据重复不准确
所以说hudi引入了Markers的机制

marker文件什么时候被创建

在写入真正文件的同时,会在 .hoodie/.temp/instantTime目录下创建maker文件,比如.hoodie/.temp/202307237055/f1.parquet.marker.CREATE,
具体的写入marker文件的在HoodieRowCreateHandle的构造方法中:

HoodiePartitionMetadata partitionMetadata =
          new HoodiePartitionMetadata(
              fs,
              instantTime,
              new Path(writeConfig.getBasePath()),
              FSUtils.getPartitionPath(writeConfig.getBasePath(), partitionPath),
              table.getPartitionMetafileFormat());
      partitionMetadata.trySave(taskPartitionId);

      createMarkerFile(partitionPath, fileName, instantTime, table, writeConfig);

该HoodieRowCreateHandle会在BulkInsertDataInternalWriterHelper.write的方法中被调用。

无效数据文件什么时候被清理

因为存在了marker文件,所以在写入完后需要清理无效的数据文件(会在job运行完清理),该清理在V2TableWriteExec中的batchWrite.commit方法中,也就是HoodieDataSourceInternalBatchWrite.commit:

@Override
  public void commit(WriterCommitMessage[] messages) {
    List writeStatList = Arrays.stream(messages).map(m -> (HoodieWriterCommitMessage) m)
        .flatMap(m -> m.getWriteStatuses().stream().map(HoodieInternalWriteStatus::getStat)).collect(Collectors.toList());
    dataSourceInternalWriterHelper.commit(writeStatList);
  }

数据流如下:

HoodieDataSourceInternalBatchWrite.commit
      ||
      \/
dataSourceInternalWriterHelper.commit
      ||
      \/
SparkRDDWriteClient.commitStats
      ||
      \/

SparkRDDWriteClient.commit
      ||
      \/

SparkRDDWriteClient.finalizeWrite
      ||
      \/

HoodieTable.finalizeWrite
      ||
      \/

HoodieTable.reconcileAgainstMarkers
      ||
      \/

HoodieTable.getInvalidDataPaths
      ||
      \/

markers.createdAndMergedDataPaths

reconcileAgainstMarkers方法中会根据marker文件删除无效的数据文件
注意一点
虽然说在Executor端写入了多个重复数据的文件,但是因为在只有一个真正的文件会被Driver认可,所以通过最终返回的被driver认可的文件和marker文件求交集就能删除掉其他废弃的文件。具体的和driver交互是否能被认可的代码在DataWritingSparkTask中:

// useCommitCoordinator 默认都是true
  val msg = if (useCommitCoordinator) {
  val coordinator = SparkEnv.get.outputCommitCoordinator
  val commitAuthorized = coordinator.canCommit(stageId, stageAttempt, partId, attemptId)
  if (commitAuthorized) {
    logInfo(s"Commit authorized for partition $partId (task $taskId, attempt $attemptId, " +
      s"stage $stageId.$stageAttempt)")
    dataWriter.commit()
       

makers目录什么时候被清理

一个job完成以后,我们可以得到真正的写入的文件,这个时候,Marker目录的意义就没有多大了,所以得进行清除
marker被清理的调用链有很多,比如说SparkRDDWriteClient.commitStats中就有清理:

SparkRDDWriteClient.commitStats

      ||
      \/
SparkRDDWriteClient.postCommit

      ||
      \/
WriteMarkers.quietDeleteMarkerDir

quietDeleteMarkerDir就会直接删除marker目录

更多关于hudi marker的问题,可以参考Apache Hudi内核之文件标记机制深入解析

你可能感兴趣的:(spark,hudi,大数据,spark,大数据,hudi)