spark SQLHadoopMapReduceCommitProtocol中mapreduce.fileoutputcommitter.algorithm.version选择1还是2

背景

本文基于 spark 3.1.1
对于spark来说默认的mapreduce.fileoutputcommitter.algorithm.version1
这个在SparkHadoopUtil.scala代码中可以看到:

  private def appendSparkHadoopConfigs(conf: SparkConf, hadoopConf: Configuration): Unit = {
    // Copy any "spark.hadoop.foo=bar" spark properties into conf as "foo=bar"
    for ((key, value) <- conf.getAll if key.startsWith("spark.hadoop.")) {
      hadoopConf.set(key.substring("spark.hadoop.".length), value)
    }
    if (conf.getOption("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version").isEmpty) {
      hadoopConf.set("mapreduce.fileoutputcommitter.algorithm.version", "1")
    }
  }

闲说杂谈

InsertIntoHadoopFsRelationCommand类中会调用FileFormatWriter.write方法,最终会调用到sparkSession.sparkContext.runJob方法:

      sparkSession.sparkContext.runJob(
        rddWithNonEmptyPartitions,
        (taskContext: TaskContext, iter: Iterator[InternalRow]) => {
          executeTask(
            description = description,
            jobIdInstant = jobIdInstant,
            sparkStageId = taskContext.stageId(),
            sparkPartitionId = taskContext.partitionId(),
            sparkAttemptNumber = taskContext.taskAttemptId().toInt & Integer.MAX_VALUE,
            committer,
            iterator = iter)
        },
        rddWithNonEmptyPartitions.partitions.indices,
        (index, res: WriteTaskResult) => {
          committer.onTaskCommit(res.commitMsg)
          ret(index) = res
        })

该executeTask方法最后会调用dataWriter.write和commit方法

  override def commit(): WriteTaskResult = {
    releaseResources()
    val summary = ExecutedWriteSummary(
      updatedPartitions = updatedPartitions.toSet,
      stats = statsTrackers.map(_.getFinalStats()))
    WriteTaskResult(committer.commitTask(taskAttemptContext), summary)
  }

最终还是会调用到HadoopMapReduceCommitProtocol.commitTask,从而调用到FileOutputCommitter.commitTask方法

···
if (algorithmVersion == 1) {
    Path committedTaskPath = getCommittedTaskPath(context);
    if (fs.exists(committedTaskPath)) {
       if (!fs.delete(committedTaskPath, true)) {
         throw new IOException("Could not delete " + committedTaskPath);
       }
    }
    if (!fs.rename(taskAttemptPath, committedTaskPath)) {
      throw new IOException("Could not rename " + taskAttemptPath + " to "
          + committedTaskPath);
    }
    LOG.info("Saved output of task '" + attemptId + "' to " +
        committedTaskPath);
  } else {
    // directly merge everything from taskAttemptPath to output directory
    mergePaths(fs, taskAttemptDirStatus, outputPath);
    LOG.info("Saved output of task '" + attemptId + "' to " +
        outputPath);
···

这里的algorithmVersion就会根据是1或者2来进行不同的操作:

  • 对于1来说,会把task生成的文件,移动到另一个临时目录,在job完成后再移动到最终的写出文件目录
  • 低于2来说,会吧task生成的文件,移动到最终的写出文件目录

对于1和2的优缺点:1是性能比2好,2是一致性比1好,下面分析spark中是怎么做的:

spark中对于该问题的处理

就像SPARK-33019这里说的一样:

Apache Spark provides multiple distributions with Hadoop 2.7 and Hadoop 3.2. spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version depends on the Hadoop version. Apache Hadoop 3.0 switches the default algorithm from v1 to v2 and now there exists a discussion to remove v2. We had better provide a consistent default behavior of v1 across various Spark distributions

也就是为了保证spark向前向后的兼容性,强行设置为V1版本
当然Spark官方文档也有解释Recommended settings for writing to object stores:

For object stores whose consistency model means that rename-based commits are safe use the FileOutputCommitter v2 algorithm for performance; v1 for safety

更多关于细节,可以参考大数据云上存算分离,我们应该关注什么

Hadoop中对于该问题的处理

参考MAPREDUCE-7282:

he v2 MR commit algorithm moves files from the task attempt dir into the dest dir on task commit -one by one

It is therefore not atomic

if a task commit fails partway through and another task attempt commits -unless exactly the same filenames are used, output of the first attempt may be included in the final result
if a worker partitions partway through task commit, and then continues after another attempt has committed, it may partially overwrite the output -even when the filenames are the same
Both MR and spark assume that task commits are atomic. Either they need to consider that this is not the case, we add a way to probe for a committer supporting atomic task commit, and the engines both add handling for task commit failures (probably fail job)

Better: we remove this as the default, maybe also warn when it is being used

大概的意思因为要保证task commits的原子性,所以好的建议是remove掉v2,不推荐使用V2。
当然后面讨论中:

Daryn Sharp Added a comment:
I'm also -1 on changing the default.  It exposes users to new (old but new to them) behavior that may have quirks. This was a 2.7 change from 5 years ago so if it's a high risk issue our customers would have squawked by now. Has this been frequently observed or theorized?

Notably our users won't tolerate the performance regression and SLA misses. I seem to recall jobs that ran for a single-digit minutes followed by a double-digit commit. The v2 commit amortized the commit to under a minute.

I'm not a MR expert. Here's my understanding:

if a task commit fails partway through and another task attempt commits -unless exactly the same filenames are used, output of the first attempt may be included in the final result

Isn't that indicative of a non-deterministic job? Should the risk to a few "bad" jobs outweigh the benefit to the mass majority of jobs? Why not change the committer for at risk jobs?

if a worker partitions partway through task commit, and then continues after another attempt has committed, it may partially overwrite the output -even when the filenames are the same

I don't think this can happen. Tasks request permission from the AM to commit.

---
Steve Loughran added a comment: 

Tasks request permission from the AM to commit.

yes, and then we assume that they continue to completion, rather than pausing for an extended period of time, so by the time the AM/spark driver gets a timeout, it can be assumed to be one of a network failure or the worker has failed/VM/k8s container terminated. The "suspended for a long time and then continues" risk does exist, and is unlikely on a physical cluster, but in a world of VMs, not entirely inconceivable.

I note the MR AM does track its time from last heartbeat to the YARN RM to detect partitions, workers don't.

这里有意思的就是,如果理想情况下,如果每个任务提交的时候都跟Driver通信,以确定只有一个任务能够提交成功(同一个task的其他attempt不会提交),那么也能保证task commit的正确性,但是如果由于网络原因导致了driver和executor的超时,而于此同时该task所在的executor又和Driver通信上了(可以提交该task),那么该task还会继续提交任务,直到driver发通知,去移除掉executor,那这段时间还是会存在数据的不一致性(当然这里面涉及到spark中的超时配置spark.executor.heartbeatInterval spark.network.timeout以及spark.rpc.askTimeout)。

结论

所以最后得出的结论就是:V1是安全的,但是性能不好,V2有可能是不安全的,但是性能好,推荐使用V1。

你可能感兴趣的:(大数据,spark,spark,hadoop)