RDD的checkpoint源码分析

当调用RDD#checkpoint的,checkpoint的方法如下:

 1   /**
 2    * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
 3    * directory set with `SparkContext#setCheckpointDir` and all references to its parent
 4    * RDDs will be removed. This function must be called before any job has been
 5    * executed on this RDD. It is strongly recommended that this RDD is persisted in
 6    * memory, otherwise saving it on a file will require recomputation.
 7    */
 8   def checkpoint(): Unit = RDDCheckpointData.synchronized {
 9     // NOTE: we use a global lock here due to complexities downstream with ensuring
10     // children RDD partitions point to the correct parent partitions. In the future
11     // we should revisit this consideration.
12     if (context.checkpointDir.isEmpty) {
13       throw new SparkException("Checkpoint directory has not been set in the SparkContext")
14     } else if (checkpointData.isEmpty) {
15      //最后生成一个新的ReliableRDDCheckpointData,checkpoint的逻辑主要体现在 ReliableRDDCheckpointData#doCheckpoint函数中。
16       checkpointData = Some(new ReliableRDDCheckpointData(this))
17     }
18   }

从注释上看,只是将此rdd标示为要checkpoint,文件保存在SparkContext#setCheckpointDir定义的目录,并且此rdd所有的父依赖将移除。
此函数一定要在所有job运行之前被执行。强烈建议把这个RDD进行persisted,否则的话数据进将行重新计算。

 1    /**
 2    * Materialize this RDD and write its content to a reliable DFS.
 3    * This is called immediately after the first action invoked on this RDD has completed.
 4    */
 5   protected override def doCheckpoint(): CheckpointRDD[T] = {
 6   //核心代码,将文件写入到目录
 7     val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)
 8 
 9     // Optionally clean our checkpoint files if the reference is out of scope
10     if (rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints", false)) {
11       rdd.context.cleaner.foreach { cleaner =>
12         cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id)
13       }
14     }
15 
16     logInfo(s"Done checkpointing RDD ${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}")
17     newRDD
18   }
 1      /**
 2    * Write RDD to checkpoint files and return a ReliableCheckpointRDD representing the RDD.
 3    */
 4   def writeRDDToCheckpointDirectory[T: ClassTag](
 5       originalRDD: RDD[T],
 6       checkpointDir: String,
 7       blockSize: Int = -1): ReliableCheckpointRDD[T] = {
 8 
 9     val sc = originalRDD.sparkContext
10 
11     // Create the output path for the checkpoint
12     val checkpointDirPath = new Path(checkpointDir)
13     val fs = checkpointDirPath.getFileSystem(sc.hadoopConfiguration)
14     if (!fs.mkdirs(checkpointDirPath)) {
15       throw new SparkException(s"Failed to create checkpoint path $checkpointDirPath")
16     }
17 
18     // Save to file, and reload it as an RDD
19     val broadcastedConf = sc.broadcast(
20       new SerializableConfiguration(sc.hadoopConfiguration))
21     // TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582)
22     //核心代码
23     sc.runJob(originalRDD,
24       writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _)
25 
26     if (originalRDD.partitioner.nonEmpty) {
27       writePartitionerToCheckpointDir(sc, originalRDD.partitioner.get, checkpointDirPath)
28     }
29 
30     val newRDD = new ReliableCheckpointRDD[T](
31       sc, checkpointDirPath.toString, originalRDD.partitioner)
32     if (newRDD.partitions.length != originalRDD.partitions.length) {
33       throw new SparkException(
34         s"Checkpoint RDD $newRDD(${newRDD.partitions.length}) has different " +
35           s"number of partitions from original RDD $originalRDD(${originalRDD.partitions.length})")
36     }
37     newRDD
38   }

第23行代码,用到了柯里化的小技巧,我们把方法稍作修改

      // TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582)
    sc.runJob(originalRDD,
      writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _)

      // TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582)
    val func : (TaskContext, Iterator[T]) => Unit = writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf)
    sc.runJob(originalRDD,func)

此处新提交一个Job,也是对RDD进行计算,那么如果原有的RDD对结果进行了cache的话,那么是不是减少了很多的计算呢,这就是为啥checkpoint的时候强烈推荐进行cache的缘故。

写文件的逻辑

  /**
   * Write a RDD partition's data to a checkpoint file.
   */
  def writePartitionToCheckpointFile[T: ClassTag](
      path: String,
      broadcastedConf: Broadcast[SerializableConfiguration],
      blockSize: Int = -1)(ctx: TaskContext, iterator: Iterator[T]) {
    val env = SparkEnv.get
    val outputDir = new Path(path)
    val fs = outputDir.getFileSystem(broadcastedConf.value.value)

    val finalOutputName = ReliableCheckpointRDD.checkpointFileName(ctx.partitionId())
    val finalOutputPath = new Path(outputDir, finalOutputName)
    val tempOutputPath =
      new Path(outputDir, s".$finalOutputName-attempt-${ctx.attemptNumber()}")

    if (fs.exists(tempOutputPath)) {
      throw new IOException(s"Checkpoint failed: temporary path $tempOutputPath already exists")
    }
    val bufferSize = env.conf.getInt("spark.buffer.size", 65536)

    val fileOutputStream = if (blockSize < 0) {
      fs.create(tempOutputPath, false, bufferSize)
    } else {
      // This is mainly for testing purpose
      fs.create(tempOutputPath, false, bufferSize, fs.getDefaultReplication, blockSize)
    }
    val serializer = env.serializer.newInstance()
    val serializeStream = serializer.serializeStream(fileOutputStream)
    Utils.tryWithSafeFinally {
      serializeStream.writeAll(iterator)
    } {
      serializeStream.close()
    }

    if (!fs.rename(tempOutputPath, finalOutputPath)) {
      if (!fs.exists(finalOutputPath)) {
        logInfo(s"Deleting tempOutputPath $tempOutputPath")
        fs.delete(tempOutputPath, false)
        throw new IOException("Checkpoint failed: failed to save output of task: " +
          s"${ctx.attemptNumber()} and final output path does not exist: $finalOutputPath")
      } else {
        // Some other copy of this task must've finished before us and renamed it
        logInfo(s"Final output path $finalOutputPath already exists; not overwriting it")
        if (!fs.delete(tempOutputPath, false)) {
          logWarning(s"Error deleting ${tempOutputPath}")
        }
      }
    }
  }

  核心代码

 1       val serializer = env.serializer.newInstance()
 2     val serializeStream = serializer.serializeStream(fileOutputStream)
 3     Utils.tryWithSafeFinally {
 4       serializeStream.writeAll(iterator)
 5     } {
 6       serializeStream.close()
 7     }
 8 
 9     //把iterator返回的结果写到指定目录中。文件命为
10     ReliableCheckpointRDD.checkpointFileName(ctx.partitionId())

我们看下定义

1       /**
2    * Return the checkpoint file name for the given partition.
3    */
4   private def checkpointFileName(partitionIndex: Int): String = {
5     "part-%05d".format(partitionIndex)
6   }

 

这个再RDD恢复的时候会用到这个文件名。在下一篇博客中我将写如何恢复。

以上是我们讲述了checkpoint的流程,那么checkpoint是如何启动的呢?
答案在SparkContext#runJob方法

  /**
   * Run a function on a given set of partitions in an RDD and pass the results to the given
   * handler function. This is the main entry point for all actions in Spark.
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

  我们看最后三行,先提交真正我们需要计算的job,然后才是 rdd.doCheckpoint()

    /**
   * Performs the checkpointing of this RDD by saving this. It is called after a job using this RDD
   * has completed (therefore the RDD has been materialized and potentially stored in memory).
   * doCheckpoint() is called recursively on the parent RDDs.
   */
  private[spark] def doCheckpoint(): Unit = {
    RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, ignoreParent = true) {
      if (!doCheckpointCalled) {
        doCheckpointCalled = true
        if (checkpointData.isDefined) {
          checkpointData.get.checkpoint()
        } else {
          dependencies.foreach(_.rdd.doCheckpoint())
        }
      }
    }

从注释上看,此函数是在使用此RDD的的job执行结束后执行,因此结果可能会保存在内存中,这就是提到过的最好对RDD进行cache的缘故。重要的事要说三遍。

最后调用的方法就是 checkpointData.get.checkpoint()

到此为止RDD如何进行checkpoint算是分析完成了。

你可能感兴趣的:(RDD的checkpoint源码分析)