要对RDD做checkpoint操作,需要先调用SparkContext的setCheckpointDir设置checkpoint数据存储位置。RDD的checkpoint操作由SparkContext.runJob发起。如果了解整个Job的执行过程,那么理解RDD的checkpoint就相对简单了。
1. RDD.checkpoint
def checkpoint() {
if (context.checkpointDir.isEmpty) {
throw new SparkException("Checkpoint directory has not been set in the SparkContext")
} else if (checkpointData.isEmpty) {
checkpointData = Some(new RDDCheckpointData(this))
checkpointData.get.markForCheckpoint()
}
}
(1)代码中的context是SparkContext对象的引用。调用checkpoint方法时,会先检查checkpointDir是否设置;
(2)创建RDDCheckpointData对象,封装要做checkpoint的RDD;
(3)将
RDDCheckpointData标记为MarkedForCheckpoint状态。
2. SparkContext.runJob
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
allowLocal: Boolean,
resultHandler: (Int, U) => Unit) {
......
rdd.doCheckpoint()
}
在RDD的job执行完成后,调用RDD的doCheckpoint方法检查RDD或依赖的RDD中是否存在checkpoint,并发起checkpoint job操作。
2.1. RDD.doCheckpoint
private[spark] def doCheckpoint() {
if (!doCheckpointCalled) {
doCheckpointCalled = true
if (checkpointData.isDefined) {
checkpointData.get.doCheckpoint()
} else {
dependencies.foreach(_.rdd.doCheckpoint())
}
}
}
(1)doCheckpointCalled是一个boolean变量,当一个RDD第一次调用doCheckpoint时,将之设置为true,以防止RDD属于多个Job而出现多次checkpoint。
(2)如果rdd设置了checkpoint操作,那么调用RDDCheckpointData.doCheckpoint执行RDD的checkpoint;
(3)doCheckpoint方法以递归的方式寻找第一个需要checkpoint的RDD进行checkpoint操作。从代码中看出如果当前RDD有checkpoint操作,即使其父RDD有checkpoint操作需求也不会执行。
(4)RDD的checkpoint函数必须在RDD所在的Job执行之前调用,否则即使设置checkpoint操作,也不会执行;因为在每次Job执行完后,都会调用doCheckpoint方法,而该方法会将
doCheckpointCalled设置为true,使得Job执行后设置的checkpoint无效。
2.2. RDDCheckpointData.doCheckpoint
def doCheckpoint() {
// If it is marked for checkpointing AND checkpointing is not already in progress,
// then set it to be in progress, else return
RDDCheckpointData.synchronized {
if (cpState == MarkedForCheckpoint) {
cpState = CheckpointingInProgress
} else {
return
}
}
// Create the output path for the checkpoint
val path = new Path(rdd.context.checkpointDir.get, "rdd-" + rdd.id)
val fs = path.getFileSystem(rdd.context.hadoopConfiguration)
if (!fs.mkdirs(path)) {
throw new SparkException("Failed to create checkpoint path " + path)
}
// Save to file, and reload it as an RDD
val broadcastedConf = rdd.context.broadcast(
new SerializableWritable(rdd.context.hadoopConfiguration))
rdd.context.runJob(rdd, CheckpointRDD.writeToFile[T](path.toString, broadcastedConf) _)
val newRDD = new CheckpointRDD[T](rdd.context, path.toString)
if (newRDD.partitions.size != rdd.partitions.size) {
throw new SparkException(
"Checkpoint RDD " + newRDD + "(" + newRDD.partitions.size + ") has different " +
"number of partitions than original RDD " + rdd + "(" + rdd.partitions.size + ")")
}
// Change the dependencies and partitions of the RDD
RDDCheckpointData.synchronized {
cpFile = Some(path.toString)
cpRDD = Some(newRDD)
rdd.markCheckpointed(newRDD) // Update the RDD's dependencies and partitions
cpState = Checkpointed
}
logInfo("Done checkpointing RDD " + rdd.id + " to " + path + ", new parent is RDD " + newRDD.id)
}
代码本身注释的非常明白。
(1)修改cpState状态;
(2)创建checkpoint的输出目录;
(3)启动一个Job,通过CheckpointRDD.writeToFile将RDD分区信息写入文件;
(4)创建CheckpointRDD,用于读取checkpoint后文件;
(5)设置RDDCheckpointData对象状态,调用markCheckpointed修改当前RDD(发起checkpoint操作的RDD)的依赖及分区信息。
2.3. RDD.markCheckpointed
private[spark] def markCheckpointed(checkpointRDD: RDD[_]) {
clearDependencies()
partitions_ = null
deps = null // Forget the constructor argument for dependencies too
}
protected def clearDependencies() {
dependencies_ = null
}
清除依赖及分区。
3. 示例
3.1. 创建RDDs
scala> val rdd = sc.parallelize(List(1, 2, 3, 5, 6, 7, 8, 9, 0), 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :21
scala> val mappedRDD = rdd.map(_ * 2)
mappedRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at :23
scala> val filterRDD = mappedRDD.filter(_ > 10)
filterRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at filter at :25
(1)创建ParallelCollectionRDD对象;
(2)在rdd上执行map方法,返回MapPartitionsRDD对象(spark1.3);
(3)在mappedRDD上执行filter方法,返回MapPartitionsRDD对象;
3.2. 设置checkpoint
scala> sc.setCheckpointDir("hdfs://CentOS-01:8020/tmp")
scala> filterRDD.checkpoint
scala> filterRDD.toDebugString
res3: String =
(2) MapPartitionsRDD[2] at filter at :25 []
| MapPartitionsRDD[1] at map at :23 []
| ParallelCollectionRDD[0] at parallelize at :21 []
通过RDD.toDebugString方法可以看出rdd之间依赖关系。
3.3. 执行checkpoint
checkpoint操作需要有action操作激活。
scala> filterRDD.count
......
15/05/25 15:29:55 INFO RDDCheckpointData: Done checkpointing RDD 2 to hdfs://CentOS-01:8020/tmp/9e7ecd6d-c327-4938-8fd2-43f57006bb09/rdd-2, new parent is RDD 3
res4: Long = 4
scala>
scala>
scala> filterRDD.toDebugString
res5: String =
(2) MapPartitionsRDD[2] at filter at :25 []
| CheckpointRDD[3] at count at :28 []
RDD.count方法的输出日志描述了checkpoint信息已写入HDFS。
toDebugString方法显示,filterRDD现在依赖新创建的CheckpointRDD。
在该例子中,将filterRDD对象进行了checkpoint操作,写入文件的是filterRDD结果:
scala> val p = filterRDD.dependencies(0).rdd
p: org.apache.spark.rdd.RDD[_] = CheckpointRDD[3] at count at :28
scala> p.collect
......
res7: Array[_] = Array(12, 14, 16, 18)
然而,filterRDD会依赖新创
建的
CheckpointRDD对象,所以filterRDD的action会多做一次不必要的filter操作。
3.4. RDD.dependencies和RDD.paritions
private def checkpointRDD: Option[RDD[T]] = checkpointData.flatMap(_.checkpointRDD)
final def dependencies: Seq[Dependency[_]] = {
checkpointRDD.map(r => List(new OneToOneDependency(r))).getOrElse {
if (dependencies_ == null) {
dependencies_ = getDependencies
}
dependencies_
}
}
final def partitions: Array[Partition] = {
checkpointRDD.map(_.partitions).getOrElse {
if (partitions_ == null) {
partitions_ = getPartitions
}
partitions_
}
}
在dependencies和partitions方法中,都会先调用checkpointRDD方法,检查RDD是否checkpoint。如果RDD做了checkpoint,则从RDDCheckpointData对象获取对应的RDD,然后基于此RDD来获取dependencies和partitions。