本文基于spark 2.4.3版本Checkpointing
Getting Started
如何使用SS的checkpoint,下面是官方的样例,本文就从这个样例说起
// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
val ssc = new StreamingContext(...) // new context
val lines = ssc.socketTextStream(...) // create DStreams
...
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
ssc
}
// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
// Do additional setup on context that needs to be done,
// irrespective of whether it is being started or restarted
context. ...
// Start the context
context.start()
context.awaitTermination()
样例中介绍了如何通过checkpoint恢复作业,即val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
;以及如何开启checkpoint的功能,即ssc.checkpoint(checkpointDirectory)
内部剖析
开启checkpoint
开启checkpoint功能,根据源码需要满足两个条件private lazy val shouldCheckpoint = ssc.checkpointDuration != null && ssc.checkpointDir != null
,即checkpoint的时间间隔和checkpoint的路径不能为空。
下面代码说明了,如果不配置间隔时间,默认会取批次间隔时间作为checkpointDuration
private[streaming] val checkpointDuration: Duration = {
if (isCheckpointPresent) _cp.checkpointDuration else graph.batchDuration
}
那么开启checkpoint的关键就在于是否设置了checkpointDir
,所以上面的官方样例中设置checkpoint路径ssc.checkpoint(checkpointDirectory)
就相当于开启了checkpoint功能。
Metadata checkpointing
根据官方文档,SS的checkpoint分为两种,Metadata checkpointing和Data checkpointing,本节将先介绍Metadata checkpointing。下面是官方对其的介绍
Metadata checkpointing - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from failure of the node running the driver of the streaming application (discussed in detail later). Metadata includes:
- Configuration - The configuration that was used to create the streaming application.
- DStream operations - The set of DStream operations that define the streaming application.
- Incomplete batches - Batches whose jobs are queued but have not completed yet.
所谓Metadata checkpointing对应到源码中就是org.apache.spark.streaming.Checkpoint
这个类。
通过下面的代码将checkpoint写到hdfs
ssc.graph.updateCheckpointData(time)
checkpointWriter.write(new Checkpoint(ssc, time), clearCheckpointDataLater)
updateCheckpointData
会最终调用DStreamCheckpointData
中的def update(time: Time)
方法,如果是使用的KafkaUtils.createDirectStream
方法接收数据,那么在class DirectKafkaInputDStreamCheckpointData extends DStreamCheckpointData(this)
类中覆写了update
以及其他相关方法
override def update(time: Time): Unit = {
batchForTime.clear()
generatedRDDs.foreach { kv =>
val a = kv._2.asInstanceOf[KafkaRDD[K, V]].offsetRanges.map(_.toTuple).toArray
batchForTime += kv._1 -> a
}
}
Data checkpointing
本节将介绍Data checkpointing。下面是官方对其的介绍
Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.
对于上面的介绍会有一些疑问
- 什么时候rdd才会checkpoint?
- StateDStream和DStream在checkpoint的时候有什么不同?
- 如何切断rdd的依赖链?
下面将针对这些问题做一个详细的介绍
在DStream
类def getOrCompute(time: Time): Option[RDD[T]]
中会调用rdd的checkpoint,
if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
newRDD.checkpoint()
logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")
}
所以问题的关键在于什么时候对checkpointDuration
变量赋值,注意这个变量是DStream
类的成员变量。对这个变量赋值的地方只有两个:
-
initialize
方法
// Set the checkpoint interval to be slideDuration or 10 seconds, which ever is larger
if (mustCheckpoint && checkpointDuration == null) {
checkpointDuration = slideDuration * math.ceil(Seconds(10) / slideDuration).toInt
logInfo(s"Checkpoint interval automatically set to $checkpointDuration")
}
然而在DStream
中,val mustCheckpoint = false
,也就是默认情况不会对checkpointDuration
变量赋值。
-
checkpoint
方法
def checkpoint(interval: Duration): DStream[T] = {
if (isInitialized) {
throw new UnsupportedOperationException(
"Cannot change checkpoint interval of a DStream after streaming context has started")
}
persist()
checkpointDuration = interval
this
}
所以只有调用了DStream
的checkpoint
方法才会对RDD进行checkpoint。
那么对于StateDStream
由于覆写了override val mustCheckpoint = true
,所以在initialize
方法中就会对checkpointDuration
变量赋值,并不一定要调用checkpoint
方法。但是调用checkpoint
方法有两个好处,第一就是可以指定checkpoint的间隔时间,第二就是由于checkpoint
方法中调用了persist()
,这样可以切断同一个批次内RDD的依赖链,当然checkpoint
可以理解为切断了RDD整个作业生命周期内的依赖关系,因为在有状态的流中,前后批次中的RDD之间也有依赖关系,包括作业重启恢复之后RDD是否需要重算。
因为persist()
方法会设置storageLevel
变量,进入下面的代码,最终调用了RDD的persist
方法
// Register the generated RDD for caching and checkpointing
if (storageLevel != StorageLevel.NONE) {
newRDD.persist(storageLevel)
logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")
}
if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
newRDD.checkpoint()
logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")
}
参考
spark streaming checkpoint详解
spark streaming checkpointing 踩坑记