Spark Streaming checkpoint技术初探

本文基于spark 2.4.3版本Checkpointing

Getting Started

如何使用SS的checkpoint,下面是官方的样例,本文就从这个样例说起

// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
  val ssc = new StreamingContext(...)   // new context
  val lines = ssc.socketTextStream(...) // create DStreams
  ...
  ssc.checkpoint(checkpointDirectory)   // set checkpoint directory
  ssc
}

// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)

// Do additional setup on context that needs to be done,
// irrespective of whether it is being started or restarted
context. ...

// Start the context
context.start()
context.awaitTermination()

样例中介绍了如何通过checkpoint恢复作业,即val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _);以及如何开启checkpoint的功能,即ssc.checkpoint(checkpointDirectory)

内部剖析

开启checkpoint

开启checkpoint功能,根据源码需要满足两个条件private lazy val shouldCheckpoint = ssc.checkpointDuration != null && ssc.checkpointDir != null,即checkpoint的时间间隔和checkpoint的路径不能为空。

下面代码说明了,如果不配置间隔时间,默认会取批次间隔时间作为checkpointDuration

private[streaming] val checkpointDuration: Duration = {
  if (isCheckpointPresent) _cp.checkpointDuration else graph.batchDuration
}

那么开启checkpoint的关键就在于是否设置了checkpointDir,所以上面的官方样例中设置checkpoint路径ssc.checkpoint(checkpointDirectory)就相当于开启了checkpoint功能。

Metadata checkpointing

根据官方文档,SS的checkpoint分为两种,Metadata checkpointingData checkpointing,本节将先介绍Metadata checkpointing。下面是官方对其的介绍

Metadata checkpointing - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from failure of the node running the driver of the streaming application (discussed in detail later). Metadata includes:

  • Configuration - The configuration that was used to create the streaming application.
  • DStream operations - The set of DStream operations that define the streaming application.
  • Incomplete batches - Batches whose jobs are queued but have not completed yet.

所谓Metadata checkpointing对应到源码中就是org.apache.spark.streaming.Checkpoint这个类。
通过下面的代码将checkpoint写到hdfs

ssc.graph.updateCheckpointData(time)
checkpointWriter.write(new Checkpoint(ssc, time), clearCheckpointDataLater)

updateCheckpointData会最终调用DStreamCheckpointData中的def update(time: Time)方法,如果是使用的KafkaUtils.createDirectStream方法接收数据,那么在class DirectKafkaInputDStreamCheckpointData extends DStreamCheckpointData(this)类中覆写了update以及其他相关方法

override def update(time: Time): Unit = {
  batchForTime.clear()
  generatedRDDs.foreach { kv =>
    val a = kv._2.asInstanceOf[KafkaRDD[K, V]].offsetRanges.map(_.toTuple).toArray
    batchForTime += kv._1 -> a
  }
}

Data checkpointing

本节将介绍Data checkpointing。下面是官方对其的介绍

Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.

对于上面的介绍会有一些疑问

  1. 什么时候rdd才会checkpoint?
  2. StateDStream和DStream在checkpoint的时候有什么不同?
  3. 如何切断rdd的依赖链?

下面将针对这些问题做一个详细的介绍

DStreamdef getOrCompute(time: Time): Option[RDD[T]]中会调用rdd的checkpoint,

if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
  newRDD.checkpoint()
  logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")
}

所以问题的关键在于什么时候对checkpointDuration变量赋值,注意这个变量是DStream类的成员变量。对这个变量赋值的地方只有两个:

  1. initialize方法
// Set the checkpoint interval to be slideDuration or 10 seconds, which ever is larger
if (mustCheckpoint && checkpointDuration == null) {
  checkpointDuration = slideDuration * math.ceil(Seconds(10) / slideDuration).toInt
  logInfo(s"Checkpoint interval automatically set to $checkpointDuration")
}

然而在DStream中,val mustCheckpoint = false,也就是默认情况不会对checkpointDuration变量赋值。

  1. checkpoint方法
  def checkpoint(interval: Duration): DStream[T] = {
    if (isInitialized) {
      throw new UnsupportedOperationException(
        "Cannot change checkpoint interval of a DStream after streaming context has started")
    }
    persist()
    checkpointDuration = interval
    this
  }

所以只有调用了DStreamcheckpoint方法才会对RDD进行checkpoint。

那么对于StateDStream由于覆写了override val mustCheckpoint = true,所以在initialize方法中就会对checkpointDuration变量赋值,并不一定要调用checkpoint方法。但是调用checkpoint方法有两个好处,第一就是可以指定checkpoint的间隔时间,第二就是由于checkpoint方法中调用了persist(),这样可以切断同一个批次内RDD的依赖链,当然checkpoint可以理解为切断了RDD整个作业生命周期内的依赖关系,因为在有状态的流中,前后批次中的RDD之间也有依赖关系,包括作业重启恢复之后RDD是否需要重算。

因为persist()方法会设置storageLevel变量,进入下面的代码,最终调用了RDD的persist方法

// Register the generated RDD for caching and checkpointing
if (storageLevel != StorageLevel.NONE) {
  newRDD.persist(storageLevel)
  logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")
}
if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
  newRDD.checkpoint()
  logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")
}

参考

  1. spark streaming checkpoint详解

  2. spark streaming checkpointing 踩坑记

你可能感兴趣的:(Spark Streaming checkpoint技术初探)