DT大数据梦工厂Spark定制班笔记(006)

Spark Streaming源码解读之Job动态生成和深度思考

图转自http://lqding.blog.51cto.com/9123978/1772958 感谢作者!

DT大数据梦工厂Spark定制班笔记(006)_第1张图片

如前篇所述,Spark Streaming应用在启动时,会先启动receiverTracker控制数据的接受;然后启动JobGenerator去生成Spark Streaming Job。

JobGenerator start实现如下所示 (JobGenerator.scala 82-102)

def start(): Unit = synchronized {
  if (eventLoop != null) return // generator has already been started

  // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock.
  // See SPARK-10125
  checkpointWriter

  eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
    override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)

    override protected def onError(e: Throwable): Unit = {
      jobScheduler.reportError("Error in job generator", e)
    }
  }
  eventLoop.start()

  if (ssc.isCheckpointPresent) {
    restart()
  } else {
    startFirstTime()
  }
}
可以看到其先是实例化并启动了事件循环器eventLooop不断接收生成job的消息(由定时器RecurringTimer发送),并使用processEvent进行相应的处理。
然后调用函数startFirstTime()
该函数实现如下:(JobGenerator.scala 192-197),它会相继启动DStreamGraph和定时器。
 
  
 
  
private def startFirstTime() {
  val startTime = new Time(timer.getStartTime())
  graph.start(startTime - graph.batchDuration)
  timer.start(startTime.milliseconds)
  logInfo("Started JobGenerator at " + startTime)
}

定时器则会按时向前面提到的eventLoop发送消息,例如GenerateJobs。
 
  
下面我们看一下processEvent的内容(JobGenerator.scala 180-189行)
private def processEvent(event: JobGeneratorEvent) {
  logDebug("Got event " + event)
  event match {
    case GenerateJobs(time) => generateJobs(time)
    case ClearMetadata(time) => clearMetadata(time)
    case DoCheckpoint(time, clearCheckpointDataLater) =>
      doCheckpoint(time, clearCheckpointDataLater)
    case ClearCheckpointData(time) => clearCheckpointData(time)
  }
}
在收到GenerateJobs消息后,会调用generateJobs函数生成Spark Streaming Job
generateJobs实现如下 (JobGenerator.scala 243-258行)
 
  
private def generateJobs(time: Time) {
  // Checkpoint all RDDs marked for checkpointing to ensure their lineages are
  // truncated periodically. Otherwise, we may run into stack overflows (SPARK-6847).
  ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")
  Try {
    jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
    graph.generateJobs(time) // generate jobs using allocated block
  } match {
    case Success(jobs) =>
      val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
      jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
    case Failure(e) =>
      jobScheduler.reportError("Error generating jobs for time " + time, e)
  }
  eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}

该函数会完成一些列工作
要求ReceiverTracker将目前已收到的数据进行一次allocate,即将上次batch切分后的数据切分到到本次新的batch里 
要求DStreamGraph复制出一套新的 RDD DAG 的实例。整个DStreamGraph.generateJobs(time)遍历结束的返回值是Seq[Job] 
将第2步生成的本 batch 的 RDD DAG,和第1步获取到的 meta 信息,一同提交给JobScheduler异步执行这里我们提交的是将 (a) time (b) Seq[job] (c) 块数据的meta信息。这三者包装为一个JobSet,然后调用JobScheduler.submitJobSet(JobSet)提交给JobScheduler。这里的向JobScheduler提交过程与JobScheduler接下来在jobExecutor里执行过程是异步分离的,因此本步将非常快即可返回。 
只要提交结束(不管是否已开始异步执行),就马上对整个系统的当前运行状态做一个checkpoint这里做checkpoint也只是异步提交一个DoCheckpoint消息请求,不用等 checkpoint 真正写完成即可返回这里也简单描述一下 checkpoint 包含的内容,包括已经提交了的、但尚未运行结束的JobSet等实际运行时信息。
 
  


你可能感兴趣的:(DT大数据梦工厂Spark定制班笔记(006))