SparkStreaming源码:启动StreamingContext

源码目录


1. 程序入口

 // 初始化 StreamingContext
 SparkConf conf = new SparkConf().setAppName("SparkStreaming_demo")
       .set("spark.streaming.stopGracefullyOnShutdown", "true");
 JavaStreamingContext streamingContext = new JavaStreamingContext(conf, Durations.seconds(30));

 // 构建处理链
 Map kafkaParams = new HashMap();
 kafkaParams.put("bootstrap.servers", BOOTSTRAP_SERVERS);
 kafkaParams.put("key.deserializer", StringDeserializer.class);
 kafkaParams.put("value.deserializer", StringDeserializer.class);
 kafkaParams.put("group.id", "exampleGroup");
 kafkaParams.put("auto.offset.reset", "latest");
 kafkaParams.put("enable.auto.commit", false);

 Collection topics = Arrays.asList("exampleTopic");

 JavaInputDStream> inputDStream = KafkaUtils.createDirectStream(streamingContext,
                LocationStrategies.PreferConsistent(), ConsumerStrategies.Subscribe(topics, kafkaParams));

 JavaDStream transformDStream = inputDStream.transform(new Function>, JavaRDD>() {
      @Override
      public JavaRDD call(JavaRDD> v1) throws Exception {
         JavaRDD tempRdd = v1.map(new Function, String>() {
             @Override
             public String call(ConsumerRecord v1) throws Exception {
                 return v1.value();
             }
         });
         return tempRdd;
     }
 });

 JavaDStream filterDStream = transformDStream.filter(new Function() {
     @Override
     public Boolean call(String v1) throws Exception {
         return StringUtils.isNotBlank(v1);
     }
 });

 filterDStream.foreachRDD(new VoidFunction>() {
     @Override
     public void call(JavaRDD javaRDD) throws Exception {
         javaRDD.saveAsTextFile("/home/example/result/");
     }
 });
 
 // 启动运行
 streamingContext.start();
 streamingContext.awaitTermination();

本文主要看StreamingContext的启动运行过程。

2. 进入源码

2.1 跟进streamingContext.start()

  • 进入 org.apache.spark.streaming.StreamingContext.scala
  def start(): Unit = synchronized {
    state match {
      case INITIALIZED =>
        startSite.set(DStream.getCreationSite())
        StreamingContext.ACTIVATION_LOCK.synchronized {
          StreamingContext.assertNoOtherContextIsActive()
          try {
            validate()

            // Start the streaming scheduler in a new thread, so that thread local properties
            // like call sites and job groups can be reset without affecting those of the
            // current thread.
            ThreadUtils.runInNewThread("streaming-start") {
              sparkContext.setCallSite(startSite.get)
              sparkContext.clearJobGroup()
              sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
              savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get()))
              scheduler.start()
            }
            state = StreamingContextState.ACTIVE
            scheduler.listenerBus.post(
              StreamingListenerStreamingStarted(System.currentTimeMillis()))
          } catch {
            case NonFatal(e) =>
              logError("Error starting the context, marking it as stopped", e)
              scheduler.stop(false)
              state = StreamingContextState.STOPPED
              throw e
          }
          StreamingContext.setActiveContext(this)
        }
        logDebug("Adding shutdown hook") // force eager creation of logger
        shutdownHookRef = ShutdownHookManager.addShutdownHook(
          StreamingContext.SHUTDOWN_HOOK_PRIORITY)(stopOnShutdown)
        // Registering Streaming Metrics at the start of the StreamingContext
        assert(env.metricsSystem != null)
        env.metricsSystem.registerSource(streamingSource)
        uiTab.foreach(_.attach())
        logInfo("StreamingContext started")
      case ACTIVE =>
        logWarning("StreamingContext has already been started")
      case STOPPED =>
        throw new IllegalStateException("StreamingContext has already been stopped")
    }
  }

StreamingContextState初始状态为INITIALIZED,进入INITIALIZED分支,另启一个新线程,在新线程中启动 scheduler(scheduler.start()),并更新StreamingContextState状态为ACTIVE。

2.2 跟进scheduler.start()

  • 进入 org.apache.spark.streaming.scheduler.JobScheduler.scala
  def start(): Unit = synchronized {
    if (eventLoop != null) return // scheduler has already been started

    logDebug("Starting JobScheduler")
    eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
      override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)

      override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
    }
    eventLoop.start()

    // attach rate controllers of input streams to receive batch completion updates
    for {
      inputDStream <- ssc.graph.getInputStreams
      rateController <- inputDStream.rateController
    } ssc.addStreamingListener(rateController)

    listenerBus.start()
    receiverTracker = new ReceiverTracker(ssc)
    inputInfoTracker = new InputInfoTracker(ssc)

    val executorAllocClient: ExecutorAllocationClient = ssc.sparkContext.schedulerBackend match {
      case b: ExecutorAllocationClient => b.asInstanceOf[ExecutorAllocationClient]
      case _ => null
    }

    executorAllocationManager = ExecutorAllocationManager.createIfEnabled(
      executorAllocClient,
      receiverTracker,
      ssc.conf,
      ssc.graph.batchDuration.milliseconds,
      clock)
    executorAllocationManager.foreach(ssc.addStreamingListener)
    receiverTracker.start()
    jobGenerator.start()
    executorAllocationManager.foreach(_.start())
    logInfo("Started JobScheduler")
  }

JobScheduler 启动时会:

  1. 创建并启动 EventLoop[JobSchedulerEvent] (eventLoop.start()),处理JobSchedulerEvent 事件。
  2. 获取 DStreamGraph 中所有 InputDStream 以及相应的 rateController,将 rateController 加入到 listenerBus,本例中的 InputDStream 为 DirectKafkaInputDStream。
  3. 创建并启动 ReceiverTracker (receiverTracker.start())
  4. 启动 jobGenerator: jobGenerator.start()

2.3 跟进receiverTracker.start()

  • 进入org.apache.spark.streaming.scheduler.ReceiverTracker.scala
  def start(): Unit = synchronized {
    if (isTrackerStarted) {
      throw new SparkException("ReceiverTracker already started")
    }

    if (!receiverInputStreams.isEmpty) {
      endpoint = ssc.env.rpcEnv.setupEndpoint(
        "ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
      if (!skipReceiverLaunch) launchReceivers()
      logInfo("ReceiverTracker started")
      trackerState = Started
    }
  }
  1. 判断 receiverInputStreams 是否为空,而 private val receiverInputStreams = ssc.graph.getReceiverInputStreams()
  2. 在 getReceiverInputStreams() 方法中,会遍历 DStreamGraph 中 inputStreams,找出其中属于 ReceiverInputDStream 类型的 InputStream
  def getReceiverInputStreams(): Array[ReceiverInputDStream[_]] = this.synchronized {
    inputStreams.filter(_.isInstanceOf[ReceiverInputDStream[_]])
      .map(_.asInstanceOf[ReceiverInputDStream[_]])
      .toArray
  }
  1. 在本例中,KafkaUtils.createDirectStream()时创建了 DirectKafkaInputDStream 并加入了 DStreamGraph 中 inputStreams 中,但是 DirectKafkaInputDStream 不是 ReceiverInputDStream 类型,因此 receiverInputStreams 为空,不执行后续流程

注意:Direct 方式和 Receiver 方式不同,如果使用 Receiver 方式接收数据,则是 ReceiverInputDStream 类型,此时会执行后续流程(以后再分析这种方式)

2.4 跟进 jobGenerator.start()

  • 进入 org.apache.spark.streaming.scheduler.JobGenerator.scala
  def start(): Unit = synchronized {
    if (eventLoop != null) return // generator has already been started

    // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock.
    // See SPARK-10125
    checkpointWriter

    eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
      override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)

      override protected def onError(e: Throwable): Unit = {
        jobScheduler.reportError("Error in job generator", e)
      }
    }
    eventLoop.start()

    if (ssc.isCheckpointPresent) {
      restart()
    } else {
      startFirstTime()
    }
  }
  1. 创建并启动 EventLoop[JobGeneratorEvent] (eventLoop.start()),这里处理的是JobGeneratorEvent。
  2. 执行 startFirstTime()

注意:
这里的 eventLoop 和上面 2.2 中的 eventLoop 不一样:
这里的 eventLoop 处理的是JobGeneratorEvent
而上面 2.2 中的 eventLoop 处理的是JobSchedulerEvent

2.5 首先跟进 2.4 的 startFirstTime()

  • 进入 org.apache.spark.streaming.scheduler.JobGenerator.scala
  /** Starts the generator for the first time */
  private def startFirstTime() {
    val startTime = new Time(timer.getStartTime())
    graph.start(startTime - graph.batchDuration)
    timer.start(startTime.milliseconds)
    logInfo("Started JobGenerator at " + startTime)
  }
  1. 启动DStreamGraph:graph.start(startTime - graph.batchDuration)
  2. 启动RecurringTimer: timer.start(startTime.milliseconds)
2.5.1 跟进 graph.start(startTime - graph.batchDuration)
  • 进入 org.apache.spark.streaming.DStreamGraph.scala
  def start(time: Time) {
    this.synchronized {
      require(zeroTime == null, "DStream graph computation already started")
      zeroTime = time
      startTime = time
      outputStreams.foreach(_.initialize(zeroTime))
      outputStreams.foreach(_.remember(rememberDuration))
      outputStreams.foreach(_.validateAtStart)
      inputStreams.par.foreach(_.start())
    }
  }
  1. outputStreams 初始化
  2. inputStreams 启动
  • 进入org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.scala
  override def start(): Unit = {
    val c = consumer
    paranoidPoll(c)
    if (currentOffsets.isEmpty) {
      currentOffsets = c.assignment().asScala.map { tp =>
        tp -> c.position(tp)
      }.toMap
    }

    // don't actually want to consume any messages, so pause all partitions
    c.pause(currentOffsets.keySet.asJava)
  }
  1. 创建 KafkaConsumer
  2. 拉取一次数据,更新 KafkaConsumer 中的 offset
  3. 更新 currentOffsets 为 KafkaConsumer 的 offset
  4. KafkaConsumer 暂停所有分区的数据消费

注:KafkaConsumer 消费数据,后面学习 Kafka 源码的时候再看。

2.5.2 跟进 timer.start(startTime.milliseconds)
  • 进入 org.apache.spark.streaming.util.RecurringTimer.scala
  /**
   * Start at the given start time.
   */
  def start(startTime: Long): Long = synchronized {
    nextTime = startTime
    thread.start()
    logInfo("Started timer for " + name + " at time " + nextTime)
    nextTime
  }

  private val thread = new Thread("RecurringTimer - " + name) {
    setDaemon(true)
    override def run() { loop }
  }

  /**
   * Repeatedly call the callback every interval.
   */
  private def loop() {
    try {
      while (!stopped) {
        triggerActionForNextInterval()
      }
      triggerActionForNextInterval()
    } catch {
      case e: InterruptedException =>
    }
  }

  private def triggerActionForNextInterval(): Unit = {
    clock.waitTillTime(nextTime)
    callback(nextTime)
    prevTime = nextTime
    nextTime += period
    logDebug("Callback for " + name + " called at time " + prevTime)
  }
  1. 创建并启动了一个新的Daemon线程
  2. 在新线程中循环调用 triggerActionForNextInterval
  3. triggerActionForNextInterval 中等到下一个批次时间到,调用 RecurringTimer 中定义的 callback() 方法(定时器),即:
  private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
    longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")
  1. 此 callback() 方法中,EventLoop[JobGeneratorEvent] 向 eventQueue 中放入元素 GenerateJobs(JobGeneratorEvent)

2.6 再跟进 2.4 的 eventLoop.start()

  • 进入 org.apache.spark.util.EventLoop.scala
  def start(): Unit = {
    if (stopped.get) {
      throw new IllegalStateException(name + " has already been stopped")
    }
    // Call onStart before starting the event thread to make sure it happens before onReceive
    onStart()
    eventThread.start()
  }
  1. 启动 eventThread 线程
  2. 运行其 run() 方法
  private val eventThread = new Thread(name) {
    setDaemon(true)

    override def run(): Unit = {
      try {
        while (!stopped.get) {
          val event = eventQueue.take()
          try {
            onReceive(event)
          } catch {
            case NonFatal(e) =>
              try {
                onError(e)
              } catch {
                case NonFatal(e) => logError("Unexpected error in " + name, e)
              }
          }
        }
      } catch {
        case ie: InterruptedException => // exit even if eventQueue is not empty
        case NonFatal(e) => logError("Unexpected error in " + name, e)
      }
    }
  }
  1. eventThread 监听 eventLoop.eventQueue,不断的从 eventQueue 中取 event,
  2. 调用 eventLoop.onReceive(event) 处理 event。

注意:
在 2.5.2 中,triggerActionForNextInterval 方法调用 RecurringTimer 的 callback() 方法时,已经将 JobGeneratorEvent 加入到 eventQueue 中了。
所以此处取到并处理的事件就是上面的 JobGeneratorEvent。

2.7 查看 onReceive(event) 的具体方法定义

  • 回到org.apache.spark.streaming.scheduler.JobGenerator.scala
    eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
      override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)

      override protected def onError(e: Throwable): Unit = {
        jobScheduler.reportError("Error in job generator", e)
      }
    }

onReceive()方法调用JobGenerator.processEvent(JobGeneratorEvent)处理 event。

  private def processEvent(event: JobGeneratorEvent) {
    logDebug("Got event " + event)
    event match {
      case GenerateJobs(time) => generateJobs(time)
      case ClearMetadata(time) => clearMetadata(time)
      case DoCheckpoint(time, clearCheckpointDataLater) =>
        doCheckpoint(time, clearCheckpointDataLater)
      case ClearCheckpointData(time) => clearCheckpointData(time)
    }
  }

JobGenerator.processEvent(JobGeneratorEvent) 处理四种event:

  • GenerateJobs
  • ClearMetadata
  • DoCheckpoint
  • ClearCheckpointData

之前从 eventLoop.eventQueue 中取出的是 GenerateJobs(JobGeneratorEvent),匹配第一类。

2.8 跟进 generateJobs(time)

  • 进入org.apache.spark.streaming.scheduler.JobGenerator.scala
  private def generateJobs(time: Time) {
    // Checkpoint all RDDs marked for checkpointing to ensure their lineages are
    // truncated periodically. Otherwise, we may run into stack overflows (SPARK-6847).
    ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")
    Try {
      jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
      graph.generateJobs(time) // generate jobs using allocated block
    } match {
      case Success(jobs) =>
        val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
        jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
      case Failure(e) =>
        jobScheduler.reportError("Error generating jobs for time " + time, e)
        PythonDStream.stopStreamingContextIfPythonProcessIsDead(e)
    }
    eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
  }
  1. receiverTracker 将接收到的 Blocks 划到当前批次
  2. 利用 DStreamGraph 构造当前批次的 Jobs
  3. 如果构造 Jobs 成功,则提交任务
2.8.1 跟进jobScheduler.receiverTracker.allocateBlocksToBatch(time)
  • 进入org.apache.spark.streaming.scheduler.ReceiverTracker.scala
  def allocateBlocksToBatch(batchTime: Time): Unit = {
    if (receiverInputStreams.nonEmpty) {
      receivedBlockTracker.allocateBlocksToBatch(batchTime)
    }
  }
  1. 如果 receiverInputStreams 不为空,才去执行 allocateBlocksToBatch 方法
  2. 此例中是 DirectKafkaInputDStream,不包含 ReceiverInputDStream
2.8.2 跟进graph.generateJobs(time)
  • 进入org.apache.spark.streaming.DStreamGraph.scala
  def generateJobs(time: Time): Seq[Job] = {
    logDebug("Generating jobs for time " + time)
    val jobs = this.synchronized {
      outputStreams.flatMap { outputStream =>
        val jobOption = outputStream.generateJob(time)
        jobOption.foreach(_.setCallSite(outputStream.creationSite))
        jobOption
      }
    }
    logDebug("Generated " + jobs.length + " jobs for time " + time)
    jobs
  }
  1. 遍历 outputStreams,此例只有 ForEachDStream
  2. 每个 outputStream 都会构造一个 Job

由于 ForEachDStream 重写了 generateJob 方法,因此执行的是 ForEachDStream.generateJob()

  • 进入org.apache.spark.streaming.dstream.ForEachDStream.scala
  override def generateJob(time: Time): Option[Job] = {
    parent.getOrCompute(time) match {
      case Some(rdd) =>
        val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
          foreachFunc(rdd, time)
        }
        Some(new Job(time, jobFunc))
      case None => None
    }
  }
  1. 调用其 parent 的 getOrCompute(time) 方法
  2. 如果返回的是 rdd,则调用 foreachFunc(rdd, time) 并根据 createRDDWithLocalProperties(柯里化函数) 构建 jobFunc
  3. new Job(time, jobFunc)

本例中 ForEachDStream 实例中保存的 parent 是 FilteredDStream

FilteredDStream 中未重写 getOrCompute(time) 方法,查看其父类 DStream:

  • 进入org.apache.spark.streaming.dstream.DStream.scala
  private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {
    // If RDD was already generated, then retrieve it from HashMap,
    // or else compute the RDD
    generatedRDDs.get(time).orElse {
      // Compute the RDD if time is valid (e.g. correct time in a sliding window)
      // of RDD generation, else generate nothing.
      if (isTimeValid(time)) {

        val rddOption = createRDDWithLocalProperties(time, displayInnerRDDOps = false) {
          // Disable checks for existing output directories in jobs launched by the streaming
          // scheduler, since we may need to write output to an existing directory during checkpoint
          // recovery; see SPARK-4835 for more details. We need to have this call here because
          // compute() might cause Spark jobs to be launched.
          PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
            compute(time)
          }
        }

        rddOption.foreach { case newRDD =>
          // Register the generated RDD for caching and checkpointing
          if (storageLevel != StorageLevel.NONE) {
            newRDD.persist(storageLevel)
            logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")
          }
          if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
            newRDD.checkpoint()
            logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")
          }
          generatedRDDs.put(time, newRDD)
        }
        rddOption
      } else {
        None
      }
    }
  }
  1. 从 generatedRDDs(HashMap[Time, RDD[T]]) 中获取 time 对应的 RDD
  2. 如果没有获取到,则调用 compute(time) 构建
  3. 将新创建的 RDD 放入 generatedRDDs 中
  • 进入org.apache.spark.streaming.dstream.FilteredDStream.scala
  override def compute(validTime: Time): Option[RDD[T]] = {
    parent.getOrCompute(validTime).map(_.filter(filterFunc))
  }
  1. 调用其 parent.getOrCompute(validTime)
  2. 对返回的 RDD 执行 map(_.filter(filterFunc))

同样的,本例中 FilteredDStream 的 parent 是 TransformedDStream

  • 进入org.apache.spark.streaming.dstream.TransformedDStream.scala
  override def compute(validTime: Time): Option[RDD[U]] = {
    val parentRDDs = parents.map { parent => parent.getOrCompute(validTime).getOrElse(
      // Guard out against parent DStream that return None instead of Some(rdd) to avoid NPE
      throw new SparkException(s"Couldn't generate RDD from parent at time $validTime"))
    }
    val transformedRDD = transformFunc(parentRDDs, validTime)
    if (transformedRDD == null) {
      throw new SparkException("Transform function must not return null. " +
        "Return SparkContext.emptyRDD() instead to represent no element " +
        "as the result of transformation.")
    }
    Some(transformedRDD)
  }
  1. 调用其 parent.getOrCompute(validTime)
  2. 对返回的 RDD 执行 transformFunc(parentRDDs, validTime)

本例中,TransformedDStream 的 parent 是 DirectKafkaInputDStream

  • 进入org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.scala
  override def compute(validTime: Time): Option[KafkaRDD[K, V]] = {
    val untilOffsets = clamp(latestOffsets())
    val offsetRanges = untilOffsets.map { case (tp, uo) =>
      val fo = currentOffsets(tp)
      OffsetRange(tp.topic, tp.partition, fo, uo)
    }
    val rdd = new KafkaRDD[K, V](
      context.sparkContext, executorKafkaParams, offsetRanges.toArray, getPreferredHosts, true)

    // Report the record number and metadata of this batch interval to InputInfoTracker.
    val description = offsetRanges.filter { offsetRange =>
      // Don't display empty ranges.
      offsetRange.fromOffset != offsetRange.untilOffset
    }.map { offsetRange =>
      s"topic: ${offsetRange.topic}\tpartition: ${offsetRange.partition}\t" +
        s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}"
    }.mkString("\n")
    // Copy offsetRanges to immutable.List to prevent from being modified by the user
    val metadata = Map(
      "offsets" -> offsetRanges.toList,
      StreamInputInfo.METADATA_KEY_DESCRIPTION -> description)
    val inputInfo = StreamInputInfo(id, rdd.count, metadata)
    ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)

    currentOffsets = untilOffsets
    commitAll()
    Some(rdd)
  }
  1. 获取 OffsetRange,创建 KafkaRDD
  2. 上报当前批次的记录数和 metadata 到 InputInfoTracker
  3. 提交 Offset 到 Kafka

注:KafkaConsumer 提交 Offset,后面学习 Kafka 源码的时候再看。

总结一下:

1. 在本例中,DStreamGraph 中保存了 inputStreams(DirectKafkaInputDStream) 和 outputStreams(ForEachDStream)

2. 构建处理链时利用各自的 parent 属性保存上一层操作,形成 'DirectKafkaInputDStream - TransformedDStream - FilteredDStream - ForEachDStream' 处理链

3. DStreamGraph 根据 outputStreams 开始创建 Job
   -- ForEachDStream 在 generateJob() 中调用 parent.getOrCompute(time)
   -- FilteredDStream 在 compute() 中调用 parent.getOrCompute(time)
   -- TransformedDStream 在 compute() 中调用 parent.getOrCompute(time)
   -- DirectKafkaInputDStream 在 compute() 创建了 KafkaRDD

4. 接着依次返回各 parent.getOrCompute(time) 调用处执行后续操作
   -- TransformedDStream 对 KafkaRDD 调用 transformFunc 得到 newRDD1
   -- FilteredDStream 对 newRDD1 调用 filterFunc 得到 newRDD2
   -- ForEachDStream 对 newRDD2 调用 foreachFunc 得到 newRDD3,最后加入 generatedRDDs

5. 创建 Job 完成(new Job(time, jobFunc))

2.8.3 跟进jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
  • 进入org.apache.spark.streaming.scheduler.JobScheduler.scala
  def submitJobSet(jobSet: JobSet) {
    if (jobSet.jobs.isEmpty) {
      logInfo("No jobs added for time " + jobSet.time)
    } else {
      listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
      jobSets.put(jobSet.time, jobSet)
      jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
      logInfo("Added jobs for time " + jobSet.time)
    }
  }
  1. 封装 Job 为 JobHandler
  2. 提交 JobHandler 给 Executor 执行

jobExecutor 是一个线程池:
private val numConcurrentJobs = ssc.conf.getInt("spark.streaming.concurrentJobs", 1)
private val jobExecutor = ThreadUtils.newDaemonFixedThreadPool(numConcurrentJobs, "streaming-job-executor")
封装好的 JobHandler 将在线程池 jobExecutor 中执行 run() 方法。

注意:
1. 这里还是在 Driver 端运行的,SparkStreaming 任务积压就是因为线程池中有没来得及处理的任务,这些任务存放在线程池内部的队列中。

2. spark.streaming.concurrentJobs设置的就是线程池的 corePoolSize 和 maximumPoolSize,因此如果设置超过1,线程池会启多个线程执行 JobHandler。

  • 进入org.apache.spark.streaming.scheduler.JobScheduler.scala
  private class JobHandler(job: Job) extends Runnable with Logging {
    import JobScheduler._

    def run() {
      val oldProps = ssc.sparkContext.getLocalProperties
      try {
        ssc.sparkContext.setLocalProperties(SerializationUtils.clone(ssc.savedProperties.get()))
        val formattedTime = UIUtils.formatBatchTime(
          job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false)
        val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}"
        val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]"

        ssc.sc.setJobDescription(
          s"""Streaming job from $batchLinkText""")
        ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString)
        ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString)
        // Checkpoint all RDDs marked for checkpointing to ensure their lineages are
        // truncated periodically. Otherwise, we may run into stack overflows (SPARK-6847).
        ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")

        // We need to assign `eventLoop` to a temp variable. Otherwise, because
        // `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then
        // it's possible that when `post` is called, `eventLoop` happens to null.
        var _eventLoop = eventLoop
        if (_eventLoop != null) {
          _eventLoop.post(JobStarted(job, clock.getTimeMillis()))
          // Disable checks for existing output directories in jobs launched by the streaming
          // scheduler, since we may need to write output to an existing directory during checkpoint
          // recovery; see SPARK-4835 for more details.
          PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
            job.run()
          }
          _eventLoop = eventLoop
          if (_eventLoop != null) {
            _eventLoop.post(JobCompleted(job, clock.getTimeMillis()))
          }
        } else {
          // JobScheduler has been stopped.
        }
      } finally {
        ssc.sparkContext.setLocalProperties(oldProps)
      }
    }
  }

在 run 方法中:

  1. 封装 Job 为 JobStarted(JobSchedulerEvent )事件,加入 eventQueue
  2. job 运行:job.run()
  3. 封装 Job 为 JobCompleted(JobSchedulerEvent )事件,加入 eventQueue。
  4. eventLoop.start()后,会不断从 eventQueue 取 JobScheduler.EventLoop 事件并调用 onReceive 方法处理 JobSchedulerEvent 事件
  • 进入org.apache.spark.streaming.scheduler.Job.scala
  def run() {
    _result = Try(func())
  }

job.run() 执行 func(),func() 是创建 job 时给定的函数,即:2.8.2 中 org.apache.spark.streaming.dstream.ForEachDStream.scala内的 jobFunc:

  override def generateJob(time: Time): Option[Job] = {
    parent.getOrCompute(time) match {
      case Some(rdd) =>
        val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
          foreachFunc(rdd, time)
        }
        Some(new Job(time, jobFunc))
      case None => None
    }
  }

jobFunc 中执行 foreachFunc(rdd, time),在本例中,foreachFunc(rdd, time) 为:
javaRDD.saveAsTextFile("/home/example/result/");

saveAsTextFile 内部调用 self.context.runJob(self, writeToFile) 运行 Job,
此处的 runJob 是 SparkContext.runJob,即调用 Spark 运行处理任务的流程。

至此,SparkStreaming 和 Spark 对上了。

总结

  1. 启动 JobScheduler,在其内部启动一个 EventLoop,用于处理 JobSchedulerEvent 事件
  2. 启动 JobGenerator,在其内部启动一个 EventLoop,用于处理 JobGeneratorEvent 事件
  3. 启动 RecurringTimer,定时向 eventQueue 中添加 GenerateJobs(JobGeneratorEvent)
  4. 启一个单独的线程,不断从 eventQueue 中取 GenerateJobs,并调用 JobGenerator.EventLoop 处理该事件
  5. 利用 DStreamGraph 和之前生成的处理链开始生成 Job
  6. 将 Job 封装成 JobHandler 提交到一个线程池(Driver端)中排队等待调度处理
  7. JobHandler 内的 run() 方法内,封装 JobStarted(JobSchedulerEvent )事件,加入 eventQueue,表示任务开始
  8. JobHandler 内的 run() 方法内,执行 job.run() 调用构造 Job 时的 jobFunc,再转而调用 SparkContext.runJob 提交任务(和 Spark 对上了)
  9. JobHandler 内的 run() 方法内,封装 JobCompleted(JobSchedulerEvent )事件,加入 eventQueue,表示任务完成

你可能感兴趣的:(SparkStreaming源码:启动StreamingContext)