Spark Streaming:工作原理

Spark Streaming简介

Spark Streaming是Spark Core API的一种扩展,它可以用于进行大规模、高吞吐量、容错的实时数据流的处理。它支持从很多种数据源中读取数据,比如Kafka、Flume、Twitter、ZeroMQ、Kinesis或者是TCP Socket。并且能够使用类似高阶函数的复杂算法来进行数据处理,比如map、reduce、join和window。处理后的数据可以被保存到文件系统、数据库、Dashboard等存储中
Spark Streaming:工作原理_第1张图片
Spark Streaming有以下特点:

  • 高可扩展性,可以运行在上百台机器上(Scales to hundreds of nodes)
  • 低延迟,可以在秒级别上对数据进行处理(Achieves low latency)
  • 高可容错性(Efficiently recover from failures)
  • 能够集成并行计算程序,比如Spark Core(Integrates with batch and interactive processing)

Spark Streaming基本工作原理

Spark Streaming内部的基本工作原理如下:接收实时输入数据流,然后将数据拆分成多个batch,比如每收集1秒的数据封装为一个batch,然后将每个batch交给Spark的计算引擎进行处理,最后会生产出一个结果数据流,其中的数据,也是由一个一个的batch所组成的。
Spark Streaming:工作原理_第2张图片

从图中也能看出它将输入的数据分成多个batch进行处理,严格来说spark streaming 并不是一个真正的实时框架,因为他是分批次进行处理的

DStream

Spark Streaming提供了一种高级的抽象,叫做DStream,英文全称为Discretized Stream,中文翻译为“离散流”,它代表了一个持续不断的数据流。DStream可以通过输入数据源来创建,比如Kafka、Flume和Kinesis;也可以通过对其他DStream应用高阶函数来创建,比如map、reduce、join、window。

DStream的内部,其实一系列持续不断产生的RDD。RDD是Spark Core的核心抽象,即,不可变的,分布式的数据集。DStream中的每个RDD都包含了一个时间段内的数据。
Spark Streaming:工作原理_第3张图片
对DStream应用的算子,比如map,其实在底层会被翻译为对DStream中每个RDD的操作。比如对一个DStream执行一个map操作,会产生一个新的DStream。但是,在底层,其实其原理为,对输入DStream中每个时间段的RDD,都应用一遍map操作,然后生成的新的RDD,即作为新的DStream中的那个时间段的一个RDD。底层的RDD的transformation操作,其实,还是由Spark Core的计算引擎来实现的。Spark Streaming对Spark Core进行了一层封装,隐藏了细节,然后对开发人员提供了方便易用的高层次的API。
Spark Streaming:工作原理_第4张图片

架构原理剖析

Spark Streaming:工作原理_第5张图片

  1. 客户端提交作业后启动Driver,Driver是spark作业的Master。
  2. 每个作业包含多个Executor,每个Executor以线程的方式运行task,Spark Streaming至少包含一个receiver task。
  3. Receiver接收数据后生成Block,并把BlockId汇报给Driver,然后备份到另外一个Executor上。
  4. ReceiverTracker维护Reciver汇报的BlockId。
  5. Driver定时启动JobGenerator,根据Dstream的关系生成逻辑RDD,然后创建Jobset,交给JobScheduler。
  6. JobScheduler负责调度Jobset,交给DAGScheduler,DAGScheduler根据逻辑RDD,生成相应的Stages,每个stage包含一到多个task。
  7. TaskScheduler负责把task调度到Executor上,并维护task的运行状态。
  8. 当tasks,stages,jobset完成后,单个batch才算完成。

Spark Streaming是将流式计算分解成一系列短小的批处理作业。这里的批处理引擎是Spark Core,也就是把Spark Streaming的输入数据按照batch size(如1秒)分成一段一段的数据(Discretized Stream),每一段数据都转换成Spark中的RDD(Resilient Distributed Dataset),然后将Spark Streaming中对DStream的Transformation操作变为针对Spark中对RDD的Transformation操作,将RDD经过操作变成中间结果保存在内存中。

源码分析

StreamingContext 初始化

第一步:StreamingContext
源码地址:org.apache.spark.streaming.StreamingContext.scala

class StreamingContext private[streaming] (
    _sc: SparkContext,
    _cp: Checkpoint,
    _batchDur: Duration
  ) extends Logging {

  .......
  
  /**
   * 重要组件DStreamGraph
   * Spark Streaming Application中,一系列的DStream的依赖关系
   * 以及互相之间算子的应用
   */
  private[streaming] val graph: DStreamGraph = {
    if (isCheckpointPresent) {
      _cp.graph.setContext(this)
      _cp.graph.restoreCheckpointData()
      _cp.graph
    } else {
      require(_batchDur != null, "Batch duration for StreamingContext cannot be null")
      val newGraph = new DStreamGraph()
      newGraph.setBatchDuration(_batchDur)
      newGraph
    }
  }

  //核心组件,涉及到job调度
  //底层还是基于Spark的核心计算引擎
  private[streaming] val scheduler = new JobScheduler(this)

JobScheduler主要负责以下几种任务:

  • 数据接收相关组件的初始化及启动
    ReceiverTracker的初始化及启动。ReceiverTracker负责管理Receiver,包括Receiver的启停,状态维护 等。
  • Job生成相关组件的启动
    JobGenerator的启动。JobGenerator负责以BatchInterval为周期生成Job
  • Streaming监听的注册与启动
  • 作业监听
  • 反压机制
    BackPressure机制,通过RateController控制数据摄取速率
  • Executor DynamicAllocation 的启动
    Executor 动态伸缩管理, 动态增加或减少Executor,来达到使用系统稳定运行 或减少资源开销的目的。
  • Job的调度及状态维护。

第二步:在创建和完成StreamingContext 的初始化之后,创建了DStreamGraph JobScheduler等关键组件之后
就会调用StreamingContext 的socketTextStream等方法,来创建输入DStream

  def socketTextStream(
      hostname: String,
      port: Int,
      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
    ): ReceiverInputDStream[String] = withNamedScope("socket text stream") {
    socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
  }
  def socketStream[T: ClassTag](
      hostname: String,
      port: Int,
      converter: (InputStream) => Iterator[T],
      storageLevel: StorageLevel
    ): ReceiverInputDStream[T] = {
    //new 出来一个 DStream 具体子类 SocketInputDStream 的实例
    new SocketInputDStream[T](this, hostname, port, converter, storageLevel)
  }

第三步:SocketInputDStream

private[streaming]
class SocketInputDStream[T: ClassTag](
    _ssc: StreamingContext,
    host: String,
    port: Int,
    bytesToObjects: InputStream => Iterator[T],
    storageLevel: StorageLevel
  ) extends ReceiverInputDStream[T](_ssc) {

  /**
   * 输入DStream,一定都会有一个重要方法getReceiver()
   * 这个方法负责返回DStream的Receiver
   */
  def getReceiver(): Receiver[T] = {
    new SocketReceiver(host, port, bytesToObjects, storageLevel)
  }
}

Receiver启动
Spark Streaming:工作原理_第6张图片

第一步:StreamingContext初始化完成之后,会调用StreamingContext.start()方法

  def start(): Unit = synchronized {
    state match {
      case INITIALIZED =>
        startSite.set(DStream.getCreationSite())
        StreamingContext.ACTIVATION_LOCK.synchronized {
          StreamingContext.assertNoOtherContextIsActive()
          try {
            validate()

            // Start the streaming scheduler in a new thread, so that thread local properties
            // like call sites and job groups can be reset without affecting those of the
            // current thread.
            ThreadUtils.runInNewThread("streaming-start") {
              sparkContext.setCallSite(startSite.get)
              sparkContext.clearJobGroup()
              sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
              savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get()))
              //核心代码,启动子线程,为了本地初始化工作,另外一方面是不要阻塞主线程
              //启动JobScheduler的start方法
              scheduler.start()
            }
            state = StreamingContextState.ACTIVE
            scheduler.listenerBus.post(
              StreamingListenerStreamingStarted(System.currentTimeMillis()))
          } catch {
            case NonFatal(e) =>
              logError("Error starting the context, marking it as stopped", e)
              scheduler.stop(false)
              state = StreamingContextState.STOPPED
              throw e
          }
          StreamingContext.setActiveContext(this)
        }
        logDebug("Adding shutdown hook") // force eager creation of logger
        shutdownHookRef = ShutdownHookManager.addShutdownHook(
          StreamingContext.SHUTDOWN_HOOK_PRIORITY)(() => stopOnShutdown())
        // Registering Streaming Metrics at the start of the StreamingContext
        assert(env.metricsSystem != null)
        env.metricsSystem.registerSource(streamingSource)
        uiTab.foreach(_.attach())
        logInfo("StreamingContext started")
      case ACTIVE =>
        logWarning("StreamingContext has already been started")
      case STOPPED =>
        throw new IllegalStateException("StreamingContext has already been stopped")
    }
  }

第二步:scheduler.start()方法

  def start(): Unit = synchronized {
    if (eventLoop != null) return // scheduler has already been started

    logDebug("Starting JobScheduler")
    eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
      override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)

      override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
    }
    // 启动消息循环处理线程。用于处理JobScheduler的各种事件
    eventLoop.start()

    // attach rate controllers of input streams to receive batch completion updates
    for {
      inputDStream <- ssc.graph.getInputStreams
      rateController <- inputDStream.rateController
    } ssc.addStreamingListener(rateController)

    listenerBus.start()
    //ReceiverTracker核心组件,数据接收相关
    receiverTracker = new ReceiverTracker(ssc)
    inputInfoTracker = new InputInfoTracker(ssc)

    val executorAllocClient: ExecutorAllocationClient = ssc.sparkContext.schedulerBackend match {
      case b: ExecutorAllocationClient => b.asInstanceOf[ExecutorAllocationClient]
      case _ => null
    }

    executorAllocationManager = ExecutorAllocationManager.createIfEnabled(
      executorAllocClient,
      receiverTracker,
      ssc.conf,
      ssc.graph.batchDuration.milliseconds,
      clock)
    executorAllocationManager.foreach(ssc.addStreamingListener)
    //启动
    receiverTracker.start()
    //创建JobScheduler的时候,直接JobGenerator创建出来了
    //启动
    jobGenerator.start()
    executorAllocationManager.foreach(_.start())
    logInfo("Started JobScheduler")
  }

第三步: receiverTracker.start()

  def start(): Unit = synchronized {
    if (isTrackerStarted) {
      throw new SparkException("ReceiverTracker already started")
    }
   //Receiver的启动是依据数据流的
    if (!receiverInputStreams.isEmpty) {
      endpoint = ssc.env.rpcEnv.setupEndpoint(
        "ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv)) //汇报状态信息
        //内部的launchReceivers()方法,启动Receivers
      if (!skipReceiverLaunch) launchReceivers()
      logInfo("ReceiverTracker started")
      trackerState = Started
    }
  }

Spark Streaming:工作原理_第7张图片
第四步:launchReceivers()方法

 //每次调用StreamingContext创建一个DStream时,都会放入DStreamGraph的ReceiverInputStreams
  private val receiverInputStreams = ssc.graph.getReceiverInputStreams() 

 private def launchReceivers(): Unit = {
     //将程序中创建的所有DStream,调用getReceiver()拿到一个receivers集合
    val receivers = receiverInputStreams.map { nis =>
      //一个数据输入来源(receiverInputDStream)只产生一个Receiver
      val rcvr = nis.getReceiver()
      rcvr.setReceiverId(nis.id)
      rcvr
    }
    
    //启动虚拟Job来分配Receiver到不同的executor上
    runDummySparkJob()

    logInfo("Starting " + receivers.length + " receivers")
     //给消息接收处理器 endpoint 发送 StartAllReceivers(receivers)消息
    endpoint.send(StartAllReceivers(receivers))
  }

Spark Streaming:工作原理_第8张图片
第五步:runDummySparkJob()

  //确保所有节点活着,而且避免所有的receivers集中在一个节点上
  private def runDummySparkJob(): Unit = {
    if (!ssc.sparkContext.isLocal) {
      ssc.sparkContext.makeRDD(1 to 50, 50).map(x => (x, 1)).reduceByKey(_ + _, 20).collect()
    }
    assert(getExecutors.nonEmpty)
  }

第六步:StartAllReceivers消息

  // schedulingPolicy调度策略
  private val schedulingPolicy = new ReceiverSchedulingPolicy()

    override def receive: PartialFunction[Any, Unit] = {
      // Local messages
      case StartAllReceivers(receivers) =>
        /**
         * scheduleReceivers就可以确定receiver可以运行在哪些Executor上
         * receivers:要启动的receiver
         * getExecutors:获得集群中的Executors的列表
         */
        val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
        for (receiver <- receivers) {
          // scheduledLocations根据receiver的Id就找到了当前那些Executors可以运行Receiver
          val executors = scheduledLocations(receiver.streamId)
          updateReceiverScheduledExecutors(receiver.streamId, executors)
         // 保存流数据接收器Receiver首选位置
          receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
          //循环receivers,每次将一个receiver传入过去
          startReceiver(receiver, executors)
        }
      //从新启动Receiver
      case RestartReceiver(receiver) =>
        // Old scheduled executors minus the ones that are not active any more
        //如果Receiver失败的话,从可选列表中减去
        //刚在调度为Receiver分配给哪个Executor的时候会有一些列可选的Executor列表
        val oldScheduledExecutors = getStoredScheduledExecutors(receiver.streamId)
        //从新获取Executors
        val scheduledLocations = if (oldScheduledExecutors.nonEmpty) {
            // Try global scheduling again
            oldScheduledExecutors
          } else {
            //如果可选的Executor使用完了,则会重新执行rescheduleReceiver重新获取Executor.
            val oldReceiverInfo = receiverTrackingInfos(receiver.streamId)
            // Clear "scheduledLocations" to indicate we are going to do local scheduling
            val newReceiverInfo = oldReceiverInfo.copy(
              state = ReceiverState.INACTIVE, scheduledLocations = None)
            receiverTrackingInfos(receiver.streamId) = newReceiverInfo
            schedulingPolicy.rescheduleReceiver(
              receiver.streamId,
              receiver.preferredLocation,
              receiverTrackingInfos,
              getExecutors)
          }
        // Assume there is one receiver restarting at one time, so we don't need to update
        // receiverTrackingInfos
        //重复调用startReceiver
        startReceiver(receiver, scheduledLocations)
      case c: CleanupOldBlocks =>
        receiverTrackingInfos.values.flatMap(_.endpoint).foreach(_.send(c))
      case UpdateReceiverRateLimit(streamUID, newRate) =>
        for (info <- receiverTrackingInfos.get(streamUID); eP <- info.endpoint) {
          eP.send(UpdateRateLimit(newRate))
        }
      // Remote messages
      case ReportError(streamId, message, error) =>
        reportError(streamId, message, error)
    }

从注释中可以看到,Spark Streaming指定receiver在那些Executors运行,而不是基于Spark Core中的Task来指定。

第七步:schedulingPolicy.scheduleReceivers(receivers, getExecutors)

  /**
 * scheduleReceivers在执行时,首先将从executors列表转换成Map格式,单个元素host->”host:port”格式。
 *  然后遍历receivers列表,为其逐个分配节点,分配过程如下:
 * 
 * 从列表取出一个receiver,判断其是否具有preferredLocation.(可优先选择机器) 此方法在Receiver基类中声明,要求子类进行实现。
 * 举列中SocketReceiver未重写此方法,因此不具preferredLocation,因此未执行Receiver的Executor节点可以随意选取。
 * 
 * 这边可以得到一个结论就一个receiver对应一个Executor
   */
    def scheduleReceivers(
      receivers: Seq[Receiver[_]],
      executors: Seq[ExecutorCacheTaskLocation]): Map[Int, Seq[TaskLocation]] = {
    //ExecutorCacheTaskLocation是TaskLocation的子类
    if (receivers.isEmpty) {
      return Map.empty
    }

    if (executors.isEmpty) {
       //如果没有指定泛型,默认是Nothing
      return receivers.map(_.streamId -> Seq.empty).toMap
    }

    //groupBy()按相同的host分组,返回Map[host,List[ExecutorCacheTaskLocation]]
    val hostToExecutors = executors.groupBy(_.host)
    //按receivers有多少个,返回的List[ArrayBuffer[TaskLocation]]元素就有多少个取决于receivers.length
    val scheduledLocations = Array.fill(receivers.length)(new mutable.ArrayBuffer[TaskLocation])
    /**
     * 代表数据存储在 executor 的内存中,也就是这个 partition的recode 被 cache到内存了
     * 比如 KafkaRDD 会将 partitions 都 cache 到内存,其 toString 方法返回的格式如 executor_$host_$executorId
     */
    val numReceiversOnExecutor = mutable.HashMap[ExecutorCacheTaskLocation, Int]()
    // Set the initial value to 0
    //将HashMap中的每个key(ExecutorCacheTaskLocation)对应的值设置成0
    executors.foreach(e => numReceiversOnExecutor(e) = 0)

    // Firstly, we need to respect "preferredLocation". So if a receiver has "preferredLocation",
    // we need to make sure the "preferredLocation" is in the candidate scheduled executor list.
    /**
     * 查看Receiver是否preferredLocation数据本地性,如果有要将有数据本地性的receiver放到executor调度列表中
     * 对于SocketReceiver它是没有实现这个方法,返回的是None,所以不会进入receivers(i).preferredLocation.foreach方法中
     */
    for (i <- 0 until receivers.length) {
      // Note: preferredLocation is host but executors are host_executorId
      //数据本地性只指明的是host的主机名,而executors除了有host还有executorId 如:host_executorId
      receivers(i).preferredLocation.foreach { host =>
        hostToExecutors.get(host) match {
           //executorsOnHost若有值得到List[ExecutorCacheTaskLocation]
          case Some(executorsOnHost) =>
            // preferredLocation is a known host. Select an executor that has the least receivers in
            // this host
            //minBy方法返回numReceiversOnExecutor的value元素中最小值对应executorsOnHost集合中元素
            val leastScheduledExecutor =
              executorsOnHost.minBy(executor => numReceiversOnExecutor(executor))
            scheduledLocations(i) += leastScheduledExecutor
            numReceiversOnExecutor(leastScheduledExecutor) =
              numReceiversOnExecutor(leastScheduledExecutor) + 1
          case None =>
            // preferredLocation is an unknown host.
            // Note: There are two cases:
            // 1. This executor is not up. But it may be up later.
            // 2. This executor is dead, or it's not a host in the cluster.
            // Currently, simply add host to the scheduled executors.

            // Note: host could be `HDFSCacheTaskLocation`, so use `TaskLocation.apply` to handle
            // this case
            scheduledLocations(i) += TaskLocation(host)
        }
      }
    }

    // For those receivers that don't have preferredLocation, make sure we assign at least one
    // executor to them.
   //对于那些没有preferredLocation的接收器,确保至少分配一个执行者给他们
    for (scheduledLocationsForOneReceiver <- scheduledLocations.filter(_.isEmpty)) {
      // Select the executor that has the least receivers
      //会将numReceiversOnExecutor这个mutable.HashMap[ExecutorCacheTaskLocation, Int]()拿第1个元素
      //给ordering比较取出最小的元素 map的kv元素可以转成tuple
      val (leastScheduledExecutor, numReceivers) = numReceiversOnExecutor.minBy(_._2)
      scheduledLocationsForOneReceiver += leastScheduledExecutor
      numReceiversOnExecutor(leastScheduledExecutor) = numReceivers + 1
    }

    // Assign idle executors to receivers that have less executors
    //赋空闲的executors给哪些没有executors的receivers
    val idleExecutors = numReceiversOnExecutor.filter(_._2 == 0).map(_._1)
    for (executor <- idleExecutors) {
      // Assign an idle executor to the receiver that has least candidate executors.
      val leastScheduledExecutors = scheduledLocations.minBy(_.size)
      leastScheduledExecutors += executor
    }
    //zip会将两个集合中索引相同的元素放在新集合的tuple元素中
    receivers.map(_.streamId).zip(scheduledLocations).toMap
  }

调度过程如下:

  • 获取所有executor的主要地址信息
  • 创建numReceiversOnExecutor用于记录每个Executor分配的Receiver数目
  • 创建scheduledLocations用于记录用户指定偏好位置的Receiver
  • 调度指定preferredLocation信息的Receiver. 遍历Receivers, 为用户指定的preferredLocation的主机中选择启动Receiver数 最少的Executor做为当前Receiver启动位置,并更新记录scheduledLocations 和numReceiversOnExecutor。
  • 调度未指定preferredLocation信息的Receiver.将Executor依照分配的Receiver数目从小到大排序,为Receiver分配一个Executor.
  • 若还有剩余Executor, 将这些Executor 加入到拥有最少候选对象的Receiver列表中。

如果Receiver设置了preferredLocation且preferredLocation所对应的主机存在此应用的Executor的情况下,也不一定保证Receiver调度至此Executor

第八步:startReceiver(receiver, executors)

    private def startReceiver(
        receiver: Receiver[_],
        //scheduledLocations 指定的是在具体的那台物理机器上执行
        scheduledLocations: Seq[TaskLocation]): Unit = {
      //判断下Receiver的状态是否正常
      def shouldStartReceiver: Boolean = {
        // It's okay to start when trackerState is Initialized or Started
        !(isTrackerStopping || isTrackerStopped)
      }

      val receiverId = receiver.streamId
      //如果不需要启动Receiver则会调用onReceiverJobFinish()
      if (!shouldStartReceiver) {
        onReceiverJobFinish(receiverId)
        return
      }

      val checkpointDirOption = Option(ssc.checkpointDir)
      val serializableHadoopConf =
        new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)

      // Function to start the receiver on the worker node
      //startReceiverFunc封装了在worker上启动receiver的动作
      val startReceiverFunc: Iterator[Receiver[_]] => Unit =
        (iterator: Iterator[Receiver[_]]) => {
          if (!iterator.hasNext) {
            throw new SparkException(
              "Could not start receiver as object not found.")
          }
          if (TaskContext.get().attemptNumber() == 0) {
            val receiver = iterator.next()
            assert(iterator.hasNext == false)
            // ReceiverSupervisorImpl是Receiver的监控器,同时负责数据的写等操作
            val supervisor = new ReceiverSupervisorImpl(
              receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
            supervisor.start()//启动Receiver
            supervisor.awaitTermination()
          } else {
            //如果你想重新启动receiver的话,你需要重新完成上面的调度,从新schedule,而不是Task重试
            // It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
          }
        }

      // Create the RDD using the scheduledLocations to run the receiver in a Spark job
      //依据preferredLocation将Receiver包装成RDD
      val receiverRDD: RDD[Receiver[_]] =
        if (scheduledLocations.isEmpty) {
          ssc.sc.makeRDD(Seq(receiver), 1)
        } else {
          val preferredLocations = scheduledLocations.map(_.toString).distinct
          ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
        }
      //receiverId可以看出,receiver只有一个
      receiverRDD.setName(s"Receiver $receiverId")
      ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
      ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))
      //每个Receiver的启动都会触发一个Job,而不是一个作业的Task去启动所有的Receiver
      //应用程序一般会有很多Receiver,调用SparkContext的submitJob,为了启动Receiver,启动了Spark一个作业
      val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
        receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
      // We will keep restarting the receiver job until ReceiverTracker is stopped
      future.onComplete {
        case Success(_) =>
          if (!shouldStartReceiver) {
            onReceiverJobFinish(receiverId)
          } else {
            logInfo(s"Restarting Receiver $receiverId")
            self.send(RestartReceiver(receiver))
          }
        case Failure(e) =>
          //shouldStartReceiver默认是true
          if (!shouldStartReceiver) {
            onReceiverJobFinish(receiverId)
          } else {
            logError("Receiver has been stopped. Try to restart it.", e)
            logInfo(s"Restarting Receiver $receiverId")
           //当Receiver启动失败的话,就会调用ReceiverTrackEndpoint重新启动一个Spark Job去启动Receiver
            self.send(RestartReceiver(receiver))
          }
      //使用线程池的方式提交Job,这样的好处是可以并发的启动Receiver。
      }(ThreadUtils.sameThread)
      logInfo(s"Receiver ${receiver.streamId} started")
    }

在startReceiver方法中,依据preferredLocation将Receiver包装成RDD,以SparkJob的形式提交作业, Receiver作为Task 以线程方式执行,Task 被发送到Executor中,从RDD中取出“Receiver”然后对它执行startReceiverFunc函数,在函数中创建了一个ReceiverSupervisorImpl对象。它用来管理具体的Receiver。

第九步:new ReceiverSupervisorImpl(receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption).start()

  def start() {
    //创建BlockGenerator并启动
    onStart()
    //启动Receiver
    startReceiver()
  }

第十步:ReceiverSupervisorImpl中 onStart()方法

  override protected def onStart() {
    //运行worker的Executor端负责数据接收后的存取工作
    registeredBlockGenerators.asScala.foreach { _.start() }
  }

第十一步: startReceiver()

  /** Start receiver */
  def startReceiver(): Unit = synchronized {
    try {
    //验证Receiver是否合法
      if (onReceiverStart()) {
        logInfo(s"Starting receiver $streamId")
        receiverState = Started
        receiver.onStart()
        logInfo(s"Called receiver $streamId onStart")
      } else {
        // The driver refused us
        stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
      }
    } catch {
      case NonFatal(t) =>
        stop("Error starting receiver " + streamId, Some(t))
    }
  }

第十二步: onReceiverStart()

  //向ReceiverTracker注册Receiver信息,并验证Receiver是否合法
  override protected def onReceiverStart(): Boolean = {
    val msg = RegisterReceiver(
      streamId, receiver.getClass.getSimpleName, host, executorId, endpoint)
    trackerEndpoint.askSync[Boolean](msg)
  }

第十三步: receiver.onStart()(SocketInputDStream中的SocketReceiver为例进行说明)

private[streaming]
class SocketReceiver[T: ClassTag](
    host: String,
    port: Int,
    bytesToObjects: InputStream => Iterator[T],
    storageLevel: StorageLevel
  ) extends Receiver[T](storageLevel) with Logging {

  private var socket: Socket = _

  def onStart() {

    logInfo(s"Connecting to $host:$port")
    try {
     //程序传入的host和post创建socket对象
      socket = new Socket(host, port)
    } catch {
      case e: ConnectException =>
        restart(s"Error connecting to $host:$port", e)
        return
    }
    logInfo(s"Connected to $host:$port")

    // Start the thread that receives data over a connection
    //启动后台线程,调用receive方法
    new Thread("Socket Receiver") {
      setDaemon(true)
      override def run() { receive() }
    }.start()
  }

  def onStop() {
    // in case restart thread close it twice
    synchronized {
      if (socket != null) {
        socket.close()
        socket = null
        logInfo(s"Closed socket to $host:$port")
      }
    }
  }

  /** Create a socket connection and receive data until receiver is stopped */
 //启动socket开始接收数据
  def receive() {
    try {
      val iterator = bytesToObjects(socket.getInputStream())
      while(!isStopped && iterator.hasNext) {
        store(iterator.next())   //接收后的数据进行存储
      }
      if (!isStopped()) {
        restart("Socket data stream had no more data")
      } else {
        logInfo("Stopped receiving")
      }
    } catch {
      case NonFatal(e) =>
        logWarning("Error receiving data", e)
        restart("Error receiving data", e)
    } finally {
      onStop()
    }
  }
}

第十四步:store()方法

  /**
   * Store a single item of received data to Spark's memory.
   * These single items will be aggregated together into data blocks before
   * being pushed into Spark's memory.
   */
  def store(dataItem: T) {
    supervisor.pushSingle(dataItem)
  }

第十五步:supervisor的pushSingle()方法
源码地址:org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.class

private val defaultBlockGenerator = createBlockGenerator(defaultBlockGeneratorListener)

  def pushSingle(data: Any) {
    //defaultBlockGenerator 即为BlockGenerator
    defaultBlockGenerator.addData(data)
  }

第十六步:BlockGenerator.addData(data)方法

 /**
   * Push a single data item into the buffer.
   * defaultBlockGenerator.addData(data)调用
   */
  def addData(data: Any): Unit = {
    if (state == Active) {
      waitToPush()
      synchronized {
        if (state == Active) {
          //存放ArrayBuffer[Any]中
          currentBuffer += data
        } else {
          throw new SparkException(
            "Cannot add data as BlockGenerator has not been started or has been stopped")
        }
      }
    } else {
      throw new SparkException(
        "Cannot add data as BlockGenerator has not been started or has been stopped")
    }
  }

数据接收原理剖析
Spark Streaming:工作原理_第9张图片

  1. Receiver接收外部数据流,其将接收的数据流交由BlockGenerator存储在ArrayBuffer中,在存储之前会先获取许可(由“spark.streaming.receiver.maxRate”指定,spark 1.5之后由backpressure进行自动计算,代表可以存取的最大速率,每存储一条数据获取一个许可,若未获取到许可接收将阻塞)
  2. BlockGenerater中定义一Timer,其依据设置的Interval定时将ArrayBuffer中的数据取出,包装成Block,并将Block存放入blocksForPushing中(阻塞队列ArrayBlockingQueue),并将ArrayBuffer清空
  3. BlockGenerater中的blockPushingThread线程从阻塞队列中取出取出block信息,并以onPushBlock的方式将消息通过监听器(listener)发送给ReceiverSupervisor
  4. ReceiverSupervisor收到消息后,将对消息中携带数据进行处理,其会通过调用BlockManager对数据进行存储,并将存储结果信息向ReceiverTracker汇报
  5. ReceiverTracker收到消息后,将信息存储在未分配Block队列(streamidToUnallocatedBlock)中,等待JobGenerator生成Job时将其指定给RDD

第一步:Receiver启动之后,其将开始接收外部数据源的数据,ReceiverSupervisor的onStart()方法

  private val registeredBlockGenerators = new ConcurrentLinkedQueue[BlockGenerator]()

 override protected def onStart() {
    //运行worker的Executor端负责数据接收后的存取工作
    registeredBlockGenerators.asScala.foreach { _.start() }
  }

  override def createBlockGenerator(
      blockGeneratorListener: BlockGeneratorListener): BlockGenerator = {
    // Cleanup BlockGenerators that have already been stopped
    val stoppedGenerators = registeredBlockGenerators.asScala.filter{ _.isStopped() }
    stoppedGenerators.foreach(registeredBlockGenerators.remove(_))

    val newBlockGenerator = new BlockGenerator(blockGeneratorListener, streamId, env.conf)
    registeredBlockGenerators.add(newBlockGenerator)
    newBlockGenerator
  }

第二步:BlockGenerator的start方法

  //默认值200毫秒
  private val blockIntervalMs = conf.getTimeAsMs("spark.streaming.blockInterval", "200ms")
  require(blockIntervalMs > 0, s"'spark.streaming.blockInterval' should be a positive value")

  //blockIntervalTimer为定时器任务,每隔200毫秒就会去调用一个updateCurrentBuffer函数
  private val blockIntervalTimer =
    new RecurringTimer(clock, blockIntervalMs, updateCurrentBuffer, "BlockGenerator")

  //blockPushingThread为新线程,负载不断的从阻塞队列中取出打包的数据
  private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } }

  //currentBuffer,就是存放原始的数据
  @volatile private var currentBuffer = new ArrayBuffer[Any]

  /** Start block generating and pushing threads. */
  def start(): Unit = synchronized {
    if (state == Initialized) {
      state = Active
      //blockIntervalTimer启动,负责将currentBuffer中的原始数据打包封装成一个个block
      blockIntervalTimer.start()
      //blockPushingThread负责blocksForPushing中的block,调用pushArrayBuffer()方法
      blockPushingThread.start()
      logInfo("Started BlockGenerator")
    } else {
      throw new SparkException(
        s"Cannot start BlockGenerator as its not in the Initialized state [state = $state]")
    }
  }

 /**
   * Push a single data item into the buffer.
   * defaultBlockGenerator.addData(data)调用
   */
  def addData(data: Any): Unit = {
    if (state == Active) {
     //控制接收速率
      waitToPush()
      synchronized {
        if (state == Active) {
          currentBuffer += data
        } else {
          throw new SparkException(
            "Cannot add data as BlockGenerator has not been started or has been stopped")
        }
      }
    } else {
      throw new SparkException(
        "Cannot add data as BlockGenerator has not been started or has been stopped")
    }
  }

第三步:updateCurrentBuffer函数

  //blocksForPushing长度,默认10个,可调节
  private val blockQueueSize = conf.getInt("spark.streaming.blockQueueSize", 10)
  
  //blocksForPushing 阻塞队列
  private val blocksForPushing = new ArrayBlockingQueue[Block](blockQueueSize)

  /** Change the buffer to which single records are added to. */
  //将currentBuffer中的数据进行打包,并添加到阻塞队列blocksForPushing中
  private def updateCurrentBuffer(time: Long): Unit = {
    try {
      var newBlock: Block = null
      synchronized {
        //如果buffer空,则不生成block.
        if (currentBuffer.nonEmpty) {
          val newBlockBuffer = currentBuffer
          currentBuffer = new ArrayBuffer[Any]
          //根据时间创建一个唯一的BlockId
          val blockId = StreamBlockId(receiverId, time - blockIntervalMs)
          listener.onGenerateBlock(blockId)
          //创建Block
          newBlock = new Block(blockId, newBlockBuffer)
        }
      }

      if (newBlock != null) {
        //Block推入blocksForPushing队列
        blocksForPushing.put(newBlock)  // put is blocking when queue is full
      }
    } catch {
      case ie: InterruptedException =>
        logInfo("Block updating timer thread was interrupted")
      case e: Exception =>
        reportError("Error in block updating thread", e)
    }
  }

第四步:keepPushingBlocks()方法,每隔一段时间,去blocksForPushing队列中取Block

  //从阻塞队列中取出切片后的数据,并通过defaultBlockGeneratorListener(ReceiverSupervisorImpl)转发
  //等待下一步存储、分发操作
  private def keepPushingBlocks() {
    logInfo("Started block pushing thread")

    def areBlocksBeingGenerated: Boolean = synchronized {
      state != StoppedGeneratingBlocks
    }

    try {
      // While blocks are being generated, keep polling for to-be-pushed blocks and push them.
      while (areBlocksBeingGenerated) {
        //每隔10ms从blocksForPushing队列中poll出来当前队首的block
        Option(blocksForPushing.poll(10, TimeUnit.MILLISECONDS)) match {
          //pushBlock 推送Block
          case Some(block) => pushBlock(block)
          case None =>
        }
      }

      // At this point, state is StoppedGeneratingBlock. So drain the queue of to-be-pushed blocks.
      logInfo("Pushing out the last " + blocksForPushing.size() + " blocks")
      while (!blocksForPushing.isEmpty) {
        val block = blocksForPushing.take()
        logDebug(s"Pushing block $block")
        pushBlock(block)
        logInfo("Blocks left to push " + blocksForPushing.size())
      }
      logInfo("Stopped block pushing thread")
    } catch {
      case ie: InterruptedException =>
        logInfo("Block pushing thread was interrupted")
      case e: Exception =>
        reportError("Error in block pushing thread", e)
    }
  }

第五步:pushBlock(block)方法

  private def pushBlock(block: Block) {
    listener.onPushBlock(block.id, block.buffer)
    logInfo("Pushed block " + block.id)
  }

第六步:BlockGeneratorListener 监控到onPushBlock事件
源码地址:org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.scala

  private val defaultBlockGeneratorListener = new BlockGeneratorListener {
    def onAddData(data: Any, metadata: Any): Unit = { }

    def onGenerateBlock(blockId: StreamBlockId): Unit = { }

    def onError(message: String, throwable: Throwable) {
      reportError(message, throwable)
    }

   //BlockGeneratorListener 监控到onPushBlock事件后,会对传输的数据分片进行存储操作,并向ReceiverTracker汇报
    def onPushBlock(blockId: StreamBlockId, arrayBuffer: ArrayBuffer[_]) {
      //推送block
      pushArrayBuffer(arrayBuffer, None, Some(blockId))
    }
  }

第七步:pushArrayBuffer()方法

  def pushArrayBuffer(
      arrayBuffer: ArrayBuffer[_],
      metadataOption: Option[Any],
      blockIdOption: Option[StreamBlockId]
    ) {
    //封装成ArrayBufferBlock
    pushAndReportBlock(ArrayBufferBlock(arrayBuffer), metadataOption, blockIdOption)
  }

第八步:pushAndReportBlock()方法

  //对数据分片进行存储
  def pushAndReportBlock(
      receivedBlock: ReceivedBlock,
      metadataOption: Option[Any],
      blockIdOption: Option[StreamBlockId]
    ) {
    val blockId = blockIdOption.getOrElse(nextBlockId)
    val time = System.currentTimeMillis
    //数据通过receivedBlockHandler存储为Block
    val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)
    logDebug(s"Pushed block $blockId in ${(System.currentTimeMillis - time)} ms")
    val numRecords = blockStoreResult.numRecords
    
    //封装成ReceivedBlockInfo对象
    val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
    if (!trackerEndpoint.askSync[Boolean](AddBlock(blockInfo))) {
      throw new SparkException("Failed to add block to receiver tracker.")
    }
    logDebug(s"Reported block $blockId")
  }

第九步:receivedBlockHandler.storeBlock,ReceivedBlockHandler有两种实现

  private val receivedBlockHandler: ReceivedBlockHandler = {
    if (WriteAheadLogUtils.enableReceiverLog(env.conf)) {
      if (checkpointDirOption.isEmpty) {
        throw new SparkException(
          "Cannot enable receiver write-ahead log without checkpoint directory set. " +
            "Please use streamingContext.checkpoint() to set the checkpoint directory. " +
            "See documentation for more details.")
      }
      //开启WAL(预写日志)时会使用此实现
      new WriteAheadLogBasedBlockHandler(env.blockManager, env.serializerManager, receiver.streamId,
        receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get)
    } else {
      //默认情况下会使用此实现
      new BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel)
    }
  }

第十步:WriteAheadLogBasedBlockHandler中的storeBlock()方法

  def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = {

    var numRecords = Option.empty[Long]
    // Serialize the block so that it can be inserted into both
    //序列化数据
    val serializedBlock = block match {
      case ArrayBufferBlock(arrayBuffer) =>
        numRecords = Some(arrayBuffer.size.toLong)
        serializerManager.dataSerialize(blockId, arrayBuffer.iterator)
      case IteratorBlock(iterator) =>
        val countIterator = new CountingIterator(iterator)
        val serializedBlock = serializerManager.dataSerialize(blockId, countIterator)
        numRecords = countIterator.count
        serializedBlock
      case ByteBufferBlock(byteBuffer) =>
        new ChunkedByteBuffer(byteBuffer.duplicate())
      case _ =>
        throw new Exception(s"Could not push $blockId to block manager, unexpected block type")
    }

    // Store the block in block manager
    //将数据保存到BlockManager中,默认的持久化策略,StorageLevel是_SER2的,会复制一份副本到其他Executor以供容错
    val storeInBlockManagerFuture = Future {
      val putSucceeded = blockManager.putBytes(
        blockId,
        serializedBlock,
        effectiveStorageLevel,
        tellMaster = true)
      if (!putSucceeded) {
        throw new SparkException(
          s"Could not store $blockId to block manager with storage level $storageLevel")
      }
    }

    // Store the block in write ahead log
    //将Block存入预写日志
    val storeInWriteAheadLogFuture = Future {
      writeAheadLog.write(serializedBlock.toByteBuffer, clock.getTimeMillis())
    }

    // Combine the futures, wait for both to complete, and return the write ahead log record handle
    val combinedFuture = storeInBlockManagerFuture.zip(storeInWriteAheadLogFuture).map(_._2)
    val walRecordHandle = ThreadUtils.awaitResult(combinedFuture, blockStoreTimeout)
    WriteAheadLogBasedStoreResult(blockId, numRecords, walRecordHandle)
  }

第十一步:BlockManagerBasedBlockHandler中的storeBlock()方法

  def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = {

    var numRecords: Option[Long] = None
    //将数据保存到BlockManager中
    val putSucceeded: Boolean = block match {
      case ArrayBufferBlock(arrayBuffer) =>
        numRecords = Some(arrayBuffer.size.toLong)
        blockManager.putIterator(blockId, arrayBuffer.iterator, storageLevel,
          tellMaster = true)
      case IteratorBlock(iterator) =>
        val countIterator = new CountingIterator(iterator)
        val putResult = blockManager.putIterator(blockId, countIterator, storageLevel,
          tellMaster = true)
        numRecords = countIterator.count
        putResult
      case ByteBufferBlock(byteBuffer) =>
        blockManager.putBytes(
          blockId, new ChunkedByteBuffer(byteBuffer.duplicate()), storageLevel, tellMaster = true)
      case o =>
        throw new SparkException(
          s"Could not store $blockId to block manager, unexpected block type ${o.getClass.getName}")
    }
    if (!putSucceeded) {
      throw new SparkException(
        s"Could not store $blockId to block manager with storage level $storageLevel")
    }
    BlockManagerBasedStoreResult(blockId, numRecords)
  }

BlockManagerBasedBlockHandler通过BlockManager的接口对数据在Receiver所在节点进行保存,并依据StorageLevel 设置的副本数,在其它Executor中保存副本

第十二步:BlockManager保存副本的replicate()方法
源码地址:org.apache.spark.storage.BlockManager.class

  private def replicate(
      blockId: BlockId,
      data: BlockData,
      level: StorageLevel,
      classTag: ClassTag[_],
      existingReplicas: Set[BlockManagerId] = Set.empty): Unit = {

   ......

    var peersForReplication = blockReplicationPolicy.prioritize(
      blockManagerId,
      initialPeers,
      peersReplicatedTo,
      blockId,
      numPeersToReplicateTo)

   ......
   
  }

第十三步: blockReplicationPolicy.prioritize() 副本策略采用,随机取样的方式进行

  override def prioritize(
      blockManagerId: BlockManagerId,
      peers: Seq[BlockManagerId],
      peersReplicatedTo: mutable.HashSet[BlockManagerId],
      blockId: BlockId,
      numReplicas: Int): List[BlockManagerId] = {
    //随机
    val random = new Random(blockId.hashCode)
    logDebug(s"Input peers : ${peers.mkString(", ")}")
    val prioritizedPeers = if (peers.size > numReplicas) {
      BlockReplicationUtils.getRandomSample(peers, numReplicas, random)
    } else {
      if (peers.size < numReplicas) {
        logWarning(s"Expecting ${numReplicas} replicas with only ${peers.size} peer/s.")
      }
      random.shuffle(peers).toList
    }
    logDebug(s"Prioritized peers : ${prioritizedPeers.mkString(", ")}")
    prioritizedPeers
  }

第十四步:Block保存完成,并且副本制作完成后,将通过trackerEndpoint向ReceiverTrack进行汇报,第八步的pushAndReportBlock()方法

//封装成ReceivedBlockInfo对象
val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
//发送AddBlock消息
if (!trackerEndpoint.askSync[Boolean](AddBlock(blockInfo))) {
   throw new SparkException("Failed to add block to receiver tracker.")
}

第十五步:ReceiverTrackEndpoint收到"AddBlock"信息

case AddBlock(receivedBlockInfo) =>
    if (WriteAheadLogUtils.isBatchingEnabled(ssc.conf, isDriver = true)) {
       walBatchingThreadPool.execute(new Runnable {
         override def run(): Unit = Utils.tryLogNonFatalError {
         if (active) {
              context.reply(addBlock(receivedBlockInfo))
           } else {
             context.sendFailure(
             new IllegalStateException("ReceiverTracker RpcEndpoint already shut down."))
          }
        }
      })
  } else {
 context.reply(addBlock(receivedBlockInfo))
}

第十六步:addBlock(receivedBlockInfo)方法

  private val receivedBlockTracker = new ReceivedBlockTracker(
    ssc.sparkContext.conf,
    ssc.sparkContext.hadoopConfiguration,
    receiverInputStreamIds,
    ssc.scheduler.clock,
    ssc.isCheckpointPresent,
    Option(ssc.checkpointDir)
  )

  //receivedBlockTracker将block信息保存入队列streamIdToUnallocatedBlockQueues中,以用于生成Job
  private def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = {
    receivedBlockTracker.addBlock(receivedBlockInfo)
  }

第十七步: receivedBlockTracker.addBlock(receivedBlockInfo)方法

  //封装了streamId到Block的映射
  private val streamIdToUnallocatedBlockQueues = new mutable.HashMap[Int, ReceivedBlockQueue]
  //封装了Time到Block的映射
  private val timeToAllocatedBlocks = new mutable.HashMap[Time, AllocatedBlocks]
  //如果开启了预写日志机制,ReceivedTracker也会写一份到预写日志中
  private val writeAheadLogOption = createWriteAheadLog()

  def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = {
    try {
      //预写日志机制
      val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo))
      if (writeResult) {
        synchronized {
          //加入队列
          getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo
        }
        logDebug(s"Stream ${receivedBlockInfo.streamId} received " +
          s"block ${receivedBlockInfo.blockStoreResult.blockId}")
      } else {
        logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} receiving " +
          s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write Ahead Log.")
      }
      writeResult
    } catch {
      case NonFatal(e) =>
        logError(s"Error adding block $receivedBlockInfo", e)
        false
    }
  }

第十八步:getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo

  private def getReceivedBlockQueue(streamId: Int): ReceivedBlockQueue = {
    streamIdToUnallocatedBlockQueues.getOrElseUpdate(streamId, new ReceivedBlockQueue)
  }

保存在streamIdToUnallocatedBlockQueues中的数据信息,在下一个批次生成Job时会被取出用于封装成RDD,且注册数据信息会转移至timeToAllocatedBlocks中。

数据计算阶段分析
Spark Streaming:工作原理_第10张图片

第一步:JobGenerator用于定期生成Job并进行提交,在启动JobScheduler时,其会调用JobGenerator的start方法

  def start(): Unit = synchronized {
    if (eventLoop != null) return // generator has already been started

    // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock.
    // See SPARK-10125
    checkpointWriter

    eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
      override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)

      override protected def onError(e: Throwable): Unit = {
        jobScheduler.reportError("Error in job generator", e)
      }
    }
    //eventLoop对象并启动,其中eventLoop定义事件交由processEvent(event),
    //processEvent其依据事件的类型,对其进行不同的处理
    eventLoop.start()

    if (ssc.isCheckpointPresent) {
      restart()
    } else {
      //定期生成Job
      startFirstTime()
    }
  }

第二步:startFirstTime()方法

  //其按设置的时间周期,重复的执行计划的任务
  private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
    longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")


  /**
   * 初始化开始时间,根据我们设置的batch interval,每到batch interval
   * 都会从上一个time,也就是这里的startTime,开始将batch interval内的数据封装为一个batch
   */
  private def startFirstTime() {
    val startTime = new Time(timer.getStartTime())
    graph.start(startTime - graph.batchDuration)
    timer.start(startTime.milliseconds)
    logInfo("Started JobGenerator at " + startTime)
  }

第三步:每个batchDuration规定时间,都会向eventLoop发送一GenerateJobs事件,eventLoop收到GenerateJobs事件,则使用processEvent进行相应处理,此处为调用 generateJobs()方法 ,生成Job

  /** Processes all events */
  private def processEvent(event: JobGeneratorEvent) {
    logDebug("Got event " + event)
    event match {
      case GenerateJobs(time) => generateJobs(time)
      case ClearMetadata(time) => clearMetadata(time)
      case DoCheckpoint(time, clearCheckpointDataLater) =>
        doCheckpoint(time, clearCheckpointDataLater)
      case ClearCheckpointData(time) => clearCheckpointData(time)
    }
  }

第四步:generateJobs(time) 生成Job

  /**
   * 定时调度generateJobs方法
   * time:Batch Interval内的时间段
   */
  private def generateJobs(time: Time) {
    // Checkpoint all RDDs marked for checkpointing to ensure their lineages are
    // truncated periodically. Otherwise, we may run into stack overflows (SPARK-6847).
    ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")
    Try {
      //找到ReceiverTracker,调用allocateBlocksToBatch方法,将当前时间段内的block分配给一个batch
      //并为其创建一个RDD
      jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
      //调用DStreamGraph的generateJobs方法,根据定义的DStream关系和算子,生成Job
      graph.generateJobs(time) // generate jobs using allocated block
    } match {
      //成功转化为Job之后,然后通过JobScheduler对JobSet进行提交
      case Success(jobs) =>
        val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
        jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
      case Failure(e) =>
        jobScheduler.reportError("Error generating jobs for time " + time, e)
        PythonDStream.stopStreamingContextIfPythonProcessIsDead(e)
    }
    eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
  }

第五步: jobScheduler.receiverTracker.allocateBlocksToBatch(time)

  def allocateBlocksToBatch(batchTime: Time): Unit = {
    if (receiverInputStreams.nonEmpty) {
      //将未分配数据信息取出,并划分给batchTime所指批次
      receivedBlockTracker.allocateBlocksToBatch(batchTime)
    }
  }

第六步:receivedBlockTracker.allocateBlocksToBatch(batchTime)

  def allocateBlocksToBatch(batchTime: Time): Unit = synchronized {
    if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) {
      val streamIdToBlocks = streamIds.map { streamId =>
        (streamId, getReceivedBlockQueue(streamId).clone())
      }.toMap
      //streamIdToUnallocatedBlockQueues中取出未分配的block信息,将其包装为AllocatedBlocks
      val allocatedBlocks = AllocatedBlocks(streamIdToBlocks)
      if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) {
        streamIds.foreach(getReceivedBlockQueue(_).clear())
        //保存到timeToAllocatedBlocks中,等待某批次(batchTime)生成Job时,与Job进行绑定
        timeToAllocatedBlocks.put(batchTime, allocatedBlocks)
        lastAllocatedBatchTime = batchTime
      } else {
        logInfo(s"Possibly processed batch $batchTime needs to be processed again in WAL recovery")
      }
    } else {
      // This situation occurs when:
      // 1. WAL is ended with BatchAllocationEvent, but without BatchCleanupEvent,
      // possibly processed batch job or half-processed batch job need to be processed again,
      // so the batchTime will be equal to lastAllocatedBatchTime.
      // 2. Slow checkpointing makes recovered batch time older than WAL recovered
      // lastAllocatedBatchTime.
      // This situation will only occurs in recovery time.
      logInfo(s"Possibly processed batch $batchTime needs to be processed again in WAL recovery")
    }
  }

第七步:第四步的graph.generateJobs(time)方法分别将DStreamGraph中的每个OutputStream转换了一个Job(如果应用中有多个OutputStream算子,则一个批次会生成多个Job)

  def generateJobs(time: Time): Seq[Job] = {
    logDebug("Generating jobs for time " + time)
    val jobs = this.synchronized {
      outputStreams.flatMap { outputStream =>
        //调用OutputStream的generateJob方法来将每个OutputStream转化为Job
        val jobOption = outputStream.generateJob(time)
        jobOption.foreach(_.setCallSite(outputStream.creationSite))
        jobOption
      }
    }
    logDebug("Generated " + jobs.length + " jobs for time " + time)
    jobs
  }

第八步: outputStream.generateJob(time)

  /**
   * 所有的output操作都会调用ForEachDStream的generateJob方法
   * 底层就会触发Job提交
   */
  override def generateJob(time: Time): Option[Job] = {
    parent.getOrCompute(time) match {
      case Some(rdd) =>
        val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
          foreachFunc(rdd, time)
        }
        //创建Job
        Some(new Job(time, jobFunc))
      case None => None
    }
  }

第九步:parent.getOrCompute(time) 调用父DStream中的getOrCompute方法,生成RDD

WordCount应用中DStream的转换
Spark Streaming:工作原理_第11张图片
转换关系链中ForEachDStream的父亲-ShuffledDStream,发现其未重写getOrCompute方法,因此将使用继承自基类DStream中的getOrCompute

  /**
   * Get the RDD corresponding to the given time; either retrieve it from cache
   * or compute-and-cache it.
   */
   //getOrCompute( compute方法与之类似)方法由DStream基类创建, 
   //如果子类重写该方法,则执行子类方法;若未重写,则执行基类中的方法
  private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {
    // If RDD was already generated, then retrieve it from HashMap,
    // or else compute the RDD
    generatedRDDs.get(time).orElse {
      // Compute the RDD if time is valid (e.g. correct time in a sliding window)
      // of RDD generation, else generate nothing.
      if (isTimeValid(time)) {

        val rddOption = createRDDWithLocalProperties(time, displayInnerRDDOps = false) {
          // Disable checks for existing output directories in jobs launched by the streaming
          // scheduler, since we may need to write output to an existing directory during checkpoint
          // recovery; see SPARK-4835 for more details. We need to have this call here because
          // compute() might cause Spark jobs to be launched.
          SparkHadoopWriterUtils.disableOutputSpecValidation.withValue(true) {
            //递归调用生成RDD
            compute(time)
          }
        }

        rddOption.foreach { case newRDD =>
          // Register the generated RDD for caching and checkpointing
          if (storageLevel != StorageLevel.NONE) {
            newRDD.persist(storageLevel)
            logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")
          }
          if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
            newRDD.checkpoint()
            logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")
          }
          generatedRDDs.put(time, newRDD)
        }
        rddOption
      } else {
        None
      }
    }
  }

第十步:ShuffledDStream.compute(time)生成RDD

  override def compute(validTime: Time): Option[RDD[(K, C)]] = {
    //调用其父DStream的compute方法,其父DStream继续递归向上调用父DStream的compute直到源头DStream
    parent.getOrCompute(validTime) match {
      case Some(rdd) => Some(rdd.combineByKey[C](
          createCombiner, mergeValue, mergeCombiner, partitioner, mapSideCombine))
      case None => None
    }
  }

第十一步:SocketInputDStream.compute(time)方法

SocketInputDStream的compute方法继承自ReceiverInputDStream,其compute方法将生成源头RDD,并按DStream递归逆向生成RDD Graph

  override def compute(validTime: Time): Option[RDD[T]] = {
    val blockRDD = {

      if (validTime < graph.startTime) {
        // If this is called for any time before the start time of the context,
        // then this returns an empty RDD. This may happen when recovering from a
        // driver failure without any write ahead log to recover pre-failure data.
        new BlockRDD[T](ssc.sc, Array.empty)
      } else {
        // Otherwise, ask the tracker for all the blocks that have been allocated to this stream
        // for this batch
        //划分过批次的数据信息(timeToAllocatedBlocks)取出ReceivedBlockInfo
        val receiverTracker = ssc.scheduler.receiverTracker
        val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)
        
        // Register the input blocks information into InputInfoTracker
        //包装成StreamInputInfo
        val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum)
        ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)
         
        // Create the BlockRDD
        //生成RDD
        createBlockRDD(validTime, blockInfos)
      }
    }
    Some(blockRDD)
  }

第十二步:createBlockRDD(validTime, blockInfos)

  private[streaming] def createBlockRDD(time: Time, blockInfos: Seq[ReceivedBlockInfo]): RDD[T] = {

    //判断blockInfos是否为空
    if (blockInfos.nonEmpty) {
      val blockIds = blockInfos.map { _.blockId.asInstanceOf[BlockId] }.toArray

      // Are WAL record handles present with all the blocks
      val areWALRecordHandlesPresent = blockInfos.forall { _.walRecordHandleOption.nonEmpty }

      if (areWALRecordHandlesPresent) {
        // If all the blocks have WAL record handle, then create a WALBackedBlockRDD
        val isBlockIdValid = blockInfos.map { _.isBlockIdValid() }.toArray
        val walRecordHandles = blockInfos.map { _.walRecordHandleOption.get }.toArray
        // 从WAL恢复
        new WriteAheadLogBackedBlockRDD[T](
          ssc.sparkContext, blockIds, walRecordHandles, isBlockIdValid)
      } else {
        // Else, create a BlockRDD. However, if there are some blocks with WAL info but not
        // others then that is unexpected and log a warning accordingly.
        if (blockInfos.exists(_.walRecordHandleOption.nonEmpty)) {
          if (WriteAheadLogUtils.enableReceiverLog(ssc.conf)) {
            logError("Some blocks do not have Write Ahead Log information; " +
              "this is unexpected and data may not be recoverable after driver failures")
          } else {
            logWarning("Some blocks have Write Ahead Log information; this is unexpected")
          }
        }
        //校验数据是否还存在,不存在就过滤掉,此时的master是BlockManager
        val validBlockIds = blockIds.filter { id =>
          ssc.sparkContext.env.blockManager.master.contains(id)
        }
        if (validBlockIds.length != blockIds.length) {
          logWarning("Some blocks could not be recovered as they were not found in memory. " +
            "To prevent such data loss, enable Write Ahead Log (see programming guide " +
            "for more details.")
        }
        //根据有效的BlockIds生成BlockRDD
        new BlockRDD[T](ssc.sc, validBlockIds)
      }
    } else {
      // If no block is ready now, creating WriteAheadLogBackedBlockRDD or BlockRDD
      // according to the configuration
      // 从WAL中创建空RDD
      if (WriteAheadLogUtils.enableReceiverLog(ssc.conf)) {
        new WriteAheadLogBackedBlockRDD[T](
          ssc.sparkContext, Array.empty, Array.empty, Array.empty)
      } else {
        //生成空的BlockRDD
        new BlockRDD[T](ssc.sc, Array.empty)
      }
    }
  }

第十三步:返回第四步,通过JobScheduler对JobSet进行提交

//成功创建了job
case Success(jobs) =>
        ////从ReceiverTracker中获取当前batch interval对应的数据
        val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
        //所有Job组成一个JobSet,使用JobScheduler的submitJobSet进行批量Job提交
        jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))

第十四步:jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))

  private val numConcurrentJobs = ssc.conf.getInt("spark.streaming.concurrentJobs", 1)
  //默认情况下每次只能提交一个Job
  private val jobExecutor =
    ThreadUtils.newDaemonFixedThreadPool(numConcurrentJobs, "streaming-job-executor")

  def submitJobSet(jobSet: JobSet) {
    if (jobSet.jobs.isEmpty) {
      logInfo("No jobs added for time " + jobSet.time)
    } else {
      listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
      jobSets.put(jobSet.time, jobSet)
      //将Job通过JobHandler进行包装,然后由ThreadPoolExecutor的execute增加到其workQueue中,等待被调度执行
      //如果线程池有空闲线程,则其将被调度
      jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
      logInfo("Added jobs for time " + jobSet.time)
    }
  }

第十五步:new JobHandler(job)

  /**
   * JobHandler是ThreadPoolExecutor中Executor运行的主要任务,其功能是对提交的Job进行处理
   */
  private class JobHandler(job: Job) extends Runnable with Logging {
    import JobScheduler._

    def run() {
      val oldProps = ssc.sparkContext.getLocalProperties
      try {
        ssc.sparkContext.setLocalProperties(SerializationUtils.clone(ssc.savedProperties.get()))
        val formattedTime = UIUtils.formatBatchTime(
          job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false)
        val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}"
        val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]"

        ssc.sc.setJobDescription(
          s"""Streaming job from $batchLinkText""")
        ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString)
        ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString)
        // Checkpoint all RDDs marked for checkpointing to ensure their lineages are
        // truncated periodically. Otherwise, we may run into stack overflows (SPARK-6847).
        ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")

        // We need to assign `eventLoop` to a temp variable. Otherwise, because
        // `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then
        // it's possible that when `post` is called, `eventLoop` happens to null.
        var _eventLoop = eventLoop
        if (_eventLoop != null) {
          _eventLoop.post(JobStarted(job, clock.getTimeMillis()))
          // Disable checks for existing output directories in jobs launched by the streaming
          // scheduler, since we may need to write output to an existing directory during checkpoint
          // recovery; see SPARK-4835 for more details.
          //其将通过EventLoop对Job状态进行管理,并通过调用job.run方法,使用Job开始运行
          SparkHadoopWriterUtils.disableOutputSpecValidation.withValue(true) {
            job.run()
          }
          _eventLoop = eventLoop
          if (_eventLoop != null) {
            _eventLoop.post(JobCompleted(job, clock.getTimeMillis()))
          }
        } else {
          // JobScheduler has been stopped.
        }
      } finally {
        ssc.sparkContext.setLocalProperties(oldProps)
      }
    }
  }

第十六步: job.run()方法

  def run() {
    //DStream的jobFunc函数
    _result = Try(func())
  }

第十七步:DStream的jobFunc函数,

  private[streaming] def generateJob(time: Time): Option[Job] = {
    getOrCompute(time) match {
      case Some(rdd) =>
        val jobFunc = () => {
          val emptyFunc = { (iterator: Iterator[T]) => {} }
          //会触发SparkJob的提交,接下来的处理流程与spark 批处理相同
          context.sparkContext.runJob(rdd, emptyFunc)
        }
        Some(new Job(time, jobFunc))
      case None => None
    }
  }

前述生成的Job,只是Streaming中定义的抽象,与SparkJob(真正进行调度,生成Task)不同。

你可能感兴趣的:(Spark)