Spark Streaming — StreamingCongtext初始化及Receiver启动

StreamingContext初始化

  StreamingContext在初始化的时候,会创建两个重要的组件DStreamGraph和JobScheduler,如下所示:

  // 这里初始化的一个重要的组件DStreamGraph,
  // 它里面保存了Spark Streaming Application中一系列的DStream的依赖关系,以及互相之间的算子的应用
  private[streaming] val graph: DStreamGraph = {
    if (isCheckpointPresent) {
      cp_.graph.setContext(this)
      cp_.graph.restoreCheckpointData()
      cp_.graph
    } else {
      require(batchDur_ != null, "Batch duration for StreamingContext cannot be null")
      val newGraph = new DStreamGraph()
      newGraph.setBatchDuration(batchDur_)
      newGraph
    }
  }
  // 初始化JobScheduler,涉及到Job的调度;JobGenerator生成的Job,就是通过它来调度和提交的
  // 其底层还是基于Spark Core Engine
  private[streaming] val scheduler = new JobScheduler(this)

  在初始化StreamingContext之后,我们以WordCount程序为例,程序接着往下执行。在前面Spark Core中分析过,触发Job的运行是通过一个output操作(也即action),我们以简单的print() action操作为例,它里面调用了print(10),也就是打印RDD中前10个数据,并且里面调用了foreachRDD函数,而这个函数里面调用了ForEachDStream的register()方法,而最终会调用generatorJob()方法,到这里会触发Job的提交。
然而上面仅仅只是触发Job的提交,这里并没有涉及到Job的产生以及Receiver数据的接收,而触发这些功能则是调用StreamingContext的start()方法,所以在Spark Streaming中,如果没有调用它的start()方法程序是不会执行的,当然没有action操作程序也不会执行,因为没有Job提交
  下面我们就着重分析start()方法:

StreamingContext的start()方法
  def start(): Unit = synchronized {
    state match {
      case INITIALIZED =>
        startSite.set(DStream.getCreationSite())
        // 加锁,保证一个节点上只有一个StreamingContext在运行
        StreamingContext.ACTIVATION_LOCK.synchronized {
          // 判断是否有多个StreamingContext在运行
          StreamingContext.assertNoOtherContextIsActive()
          try {
            // 检测初始化组件是否合法,以及是否设置了checkpoint等。
            validate()
            
            // 启用一个单独的线程启动Streaming Application
            ThreadUtils.runInNewThread("streaming-start") {
              sparkContext.setCallSite(startSite.get)
              sparkContext.clearJobGroup()
              sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
              // 调用JobScheduler的start方法,来进行Receiver的启动
              scheduler.start()
            }
            // 更新当前状态
            state = StreamingContextState.ACTIVE
          } catch {
            case NonFatal(e) =>
              logError("Error starting the context, marking it as stopped", e)
              scheduler.stop(false)
              state = StreamingContextState.STOPPED
              throw e
          }
          StreamingContext.setActiveContext(this)
        }
        //  省略部分代码
        .............................
    }
  }

  从上面代码中可以看出,具体调用了JobScheduler的start()方法,我们到这个方法里面看一下。

JobScheduler的start()方法
  // StreamingContext的start()方法,其实里面真正调用的是JobScheduler的start()方法
  def start(): Unit = synchronized {
    // 假如这个StreamingContext已经在启动,那么返回(可能是故障重启动等)
    if (eventLoop != null) return // scheduler has already been started

    logDebug("Starting JobScheduler")
    // 创建一个接收消息的消息队列
    eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
      override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)

      override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
    }
    // 启动消息接收(接收的是本地消息)
    eventLoop.start()

    // 获取输入DStream的限流率
    // 这个其实还挺重要的,在基于Kafka Direct方式接收数据的时候(或者普通Receiver),
    // 可以设置一个最大接收速度,也就是进行限速
    for {
      inputDStream <- ssc.graph.getInputStreams
      rateController <- inputDStream.rateController
    } ssc.addStreamingListener(rateController)

    listenerBus.start(ssc.sparkContext)
    // 创建ReceiverTracker组件,这是数据接收相关组件
    receiverTracker = new ReceiverTracker(ssc)
    // 记录输入DStream的数据信息,以便Streaming进行监控
    inputInfoTracker = new InputInfoTracker(ssc)
    // 启动receiverTracker,这里启动输入DStream关联的Receiver
    receiverTracker.start()
    // 创建JobScheduler的时候,直接就把JobGenerator给创建出来了,并在这里启动
    jobGenerator.start()
    logInfo("Started JobScheduler")
  }

  我们重点分析一下上面的代码:首先创建一个消息接收器,用于接收本地消息。接着获取DStream的限流器,这里涉及到输入DStream的限流。
  这里简单说说Receiver的限流,如果集群资源有限,并没有大到Receiver一接收到数据就立即处理它,这会导致Receiver端有数据积压,为了防止数据积压太多,因此有必要调整接收数据的速度,这里可以通过两个参数来设置:spark.streaming.receiver.maxRate 和 spark.streaming.kafka.maxRatePerPartition
  前者是设置普通Receiver,后者是设置Kafka 的,然而从Spark 1.5之后,对于Kafka Direct方式而言引入了backpressure(背压)机制,从而不需要设置Receiver的限速,Spark可以自动估计Receiver最合理的接收速度,并根据情况动态调整。启动这个机制只需要设置 spark.streaming.backpressure.enabled为true即可
  接着分析上述代码,然后创建了两个重要组件ReceiverTracker和JobGenerator,并启动他们,我们先分析ReceiverTracker的start()方法。

ReceiverTracker的start()方法

  由于ReceiverTracker的start方法中实际上调用的是launchReceivers()方法,我们就看这个方法:

  private def launchReceivers(): Unit = {
    // 获取所有的Receiver
    val receivers = receiverInputStreams.map(nis => {
      // 将程序中创建的所有输入DStream,调用其getReceiver方法,拿到一个Receiver集合
      val rcvr = nis.getReceiver()
      // 设置Receiver的ID
      rcvr.setReceiverId(nis.id)
      rcvr
    })
    runDummySparkJob()

    logInfo("Starting " + receivers.length + " receivers")
    // 向ReceiverTrackerEndpoint发送启动所有Receiver消息,
    // 其实就是在本地进行消息收发
    endpoint.send(StartAllReceivers(receivers))
  }

  从上述代码可以看到,启动Receiver是通过发送一个本地消息StartAllReceiver来启动的,下面我们看一下这个源码:

 override def receive: PartialFunction[Any, Unit] = {
      // Local messages
      // 启动所有的Receivers
      case StartAllReceivers(receivers) =>

        // 计算Receiver的启动位置,说白了就是看Receiver在哪个executor启动
        val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
        for (receiver <- receivers) {
          // 获取Receiver启动的所在节点的executor
          val executors = scheduledLocations(receiver.streamId)
          // 更新到ReceiverInfo中
          updateReceiverScheduledExecutors(receiver.streamId, executors)
          // 记录每个Receiver的启动位置
          receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
          // 启动Receiver,这里传入了Receiver要启动的executor位置
          startReceiver(receiver, executors)
        }
        // 省略代码
    }

  这里主要就是startReceiver()这个函数来启动Receiver,注意传入的参数,一个是待启动的Receiver的集合,还有就是每个Receiver的启动位置(也即在哪个Worker的executor节点上启动)。
  下面重点分析startReceiver方法:

ReceiverTracker的startReceiver()方法
private def startReceiver(
        receiver: Receiver[_],
        scheduledLocations: Seq[TaskLocation]): Unit = {
    // 检测Receiver是否已经启动或已经被关闭等
      val receiverId = receiver.streamId
      if (!shouldStartReceiver) {
        onReceiverJobFinish(receiverId)
        return
      }
      // 是否设置Checkpoint
      val checkpointDirOption = Option(ssc.checkpointDir)
      val serializableHadoopConf =
        new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)
      /**
        *   这里定义了Receiver的核心逻辑,
        *   注意:这里以及之后的操作都只是定义,不是在Driver端执行的
        *   这里只是定义了一个函数,这个函数的执行,以及往后的过程
        *   都是在executor上执行的。这里强调,Receiver的启动绝对不是在Driver上的,是在Executor上的
        */
        // 遍历每个Receiver,并进行启动
      val startReceiverFunc: Iterator[Receiver[_]] => Unit =
        (iterator: Iterator[Receiver[_]]) => {
          if (!iterator.hasNext) {
            throw new SparkException(
              "Could not start receiver as object not found.")
          }
          if (TaskContext.get().attemptNumber() == 0) {
            // 获取一个Receiver
            val receiver = iterator.next()
            assert(iterator.hasNext == false)
            // 将每个Receiver封装在ReceiverSupervisorImpl中,并调用其start方法启动
            val supervisor = new ReceiverSupervisorImpl(
              receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
            // 这里调用了它的父类ReceiverSupervisor的start方法
            supervisor.start()
            supervisor.awaitTermination()
          } else {
            // It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
          }
        }

      // 这里做了优化,receiver接收到的数据,所封装成的RDD,它的最佳位置在Receiver启动的那个节点上
      val receiverRDD: RDD[Receiver[_]] =
        if (scheduledLocations.isEmpty) {
          ssc.sc.makeRDD(Seq(receiver), 1)
        } else {
          val preferredLocations = scheduledLocations.map(_.toString).distinct
          ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
        }
      receiverRDD.setName(s"Receiver $receiverId")
      ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
      ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))

      // 这里的submitJob会真正将Receiver的启动函数,分布到各个Worker节点的Executor上去执行
      val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
        receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
      // We will keep restarting the receiver job until ReceiverTracker is stopped
      // 这里就是判断Job的运行结果状态
      future.onComplete {
        case Success(_) =>
          if (!shouldStartReceiver) {
            onReceiverJobFinish(receiverId)
          } else {
            logInfo(s"Restarting Receiver $receiverId")
            self.send(RestartReceiver(receiver))
          }
        case Failure(e) =>
          if (!shouldStartReceiver) {
            onReceiverJobFinish(receiverId)
          } else {
            logError("Receiver has been stopped. Try to restart it.", e)
            logInfo(s"Restarting Receiver $receiverId")
            self.send(RestartReceiver(receiver))
          }
      }(submitJobThreadPool)
      logInfo(s"Receiver ${receiver.streamId} started")
    }

  上面代码中重要的部分就是对每个Receiver封装了一个startReceiverFunc,这个Receiver的启动函数,它里面具体的就是将每个Receiver封装进了ReceiverSupervisorImpl中,然后调用它的start()方法,启动Receiver;接着就是封装了receiverRDD;最重要的是将他们两通过SparkContext的submitJob进行Job的提交发送到各个Worker节点的executor上去执行。
上面需要注意的是,真正启动Receiver是在Worker节点的executor上,而不是Driver上,Driver只是将Receiver进行封装,然后发送到各个Executor上进行启动。
  下面我们看看在每个Worker的executor上,Receiver是如何启动的,首先调用的是ReceiverSupervisorImpl的start()方法。

ReceiverSupervisorImpl的start()

  ReceiverSupervisorImpl的start()方法在这个类中没有,因为在其父类中实现的,我们看其父类ReceiverSupervisor的start方法:

  def start() {
    // 调用ReceiverSupervisorImpl的onStart()方法
    onStart()
    startReceiver()
  }

  start方法中只有两个方法,一个是onStart(),如下所示,用于启动JobGenerator(后面再进行分析);还有一个就是startReceiver,用于启动Receiver。

override protected def onStart() {
    // 这里启动了一个BlockGenerator,非常重要,它允许在worker的executor端负责
    // 数据接收后的一些存取工作,以及配合ReceiverTracker。
    // 所以在Executor上,启动Receiver之前,就会先启动这个Receiver,相关的BlockGenerator
    // 这里启动已经注册的BlockGenerator
    registeredBlockGenerators.foreach { _.start() }
  }

  在startReceiver中启动Receiver。

startReceiver启动Receiver
// 这里就会启动Receiver
  def startReceiver(): Unit = synchronized {
    try {
      // 先向ReceiverTracker发送启动Receiver的信息进行注册
      if (onReceiverStart()) {
        logInfo("Starting receiver")
        // 启动Receiver
        receiverState = Started
        // 这里就启动Receiver
        receiver.onStart()
        logInfo("Called receiver onStart")
      } else {
        // The driver refused us
        stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
      }
    } catch {
      case NonFatal(t) =>
        stop("Error starting receiver " + streamId, Some(t))
    }
  }

  首先向ReceiverTracker发送启动Receiver的消息,进行注册。发送成功之后,就会进行Receiver的启动,调用了Receiver的onStart()方法,我们这里以socket receiver为例,来进行说明,其他的Receiver启动都大同小异。

// 启动Receiver
  def onStart() {
    // Start the thread that receives data over a connection
    new Thread("Socket Receiver") {
      setDaemon(true)
      override def run() { receive() }
    }.start()
  }
 
 // 这里主要就是建立一个socket连接,用于接收数据
def receive() {
    var socket: Socket = null
    try {
      logInfo("Connecting to " + host + ":" + port)
      socket = new Socket(host, port)
      logInfo("Connected to " + host + ":" + port)
      val iterator = bytesToObjects(socket.getInputStream())
      while(!isStopped && iterator.hasNext) {
        store(iterator.next)
      }
      if (!isStopped()) {
        restart("Socket data stream had no more data")
      } else {
        logInfo("Stopped receiving")
      }
    } catch {
      case e: java.net.ConnectException =>
        restart("Error connecting to " + host + ":" + port, e)
      case NonFatal(e) =>
        logWarning("Error receiving data", e)
        restart("Error receiving data", e)
    } finally {
      if (socket != null) {
        socket.close()
        logInfo("Closed socket to " + host + ":" + port)
      }
    }
  }

  从上面可以很清楚的看到,在Worker节点的executor上启动的Socket Receiver,主要就是与数据源建立一个Socket连接,然后接受数据,并保存到它对应的BlockManager上,然后进行后面一系列的算子处理。
  总结一下:上面主要分析了StreamingContext初始化的过程,以及它的start()方法;这里面主要作用是创建了四个重要的组件JobScheduler、DStreamGraph、ReceiverTracker和JobGenerator。其中着重分析了Receiver的启动过程,首先将Receiver封装进启动函数startReceiverFunc中,然后通过SparkContext的submitJob,将各个Receiver分发到Worker节点的executor上去启动,启动的时候主要调用的是ReceiverSupervosor的startReceiver()方法,首先给ReceiverTracker发送启动消息,然后调用receiver的onStart()方法启动Receiver的数据接收。我们以Socket Receiver为例,进行了简单分析。

你可能感兴趣的:(Spark,Streaming)