SparkStreaming源码剖析1-Receiver数据接收流程

根据上文所展示的Spark Streaming的socketTextStream应用示例,来从源码的角度来看下其是怎么工作运行的。

1、初始化StreamingContext

StreamingContext是Spark Streaming程序的主要入口,其构造函数如下:

class StreamingContext private[streaming] (
    _sc: SparkContext,
    _cp: Checkpoint,
    _batchDur: Duration
  ) extends Logging {

  // DStreamGraph里面保存了一系列的DStream的依赖关系以及互相之间的算子的应用
  private[streaming] val graph: DStreamGraph = {
    if (isCheckpointPresent) {
      _cp.graph.setContext(this)
      _cp.graph.restoreCheckpointData()
      _cp.graph
    } else {
      require(_batchDur != null, "Batch duration for StreamingContext cannot be null")
      val newGraph = new DStreamGraph()
      newGraph.setBatchDuration(_batchDur)
      newGraph
    }
  }

  private val nextInputStreamId = new AtomicInteger(0)

  // 检查点设置
  private[streaming] var checkpointDir: String = {
    if (isCheckpointPresent) {
      sc.setCheckpointDir(_cp.checkpointDir)
      _cp.checkpointDir
    } else {
      null
    }
  }

  // JobScheduler涉及到job的调度,JobGenerator会负责每隔batch interval,生成一个job,然后通过JobScheduler来调度和提交job
  private[streaming] val scheduler = new JobScheduler(this)

  private[streaming] val uiTab: Option[StreamingTab] =
    if (conf.getBoolean("spark.ui.enabled", true)) {
      Some(new StreamingTab(this))
    } else {
      None
    }
}

其构造器有三个主要参数为:

  1. SparkContext:Spark Streaming的最终处理是交给SparkContext的
  2. CheckPoint:检查点
  3. Duration:设定streaming每个批次的累计时间

其主要用于应用程序处理的核心组件为:

  1. DStreamGraph:处理Dstream的依赖关系
  2. JobScheduler:定时生成Spark Job,其是Spark Streaming的任务核心,其将流式处理转换为批次处理

       在初始化创建StreamingContext时,其创建了DStreamGraph、JobScheduler等关联组件之后,就会调用StreamingContext的socketTextStream等方法,来创建输入DStream,然后针对输入DStream执行一系列的transformation转换操作,最后会执行一个output输出操作,来触发针对一个个batch job的触发和执行;上述初始化操作完成后,就可以调用StreamingContext.start()方法来启动一个Spark Streaming的应用程序,在这个start()方法,会创建StreamingContext的另外两个重要组件ReceiverTracker、JobGenerator,另外最重要的是,其会启动整个Spark Streaming应用程序输入DStream对应的Receiver,并会在Spark集群上的worker节点上的Executor中启动对应的Receiver。其基本的数据接收到任务划分及调度示意图如下:
SparkStreaming源码剖析1-Receiver数据接收流程_第1张图片

接下来看看其触发流式任务执行的StreamingContext中的start()方法:

// StreamingContext#start():Streaming应用程序启动的入口
def start(): Unit = synchronized {
  // ......
  ThreadUtils.runInNewThread("streaming-start") {
    sparkContext.setCallSite(startSite.get)
    sparkContext.clearJobGroup()
    sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
    savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get()))
    // 调用JobScheduler的start()方法
    scheduler.start()
  }
  state = StreamingContextState.ACTIVE
}

可以看到在StreamingContext的start()方法中,其主要就是调用scheduler.start()来启动对应的JobScheduler,在scheduler.start()方法中,其会创建并启动EventLoop[JobSchedulerEvent](eventLoop.start())来处理JobSchedulerEvent 事件。并获取 DStreamGraph中所有InputDStream以及相应的rateController,然后将rateController加入到listenerBus,之后便创建并启动两个最重要的组件为:ReceiverTracker和JobGenerator。

  • JobGenerator:主要用来生成对应的Batch Job,并提交生成的Job到集群运行。它封装了由DStream转化而来的RDD操作,其通过定时调用DStreamingGraph的generateJob方法生成Job和清理Dstream的元数据,DStreamGraph持有构成DStream图的所有DStream对象,并调用DStream的generateJob方法生成具体Job对象。DStream生成最终的Job交给JobScheduler调度执行。
  • ReceiverTracker:通知Executor启动对应的Receiver数据接收器,并管理Receiver的执行。
// JobScheduler#start()
def start(): Unit = synchronized {
  if (eventLoop != null) return // scheduler has already been started

  logDebug("Starting JobScheduler")
  eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
    override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)

    override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
  }
  eventLoop.start()

  // attach rate controllers of input streams to receive batch completion updates
  for {
    inputDStream <- ssc.graph.getInputStreams
    rateController <- inputDStream.rateController
  } ssc.addStreamingListener(rateController)

  listenerBus.start()
  receiverTracker = new ReceiverTracker(ssc) // 创建了ReceiverTracker组件,数据接收相关
  inputInfoTracker = new InputInfoTracker(ssc)

  val executorAllocClient: ExecutorAllocationClient = ssc.sparkContext.schedulerBackend match {
    case b: ExecutorAllocationClient => b.asInstanceOf[ExecutorAllocationClient]
    case _ => null
  }

  executorAllocationManager = ExecutorAllocationManager.createIfEnabled(
    executorAllocClient,
    receiverTracker,
    ssc.conf,
    ssc.graph.batchDuration.milliseconds,
    clock)
  executorAllocationManager.foreach(ssc.addStreamingListener)
  receiverTracker.start() // 启动DStream关联的Receiver
  jobGenerator.start() // 启动JobGenerator
  executorAllocationManager.foreach(_.start())
  logInfo("Started JobScheduler")
}

2、ReceiverTracker

ReceiverTracker是负责启动、管理各个receiver的功能组件。其主要的功能如下:

  1. 在Executor上启动Receiver并接受来自Receiver的注册
  2. 接受Receiver发送的各种消息,并作相应处理
  3. 更新Receiver接收数据的速率(也就是限流)
  4. 不断的等待Receivers的运行状态,只要Receivers停止运行,就重新启动Receiver。也就是Receiver的容错功能。
  5. 停止Receivers以及汇报Receiver发送过来的错误信息

接着上面的ReceiverTracker.start()方法,继续分析如下:

/** Start the endpoint and receiver execution thread. */
def start(): Unit = synchronized {
  // ......
  if (!receiverInputStreams.isEmpty) {
    // 创建RPC服务端
    endpoint = ssc.env.rpcEnv.setupEndpoint(
      "ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
    // 加载启动Receiver
    if (!skipReceiverLaunch) launchReceivers()
    logInfo("ReceiverTracker started")
    trackerState = Started
  }
}

// 加载启动Receiver
private def launchReceivers(): Unit = {
  // 从inputStreams中获取receivers,并设置每一个receiver的基本Id
  val receivers = receiverInputStreams.map { nis =>
    val rcvr = nis.getReceiver()
    rcvr.setReceiverId(nis.id)
    rcvr
  }
  runDummySparkJob()
  logInfo("Starting " + receivers.length + " receivers")
  // 发送StartAllReceivers的消息
  endpoint.send(StartAllReceivers(receivers))
}

ReceiverTracker首先创建ReceiverTrackerEndPoint协议的RPC服务端,其用来和对应的Receiver进行通信、监听、回复与Receiver相关的RPC信息。然后再接着调用launchReceivers()用来加载启动所有的Receiver,launchReceivers中先通过receiverInputStreams中获取到所有的Receiver集合,然后再向ReceiverTrackerEndPoint通讯体endpoint变量发送了StartAllReceivers(receivers)消息。ReceiverTrackerEndpoint的receive方法会响应该消息,在收到StartAllReceivers后,会根据一定的调度策略来调度分配启动Receiver的位置,然后调用其startReceiver()方法:

override def receive: PartialFunction[Any, Unit] = {
  // Local messages
  case StartAllReceivers(receivers) =>
    // 根据一定的调度策略来分配其所在的启动位置
    val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
    for (receiver <- receivers) {
      val executors = scheduledLocations(receiver.streamId)
      updateReceiverScheduledExecutors(receiver.streamId, executors)
      receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
      startReceiver(receiver, executors) // 启动该receiver
    }
}

其基本的Receiver调度分配策略会尽量使Receivers均匀的分散在Executors中。调度过程中会为每一个Receiver指定启动的位置信息(location);其调度过程如下:

  1. 获取所有executor的主要地址信息
  2. 创建numReceiversOnExecutor用于记录每个Executor分配的Receiver数目
  3. 创建scheduledLocations用于记录用户指定偏好位置的Receiver
  4. 调度指定preferredLocation信息的Receiver。遍历Receivers,为用户指定的preferredLocation的主机中选择启动Receiver数 最少的Executor做为当前Receiver启动位置,并更新记录scheduledLocations和numReceiversOnExecutor
  5. 调度未指定preferredLocation信息的Receiver
  6. 将Executor依照分配的Receiver数目从小到大排序,为Receiver分配一个Executor
  7. 若还有剩余Executor,将这些Executor加入到拥有最少候选对象的Receiver列表中

接下来便是使用startReceiver(receiver, executors)方法来在指定的Executor上启动对应的Receiver:

/**
 * Start a receiver along with its scheduled executors
 */
private def startReceiver(
    receiver: Receiver[_],
    scheduledLocations: Seq[TaskLocation]): Unit = {
  // ......
  val checkpointDirOption = Option(ssc.checkpointDir)
  val serializableHadoopConf =
    new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)

  // Function to start the receiver on the worker node
  // 在worker节点上启动receiver的方法
  // 这个函数会和rdd一起提交,它new了一个ReceiverSupervisorImpl用来具体处理接收的数据
  val startReceiverFunc: Iterator[Receiver[_]] => Unit =
    (iterator: Iterator[Receiver[_]]) => {
      if (TaskContext.get().attemptNumber() == 0) {
        val receiver = iterator.next()
        assert(iterator.hasNext == false)
        val supervisor = new ReceiverSupervisorImpl(
          receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
        supervisor.start()
        supervisor.awaitTermination()
      } else {
        // It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
      }
    }

  // Create the RDD using the scheduledLocations to run the receiver in a Spark job
  // 使用ScheduledLocations创建RDD以在Spark作业中运行接收器
  val receiverRDD: RDD[Receiver[_]] =
    if (scheduledLocations.isEmpty) {
      ssc.sc.makeRDD(Seq(receiver), 1)
    } else {
      val preferredLocations = scheduledLocations.map(_.toString).distinct
      ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
    }
  receiverRDD.setName(s"Receiver $receiverId")
  ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
  ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))

  // 提交启动receiver的job到sparkContext进行启动,将recevier分发到具体的executor上
  val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
    receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
  
  // ......
}

在startReceiver()方法中,其将Receiver包装成了RDD,并把scheduledLocations作为RDD的优先位置locationPrefs。然后通过sc提交了一个Spark Core Job,执行函数是startReceiverFunc(也就是要在executor上执行的),并在该方法中创建一个ReceiverSupervisorImpl对象(ReceiverSupervisor是Executor端Receiver的管理者,负责监督和管理Executor中的Receiver的运行),并调用了start()方法,其会调用receiver的onStart方法后立即返回。在Receiver的onStart方法中一般会新建线程或线程池来接收数据,比如在KafkaReceiver中,就新建了线程池,在线程池中接收topics的数据。supervisor.start()返回后,由 supervisor.awaitTermination()阻塞住线程,以让这个task一直不退出,从而可以源源不断接收数据。

Executor端ReceiverSupervisor

接下来分析在Executor端执行启动receiver的启动方法ReceiverSupervisor.start()方法:

// ReceiverSupervisor
def start() {
  onStart()
  startReceiver()
}

// ReceiverSupervisorImpl中的onStart方法
override protected def onStart() {
  registeredBlockGenerators.asScala.foreach { _.start() }
}

// ReceiverSupervisor的方法,用于启动Receiver
def startReceiver(): Unit = synchronized {
  try {
    if (onReceiverStart()) { // Receiver启动后会向ReceiverTracker注册,告诉ReceiverTracker自己启动成功
      logInfo(s"Starting receiver $streamId")
      receiverState = Started
      receiver.onStart() // 启动receiver,开始接收数据
      logInfo(s"Called receiver $streamId onStart")
    } else {
      // The driver refused us
      stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
    }
  } catch {
    case NonFatal(t) =>
      stop("Error starting receiver " + streamId, Some(t))
  }
}

在ReceiverSupervisor.start()方法,首先其会调用被复写的onStart()方法来启动所有的BlockGenerator集合。其主要用于对Receiver接收的数据流进行切片操作。之后便会先调用onReceiverStart()方法向ReceiverTracker注册,告诉ReceiverTracker自己启动成功;之后便调用receiver.onstart()方法来启动对应的receiver;开始接收产生的流式数据。(在官方文档中可以知道,若用户需要自定义receiver的实现,则需要继承receiver接口并在onStart()方法中启动一个单独的线程来进行数据的接收,该处onStart()就是此处调用,通过线程异步启动并一直接收流式数据),比如自带的SocketReceiver示例如下,其在该方法中创建真正用于通信的Socket,并不断的接收发送来的数据:

def onStart() {
  try {
    socket = new Socket(host, port)
  } catch {
    // ......
  }
  // Start the thread that receives data over a connection
  new Thread("Socket Receiver") {
    setDaemon(true)
    override def run() { receive() }
  }.start()
}

def receive() {
  try {
    // 从socket流中接收数据
    val iterator = bytesToObjects(socket.getInputStream())
    while(!isStopped && iterator.hasNext) {
      store(iterator.next()) // 调用store()方法将接收到的数据进行存储
    }
  } catch {
    // ......
  }
}

并且在官方文档中,可以知道在receiver接收到数据之后,必须显示的调用store()方法将数据进行存储;接来下详细分析一下store()方法以及前文提到的ReceiverSupervisorImpl中的复写的onStart方法(registeredBlockGenerators.asScala.foreach { _.start() });

BlockGenerator

        BlockGenerator是Receiver中比较重要类,其用于将我们收到的流式数据写入buffer,然后定时将buffer封装为块,进行存储和汇报给Driver。先来看下其基本的变量如下:

// 用来监听块相关事件:onAddData、onGenerateBlock、onPushBlock
listener: BlockGeneratorListener

// ArrayBuffer用来暂存接收到的数据
@volatile private var currentBuffer = new ArrayBuffer[Any]

// 用来缓存封装好的Block块的队列
private val blocksForPushing = new ArrayBlockingQueue[Block](blockQueueSize)

// 定时将currentBuffer中的数据封装为Block,然后推到blocksForPushing里面
private val blockIntervalTimer =
  new RecurringTimer(clock, blockIntervalMs, updateCurrentBuffer, "BlockGenerator")

// 用来从blocksForPushing中取出Block,然后进行存储,汇报ReceiverTracker的线程
private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } }

里面比较重要的定时器blockIntervalTimer和线程blockPushingThread都会在ReceiverSupervisorImpl的onStart()方法中启动,该方法启动了所有的BlockGenerator集合。其基本的start()方法如下:

/** Start block generating and pushing threads. */
def start(): Unit = synchronized {
  if (state == Initialized) {
    state = Active
    blockIntervalTimer.start()  // 启动块生成的time定时器,其将currentBuffer中缓存的数据流封装为Block
    blockPushingThread.start()  // 启动推送块的线程,其将缓存的block推送到ReceivedBlockQueue中
    logInfo("Started BlockGenerator")
  } else {
    throw new SparkException(
      s"Cannot start BlockGenerator as its not in the Initialized state [state = $state]")
  }
}

先来看blockIntervalTimer定时器的工作流程,在函数updateCurrentBuffer中其会将currentBuffer中的所有数据封装为一个Block,然后放入blocksForPush队列:

/** Change the buffer to which single records are added to. */
// 按照定时时间将currentBuffer中缓存的所有数据生成块,然后将块推到blocksForPushing中
private def updateCurrentBuffer(time: Long): Unit = {
  try {
    var newBlock: Block = null
    synchronized {
      if (currentBuffer.nonEmpty) {
        val newBlockBuffer = currentBuffer
        currentBuffer = new ArrayBuffer[Any]
        val blockId = StreamBlockId(receiverId, time - blockIntervalMs)
        listener.onGenerateBlock(blockId) // 一个block生成
        newBlock = new Block(blockId, newBlockBuffer)
      }
    }

    if (newBlock != null) {
      blocksForPushing.put(newBlock)  // put is blocking when queue is full
    }
  } catch {
    // ......
  }
}

之后blockPushingThread线程将会调用keepPushingBlocks()函数来将blocksForPushing队列中的block通过listener.onPushBlock()通知该listener,并通过调用ReceiverSupervisorImpl#pushArrayBuffer()函数将其传递给BlockManager,让BlockManager存储在内存当中:

/** Keep pushing blocks to the BlockManager. */
private def keepPushingBlocks() {
  try {
    while (!blocksForPushing.isEmpty) {
      val block = blocksForPushing.take()
      logDebug(s"Pushing block $block")
      // 调用本类的pushBlock方法
      pushBlock(block)
      logInfo("Blocks left to push " + blocksForPushing.size())
    }
  } catch {
    // ......
  }
}

// 推送块
private def pushBlock(block: Block) {
  listener.onPushBlock(block.id, block.buffer)
  logInfo("Pushed block " + block.id)
}
private val defaultBlockGeneratorListener = new BlockGeneratorListener {
  def onAddData(data: Any, metadata: Any): Unit = { }

  def onGenerateBlock(blockId: StreamBlockId): Unit = { }

  def onError(message: String, throwable: Throwable) {
    reportError(message, throwable)
  }
  
  // 推块的时候调用ReceiverSupervisorImpl.pushArrayBuffer()
  def onPushBlock(blockId: StreamBlockId, arrayBuffer: ArrayBuffer[_]) {
    pushArrayBuffer(arrayBuffer, None, Some(blockId))
  }
}

// 将接收到的数据ArrayBuffer作为数据块存储到Spark的内存中
def pushArrayBuffer(
  arrayBuffer: ArrayBuffer[_],
  metadataOption: Option[Any],
  blockIdOption: Option[StreamBlockId]
) {
  // 调用pushAndReportBlock()
  pushAndReportBlock(ArrayBufferBlock(arrayBuffer), metadataOption, blockIdOption)
}

// 将块数据进行存储,然后汇报给Driver
def pushAndReportBlock(
  receivedBlock: ReceivedBlock,
  metadataOption: Option[Any],
  blockIdOption: Option[StreamBlockId]
) {
  val blockId = blockIdOption.getOrElse(nextBlockId)
  val time = System.currentTimeMillis
  // 此处会触发真正的存储数据
  val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)
  logDebug(s"Pushed block $blockId in ${(System.currentTimeMillis - time)} ms")
  val numRecords = blockStoreResult.numRecords
  val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
  // 将存储结果报告Driver
  if (!trackerEndpoint.askSync[Boolean](AddBlock(blockInfo))) {
      throw new SparkException("Failed to add block to receiver tracker.")
  }
  logDebug(s"Reported block $blockId")
}

其真正触发进行数据块存储到Spark内存中以及向driver端汇报的过程是在ReceiverSupervisorImpl#pushArrayBuffer()方法中执行的,接下来分别看下数据块存储以及向Driver汇报的过程:

  • 数据块存储:val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)

       存储数据块有对应的receivedBlockHandler,在启用了WAL(spark.streaming.receiver.writeAheadLog.enable为true)的情况下对应的是WriteAheadLogBasedBlockHandler(启用了WAL的情况下在应用程序挂掉后可以从WAL恢复数据),未启用的情况下对应的是BlockManagerBasedBlockHandler。两种handler都是最终通过blockManager来存储block到内存或者磁盘。

private val receivedBlockHandler: ReceivedBlockHandler = {
  if (WriteAheadLogUtils.enableReceiverLog(env.conf)) {
    if (checkpointDirOption.isEmpty) {
      throw new SparkException(
        "Cannot enable receiver write-ahead log without checkpoint directory set. " +
          "Please use streamingContext.checkpoint() to set the checkpoint directory. " +
          "See documentation for more details.")
    }
    new WriteAheadLogBasedBlockHandler(env.blockManager, env.serializerManager, receiver.streamId,
      receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get)
  } else {
    new BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel)
  }
}
// storeBlock方法部分代码
case ArrayBufferBlock(arrayBuffer) =>
    numRecords = Some(arrayBuffer.size.toLong)
    blockManager.putIterator(blockId, arrayBuffer.iterator, storageLevel,tellMaster = true)
case IteratorBlock(iterator) =>
    val countIterator = new CountingIterator(iterator)
    val putResult = blockManager.putIterator(blockId, countIterator, storageLevel,tellMaster = true)
    numRecords = countIterator.count
    putResult
case ByteBufferBlock(byteBuffer) =>
    blockManager.putBytes(blockId, new ChunkedByteBuffer(byteBuffer.duplicate()), storageLevel, tellMaster = true
  • 向Driver汇报

       存储了block后,接着创建了ReceivedBlockInfo实例,对应该block的一些信息,包括streamId(一个InputDStream对应一个Receiver,一个Receiver对应一个streamId)、block中数据的条数、storeResult等信息。接着将receivedBlockInfo作为参数和ReceiverTracker进行RPC通信并发送AddBlock消息,ReceiverTracker收到AddBlock(blockInfo)消息后,调用了addBlock(receiveedBlockInfo)方法进行处理:

// ReceiverTracker#AddBlock(blockInfo)
case AddBlock(receivedBlockInfo) =>
  if (WriteAheadLogUtils.isBatchingEnabled(ssc.conf, isDriver = true)) {
    walBatchingThreadPool.execute(new Runnable {
      override def run(): Unit = Utils.tryLogNonFatalError {
        if (active) {
          context.reply(addBlock(receivedBlockInfo))
        } else {
          throw new IllegalStateException("ReceiverTracker RpcEndpoint shut down.")
        }
      }
    })
  } else {
    context.reply(addBlock(receivedBlockInfo))
  }

/** Add new blocks for the given stream */
private def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = {
  receivedBlockTracker.addBlock(receivedBlockInfo)
}

这里最终在receivedBlockTracker的addBlock方法中将block的元信息添加到了一个队列ReceivedBlockQueue中,并且其最终是保存在streamIdToUnallocatedBlockQueues的Map中,其中key是streamId,值是该streamid对应的block队列:

/** Add received block. This event will get written to the write ahead log (if enabled). */
def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = {
  try {
    val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo))
    if (writeResult) {
      synchronized {
        getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo
      }
    }
    // ......
  }
}

/** Get the queue of received blocks belonging to a particular stream */
private def getReceivedBlockQueue(streamId: Int): ReceivedBlockQueue = {
  streamIdToUnallocatedBlockQueues.getOrElseUpdate(streamId, new ReceivedBlockQueue)
}

总结,其最终的流式数据的接收处理流程及其示意图如下:

  • Receiver接收外部数据流,其将接收的数据流交由BlockGenerator存储在ArrayBuffer中
  • BlockGenerater中定义一Timer,其依据设置的Interval定时将ArrayBuffer中的数据取出,包装成Block,并将Block存放入blocksForPushing中(阻塞队列ArrayBlockingQueue),并将ArrayBuffer清空
  • BlockGenerater中的blockPushingThread线程从阻塞队列中取出取出block信息,并以onPushBlock的方式将消息通过监听器(listener)发送给ReceiverSupervisor。
  • ReceiverSupervisor收到消息后,将对消息中携带数据进行处理,其会通过调用BlockManager对数据进行存储,并将存储结果信息向ReceiverTracker汇报
  • ReceiverTracker收到消息后,将信息存储在未分配Block队列(streamidToUnallocatedBlock)中,等待JobGenerator生成Job时将其指定给RDD

SparkStreaming源码剖析1-Receiver数据接收流程_第2张图片

SparkStreaming源码剖析1-Receiver数据接收流程_第3张图片

接来下回头来看下receiver接口定义中的store()方法,我们知道在receiver接收到数据之后,必须显示的调用store()方法将数据进行存储;该store()方法的部分函数如下:

def store(dataItem: T) {
  supervisor.pushSingle(dataItem)
}

def store(dataBuffer: ArrayBuffer[T]) {
  supervisor.pushArrayBuffer(dataBuffer, None, None)
}

def store(dataIterator: Iterator[T]) {
  supervisor.pushIterator(dataIterator, None, None)
}

def store(bytes: ByteBuffer) {
  supervisor.pushBytes(bytes, None, None)
}
// ......

在store()方法中可以看到其实际是调用supervisor.pushSingle或supervisor.pushArrayBuffer等方法来传递数据。Receiver#store()方法有多种形式,ReceiverSupervisor也有pushSingle、pushArrayBuffer、pushIterator、pushBytes等方法与不同的store对应:

  • pushSingle:对应单条小数据
  • pushArrayBuffer:对应数组形式的数据
  • pushIterator:对应iterator形式数据
  • pushBytes:对应ByteBuffer形式的块数据

       对于单条数据,存储时需要BlockGenerator将其多次保存的单条数据聚合成一个数据块,其会调用pushSingle将数据添加到BlockGenerator中的currentBuffer中;BlockGenerator再定时将currentBuffer封装为Block,然后调用pushBlock、pushArrayBuffer、pushAndReportBlock等对数据块进行存储、汇报给Driver端等。另外几个对应于数组形式ArrayBuffer、迭代器Iterator、字节缓冲区ByteBuffer等的重载方法也都会最终也都会调用ReceiverSupervisorImpl#pushAndReportBlock()方法对数据块进行存储,然后汇报给Driver端的ReceiverTracker。

你可能感兴趣的:(Spark使用,spark,spark,streaming,ReceiverTracker,BlockGenerator)