DT大数据梦工厂Spark定制班笔记(010)

Spark Streaming源码解读之流数据不断接收全生命周期彻底研究和思考

接第9讲的内容

ReceiverSupervisor.scala.scala (127-131行)

def start() {
  onStart()
  startReceiver()
}

 

onStart ReceiverSupervisor.scala.scala (107-112行)

 

/**
 * Called when supervisor is started.
 * Note that this must be called beforethe receiver.onStart() is called to ensure
 * things like
[[BlockGenerator]]s are started before the receiver startssending data.
 */
protected def onStart() { }

必须在receiver.onstart()之前启动

 

实现在ReceiverSupervisorImpl.scala (172-174行)

overrideprotected def onStart() {
  registeredBlockGenerators.foreach { _.start() }
}

registeredBlockGenerator的定义

ReceiverSupervisorImpl.scala (95-96行)

private val registeredBlockGenerators= new mutable.ArrayBuffer[BlockGenerator]
  with mutable.SynchronizedBuffer[BlockGenerator]

 

对registeredBlockGenerator的操作ReceiverSupervisorImpl.scala (194-202行)

 

override def createBlockGenerator(
    blockGeneratorListener:BlockGeneratorListener): BlockGenerator = {
  // Cleanup BlockGenerators that have already been stopped
 
registeredBlockGenerators --=registeredBlockGenerators.filter{_.isStopped() }

  val newBlockGenerator= new BlockGenerator(blockGeneratorListener,streamId, env.conf)
  registeredBlockGenerators+= newBlockGenerator
  newBlockGenerator
}

该函数在ReceiverSupervisorImpl实例化时被调用。ReceiverSupervisorImpl(112)

private val defaultBlockGenerator= createBlockGenerator(defaultBlockGeneratorListener)

注意这里的defaultBlockGeneratorListener 后面会再次被提及。

ReceiverSupervisorImpl.scala(99-111)

private val defaultBlockGeneratorListener= new BlockGeneratorListener{
  def onAddData(data:Any, metadata: Any): Unit = { }

  def onGenerateBlock(blockId:StreamBlockId): Unit = { }

  def onError(message:String, throwable:Throwable){
    reportError(message, throwable)
  }

  def onPushBlock(blockId:StreamBlockId, arrayBuffer: ArrayBuffer[_]) {
    pushArrayBuffer(arrayBuffer, None, Some(blockId))
  }
}

特别注意onPushBlock函数。

 

回到ReceiverSupervisorImpl.scala (172-174行),调用BlockGenerator的start

BlockGenerator.scala(115-125行)

def start(): Unit = synchronized {
  if (state== Initialized) {
    state =Active
    blockIntervalTimer
.start()
    blockPushingThread.start()
    logInfo("Started BlockGenerator")
  } else {
    throw new SparkException(
      s"Cannot start BlockGenerator as its not in theInitialized state [state =$state]")
  }
}

这里启动了blockIntervalTimer和blockPushingThread,blockIntervalTimer就是一个定时器,默认每200ms回调一下updateCurrentBuffer方法,回调时间通过参数spark.streaming.blockInterval设置,这也是一个性能调优的参数,时间过短太造成block碎片太多,时间过长可能导致block块过大,具体时间长短要根据实际业务而定,updateCurrentBuffer方法作用就是将接收到的数据包装到block存储,;blockPushingThread作用是定时从blocksForPushing队列中取block,然后存储,并向ReceiverTrackerEndpoint汇报。

blockIntervalTimer.scala是一定时器,代码如下

BlockGenerator.scala(105-106行)

private val blockIntervalTimer =
  new RecurringTimer(clock, blockIntervalMs,updateCurrentBuffer,"BlockGenerator")

 

会定时调用updateCurrentBuffer

BlockGenerator.scala(232-254行)

 

private def updateCurrentBuffer(time: Long): Unit = {
  try {
    var newBlock:Block = null
   
synchronized {
      if (currentBuffer.nonEmpty) {
        val newBlockBuffer= currentBuffer
        currentBuffer
= new ArrayBuffer[Any]
        val blockId= StreamBlockId(receiverId, time -blockIntervalMs)
        listener.onGenerateBlock(blockId)
        newBlock = new Block(blockId, newBlockBuffer)
      }
    }

    if (newBlock!= null) {
      blocksForPushing.put(newBlock) // put is blocking when queue is full
   
}
  } catch {
    case ie:InterruptedException=>
      logInfo("Block updating timer thread wasinterrupted")
    case e:Exception=>
      reportError("Error in block updating thread", e)
  }
}

 

例化一个空的ArrayBuffercurrentBuffer,接着例化一个BlocknewBlockBuffer传递进去,最后把newBlock放入到blocksForPushing

 

blockPushingThreadBlockGenerator.scala 109

private val blockPushingThread= new Thread() { override def run() { keepPushingBlocks() } }

 

其中keepPushingBlocks()实现如下 (BlockGenerator.scala256-289行)

 

/** Keeppushing blocks to the BlockManager. */
private def keepPushingBlocks() {
  logInfo("Started block pushing thread")

  def areBlocksBeingGenerated:Boolean = synchronized {
    state !=StoppedGeneratingBlocks
 
}

  try {
    // While blocks are being generated, keep polling forto-be-pushed blocks and push them.
   
while (areBlocksBeingGenerated){
      Option(blocksForPushing.poll(10, TimeUnit.MILLISECONDS))match {
        case Some(block)=> pushBlock(block)
        case None=>
      }
    }

    // At this point, state is StoppedGeneratingBlock. So drainthe queue of to-be-pushed blocks.
   
logInfo("Pushing out the last "+ blocksForPushing.size() +" blocks")
    while (!blocksForPushing.isEmpty) {
      val block= blocksForPushing.take()
      logDebug(s"Pushing block $block")
      pushBlock(block)
      logInfo("Blocks left to push " +blocksForPushing.size())
    }
    logInfo("Stopped block pushing thread")
  } catch {
    case ie:InterruptedException=>
      logInfo("Block pushing thread wasinterrupted")
    case e:Exception=>
      reportError("Error in block pushing thread", e)
  }
}

 

其中功能模块

while (areBlocksBeingGenerated) {
  Option(blocksForPushing.poll(10, TimeUnit.MILLISECONDS))match {
    case Some(block)=> pushBlock(block)
    case None=>
  }
}

blocksForPushing列中定取出block然后调用pushBlock

BlockGenerator.scala295-298行

private def push Block(block: Block) {
  listener.onPushBlock(block.id,block.buffer)
  logInfo("Pushed block " + block.id)
}

而这里的listener则在BlockGenerator初始化(伴随ReceiverSupervisorImpl初始化)是就已经完成实例化了(defaultBlockGeneratorListener)。可以参照上面提及的createBlockGenerator函数的内容。

 

所以onPushBlock的实现如下(ReceiverSupervisorImpl.scala 108-110)

def onPushBlock(blockId: StreamBlockId,arrayBuffer: ArrayBuffer[_]) {
  pushArrayBuffer(arrayBuffer, None, Some(blockId))
}

(ReceiverSupervisorImpl.scala123-129)

 

def pushArrayBuffer(
    arrayBuffer:ArrayBuffer[_],
    metadataOption: Option[Any],
    blockIdOption: Option[StreamBlockId]
  ) {
  pushAndReportBlock(ArrayBufferBlock(arrayBuffer), metadataOption, blockIdOption)
}

而pushAndReportBlock是重磅!(ReceiverSupervisorImpl.scala149-163)

 

/** Storeblock and report it to driver */
def pushAndReportBlock(
    receivedBlock: ReceivedBlock,
    metadataOption: Option[Any],
    blockIdOption: Option[StreamBlockId]
  ) {
  val blockId =blockIdOption.getOrElse(nextBlockId)
  val time =System.currentTimeMillis
 
val blockStoreResult= receivedBlockHandler.storeBlock(blockId, receivedBlock)
  logDebug(s"Pushed block $blockIdin${(System.currentTimeMillis- time)} ms")
  val numRecords =blockStoreResult.numRecords
  val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
  trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))
  logDebug(s"Reported block $blockId")
}

第一receivedBlockHandler来存block
第二向trackerEndpoint汇报block的存储结blockInfo

 

receivedBlockHandler实现如下 (ReceiverSupervisorImpl.scala 53-66)

private val receivedBlockHandler: ReceivedBlockHandler = {
  if (WriteAheadLogUtils.enableReceiverLog(env.conf)){
    if (checkpointDirOption.isEmpty){
      throw new SparkException(
        "Cannot enable receiver write-ahead log withoutcheckpoint directory set. "+
          "Please usestreamingContext.checkpoint() to set the checkpoint directory. "+
          "See documentation for moredetails.")
    }
    new WriteAheadLogBasedBlockHandler(env.blockManager,receiver.streamId,
      receiver.storageLevel, env.conf,hadoopConf, checkpointDirOption.get)
  } else {
    new BlockManagerBasedBlockHandler(env.blockManager,receiver.storageLevel)
  }
}

 

其中有两种实现WriteAheadLogBasedBlockHandlerBlockManagerBasedBlockHandler
BlockManagerBasedBlockHandler借助BlockManager来存block并返回block的元数

 

BlockGenerator之后接着看 supervisor.start()方法中的 startReceiver()方法, 

SupervisorImpl.scala143-158行)

def startReceiver(): Unit = synchronized {
  try {
    if (onReceiverStart()) {
      logInfo("Starting receiver")
      receiverState = Started
     
receiver.onStart()
      logInfo("Called receiver onStart")
    } else {
      // The driver refused us
     
stop("Registeredunsuccessfully because Driver refused to start receiver "+ streamId, None)
    }
  } catch {
    case NonFatal(t) =>
      stop("Error starting receiver " +streamId, Some(t))
  }
}

onReceiverStart的实现(ReceiverSupervisorImpl.scala181-185行)

overrideprotected def onReceiverStart():Boolean = {
  val msg = RegisterReceiver(
    streamId,receiver.getClass.getSimpleName,host,executorId,endpoint)
  trackerEndpoint.askWithRetry[Boolean](msg)
}

主要是trackerEndpoint送了一条RegisterReceiver注册receiver的消息

 

receiver.onStart()

这里以SocketReceiver为例。


SocketInputDStream.scala(55-61)

def onStart() {
  // Start the thread that receives data over a connection
 
new Thread("Socket Receiver") {
    setDaemon(true)
    override def run() { receive() }
  }.start()
}

 

receive定义 (SocketInputDStream.scala  69-96)

  def receive() {
    var socket:Socket = null
    try
{
      logInfo("Connecting to " + host +":" + port)
      socket = new Socket(host, port)
      logInfo("Connected to " + host +":" + port)
      val iterator= bytesToObjects(socket.getInputStream())
      while(!isStopped&& iterator.hasNext) {
        store(iterator.next)
      }
      if (!isStopped()){
        restart("Socket data stream had no more data")
      } else {
        logInfo("Stopped receiving")
      }
    } catch {
      case e:java.net.ConnectException =>
        restart("Error connecting to " + host +":" + port, e)
      case NonFatal(e) =>
        logWarning("Error receiving data", e)
        restart("Error receiving data", e)
    } finally {
      if (socket!= null) {
        socket.close()
        logInfo("Closed socket to " + host +":" + port)
      }
    }
  }
}

启动socket 并将接收到的数据存储起来。

Store的实现 Receiver.scala (118-120)

def store(dataItem:T) {
  supervisor.pushSingle(dataItem)
}

ReceiverSupervisorImpl.scala(118-120)

def pushSingle(data: Any) {
  defaultBlockGenerator.addData(data)
}

BlockGenerator.scala(160-175)

def addData(data: Any): Unit = {
  if (state== Active) {
    waitToPush()
    synchronized {
      if (state== Active) {
        currentBuffer += data
      } else {
        throw new SparkException(
          "Cannot add data as BlockGenerator hasnot been started or has been stopped")
      }
    }
  } else {
    throw new SparkException(
      "Cannot add data as BlockGenerator has not beenstarted or has been stopped")
  }
}

currentBuffer += data,currentBuffer上不断的累加数据.


你可能感兴趣的:(DT大数据梦工厂Spark定制班笔记(010))