Spark源码解析SparkStreaming数据接收

在上一篇博文中,我们讲述了一个SparkStreaming应用程序启动后开始的准备工作,即在executors启动receiver
这里我们将讲述接收数据到存储数据的过程
首先接受数据是在receiver的onStart方法里,在这里我们还是以SocketReceiver为例,在SocketReceiver的OnStart方法中启动一个线程,在该线程中调用receive方法,进行接收数据的处理

def receive() {
  var socket: Socket = null
  try {
    logInfo("Connecting to " + host + ":" + port)
    socket = new Socket(host, port)
    logInfo("Connected to " + host + ":" + port)
    val iterator = bytesToObjects(socket.getInputStream())
    while(!isStopped && iterator.hasNext) {
      store(iterator.next)
    }
   ...
  } catch {
  ...
  } finally {
    if (socket != null) {
      socket.close()
      logInfo("Closed socket to " + host + ":" + port)
    }
  }
}

我们可以看到在这里,接收到数据后,会调用store方法存储接收到的数据,store方法是一个重载的方法,其有很多的实现,大致如下:

def store(dataItem: T) {
    supervisor.pushSingle(dataItem)
}


def store(dataBuffer: ArrayBuffer[T]) {
  supervisor.pushArrayBuffer(dataBuffer, None, None)
}

def store(dataBuffer: ArrayBuffer[T], metadata: Any) {
  supervisor.pushArrayBuffer(dataBuffer, Some(metadata), None)
}

def store(dataIterator: Iterator[T]) {
  supervisor.pushIterator(dataIterator, None, None)
}
...

还有很多的重载方法,但都是去调用supervisor的pushXXXX方法,在ReceiverSupervisorImpl内的这些方法,都会最终调用ReceiverSupervisorImpl#pushAndReportBlock

def pushAndReportBlock(
    receivedBlock: ReceivedBlock,
    metadataOption: Option[Any],
    blockIdOption: Option[StreamBlockId]
  ) {
  val blockId = blockIdOption.getOrElse(nextBlockId)
  val time = System.currentTimeMillis
  //存储block到BlockManager中,这里我们就可以看到预写日志的机制
  val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)

  val numRecords = blockStoreResult.numRecords

  //封装一个ReceivedBlockInfo对象,里面有一个streamId
  val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)

  //向ReceiverTracker发送AddBlock的消息,这个样例类中包含了block的相关信息blockInfo
  trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))
  logDebug(s"Reported block $blockId")
}

我们在这里需要注意一下receivedBlockHandler,这个对象的初始化如下:

private val receivedBlockHandler: ReceivedBlockHandler = {
 //如果开启了预写日志机制,默认为false
 //那么receivedBlockHandler就是WriteAheadLogBasedBlockHandler
 //如果没有开启预写日志机制,那么receivedBlockHandler就是BlockManagerBasedBlockHandler
 if (WriteAheadLogUtils.enableReceiverLog(env.conf)) {
   if (checkpointDirOption.isEmpty) {
     throw new SparkException(
       "Cannot enable receiver write-ahead log without checkpoint directory set. " +
         "Please use streamingContext.checkpoint() to set the checkpoint directory. " +
         "See documentation for more details.")
   }
   new WriteAheadLogBasedBlockHandler(env.blockManager, receiver.streamId,
     receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get)
 } else {
   new BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel)
 }
}

我们可以看到对于是否开启预写日志机制将会创建不同的子类对象,在这里我们以WriteAheadLogBasedBlockHandler的storeBlock方法为例:

def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = {

   var numRecords = None: Option[Long]
   // Serialize the block so that it can be inserted into both
   //先用BlockManager序列化数据
   val serializedBlock = block match {
     case ArrayBufferBlock(arrayBuffer) =>
       numRecords = Some(arrayBuffer.size.toLong)
       blockManager.dataSerialize(blockId, arrayBuffer.iterator)
     case IteratorBlock(iterator) =>
       val countIterator = new CountingIterator(iterator)
       val serializedBlock = blockManager.dataSerialize(blockId, countIterator)
       numRecords = countIterator.count
       serializedBlock
     case ByteBufferBlock(byteBuffer) =>
       byteBuffer
     case _ =>
       throw new Exception(s"Could not push $blockId to block manager, unexpected block type")
   }

   // Store the block in block manager
   //将数据保存到BlockManager,默认的持久化策略是_SER,_2的,会序列化,会复制一份副本到其他Executor上的BlockManager中
   //以供容错需要
   val storeInBlockManagerFuture = Future {
     val putResult =
       blockManager.putBytes(blockId, serializedBlock, effectiveStorageLevel, tellMaster = true)
     if (!putResult.map { _._1 }.contains(blockId)) {
       throw new SparkException(
         s"Could not store $blockId to block manager with storage level $storageLevel")
     }
   }
   // Store the block in write ahead log
   //将block存入预写日志
   val storeInWriteAheadLogFuture = Future {
     writeAheadLog.write(serializedBlock, clock.getTimeMillis())
   }
   // Combine the futures, wait for both to complete, and return the write ahead log record handle
   val combinedFuture = storeInBlockManagerFuture.zip(storeInWriteAheadLogFuture).map(_._2)
   val walRecordHandle = Await.result(combinedFuture, blockStoreTimeout)
   WriteAheadLogBasedStoreResult(blockId, numRecords, walRecordHandle)
}

在这个方法中我们先将block进行序列化,然后将序列化后的数据写入BlockManager,最后将序列化后的数据写入日志

我们再回到上面的pushAndReportBlock方法,上面我们解析的是其push部分,现在我们需要解析的是其Report部分
我们可以看到封装了一个blockInfo对象,然后向ReceiverTracker发送了AddBlock消息,在ReceiverTracker中接收到这个消息后的处理:

case AddBlock(receivedBlockInfo) =>
    context.reply(addBlock(receivedBlockInfo))

private def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = {
  receivedBlockTracker.addBlock(receivedBlockInfo)
}

def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = synchronized {
     try {
       writeToLog(BlockAdditionEvent(receivedBlockInfo))
       getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo
       logDebug(s"Stream ${receivedBlockInfo.streamId} received " +
         s"block ${receivedBlockInfo.blockStoreResult.blockId}")
       true
     } catch {
       case e: Exception =>
         logError(s"Error adding block $receivedBlockInfo", e)
         false
     }
}

这段代码中最重要的无非就是getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo即将接收到的blockInfo存放在一个队列中,然后再将这个队列放入一个以streamId为key的map中

综上我们可以发现Receiver在接受到数据后,主要就是做了两件事:
1. 将接收到的消息存入BlockManager中
2. 将接受到的消息的blockInfo作为AddBlock消息的参数发送给ReceiverTracker,ReceiverTrakcer中使用一个Map[streamId,BlockQuqueu[blockInfo]]的数据结构存储之

你可能感兴趣的:(spark)