在上一篇博文中,我们讲述了一个SparkStreaming应用程序启动后开始的准备工作,即在executors启动receiver
这里我们将讲述接收数据到存储数据的过程
首先接受数据是在receiver的onStart方法里,在这里我们还是以SocketReceiver为例,在SocketReceiver的OnStart方法中启动一个线程,在该线程中调用receive方法,进行接收数据的处理
def receive() {
var socket: Socket = null
try {
logInfo("Connecting to " + host + ":" + port)
socket = new Socket(host, port)
logInfo("Connected to " + host + ":" + port)
val iterator = bytesToObjects(socket.getInputStream())
while(!isStopped && iterator.hasNext) {
store(iterator.next)
}
...
} catch {
...
} finally {
if (socket != null) {
socket.close()
logInfo("Closed socket to " + host + ":" + port)
}
}
}
我们可以看到在这里,接收到数据后,会调用store方法存储接收到的数据,store方法是一个重载的方法,其有很多的实现,大致如下:
def store(dataItem: T) {
supervisor.pushSingle(dataItem)
}
def store(dataBuffer: ArrayBuffer[T]) {
supervisor.pushArrayBuffer(dataBuffer, None, None)
}
def store(dataBuffer: ArrayBuffer[T], metadata: Any) {
supervisor.pushArrayBuffer(dataBuffer, Some(metadata), None)
}
def store(dataIterator: Iterator[T]) {
supervisor.pushIterator(dataIterator, None, None)
}
...
还有很多的重载方法,但都是去调用supervisor的pushXXXX方法,在ReceiverSupervisorImpl内的这些方法,都会最终调用ReceiverSupervisorImpl#pushAndReportBlock
def pushAndReportBlock(
receivedBlock: ReceivedBlock,
metadataOption: Option[Any],
blockIdOption: Option[StreamBlockId]
) {
val blockId = blockIdOption.getOrElse(nextBlockId)
val time = System.currentTimeMillis
//存储block到BlockManager中,这里我们就可以看到预写日志的机制
val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)
val numRecords = blockStoreResult.numRecords
//封装一个ReceivedBlockInfo对象,里面有一个streamId
val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
//向ReceiverTracker发送AddBlock的消息,这个样例类中包含了block的相关信息blockInfo
trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))
logDebug(s"Reported block $blockId")
}
我们在这里需要注意一下receivedBlockHandler,这个对象的初始化如下:
private val receivedBlockHandler: ReceivedBlockHandler = {
//如果开启了预写日志机制,默认为false
//那么receivedBlockHandler就是WriteAheadLogBasedBlockHandler
//如果没有开启预写日志机制,那么receivedBlockHandler就是BlockManagerBasedBlockHandler
if (WriteAheadLogUtils.enableReceiverLog(env.conf)) {
if (checkpointDirOption.isEmpty) {
throw new SparkException(
"Cannot enable receiver write-ahead log without checkpoint directory set. " +
"Please use streamingContext.checkpoint() to set the checkpoint directory. " +
"See documentation for more details.")
}
new WriteAheadLogBasedBlockHandler(env.blockManager, receiver.streamId,
receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get)
} else {
new BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel)
}
}
我们可以看到对于是否开启预写日志机制将会创建不同的子类对象,在这里我们以WriteAheadLogBasedBlockHandler的storeBlock方法为例:
def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = {
var numRecords = None: Option[Long]
// Serialize the block so that it can be inserted into both
//先用BlockManager序列化数据
val serializedBlock = block match {
case ArrayBufferBlock(arrayBuffer) =>
numRecords = Some(arrayBuffer.size.toLong)
blockManager.dataSerialize(blockId, arrayBuffer.iterator)
case IteratorBlock(iterator) =>
val countIterator = new CountingIterator(iterator)
val serializedBlock = blockManager.dataSerialize(blockId, countIterator)
numRecords = countIterator.count
serializedBlock
case ByteBufferBlock(byteBuffer) =>
byteBuffer
case _ =>
throw new Exception(s"Could not push $blockId to block manager, unexpected block type")
}
// Store the block in block manager
//将数据保存到BlockManager,默认的持久化策略是_SER,_2的,会序列化,会复制一份副本到其他Executor上的BlockManager中
//以供容错需要
val storeInBlockManagerFuture = Future {
val putResult =
blockManager.putBytes(blockId, serializedBlock, effectiveStorageLevel, tellMaster = true)
if (!putResult.map { _._1 }.contains(blockId)) {
throw new SparkException(
s"Could not store $blockId to block manager with storage level $storageLevel")
}
}
// Store the block in write ahead log
//将block存入预写日志
val storeInWriteAheadLogFuture = Future {
writeAheadLog.write(serializedBlock, clock.getTimeMillis())
}
// Combine the futures, wait for both to complete, and return the write ahead log record handle
val combinedFuture = storeInBlockManagerFuture.zip(storeInWriteAheadLogFuture).map(_._2)
val walRecordHandle = Await.result(combinedFuture, blockStoreTimeout)
WriteAheadLogBasedStoreResult(blockId, numRecords, walRecordHandle)
}
在这个方法中我们先将block进行序列化,然后将序列化后的数据写入BlockManager,最后将序列化后的数据写入日志
我们再回到上面的pushAndReportBlock方法,上面我们解析的是其push部分,现在我们需要解析的是其Report部分
我们可以看到封装了一个blockInfo对象,然后向ReceiverTracker发送了AddBlock消息,在ReceiverTracker中接收到这个消息后的处理:
case AddBlock(receivedBlockInfo) =>
context.reply(addBlock(receivedBlockInfo))
private def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = {
receivedBlockTracker.addBlock(receivedBlockInfo)
}
def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = synchronized {
try {
writeToLog(BlockAdditionEvent(receivedBlockInfo))
getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo
logDebug(s"Stream ${receivedBlockInfo.streamId} received " +
s"block ${receivedBlockInfo.blockStoreResult.blockId}")
true
} catch {
case e: Exception =>
logError(s"Error adding block $receivedBlockInfo", e)
false
}
}
这段代码中最重要的无非就是getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo
即将接收到的blockInfo存放在一个队列中,然后再将这个队列放入一个以streamId为key的map中
综上我们可以发现Receiver在接受到数据后,主要就是做了两件事:
1. 将接收到的消息存入BlockManager中
2. 将接受到的消息的blockInfo作为AddBlock消息的参数发送给ReceiverTracker,ReceiverTrakcer中使用一个Map[streamId,BlockQuqueu[blockInfo]]的数据结构存储之