根据上文所展示的Spark Streaming的socketTextStream应用示例,来从源码的角度来看下其是怎么工作运行的。
StreamingContext是Spark Streaming程序的主要入口,其构造函数如下:
class StreamingContext private[streaming] (
_sc: SparkContext,
_cp: Checkpoint,
_batchDur: Duration
) extends Logging {
// DStreamGraph里面保存了一系列的DStream的依赖关系以及互相之间的算子的应用
private[streaming] val graph: DStreamGraph = {
if (isCheckpointPresent) {
_cp.graph.setContext(this)
_cp.graph.restoreCheckpointData()
_cp.graph
} else {
require(_batchDur != null, "Batch duration for StreamingContext cannot be null")
val newGraph = new DStreamGraph()
newGraph.setBatchDuration(_batchDur)
newGraph
}
}
private val nextInputStreamId = new AtomicInteger(0)
// 检查点设置
private[streaming] var checkpointDir: String = {
if (isCheckpointPresent) {
sc.setCheckpointDir(_cp.checkpointDir)
_cp.checkpointDir
} else {
null
}
}
// JobScheduler涉及到job的调度,JobGenerator会负责每隔batch interval,生成一个job,然后通过JobScheduler来调度和提交job
private[streaming] val scheduler = new JobScheduler(this)
private[streaming] val uiTab: Option[StreamingTab] =
if (conf.getBoolean("spark.ui.enabled", true)) {
Some(new StreamingTab(this))
} else {
None
}
}
其构造器有三个主要参数为:
其主要用于应用程序处理的核心组件为:
在初始化创建StreamingContext时,其创建了DStreamGraph、JobScheduler等关联组件之后,就会调用StreamingContext的socketTextStream等方法,来创建输入DStream,然后针对输入DStream执行一系列的transformation转换操作,最后会执行一个output输出操作,来触发针对一个个batch job的触发和执行;上述初始化操作完成后,就可以调用StreamingContext.start()方法来启动一个Spark Streaming的应用程序,在这个start()方法,会创建StreamingContext的另外两个重要组件ReceiverTracker、JobGenerator,另外最重要的是,其会启动整个Spark Streaming应用程序输入DStream对应的Receiver,并会在Spark集群上的worker节点上的Executor中启动对应的Receiver。其基本的数据接收到任务划分及调度示意图如下:
接下来看看其触发流式任务执行的StreamingContext中的start()方法:
// StreamingContext#start():Streaming应用程序启动的入口
def start(): Unit = synchronized {
// ......
ThreadUtils.runInNewThread("streaming-start") {
sparkContext.setCallSite(startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get()))
// 调用JobScheduler的start()方法
scheduler.start()
}
state = StreamingContextState.ACTIVE
}
可以看到在StreamingContext的start()方法中,其主要就是调用scheduler.start()来启动对应的JobScheduler,在scheduler.start()方法中,其会创建并启动EventLoop[JobSchedulerEvent](eventLoop.start())来处理JobSchedulerEvent 事件。并获取 DStreamGraph中所有InputDStream以及相应的rateController,然后将rateController加入到listenerBus,之后便创建并启动两个最重要的组件为:ReceiverTracker和JobGenerator。
// JobScheduler#start()
def start(): Unit = synchronized {
if (eventLoop != null) return // scheduler has already been started
logDebug("Starting JobScheduler")
eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)
override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
}
eventLoop.start()
// attach rate controllers of input streams to receive batch completion updates
for {
inputDStream <- ssc.graph.getInputStreams
rateController <- inputDStream.rateController
} ssc.addStreamingListener(rateController)
listenerBus.start()
receiverTracker = new ReceiverTracker(ssc) // 创建了ReceiverTracker组件,数据接收相关
inputInfoTracker = new InputInfoTracker(ssc)
val executorAllocClient: ExecutorAllocationClient = ssc.sparkContext.schedulerBackend match {
case b: ExecutorAllocationClient => b.asInstanceOf[ExecutorAllocationClient]
case _ => null
}
executorAllocationManager = ExecutorAllocationManager.createIfEnabled(
executorAllocClient,
receiverTracker,
ssc.conf,
ssc.graph.batchDuration.milliseconds,
clock)
executorAllocationManager.foreach(ssc.addStreamingListener)
receiverTracker.start() // 启动DStream关联的Receiver
jobGenerator.start() // 启动JobGenerator
executorAllocationManager.foreach(_.start())
logInfo("Started JobScheduler")
}
ReceiverTracker是负责启动、管理各个receiver的功能组件。其主要的功能如下:
接着上面的ReceiverTracker.start()方法,继续分析如下:
/** Start the endpoint and receiver execution thread. */
def start(): Unit = synchronized {
// ......
if (!receiverInputStreams.isEmpty) {
// 创建RPC服务端
endpoint = ssc.env.rpcEnv.setupEndpoint(
"ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
// 加载启动Receiver
if (!skipReceiverLaunch) launchReceivers()
logInfo("ReceiverTracker started")
trackerState = Started
}
}
// 加载启动Receiver
private def launchReceivers(): Unit = {
// 从inputStreams中获取receivers,并设置每一个receiver的基本Id
val receivers = receiverInputStreams.map { nis =>
val rcvr = nis.getReceiver()
rcvr.setReceiverId(nis.id)
rcvr
}
runDummySparkJob()
logInfo("Starting " + receivers.length + " receivers")
// 发送StartAllReceivers的消息
endpoint.send(StartAllReceivers(receivers))
}
ReceiverTracker首先创建ReceiverTrackerEndPoint协议的RPC服务端,其用来和对应的Receiver进行通信、监听、回复与Receiver相关的RPC信息。然后再接着调用launchReceivers()用来加载启动所有的Receiver,launchReceivers中先通过receiverInputStreams中获取到所有的Receiver集合,然后再向ReceiverTrackerEndPoint通讯体endpoint变量发送了StartAllReceivers(receivers)消息。ReceiverTrackerEndpoint的receive方法会响应该消息,在收到StartAllReceivers后,会根据一定的调度策略来调度分配启动Receiver的位置,然后调用其startReceiver()方法:
override def receive: PartialFunction[Any, Unit] = {
// Local messages
case StartAllReceivers(receivers) =>
// 根据一定的调度策略来分配其所在的启动位置
val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
for (receiver <- receivers) {
val executors = scheduledLocations(receiver.streamId)
updateReceiverScheduledExecutors(receiver.streamId, executors)
receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
startReceiver(receiver, executors) // 启动该receiver
}
}
其基本的Receiver调度分配策略会尽量使Receivers均匀的分散在Executors中。调度过程中会为每一个Receiver指定启动的位置信息(location);其调度过程如下:
接下来便是使用startReceiver(receiver, executors)方法来在指定的Executor上启动对应的Receiver:
/**
* Start a receiver along with its scheduled executors
*/
private def startReceiver(
receiver: Receiver[_],
scheduledLocations: Seq[TaskLocation]): Unit = {
// ......
val checkpointDirOption = Option(ssc.checkpointDir)
val serializableHadoopConf =
new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)
// Function to start the receiver on the worker node
// 在worker节点上启动receiver的方法
// 这个函数会和rdd一起提交,它new了一个ReceiverSupervisorImpl用来具体处理接收的数据
val startReceiverFunc: Iterator[Receiver[_]] => Unit =
(iterator: Iterator[Receiver[_]]) => {
if (TaskContext.get().attemptNumber() == 0) {
val receiver = iterator.next()
assert(iterator.hasNext == false)
val supervisor = new ReceiverSupervisorImpl(
receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
supervisor.start()
supervisor.awaitTermination()
} else {
// It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
}
}
// Create the RDD using the scheduledLocations to run the receiver in a Spark job
// 使用ScheduledLocations创建RDD以在Spark作业中运行接收器
val receiverRDD: RDD[Receiver[_]] =
if (scheduledLocations.isEmpty) {
ssc.sc.makeRDD(Seq(receiver), 1)
} else {
val preferredLocations = scheduledLocations.map(_.toString).distinct
ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
}
receiverRDD.setName(s"Receiver $receiverId")
ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))
// 提交启动receiver的job到sparkContext进行启动,将recevier分发到具体的executor上
val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
// ......
}
在startReceiver()方法中,其将Receiver包装成了RDD,并把scheduledLocations作为RDD的优先位置locationPrefs。然后通过sc提交了一个Spark Core Job,执行函数是startReceiverFunc(也就是要在executor上执行的),并在该方法中创建一个ReceiverSupervisorImpl对象(ReceiverSupervisor是Executor端Receiver的管理者,负责监督和管理Executor中的Receiver的运行),并调用了start()方法,其会调用receiver的onStart方法后立即返回。在Receiver的onStart方法中一般会新建线程或线程池来接收数据,比如在KafkaReceiver中,就新建了线程池,在线程池中接收topics的数据。supervisor.start()返回后,由 supervisor.awaitTermination()阻塞住线程,以让这个task一直不退出,从而可以源源不断接收数据。
接下来分析在Executor端执行启动receiver的启动方法ReceiverSupervisor.start()方法:
// ReceiverSupervisor
def start() {
onStart()
startReceiver()
}
// ReceiverSupervisorImpl中的onStart方法
override protected def onStart() {
registeredBlockGenerators.asScala.foreach { _.start() }
}
// ReceiverSupervisor的方法,用于启动Receiver
def startReceiver(): Unit = synchronized {
try {
if (onReceiverStart()) { // Receiver启动后会向ReceiverTracker注册,告诉ReceiverTracker自己启动成功
logInfo(s"Starting receiver $streamId")
receiverState = Started
receiver.onStart() // 启动receiver,开始接收数据
logInfo(s"Called receiver $streamId onStart")
} else {
// The driver refused us
stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
}
} catch {
case NonFatal(t) =>
stop("Error starting receiver " + streamId, Some(t))
}
}
在ReceiverSupervisor.start()方法,首先其会调用被复写的onStart()方法来启动所有的BlockGenerator集合。其主要用于对Receiver接收的数据流进行切片操作。之后便会先调用onReceiverStart()方法向ReceiverTracker注册,告诉ReceiverTracker自己启动成功;之后便调用receiver.onstart()方法来启动对应的receiver;开始接收产生的流式数据。(在官方文档中可以知道,若用户需要自定义receiver的实现,则需要继承receiver接口并在onStart()方法中启动一个单独的线程来进行数据的接收,该处onStart()就是此处调用,通过线程异步启动并一直接收流式数据),比如自带的SocketReceiver示例如下,其在该方法中创建真正用于通信的Socket,并不断的接收发送来的数据:
def onStart() {
try {
socket = new Socket(host, port)
} catch {
// ......
}
// Start the thread that receives data over a connection
new Thread("Socket Receiver") {
setDaemon(true)
override def run() { receive() }
}.start()
}
def receive() {
try {
// 从socket流中接收数据
val iterator = bytesToObjects(socket.getInputStream())
while(!isStopped && iterator.hasNext) {
store(iterator.next()) // 调用store()方法将接收到的数据进行存储
}
} catch {
// ......
}
}
并且在官方文档中,可以知道在receiver接收到数据之后,必须显示的调用store()方法将数据进行存储;接来下详细分析一下store()方法以及前文提到的ReceiverSupervisorImpl中的复写的onStart方法(registeredBlockGenerators.asScala.foreach { _.start() });
BlockGenerator是Receiver中比较重要类,其用于将我们收到的流式数据写入buffer,然后定时将buffer封装为块,进行存储和汇报给Driver。先来看下其基本的变量如下:
// 用来监听块相关事件:onAddData、onGenerateBlock、onPushBlock
listener: BlockGeneratorListener
// ArrayBuffer用来暂存接收到的数据
@volatile private var currentBuffer = new ArrayBuffer[Any]
// 用来缓存封装好的Block块的队列
private val blocksForPushing = new ArrayBlockingQueue[Block](blockQueueSize)
// 定时将currentBuffer中的数据封装为Block,然后推到blocksForPushing里面
private val blockIntervalTimer =
new RecurringTimer(clock, blockIntervalMs, updateCurrentBuffer, "BlockGenerator")
// 用来从blocksForPushing中取出Block,然后进行存储,汇报ReceiverTracker的线程
private val blockPushingThread = new Thread() { override def run() { keepPushingBlocks() } }
里面比较重要的定时器blockIntervalTimer和线程blockPushingThread都会在ReceiverSupervisorImpl的onStart()方法中启动,该方法启动了所有的BlockGenerator集合。其基本的start()方法如下:
/** Start block generating and pushing threads. */
def start(): Unit = synchronized {
if (state == Initialized) {
state = Active
blockIntervalTimer.start() // 启动块生成的time定时器,其将currentBuffer中缓存的数据流封装为Block
blockPushingThread.start() // 启动推送块的线程,其将缓存的block推送到ReceivedBlockQueue中
logInfo("Started BlockGenerator")
} else {
throw new SparkException(
s"Cannot start BlockGenerator as its not in the Initialized state [state = $state]")
}
}
先来看blockIntervalTimer定时器的工作流程,在函数updateCurrentBuffer中其会将currentBuffer中的所有数据封装为一个Block,然后放入blocksForPush队列:
/** Change the buffer to which single records are added to. */
// 按照定时时间将currentBuffer中缓存的所有数据生成块,然后将块推到blocksForPushing中
private def updateCurrentBuffer(time: Long): Unit = {
try {
var newBlock: Block = null
synchronized {
if (currentBuffer.nonEmpty) {
val newBlockBuffer = currentBuffer
currentBuffer = new ArrayBuffer[Any]
val blockId = StreamBlockId(receiverId, time - blockIntervalMs)
listener.onGenerateBlock(blockId) // 一个block生成
newBlock = new Block(blockId, newBlockBuffer)
}
}
if (newBlock != null) {
blocksForPushing.put(newBlock) // put is blocking when queue is full
}
} catch {
// ......
}
}
之后blockPushingThread线程将会调用keepPushingBlocks()函数来将blocksForPushing队列中的block通过listener.onPushBlock()通知该listener,并通过调用ReceiverSupervisorImpl#pushArrayBuffer()函数将其传递给BlockManager,让BlockManager存储在内存当中:
/** Keep pushing blocks to the BlockManager. */
private def keepPushingBlocks() {
try {
while (!blocksForPushing.isEmpty) {
val block = blocksForPushing.take()
logDebug(s"Pushing block $block")
// 调用本类的pushBlock方法
pushBlock(block)
logInfo("Blocks left to push " + blocksForPushing.size())
}
} catch {
// ......
}
}
// 推送块
private def pushBlock(block: Block) {
listener.onPushBlock(block.id, block.buffer)
logInfo("Pushed block " + block.id)
}
private val defaultBlockGeneratorListener = new BlockGeneratorListener {
def onAddData(data: Any, metadata: Any): Unit = { }
def onGenerateBlock(blockId: StreamBlockId): Unit = { }
def onError(message: String, throwable: Throwable) {
reportError(message, throwable)
}
// 推块的时候调用ReceiverSupervisorImpl.pushArrayBuffer()
def onPushBlock(blockId: StreamBlockId, arrayBuffer: ArrayBuffer[_]) {
pushArrayBuffer(arrayBuffer, None, Some(blockId))
}
}
// 将接收到的数据ArrayBuffer作为数据块存储到Spark的内存中
def pushArrayBuffer(
arrayBuffer: ArrayBuffer[_],
metadataOption: Option[Any],
blockIdOption: Option[StreamBlockId]
) {
// 调用pushAndReportBlock()
pushAndReportBlock(ArrayBufferBlock(arrayBuffer), metadataOption, blockIdOption)
}
// 将块数据进行存储,然后汇报给Driver
def pushAndReportBlock(
receivedBlock: ReceivedBlock,
metadataOption: Option[Any],
blockIdOption: Option[StreamBlockId]
) {
val blockId = blockIdOption.getOrElse(nextBlockId)
val time = System.currentTimeMillis
// 此处会触发真正的存储数据
val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)
logDebug(s"Pushed block $blockId in ${(System.currentTimeMillis - time)} ms")
val numRecords = blockStoreResult.numRecords
val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
// 将存储结果报告Driver
if (!trackerEndpoint.askSync[Boolean](AddBlock(blockInfo))) {
throw new SparkException("Failed to add block to receiver tracker.")
}
logDebug(s"Reported block $blockId")
}
其真正触发进行数据块存储到Spark内存中以及向driver端汇报的过程是在ReceiverSupervisorImpl#pushArrayBuffer()方法中执行的,接下来分别看下数据块存储以及向Driver汇报的过程:
存储数据块有对应的receivedBlockHandler,在启用了WAL(spark.streaming.receiver.writeAheadLog.enable为true)的情况下对应的是WriteAheadLogBasedBlockHandler(启用了WAL的情况下在应用程序挂掉后可以从WAL恢复数据),未启用的情况下对应的是BlockManagerBasedBlockHandler。两种handler都是最终通过blockManager来存储block到内存或者磁盘。
private val receivedBlockHandler: ReceivedBlockHandler = {
if (WriteAheadLogUtils.enableReceiverLog(env.conf)) {
if (checkpointDirOption.isEmpty) {
throw new SparkException(
"Cannot enable receiver write-ahead log without checkpoint directory set. " +
"Please use streamingContext.checkpoint() to set the checkpoint directory. " +
"See documentation for more details.")
}
new WriteAheadLogBasedBlockHandler(env.blockManager, env.serializerManager, receiver.streamId,
receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get)
} else {
new BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel)
}
}
// storeBlock方法部分代码
case ArrayBufferBlock(arrayBuffer) =>
numRecords = Some(arrayBuffer.size.toLong)
blockManager.putIterator(blockId, arrayBuffer.iterator, storageLevel,tellMaster = true)
case IteratorBlock(iterator) =>
val countIterator = new CountingIterator(iterator)
val putResult = blockManager.putIterator(blockId, countIterator, storageLevel,tellMaster = true)
numRecords = countIterator.count
putResult
case ByteBufferBlock(byteBuffer) =>
blockManager.putBytes(blockId, new ChunkedByteBuffer(byteBuffer.duplicate()), storageLevel, tellMaster = true
存储了block后,接着创建了ReceivedBlockInfo实例,对应该block的一些信息,包括streamId(一个InputDStream对应一个Receiver,一个Receiver对应一个streamId)、block中数据的条数、storeResult等信息。接着将receivedBlockInfo作为参数和ReceiverTracker进行RPC通信并发送AddBlock消息,ReceiverTracker收到AddBlock(blockInfo)消息后,调用了addBlock(receiveedBlockInfo)方法进行处理:
// ReceiverTracker#AddBlock(blockInfo)
case AddBlock(receivedBlockInfo) =>
if (WriteAheadLogUtils.isBatchingEnabled(ssc.conf, isDriver = true)) {
walBatchingThreadPool.execute(new Runnable {
override def run(): Unit = Utils.tryLogNonFatalError {
if (active) {
context.reply(addBlock(receivedBlockInfo))
} else {
throw new IllegalStateException("ReceiverTracker RpcEndpoint shut down.")
}
}
})
} else {
context.reply(addBlock(receivedBlockInfo))
}
/** Add new blocks for the given stream */
private def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = {
receivedBlockTracker.addBlock(receivedBlockInfo)
}
这里最终在receivedBlockTracker的addBlock方法中将block的元信息添加到了一个队列ReceivedBlockQueue中,并且其最终是保存在streamIdToUnallocatedBlockQueues的Map中,其中key是streamId,值是该streamid对应的block队列:
/** Add received block. This event will get written to the write ahead log (if enabled). */
def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = {
try {
val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo))
if (writeResult) {
synchronized {
getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo
}
}
// ......
}
}
/** Get the queue of received blocks belonging to a particular stream */
private def getReceivedBlockQueue(streamId: Int): ReceivedBlockQueue = {
streamIdToUnallocatedBlockQueues.getOrElseUpdate(streamId, new ReceivedBlockQueue)
}
总结,其最终的流式数据的接收处理流程及其示意图如下:
接来下回头来看下receiver接口定义中的store()方法,我们知道在receiver接收到数据之后,必须显示的调用store()方法将数据进行存储;该store()方法的部分函数如下:
def store(dataItem: T) {
supervisor.pushSingle(dataItem)
}
def store(dataBuffer: ArrayBuffer[T]) {
supervisor.pushArrayBuffer(dataBuffer, None, None)
}
def store(dataIterator: Iterator[T]) {
supervisor.pushIterator(dataIterator, None, None)
}
def store(bytes: ByteBuffer) {
supervisor.pushBytes(bytes, None, None)
}
// ......
在store()方法中可以看到其实际是调用supervisor.pushSingle或supervisor.pushArrayBuffer等方法来传递数据。Receiver#store()方法有多种形式,ReceiverSupervisor也有pushSingle、pushArrayBuffer、pushIterator、pushBytes等方法与不同的store对应:
对于单条数据,存储时需要BlockGenerator将其多次保存的单条数据聚合成一个数据块,其会调用pushSingle将数据添加到BlockGenerator中的currentBuffer中;BlockGenerator再定时将currentBuffer封装为Block,然后调用pushBlock、pushArrayBuffer、pushAndReportBlock等对数据块进行存储、汇报给Driver端等。另外几个对应于数组形式ArrayBuffer、迭代器Iterator、字节缓冲区ByteBuffer等的重载方法也都会最终也都会调用ReceiverSupervisorImpl#pushAndReportBlock()方法对数据块进行存储,然后汇报给Driver端的ReceiverTracker。