本套系列博客从真实商业环境抽取案例进行总结和分享,并给出Spark源码解读及商业实战指导,请持续关注本套博客。版权声明:本套Spark源码解读及商业实战归作者(秦凯新)所有,禁止转载,欢迎学习。
- Spark商业环境实战-Spark内置框架rpc通讯机制及RpcEnv基础设施
- Spark商业环境实战-Spark事件监听总线流程分析
- Spark商业环境实战-Spark存储体系底层架构剖析
- Spark商业环境实战-Spark底层多个MessageLoop循环线程执行流程分析
- Spark商业环境实战-Spark二级调度系统Stage划分算法和最佳任务调度细节剖析
- Spark商业环境实战-Spark任务延迟调度及调度池Pool架构剖析
- Spark商业环境实战-Task粒度的缓存聚合排序结构AppendOnlyMap详细剖析
- Spark商业环境实战-ExternalSorter 外部排序器在Spark Shuffle过程中设计思路剖析
- Spark商业环境实战-ShuffleExternalSorter外部排序器在Spark Shuffle过程中的设计思路剖析
- Spark商业环境实战-Spark ShuffleManager内存缓冲器SortShuffleWriter设计思路剖析
- Spark商业环境实战-Spark ShuffleManager内存缓冲器UnsafeShuffleWriter设计思路剖析
- Spark商业环境实战-Spark ShuffleManager内存缓冲器BypassMergeSortShuffleWriter设计思路剖析
- Spark商业环境实战-Spark Shuffle 核心组件BlockStoreShuffleReader内核原理深入剖析
- Spark商业环境实战-Spark Shuffle 管理器SortShuffleManager内核原理深入剖析
- Spark商业环境实战-StreamingContext启动流程及Dtream 模板源码剖析
- Spark商业环境实战-ReceiverTracker 启动过程及接收器 receiver RDD 任务提交机制源码剖析
- [Spark商业环境实战-SparkStreaming数据流从Batch到Block定时转化过程源码深度剖析]
- [Spark商业环境实战-SparkStreaming之JobGenerator实现数据块Block转换RDD处理过程源码深度剖析]
- [Spark商业环境实战-SparkStreaming Graph 处理链迭代过程源码深度剖析]
1 ReceiverTracker is What?
- 在Driver端需要有一个跟踪器ReceiverTracker,而这个跟踪器会不断监督Executor启动Receiver,(比如:发送Receiver Rdd 给Executor, 注意是把Receiver作为RDD给Executor),同时管理待分配给Job的数据Block的元数据。
- ReceiverTracker只是一个全局管理者,负责Block元数据的管理。
- 因为在Executor上启动的是Receiver,Receiver的启动是连续不断地,通过把离散数据转化为batch,再进一步转化为Block,最终经由存储体系BlockManager纳管起来。最后通知Driver端ReceiverTracker已保存的Blockd的元数据信息。
1 Receiver is What?
* Abstract class of a receiver that can be run on worker nodes to receive external data. A
* custom receiver can be defined by defining the functions `onStart()` and `onStop()`. `onStart()`
* should define the setup steps necessary to start receiving data,
* and `onStop()` should define the cleanup steps necessary to stop receiving data.
* Exceptions while receiving can be handled either by restarting the receiver with `restart(...)`
* or stopped completely by `stop(...)`.
复制代码
1.1 Receiver超级模板
-
_supervisor :ReceiverSupervisor 启动时威力无穷,坐拥乾坤,主要管理Executor端的数据存储。(下一节重点剖析)
-
store() :顾名思义,就是数据流batch 存储
@DeveloperApi abstract class Receiver[T](val storageLevel: StorageLevel) extends Serializable { /** * This method is called by the system when the receiver is started. This function * must initialize all resources (threads, buffers, etc.) necessary for receiving data. * This function must be non-blocking, so receiving the data must occur on a different * thread. Received data can be stored with Spark by calling `store(data)`. * * If there are errors in threads started here, then following options can be done * (i) `reportError(...)` can be called to report the error to the driver. * The receiving of data will continue uninterrupted. * (ii) `stop(...)` can be called to stop receiving data. This will call `onStop()` to * clear up all resources allocated (threads, buffers, etc.) during `onStart()`. * (iii) `restart(...)` can be called to restart the receiver. This will call `onStop()` * immediately, and then `onStart()` after a delay. */ def onStart(): Unit /** * This method is called by the system when the receiver is stopped. All resources * (threads, buffers, etc.) set up in `onStart()` must be cleaned up in this method. */ def onStop(): Unit /** Override this to specify a preferred location (hostname). */ def preferredLocation: Option[String] = None /** * Store a single item of received data to Spark's memory. * These single items will be aggregated together into data blocks before * being pushed into Spark's memory. */ def store(dataItem: T) { supervisor.pushSingle(dataItem) } /** Store an ArrayBuffer of received data as a data block into Spark's memory. */ def store(dataBuffer: ArrayBuffer[T]) { supervisor.pushArrayBuffer(dataBuffer, None, None) } /** * Store an ArrayBuffer of received data as a data block into Spark's memory. * The metadata will be associated with this block of data * for being used in the corresponding InputDStream. */ def store(dataBuffer: ArrayBuffer[T], metadata: Any) { supervisor.pushArrayBuffer(dataBuffer, Some(metadata), None) } /** Store an iterator of received data as a data block into Spark's memory. */ def store(dataIterator: Iterator[T]) { supervisor.pushIterator(dataIterator, None, None) } /** * Store an iterator of received data as a data block into Spark's memory. * The metadata will be associated with this block of data * for being used in the corresponding InputDStream. */ def store(dataIterator: java.util.Iterator[T], metadata: Any) { supervisor.pushIterator(dataIterator.asScala, Some(metadata), None) } /** Store an iterator of received data as a data block into Spark's memory. */ def store(dataIterator: java.util.Iterator[T]) { supervisor.pushIterator(dataIterator.asScala, None, None) } /** * Store an iterator of received data as a data block into Spark's memory. * The metadata will be associated with this block of data * for being used in the corresponding InputDStream. */ def store(dataIterator: Iterator[T], metadata: Any) { supervisor.pushIterator(dataIterator, Some(metadata), None) } /** * Store the bytes of received data as a data block into Spark's memory. Note * that the data in the ByteBuffer must be serialized using the same serializer * that Spark is configured to use. */ def store(bytes: ByteBuffer) { supervisor.pushBytes(bytes, None, None) } /** * Store the bytes of received data as a data block into Spark's memory. * The metadata will be associated with this block of data * for being used in the corresponding InputDStream. */ def store(bytes: ByteBuffer, metadata: Any) { supervisor.pushBytes(bytes, Some(metadata), None) } /** Report exceptions in receiving data. */ def reportError(message: String, throwable: Throwable) { supervisor.reportError(message, throwable) } /** * Restart the receiver. This method schedules the restart and returns * immediately. The stopping and subsequent starting of the receiver * (by calling `onStop()` and `onStart()`) is performed asynchronously * in a background thread. The delay between the stopping and the starting * is defined by the Spark configuration `spark.streaming.receiverRestartDelay`. * The `message` will be reported to the driver. */ def restart(message: String) { supervisor.restartReceiver(message) } /** * Restart the receiver. This method schedules the restart and returns * immediately. The stopping and subsequent starting of the receiver * (by calling `onStop()` and `onStart()`) is performed asynchronously * in a background thread. The delay between the stopping and the starting * is defined by the Spark configuration `spark.streaming.receiverRestartDelay`. * The `message` and `exception` will be reported to the driver. */ def restart(message: String, error: Throwable) { supervisor.restartReceiver(message, Some(error)) } /** * Restart the receiver. This method schedules the restart and returns * immediately. The stopping and subsequent starting of the receiver * (by calling `onStop()` and `onStart()`) is performed asynchronously * in a background thread. */ def restart(message: String, error: Throwable, millisecond: Int) { supervisor.restartReceiver(message, Some(error), millisecond) } /** Stop the receiver completely. */ def stop(message: String) { supervisor.stop(message, None) } /** Stop the receiver completely due to an exception */ def stop(message: String, error: Throwable) { supervisor.stop(message, Some(error)) } /** Check if the receiver has started or not. */ def isStarted(): Boolean = { supervisor.isReceiverStarted() } /** * Check if receiver has been marked for stopping. Use this to identify when * the receiving of data should be stopped. */ def isStopped(): Boolean = { supervisor.isReceiverStopped() } /** * Get the unique identifier the receiver input stream that this * receiver is associated with. */ def streamId: Int = id /* * ================= * Private methods * ================= */ /** Identifier of the stream this receiver is associated with. */ private var id: Int = -1 /** Handler object that runs the receiver. This is instantiated lazily in the worker. */ @transient private var _supervisor: ReceiverSupervisor = null /** Set the ID of the DStream that this receiver is associated with. */ private[streaming] def setReceiverId(_id: Int) { id = _id } /** Attach Network Receiver executor to this receiver. */ private[streaming] def attachSupervisor(exec: ReceiverSupervisor) { assert(_supervisor == null) _supervisor = exec } /** Get the attached supervisor. */ private[streaming] def supervisor: ReceiverSupervisor = { assert(_supervisor != null, "A ReceiverSupervisor has not been attached to the receiver yet. Maybe you are starting " + "some computation in the receiver before the Receiver.onStart() has been called.") _supervisor } 复制代码
1.2 Receiver的继承关系
2 ReceiverTracker 深度剖析
2.1 ReceiverTracker 的启动流程
-
建立通讯终端,方便整个Streaming状态监控。
-
launchReceivers() :自己给自己 endpoint 终端发送StartAllReceivers,然后启动Receiver RDD,把它作为Job,然后由SparkContext提交Job,最后Receiver 就可以在Executor上启动。
/** Start the endpoint and receiver execution thread. */ def start(): Unit = synchronized { if (isTrackerStarted) { throw new SparkException("ReceiverTracker already started") } if (!receiverInputStreams.isEmpty) { endpoint = ssc.env.rpcEnv.setupEndpoint( "ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv)) if (!skipReceiverLaunch) launchReceivers() <= 神来之笔 logInfo("ReceiverTracker started") trackerState = Started } } 复制代码
2.2 ReceiverTracker.launchReceivers 是一个信使
作为信使,我就是要发StartAllReceivers(receivers)消息。
- Get the receivers from the ReceiverInputDStreams, distributes them to the
- worker nodes as a parallel collection, and runs them. private def launchReceivers(): Unit = {
val receivers = receiverInputStreams.map { nis =>
val rcvr = nis.getReceiver()
rcvr.setReceiverId(nis.id)
rcvr
} <= 神来之笔
runDummySparkJob()
logInfo("Starting " + receivers.length + " receivers")
endpoint.send(StartAllReceivers(receivers)) <= 神来之笔
复制代码
}
2.2 ReceiverTracker 接收自己的消息并处理
/** RpcEndpoint to receive messages from the receivers. */
private class ReceiverTrackerEndpoint(override val rpcEnv: RpcEnv) extends ThreadSafeRpcEndpoint {
private val walBatchingThreadPool = ExecutionContext.fromExecutorService(
ThreadUtils.newDaemonCachedThreadPool("wal-batching-thread-pool"))
@volatile private var active: Boolean = true
override def receive: PartialFunction[Any, Unit] = {
// Local messages
case StartAllReceivers(receivers) =>
val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
<= 神来之笔(以最优位置推荐,确定receiver在哪个Executor上启动)
for (receiver <- receivers) {
val executors = scheduledLocations(receiver.streamId) <= 神来之笔
updateReceiverScheduledExecutors(receiver.streamId, executors)
receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
startReceiver(receiver, executors) <= 神来之笔(循环所有receiver以最优位置启动)
}
case RestartReceiver(receiver) =>
// Old scheduled executors minus the ones that are not active any more
val oldScheduledExecutors = getStoredScheduledExecutors(receiver.streamId)
val scheduledLocations = if (oldScheduledExecutors.nonEmpty) {
// Try global scheduling again
oldScheduledExecutors
} else {
val oldReceiverInfo = receiverTrackingInfos(receiver.streamId)
// Clear "scheduledLocations" to indicate we are going to do local scheduling
val newReceiverInfo = oldReceiverInfo.copy(
state = ReceiverState.INACTIVE, scheduledLocations = None)
receiverTrackingInfos(receiver.streamId) = newReceiverInfo
schedulingPolicy.rescheduleReceiver(
receiver.streamId,
receiver.preferredLocation,
receiverTrackingInfos,
getExecutors)
}
// Assume there is one receiver restarting at one time, so we don't need to update
// receiverTrackingInfos
startReceiver(receiver, scheduledLocations)
case c: CleanupOldBlocks =>
receiverTrackingInfos.values.flatMap(_.endpoint).foreach(_.send(c))
case UpdateReceiverRateLimit(streamUID, newRate) =>
for (info <- receiverTrackingInfos.get(streamUID); eP <- info.endpoint) {
eP.send(UpdateRateLimit(newRate))
}
// Remote messages
case ReportError(streamId, message, error) =>
reportError(streamId, message, error)
}
复制代码
2.2 ReceiverTracker发射receiver到Executor
-
startReceiverFunc :receiverRDD的func 函数,主要执行在Executor中,startReceiverFunc方法体中包含需要启动的 ReceiverSupervisorImpl ,ReceiverSupervisorImpl是具体的数据接收执行者。内部封装了BlockGenerator:
private val registeredBlockGenerators = new ConcurrentLinkedQueue[BlockGenerator]() 复制代码
-
BlockGenerator 中的 blockIntervalTimer 和 blockPushingThread 负责了整个实时数据流的batch到Block的过程。
-
startReceiver 核心代码段,从receiverRDD到Job提交,最终executor端的ReceiverSupervisorImpl.start()
* Start a receiver along with its scheduled executors private def startReceiver( receiver: Receiver[_], scheduledLocations: Seq[TaskLocation]): Unit = { def shouldStartReceiver: Boolean = { // It's okay to start when trackerState is Initialized or Started !(isTrackerStopping || isTrackerStopped) } val receiverId = receiver.streamId if (!shouldStartReceiver) { onReceiverJobFinish(receiverId) return } val checkpointDirOption = Option(ssc.checkpointDir) val serializableHadoopConf = new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration) // Function to start the receiver on the worker node val startReceiverFunc: Iterator[Receiver[_]] => Unit = (iterator: Iterator[Receiver[_]]) => { if (!iterator.hasNext) { throw new SparkException( "Could not start receiver as object not found.") } if (TaskContext.get().attemptNumber() == 0) { val receiver = iterator.next() assert(iterator.hasNext == false) val supervisor = new ReceiverSupervisorImpl( receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption) supervisor.start() supervisor.awaitTermination() } else { // It's restarted by TaskScheduler, but we want to reschedule it again. So exit it. } } // Create the RDD using the scheduledLocations to run the receiver in a Spark job val receiverRDD: RDD[Receiver[_]] = if (scheduledLocations.isEmpty) { ssc.sc.makeRDD(Seq(receiver), 1) } else { val preferredLocations = scheduledLocations.map(_.toString).distinct ssc.sc.makeRDD(Seq(receiver -> preferredLocations)) } receiverRDD.setName(s"Receiver $receiverId") ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId") ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite())) val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit]( receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ()) // We will keep restarting the receiver job until ReceiverTracker is stopped future.onComplete { case Success(_) => if (!shouldStartReceiver) { onReceiverJobFinish(receiverId) } else { logInfo(s"Restarting Receiver $receiverId") self.send(RestartReceiver(receiver)) } case Failure(e) => if (!shouldStartReceiver) { onReceiverJobFinish(receiverId) } else { logError("Receiver has been stopped. Try to restart it.", e) logInfo(s"Restarting Receiver $receiverId") self.send(RestartReceiver(receiver)) } }(ThreadUtils.sameThread) logInfo(s"Receiver ${receiver.streamId} started") } 复制代码
-
ReceiverTracker 内部 receiver RDD的架构图
下图深度剖析了ReceiverTracker中如何实现 receiver RDD 的Job提交流程
3 总结
最终Driver端的ReceiverTracker 负责把receiver在不同的Executor中启动,从而使得receiver可以不间断的进行数据的批处理存储及BlcokManger管理。
秦凯新 于深圳 2018