StreamingContext在初始化的时候,会创建两个重要的组件DStreamGraph和JobScheduler,如下所示:
// 这里初始化的一个重要的组件DStreamGraph,
// 它里面保存了Spark Streaming Application中一系列的DStream的依赖关系,以及互相之间的算子的应用
private[streaming] val graph: DStreamGraph = {
if (isCheckpointPresent) {
cp_.graph.setContext(this)
cp_.graph.restoreCheckpointData()
cp_.graph
} else {
require(batchDur_ != null, "Batch duration for StreamingContext cannot be null")
val newGraph = new DStreamGraph()
newGraph.setBatchDuration(batchDur_)
newGraph
}
}
// 初始化JobScheduler,涉及到Job的调度;JobGenerator生成的Job,就是通过它来调度和提交的
// 其底层还是基于Spark Core Engine
private[streaming] val scheduler = new JobScheduler(this)
在初始化StreamingContext之后,我们以WordCount程序为例,程序接着往下执行。在前面Spark Core中分析过,触发Job的运行是通过一个output操作(也即action),我们以简单的print() action操作为例,它里面调用了print(10),也就是打印RDD中前10个数据,并且里面调用了foreachRDD函数,而这个函数里面调用了ForEachDStream的register()方法,而最终会调用generatorJob()方法,到这里会触发Job的提交。
然而上面仅仅只是触发Job的提交,这里并没有涉及到Job的产生以及Receiver数据的接收,而触发这些功能则是调用StreamingContext的start()方法,所以在Spark Streaming中,如果没有调用它的start()方法程序是不会执行的,当然没有action操作程序也不会执行,因为没有Job提交。
下面我们就着重分析start()方法:
def start(): Unit = synchronized {
state match {
case INITIALIZED =>
startSite.set(DStream.getCreationSite())
// 加锁,保证一个节点上只有一个StreamingContext在运行
StreamingContext.ACTIVATION_LOCK.synchronized {
// 判断是否有多个StreamingContext在运行
StreamingContext.assertNoOtherContextIsActive()
try {
// 检测初始化组件是否合法,以及是否设置了checkpoint等。
validate()
// 启用一个单独的线程启动Streaming Application
ThreadUtils.runInNewThread("streaming-start") {
sparkContext.setCallSite(startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
// 调用JobScheduler的start方法,来进行Receiver的启动
scheduler.start()
}
// 更新当前状态
state = StreamingContextState.ACTIVE
} catch {
case NonFatal(e) =>
logError("Error starting the context, marking it as stopped", e)
scheduler.stop(false)
state = StreamingContextState.STOPPED
throw e
}
StreamingContext.setActiveContext(this)
}
// 省略部分代码
.............................
}
}
从上面代码中可以看出,具体调用了JobScheduler的start()方法,我们到这个方法里面看一下。
// StreamingContext的start()方法,其实里面真正调用的是JobScheduler的start()方法
def start(): Unit = synchronized {
// 假如这个StreamingContext已经在启动,那么返回(可能是故障重启动等)
if (eventLoop != null) return // scheduler has already been started
logDebug("Starting JobScheduler")
// 创建一个接收消息的消息队列
eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)
override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
}
// 启动消息接收(接收的是本地消息)
eventLoop.start()
// 获取输入DStream的限流率
// 这个其实还挺重要的,在基于Kafka Direct方式接收数据的时候(或者普通Receiver),
// 可以设置一个最大接收速度,也就是进行限速
for {
inputDStream <- ssc.graph.getInputStreams
rateController <- inputDStream.rateController
} ssc.addStreamingListener(rateController)
listenerBus.start(ssc.sparkContext)
// 创建ReceiverTracker组件,这是数据接收相关组件
receiverTracker = new ReceiverTracker(ssc)
// 记录输入DStream的数据信息,以便Streaming进行监控
inputInfoTracker = new InputInfoTracker(ssc)
// 启动receiverTracker,这里启动输入DStream关联的Receiver
receiverTracker.start()
// 创建JobScheduler的时候,直接就把JobGenerator给创建出来了,并在这里启动
jobGenerator.start()
logInfo("Started JobScheduler")
}
我们重点分析一下上面的代码:首先创建一个消息接收器,用于接收本地消息。接着获取DStream的限流器,这里涉及到输入DStream的限流。
这里简单说说Receiver的限流,如果集群资源有限,并没有大到Receiver一接收到数据就立即处理它,这会导致Receiver端有数据积压,为了防止数据积压太多,因此有必要调整接收数据的速度,这里可以通过两个参数来设置:spark.streaming.receiver.maxRate 和 spark.streaming.kafka.maxRatePerPartition
前者是设置普通Receiver,后者是设置Kafka 的,然而从Spark 1.5之后,对于Kafka Direct方式而言引入了backpressure(背压)机制,从而不需要设置Receiver的限速,Spark可以自动估计Receiver最合理的接收速度,并根据情况动态调整。启动这个机制只需要设置 spark.streaming.backpressure.enabled为true即可。
接着分析上述代码,然后创建了两个重要组件ReceiverTracker和JobGenerator,并启动他们,我们先分析ReceiverTracker的start()方法。
由于ReceiverTracker的start方法中实际上调用的是launchReceivers()方法,我们就看这个方法:
private def launchReceivers(): Unit = {
// 获取所有的Receiver
val receivers = receiverInputStreams.map(nis => {
// 将程序中创建的所有输入DStream,调用其getReceiver方法,拿到一个Receiver集合
val rcvr = nis.getReceiver()
// 设置Receiver的ID
rcvr.setReceiverId(nis.id)
rcvr
})
runDummySparkJob()
logInfo("Starting " + receivers.length + " receivers")
// 向ReceiverTrackerEndpoint发送启动所有Receiver消息,
// 其实就是在本地进行消息收发
endpoint.send(StartAllReceivers(receivers))
}
从上述代码可以看到,启动Receiver是通过发送一个本地消息StartAllReceiver来启动的,下面我们看一下这个源码:
override def receive: PartialFunction[Any, Unit] = {
// Local messages
// 启动所有的Receivers
case StartAllReceivers(receivers) =>
// 计算Receiver的启动位置,说白了就是看Receiver在哪个executor启动
val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
for (receiver <- receivers) {
// 获取Receiver启动的所在节点的executor
val executors = scheduledLocations(receiver.streamId)
// 更新到ReceiverInfo中
updateReceiverScheduledExecutors(receiver.streamId, executors)
// 记录每个Receiver的启动位置
receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
// 启动Receiver,这里传入了Receiver要启动的executor位置
startReceiver(receiver, executors)
}
// 省略代码
}
这里主要就是startReceiver()这个函数来启动Receiver,注意传入的参数,一个是待启动的Receiver的集合,还有就是每个Receiver的启动位置(也即在哪个Worker的executor节点上启动)。
下面重点分析startReceiver方法:
private def startReceiver(
receiver: Receiver[_],
scheduledLocations: Seq[TaskLocation]): Unit = {
// 检测Receiver是否已经启动或已经被关闭等
val receiverId = receiver.streamId
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
return
}
// 是否设置Checkpoint
val checkpointDirOption = Option(ssc.checkpointDir)
val serializableHadoopConf =
new SerializableConfiguration(ssc.sparkContext.hadoopConfiguration)
/**
* 这里定义了Receiver的核心逻辑,
* 注意:这里以及之后的操作都只是定义,不是在Driver端执行的
* 这里只是定义了一个函数,这个函数的执行,以及往后的过程
* 都是在executor上执行的。这里强调,Receiver的启动绝对不是在Driver上的,是在Executor上的
*/
// 遍历每个Receiver,并进行启动
val startReceiverFunc: Iterator[Receiver[_]] => Unit =
(iterator: Iterator[Receiver[_]]) => {
if (!iterator.hasNext) {
throw new SparkException(
"Could not start receiver as object not found.")
}
if (TaskContext.get().attemptNumber() == 0) {
// 获取一个Receiver
val receiver = iterator.next()
assert(iterator.hasNext == false)
// 将每个Receiver封装在ReceiverSupervisorImpl中,并调用其start方法启动
val supervisor = new ReceiverSupervisorImpl(
receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
// 这里调用了它的父类ReceiverSupervisor的start方法
supervisor.start()
supervisor.awaitTermination()
} else {
// It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
}
}
// 这里做了优化,receiver接收到的数据,所封装成的RDD,它的最佳位置在Receiver启动的那个节点上
val receiverRDD: RDD[Receiver[_]] =
if (scheduledLocations.isEmpty) {
ssc.sc.makeRDD(Seq(receiver), 1)
} else {
val preferredLocations = scheduledLocations.map(_.toString).distinct
ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
}
receiverRDD.setName(s"Receiver $receiverId")
ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))
// 这里的submitJob会真正将Receiver的启动函数,分布到各个Worker节点的Executor上去执行
val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
// We will keep restarting the receiver job until ReceiverTracker is stopped
// 这里就是判断Job的运行结果状态
future.onComplete {
case Success(_) =>
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
} else {
logInfo(s"Restarting Receiver $receiverId")
self.send(RestartReceiver(receiver))
}
case Failure(e) =>
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
} else {
logError("Receiver has been stopped. Try to restart it.", e)
logInfo(s"Restarting Receiver $receiverId")
self.send(RestartReceiver(receiver))
}
}(submitJobThreadPool)
logInfo(s"Receiver ${receiver.streamId} started")
}
上面代码中重要的部分就是对每个Receiver封装了一个startReceiverFunc,这个Receiver的启动函数,它里面具体的就是将每个Receiver封装进了ReceiverSupervisorImpl中,然后调用它的start()方法,启动Receiver;接着就是封装了receiverRDD;最重要的是将他们两通过SparkContext的submitJob进行Job的提交发送到各个Worker节点的executor上去执行。
上面需要注意的是,真正启动Receiver是在Worker节点的executor上,而不是Driver上,Driver只是将Receiver进行封装,然后发送到各个Executor上进行启动。
下面我们看看在每个Worker的executor上,Receiver是如何启动的,首先调用的是ReceiverSupervisorImpl的start()方法。
ReceiverSupervisorImpl的start()方法在这个类中没有,因为在其父类中实现的,我们看其父类ReceiverSupervisor的start方法:
def start() {
// 调用ReceiverSupervisorImpl的onStart()方法
onStart()
startReceiver()
}
start方法中只有两个方法,一个是onStart(),如下所示,用于启动JobGenerator(后面再进行分析);还有一个就是startReceiver,用于启动Receiver。
override protected def onStart() {
// 这里启动了一个BlockGenerator,非常重要,它允许在worker的executor端负责
// 数据接收后的一些存取工作,以及配合ReceiverTracker。
// 所以在Executor上,启动Receiver之前,就会先启动这个Receiver,相关的BlockGenerator
// 这里启动已经注册的BlockGenerator
registeredBlockGenerators.foreach { _.start() }
}
在startReceiver中启动Receiver。
// 这里就会启动Receiver
def startReceiver(): Unit = synchronized {
try {
// 先向ReceiverTracker发送启动Receiver的信息进行注册
if (onReceiverStart()) {
logInfo("Starting receiver")
// 启动Receiver
receiverState = Started
// 这里就启动Receiver
receiver.onStart()
logInfo("Called receiver onStart")
} else {
// The driver refused us
stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
}
} catch {
case NonFatal(t) =>
stop("Error starting receiver " + streamId, Some(t))
}
}
首先向ReceiverTracker发送启动Receiver的消息,进行注册。发送成功之后,就会进行Receiver的启动,调用了Receiver的onStart()方法,我们这里以socket receiver为例,来进行说明,其他的Receiver启动都大同小异。
// 启动Receiver
def onStart() {
// Start the thread that receives data over a connection
new Thread("Socket Receiver") {
setDaemon(true)
override def run() { receive() }
}.start()
}
// 这里主要就是建立一个socket连接,用于接收数据
def receive() {
var socket: Socket = null
try {
logInfo("Connecting to " + host + ":" + port)
socket = new Socket(host, port)
logInfo("Connected to " + host + ":" + port)
val iterator = bytesToObjects(socket.getInputStream())
while(!isStopped && iterator.hasNext) {
store(iterator.next)
}
if (!isStopped()) {
restart("Socket data stream had no more data")
} else {
logInfo("Stopped receiving")
}
} catch {
case e: java.net.ConnectException =>
restart("Error connecting to " + host + ":" + port, e)
case NonFatal(e) =>
logWarning("Error receiving data", e)
restart("Error receiving data", e)
} finally {
if (socket != null) {
socket.close()
logInfo("Closed socket to " + host + ":" + port)
}
}
}
从上面可以很清楚的看到,在Worker节点的executor上启动的Socket Receiver,主要就是与数据源建立一个Socket连接,然后接受数据,并保存到它对应的BlockManager上,然后进行后面一系列的算子处理。
总结一下:上面主要分析了StreamingContext初始化的过程,以及它的start()方法;这里面主要作用是创建了四个重要的组件JobScheduler、DStreamGraph、ReceiverTracker和JobGenerator。其中着重分析了Receiver的启动过程,首先将Receiver封装进启动函数startReceiverFunc中,然后通过SparkContext的submitJob,将各个Receiver分发到Worker节点的executor上去启动,启动的时候主要调用的是ReceiverSupervosor的startReceiver()方法,首先给ReceiverTracker发送启动消息,然后调用receiver的onStart()方法启动Receiver的数据接收。我们以Socket Receiver为例,进行了简单分析。