博客地址: http://blog.csdn.net/yueqian_zhu/
前面的两节内容介绍了StreamingContext的构造以及在此上的一系列操作。
通过调用start方法,真正开始调度执行。首先校验状态是否是INITIALIZED,然后调用JobScheduler的start方法,并将状态设置为ACTIVE。
看一下JobScheduler的start方法内部
def start(): Unit = synchronized { if (eventLoop != null) return // scheduler has already been started logDebug("Starting JobScheduler") eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") { override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event) override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e) } eventLoop.start() listenerBus.start(ssc.sparkContext) receiverTracker = new ReceiverTracker(ssc) inputInfoTracker = new InputInfoTracker(ssc) receiverTracker.start() jobGenerator.start() logInfo("Started JobScheduler") }1、首先构造一个事件类型为[JobSchedulerEvent]的循环器eventLoop(包含JobStarted,JobCompleted,ErrorReported三个事件),内部有一个线程实时获取队列中的事件,有则处理。实际调用如上的onReceive/onError方法。eventLoop.start后,内部线程真正运行起来,并等待事件的到来。
2、构造ReceiverTracker
(1)从DStreamGraph中获取注册的ReceiverInputStreams
(2)获取所有ReceiverInputStreams的streamId
(3)构造一个ReceiverLauncher,它是一个接受器
(4)构造一个ReceivedBlockTracker,
用于维护所有的接收器(receiver)接收到的所有block信息,即ReceivedBlockInfo
3、调用receiverTracker的start方法。
如果receiverInputStreams不为空,则建立akka RPC服务,名称为ReceiverTracker,负责注册Receiver、AddBlock、ReportError(报告错误)、注销Receiver四个事件
调用receiverExecutor的start方法,最终调用了startReceivers方法。
/** * Get the receivers from the ReceiverInputDStreams, distributes them to the * worker nodes as a parallel collection, and runs them. */ private def startReceivers() { val receivers = receiverInputStreams.map(nis => { val rcvr = nis.getReceiver() rcvr.setReceiverId(nis.id) rcvr }) // Right now, we only honor preferences if all receivers have them val hasLocationPreferences = receivers.map(_.preferredLocation.isDefined).reduce(_ && _) // Create the parallel collection of receivers to distributed them on the worker nodes val tempRDD = if (hasLocationPreferences) { val receiversWithPreferences = receivers.map(r => (r, Seq(r.preferredLocation.get))) ssc.sc.makeRDD[Receiver[_]](receiversWithPreferences) } else { ssc.sc.makeRDD(receivers, receivers.size) } val checkpointDirOption = Option(ssc.checkpointDir) val serializableHadoopConf = new SerializableWritable(ssc.sparkContext.hadoopConfiguration) // Function to start the receiver on the worker node val startReceiver = (iterator: Iterator[Receiver[_]]) => { if (!iterator.hasNext) { throw new SparkException( "Could not start receiver as object not found.") } val receiver = iterator.next() val supervisor = new ReceiverSupervisorImpl( receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption) supervisor.start() supervisor.awaitTermination() } // Run the dummy Spark job to ensure that all slaves have registered. // This avoids all the receivers to be scheduled on the same node. if (!ssc.sparkContext.isLocal) { ssc.sparkContext.makeRDD(1 to 50, 50).map(x => (x, 1)).reduceByKey(_ + _, 20).collect() } // Distribute the receivers and start them logInfo("Starting " + receivers.length + " receivers") running = true ssc.sparkContext.runJob(tempRDD, ssc.sparkContext.clean(startReceiver)) running = false logInfo("All of the receivers have been terminated") }
1)获取所有的receiver(接收器)
2)将receivers建立tempRDD,并分区并行化,每个分区一个元素,元素为receiver
3)创建方法startReceiver,该方法以分区元素(receiver)的迭代器作为参数,之后将该方法参数传入runJob中,针对每个分区,依次将每个分区中的元素(receiver)应用到该方法上
4)runJob的startReceiver方法。每个分区只有一个receiver,因此在该方法内构造一个ReceiverSupervisorImpl,在它内部真正的接收数据并保存。发送RegisterReceiver消息给dirver驱动。
重点介绍一下supervisor.start方法内部的逻辑实现:主要分为以下两个方法
/** Start the supervisor */ def start() { onStart() startReceiver() }(1)onStart方法:
override protected def onStart() { blockGenerator.start() }
(2)startReceiver方法:
/** Start receiver */ def startReceiver(): Unit = synchronized { try { logInfo("Starting receiver") receiver.onStart() logInfo("Called receiver onStart") onReceiverStart() receiverState = Started } catch { case t: Throwable => stop("Error starting receiver " + streamId, Some(t)) } }1)receiver.onStart方法
2)onReceiverStart方法
向receiverTracker(位于driver端)发送RegisterReceiver消息,报告自己(receiver)启动了,目的是可以在UI中反馈出来。ReceiverTracker将每一个stream接收到但还没有进行处理的block放入到receiverInfo,其为一Hashmap. 在后面的generateJobs中会从receiverInfo提取数据以生成相应的RDD。
4、调用jobGenerator的start方法。
(1)首先构建JobGeneratorEvent类型事件的EventLoop,包含GenerateJobs,ClearMetadata,DoCheckpoint,ClearCheckpointData四个事件。并运行起来。
(2)调用startFirstTime启动generator
/** Starts the generator for the first time */ private def startFirstTime() { val startTime = new Time(timer.getStartTime()) graph.start(startTime - graph.batchDuration) timer.start(startTime.milliseconds) logInfo("Started JobGenerator at " + startTime) }timer.getStartTime计算出来下一个周期的到期时间,计算公式:(math.floor(clock.currentTime.toDouble / period) + 1).toLong * period,以当前的时间/除以间隔时间,再用math.floor求出它的上一个整数(即上一个周期的到期时间点),加上1,再乘以周期就等于下一个周期的到期时间。
(3) 启动DStreamGraph,调用graph.start方法,启动时间比startTime早一个时间间隔,为什么呢?求告知!!!
def start(time: Time) { this.synchronized { if (zeroTime != null) { throw new Exception("DStream graph computation already started") } zeroTime = time startTime = time outputStreams.foreach(_.initialize(zeroTime))//设置outputstream的zeroTime为time值 outputStreams.foreach(_.remember(rememberDuration))//如果设置过rememberDuration,则设置outputstream的rememberDuration为该值 outputStreams.foreach(_.validateAtStart) inputStreams.par.foreach(_.start()) } }(4) 调用timer.start方法,参数为startTime
这里的timer为:
private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds, longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")内部包含一个定时器,每隔batchDuration的时间间隔就向eventLoop发送一个GenerateJobs消息,参数longTime为下一个间隔到来时的时间点
/** * Start at the given start time. */ def start(startTime: Long): Long = synchronized { nextTime = startTime thread.start() logInfo("Started timer for " + name + " at time " + nextTime) nextTime }通过内部的thread.start方法,触发timer内部的定时器运行。从而按时间间隔产生job。
5、GenerateJobs/ClearMetadata 事件处理介绍
JobGeneratorEvent类型事件的EventLoop,包含GenerateJobs,ClearMetadata,DoCheckpoint,ClearCheckpointData四个事件
GenerateJobs:
/** Generate jobs and perform checkpoint for the given `time`. */ private def generateJobs(time: Time) { // Set the SparkEnv in this thread, so that job generation code can access the environment // Example: BlockRDDs are created in this thread, and it needs to access BlockManager // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed. SparkEnv.set(ssc.env) Try { jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch graph.generateJobs(time) // generate jobs using allocated block } match { case Success(jobs) => val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time) val streamIdToNumRecords = streamIdToInputInfos.mapValues(_.numRecords) jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToNumRecords)) case Failure(e) => jobScheduler.reportError("Error generating jobs for time " + time, e) } eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false)) }(1)allocateBlocksToBatch:首先根据time的值获取之前receiver接收到的并且通过AddBlock消息传递给receiverTracker的block元数据信息。并且将time对应的blocks信息映射保存起来。
那么,这里的time是怎么和每200ms间隔产生blocks对应起来的呢?答案就是time时间到后,将所有接收到但还未分配的blocks都划为这个time间隔内的。
(2)generateJobs:根据一个outputStream生成一个job,最终每个outputStream都调用如下的方法,见下面代码注释
注:这里的generateJob实际调用的是根据outputStream重载的方法,比如print的方法是输出一些值:
override def generateJob(time: Time): Option[Job] = { parent.getOrCompute(time) match {<span style="font-family: Tahoma, 'Microsoft Yahei', Simsun;">//这里实际是手动调用了ReceiverInputDStream的compute方法,产生一个RDD,确切的说是BlockRDD。见下面介绍</span> case Some(rdd) => val jobFunc = () => createRDDWithLocalProperties(time) { ssc.sparkContext.setCallSite(creationSite) foreachFunc(rdd, time)<span style="font-family: Tahoma, 'Microsoft Yahei', Simsun;">//这里将上面的到的BlockRDD和一个在每个分区上执行的方法封装成一个jobFunc,在foreachFunc方法内部通过runJob提交任务获得输出的值,从而输出</span> } Some(new Job(time, jobFunc))<span style="font-family: Tahoma, 'Microsoft Yahei', Simsun;">//</span><span style="font-family: Tahoma, 'Microsoft Yahei', Simsun;">将time和jobFunc再次封装成Job,返回,等待被调度执行</span> case None => None } }这里需要解释一下ReceiverInputDStream的compute方法
1)首先根据time值将之前映射的blocks元数据信息获取出来
2) 获取这些blocks的blockId,blockId其实就是streamId+唯一值,这个唯一值可以保证在一个流里面产生的唯一的Id
3)将这个batchTime时间内的blocks元信息汇总起来,保存到inputInfoTracker中
4)将sparkContext和blockIds封装成BlockRDD返回
至此,Job已经产生了。如果Job产生成功,就走Case Success(Jobs) =>分支
jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToNumRecords))主要是根据time,jobs,以及streamId和每个streamId的记录数的映射封装成JobSet,调用submitJobSet
def submitJobSet(jobSet: JobSet) { if (jobSet.jobs.isEmpty) { logInfo("No jobs added for time " + jobSet.time) } else { listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo)) jobSets.put(jobSet.time, jobSet) jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job))) logInfo("Added jobs for time " + jobSet.time) } }可以看到,将jobSet保存到jobSets这样一个映射结构当中,然后将每个job通过JobHandler封装之后,通过一个线程调用运行起来。这个线程就是通过“spark.streaming.concurrentJobs”参数设置的一个线程池,默认是1。
接着看JobHandler被线程处理时的逻辑,见代码注释:
private class JobHandler(job: Job) extends Runnable with Logging { def run() { ssc.sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString) ssc.sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString) try { eventLoop.post(JobStarted(job))//这里主要是设置这个job所处的jobset的processingStartTime为当时时刻 // Disable checks for existing output directories in jobs launched by the streaming // scheduler, since we may need to write output to an existing directory during checkpoint // recovery; see SPARK-4835 for more details. PairRDDFunctions.disableOutputSpecValidation.withValue(true) { job.run()//这里的run方法就是调用了封装Job时的第二个参数,一个方法参数,就是上面的jobFunc } eventLoop.post(JobCompleted(job))//如果这个job所处的jobset都完成了,就设置processingEndTime,并向时间循环器发送ClearMetadata消息,后续讲解 } finally { ssc.sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, null) ssc.sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, null) } } }ClearMetadata:
当一个jobset完成后,就会处理ClearMetadata消息
1、根据time的时间,过滤出在time之前的rdd,如果设置了rememberDuration,则过滤出小于(time-rememberDuration)的rdd
2、将过滤出的rdd调用unpersist
3、删除在blockManager中的block
4、根据dependencies关系链依次删除,从outputStream开始,根据链路依次进行
5、删除其它内存纪录信息
至此,关于spark stream最重要的部分,调度及运行就分析结束了!