本期内容:
1,Spark Streaming Job生成深度思考
2,Spark Streaming Job生成源码解析
先来看下JobGenerator类,其构造函数中需要传入JobScheduler对象,而JobScheduler类是Spark Streaming Job生成和提交Job到集群的核心。JobGenerator基于DStreamGraph 来生成Job,再次强调这里的Job相当于Java中Runnable接口对业务逻辑的封装,他和Spark Core中Job不是同一个概念,Spark Core中的Job就是运行的作业,Spark Streaming中的Job是更高层的抽象。
/** * This class generates jobs from DStreams as well as drives checkpointing and cleaning * up DStream metadata. */ private[streaming] class JobGenerator(jobScheduler: JobScheduler) extends Logging { private val ssc = jobScheduler.ssc private val conf = ssc.conf private val graph = ssc.graph |
Spark Streaming中的Job,只是一个Java Bean,业务逻辑在func这个函数中。
/** * Class representing a Spark computation. It may contain multiple Spark jobs. */ private[streaming] class Job(val time: Time, func: () => _) { private var _id: String = _ private var _outputOpId: Int = _ private var isSet = false private var _result: Try[_] = null private var _callSite: CallSite = null private var _startTime: Option[Long] = None private var _endTime: Option[Long] = None |
DStream有三种类型,第一种是不同的输入来源构建的Stream,例如来自Socket,Kafka,Flume,第二种是输出,outputStreams 是逻辑级别的Action,由于还是Spark Streaming框架级别的,最终还要变为物理级别的Action,第三种是Transforms操作从一种DStream转变为另一种DStream,即基于其他DStream产生的。其中DStreamGraph 类记录了数据来源的DStream,和输出类型的DStream。
//DStreamGraph是RDD的静态的模板,表示RDD依赖关系构成的具体处理逻辑步骤 final private[streaming] class DStreamGraph extends Serializable with Logging { // InputDStream类型的动态数组 //输入流:数据来源 private val inputStreams = new ArrayBuffer[InputDStream[_]]() //输出流:具体Action的输出操作 private val outputStreams = new ArrayBuffer[DStream[_]]() |
JobGenerator会根据BatchDuration时间间隔,随着时间的推移,会不断的产生作业,驱使checkpoint操作和清理之前DStream的数据。
对于流处理和批处理的思考。批处理间隔时间足够短的话就是流处理。Spark Streaming的流处理是以时间为触发器的,Strom的流处理是事件为触发器的。定时任务,流处理,J2EE触发作业。
思考一个问题:DStreamGraph逻辑级别翻译成物理级别的RDD Graph,最后一个操作是RDD的action操作,是否会立即触发Job?
JobGenerator产生的Job是Runnable的封装,对DStream的依赖关系生成RDD之间的依赖关系,最后的操作就是Action,由于这些操作都是在方法中,还没有被调用所以并没有在翻译时触发Job。如果在翻译时就触发Job,这样整个Spark Streaming的Jon提交就不受管理了。
当JobScheduler要调度Job的时候,就从线程池中拿出一个线程来执行封装Dstream到RDD的方法。
接下来从JobGenerator,JobScheduler,ReceiverTracker这三个角度来讲Job的生成。其中JobGenerator是负责Job的生成,JobScheduler是负责Job的调度,ReceiverTracker是记录数据的来源。JobGenerator和ReceiverTracker是JobScheduler的成员。
/** * This class schedules jobs to be run on Spark. It uses the JobGenerator to generate * the jobs and runs them using a thread pool. * 本类对运行在Spark上的job进行调度。使用JobGenerator来生成Jobs,并且在线程池运行。 * 说的很清楚了。由JobGenerator生成Job,在线程池中运行。 */ private[streaming] class JobScheduler(val ssc: StreamingContext) extends Logging { // Use of ConcurrentHashMap.keySet later causes an odd runtime problem due to Java 7/8 diff // https://gist.github.com/AlainODea/1375759b8720a3f9f094 private val jobSets: java.util.Map[Time, JobSet] = new ConcurrentHashMap[Time, JobSet] // 默认并发Jobs数为1 private val numConcurrentJobs = ssc.conf.getInt("spark.streaming.concurrentJobs", 1) // 使用线程池方式执行 private val jobExecutor = ThreadUtils.newDaemonFixedThreadPool(numConcurrentJobs, "streaming-job-executor") // 创建JobGenerator,后续会详细分析 private val jobGenerator = new JobGenerator(this) val clock = jobGenerator.clock val listenerBus = new StreamingListenerBus() // These two are created only when scheduler starts. // eventLoop not being null means the scheduler has been started and not stopped var receiverTracker: ReceiverTracker = null // A tracker to track all the input stream information as well as processed record number var inputInfoTracker: InputInfoTracker = null |
在JobScheduler的start方法中,分别调用了ReceiverTracker和JobGenerator的start方法。
def start(): Unit = synchronized { if (eventLoop != null) return // scheduler has already been started logDebug("Starting JobScheduler") //消息驱动系统 eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") { override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event) override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e) } //启动消息循环处理线程 eventLoop.start() // attach rate controllers of input streams to receive batch completion updates for { inputDStream <- ssc.graph.getInputStreams rateController <- inputDStream.rateController } ssc.addStreamingListener(rateController) listenerBus.start(ssc.sparkContext) receiverTracker = new ReceiverTracker(ssc) inputInfoTracker = new InputInfoTracker(ssc) //启动receiverTracker receiverTracker.start() //启动Job生成器 jobGenerator.start() logInfo("Started JobScheduler") } |
先看下JobGenerator的start方法,checkpoint的初始化操作,实例化并启动消息循环体EventLoop,开启定时生成Job的定时器。
/** Start generation of jobs */ def start(): Unit = synchronized { if (eventLoop != null) return // generator has already been started // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock. // See SPARK-10125 checkpointWriter eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") { override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event) override protected def onError(e: Throwable): Unit = { jobScheduler.reportError("Error in job generator", e) } } //启动消息循环处理线程 eventLoop.start() if (ssc.isCheckpointPresent) { restart() } else { //开启定时生成Job的定时器 startFirstTime() } } |
EvenLoop类中有存储消息的LinkedBlockingDeque和后台线程,后台线程从队列中获取消息,然后调用onReceive方法对该消息进行处理,这里的onReceive方法即匿名内部类中重写onReceive方法的processEvent方法。
private[spark] abstract class EventLoop[E](name: String) extends Logging { private val eventQueue: BlockingQueue[E] = new LinkedBlockingDeque[E]() private val stopped = new AtomicBoolean(false) private val eventThread = new Thread(name) { setDaemon(true) override def run(): Unit = { try { while (!stopped.get) { val event = eventQueue.take() try { onReceive(event) } catch { case NonFatal(e) => { try { onError(e) } catch { case NonFatal(e) => logError("Unexpected error in " + name, e) } } } } } catch { case ie: InterruptedException => // exit even if eventQueue is not empty case NonFatal(e) => logError("Unexpected error in " + name, e) } } } def start(): Unit = { if (stopped.get) { throw new IllegalStateException(name + " has already been stopped") } // Call onStart before starting the event thread to make sure it happens before onReceive onStart() eventThread.start() } |
processEvent方法是对消息类型进行模式匹配,然后路由到对应处理该消息的方法中。消息的处理一般是发给另外一个线程来处理的,消息循环器不处理耗时的业务逻辑。
/** Processes all events */ private def processEvent(event: JobGeneratorEvent) { logDebug("Got event " + event) event match { case GenerateJobs(time) => generateJobs(time) case ClearMetadata(time) => clearMetadata(time) case DoCheckpoint(time, clearCheckpointDataLater) => doCheckpoint(time, clearCheckpointDataLater) case ClearCheckpointData(time) => clearCheckpointData(time) } } |
以GenerateJobs消息的处理函数generateJobs为例,在获取到数据后调用DStreamGraph的generateJobs方法来生成Job。
/** Generate jobs and perform checkpoint for the given `time`. */ private def generateJobs(time: Time) { // Set the SparkEnv in this thread, so that job generation code can access the environment // Example: BlockRDDs are created in this thread, and it needs to access BlockManager // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed. SparkEnv.set(ssc.env) Try { //根据特定的时间获取具体的数据 jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch //调用DStreamGraph的generateJobs生成Job graph.generateJobs(time) // generate jobs using allocated block } match { case Success(jobs) => val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time) jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos)) case Failure(e) => jobScheduler.reportError("Error generating jobs for time " + time, e) } eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false)) } |
generateJobs方法中outputStreams是整个DStream中的最后一个DStream。这里outputStream.generateJob(time)类似于RDD中从后往前推。
def generateJobs(time: Time): Seq[Job] = { logDebug("Generating jobs for time " + time) val jobs = this.synchronized { outputStreams.flatMap { outputStream => val jobOption = outputStream.generateJob(time) jobOption.foreach(_.setCallSite(outputStream.creationSite)) jobOption } } logDebug("Generated " + jobs.length + " jobs for time " + time) jobs } |
generateJob方法中jobFunc 封装了context.sparkContext.runJob(rdd, emptyFunc)
private[streaming] def generateJob(time: Time): Option[Job] = { getOrCompute(time) match { case Some(rdd) => { //用函数封装了Job本身,该方法现在没有执行 val jobFunc = () => { val emptyFunc = { (iterator: Iterator[T]) => {} } context.sparkContext.runJob(rdd, emptyFunc) } Some(new Job(time, jobFunc)) } case None => None } } |
Job对象,方法run会导致传入的func被调用。
/** * Class representing a Spark computation. It may contain multiple Spark jobs. */ private[streaming] class Job(val time: Time, func: () => _) { private var _id: String = _ private var _outputOpId: Int = _ private var isSet = false private var _result: Try[_] = null private var _callSite: CallSite = null private var _startTime: Option[Long] = None private var _endTime: Option[Long] = None def run() { _result = Try(func()) } |
getOrCompute方法,先根据传入的时间在HashMap中查找下RDD是否存在,如果不存在则调用compute方法计算获取RDD,再根据storageLevel 是否需要persist,是否到了checkpoint时间点进行checkpoint操作,最后把该RDD放入到HashMap中。
/** * Get the RDD corresponding to the given time; either retrieve it from cache * or compute-and-cache it. */ private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = { // If RDD was already generated, then retrieve it from HashMap, // or else compute the RDD generatedRDDs.get(time).orElse { // Compute the RDD if time is valid (e.g. correct time in a sliding window) // of RDD generation, else generate nothing. if (isTimeValid(time)) { val rddOption = createRDDWithLocalProperties(time, displayInnerRDDOps = false) { // Disable checks for existing output directories in jobs launched by the streaming // scheduler, since we may need to write output to an existing directory during checkpoint // recovery; see SPARK-4835 for more details. We need to have this call here because // compute() might cause Spark jobs to be launched. PairRDDFunctions.disableOutputSpecValidation.withValue(true) { compute(time) } } rddOption.foreach { case newRDD => // Register the generated RDD for caching and checkpointing if (storageLevel != StorageLevel.NONE) { newRDD.persist(storageLevel) logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel") } if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) { newRDD.checkpoint() logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing") } generatedRDDs.put(time, newRDD) } rddOption } else { None } } } |
再次回到JobGenerator类中,看下start方法中在消息循环体启动后,先判断之前是否进行checkpoint操作,如果是从checkpoint目录中读取然后再调用restart重启JobGenerator,如果是第一次则调用startFirstTime方法。
/** Start generation of jobs */ def start(): Unit = synchronized { if (eventLoop != null) return // generator has already been started // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock. // See SPARK-10125 checkpointWriter eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") { override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event) override protected def onError(e: Throwable): Unit = { jobScheduler.reportError("Error in job generator", e) } } //启动消息循环处理线程 eventLoop.start() if (ssc.isCheckpointPresent) { restart() } else { //开启定时生成Job的定时器 startFirstTime() } } |
JobGenerator类中的startFirstTime方法,启动定时生成Job的Timer
/** Starts the generator for the first time */ private def startFirstTime() { val startTime = new Time(timer.getStartTime()) graph.start(startTime - graph.batchDuration) timer.start(startTime.milliseconds) logInfo("Started JobGenerator at " + startTime) } |
timer对象为RecurringTimer,其start方法内部启动一个线程,在线程中不断调用triggerActionForNextInterval方法
// 循环定时器,定时回调 eventLoop.post(GenerateJobs(new Time(longTime)))。 // 定义了定时触发的函数,此函数就是将 发送 类型为"GenerateJobs"的事件 // 值得注意的事,这里只是定义了回调函数。 //根据创建StreamContext时传入的batchInterval,定时发送GenerateJobs消息 private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds, longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator") /** * Start at the given start time. */ def start(startTime: Long): Long = synchronized { nextTime = startTime thread.start() logInfo("Started timer for " + name + " at time " + nextTime) nextTime } // 这里创建了一个守护线程 private val thread = new Thread("RecurringTimer - " + name) { setDaemon(true) override def run() { loop } } /** * Repeatedly call the callback every interval. */ private def loop() { try { while (!stopped) { triggerActionForNextInterval() } triggerActionForNextInterval() } catch { case e: InterruptedException => } } |
triggerActionForNextInterval方法,等待BatchDuration后回调callback这个方法,这里的callback方法是构造RecurringTimer对象时传入的方法,即longTime => eventLoop.post(GenerateJobs(new Time(longTime))),不断向消息循环体发送GenerateJobs消息。
private def triggerActionForNextInterval(): Unit = { clock.waitTillTime(nextTime) callback(nextTime) prevTime = nextTime nextTime += period logDebug("Callback for " + name + " called at time " + prevTime) } private[streaming] class RecurringTimer(clock: Clock, period: Long, callback: (Long) => Unit, name: String) extends Logging { |
我们再次聚焦generateJobs这个方法生成Job的步骤,
第一步:获取当前时间段内的数据。
第二步:生成Job,RDD之间的依赖关系。
第三步:获取生成Job对应的StreamId的信息。
第四步:封装成JobSet交给JobScheduler。
第五步:进行checkpoint操作。
/** Generate jobs and perform checkpoint for the given `time`. */ private def generateJobs(time: Time) { // Set the SparkEnv in this thread, so that job generation code can access the environment // Example: BlockRDDs are created in this thread, and it needs to access BlockManager // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed. SparkEnv.set(ssc.env) Try { //第一步:获取当前时间段内的数据。 jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch //第二步:生成Job,RDD之间的依赖关系。 graph.generateJobs(time) // generate jobs using allocated block } match { case Success(jobs) => //第三步:获取生成Job对应的StreamId的信息。 val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time) //第四步:封装成JobSet交给JobScheduler。 jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos)) case Failure(e) => jobScheduler.reportError("Error generating jobs for time " + time, e) } //第五步:进行checkpoint操作。 eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false)) } |
其中submitJobSet方法,只是把JobSet放到ConcurrentHashMap中,把Job封装为JobHandler提交到jobExecutor线程池中
def submitJobSet(jobSet: JobSet) { if (jobSet.jobs.isEmpty) { logInfo("No jobs added for time " + jobSet.time) } else { listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo)) jobSets.put(jobSet.time, jobSet) jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job))) logInfo("Added jobs for time " + jobSet.time) } } private val jobSets: java.util.Map[Time, JobSet] = new ConcurrentHashMap[Time, JobSet] |
JobHandler对象为实现Runnable 接口,job的run方法导致了func的调用,即基于DStream的业务逻辑
private class JobHandler(job: Job) extends Runnable with Logging { import JobScheduler._ def run() { try { val formattedTime = UIUtils.formatBatchTime( job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false) val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}" val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]" ssc.sc.setJobDescription( s"""Streaming job from <a href="$batchUrl">$batchLinkText</a>""") ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString) ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString) // We need to assign `eventLoop` to a temp variable. Otherwise, because // `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then // it's possible that when `post` is called, `eventLoop` happens to null. var _eventLoop = eventLoop if (_eventLoop != null) { _eventLoop.post(JobStarted(job, clock.getTimeMillis())) // Disable checks for existing output directories in jobs launched by the streaming // scheduler, since we may need to write output to an existing directory during checkpoint // recovery; see SPARK-4835 for more details. PairRDDFunctions.disableOutputSpecValidation.withValue(true) { job.run() } _eventLoop = eventLoop if (_eventLoop != null) { _eventLoop.post(JobCompleted(job, clock.getTimeMillis())) } } else { // JobScheduler has been stopped. } } finally { ssc.sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, null) ssc.sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, null) } } } } |
备注:
1、DT大数据梦工厂微信公众号DT_Spark
2、IMF晚8点大数据实战YY直播频道号:68917580
3、新浪微博: http://www.weibo.com/ilovepains