源码目录
1. 程序入口
// 初始化 StreamingContext
SparkConf conf = new SparkConf().setAppName("SparkStreaming_demo")
.set("spark.streaming.stopGracefullyOnShutdown", "true");
JavaStreamingContext streamingContext = new JavaStreamingContext(conf, Durations.seconds(30));
// 构建处理链
Map kafkaParams = new HashMap();
kafkaParams.put("bootstrap.servers", BOOTSTRAP_SERVERS);
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "exampleGroup");
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", false);
Collection topics = Arrays.asList("exampleTopic");
JavaInputDStream> inputDStream = KafkaUtils.createDirectStream(streamingContext,
LocationStrategies.PreferConsistent(), ConsumerStrategies.Subscribe(topics, kafkaParams));
JavaDStream transformDStream = inputDStream.transform(new Function>, JavaRDD>() {
@Override
public JavaRDD call(JavaRDD> v1) throws Exception {
JavaRDD tempRdd = v1.map(new Function, String>() {
@Override
public String call(ConsumerRecord v1) throws Exception {
return v1.value();
}
});
return tempRdd;
}
});
JavaDStream filterDStream = transformDStream.filter(new Function() {
@Override
public Boolean call(String v1) throws Exception {
return StringUtils.isNotBlank(v1);
}
});
filterDStream.foreachRDD(new VoidFunction>() {
@Override
public void call(JavaRDD javaRDD) throws Exception {
javaRDD.saveAsTextFile("/home/example/result/");
}
});
// 启动运行
streamingContext.start();
streamingContext.awaitTermination();
本文主要看StreamingContext的启动运行过程。
2. 进入源码
2.1 跟进streamingContext.start()
- 进入
org.apache.spark.streaming.StreamingContext.scala
def start(): Unit = synchronized {
state match {
case INITIALIZED =>
startSite.set(DStream.getCreationSite())
StreamingContext.ACTIVATION_LOCK.synchronized {
StreamingContext.assertNoOtherContextIsActive()
try {
validate()
// Start the streaming scheduler in a new thread, so that thread local properties
// like call sites and job groups can be reset without affecting those of the
// current thread.
ThreadUtils.runInNewThread("streaming-start") {
sparkContext.setCallSite(startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get()))
scheduler.start()
}
state = StreamingContextState.ACTIVE
scheduler.listenerBus.post(
StreamingListenerStreamingStarted(System.currentTimeMillis()))
} catch {
case NonFatal(e) =>
logError("Error starting the context, marking it as stopped", e)
scheduler.stop(false)
state = StreamingContextState.STOPPED
throw e
}
StreamingContext.setActiveContext(this)
}
logDebug("Adding shutdown hook") // force eager creation of logger
shutdownHookRef = ShutdownHookManager.addShutdownHook(
StreamingContext.SHUTDOWN_HOOK_PRIORITY)(stopOnShutdown)
// Registering Streaming Metrics at the start of the StreamingContext
assert(env.metricsSystem != null)
env.metricsSystem.registerSource(streamingSource)
uiTab.foreach(_.attach())
logInfo("StreamingContext started")
case ACTIVE =>
logWarning("StreamingContext has already been started")
case STOPPED =>
throw new IllegalStateException("StreamingContext has already been stopped")
}
}
StreamingContextState初始状态为INITIALIZED,进入INITIALIZED分支,另启一个新线程,在新线程中启动 scheduler(scheduler.start()
),并更新StreamingContextState状态为ACTIVE。
2.2 跟进scheduler.start()
- 进入
org.apache.spark.streaming.scheduler.JobScheduler.scala
def start(): Unit = synchronized {
if (eventLoop != null) return // scheduler has already been started
logDebug("Starting JobScheduler")
eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)
override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
}
eventLoop.start()
// attach rate controllers of input streams to receive batch completion updates
for {
inputDStream <- ssc.graph.getInputStreams
rateController <- inputDStream.rateController
} ssc.addStreamingListener(rateController)
listenerBus.start()
receiverTracker = new ReceiverTracker(ssc)
inputInfoTracker = new InputInfoTracker(ssc)
val executorAllocClient: ExecutorAllocationClient = ssc.sparkContext.schedulerBackend match {
case b: ExecutorAllocationClient => b.asInstanceOf[ExecutorAllocationClient]
case _ => null
}
executorAllocationManager = ExecutorAllocationManager.createIfEnabled(
executorAllocClient,
receiverTracker,
ssc.conf,
ssc.graph.batchDuration.milliseconds,
clock)
executorAllocationManager.foreach(ssc.addStreamingListener)
receiverTracker.start()
jobGenerator.start()
executorAllocationManager.foreach(_.start())
logInfo("Started JobScheduler")
}
JobScheduler 启动时会:
- 创建并启动 EventLoop[JobSchedulerEvent] (
eventLoop.start()
),处理JobSchedulerEvent 事件。 - 获取 DStreamGraph 中所有 InputDStream 以及相应的 rateController,将 rateController 加入到 listenerBus,本例中的 InputDStream 为 DirectKafkaInputDStream。
- 创建并启动 ReceiverTracker (
receiverTracker.start()
) - 启动 jobGenerator:
jobGenerator.start()
2.3 跟进receiverTracker.start()
- 进入
org.apache.spark.streaming.scheduler.ReceiverTracker.scala
def start(): Unit = synchronized {
if (isTrackerStarted) {
throw new SparkException("ReceiverTracker already started")
}
if (!receiverInputStreams.isEmpty) {
endpoint = ssc.env.rpcEnv.setupEndpoint(
"ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
if (!skipReceiverLaunch) launchReceivers()
logInfo("ReceiverTracker started")
trackerState = Started
}
}
- 判断 receiverInputStreams 是否为空,而
private val receiverInputStreams = ssc.graph.getReceiverInputStreams()
- 在 getReceiverInputStreams() 方法中,会遍历 DStreamGraph 中 inputStreams,找出其中属于 ReceiverInputDStream 类型的 InputStream
def getReceiverInputStreams(): Array[ReceiverInputDStream[_]] = this.synchronized {
inputStreams.filter(_.isInstanceOf[ReceiverInputDStream[_]])
.map(_.asInstanceOf[ReceiverInputDStream[_]])
.toArray
}
- 在本例中,
KafkaUtils.createDirectStream()
时创建了 DirectKafkaInputDStream 并加入了 DStreamGraph 中 inputStreams 中,但是 DirectKafkaInputDStream 不是 ReceiverInputDStream 类型,因此 receiverInputStreams 为空,不执行后续流程
注意:Direct 方式和 Receiver 方式不同,如果使用 Receiver 方式接收数据,则是 ReceiverInputDStream 类型,此时会执行后续流程(以后再分析这种方式)
2.4 跟进 jobGenerator.start()
- 进入
org.apache.spark.streaming.scheduler.JobGenerator.scala
def start(): Unit = synchronized {
if (eventLoop != null) return // generator has already been started
// Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock.
// See SPARK-10125
checkpointWriter
eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)
override protected def onError(e: Throwable): Unit = {
jobScheduler.reportError("Error in job generator", e)
}
}
eventLoop.start()
if (ssc.isCheckpointPresent) {
restart()
} else {
startFirstTime()
}
}
- 创建并启动 EventLoop[JobGeneratorEvent] (
eventLoop.start()
),这里处理的是JobGeneratorEvent。 - 执行
startFirstTime()
。
注意:
这里的 eventLoop 和上面 2.2 中的 eventLoop 不一样:
这里的 eventLoop 处理的是JobGeneratorEvent
,
而上面 2.2 中的 eventLoop 处理的是JobSchedulerEvent
。
2.5 首先跟进 2.4 的 startFirstTime()
- 进入
org.apache.spark.streaming.scheduler.JobGenerator.scala
/** Starts the generator for the first time */
private def startFirstTime() {
val startTime = new Time(timer.getStartTime())
graph.start(startTime - graph.batchDuration)
timer.start(startTime.milliseconds)
logInfo("Started JobGenerator at " + startTime)
}
- 启动DStreamGraph:
graph.start(startTime - graph.batchDuration)
- 启动RecurringTimer:
timer.start(startTime.milliseconds)
2.5.1 跟进 graph.start(startTime - graph.batchDuration)
- 进入
org.apache.spark.streaming.DStreamGraph.scala
def start(time: Time) {
this.synchronized {
require(zeroTime == null, "DStream graph computation already started")
zeroTime = time
startTime = time
outputStreams.foreach(_.initialize(zeroTime))
outputStreams.foreach(_.remember(rememberDuration))
outputStreams.foreach(_.validateAtStart)
inputStreams.par.foreach(_.start())
}
}
- outputStreams 初始化
- inputStreams 启动
- 进入
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.scala
override def start(): Unit = {
val c = consumer
paranoidPoll(c)
if (currentOffsets.isEmpty) {
currentOffsets = c.assignment().asScala.map { tp =>
tp -> c.position(tp)
}.toMap
}
// don't actually want to consume any messages, so pause all partitions
c.pause(currentOffsets.keySet.asJava)
}
- 创建 KafkaConsumer
- 拉取一次数据,更新 KafkaConsumer 中的 offset
- 更新 currentOffsets 为 KafkaConsumer 的 offset
- KafkaConsumer 暂停所有分区的数据消费
注:KafkaConsumer 消费数据,后面学习 Kafka 源码的时候再看。
2.5.2 跟进 timer.start(startTime.milliseconds)
- 进入
org.apache.spark.streaming.util.RecurringTimer.scala
/**
* Start at the given start time.
*/
def start(startTime: Long): Long = synchronized {
nextTime = startTime
thread.start()
logInfo("Started timer for " + name + " at time " + nextTime)
nextTime
}
private val thread = new Thread("RecurringTimer - " + name) {
setDaemon(true)
override def run() { loop }
}
/**
* Repeatedly call the callback every interval.
*/
private def loop() {
try {
while (!stopped) {
triggerActionForNextInterval()
}
triggerActionForNextInterval()
} catch {
case e: InterruptedException =>
}
}
private def triggerActionForNextInterval(): Unit = {
clock.waitTillTime(nextTime)
callback(nextTime)
prevTime = nextTime
nextTime += period
logDebug("Callback for " + name + " called at time " + prevTime)
}
- 创建并启动了一个新的Daemon线程
- 在新线程中循环调用 triggerActionForNextInterval
- triggerActionForNextInterval 中等到下一个批次时间到,调用 RecurringTimer 中定义的 callback() 方法(定时器),即:
private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")
- 此 callback() 方法中,EventLoop[JobGeneratorEvent] 向 eventQueue 中放入元素 GenerateJobs(JobGeneratorEvent)
2.6 再跟进 2.4 的 eventLoop.start()
- 进入
org.apache.spark.util.EventLoop.scala
def start(): Unit = {
if (stopped.get) {
throw new IllegalStateException(name + " has already been stopped")
}
// Call onStart before starting the event thread to make sure it happens before onReceive
onStart()
eventThread.start()
}
- 启动 eventThread 线程
- 运行其 run() 方法
private val eventThread = new Thread(name) {
setDaemon(true)
override def run(): Unit = {
try {
while (!stopped.get) {
val event = eventQueue.take()
try {
onReceive(event)
} catch {
case NonFatal(e) =>
try {
onError(e)
} catch {
case NonFatal(e) => logError("Unexpected error in " + name, e)
}
}
}
} catch {
case ie: InterruptedException => // exit even if eventQueue is not empty
case NonFatal(e) => logError("Unexpected error in " + name, e)
}
}
}
- eventThread 监听
eventLoop.eventQueue
,不断的从 eventQueue 中取 event, - 调用
eventLoop.onReceive(event)
处理 event。
注意:
在 2.5.2 中,triggerActionForNextInterval 方法调用 RecurringTimer 的 callback() 方法时,已经将 JobGeneratorEvent 加入到 eventQueue 中了。
所以此处取到并处理的事件就是上面的 JobGeneratorEvent。
2.7 查看 onReceive(event)
的具体方法定义
- 回到
org.apache.spark.streaming.scheduler.JobGenerator.scala
eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)
override protected def onError(e: Throwable): Unit = {
jobScheduler.reportError("Error in job generator", e)
}
}
onReceive()
方法调用JobGenerator.processEvent(JobGeneratorEvent)
处理 event。
private def processEvent(event: JobGeneratorEvent) {
logDebug("Got event " + event)
event match {
case GenerateJobs(time) => generateJobs(time)
case ClearMetadata(time) => clearMetadata(time)
case DoCheckpoint(time, clearCheckpointDataLater) =>
doCheckpoint(time, clearCheckpointDataLater)
case ClearCheckpointData(time) => clearCheckpointData(time)
}
}
JobGenerator.processEvent(JobGeneratorEvent)
处理四种event:
- GenerateJobs
- ClearMetadata
- DoCheckpoint
- ClearCheckpointData
之前从 eventLoop.eventQueue 中取出的是 GenerateJobs(JobGeneratorEvent),匹配第一类。
2.8 跟进 generateJobs(time)
- 进入
org.apache.spark.streaming.scheduler.JobGenerator.scala
private def generateJobs(time: Time) {
// Checkpoint all RDDs marked for checkpointing to ensure their lineages are
// truncated periodically. Otherwise, we may run into stack overflows (SPARK-6847).
ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")
Try {
jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
graph.generateJobs(time) // generate jobs using allocated block
} match {
case Success(jobs) =>
val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
case Failure(e) =>
jobScheduler.reportError("Error generating jobs for time " + time, e)
PythonDStream.stopStreamingContextIfPythonProcessIsDead(e)
}
eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}
- receiverTracker 将接收到的 Blocks 划到当前批次
- 利用 DStreamGraph 构造当前批次的 Jobs
- 如果构造 Jobs 成功,则提交任务
2.8.1 跟进jobScheduler.receiverTracker.allocateBlocksToBatch(time)
- 进入
org.apache.spark.streaming.scheduler.ReceiverTracker.scala
def allocateBlocksToBatch(batchTime: Time): Unit = {
if (receiverInputStreams.nonEmpty) {
receivedBlockTracker.allocateBlocksToBatch(batchTime)
}
}
- 如果 receiverInputStreams 不为空,才去执行 allocateBlocksToBatch 方法
- 此例中是 DirectKafkaInputDStream,不包含 ReceiverInputDStream
2.8.2 跟进graph.generateJobs(time)
- 进入
org.apache.spark.streaming.DStreamGraph.scala
def generateJobs(time: Time): Seq[Job] = {
logDebug("Generating jobs for time " + time)
val jobs = this.synchronized {
outputStreams.flatMap { outputStream =>
val jobOption = outputStream.generateJob(time)
jobOption.foreach(_.setCallSite(outputStream.creationSite))
jobOption
}
}
logDebug("Generated " + jobs.length + " jobs for time " + time)
jobs
}
- 遍历 outputStreams,此例只有 ForEachDStream
- 每个 outputStream 都会构造一个 Job
由于 ForEachDStream 重写了 generateJob 方法,因此执行的是 ForEachDStream.generateJob()
:
- 进入
org.apache.spark.streaming.dstream.ForEachDStream.scala
override def generateJob(time: Time): Option[Job] = {
parent.getOrCompute(time) match {
case Some(rdd) =>
val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
foreachFunc(rdd, time)
}
Some(new Job(time, jobFunc))
case None => None
}
}
- 调用其 parent 的 getOrCompute(time) 方法
- 如果返回的是 rdd,则调用 foreachFunc(rdd, time) 并根据 createRDDWithLocalProperties(柯里化函数) 构建 jobFunc
- new Job(time, jobFunc)
本例中 ForEachDStream 实例中保存的 parent 是 FilteredDStream
FilteredDStream 中未重写 getOrCompute(time) 方法,查看其父类 DStream:
- 进入
org.apache.spark.streaming.dstream.DStream.scala
private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {
// If RDD was already generated, then retrieve it from HashMap,
// or else compute the RDD
generatedRDDs.get(time).orElse {
// Compute the RDD if time is valid (e.g. correct time in a sliding window)
// of RDD generation, else generate nothing.
if (isTimeValid(time)) {
val rddOption = createRDDWithLocalProperties(time, displayInnerRDDOps = false) {
// Disable checks for existing output directories in jobs launched by the streaming
// scheduler, since we may need to write output to an existing directory during checkpoint
// recovery; see SPARK-4835 for more details. We need to have this call here because
// compute() might cause Spark jobs to be launched.
PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
compute(time)
}
}
rddOption.foreach { case newRDD =>
// Register the generated RDD for caching and checkpointing
if (storageLevel != StorageLevel.NONE) {
newRDD.persist(storageLevel)
logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")
}
if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
newRDD.checkpoint()
logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")
}
generatedRDDs.put(time, newRDD)
}
rddOption
} else {
None
}
}
}
- 从 generatedRDDs(HashMap[Time, RDD[T]]) 中获取 time 对应的 RDD
- 如果没有获取到,则调用 compute(time) 构建
- 将新创建的 RDD 放入 generatedRDDs 中
- 进入
org.apache.spark.streaming.dstream.FilteredDStream.scala
override def compute(validTime: Time): Option[RDD[T]] = {
parent.getOrCompute(validTime).map(_.filter(filterFunc))
}
- 调用其 parent.getOrCompute(validTime)
- 对返回的 RDD 执行 map(_.filter(filterFunc))
同样的,本例中 FilteredDStream 的 parent 是 TransformedDStream
- 进入
org.apache.spark.streaming.dstream.TransformedDStream.scala
override def compute(validTime: Time): Option[RDD[U]] = {
val parentRDDs = parents.map { parent => parent.getOrCompute(validTime).getOrElse(
// Guard out against parent DStream that return None instead of Some(rdd) to avoid NPE
throw new SparkException(s"Couldn't generate RDD from parent at time $validTime"))
}
val transformedRDD = transformFunc(parentRDDs, validTime)
if (transformedRDD == null) {
throw new SparkException("Transform function must not return null. " +
"Return SparkContext.emptyRDD() instead to represent no element " +
"as the result of transformation.")
}
Some(transformedRDD)
}
- 调用其 parent.getOrCompute(validTime)
- 对返回的 RDD 执行 transformFunc(parentRDDs, validTime)
本例中,TransformedDStream 的 parent 是 DirectKafkaInputDStream
- 进入
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.scala
override def compute(validTime: Time): Option[KafkaRDD[K, V]] = {
val untilOffsets = clamp(latestOffsets())
val offsetRanges = untilOffsets.map { case (tp, uo) =>
val fo = currentOffsets(tp)
OffsetRange(tp.topic, tp.partition, fo, uo)
}
val rdd = new KafkaRDD[K, V](
context.sparkContext, executorKafkaParams, offsetRanges.toArray, getPreferredHosts, true)
// Report the record number and metadata of this batch interval to InputInfoTracker.
val description = offsetRanges.filter { offsetRange =>
// Don't display empty ranges.
offsetRange.fromOffset != offsetRange.untilOffset
}.map { offsetRange =>
s"topic: ${offsetRange.topic}\tpartition: ${offsetRange.partition}\t" +
s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}"
}.mkString("\n")
// Copy offsetRanges to immutable.List to prevent from being modified by the user
val metadata = Map(
"offsets" -> offsetRanges.toList,
StreamInputInfo.METADATA_KEY_DESCRIPTION -> description)
val inputInfo = StreamInputInfo(id, rdd.count, metadata)
ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)
currentOffsets = untilOffsets
commitAll()
Some(rdd)
}
- 获取 OffsetRange,创建 KafkaRDD
- 上报当前批次的记录数和 metadata 到 InputInfoTracker
- 提交 Offset 到 Kafka
注:KafkaConsumer 提交 Offset,后面学习 Kafka 源码的时候再看。
总结一下:
1. 在本例中,DStreamGraph 中保存了 inputStreams(DirectKafkaInputDStream) 和 outputStreams(ForEachDStream)
2. 构建处理链时利用各自的 parent 属性保存上一层操作,形成 'DirectKafkaInputDStream - TransformedDStream - FilteredDStream - ForEachDStream' 处理链
3. DStreamGraph 根据 outputStreams 开始创建 Job
-- ForEachDStream 在 generateJob() 中调用 parent.getOrCompute(time)
-- FilteredDStream 在 compute() 中调用 parent.getOrCompute(time)
-- TransformedDStream 在 compute() 中调用 parent.getOrCompute(time)
-- DirectKafkaInputDStream 在 compute() 创建了 KafkaRDD
4. 接着依次返回各 parent.getOrCompute(time) 调用处执行后续操作
-- TransformedDStream 对 KafkaRDD 调用 transformFunc 得到 newRDD1
-- FilteredDStream 对 newRDD1 调用 filterFunc 得到 newRDD2
-- ForEachDStream 对 newRDD2 调用 foreachFunc 得到 newRDD3,最后加入 generatedRDDs
5. 创建 Job 完成(new Job(time, jobFunc))
2.8.3 跟进jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
- 进入
org.apache.spark.streaming.scheduler.JobScheduler.scala
def submitJobSet(jobSet: JobSet) {
if (jobSet.jobs.isEmpty) {
logInfo("No jobs added for time " + jobSet.time)
} else {
listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
jobSets.put(jobSet.time, jobSet)
jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
logInfo("Added jobs for time " + jobSet.time)
}
}
- 封装 Job 为 JobHandler
- 提交 JobHandler 给 Executor 执行
jobExecutor 是一个线程池:
private val numConcurrentJobs = ssc.conf.getInt("spark.streaming.concurrentJobs", 1)
private val jobExecutor = ThreadUtils.newDaemonFixedThreadPool(numConcurrentJobs, "streaming-job-executor")
封装好的 JobHandler 将在线程池 jobExecutor 中执行 run() 方法。
注意:
1. 这里还是在 Driver 端运行的,SparkStreaming 任务积压就是因为线程池中有没来得及处理的任务,这些任务存放在线程池内部的队列中。
2. spark.streaming.concurrentJobs
设置的就是线程池的 corePoolSize 和 maximumPoolSize,因此如果设置超过1,线程池会启多个线程执行 JobHandler。
- 进入
org.apache.spark.streaming.scheduler.JobScheduler.scala
private class JobHandler(job: Job) extends Runnable with Logging {
import JobScheduler._
def run() {
val oldProps = ssc.sparkContext.getLocalProperties
try {
ssc.sparkContext.setLocalProperties(SerializationUtils.clone(ssc.savedProperties.get()))
val formattedTime = UIUtils.formatBatchTime(
job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false)
val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}"
val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]"
ssc.sc.setJobDescription(
s"""Streaming job from $batchLinkText""")
ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString)
ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString)
// Checkpoint all RDDs marked for checkpointing to ensure their lineages are
// truncated periodically. Otherwise, we may run into stack overflows (SPARK-6847).
ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")
// We need to assign `eventLoop` to a temp variable. Otherwise, because
// `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then
// it's possible that when `post` is called, `eventLoop` happens to null.
var _eventLoop = eventLoop
if (_eventLoop != null) {
_eventLoop.post(JobStarted(job, clock.getTimeMillis()))
// Disable checks for existing output directories in jobs launched by the streaming
// scheduler, since we may need to write output to an existing directory during checkpoint
// recovery; see SPARK-4835 for more details.
PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
job.run()
}
_eventLoop = eventLoop
if (_eventLoop != null) {
_eventLoop.post(JobCompleted(job, clock.getTimeMillis()))
}
} else {
// JobScheduler has been stopped.
}
} finally {
ssc.sparkContext.setLocalProperties(oldProps)
}
}
}
在 run 方法中:
- 封装 Job 为 JobStarted(JobSchedulerEvent )事件,加入 eventQueue
- job 运行:job.run()
- 封装 Job 为 JobCompleted(JobSchedulerEvent )事件,加入 eventQueue。
-
eventLoop.start()
后,会不断从 eventQueue 取 JobScheduler.EventLoop 事件并调用 onReceive 方法处理 JobSchedulerEvent 事件
- 进入
org.apache.spark.streaming.scheduler.Job.scala
def run() {
_result = Try(func())
}
job.run() 执行 func(),func() 是创建 job 时给定的函数,即:2.8.2 中 org.apache.spark.streaming.dstream.ForEachDStream.scala
内的 jobFunc:
override def generateJob(time: Time): Option[Job] = {
parent.getOrCompute(time) match {
case Some(rdd) =>
val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) {
foreachFunc(rdd, time)
}
Some(new Job(time, jobFunc))
case None => None
}
}
jobFunc 中执行 foreachFunc(rdd, time),在本例中,foreachFunc(rdd, time) 为:
javaRDD.saveAsTextFile("/home/example/result/");
saveAsTextFile 内部调用 self.context.runJob(self, writeToFile)
运行 Job,
此处的 runJob 是 SparkContext.runJob,即调用 Spark 运行处理任务的流程。
至此,SparkStreaming 和 Spark 对上了。
总结
- 启动 JobScheduler,在其内部启动一个 EventLoop,用于处理 JobSchedulerEvent 事件
- 启动 JobGenerator,在其内部启动一个 EventLoop,用于处理 JobGeneratorEvent 事件
- 启动 RecurringTimer,定时向 eventQueue 中添加 GenerateJobs(JobGeneratorEvent)
- 启一个单独的线程,不断从 eventQueue 中取 GenerateJobs,并调用 JobGenerator.EventLoop 处理该事件
- 利用 DStreamGraph 和之前生成的处理链开始生成 Job
- 将 Job 封装成 JobHandler 提交到一个线程池(Driver端)中排队等待调度处理
- JobHandler 内的 run() 方法内,封装 JobStarted(JobSchedulerEvent )事件,加入 eventQueue,表示任务开始
- JobHandler 内的 run() 方法内,执行 job.run() 调用构造 Job 时的 jobFunc,再转而调用 SparkContext.runJob 提交任务(和 Spark 对上了)
- JobHandler 内的 run() 方法内,封装 JobCompleted(JobSchedulerEvent )事件,加入 eventQueue,表示任务完成