实例代码:
def rddBasics: Unit = {
val sparkConf: SparkConf = new SparkConf().setAppName("rdd basics implement")
val sparkContext: SparkContext = SparkContext.getOrCreate(sparkConf)
val rdd: RDD[Int] = sparkContext.parallelize(Array(1,2,3,4,5,6))
rdd.count()
}
此作业的真正提交是从count这个行动方法开始的。count方法触发了SparkContext的runJob方法来提交作业,而这个提交是在其内部隐性调用runJob方法进行的。count的源码:
/** Return the number of elements in the RDD*/
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
对于RDD来说,它们会根据彼此之间的依赖关系形成一个有向无环图(DAG),然后把此DAG图提交给DAGScheduler来处理。SparkContext的runJob方法经过几次调用后进入DAGScheduler的runJob方法,SparkContext的runJob方法:
/** Run a function on a given set of partitions in an RDD and pass the results to the given
* handler function. This is the main entry point for all actions in Spark.*/
def runJob[T, U: ClassTag](
rdd: RDD[T],func: (TaskContext, Iterator[T]) => U,partitions: Seq[Int],resultHandler: (Int, U) => Unit): Unit = {
if (stopped.get()) {
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite //记录方法调用的方法栈
val cleanedFunc = clean(func) //请出闭包,为了函数能够序列化
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
/**调用高层调度器DAGScheduler的runJob方法*/
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}
SparkContext调用DAGScheduler类的runJob方法的代码:
def runJob[T, U](
rdd: RDD[T],func: (TaskContext, Iterator[T]) => U,partitions: Seq[Int],callSite: CallSite,resultHandler: (Int, U) => Unit,properties: Properties): Unit = {
val start = System.nanoTime
/**在DAGScheduler的runJob中调用submitJob方法,会发生阻塞,直到返回作业完成或者失败的结果
* submitJob方法的返回结果返回给JobWaiter对象waiter,并借助内部消息处理进行把这个对象发送给DAGScheduler的内嵌类DAGSchedulerEventProcessLoop进行处理*/
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
waiter.awaitResult() match {
case JobSucceeded =>
logInfo("Job %d finished: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
case JobFailed(exception: Exception) =>
logInfo("Job %d failed: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
// SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
val callerStackTrace = Thread.currentThread().getStackTrace.tail
exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
throw exception
}
}
DAGScheduler的submitJob方法,返回的JobWaiter对象:
def submitJob[T, U](rdd: RDD[T],func: (TaskContext, Iterator[T]) => U,partitions: Seq[Int],
callSite: CallSite,resultHandler: (Int, U) => Unit,properties: Properties): JobWaiter[U] = {
// 判断任务处理的分区是否存在,如果不存在,则抛出异常.
val maxPartitions = rdd.partitions.length
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +"Total number of partitions: " + maxPartitions)}
// 如果作业只包含0个任务,则创建0个任务的JobWaiter,并立即返回.
val jobId = nextJobId.getAndIncrement()
if (partitions.size == 0) {
// Return immediately if the job is running 0 tasks
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
assert(partitions.size > 0)
// 创建JobWaiter对象,等待作业运行完毕,使用内部类提交作业
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
在DAGScheduler中的submitJob方法中,首先获取rdd.partitions.length校验运行时partitions是否存在。通过eventProcessLoop的post方法进行内部消息处理的方式把发送给DAGSchedulerEventProcessLoop进行处理,post的JobSubmitted是case class而不是case object,因为application中有很多job,不同的job的JobSubmitted实例不一样,若使用case object则内容是一样的。case class JobSubmitted的源码:
private[scheduler] case class JobSubmitted(
jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties = null)
extends DAGSchedulerEvent
JobSubmitted是private[scheduler]级别的,用户不可直接调用它。JobSubmitted封装了jobId、最后一个finalRDD、具体对RDD操作的函数、哪些分区参与计算、作业监听器、状态等内容。
由Action操作导致SparkContext.runJob方法的执行,最终导致DAGScheduler的submitJob的执行其核心是通过发送一个case class JobSubmitted对象给eventProcessLoop。eventProcessLoop是由类DAGSchedulerEventProcessLoop类创建的:
private[scheduler] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)
其中DAGSchedulerEventProcessLoop是extends抽象类EventLoop,所以在DAGSchedulerEventProcessLoop中实现抽象类EventLoop的抽象方法onReceive,在DAGSchedulerEventProcessLoop中override的onReceive方法:
override def onReceive(event: DAGSchedulerEvent): Unit = {
val timerContext = timer.time()
try {
doOnReceive(event)
} finally {
timerContext.stop()
}
}
在类DAGSchedulerEventProcessLoopdoOnReceive中接收到JobSubmitted样例类完成模式匹配后,继续调用DAGScheduler的handleJobSubmitted方法来提交作业,在handleJobSubmitted中完成划分阶段。类DAGSchedulerEventProcessLoop中的onReceive方法主要调用的是doOnReceive方法:
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
case MapStageSubmitted(jobId, dependency, callSite, listener, properties) =>
dagScheduler.handleMapStageSubmitted(jobId, dependency, callSite, listener, properties)
case StageCancelled(stageId) =>
dagScheduler.handleStageCancellation(stageId)
case JobCancelled(jobId) =>
dagScheduler.handleJobCancellation(jobId)
case JobGroupCancelled(groupId) =>
dagScheduler.handleJobGroupCancelled(groupId)
case AllJobsCancelled =>
dagScheduler.doCancelAllJobs()
case ExecutorAdded(execId, host) =>
dagScheduler.handleExecutorAdded(execId, host)
case ExecutorLost(execId) =>
dagScheduler.handleExecutorLost(execId, fetchFailed = false)
case BeginEvent(task, taskInfo) =>
dagScheduler.handleBeginEvent(task, taskInfo)
case GettingResultEvent(taskInfo) =>
dagScheduler.handleGetTaskResult(taskInfo)
case completion @ CompletionEvent(task, reason, _, _, taskInfo, taskMetrics) =>dagScheduler.handleTaskCompletion(completion)
case TaskSetFailed(taskSet, reason, exception) =>
dagScheduler.handleTaskSetFailed(taskSet, reason, exception)
case ResubmitFailedStages =>
dagScheduler.resubmitFailedStages()
}
以上为spark的作业提交。总结如下:
在DAGScheduler类内部会进行一系列的方法调用,首先是在runJob方法里调用submitJob方法来继续提交作业,这里会发生阻塞,直到返回作业完成或失败的结果;然后在submitJob方法里,创建一个JobWaiter对象,并借助内部消息处理进行把这个对象发送给DAGScheduler的内嵌类DAGSchedulerEventProcessLoop进行处理;最后在DAGSchedulerEventProcessLoop消息接收方法OnReceive中,接收到JobSubmitted样例类完成模式匹配后,继续调用DAGScheduler的handleJobSubmitted方法来提交作业,在该方法中将进行task划分阶段。
Spark调度阶段的划分是由DAGScheduler实现的,DAGScheduler会从最后一个RDD-stage出发使用广度优先遍历整个依赖树,从而划分调度阶段,调度阶段划分依据是以操作是否为宽依赖(ShuffleDependency)进行的,即当某个RDD的操作是Shuffle时,以该Shuffle操作为界限划分成前后两个调度阶段。
代码实现是在DAGScheduler的handleJobSubmitted方法中根据最后一个RDD生成ResultStage开始的。DAGScheduler的handleJobSubmitted的源码如下:
private[scheduler] def handleJobSubmitted(jobId: Int,finalRDD: RDD[_],func: (TaskContext, Iterator[_]) => _,partitions: Array[Int],callSite: CallSite,listener: JobListener,
properties: Properties) {
// 根据最后一个RDD,获取最后一个调度阶段finalStage.
var finalStage: ResultStage = null
try { finalStage = newResultStage(finalRDD, func, partitions, jobId, callSite) } catch { case e: Exception => listener.jobFailed(e)
return }
// 根据最后一个调度阶段finalStage生成作业
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
clearCacheLocs()
val jobSubmissionTime = clock.getTimeMillis()
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.setActiveJob(job)
val stageIds = jobIdToStageIds(jobId).toArray
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
submitStage(finalStage) // 提交作业
submitWaitingStages() }
在DAGScheduler的handleJobSubmitted方法中调用了DAGScheduler的newResultStage方法返回给finalStage对象,其newResultStage方法的源码:
/**Create a ResultStage associated with the provided jobId.*/
private def newResultStage(rdd: RDD[_],func: (TaskContext, Iterator[_]) => _,partitions: Array[Int],jobId: Int,callSite: CallSite): ResultStage = {
val (parentStages: List[Stage], id: Int) = getParentStagesAndId(rdd, jobId)
val stage = new ResultStage(id, rdd, func, partitions, parentStages, jobId, callSite)
stageIdToStage(id) = stage
updateJobIdStageIdMaps(jobId, stage)
stage
}
在newResultStage方法内调用getParentStagesAndId方法,是将最后一个RDD传入到getParentStagesAndId方法中,在该方法中调用getParentStages,生成最后一个调度阶段finalStage,DAGScheduler的getParentStagesAndId方法的源码:
/**Helper function to eliminate some code re-use when creating new stages.*/
private def getParentStagesAndId(rdd: RDD[_], firstJobId: Int): (List[Stage], Int) = {
val parentStages = getParentStages(rdd, firstJobId)
val id = nextStageId.getAndIncrement()
(parentStages, id)
}
DAGScheduler的getParentStages方法的源码:
/**Get or create the list of parent stages for a given RDD. The new Stages will be created with the provided firstJobId.*/
private def getParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
val parents = new HashSet[Stage]
val visited = new HashSet[RDD[_]]
//存放等待访问的堆栈,存放的是非ShuffleDependency的RDD
val waitingForVisit = new Stack[RDD[_]]
//定义遍历处理方法,先对访问过的RDD标记,然后根据当前RDD所依赖的RDD操作类型进行不同的处理
def visit(r: RDD[_]) {
if (!visited(r)) {
visited += r
// 所依赖RDD操作类型是ShuffleDependency,需要划分ShuffleMap调度阶段,以调度getShuffleMapStage方法为入口,向前遍历划分调度阶段
for (dep <- r.dependencies) {
dep match {
case shufDep: ShuffleDependency[_, _, _] =>
parents += getShuffleMapStage(shufDep, firstJobId)
case _ =>
// 所依赖RDD操作类型是非ShuffleDependency,把该RDD压入等待访问的堆栈中
waitingForVisit.push(dep.rdd)
}
}
}
}
// 以最后一个RDD开始向前遍历整个依赖树,如果该RDD依赖树存在ShuffleDependency的RDD,则父调度阶段存在,反之,不存在
waitingForVisit.push(rdd)
while (waitingForVisit.nonEmpty) {
visit(waitingForVisit.pop())
}
parents.toList
}
若RDD的依赖类型是ShuffleDependency类型,则要通过getShuffleMapStage方法划分ShuffleMap阶段,DAGScheduler的getShuffleMapStage方法源码:
/**Get or create a shuffle map stage for the given shuffle dependency's map side.*/
private def getShuffleMapStage(
shuffleDep: ShuffleDependency[_, _, _],
firstJobId: Int): ShuffleMapStage = {
shuffleToMapStage.get(shuffleDep.shuffleId) match {
case Some(stage) => stage
case None =>
// 标记父和子之间的RDD的Shuffle Dependency关系
getAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
shuffleToMapStage(dep.shuffleId) = newOrUsedShuffleStage(dep, firstJobId)
}
// Then register current shuffleDep
val stage = newOrUsedShuffleStage(shuffleDep, firstJobId)
shuffleToMapStage(shuffleDep.shuffleId) = stage
stage
}
}
在getShuffleMapStage方法中,当finalRDD存在父调度阶段,需要从发生shuffle操作的RDD往前遍历,找出所有的ShuffleMapStage,这是调度阶段划分最关键的部分,该算法与getParentStages类似,由getAncestorShuffleDependencies方法实现。在该方法中找出所有的操作类型是宽依赖的RDD,然后通过newOrUsedShuffleStage两个方法划分所有的ShuffleStage。
DAGScheduler的getAncestorShuffleDependencies方法源码:
/** Find ancestor shuffle dependencies that are not registered in shuffleToMapStage yet */
private def getAncestorShuffleDependencies(rdd: RDD[_]): Stack[ShuffleDependency[_, _, _]] = {
val parents = new Stack[ShuffleDependency[_, _, _]]
val visited = new HashSet[RDD[_]]
// 存放等待访问的堆栈,存放的是非ShuffleDependency的RDD
val waitingForVisit = new Stack[RDD[_]]
def visit(r: RDD[_]) {
if (!visited(r)) {
visited += r
for (dep <- r.dependencies) {
dep match {
// 所依赖RDD操作类型是ShuffleDependency,作为划分ShuffleMap调度阶段界限
case shufDep: ShuffleDependency[_, _, _] =>
if (!shuffleToMapStage.contains(shufDep.shuffleId)) {
parents.push(shufDep)
}
case _ =>
}
waitingForVisit.push(dep.rdd)
}
}
}
// 向前遍历依赖树,获取所有的操作类型是ShuffleDependency的RDD,作为划分阶段依据
waitingForVisit.push(rdd)
while (waitingForVisit.nonEmpty) {
visit(waitingForVisit.pop())
}
parents
}
当所有调度阶段划分完毕时,这些调度阶段建立依赖关系。该依赖关系是通过调度阶段属性parents:List[Stage]来定义的,通过该属性可以获取当前阶段所有祖先阶段,可以根据这些信息按顺序提交调度阶段进行运行。调度阶段划分是spark作业执行的重要部分,其详细步骤如下:
在SparkContext中提交运行时,会调用DAGScheduler的handleJobSubmitted进行处理,在该方法中实现newResultStage方法先找到最后一个RDD,并调用getParentStages方法;
在getParentStages方法判断RDD的依赖树中是否存在Shuffle操作,若有Shuffle操作则实现getShuffleMapStage方法,在getShuffleMapStage方法调用getAncestorShuffleDependencies方法获取RDD的祖先依赖。
使用getAncestorShuffleDependencies方法中对RDD向前遍历,发现该依赖分支上是否有其他依赖,调用newOrUsedShuffleStage方法生成调用阶段ShuffleMapStage方法。
提交调度
在DAGScheduler的handleJobSubmitted方法中,生成finalStage的同时建立所有调度阶段依赖关系,然后通过finalStage生成一个作业实例,在该作业实例中按照顺序提交调度阶段进行执行,在执行过程中通过监听总线获取作业、阶段执行情况。
stage的提交调度是DAGScheduler的submitStage方法,其源码如下:
/** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
//在该方法中,获取该调度阶段的父调度阶段,获取的方法是通过RDD的依赖关系向前遍历看是否存在shuffle存在,这里并没有使用调度阶段的依赖关系进行获取
val missing = getMissingParentStages(stage).sortBy(_.id)
if (missing.isEmpty) {
//若不存在父调度阶段,直接把该调度阶段提交执行
submitMissingTasks(stage, jobId.get)
} else {
//若存在父调度阶段,把该调度阶段加入到等待运行调度阶段列表中,同时递归调用submitStage方法,直至找到开始的调度阶段,即该调度阶段没有父调度阶段
for (parent <- missing) {
submitStage(parent)
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
在stage提交调度阶段开始时,在submitStage方法中调用getMissingParentStages方法获取finalStage父调度阶段,如果不存在父调度阶段,则使用submitMissingTasks方法提交执行;若存在父调度阶段,则把该调度阶段存放到waitingStages列表中,同时递归调用submitStage。通过该算法把存在父调度阶段的等待调度阶段放入列表waitingStages中,不存在父调度阶段的调度阶段作为作业运行的入口。
当入口的调度阶段运行完成后相继提交后续调度阶段,在调度前先判断该调度阶段所依赖的父调度阶段的记过是否可用(即运行是否成功),若结果都可用,则提交该调度阶段,若结果存在不可用的情况,则尝试提交结果不可用的父调度阶段。对于调度阶段是否可用的判断是在ShuffleMapTask完成时运行,DAGScheduler会检查调度阶段的所有任务是否都完成了。若存在执行失败的任务,则重新提交该调度阶段;若所有任务完成,则扫描等待运行调度阶段列表,检查它们的父调度阶段是否存在未完成,若不存在则表明该调度阶段准备就绪,生成实例并提交运行。
任务的提交虽然是从DAGScheduler的submitStage方法实现的,在submitStage中调用的submitMissingTasks方法。因此当提交调度阶段提交运行后,在DAGScheduler的submitMissingTasks方法中,会根据调度阶段partition个数拆分与之对应的个数任务(一般是有几个partition就有几个task),这些任务组成一个任务集提交到TaskScheduler进行处理。submitMissingTasks的源码如下:
/** Called when stage's parents are available and we can now do its task. */
private def submitMissingTasks(stage: Stage, jobId: Int) {
logDebug("submitMissingTasks(" + stage + ")")
//获取正在等待task
stage.pendingPartitions.clear()
//让每个任务的parititon的索引参与计算
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
//当累计器不存在时创建累计器,当累计器存在时,重新设置,或者从其他task中覆盖当前的累计器值
if (stage.internalAccumulators.isEmpty || stage.numPartitions == partitionsToCompute.size) {
stage.resetInternalAccumulators()
}
//使用于此阶段相关的ActiveJob中的规划池、作业组等
val properties = jobIdToActiveJob(jobId).properties
runningStages += stage
//SparkListenerStageSubmitted信息必须在task序列化之前被传递,
stage match {
case s: ShuffleMapStage =>
//numPartitions是rdd参与计算的分区数量,如果程序中进行partition时,并不是所有的rdd的partition数据都参与计算,stage.numPartitions=partitionsToCompute.size
outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
case s: ResultStage =>
outputCommitCoordinator.stageStart(
stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
}
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
stage match {
case s: ShuffleMapStage =>
partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
case s: ResultStage =>
val job = s.activeJob.get
partitionsToCompute.map { id =>
val p = s.partitions(id)
(id, getPreferredLocs(stage.rdd, p))
}.toMap
}
} catch {
case NonFatal(e) =>
stage.makeNewStageAttempt(partitionsToCompute.size)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
abortStage(stage, s"Task creation failed: $e\n${e.getStackTraceString}", Some(e))
runningStages -= stage
return
}
stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
var taskBinary: Broadcast[Array[Byte]] = null
try {
val taskBinaryBytes: Array[Byte] = stage match {
case stage: ShuffleMapStage =>
closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef).array()
case stage: ResultStage =>
closureSerializer.serialize((stage.rdd, stage.func): AnyRef).array()
}
taskBinary = sc.broadcast(taskBinaryBytes) //使用Broadcast向executor派遣stage
} catch {
case e: NotSerializableException =>
abortStage(stage, "Task not serializable: " + e.toString, Some(e))
runningStages -= stage
return
case NonFatal(e) =>
abortStage(stage, s"Task serialization failed: $e\n${e.getStackTraceString}", Some(e))
runningStages -= stage
return
}
val tasks: Seq[Task[_]] = try {
stage match {
case stage: ShuffleMapStage =>
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = stage.rdd.partitions(id)
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, stage.internalAccumulators)
}
case stage: ResultStage =>
val job = stage.activeJob.get
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = stage.rdd.partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, id, stage.internalAccumulators)
}
}
} catch {
case NonFatal(e) =>
abortStage(stage, s"Task creation failed: $e\n${e.getStackTraceString}", Some(e))
runningStages -= stage
return
}
if (tasks.size > 0) {
logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")
stage.pendingPartitions ++= tasks.map(_.partitionId)
logDebug("New pending partitions: " + stage.pendingPartitions)
//把这些task以任务集的形式提交到TaskScheduler
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
} else {
markStageAsFinished(stage, None) //若调度阶段不存在任务标记,则表示该阶段已经完成
val debugString = stage match {
case stage: ShuffleMapStage =>
s"Stage ${stage} is actually done; " +
s"(available: ${stage.isAvailable}," +
s"available outputs: ${stage.numAvailableOutputs}," +
s"partitions: ${stage.numPartitions})"
case stage : ResultStage =>
s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
}
logDebug(debugString)
}
}
特质TaskScheduler的submitTasks方法是没有方法体的,而类TaskSchedulerImpl extends于TaskScheduler,因此TaskScheduler的submitTasks方法是由类TaskSchedulerImpl实现,TaskSchedulerImpl类override(重写)TaskScheduler的submitTasks方法,则TaskSchedulerImpl中的submitTasks方法的实现源码:
override def submitTasks(taskSet: TaskSet) {
val tasks = taskSet.tasks
logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
this.synchronized {
//manager:TaskSetManager,创建任务集管理器,用于管理此任务集的生命周期
val manager = createTaskSetManager(taskSet, maxTaskFailures)
val stage = taskSet.stageId
val stageTaskSets =
taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
stageTaskSets(taskSet.stageAttemptId) = manager
val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
ts.taskSet != taskSet && !ts.isZombie
}
if (conflictingTaskSet) {
throw new IllegalStateException(s"more than one active taskSet for stage $stage:" +
s" ${stageTaskSets.toSeq.map{_._2.taskSet.id}.mkString(",")}")
}
//将任务集添加到调度池中,由系统统一调配,支持FIFO和FAIR两种
schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)
if (!isLocal && !hasReceivedTask) {
starvationTimer.scheduleAtFixedRate(new TimerTask() {
override def run() {
if (!hasLaunchedTask) {
logWarning("Initial job has not accepted any resources; " +
"check your cluster UI to ensure that workers are registered " +
"and have sufficient resources")
} else {
this.cancel()
}
}
}, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
}
hasReceivedTask = true
}
backend.reviveOffers() //SchedulerBackend向driver发送ReviveOffers消息
}
在TaskSchedulerImpl的submitTasks方法调用createTaskSetManager方法构建了一个TaskSetManager实例,用于管理这个任务集的生命周期,而该TaskSetManager会放入系统的调度池中,通过SchedulerBuilder.addTaskSetManager方法把任务集添加到SchedulerBuilder中,SchedulerBuilder是trait,SchedulerBuilder是由类FIFOSchedulerBuilder或FAIRSchedulerBuilder实现,因此通过SchedulerBuilder的addTaskSetManager方法根据系统设置的调度算法(FAIR/FIFO)进行调度。createTaskSetManager方法其实就是实现了TaskSetManager的构造器,createTaskSetManager方法是实现TaskSetManager构造器,源码:
private[scheduler] def createTaskSetManager(taskSet: TaskSet,maxTaskFailures: Int): TaskSetManager = {
new TaskSetManager(this, taskSet, maxTaskFailures)
}
系统调度池分为两种FIFO和FAIR,是由特质SchedulableBuilder中的buildPools和addTaskSetManager方法来实现的。类FIFOSchedulableBuilder和类FairSchedulableBuilder都继承特质SchedulableBuilder,并重写(override)此特质的buildPools和addTaskSetManager方法。
在submitTasks方法中调用特质SchedulerBackend中的reviveOffers方法进行分配资源并执行,若是standalone模式下,CoarseGrainedSchedulerBackend实现特质SchedulerBackend,因此重写reviveOffers方法:
override def reviveOffers() {
driverEndpoint.send(ReviveOffers)
}
ReviveOffers本身只是一个空的case object对象,只是起到触发底层资源调度的作业在有Task提交或计算资源变动时会发送ReviveOffers这个消息作为触发器。ReviveOffers的源码:
// Internal messages in driver
case object ReviveOffers extends CoarseGrainedClusterMessage
当CoarseGrainedSchedulerBackend的reviveOffers方法执行完成ReviveOffers消息发送后,之后调用CoarseGrainedSchedulerBackend的receive方法完成匹配接收。并在CoarseGrainedSchedulerBackend中执行一系列的方法操作,如重写receiveAndReply方法等。
在CoarseGrainedSchedulerBackend的receive方法中,DriverEndPoint接收到ReviveOffers后,会触发makeOffers方法
override def receive: PartialFunction[Any, Unit] = {
case StatusUpdate(executorId, taskId, state, data) =>
scheduler.statusUpdate(taskId, state, data.value)
if (TaskState.isFinished(state)) {
executorDataMap.get(executorId) match {
case Some(executorInfo) =>
executorInfo.freeCores += scheduler.CPUS_PER_TASK
makeOffers(executorId)
case None =>
logWarning(s"Ignored task status update ($taskId state $state) " +
s"from unknown executor with ID $executorId")
}
}
case ReviveOffers =>
makeOffers() //接收到case Object ReviveOffers后触发makeOffers方法
case KillTask(taskId, executorId, interruptThread) =>
executorDataMap.get(executorId) match {
case Some(executorInfo) =>
executorInfo.executorEndpoint.send(KillTask(taskId, executorId, interruptThread))
case None =>
// Ignoring the task kill since the executor is not registered.
logWarning(s"Attempted to kill task $taskId for unknown executor $executorId.")
}
}
在reviveOffers方法中RpcEndpointRef(driverEndPoint的定义)对象向DriverEndPoint发送消息,当DriverEndPoint接收到有新的taskSet需要提交(case ReviveOffers)、task完成的状态需要更新(override def receive: PartialFunction[Any,Unit]方法)、有新的Executor需要注册(over def receiveAndReply(context:RpcCallContext):PartialFunction),CoarseGrainedSchedulerBackend会执行调度行为,也就是调用makeOffers或makeOffers (executorId: String)方法,在该方法中先会获取集群中可用的Executor,然后发送到TaskSchedulerImpl中进行对任务集的任务分配运行资源,最后提交到launchTasks方法中。CoarseGrainedSchedulerBackend的makeOffers方法的源码:
private def makeOffers() {
// 获取集群中可用的Executor列表
val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
val workOffers = activeExecutors.map { case (id, executorData) =>
new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
}.toSeq
// 对任务集的任务分配运行资源,并把这些任务提交运行
launchTasks(scheduler.resourceOffers(workOffers))
}
在makeOffers方法中调用TaskSchedulerImpl的resourceOffers方法是为了进行资源分配,TaskSchedulerImpl的resourceOffers方法的源码:
def resourceOffers(offers: Seq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized {
//标记每个slave节点,获取其hostname(ip),监控是否有新的executor,若有则添加
var newExecAvail = false
for (o <- offers) {
executorIdToHost(o.executorId) = o.host
executorIdToTaskCount.getOrElseUpdate(o.executorId, 0)
if (!executorsByHost.contains(o.host)) {
executorsByHost(o.host) = new HashSet[String]()
executorAdded(o.executorId, o.host)
newExecAvail = true
}
for (rack <- getRackForHost(o.host)) {
hostsByRack.getOrElseUpdate(rack, new HashSet[String]()) += o.host
}
}
//随机进行shuffle,防止把任务总是放在同一个worker上
val shuffledOffers = Random.shuffle(offers)
//创建一个任务列表,并分配给每一个worker
val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores))
val availableCpus = shuffledOffers.map(o => o.cores).toArray
//获取按照调度策略排序好的TaskSetManager
val sortedTaskSets = rootPool.getSortedTaskSetQueue
//若有新加入的Executor,则需要重新计算数据本地性
for (taskSet <- sortedTaskSets) {
logDebug("parentName: %s, name: %s, runningTasks: %s".format(
taskSet.parent.name, taskSet.name, taskSet.runningTasks))
if (newExecAvail) {
taskSet.executorAdded()
}
}
//为排好序的TaskSetManager列表进行资源分配,分配的原则是就近原则,按照顺序(myLocalityLevels中定义)为:PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY
var launchedTask = false
for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) {
do {
launchedTask = resourceOfferSingleTaskSet(
taskSet, maxLocality, shuffledOffers, availableCpus, tasks)
} while (launchedTask)
}
if (tasks.size > 0) {
hasLaunchedTask = true
}
return tasks
}
分配好的资源任务提交到CoarseGrainedSchedulerBackend的launchTasks方法中,此方法主要用于和各个节点的executor进行RPC通信,指挥实际的作业调度,在该方法中会把任务一个个发送到worker节点上的CoarseGrainedExecutorBackend,然后通过其内部的Executor来执行任务。CoarseGrainedSchedulerBackend的launchTasks方法的源码:
// Launch tasks returned by a set of resource offers
private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
for (task <- tasks.flatten) {
val serializedTask = ser.serialize(task)
if (serializedTask.limit >= akkaFrameSize - AkkaUtils.reservedSizeBytes) {
scheduler.taskIdToTaskSetManager.get(task.taskId).foreach { taskSetMgr =>
try {
var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
"spark.akka.frameSize (%d bytes) - reserved (%d bytes). Consider increasing " +
"spark.akka.frameSize or using broadcast variables for large values."
msg = msg.format(task.taskId, task.index, serializedTask.limit, akkaFrameSize,
AkkaUtils.reservedSizeBytes)
taskSetMgr.abort(msg)
} catch {
case e: Exception => logError("Exception in error callback", e)
}
}
}
else {
val executorData = executorDataMap(task.executorId)
executorData.freeCores -= scheduler.CPUS_PER_TASK
//Worker节点上的CoarseGrainedExecutorBackend对象接收LaunchTask消息,在Executor上启动Task的执行
executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
}
}
}
执行任务
DAGScheduler划分好Stage通过submitMissingTasks方法分配任务,并把任务交由TaskSchedulerImpl的submitTasks方法,将任务加入调度池,之后调用CoarseGrainedSchedulerBackend的reviveOffers方法为Task分配资源,指定Executor。任务资源都分配好后,CoarseGrainedSchedulerBackend将向CoarseGrainedExecutorBackend发送LaunchTask消息,将具体的任务发送到Executor上进行计算。当CoarseGrainedExecutorBackend接收到LaunchTask消息时,会调用Executor的launchTask方法进行处理。在Executor的launchTask方法中,初始化一个TaskRunner(位于Executor中)来封装任务,它用于管理任务运行时的细节,再把TaskRunner对象放入到ThreadPool(线程池)中去执行,其中类TaskRunner在类Executor中定义。
在TaskRunner(位于Executor中)的run方法中,首先会对发送过来的Task本身以及它所依赖的jar等文件进行反序列,然后对反序列化的任务调用Task的run方法。由于Task本身是一个抽象类,具体的run方法是由它的两个子类ShuffleMapTask和RedultTask来实现的。
override def run(): Unit = {
//生成内存管理taskMemoryManager实例,用于任务运行期间内存管理
val taskMemoryManager = new TaskMemoryManager(env.memoryManager, taskId)
val deserializeStartTime = System.currentTimeMillis()
Thread.currentThread.setContextClassLoader(replClassLoader)
val ser = env.closureSerializer.newInstance()
logInfo(s"Running $taskName (TID $taskId)")
//调用CoarseGrainedExecutorBackend对象向Driver终端发送任务运行的开始消息
execBackend.statusUpdate(taskId, TaskState.RUNNING, EMPTY_BYTE_BUFFER)
var taskStart: Long = 0
startGCTime = computeTotalGcTime()
try {
//对任务运行时所需要的文件、jar包、代码等反序列化
val (taskFiles, taskJars, taskBytes) = Task.deserializeWithDependencies(serializedTask)
updateDependencies(taskFiles, taskJars)
task = ser.deserialize[Task[Any]](taskBytes, Thread.currentThread.getContextClassLoader)
task.setTaskMemoryManager(taskMemoryManager)
if (killed) {
//任务在序列化之前被杀死,则抛出异常并退出
throw new TaskKilledException
}
logDebug("Task " + taskId + "'s epoch is " + task.epoch)
env.mapOutputTracker.updateEpoch(task.epoch)
taskStart = System.currentTimeMillis()
var threwException = true
val (value, accumUpdates) = try {
//调用Task的run方法运行任务
val res = task.run(
taskAttemptId = taskId,
attemptNumber = attemptNumber,
metricsSystem = env.metricsSystem)
threwException = false
res
}
........
}
}
在类TaskRunner的run方法中执行ExecutorB ackend的statusUpdate方法,但是statusUpdate方法是由CoarseGrainedExecutorBackend的statusUpdate方法实现(因为CoarseGrainedExecutorBackend实现特质ExecutorBackend)。其中TaskState有LAUNCHING、RUNNING、FINISHED、KILLED、LOST六种方式通过特定的TaskState值,该值通过ExecutorBackend(确切来说是其子类CoarseGrainedExecutorBackend,ExecutorBackend是trait)返回给SchedulerBackend(确切来说是其子类CoarseGrainedSchedulerBackend,SchedulerBackend是trait),在SchedulerBackend中根据TaskState中的值进行处理。当执行Executor(其实是在类TaskRunner中)的run方法,把任务的执行状态反馈给CoarseGrainedSchedulerBackend,CoarseGrainedSchedulerBackend通过方法receive进行执行,并根据不同的任务状态进行不同的调用执行。CoarseGrainedSchedulerBackend接收到的TastUpate的receive方法的源码:
override def receive: PartialFunction[Any, Unit] = {
case StatusUpdate(executorId, taskId, state, data) =>
//调用TaskSchedulerImpl的statusUpdate方法更新状态
scheduler.statusUpdate(taskId, state, data.value)
if (TaskState.isFinished(state)) {
executorDataMap.get(executorId) match {
case Some(executorInfo) =>
//CPUS_PER_TASK是闲置的CPU个数
executorInfo.freeCores += scheduler.CPUS_PER_TASK
makeOffers(executorId) //闲置的CPU通过makeOffers方法重新分配资源
case None =>
logWarning(s"Ignored task status update ($taskId state $state) " +
s"from unknown executor with ID $executorId") }}
case ReviveOffers =>
makeOffers()
case KillTask(taskId, executorId, interruptThread) =>
executorDataMap.get(executorId) match {
case Some(executorInfo) =>
executorInfo.executorEndpoint.send(KillTask(taskId, executorId, interruptThread))
case None =>
logWarning(s"Attempted to kill task $taskId for unknown executor $executorId.")
}
}