Spark作业执行流程

Spark作业执行

实例代码:

def rddBasics: Unit = {
  val sparkConf: SparkConf = new SparkConf().setAppName("rdd basics implement")
  val sparkContext: SparkContext = SparkContext.getOrCreate(sparkConf)
  val rdd: RDD[Int] = sparkContext.parallelize(Array(1,2,3,4,5,6))
  rdd.count()
}

提交作业

此作业的真正提交是从count这个行动方法开始的。count方法触发了SparkContext的runJob方法来提交作业,而这个提交是在其内部隐性调用runJob方法进行的。count的源码:

/** Return the number of elements in the RDD*/
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

对于RDD来说,它们会根据彼此之间的依赖关系形成一个有向无环图(DAG),然后把此DAG图提交给DAGScheduler来处理。SparkContext的runJob方法经过几次调用后进入DAGScheduler的runJob方法,SparkContext的runJob方法:

/** Run a function on a given set of partitions in an RDD and pass the results to the given
* handler function. This is the main entry point for all actions in Spark.*/
def runJob[T, U: ClassTag](
    rdd: RDD[T],func: (TaskContext, Iterator[T]) => U,partitions: Seq[Int],resultHandler: (Int, U) => Unit): Unit = {
  if (stopped.get()) {
    throw new IllegalStateException("SparkContext has been shutdown")
  }
  val callSite = getCallSite  //记录方法调用的方法栈
  val cleanedFunc = clean(func)   //请出闭包,为了函数能够序列化
  logInfo("Starting job: " + callSite.shortForm)
  if (conf.getBoolean("spark.logLineage", false)) {
    logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
  }

/**调用高层调度器DAGScheduler的runJob方法*/
  dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
  progressBar.foreach(_.finishAll())
  rdd.doCheckpoint()
}

SparkContext调用DAGScheduler类的runJob方法的代码:

def runJob[T, U](
    rdd: RDD[T],func: (TaskContext, Iterator[T]) => U,partitions: Seq[Int],callSite: CallSite,resultHandler: (Int, U) => Unit,properties: Properties): Unit = {
  val start = System.nanoTime

/**在DAGScheduler的runJob中调用submitJob方法,会发生阻塞,直到返回作业完成或者失败的结果

* submitJob方法的返回结果返回给JobWaiter对象waiter,并借助内部消息处理进行把这个对象发送给DAGScheduler的内嵌类DAGSchedulerEventProcessLoop进行处理*/
  val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
  waiter.awaitResult() match {
    case JobSucceeded =>
      logInfo("Job %d finished: %s, took %f s".format
        (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
    case JobFailed(exception: Exception) =>
      logInfo("Job %d failed: %s, took %f s".format
        (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
      // SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
      val callerStackTrace = Thread.currentThread().getStackTrace.tail
      exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
      throw exception
  }
}

DAGScheduler的submitJob方法,返回的JobWaiter对象:

def submitJob[T, U](rdd: RDD[T],func: (TaskContext, Iterator[T]) => U,partitions: Seq[Int],
    callSite: CallSite,resultHandler: (Int, U) => Unit,properties: Properties): JobWaiter[U] = {
  // 判断任务处理的分区是否存在,如果不存在,则抛出异常.
  val maxPartitions = rdd.partitions.length
  partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
    throw new IllegalArgumentException(
      "Attempting to access a non-existent partition: " + p + ". " +"Total number of partitions: " + maxPartitions)}

// 如果作业只包含0个任务,则创建0个任务的JobWaiter,并立即返回.
  val jobId = nextJobId.getAndIncrement()
  if (partitions.size == 0) {
    // Return immediately if the job is running 0 tasks
    return new JobWaiter[U](this, jobId, 0, resultHandler)
  }
  assert(partitions.size > 0)

// 创建JobWaiter对象,等待作业运行完毕,使用内部类提交作业
  val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
  val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
  eventProcessLoop.post(JobSubmitted(
    jobId, rdd, func2, partitions.toArray, callSite, waiter,
    SerializationUtils.clone(properties)))
  waiter
}

在DAGScheduler中的submitJob方法中,首先获取rdd.partitions.length校验运行时partitions是否存在。通过eventProcessLoop的post方法进行内部消息处理的方式把发送给DAGSchedulerEventProcessLoop进行处理,post的JobSubmitted是case class而不是case object,因为application中有很多job,不同的job的JobSubmitted实例不一样,若使用case object则内容是一样的。case class JobSubmitted的源码:

private[scheduler] case class JobSubmitted(
    jobId: Int,
    finalRDD: RDD[_],
    func: (TaskContext, Iterator[_]) => _,
    partitions: Array[Int],
    callSite: CallSite,
    listener: JobListener,
    properties: Properties = null)
  extends DAGSchedulerEvent

JobSubmitted是private[scheduler]级别的,用户不可直接调用它。JobSubmitted封装了jobId、最后一个finalRDD、具体对RDD操作的函数、哪些分区参与计算、作业监听器、状态等内容。

由Action操作导致SparkContext.runJob方法的执行,最终导致DAGScheduler的submitJob的执行其核心是通过发送一个case class JobSubmitted对象给eventProcessLoop。eventProcessLoop是由类DAGSchedulerEventProcessLoop类创建的:

private[scheduler] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)

其中DAGSchedulerEventProcessLoop是extends抽象类EventLoop,所以在DAGSchedulerEventProcessLoop中实现抽象类EventLoop的抽象方法onReceive,在DAGSchedulerEventProcessLoop中override的onReceive方法:

override def onReceive(event: DAGSchedulerEvent): Unit = {
  val timerContext = timer.time()
  try {
    doOnReceive(event)
  } finally {
    timerContext.stop()
  }
}

在类DAGSchedulerEventProcessLoopdoOnReceive中接收到JobSubmitted样例类完成模式匹配后,继续调用DAGScheduler的handleJobSubmitted方法来提交作业,在handleJobSubmitted中完成划分阶段。类DAGSchedulerEventProcessLoop中的onReceive方法主要调用的是doOnReceive方法:

private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
  case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
  case MapStageSubmitted(jobId, dependency, callSite, listener, properties) =>
    dagScheduler.handleMapStageSubmitted(jobId, dependency, callSite, listener, properties)
  case StageCancelled(stageId) =>
    dagScheduler.handleStageCancellation(stageId)
  case JobCancelled(jobId) =>
    dagScheduler.handleJobCancellation(jobId)
  case JobGroupCancelled(groupId) =>
    dagScheduler.handleJobGroupCancelled(groupId)
  case AllJobsCancelled =>
    dagScheduler.doCancelAllJobs()
  case ExecutorAdded(execId, host) =>
    dagScheduler.handleExecutorAdded(execId, host)
  case ExecutorLost(execId) =>
    dagScheduler.handleExecutorLost(execId, fetchFailed = false)
  case BeginEvent(task, taskInfo) =>
    dagScheduler.handleBeginEvent(task, taskInfo)
  case GettingResultEvent(taskInfo) =>
    dagScheduler.handleGetTaskResult(taskInfo)
  case completion @ CompletionEvent(task, reason, _, _, taskInfo, taskMetrics) =>dagScheduler.handleTaskCompletion(completion)
  case TaskSetFailed(taskSet, reason, exception) =>
    dagScheduler.handleTaskSetFailed(taskSet, reason, exception)
  case ResubmitFailedStages =>
    dagScheduler.resubmitFailedStages()
}

以上为spark的作业提交。总结如下:

在DAGScheduler类内部会进行一系列的方法调用,首先是在runJob方法里调用submitJob方法来继续提交作业,这里会发生阻塞,直到返回作业完成或失败的结果;然后在submitJob方法里,创建一个JobWaiter对象,并借助内部消息处理进行把这个对象发送给DAGScheduler的内嵌类DAGSchedulerEventProcessLoop进行处理;最后在DAGSchedulerEventProcessLoop消息接收方法OnReceive中,接收到JobSubmitted样例类完成模式匹配后,继续调用DAGScheduler的handleJobSubmitted方法来提交作业,在该方法中将进行task划分阶段。

划分调度

Spark调度阶段的划分是由DAGScheduler实现的,DAGScheduler会从最后一个RDD-stage出发使用广度优先遍历整个依赖树,从而划分调度阶段,调度阶段划分依据是以操作是否为宽依赖(ShuffleDependency)进行的,即当某个RDD的操作是Shuffle时,以该Shuffle操作为界限划分成前后两个调度阶段。

代码实现是在DAGScheduler的handleJobSubmitted方法中根据最后一个RDD生成ResultStage开始的。DAGScheduler的handleJobSubmitted的源码如下:

private[scheduler] def handleJobSubmitted(jobId: Int,finalRDD: RDD[_],func: (TaskContext, Iterator[_]) => _,partitions: Array[Int],callSite: CallSite,listener: JobListener,
    properties: Properties) {

// 根据最后一个RDD,获取最后一个调度阶段finalStage.
  var finalStage: ResultStage = null
  try { finalStage = newResultStage(finalRDD, func, partitions, jobId, callSite) } catch { case e: Exception => listener.jobFailed(e)
      return }

// 根据最后一个调度阶段finalStage生成作业
  val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
  clearCacheLocs()
  val jobSubmissionTime = clock.getTimeMillis()
  jobIdToActiveJob(jobId) = job
  activeJobs += job
  finalStage.setActiveJob(job)
  val stageIds = jobIdToStageIds(jobId).toArray
  val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
  listenerBus.post(
    SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))

submitStage(finalStage)  // 提交作业
  submitWaitingStages() }

在DAGScheduler的handleJobSubmitted方法中调用了DAGScheduler的newResultStage方法返回给finalStage对象,其newResultStage方法的源码:

/**Create a ResultStage associated with the provided jobId.*/
private def newResultStage(rdd: RDD[_],func: (TaskContext, Iterator[_]) => _,partitions: Array[Int],jobId: Int,callSite: CallSite): ResultStage = {
  val (parentStages: List[Stage], id: Int) = getParentStagesAndId(rdd, jobId)
  val stage = new ResultStage(id, rdd, func, partitions, parentStages, jobId, callSite)
  stageIdToStage(id) = stage
  updateJobIdStageIdMaps(jobId, stage)
  stage
}

在newResultStage方法内调用getParentStagesAndId方法,是将最后一个RDD传入到getParentStagesAndId方法中,在该方法中调用getParentStages,生成最后一个调度阶段finalStage,DAGScheduler的getParentStagesAndId方法的源码:

/**Helper function to eliminate some code re-use when creating new stages.*/
private def getParentStagesAndId(rdd: RDD[_], firstJobId: Int): (List[Stage], Int) = {
  val parentStages = getParentStages(rdd, firstJobId)
  val id = nextStageId.getAndIncrement()
  (parentStages, id)
}

DAGScheduler的getParentStages方法的源码:

/**Get or create the list of parent stages for a given RDD.  The new Stages will be created with the provided firstJobId.*/
private def getParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
  val parents = new HashSet[Stage]
  val visited = new HashSet[RDD[_]]
  //存放等待访问的堆栈,存放的是非ShuffleDependency的RDD
  val waitingForVisit = new Stack[RDD[_]]

//定义遍历处理方法,先对访问过的RDD标记,然后根据当前RDD所依赖的RDD操作类型进行不同的处理
  def visit(r: RDD[_]) {
    if (!visited(r)) {
      visited += r
      // 所依赖RDD操作类型是ShuffleDependency,需要划分ShuffleMap调度阶段,以调度getShuffleMapStage方法为入口,向前遍历划分调度阶段
      for (dep <- r.dependencies) {
        dep match {
          case shufDep: ShuffleDependency[_, _, _] =>
            parents += getShuffleMapStage(shufDep, firstJobId)
          case _ =>

// 所依赖RDD操作类型是非ShuffleDependency,把该RDD压入等待访问的堆栈中
            waitingForVisit.push(dep.rdd)
        }
      }
    }
  }

// 以最后一个RDD开始向前遍历整个依赖树,如果该RDD依赖树存在ShuffleDependency的RDD,则父调度阶段存在,反之,不存在
  waitingForVisit.push(rdd)
  while (waitingForVisit.nonEmpty) {
    visit(waitingForVisit.pop())
  }
  parents.toList
}

若RDD的依赖类型是ShuffleDependency类型,则要通过getShuffleMapStage方法划分ShuffleMap阶段,DAGScheduler的getShuffleMapStage方法源码:

/**Get or create a shuffle map stage for the given shuffle dependency's map side.*/
private def getShuffleMapStage(
    shuffleDep: ShuffleDependency[_, _, _],
    firstJobId: Int): ShuffleMapStage = {
  shuffleToMapStage.get(shuffleDep.shuffleId) match {
    case Some(stage) => stage
    case None =>
      // 标记父和子之间的RDD的Shuffle Dependency关系
      getAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
        shuffleToMapStage(dep.shuffleId) = newOrUsedShuffleStage(dep, firstJobId)
      }
      // Then register current shuffleDep
      val stage = newOrUsedShuffleStage(shuffleDep, firstJobId)
      shuffleToMapStage(shuffleDep.shuffleId) = stage
      stage
  }
}

在getShuffleMapStage方法中,当finalRDD存在父调度阶段,需要从发生shuffle操作的RDD往前遍历,找出所有的ShuffleMapStage,这是调度阶段划分最关键的部分,该算法与getParentStages类似,由getAncestorShuffleDependencies方法实现。在该方法中找出所有的操作类型是宽依赖的RDD,然后通过newOrUsedShuffleStage两个方法划分所有的ShuffleStage。

DAGScheduler的getAncestorShuffleDependencies方法源码:

/** Find ancestor shuffle dependencies that are not registered in shuffleToMapStage yet */
private def getAncestorShuffleDependencies(rdd: RDD[_]): Stack[ShuffleDependency[_, _, _]] = {
  val parents = new Stack[ShuffleDependency[_, _, _]]
  val visited = new HashSet[RDD[_]]
  // 存放等待访问的堆栈,存放的是非ShuffleDependency的RDD
  val waitingForVisit = new Stack[RDD[_]]
  def visit(r: RDD[_]) {
    if (!visited(r)) {
      visited += r
      for (dep <- r.dependencies) {
        dep match {

// 所依赖RDD操作类型是ShuffleDependency,作为划分ShuffleMap调度阶段界限
          case shufDep: ShuffleDependency[_, _, _] =>
            if (!shuffleToMapStage.contains(shufDep.shuffleId)) {
              parents.push(shufDep)
            }
          case _ =>
        }
        waitingForVisit.push(dep.rdd)
      }
    }
  }

// 向前遍历依赖树,获取所有的操作类型是ShuffleDependency的RDD,作为划分阶段依据
  waitingForVisit.push(rdd)
  while (waitingForVisit.nonEmpty) {
    visit(waitingForVisit.pop())
  }
  parents
}

当所有调度阶段划分完毕时,这些调度阶段建立依赖关系。该依赖关系是通过调度阶段属性parents:List[Stage]来定义的,通过该属性可以获取当前阶段所有祖先阶段,可以根据这些信息按顺序提交调度阶段进行运行。调度阶段划分是spark作业执行的重要部分,其详细步骤如下:

  1. 在SparkContext中提交运行时,会调用DAGScheduler的handleJobSubmitted进行处理,在该方法中实现newResultStage方法先找到最后一个RDD,并调用getParentStages方法;

  2. 在getParentStages方法判断RDD的依赖树中是否存在Shuffle操作,若有Shuffle操作则实现getShuffleMapStage方法,在getShuffleMapStage方法调用getAncestorShuffleDependencies方法获取RDD的祖先依赖。

  3. 使用getAncestorShuffleDependencies方法中对RDD向前遍历,发现该依赖分支上是否有其他依赖,调用newOrUsedShuffleStage方法生成调用阶段ShuffleMapStage方法。

提交调度

在DAGScheduler的handleJobSubmitted方法中,生成finalStage的同时建立所有调度阶段依赖关系,然后通过finalStage生成一个作业实例,在该作业实例中按照顺序提交调度阶段进行执行,在执行过程中通过监听总线获取作业、阶段执行情况。

stage的提交调度是DAGScheduler的submitStage方法,其源码如下:

/** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage) {
  val jobId = activeJobForStage(stage)
  if (jobId.isDefined) {
    logDebug("submitStage(" + stage + ")")
    if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {

//在该方法中,获取该调度阶段的父调度阶段,获取的方法是通过RDD的依赖关系向前遍历看是否存在shuffle存在,这里并没有使用调度阶段的依赖关系进行获取
      val missing = getMissingParentStages(stage).sortBy(_.id)
      if (missing.isEmpty) {

//若不存在父调度阶段,直接把该调度阶段提交执行
        submitMissingTasks(stage, jobId.get)
      } else {

//若存在父调度阶段,把该调度阶段加入到等待运行调度阶段列表中,同时递归调用submitStage方法,直至找到开始的调度阶段,即该调度阶段没有父调度阶段
        for (parent <- missing) {
          submitStage(parent)
        }
        waitingStages += stage
      }
    }
  } else {
    abortStage(stage, "No active job for stage " + stage.id, None)
  }
}

在stage提交调度阶段开始时,在submitStage方法中调用getMissingParentStages方法获取finalStage父调度阶段,如果不存在父调度阶段,则使用submitMissingTasks方法提交执行;若存在父调度阶段,则把该调度阶段存放到waitingStages列表中,同时递归调用submitStage。通过该算法把存在父调度阶段的等待调度阶段放入列表waitingStages中,不存在父调度阶段的调度阶段作为作业运行的入口。

当入口的调度阶段运行完成后相继提交后续调度阶段,在调度前先判断该调度阶段所依赖的父调度阶段的记过是否可用(即运行是否成功),若结果都可用,则提交该调度阶段,若结果存在不可用的情况,则尝试提交结果不可用的父调度阶段。对于调度阶段是否可用的判断是在ShuffleMapTask完成时运行,DAGScheduler会检查调度阶段的所有任务是否都完成了。若存在执行失败的任务,则重新提交该调度阶段;若所有任务完成,则扫描等待运行调度阶段列表,检查它们的父调度阶段是否存在未完成,若不存在则表明该调度阶段准备就绪,生成实例并提交运行。

提交任务

任务的提交虽然是从DAGScheduler的submitStage方法实现的,在submitStage中调用的submitMissingTasks方法。因此当提交调度阶段提交运行后,在DAGScheduler的submitMissingTasks方法中,会根据调度阶段partition个数拆分与之对应的个数任务(一般是有几个partition就有几个task),这些任务组成一个任务集提交到TaskScheduler进行处理。submitMissingTasks的源码如下:

/** Called when stage's parents are available and we can now do its task. */
private def submitMissingTasks(stage: Stage, jobId: Int) {
  logDebug("submitMissingTasks(" + stage + ")")

//获取正在等待task
  stage.pendingPartitions.clear()

//让每个任务的parititon的索引参与计算
  val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()

//当累计器不存在时创建累计器,当累计器存在时,重新设置,或者从其他task中覆盖当前的累计器值
  if (stage.internalAccumulators.isEmpty || stage.numPartitions == partitionsToCompute.size) {
    stage.resetInternalAccumulators()
  }

//使用于此阶段相关的ActiveJob中的规划池、作业组等
  val properties = jobIdToActiveJob(jobId).properties
  runningStages += stage

//SparkListenerStageSubmitted信息必须在task序列化之前被传递,
  stage match {

    case s: ShuffleMapStage =>

//numPartitions是rdd参与计算的分区数量,如果程序中进行partition时,并不是所有的rdd的partition数据都参与计算,stage.numPartitions=partitionsToCompute.size

      outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
    case s: ResultStage =>
      outputCommitCoordinator.stageStart(
        stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
  }
  val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
    stage match {
      case s: ShuffleMapStage =>
        partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
      case s: ResultStage =>
        val job = s.activeJob.get
        partitionsToCompute.map { id =>
          val p = s.partitions(id)
          (id, getPreferredLocs(stage.rdd, p))
        }.toMap
    }
  } catch {
    case NonFatal(e) =>
      stage.makeNewStageAttempt(partitionsToCompute.size)
      listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
      abortStage(stage, s"Task creation failed: $e\n${e.getStackTraceString}", Some(e))
      runningStages -= stage
      return
  }
  stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)

  listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))

  var taskBinary: Broadcast[Array[Byte]] = null  

  try {
    val taskBinaryBytes: Array[Byte] = stage match {
      case stage: ShuffleMapStage =>
        closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef).array()
      case stage: ResultStage =>
        closureSerializer.serialize((stage.rdd, stage.func): AnyRef).array()

    }

    taskBinary = sc.broadcast(taskBinaryBytes)  //使用Broadcast向executor派遣stage

  } catch {
    case e: NotSerializableException =>
      abortStage(stage, "Task not serializable: " + e.toString, Some(e))
      runningStages -= stage
      return
    case NonFatal(e) =>
      abortStage(stage, s"Task serialization failed: $e\n${e.getStackTraceString}", Some(e))
      runningStages -= stage
      return
  }
  val tasks: Seq[Task[_]] = try {
    stage match {
      case stage: ShuffleMapStage =>
        partitionsToCompute.map { id =>
          val locs = taskIdToLocations(id)
          val part = stage.rdd.partitions(id)
          new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
            taskBinary, part, locs, stage.internalAccumulators)
        }
      case stage: ResultStage =>
        val job = stage.activeJob.get
        partitionsToCompute.map { id =>
          val p: Int = stage.partitions(id)
          val part = stage.rdd.partitions(p)
          val locs = taskIdToLocations(id)
          new ResultTask(stage.id, stage.latestInfo.attemptId,
            taskBinary, part, locs, id, stage.internalAccumulators)
        }
    }
  } catch {
    case NonFatal(e) =>
      abortStage(stage, s"Task creation failed: $e\n${e.getStackTraceString}", Some(e))
      runningStages -= stage
      return
  }
  if (tasks.size > 0) {
    logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")
    stage.pendingPartitions ++= tasks.map(_.partitionId)
    logDebug("New pending partitions: " + stage.pendingPartitions)

//把这些task以任务集的形式提交到TaskScheduler
    taskScheduler.submitTasks(new TaskSet(
      tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
    stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
  } else {
    markStageAsFinished(stage, None) //若调度阶段不存在任务标记,则表示该阶段已经完成
    val debugString = stage match {
      case stage: ShuffleMapStage =>
        s"Stage ${stage} is actually done; " +
          s"(available: ${stage.isAvailable}," +
          s"available outputs: ${stage.numAvailableOutputs}," +
          s"partitions: ${stage.numPartitions})"
      case stage : ResultStage =>
        s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
    }
    logDebug(debugString)
  }
}

特质TaskScheduler的submitTasks方法是没有方法体的,而类TaskSchedulerImpl extends于TaskScheduler,因此TaskScheduler的submitTasks方法是由类TaskSchedulerImpl实现,TaskSchedulerImpl类override(重写)TaskScheduler的submitTasks方法,则TaskSchedulerImpl中的submitTasks方法的实现源码:

override def submitTasks(taskSet: TaskSet) {
  val tasks = taskSet.tasks
  logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
  this.synchronized {

//manager:TaskSetManager,创建任务集管理器,用于管理此任务集的生命周期
    val manager = createTaskSetManager(taskSet, maxTaskFailures)
    val stage = taskSet.stageId
    val stageTaskSets =
      taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
    stageTaskSets(taskSet.stageAttemptId) = manager
    val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
      ts.taskSet != taskSet && !ts.isZombie
    }
    if (conflictingTaskSet) {
      throw new IllegalStateException(s"more than one active taskSet for stage $stage:" +
        s" ${stageTaskSets.toSeq.map{_._2.taskSet.id}.mkString(",")}")
    }

//将任务集添加到调度池中,由系统统一调配,支持FIFO和FAIR两种
    schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)

    if (!isLocal && !hasReceivedTask) {
      starvationTimer.scheduleAtFixedRate(new TimerTask() {
        override def run() {
          if (!hasLaunchedTask) {
            logWarning("Initial job has not accepted any resources; " +
              "check your cluster UI to ensure that workers are registered " +
              "and have sufficient resources")
          } else {
            this.cancel()
          }
        }
      }, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
    }
    hasReceivedTask = true
  }

  backend.reviveOffers() //SchedulerBackend向driver发送ReviveOffers消息

}

在TaskSchedulerImpl的submitTasks方法调用createTaskSetManager方法构建了一个TaskSetManager实例,用于管理这个任务集的生命周期,而该TaskSetManager会放入系统的调度池中,通过SchedulerBuilder.addTaskSetManager方法把任务集添加到SchedulerBuilder中,SchedulerBuilder是trait,SchedulerBuilder是由类FIFOSchedulerBuilder或FAIRSchedulerBuilder实现,因此通过SchedulerBuilder的addTaskSetManager方法根据系统设置的调度算法(FAIR/FIFO)进行调度。createTaskSetManager方法其实就是实现了TaskSetManager的构造器,createTaskSetManager方法是实现TaskSetManager构造器,源码:

private[scheduler] def createTaskSetManager(taskSet: TaskSet,maxTaskFailures: Int): TaskSetManager = {
  new TaskSetManager(this, taskSet, maxTaskFailures)
}

系统调度池分为两种FIFO和FAIR,是由特质SchedulableBuilder中的buildPools和addTaskSetManager方法来实现的。类FIFOSchedulableBuilder和类FairSchedulableBuilder都继承特质SchedulableBuilder,并重写(override)此特质的buildPools和addTaskSetManager方法。

在submitTasks方法中调用特质SchedulerBackend中的reviveOffers方法进行分配资源并执行,若是standalone模式下,CoarseGrainedSchedulerBackend实现特质SchedulerBackend,因此重写reviveOffers方法:

override def reviveOffers() {
  driverEndpoint.send(ReviveOffers)

}

ReviveOffers本身只是一个空的case object对象,只是起到触发底层资源调度的作业在有Task提交或计算资源变动时会发送ReviveOffers这个消息作为触发器。ReviveOffers的源码:

// Internal messages in driver
case object ReviveOffers extends CoarseGrainedClusterMessage

当CoarseGrainedSchedulerBackend的reviveOffers方法执行完成ReviveOffers消息发送后,之后调用CoarseGrainedSchedulerBackend的receive方法完成匹配接收。并在CoarseGrainedSchedulerBackend中执行一系列的方法操作,如重写receiveAndReply方法等。

在CoarseGrainedSchedulerBackend的receive方法中,DriverEndPoint接收到ReviveOffers后,会触发makeOffers方法

override def receive: PartialFunction[Any, Unit] = {
  case StatusUpdate(executorId, taskId, state, data) =>
    scheduler.statusUpdate(taskId, state, data.value)
    if (TaskState.isFinished(state)) {
      executorDataMap.get(executorId) match {
        case Some(executorInfo) =>
          executorInfo.freeCores += scheduler.CPUS_PER_TASK
          makeOffers(executorId)
        case None =>
          logWarning(s"Ignored task status update ($taskId state $state) " +
            s"from unknown executor with ID $executorId")
      }
    }

  case ReviveOffers =>

    makeOffers() //接收到case Object ReviveOffers后触发makeOffers方法

  case KillTask(taskId, executorId, interruptThread) =>
    executorDataMap.get(executorId) match {
      case Some(executorInfo) =>
        executorInfo.executorEndpoint.send(KillTask(taskId, executorId, interruptThread))
      case None =>
        // Ignoring the task kill since the executor is not registered.
        logWarning(s"Attempted to kill task $taskId for unknown executor $executorId.")
    }

}

在reviveOffers方法中RpcEndpointRef(driverEndPoint的定义)对象向DriverEndPoint发送消息,当DriverEndPoint接收到有新的taskSet需要提交(case ReviveOffers)、task完成的状态需要更新(override def receive: PartialFunction[Any,Unit]方法)、有新的Executor需要注册(over def receiveAndReply(context:RpcCallContext):PartialFunction),CoarseGrainedSchedulerBackend会执行调度行为,也就是调用makeOffers或makeOffers (executorId: String)方法,在该方法中先会获取集群中可用的Executor,然后发送到TaskSchedulerImpl中进行对任务集的任务分配运行资源,最后提交到launchTasks方法中。CoarseGrainedSchedulerBackend的makeOffers方法的源码:

private def makeOffers() {
  // 获取集群中可用的Executor列表
  val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
  val workOffers = activeExecutors.map { case (id, executorData) =>
    new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
  }.toSeq

// 对任务集的任务分配运行资源,并把这些任务提交运行
  launchTasks(scheduler.resourceOffers(workOffers))
}

在makeOffers方法中调用TaskSchedulerImpl的resourceOffers方法是为了进行资源分配,TaskSchedulerImpl的resourceOffers方法的源码:

def resourceOffers(offers: Seq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized {

//标记每个slave节点,获取其hostname(ip),监控是否有新的executor,若有则添加
  var newExecAvail = false
  for (o <- offers) {
    executorIdToHost(o.executorId) = o.host
    executorIdToTaskCount.getOrElseUpdate(o.executorId, 0)
    if (!executorsByHost.contains(o.host)) {
      executorsByHost(o.host) = new HashSet[String]()
      executorAdded(o.executorId, o.host)
      newExecAvail = true
    }
    for (rack <- getRackForHost(o.host)) {
      hostsByRack.getOrElseUpdate(rack, new HashSet[String]()) += o.host
    }
  }

//随机进行shuffle,防止把任务总是放在同一个worker上
  val shuffledOffers = Random.shuffle(offers)

//创建一个任务列表,并分配给每一个worker
  val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores))
  val availableCpus = shuffledOffers.map(o => o.cores).toArray

//获取按照调度策略排序好的TaskSetManager
  val sortedTaskSets = rootPool.getSortedTaskSetQueue

//若有新加入的Executor,则需要重新计算数据本地性
  for (taskSet <- sortedTaskSets) {
    logDebug("parentName: %s, name: %s, runningTasks: %s".format(
      taskSet.parent.name, taskSet.name, taskSet.runningTasks))
    if (newExecAvail) {
      taskSet.executorAdded()
    }
  }

//为排好序的TaskSetManager列表进行资源分配,分配的原则是就近原则,按照顺序(myLocalityLevels中定义)为:PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY
  var launchedTask = false
  for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) {
    do {
      launchedTask = resourceOfferSingleTaskSet(
          taskSet, maxLocality, shuffledOffers, availableCpus, tasks)
    } while (launchedTask)
  }
  if (tasks.size > 0) {
    hasLaunchedTask = true
  }
  return tasks

}

分配好的资源任务提交到CoarseGrainedSchedulerBackend的launchTasks方法中,此方法主要用于和各个节点的executor进行RPC通信,指挥实际的作业调度,在该方法中会把任务一个个发送到worker节点上的CoarseGrainedExecutorBackend,然后通过其内部的Executor来执行任务。CoarseGrainedSchedulerBackend的launchTasks方法的源码:

// Launch tasks returned by a set of resource offers
private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
  for (task <- tasks.flatten) {
    val serializedTask = ser.serialize(task)
    if (serializedTask.limit >= akkaFrameSize - AkkaUtils.reservedSizeBytes) {
      scheduler.taskIdToTaskSetManager.get(task.taskId).foreach { taskSetMgr =>
        try {
          var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
            "spark.akka.frameSize (%d bytes) - reserved (%d bytes). Consider increasing " +
            "spark.akka.frameSize or using broadcast variables for large values."
          msg = msg.format(task.taskId, task.index, serializedTask.limit, akkaFrameSize,
            AkkaUtils.reservedSizeBytes)
          taskSetMgr.abort(msg)
        } catch {
          case e: Exception => logError("Exception in error callback", e)
        }
      }
    }
    else {
      val executorData = executorDataMap(task.executorId)

      executorData.freeCores -= scheduler.CPUS_PER_TASK

     //Worker节点上的CoarseGrainedExecutorBackend对象接收LaunchTask消息,在Executor上启动Task的执行

      executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
    }
  }
}

 

执行任务

DAGScheduler划分好Stage通过submitMissingTasks方法分配任务,并把任务交由TaskSchedulerImpl的submitTasks方法,将任务加入调度池,之后调用CoarseGrainedSchedulerBackend的reviveOffers方法为Task分配资源,指定Executor。任务资源都分配好后,CoarseGrainedSchedulerBackend将向CoarseGrainedExecutorBackend发送LaunchTask消息,将具体的任务发送到Executor上进行计算。当CoarseGrainedExecutorBackend接收到LaunchTask消息时,会调用Executor的launchTask方法进行处理。在Executor的launchTask方法中,初始化一个TaskRunner(位于Executor中)来封装任务,它用于管理任务运行时的细节,再把TaskRunner对象放入到ThreadPool(线程池)中去执行,其中类TaskRunner在类Executor中定义。

在TaskRunner(位于Executor中)的run方法中,首先会对发送过来的Task本身以及它所依赖的jar等文件进行反序列,然后对反序列化的任务调用Task的run方法。由于Task本身是一个抽象类,具体的run方法是由它的两个子类ShuffleMapTask和RedultTask来实现的。

override def run(): Unit = {

//生成内存管理taskMemoryManager实例,用于任务运行期间内存管理
  val taskMemoryManager = new TaskMemoryManager(env.memoryManager, taskId)
  val deserializeStartTime = System.currentTimeMillis()
  Thread.currentThread.setContextClassLoader(replClassLoader)
  val ser = env.closureSerializer.newInstance()
  logInfo(s"Running $taskName (TID $taskId)")

//调用CoarseGrainedExecutorBackend对象向Driver终端发送任务运行的开始消息
  execBackend.statusUpdate(taskId, TaskState.RUNNING, EMPTY_BYTE_BUFFER)
  var taskStart: Long = 0
  startGCTime = computeTotalGcTime()
  try {

//对任务运行时所需要的文件、jar包、代码等反序列化
    val (taskFiles, taskJars, taskBytes) = Task.deserializeWithDependencies(serializedTask)
    updateDependencies(taskFiles, taskJars)
    task = ser.deserialize[Task[Any]](taskBytes, Thread.currentThread.getContextClassLoader)
    task.setTaskMemoryManager(taskMemoryManager)
    if (killed) {

//任务在序列化之前被杀死,则抛出异常并退出
      throw new TaskKilledException
    }
    logDebug("Task " + taskId + "'s epoch is " + task.epoch)
    env.mapOutputTracker.updateEpoch(task.epoch)
    taskStart = System.currentTimeMillis()
    var threwException = true
    val (value, accumUpdates) = try {

//调用Task的run方法运行任务
      val res = task.run(
        taskAttemptId = taskId,
        attemptNumber = attemptNumber,
        metricsSystem = env.metricsSystem)
      threwException = false
      res
    }
     ........
  }

}

 在类TaskRunner的run方法中执行ExecutorB ackend的statusUpdate方法,但是statusUpdate方法是由CoarseGrainedExecutorBackend的statusUpdate方法实现(因为CoarseGrainedExecutorBackend实现特质ExecutorBackend)。其中TaskState有LAUNCHING、RUNNING、FINISHED、KILLED、LOST六种方式通过特定的TaskState值,该值通过ExecutorBackend(确切来说是其子类CoarseGrainedExecutorBackend,ExecutorBackend是trait)返回给SchedulerBackend(确切来说是其子类CoarseGrainedSchedulerBackend,SchedulerBackend是trait),在SchedulerBackend中根据TaskState中的值进行处理。当执行Executor(其实是在类TaskRunner中)的run方法,把任务的执行状态反馈给CoarseGrainedSchedulerBackend,CoarseGrainedSchedulerBackend通过方法receive进行执行,并根据不同的任务状态进行不同的调用执行。CoarseGrainedSchedulerBackend接收到的TastUpate的receive方法的源码:

override def receive: PartialFunction[Any, Unit] = {
  case StatusUpdate(executorId, taskId, state, data) =>

//调用TaskSchedulerImpl的statusUpdate方法更新状态
    scheduler.statusUpdate(taskId, state, data.value)
    if (TaskState.isFinished(state)) {
      executorDataMap.get(executorId) match {
        case Some(executorInfo) =>

//CPUS_PER_TASK是闲置的CPU个数
          executorInfo.freeCores += scheduler.CPUS_PER_TASK
          makeOffers(executorId) //闲置的CPU通过makeOffers方法重新分配资源
        case None =>
          logWarning(s"Ignored task status update ($taskId state $state) " +
            s"from unknown executor with ID $executorId") }}
  case ReviveOffers =>
    makeOffers()
  case KillTask(taskId, executorId, interruptThread) =>
    executorDataMap.get(executorId) match {
      case Some(executorInfo) =>
        executorInfo.executorEndpoint.send(KillTask(taskId, executorId, interruptThread))
      case None =>
        logWarning(s"Attempted to kill task $taskId for unknown executor $executorId.")
    }

}

 

 

你可能感兴趣的:(spark)