《Apache Spark源码剖析》学习笔记之Spark作业提交

1.作业提交

以foreach函数开始:

foreach

-------------------------------------------------------------------------------------------

/**
 * Applies a function f to all elements of this RDD.
 */
def foreach(f: T => Unit): Unit = withScope {
  val cleanF = sc.clean(f)
  sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}
--------------------------------------------------------------------------------------------

在foreach函数中调用的runJob函数有多个变种,也就是实现了函数重载,这些重载的函数实现了哪些功能呢?

接下来就会解答这个问题。

步骤1:指定了Final RDD和 作用于RDD上的Function

runJob(-)

--------------------------------------------------------------------------------------------


/**
 * Run a job on all partitions in an RDD and return the results in an array.
 */
def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
  runJob(rdd, func, 0 until rdd.partitions.size, false)
}

-------------------------------------------------------------------------------------------

步骤2:读取Final RDD的分区,数,并指定是否允许本地执行。

runJob(二)

----------------------------------------------------------------------------------------------

/**
 * Run a job on a given set of partitions of an RDD, but take a function of type
 * `Iterator[T] => U` instead of `(TaskContext, Iterator[T]) => U`.
 */
def runJob[T, U: ClassTag](
    rdd: RDD[T],
    func: Iterator[T] => U,
    partitions: Seq[Int],
    allowLocal: Boolean
    ): Array[U] = {
  runJob(rdd, (context: TaskContext, iter: Iterator[T]) => func(iter), partitions, allowLocal)
}
------------------------------------------------------------------------------------------------

步骤3:匿名函数抓换

------------------------------------------------------------------------------------------------

/**
 * Run a function on a given set of partitions in an RDD and return the results as an array. The
 * allowLocal flag specifies whether the scheduler can run the computation on the driver rather
 * than shipping it out to the cluster, for short actions like first().
 */
def runJob[T, U: ClassTag](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    allowLocal: Boolean
    ): Array[U] = {
  val results = new Array[U](partitions.size)
  runJob[T, U](rdd, func, partitions, allowLocal, (index, res) => results(index) = res)
  results
}
---------------------------------------------------------------------------------------------------

步骤4:添加对Job计算结果的处理句柄

---------------------------------------------------------------------------------------------------

/**
 * Run a function on a given set of partitions in an RDD and pass the results to the given
 * handler function. This is the main entry point for all actions in Spark. The allowLocal
 * flag specifies whether the scheduler can run the computation on the driver rather than
 * shipping it out to the cluster, for short actions like first().
 */
def runJob[T, U: ClassTag](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    allowLocal: Boolean,
    resultHandler: (Int, U) => Unit) {
  if (stopped) {
    throw new IllegalStateException("SparkContext has been shutdown")
  }
  val callSite = getCallSite
  val cleanedFunc = clean(func)
  logInfo("Starting job: " + callSite.shortForm)
  if (conf.getBoolean("spark.logLineage", false)) {
    logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
  }
  dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, allowLocal,
    resultHandler, localProperties.get)
  progressBar.foreach(_.finishAll())
  rdd.doCheckpoint()
}
 ----------------------------------------------------------------------------------------------------------

注意在此处调用clean(func)

/**
 * Clean a closure to make it ready to serialized and send to tasks
 * (removes unreferenced variables in $outer's, updates REPL variables)
 * If checkSerializable is set, clean will also proactively
 * check to see if f is serializable and throw a SparkException
 * if not.
 *
 * @param f the closure to clean
 * @param checkSerializable whether or not to immediately check f for serializability
 * @throws SparkException if checkSerializable is set but f is not
 *   serializable
 */
private[spark] def clean[F <: AnyRef](f: F, checkSerializable: Boolean = true): F = {
  ClosureCleaner.clean(f, checkSerializable)
  f
}
-------------------------------------------------------------------------------------------------------------

ClosureCleaner的主要作用

当Scala在创建一个闭包时,需要先判断哪些变量会被闭包所使用并将这些需要使用的变量存储在闭包之内。这一特性是的闭包可以在创建闭包的作用范围之外也能得以正确的 执行

但是,Scala又是会捕捉太多不必要的外部变量。在大多数情况下,这样子操作不会有什么副作用,只是这些多余的变量没有被使用罢了。但对于Spark来说,由于这些闭包可能会在其他的机器上执行,故此,多余的外部变量一方面浪费了网络宽带,另一方面可能就是由于外部变量并不支持序列化操作进而导致整个闭包的序列化操作出错。

为了解决这个潜在的问题,Spark专门写了ClosureCleaner来移除那些不必要的外部变量,经过清理的闭包函数能够得以正常地序列化,并可以在任意的机器上执行。

理解了ClosureCleaner存在的原因,也就会明白为什么在写Spark Application的时候,经常会遇到的"Task Not Serializable"是在什么地方报错的了。产生无法序列化的原因就是在RDD的操作中引用了无法序列化的变量。

2.作业执行

作业提交执行的完整流程如下图所示:

在任务提交过程中主要涉及Driver和Executor两个节点。
Driver在任务提交过程中最主要解决如下几个问题:
(1)RDD依赖性问题,以生成RAG。
(2)根据RDD DAG将Job分割成多个Stage。
(3)Stage一经确认,即生成相应的Task,将生成的Task分布到Executor执行。
Executor节点在接受到执行任务的指令后,启动新的线程,运行接收到的任务,并将任务的处理结果返回。

2.1 依赖性分析及Stage划分

Spark中将RDD之间的依赖分为窄依赖和宽依赖。
窄依赖是指父RDD的所有输出都会被指定的子RDD消费,也就是输出路径是固定的。宽依赖是指父RDD的输出会有不同的子RDD消费,即输出路径不固定。
调度器(Scheduler)会计算RDD之间的依赖关系,将拥有持有窄依赖的RDD归并到同一个Stage中,而宽依赖则作为划分不同Stage的判断准则。
函数handleJobSubmitted最主要的工作是生成finalStage,并根据finalStage来产生ActiveJob。
handleJobSubmitted
---------------------------------------------------------------------------------------------------
private[scheduler] def handleJobSubmitted(jobId: Int,
    finalRDD: RDD[_],
    func: (TaskContext, Iterator[_]) => _,
    partitions: Array[Int],
    allowLocal: Boolean,
    callSite: CallSite,
    listener: JobListener,
    properties: Properties = null)
{
  var finalStage: Stage = null
  try {
    // New stage creation may throw an exception if, for example, jobs are run on a
    // HadoopRDD whose underlying HDFS files have been deleted.
    finalStage = newStage(finalRDD, partitions.size, None, jobId, callSite)
  } catch {
    case e: Exception =>
      logWarning("Creating new stage failed due to exception - job: " + jobId, e)
      listener.jobFailed(e)
      return
  }
  if (finalStage != null) {
    val job = new ActiveJob(jobId, finalStage, func, partitions, callSite, listener, properties)
    clearCacheLocs()
    logInfo("Got job %s (%s) with %d output partitions (allowLocal=%s)".format(
      job.jobId, callSite.shortForm, partitions.length, allowLocal))
    logInfo("Final stage: " + finalStage + "(" + finalStage.name + ")")
    logInfo("Parents of final stage: " + finalStage.parents)
    logInfo("Missing parents: " + getMissingParentStages(finalStage))
    val shouldRunLocally =
      localExecutionEnabled && allowLocal && finalStage.parents.isEmpty && partitions.length == 1
    val jobSubmissionTime = clock.getTimeMillis()
    if (shouldRunLocally) {
      // Compute very short actions like first() or take() with no parent stages locally.
      listenerBus.post(
        SparkListenerJobStart(job.jobId, jobSubmissionTime, Seq.empty, properties))
      runLocally(job)
    } else {
      jobIdToActiveJob(jobId) = job
      activeJobs += job
      finalStage.resultOfJob = Some(job)
      val stageIds = jobIdToStageIds(jobId).toArray
      val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
      listenerBus.post(
        SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
      submitStage(finalStage)
    }
  }
  submitWaitingStages()
}
---------------------------------------------------------------------------------------------------------------------
finalStage = newStage(finalRDD, partitions.size, None, jobId, callSite)
用来创建一个新的Stage。
---------------------------------------------------------------------------------------------------------------------
/**
 * Create a Stage -- either directly for use as a result stage, or as part of the (re)-creation
 * of a shuffle map stage in newOrUsedStage.  The stage will be associated with the provided
 * jobId. Production of shuffle map stages should always use newOrUsedStage, not newStage
 * directly.
 */
private def newStage(
    rdd: RDD[_],
    numTasks: Int,
    shuffleDep: Option[ShuffleDependency[_, _, _]],
    jobId: Int,
    callSite: CallSite)
  : Stage =
{
  val parentStages = getParentStages(rdd, jobId)
  val id = nextStageId.getAndIncrement()
  val stage = new Stage(id, rdd, numTasks, shuffleDep, parentStages, jobId, callSite)
  stageIdToStage(id) = stage
  updateJobIdStageIdMaps(jobId, stage)
  stage
}
Stage的初始化参数:在创建一个Stage之前,我们必须知道该Stage需要从多少个Partition读入数据,这个数值直接影响要创建多少个Task。
--------------------------------------------------------------------------

private[spark] class Stage(
    val id: Int,// Stage的序号,数值越大,越优先执行。如3,2,1.
    val rdd: RDD[_],// 归属于本Stage的最后一个RDD
    val numTasks: Int,// 创建的Task数目,等于父rdd的输出Partition数目
    val shuffleDep: Option[ShuffleDependency[_, _, _]],  // Output shuffle if stage is a map stage
                                                         // 是否存在ShuffleDependency
    val parents: List[Stage],//父Stage列表
    val jobId: Int,// 作业Id
    val callSite: CallSite)
  extends Logging {

---------------------------------------------------------------------------------------------------------------

也就是说在创建Stage的时候,其实已经清楚该Stage需要从多少不同的Partition读入数据,并写入到多少不同的Partition中,即输入和输出的个数均已明确。

ActiveJob的初始化参数如下。

---------------------------------------------------------------------------------------------------------------

/**
 * Tracks information about an active job in the DAGScheduler.
 */
private[spark] class ActiveJob(
    val jobId: Int,// 每个作业都分配一个唯一的Id
    val finalStage: Stage,// 最终的Stage
    val func: (TaskContext, Iterator[_]) => _,// 作用与最后一个Stage上的函数
    val partitions: Array[Int],//分区列表,
                              // 注意这里表示需要从多少个分区读入数据并进行处理
    val callSite: CallSite,
    val listener: JobListener,
    val properties: Properties) {

  val numPartitions = partitions.length
  val finished = Array.fill[Boolean](numPartitions)(false)
  var numFinished = 0
}
-------------------------------------------------------------------------
submitStage处理流程如下所述:

  • 所依赖的Stage是否都已经完成,如果没有则先执行所依赖的Stage。
  • 如果所有的依赖已经完成,则提交自身所处的Stage。
--------------------------------------------------------------------------
/** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage) {
  val jobId = activeJobForStage(stage)
  if (jobId.isDefined) {
    logDebug("submitStage(" + stage + ")")
    if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
      val missing = getMissingParentStages(stage).sortBy(_.id)
      logDebug("missing: " + missing)
      if (missing == Nil) {
        logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
        submitMissingTasks(stage, jobId.get)
      } else {
        for (parent <- missing) {
          submitStage(parent)
        }
        waitingStages += stage
      }
    }
  } else {
    abortStage(stage, "No active job for stage " + stage.id)
  }
}
--------------------------------------------------------------------------------
val missing = getMissingParentStages(stage).sortBy(_.id)
通过图的遍历,来找出所依赖的所有父Stage.
--------------------------------------------------------------------------------
private def getMissingParentStages(stage: Stage): List[Stage] = {
  val missing = new HashSet[Stage]
  val visited = new HashSet[RDD[_]]
  // We are manually maintaining a stack here to prevent StackOverflowError
  // caused by recursively visiting
  val waitingForVisit = new Stack[RDD[_]]
  def visit(rdd: RDD[_]) {
    if (!visited(rdd)) {
      visited += rdd
      if (getCacheLocs(rdd).contains(Nil)) {
        for (dep <- rdd.dependencies) {
          dep match {
            case shufDep: ShuffleDependency[_, _, _] =>
              val mapStage = getShuffleMapStage(shufDep, stage.jobId)
              if (!mapStage.isAvailable) {
                missing += mapStage
              }
            case narrowDep: NarrowDependency[_] =>
              waitingForVisit.push(narrowDep.rdd)
          }
        }
      }
    }
  }
  waitingForVisit.push(stage.rdd)
  while (!waitingForVisit.isEmpty) {
    visit(waitingForVisit.pop())
  }
  missing.toList
}
-----------------------------------------------------------
Stage的划分是如何确定的呢?其判断的重要依据就是是否存在ShuffleDependency,如果有则创建一个新的Stage。
那么又是如何知道是否存在ShuffleDependency的呢?这取决于RDD的转换本身了。一下RDD会返回ShuffleDependency:
  • ShuffledRDD
  • CoGroupedRDD
  • SubtractedRDD
假设今后需要新建一种RDD,就需要明确其Dependency类型,具体就是重载getDependencies函数,如ShuffledRDD中的实现
----------------------------------------------------------------------------
override def getDependencies: Seq[Dependency[_]] = {
  List(new ShuffleDependency(prev, part, serializer, keyOrdering, aggregator, mapSideCombine))
}
----------------------------------------------------------------------------
Stage划分完毕就已经明确了如下内容:
(1)产生的Stage需要从多少个Partition中读取数据
(2)产生的Stage会生成多少Partition
(3)产生的Stage是否属于ShuffleMap类型。
确认Partition以决定需要产生多少不同的Task,ShuffleMap类型判断来决定生成的Task类型。在Spark中共分为两种Task,分别是ShuffleMapTask和ResultTask。

2.2任务的创建和分发

Spark将由Executor执行的Task分为 ShuffleMapTask和ResultTask两种,可以简单地将其对应于Hadoop中的Map和Reduce。
submitMissingTasks负责创建新的Task。
每个Stage生成Task的时候根据Stage中的isShuffleMap标记确定Task的类型,如果标记为真,则创建ShuffleMapTask;否则创建ResultTask。
属于同一个Stage的Task是可以并发执行的。那么决定同一个Stage要生成多少个Task又是由哪些因素决定的呢?从源码中可以看出Partitions决定了
每一个Stage中生成的Task个数。
需要特别指出的是Task的个数不等于真正并发执行的个数,比如总共生成了8个Task,但只有2个Core,那么需要分成4个批次,每次并发执行两个Task。
------------------------------------------------------------------------------
/** Called when stage's parents are available and we can now do its task. */
private def submitMissingTasks(stage: Stage, jobId: Int) {
  logDebug("submitMissingTasks(" + stage + ")")
  // Get our pending tasks and remember them in our pendingTasks entry
  stage.pendingTasks.clear()

  // First figure out the indexes of partition ids to compute.
  val partitionsToCompute: Seq[Int] = {
    if (stage.isShuffleMap) {
      (0 until stage.numPartitions).filter(id => stage.outputLocs(id) == Nil)
    } else {
      val job = stage.resultOfJob.get
      (0 until job.numPartitions).filter(id => !job.finished(id))
    }
  }

  val properties = if (jobIdToActiveJob.contains(jobId)) {
    jobIdToActiveJob(stage.jobId).properties
  } else {
    // this stage will be assigned to "default" pool
    null
  }

  runningStages += stage
  // SparkListenerStageSubmitted should be posted before testing whether tasks are
  // serializable. If tasks are not serializable, a SparkListenerStageCompleted event
  // will be posted, which should always come after a corresponding SparkListenerStageSubmitted
  // event.
  stage.latestInfo = StageInfo.fromStage(stage, Some(partitionsToCompute.size))
  outputCommitCoordinator.stageStart(stage.id)
  listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))

  // TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.
  // Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast
  // the serialized copy of the RDD and for each task we will deserialize it, which means each
  // task gets a different copy of the RDD. This provides stronger isolation between tasks that
  // might modify state of objects referenced in their closures. This is necessary in Hadoop
  // where the JobConf/Configuration object is not thread-safe.
  var taskBinary: Broadcast[Array[Byte]] = null
  try {
    // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
    // For ResultTask, serialize and broadcast (rdd, func).
    val taskBinaryBytes: Array[Byte] =
      if (stage.isShuffleMap) {
        closureSerializer.serialize((stage.rdd, stage.shuffleDep.get) : AnyRef).array()
      } else {
        closureSerializer.serialize((stage.rdd, stage.resultOfJob.get.func) : AnyRef).array()
      }
    taskBinary = sc.broadcast(taskBinaryBytes)
  } catch {
    // In the case of a failure during serialization, abort the stage.
    case e: NotSerializableException =>
      abortStage(stage, "Task not serializable: " + e.toString)
      runningStages -= stage
      return
    case NonFatal(e) =>
      abortStage(stage, s"Task serialization failed: $e\n${e.getStackTraceString}")
      runningStages -= stage
      return
  }

  val tasks: Seq[Task[_]] = if (stage.isShuffleMap) {
    partitionsToCompute.map { id =>
      val locs = getPreferredLocs(stage.rdd, id)
      val part = stage.rdd.partitions(id)
      new ShuffleMapTask(stage.id, taskBinary, part, locs)
    }
  } else {
    val job = stage.resultOfJob.get
    partitionsToCompute.map { id =>
      val p: Int = job.partitions(id)
      val part = stage.rdd.partitions(p)
      val locs = getPreferredLocs(stage.rdd, p)
      new ResultTask(stage.id, taskBinary, part, locs, id)
    }
  }

  if (tasks.size > 0) {
    logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")
    stage.pendingTasks ++= tasks
    logDebug("New pending tasks: " + stage.pendingTasks)
    taskScheduler.submitTasks(
      new TaskSet(tasks.toArray, stage.id, stage.newAttemptId(), stage.jobId, properties))
    stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
  } else {
    // Because we posted SparkListenerStageSubmitted earlier, we should post
    // SparkListenerStageCompleted here in case there are no tasks to run.
    outputCommitCoordinator.stageEnd(stage.id)
    listenerBus.post(SparkListenerStageCompleted(stage.latestInfo))
    logDebug("Stage " + stage + " is actually done; %b %d %d".format(
      stage.isAvailable, stage.numAvailableOutputs, stage.numPartitions))
    runningStages -= stage
  }
}
------------------------------------------------------------------------------
一旦任务类型及任务个数确定之后,剩下的工作就是将这些任务派发到各个Executor,由Executor
启动相应的线程来执行。这也是从计划到真正执行的过度阶段。
TaskshcdulerImpl发送ReviveOffers消息给DriverActor(backend),DriverActor(backend)在收到ReviveOffer消息后,调用
makeOffers处理函数。
------------------------------------------------------------------------------
// Make fake resource offers on all executors
def makeOffers() {
  launchTasks(scheduler.resourceOffers(executorDataMap.map { case (id, executorData) =>
    new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
  }.toSeq))
}
------------------------------------------------------------------------------
makeOffers的处理逻辑如下所述:
(1)找到空闲的Executor,分发的策略是随机分发,即尽可能将任务平摊到各个Executor。
(2)如果有空闲的Executor,就将任务类表中的部分任务利用launchTasks发送给指定Executor。
任务分发策略是随机分发的,即将任务随机发送到各个Executor中。资源分配的工作由resourceOffers函数处理。
-----------------------------------------------------------------------------------
/**
 * Called by cluster manager to offer resources on slaves. We respond by asking our active task
 * sets for tasks in order of priority. We fill each node with tasks in a round-robin manner so
 * that tasks are balanced across the cluster.
 */
def resourceOffers(offers: Seq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized {
  // Mark each slave as alive and remember its hostname
  // Also track if new executor is added
  var newExecAvail = false
  for (o <- offers) {
    executorIdToHost(o.executorId) = o.host
    activeExecutorIds += o.executorId
    if (!executorsByHost.contains(o.host)) {
      executorsByHost(o.host) = new HashSet[String]()
      executorAdded(o.executorId, o.host)
      newExecAvail = true
    }
    for (rack <- getRackForHost(o.host)) {
      hostsByRack.getOrElseUpdate(rack, new HashSet[String]()) += o.host
    }
  }

  // Randomly shuffle offers to avoid always placing tasks on the same set of workers.
  val shuffledOffers = Random.shuffle(offers)
  // Build a list of tasks to assign to each worker.
  val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores))
  val availableCpus = shuffledOffers.map(o => o.cores).toArray
  val sortedTaskSets = rootPool.getSortedTaskSetQueue
  for (taskSet <- sortedTaskSets) {
    logDebug("parentName: %s, name: %s, runningTasks: %s".format(
      taskSet.parent.name, taskSet.name, taskSet.runningTasks))
    if (newExecAvail) {
      taskSet.executorAdded()
    }
  }
---------------------------------------------------------------------------

2.3任务执行

LaunchTask消息被Executor接受,Executor会使用launchTask对消息进行处理
这里需要注意的是如果Executor没有注册到Driver,即便接受到LaunchTask指令,也不会做任何处理。
---------------------------------------------------------------------------
def launchTask(
    context: ExecutorBackend,
    taskId: Long,
    attemptNumber: Int,
    taskName: String,
    serializedTask: ByteBuffer) {
  val tr = new TaskRunner(context, taskId = taskId, attemptNumber = attemptNumber, taskName,
    serializedTask)
  runningTasks.put(taskId, tr)
  threadPool.execute(tr)
}
------------------------------------------------------------------------------
val tr = new TaskRunner(context, taskId = taskId, attemptNumber = attemptNumber, taskName,
  serializedTask)
在这个函数中进行反序列化过程。
-------------------------------------------------------------------------------
override def run() {
  val deserializeStartTime = System.currentTimeMillis()
  Thread.currentThread.setContextClassLoader(replClassLoader)
  val ser = env.closureSerializer.newInstance()
  logInfo(s"Running $taskName (TID $taskId)")
  execBackend.statusUpdate(taskId, TaskState.RUNNING, EMPTY_BYTE_BUFFER)
  var taskStart: Long = 0
  startGCTime = gcTime

  try {
    val (taskFiles, taskJars, taskBytes) = Task.deserializeWithDependencies(serializedTask)
    updateDependencies(taskFiles, taskJars)
    task = ser.deserialize[Task[Any]](taskBytes, Thread.currentThread.getContextClassLoader)
-------------------------------------------------------------------------------
解决依赖性问题:
 updateDependencies(taskFiles, taskJars)
-------------------------------------------------------------------------------

/**
 * Download any missing dependencies if we receive a new set of files and JARs from the
 * SparkContext. Also adds any new JARs we fetched to the class loader.
 */
private def updateDependencies(newFiles: HashMap[String, Long], newJars: HashMap[String, Long]) {
  lazy val hadoopConf = SparkHadoopUtil.get.newConfiguration(conf)
  synchronized {
    // Fetch missing dependencies
    for ((name, timestamp) <- newFiles if currentFiles.getOrElse(name, -1L) < timestamp) {
      logInfo("Fetching " + name + " with timestamp " + timestamp)
      // Fetch file with useCache mode, close cache for local mode.
      Utils.fetchFile(name, new File(SparkFiles.getRootDirectory), conf,
        env.securityManager, hadoopConf, timestamp, useCache = !isLocal)
      currentFiles(name) = timestamp
    }
    for ((name, timestamp) <- newJars) {
      val localName = name.split("/").last
      val currentTimeStamp = currentJars.get(name)
        .orElse(currentJars.get(localName))
        .getOrElse(-1L)
      if (currentTimeStamp < timestamp) {
        logInfo("Fetching " + name + " with timestamp " + timestamp)
        // Fetch file with useCache mode, close cache for local mode.
        Utils.fetchFile(name, new File(SparkFiles.getRootDirectory), conf,
          env.securityManager, hadoopConf, timestamp, useCache = !isLocal)
        currentJars(name) = timestamp
        // Add it to our class loader
        val url = new File(SparkFiles.getRootDirectory, localName).toURI.toURL
        if (!urlClassLoader.getURLs.contains(url)) {
          logInfo("Adding " + url + " to class loader")
          urlClassLoader.addURL(url)
        }
      }
    }
  }
}
------------------------------------------------------------------------
Utils.fetchFile从HttpFileServer上获取所依赖的文件,依赖文件上传到HttpFileServer是发生在Submit的时候。支持的文件存储方式如下:

  • HttpFileServer
  • HDFS
  • 本地文件
Utils.fetchFile在开始获取文件之前,首先要调用createTempDir来创建存储文件的临时目录。目录名通过java.io.tmpdir指定,默认实在/tmp目录下。

2.4 Shuffle Task

TaskRunner会启动一个新的线程,这没有问题,问题是如何在run中去调用用户自己定义的处理函数呢?也就是说作用与RDD上的Operation是
如何真正起作用呢?



下面来看看ShuffleMapTask中的runTask函数实现
------------------------------------------------------------------------
override def runTask(context: TaskContext): MapStatus = {
  // Deserialize the RDD using the broadcast variable.
  val ser = SparkEnv.get.closureSerializer.newInstance()
  val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
    ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)

  metrics = Some(context.taskMetrics)
  var writer: ShuffleWriter[Any, Any] = null
  try {
    val manager = SparkEnv.get.shuffleManager
    writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
    writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
    return writer.stop(success = true).get
  } catch {
    case e: Exception =>
      try {
        if (writer != null) {
          writer.stop(success = false)
        }
      } catch {
        case e: Exception =>
          log.debug("Could not stop writer", e)
      }
      throw e
  }
}
----------------------------------------------------------
Iterator很重要,看看每一个RDD中的Iterator的定义
final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
  if (storageLevel != StorageLevel.NONE) {
    SparkEnv.get.cacheManager.getOrCompute(this, split, context, storageLevel)
  } else {
    computeOrReadCheckpoint(split, context)
  }
}
-------------------------------------------------------
private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
{
  if (isCheckpointed) firstParent[T].iterator(split, context) else compute(split, context)
}
------------------------------------------------------
compute的计算过程对于ShuffleMapTask比较复杂,绕的圈圈比较多;而对于ResultTask则就直接许多。
与ShuffleMapTask不同,ResultTask的runTask没有明确返回值,在后续的handleTaskCompletion函数中可以进一步发现这样的处理的原因。
------------------------------------------------------
private[spark] class ResultTask[T, U](
    stageId: Int,
    taskBinary: Broadcast[Array[Byte]],
    partition: Partition,
    @transient locs: Seq[TaskLocation],
    val outputId: Int)
  extends Task[U](stageId, partition.index) with Serializable {

  @transient private[this] val preferredLocs: Seq[TaskLocation] = {
    if (locs == null) Nil else locs.toSet.toSeq
  }

  override def runTask(context: TaskContext): U = {
    // Deserialize the RDD and the func using the broadcast variables.
    val ser = SparkEnv.get.closureSerializer.newInstance()
    val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => U)](
      ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)

    metrics = Some(context.taskMetrics)
    func(context, rdd.iterator(partition, context))
  }
--------------------------------------------------------------

2.5 结果返回


2.6 WebUI


2.7 Metrics


2.8 存储机制





























你可能感兴趣的:(《Apache,Spark源码剖析》学习笔记)