SparkContext有两中提交作业的方法:
1、是我前面一章讲的runJob方法
2、还有一种是submit方法
它们都是提交到DAGScheduler中,DAGScheduler对外暴露的两个入口
两者的区别在于DAGScheduler.runJob在内部调用DAGScheduler.submit返回一个JobWaiter对象,阻塞等待直到作业完成或失败;而后者直接调用DAGScheduler.submitJob,可以用在异步调用中,用来判断作业完成或者取消作业
接下去的调用关系如下
eventProcessActor ! JobSubmitted DAGScheduler.handleJobSubmitted handleJobSubmitted.newStagenewSatge会创建一个Satge对象
new Stage(id, rdd, numTasks, shuffleDep, getParentStages(rdd, jobId), jobId, callSite) private[spark] class Stage( val id: Int, val rdd: RDD[_], val numTasks: Int, val shuffleDep: Option[ShuffleDependency[_,_]], // Output shuffle if stage is a map stage val parents: List[Stage], val jobId: Int, callSite: Option[String])
可以看到Stage包含一个RDD,而这个RDD是每个Stage最后一个RDD
那它的parentStage是什么呢?以下代码是生成parentStage
private def getParentStages(rdd: RDD[_], jobId: Int): List[Stage] = { val parents = new HashSet[Stage] val visited = new HashSet[RDD[_]] def visit(r: RDD[_]) { //遍历RDD依赖链 if (!visited(r)) { visited += r // Kind of ugly: need to register RDDs with the cache here since // we can't do it in its constructor because # of partitions is unknown for (dep <- r.dependencies) { dep match { case shufDep: ShuffleDependency[_,_] =>//如果是ShuffleDependency,则据此划分调度阶段 parents += getShuffleMapStage(shufDep, jobId)//并添加到该调度阶段的父调度阶段列表中 case _ => visit(dep.rdd)//如果不是ShuffleDependency则继续迭代遍历RDD依赖链 } } } } visit(rdd) parents.toList }因此每个Stage都有一批parent Stage List[Stage],上述过程如下图所示
需要说明的是MapPartitionRDD、ShuffleRDD和MapPartitionRDD是reduceByKey转换操作产生的
生成finalStage后就要提交Stage
private def submitStage(stage: Stage) { val jobId = activeJobForStage(stage) if (jobId.isDefined) { logDebug("submitStage(" + stage + ")") if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) { val missing = getMissingParentStages(stage).sortBy(_.id)//看看有没有漏掉的Stage logDebug("missing: " + missing) if (missing == Nil) { logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents") submitMissingTasks(stage, jobId.get)//如果没有parent stage,则直接提交当前stage runningStages += stage } else { for (parent <- missing) { submitStage(parent)//如果有praent stage,则递归找到第一个stage } waitingStages += stage } } } else { abortStage(stage, "No active job for stage " + stage.id) } }提交stage是一个递归过程,先把父stage submitStage,再把当前stage添加到waitingStages中,直到stage没有父stage,就提交该stage的任务
接着看submitMissingTasks方法
private def submitMissingTasks(stage: Stage, jobId: Int) { logDebug("submitMissingTasks(" + stage + ")") // Get our pending tasks and remember them in our pendingTasks entry val myPending = pendingTasks.getOrElseUpdate(stage, new HashSet) ...... if (stage.isShuffleMap) {//不是final stage都是生成ShuffleMapTasks for (p <- 0 until stage.numPartitions if stage.outputLocs(p) == Nil) { val locs = getPreferredLocs(stage.rdd, p) tasks += new ShuffleMapTask(stage.id, stage.rdd, stage.shuffleDep.get, p, locs) } } else { // This is a final stage; figure out its job's missing partitions val job = resultStageToJob(stage) for (id <- 0 until job.numPartitions if !job.finished(id)) { val partition = job.partitions(id) val locs = getPreferredLocs(stage.rdd, partition) tasks += new ResultTask(stage.id, stage.rdd, job.func, partition, locs, id) } } ...... taskScheduler.submitTasks( new TaskSet(tasks.toArray, stage.id, stage.newAttemptId(), stage.jobId, properties)) stageToInfos(stage).submissionTime = Some(System.currentTimeMillis()) } else { logDebug("Stage " + stage + " is actually done; %b %d %d".format( stage.isAvailable, stage.numAvailableOutputs, stage.numPartitions)) runningStages -= stage } }Task有两种,一种是ShuffleMapTask,另一种是ResultTask,我们需要注意这两种Task的runTask方法
最后将Tasks提交给TaskSchedulerImpl,注意它将Tasks封装成TaskSet提交的
override def submitTasks(taskSet: TaskSet) { val tasks = taskSet.tasks logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks") this.synchronized { val manager = new TaskSetManager(this, taskSet, maxTaskFailures) activeTaskSets(taskSet.id) = manager schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties) if (!isLocal && !hasReceivedTask) { starvationTimer.scheduleAtFixedRate(new TimerTask() { override def run() { if (!hasLaunchedTask) { logWarning("Initial job has not accepted any resources; " + "check your cluster UI to ensure that workers are registered " + "and have sufficient memory") } else { this.cancel() } } }, STARVATION_TIMEOUT, STARVATION_TIMEOUT) } hasReceivedTask = true } backend.reviveOffers() }将TaskSet封装成TaskSetManager-->SchedulerBuilder.addTaskSetManager-->rootPool.addSchedulable将TaskSetManager添加进去,其实TaskSetManager和Pool都是继承相同的特质Schedulable,但是两个类的核心接口完全不同,个人感觉设计的不好,所以对于Pool,可以理解为TaskSetManager的容器,也可以放其他Pool
CoraseGrainedSchedulerBackend.reviveOffers DriverActor!ReviveOffers makeOffers
def makeOffers() { launchTasks(scheduler.resourceOffers( executorHost.toArray.map {case (id, host) => new WorkerOffer(id, host, freeCores(id))})) }SchedulerBackend给TaskScheduler提供资源,首先看launchTasks里面的方法TaskScheduler.resourceOffers
def resourceOffers(offers: Seq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized { ...... val sortedTaskSets = rootPool.getSortedTaskSetQueue//返回按调度模式排列好的TaskSetManager ...... // Take each TaskSet in our scheduling order, and then offer it each node in increasing order // of locality levels so that it gets a chance to launch local tasks on all of them. var launchedTask = false for (taskSet <- sortedTaskSets; maxLocality <- TaskLocality.values) { do { launchedTask = false for (i <- 0 until shuffledOffers.size) { val execId = shuffledOffers(i).executorId val host = shuffledOffers(i).host if (availableCpus(i) >= CPUS_PER_TASK) { for (task <- taskSet.resourceOffer(execId, host, maxLocality)) {// 把数据本地性最高的任务交给worker ...... } } } } while (launchedTask) } return tasks }先看下类TaskSetManager的结构
private[spark] class TaskSetManager( sched: TaskSchedulerImpl, val taskSet: TaskSet, val maxTaskFailures: Int, clock: Clock = SystemClock) extends Schedulable with Logging { ...... private val pendingTasksForExecutor = new HashMap[String, ArrayBuffer[Int]] private val pendingTasksForHost = new HashMap[String, ArrayBuffer[Int]] private val pendingTasksForRack = new HashMap[String, ArrayBuffer[Int]] ...... for (i <- (0 until numTasks).reverse) {// 创建该对象就会执行该方法,它会把Tasks对应的locality分别加入上述几个集合 addPendingTask(i) } // Figure out which locality levels we have in our TaskSet, so we can do delay scheduling val myLocalityLevels = computeValidLocalityLevels()//计算上述几个集合有哪些locality并且存放在myLocalityLevel集合中,按process->node->rack顺序存放 val localityWaits = myLocalityLevels.map(getLocalityWait) // Time to wait at each level// 每个locality默认的等待时间(从配置读) // Delay scheduling variables: we keep track of our current locality level and the time we // last launched a task at that level, and move up a level when localityWaits[curLevel] expires. // We then move down if we manage to launch a "more local" task. var currentLocalityIndex = 0 // 当前myLocalityLevels的index,从0开始 var lastLaunchTime = clock.getTime() // 记录最后启动Task的时间,如果当前启动Task的时间-lastLaunchTime大于阈值,currentLocalityIndex+1 ...... }接着看TaskSetManager.resourceOffer
override def resourceOffer( execId: String, host: String, availableCpus: Int, maxLocality: TaskLocality.TaskLocality) : Option[TaskDescription] = { if (tasksFinished < numTasks && availableCpus >= CPUS_PER_TASK) { // 前提是task没有执行完和有足够的available cores val curTime = clock.getTime() var allowedLocality = getAllowedLocalityLevel(curTime) // 如果发生超时currentLocalityIndex+1,取当前allowed LocalityLevel if (allowedLocality > maxLocality) { // 不能超出作为参数传入的maxLocality, 调用者限定 allowedLocality = maxLocality // We're not allowed to search for farther-away tasks } findTask(execId, host, allowedLocality) match { // 调用findTask,根据allowedLocality、execId、host找出合适的Task case Some((index, taskLocality)) => { ...... // Update our locality level for delay scheduling currentLocalityIndex = getLocalityIndex(taskLocality) // 用当前Task的locality来更新currentLocalityIndex, 这里currentLocalityIndex有可能会减少, 因为调用者限定的locality可能会修改之前的allowedLocality lastLaunchTime = curTime // 更新lastLaunchTime // Serialize and return the task ...... return Some(new TaskDescription(taskId, execId, taskName, index, serializedTask)) // 最终返回schedule得到的那个task } case _ => } } return None }总结一下:TaskSetManager.resourceOffer内部会根据上一个任务成功提交的时间,自动调整自身的locality策略