DAGScheduler.submitStage建立Spark应用的物理执行图,DAGScheduler.submitStage通过调用DAGSchdeuler.getMissingParentStages找到一个Stage的祖宗Stage并把祖宗Stage加入到物理执行图中。在这里如果发现依赖的RDD的全部分区已经存储到了BlockManager,也就是已经成功Cache,那么这个RDD以及它的祖宗RDD不会加入到任务物理执行计划,也就是Cache RDD以及它的祖宗RDD不会给它们分配任务计算生成。DAGSchdeuler.getMissingParentStages方法定义如下:
//找到当前stage的所有parent stage,找到parent stage之后不再继续查找更上一级的祖宗stage private def getMissingParentStages(stage: Stage): List[Stage] = { val missing = new HashSet[Stage] val visited = new HashSet[RDD[_]] // We are manually maintaining a stack here to prevent StackOverflowError // caused by recursively visiting val waitingForVisit = new Stack[RDD[_]] def visit(rdd: RDD[_]) { if (!visited(rdd)) { visited += rdd /* * 如果RDD的所有分区已经全部Cache到了BlockManager,则这个RDD以及它的祖宗RDD不会 * 作为制定任务物理执行计划的依据,也就是Cache RDD以及它的祖宗RDD不会任务执行时实际计算生成 * */ val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil) if (rddHasUncachedPartitions) { //根据RDD的依赖关系切分Stage for (dep <- rdd.dependencies) { dep match { case shufDep: ShuffleDependency[_, _, _] => /*找到parent stage之后,不再继续将当前的rdd加入到waitingForVisit栈中 *这个方法会把shufDepp依赖链的所有祖宗Stage加入到DAGScheduler.shuffleToMapStage HashMap中 */ val mapStage = getShuffleMapStage(shufDep, stage.firstJobId) /* *什么时候返回true需要接下来研究,一个stage的parent stage的task提交之后,这个方法可能返回true, * 导致missing为empty返回,然后这个Stage的task在submitStage方法就可以提交了 * */ if (!mapStage.isAvailable) { missing += mapStage } case narrowDep: NarrowDependency[_] => //没有找到parent stage,将当前的rdd加入到waitingForVisit栈中 waitingForVisit.push(narrowDep.rdd) } } } } } waitingForVisit.push(stage.rdd) while (waitingForVisit.nonEmpty) { /* * 从waitingForVisit栈弹出RDD,继续依赖分析,切割Stage * */ visit(waitingForVisit.pop()) } missing.toList }
def getCacheLocs(rdd: RDD[_]): Seq[Seq[TaskLocation]] = cacheLocs.synchronized { // Note: this doesn't use `getOrElse()` because this method is called O(num tasks) times if (!cacheLocs.contains(rdd.id)) { // Note: if the storage level is NONE, we don't need to get locations from block manager. val locs: Seq[Seq[TaskLocation]] = if (rdd.getStorageLevel == StorageLevel.NONE) { Seq.fill(rdd.partitions.size)(Nil) } else { /* *rdd已经cache了, 获取rdd每个分区对应的RDDBlockId * */ val blockIds = rdd.partitions.indices.map(index => RDDBlockId(rdd.id, index)).toArray[BlockId] //向驱动发消息找到RDD每个分区存储在了那个节点的BlockManager,将BlockManager所在的节点作为任务启动的节点 blockManagerMaster.getLocations(blockIds).map { bms => bms.map(bm => TaskLocation(bm.host, bm.executorId)) } } cacheLocs(rdd.id) = locs } cacheLocs(rdd.id) }
在DAGScheduler.submitMissingTasks方法创建ShuffleMapTask或者ResultTask的时候,会用到DAGScheduler.cacheLocs 这个HashMap的Value创建。
DAGScheduler.submitMissingTasks通过调用DAGScheduler.getPreferedLocs,DAGScheduler.getPreferedLocs通过调用DAGScheduler.getPreferedLocsInternal方法获取任务所要执行的节点,DAGScheduler.getPreferedLocsInternal法中调用DAGScheduler.getCacheLocs方法如果发现这个分区已经Cache到BlockManager,则将这个分区Cache到的BlockManager对应的TaskLocation作为Task执行的节点。DAGScheduler.getPreferedLocsInternal定义如下:
private def getPreferredLocsInternal( rdd: RDD[_], partition: Int, visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = { // If the partition has already been visited, no need to re-visit. // This avoids exponential path exploration. SPARK-695 if (!visited.add((rdd, partition))) { // Nil has already been returned for previously visited partitions. return Nil } // If the partition is cached, return the cache locations val cached = getCacheLocs(rdd)(partition) if (cached.nonEmpty) { return cached//cache的节点位置作为task执行的节点,返回这个节点位置对应的TaskLocation对象 } // If the RDD has some placement preferences (as is the case for input RDDs), get those /* * ShuffledRDD事rddPrefs.nonEmpty为false * */ val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList if (rddPrefs.nonEmpty) { return rddPrefs.map(TaskLocation(_)) } rdd.dependencies.foreach { case n: NarrowDependency[_] => // If the RDD has narrow dependencies, pick the first partition of the first narrow dep // that has any placement preferences. Ideally we would choose based on transfer sizes, // but this will do for now. for (inPart <- n.getParents(partition)) { val locs = getPreferredLocsInternal(n.rdd, inPart, visited) if (locs != Nil) { return locs } } case s: ShuffleDependency[_, _, _] => // For shuffle dependencies, pick locations which have at least REDUCER_PREF_LOCS_FRACTION // of data as preferred locations if (shuffleLocalityEnabled && rdd.partitions.size < SHUFFLE_PREF_REDUCE_THRESHOLD && s.rdd.partitions.size < SHUFFLE_PREF_MAP_THRESHOLD) { // Get the preferred map output locations for this reducer /* *根据Stage1 shuffle map操作各个任务的返回值确定Stage2 shuffle reduce操作各个任务的本地性 *在Stage1某个TaskLocation上partition Shuffle后数据占Stage2 partition所有数据的比率大于REDUCER_PREF_LOCS_FRACTION, * 则这个TaskLocation会作为Stage2任务的TaskLocation * * */ val topLocsForReducer = mapOutputTracker.getLocationsWithLargestOutputs(s.shuffleId, partition, rdd.partitions.size, REDUCER_PREF_LOCS_FRACTION) if (topLocsForReducer.nonEmpty) { return topLocsForReducer.get.map(loc => TaskLocation(loc.host, loc.executorId)) } } case _ => } Nil }
后继Job执行的时候,根RDD是已经Cache的RDD,进入了CacheManager.getOrCompute流程通过调用BlockManager.get方法获取已经cache的数据块,进入后继的处理流程。BlockManger.get源码如下:
def getOrCompute[T]( rdd: RDD[T], partition: Partition, context: TaskContext, storageLevel: StorageLevel): Iterator[T] = { //根据RDD.id和分区index创建存储块的id val key = RDDBlockId(rdd.id, partition.index) logDebug(s"Looking for partition $key") blockManager.get(key) match { /* * 如果已经cache了这个块的数据,则从BlockManager读取 * */ case Some(blockResult) => // Partition is already materialized, so just return its values val existingMetrics = context.taskMetrics .getInputMetricsForReadMethod(blockResult.readMethod) existingMetrics.incBytesRead(blockResult.bytes) val iter = blockResult.data.asInstanceOf[Iterator[T]] new InterruptibleIterator[T](context, iter) { override def next(): T = { existingMetrics.incRecordsRead(1) delegate.next() } } case None => // Acquire a lock for loading this partition // If another thread already holds the lock, wait for it to finish return its results /* * 如果没有cache这个块的数据,则计算出本分区的数据,并保存到BlockManager * */ val storedValues = acquireLockForPartition[T](key) if (storedValues.isDefined) { return new InterruptibleIterator[T](context, storedValues.get) } // Otherwise, we have to load the partition ourselves try { logInfo(s"Partition $key not found, computing it") val computedValues = rdd.computeOrReadCheckpoint(partition, context)//计算出一个分区的数据 // If the task is running locally, do not persist the result if (context.isRunningLocally) { return computedValues } // Otherwise, cache the values and keep track of any updates in block statuses val updatedBlocks = new ArrayBuffer[(BlockId, BlockStatus)] val cachedValues = putInBlockManager(key, computedValues, storageLevel, updatedBlocks)//cache一个分区的数据 val metrics = context.taskMetrics val lastUpdatedBlocks = metrics.updatedBlocks.getOrElse(Seq[(BlockId, BlockStatus)]()) metrics.updatedBlocks = Some(lastUpdatedBlocks ++ updatedBlocks.toSeq) new InterruptibleIterator(context, cachedValues) } finally { loading.synchronized { loading.remove(key) loading.notifyAll() } } } }