wordcount案例解析,来分析Spark Job的触发流程
代码:var linesRDD= sc.textFile('hdfs://')
SparkContext中textFile方法
/**
* hadoopFile方法调用会创建一个HadoopRDD,其中的元素pair是(key,value)
* key是hdfs或者文本文件的每一行的offset,value就是文本行
* 然后,调用map方法,过滤掉key,剩下value,最终获得一个MapPartitionRDD,其内部是一行一行的文本行
*/
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
// hadoop mapreduce TextInputFormat:读取文本数据的一种方式 LongWritable 读取到的文本的偏移量(行号) Text 读到的文本
// map操作中,pair就是行号和文本数据映射的tuple,pair._2.toString 就是取数据
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString)
}
hadoopFile方法
def hadoopFile[K, V](
path: String,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
assertNotStopped()
// A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
// 创建的HadoopRDD读取配置文件的之后,上面已经做了广播变量,在本机worker上就可以读到
new HadoopRDD(
this,
confBroadcast,
Some(setInputPathsFunc),
inputFormatClass,
keyClass,
valueClass,
minPartitions).setName(path)
}
hadoopFile().map() 方法最终调用的是RDD.scala的map方法
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
代码:var wordsRDD = linesRDD.flatMap(line => line.split(" "))
RDD.scala的flatMap方法
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}
代码:var pairsRDD = wordsRDD.map(word => (word,1))
RDD.scala的map方法
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
代码:var countRDD = pairsRDD.reduceByKey(_+_)
rdd中是没有reduceByKey方法的,这里有一个隐式转换(相当于Java中的包装类),程序执行过程中会找RDD.scala中的一个隐式转换代码
implicit def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)])
(implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null): PairRDDFunctions[K, V] = {
new PairRDDFunctions(rdd)
}
最终创建了PairRDDFunctions类,MapPartitionRDD调用reduceByKey会触发scala隐式转换,在作用域中内寻找隐式转换,最终将MapPartitionRDD转换为PairRDDFunctions,调用reduceByKey方法。
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKey[V]((v: V) => v, func, func, partitioner)
}
代码:countRDD.foreach(count => println(count._1 + ":"+count._2))
RDD.scala的foreach方法
def foreach(f: T => Unit): Unit = withScope {
val cleanF = sc.clean(f)
sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}
有多个runJob的嵌套调用
def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
runJob(rdd, func, 0 until rdd.partitions.length)
}
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: Iterator[T] => U,
partitions: Seq[Int]): Array[U] = {
val cleanedFunc = clean(func)
runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions)
}
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int]): Array[U] = {
val results = new Array[U](partitions.size)
runJob[T, U](rdd, func, partitions, (index, res) => results(index) = res)
results
}
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit = {
if (stopped.get()) {
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
// 核心:调用SparkContext初始化创建的DAGSChduler的runJob方法
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}
由于Spark的算子构建一般都是链式的,这就涉及了要如何进行这些链式计算,Spark的策略是对这些算子,先划分Stage,然后在进行计算。
算法总结:会从出发action操作的那个RDD往前倒推,首先会为最后一个RDD创建一个stage(stage1),然后往前倒推的时候,如果发现对某个RDD的宽依赖,那么就会将宽依赖的那个RDD创建一个新的stage(stage0),那个RDD就是新的stage的最后一个RDD;最后依此类推,继续往前倒推,根据窄依赖,或者宽依赖,进行stage的划分,知道所有的RDD全部遍历完了为止。
源码解析
第一步:DAGScheduler类中runJob方法
源码地址:org.apache.spark.scheduler.DAGScheduler
def runJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): Unit = {
val start = System.nanoTime
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)
waiter.completionFuture.value.get match {
case scala.util.Success(_) =>
logInfo("Job %d finished: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
case scala.util.Failure(exception) =>
logInfo("Job %d failed: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
// SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
val callerStackTrace = Thread.currentThread().getStackTrace.tail
exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
throw exception
}
}
第二步:DAGScheduler类中submitJob方法
private[spark] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)
taskScheduler.setDAGScheduler(this)
......
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
// Check to make sure we are not launching a task on a partition that does not exist.
val maxPartitions = rdd.partitions.length
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +
"Total number of partitions: " + maxPartitions)
}
val jobId = nextJobId.getAndIncrement()
if (partitions.size == 0) {
// Return immediately if the job is running 0 tasks
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
// JobSubmitted对象给eventProcessLoop(消息循环器)
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
//这里面封装了哪些Partition要进行计算,joblistener作业监听等等
private[scheduler] case class JobSubmitted(
jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties = null)
extends DAGSchedulerEvent
......
第三步:eventProcessLoop类中post方法
//双端队列,任何一端都可以进行元素的出入
private val eventQueue: BlockingQueue[E] = new LinkedBlockingDeque[E]()
private val stopped = new AtomicBoolean(false)
// Exposed for testing.
private[spark] val eventThread = new Thread(name) {
setDaemon(true)
override def run(): Unit = {
try {
while (!stopped.get) {
//接收消息。在这里并没有直接实现OnReceive方法,具体方法实现是在DAGScheduler#onReceive
val event = eventQueue.take() //取出放入的消息
try {
onReceive(event)
} catch {
case NonFatal(e) =>
try {
onError(e)
} catch {
case NonFatal(e) => logError("Unexpected error in " + name, e)
}
}
}
} catch {
case ie: InterruptedException => // exit even if eventQueue is not empty
case NonFatal(e) => logError("Unexpected error in " + name, e)
}
}
}
......
def post(event: E): Unit = {
eventQueue.put(event)
}
第三步:onReceive方法
protected def onReceive(event: E): Unit //抽象方法调用子类的实现。
第三步:onReceive方法子类(DAGSchedulerEventProcessLoop)的实现
源码地址:org.apache.spark.scheduler.DAGScheduler.DAGSchedulerEventProcessLoop
/**
* The main event loop of the DAG scheduler.
*/
override def onReceive(event: DAGSchedulerEvent): Unit = {
val timerContext = timer.time()
try {
doOnReceive(event)
} finally {
timerContext.stop()
}
}
第三步:doOnReceive方法
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
......
}
DAGScheduler启动一个线程EventLoop(消息循环器),不断地从消息队列中取消息。消息是通过EventLoop的put方法放入消息队列,当EventLoop拿到消息后会回调DAGScheduler的OnReceive,进而调用doOnReceive方法进行处理
第四步:handleJobSubmitted方法
/**
* DAGSchduler的job调度核心入口
*/
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],//参数finalRDD为触发action操作时最后一个RDD
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
// 使用触发job的最后一个rdd,创建stage
var finalStage: ResultStage = null
try {
// New stage creation may throw an exception if, for example, jobs are run on a
// HadoopRDD whose underlying HDFS files have been deleted.
//第一步 创建一个结果输出stage,并加入DAGSchduler内部的内存缓存中
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
case e: BarrierJobSlotsNumberCheckFailed =>
logWarning(s"The job $jobId requires to run a barrier stage that requires more slots " +
"than the total number of slots in the cluster currently.")
// If jobId doesn't exist in the map, Scala coverts its value null to 0: Int automatically.
val numCheckFailures = barrierJobIdToNumTasksCheckFailures.compute(jobId,
new BiFunction[Int, Int, Int] {
override def apply(key: Int, value: Int): Int = value + 1
})
if (numCheckFailures <= maxFailureNumTasksCheck) {
messageScheduler.schedule(
new Runnable {
override def run(): Unit = eventProcessLoop.post(JobSubmitted(jobId, finalRDD, func,
partitions, callSite, listener, properties))
},
timeIntervalNumTasksCheck,
TimeUnit.SECONDS
)
return
} else {
// Job failed, clear internal data.
barrierJobIdToNumTasksCheckFailures.remove(jobId)
listener.jobFailed(e)
return
}
case e: Exception =>
logWarning("Creating new stage failed due to exception - job: " + jobId, e)
listener.jobFailed(e)
return
}
// Job submitted, clear internal data.
barrierJobIdToNumTasksCheckFailures.remove(jobId)
// 第二步 用finalStage创建一个job,即这个job的最后一个stage
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
clearCacheLocs()
logInfo("Got job %s (%s) with %d output partitions".format(
job.jobId, callSite.shortForm, partitions.length))
logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))
val jobSubmissionTime = clock.getTimeMillis()
//第三步,将job加入内存缓存中
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.setActiveJob(job)
val stageIds = jobIdToStageIds(jobId).toArray
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
//第四步,使用submitStage()方法提交finalStage
submitStage(finalStage)
}
第四步:createResultStage方法
/**
* Create a ResultStage associated with the provided jobId.
*/
private def createResultStage(
rdd: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
jobId: Int,
callSite: CallSite): ResultStage = {
checkBarrierStageWithDynamicAllocation(rdd)
checkBarrierStageWithNumSlots(rdd)
checkBarrierStageWithRDDChainPattern(rdd, partitions.toSet.size)
//获取父stages
val parents = getOrCreateParentStages(rdd, jobId)
val id = nextStageId.getAndIncrement()
val stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite)
stageIdToStage(id) = stage //将Stage的id放入stageIdToStage(HashMap)结构中。
updateJobIdStageIdMaps(jobId, stage) //更新JobIdStageIdMaps
stage
}
第五步:submitStage方法,stage划分算法的入口
/** Submits stage, but first recursively submits any missing parents. */
/**
* 这里是stage划分的入口,submitStage与getMissingParentStages方法共同完成了stage的划分
*
*/
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
//调用getMissingParentStages()方法,去获取当前这个stage的父stage,并且按照id进行排序,
//从小到大的进行排序,从前向后依次计算,这样做的原因是不同的rdd存在依赖关系
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
//首先提交这个第一个stage,stage0,其余的stage,此时全部都在waitingStage里
submitMissingTasks(stage, jobId.get)
} else {
// 递归调用
for (parent <- missing) {
submitStage(parent)
}
// 将当前stage放入等待执行的stage队列
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
第六步:getMissingParentStages方法
/**
* 获取某个stage的父stage
*
* 如果最后一个rdd的所有依赖都是窄依赖,就不会创建新的shuffle stage
* 但是,如果stage的rdd是宽依赖,就使用rdd的宽依赖创建新的stage,立即新的stage返回
*
*/
private def getMissingParentStages(stage: Stage): List[Stage] = {
//定义一个hashset来存放stage
val missing = new HashSet[Stage]
//存储已经被访问的RDD,构建的时候是从后往前回溯的一个过程,回溯过之后就会被保存起来。
val visited = new HashSet[RDD[_]]
// We are manually maintaining a stack here to prevent StackOverflowError
// caused by recursively visiting
//这里只是缓存RDD的信息,并未真正计算,因为此时我们并没有partition的信息。
val waitingForVisit = new ArrayStack[RDD[_]] //存储需要被处理的RDD
def visit(rdd: RDD[_]) {
if (!visited(rdd)) {
visited += rdd //如果没有被回溯过,那么就将此RDD加入HashSet中
val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
if (rddHasUncachedPartitions) { //如果分区的数量是非空的
//遍历RDD依赖
for (dep <- rdd.dependencies) {
dep match {
//那么使用宽依赖的那个rdd,创建一个stage
//默认最后一个stage,不是shuffleMap stage,而是ResultStage
//但是finalStage之前所有的stage,都是shuffleMap stage
case shufDep: ShuffleDependency[_, _, _] =>
val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
if (!mapStage.isAvailable) {
missing += mapStage
}
//如果是窄依赖,那么将依然的RDD放入栈中
case narrowDep: NarrowDependency[_] =>
waitingForVisit.push(narrowDep.rdd)
}
}
}
}
}
// 往栈中,加入stage最后的一个rdd
waitingForVisit.push(stage.rdd)
while (waitingForVisit.nonEmpty) {
//调用内部visit方法
visit(waitingForVisit.pop())
}
//返回这个stage的所有的父亲节点,便于在后面递归的去查找
missing.toList
}
第七步:getOrCreateShuffleMapStage方法
private def getOrCreateShuffleMapStage(
shuffleDep: ShuffleDependency[_, _, _],
firstJobId: Int): ShuffleMapStage = {
shuffleIdToMapStage.get(shuffleDep.shuffleId) match {
case Some(stage) =>
stage
case None =>
// Create stages for all missing ancestor shuffle dependencies.
getMissingAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
// Even though getMissingAncestorShuffleDependencies only returns shuffle dependencies
// that were not already in shuffleIdToMapStage, it's possible that by the time we
// get to a particular dependency in the foreach loop, it's been added to
// shuffleIdToMapStage by the stage creation process for an earlier dependency. See
// SPARK-13902 for more information.
if (!shuffleIdToMapStage.contains(dep.shuffleId)) {
//创建一个shuffle stage
createShuffleMapStage(dep, firstJobId)
}
}
// Finally, create a stage for the given shuffle dependency.
createShuffleMapStage(shuffleDep, firstJobId)
}
}
第一次循环:
1.RDD G传进来,将RDD G压栈.
waitingForVisit.push(stage.rdd)
2.此时的栈不空,将栈里面的RDD G弹出,作为参数传入visit函数内。
while (waitingForVisit.nonEmpty) {
//调用内部visit方法
visit(waitingForVisit.pop())
}
3.RDD G没有被访问过,所以执行if中的代码
val waitingForVisit = new ArrayStack[RDD[_]] //存储需要被处理的RDD
def visit(rdd: RDD[_]) {
if (!visited(rdd)) {
4.对RDD G进行处理,加入visited中。
visited += rdd
5.遍历RDD G依赖的父RDD
for (dep <- rdd.dependencies) {
dep match {
//若是 RDD F 创建新的Stage,并将新创建的Stage存储到missing 中
case shufDep: ShuffleDependency[_, _, _] =>
val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
if (!mapStage.isAvailable) {
missing += mapStage
}
//若是RDD F 则为和RDD G是一个Stage,然后就将RDD F压栈
case narrowDep: NarrowDependency[_] =>
waitingForVisit.push(narrowDep.rdd)
}
}
RDD G依赖两个RDD:
第二次循环:Stack栈中 RDD B
RDD B所依赖的父RDD是RDD A 之间是宽依赖关系,因此要创建一个新的Stage为Stage 1
此时 missing 中存在两个Stage,分别是 Stage 1与Stage 2
递归调用Stage 1,A无父RDD,提交stage
// 递归调用,直到最初的stage,他没有父stage
for (parent <- missing) {
submitStage(parent)
}
递归调用Stage 2,而RDD F的父RDD都是窄依赖,所以不产生新的Stage,均为Stage 2.
stage划分算法总结:
1.在submitMissingTasks中会通过调用以下代码来获取任务的本地性。
//提交stage,为stage创建一批task,task数量与partition数量相同
private def submitMissingTasks(stage: Stage, jobId: Int) {
logDebug("submitMissingTasks(" + stage + ")")
// First figure out the indexes of partition ids to compute.
//获取你要创建的task的数量
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
// Use the scheduling pool, job group, description, etc. from an ActiveJob associated
// with this Stage
val properties = jobIdToActiveJob(jobId).properties
//将stage加入runningStage队列
runningStages += stage
// SparkListenerStageSubmitted should be posted before testing whether tasks are
// serializable. If tasks are not serializable, a SparkListenerStageCompleted event
// will be posted, which should always come after a corresponding SparkListenerStageSubmitted
// event.
stage match {
case s: ShuffleMapStage =>
outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
case s: ResultStage =>
outputCommitCoordinator.stageStart(
stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
}
//获得Task的本地性,task的最佳位置计算算法
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
stage match {
case s: ShuffleMapStage =>
partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id)) }.toMap
case s: ResultStage =>
partitionsToCompute.map { id =>
val p = s.partitions(id)
(id, getPreferredLocs(stage.rdd, p))
}.toMap
}
} catch {
case NonFatal(e) =>
stage.makeNewStageAttempt(partitionsToCompute.size)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
// If there are tasks to execute, record the submission time of the stage. Otherwise,
// post the even without the submission time, which indicates that this stage was
// skipped.
if (partitionsToCompute.nonEmpty) {
stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
}
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
// TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.
// Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast
// the serialized copy of the RDD and for each task we will deserialize it, which means each
// task gets a different copy of the RDD. This provides stronger isolation between tasks that
// might modify state of objects referenced in their closures. This is necessary in Hadoop
// where the JobConf/Configuration object is not thread-safe.
var taskBinary: Broadcast[Array[Byte]] = null
var partitions: Array[Partition] = null
try {
// For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
// For ResultTask, serialize and broadcast (rdd, func).
var taskBinaryBytes: Array[Byte] = null
// taskBinaryBytes and partitions are both effected by the checkpoint status. We need
// this synchronization in case another concurrent job is checkpointing this RDD, so we get a
// consistent view of both variables.
RDDCheckpointData.synchronized {
taskBinaryBytes = stage match {
case stage: ShuffleMapStage =>
JavaUtils.bufferToArray(
closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
case stage: ResultStage =>
JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
}
partitions = stage.rdd.partitions
}
taskBinary = sc.broadcast(taskBinaryBytes)
} catch {
// In the case of a failure during serialization, abort the stage.
case e: NotSerializableException =>
abortStage(stage, "Task not serializable: " + e.toString, Some(e))
runningStages -= stage
// Abort execution
return
case NonFatal(e) =>
abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
//根据不同的Stage类型创建不同的tasks队列
val tasks: Seq[Task[_]] = try {
val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
stage match {
case stage: ShuffleMapStage =>
stage.pendingPartitions.clear()
partitionsToCompute.map { id =>
//为每一个Partition创建ShuffleMapTask,并计算最近位置
val locs = taskIdToLocations(id)
val part = partitions(id)
stage.pendingPartitions += id
//对finalStage之外的,都创建ShuffleMapTask
new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId, stage.rdd.isBarrier())
}
case stage: ResultStage =>
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = partitions(p)
val locs = taskIdToLocations(id)
//如果不是shuffleMap,那么就是finalStage 是创建ResultTask的
new ResultTask(stage.id, stage.latestInfo.attemptNumber,
taskBinary, part, locs, id, properties, serializedTaskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId,
stage.rdd.isBarrier())
}
}
} catch {
case NonFatal(e) =>
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
}
if (tasks.size > 0) {
logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
//将Tasks封装到TaskSet中,并将TaskSet提交给TaskScheduler。
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
} else {
// Because we posted SparkListenerStageSubmitted earlier, we should mark
// the stage as completed here in case there are no tasks to run
markStageAsFinished(stage, None)
stage match {
case stage: ShuffleMapStage =>
logDebug(s"Stage ${stage} is actually done; " +
s"(available: ${stage.isAvailable}," +
s"available outputs: ${stage.numAvailableOutputs}," +
s"partitions: ${stage.numPartitions})")
// 如果Tasks执行完了,表示该Stage执行完成
markMapStageJobsAsFinished(stage)
case stage: ResultStage =>
logDebug(s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})")
}
submitWaitingChildStages(stage)
}
}
2.getPreferredLocsInternal方法
/**
* Recursive implementation for getPreferredLocs.
*
* This method is thread-safe because it only accesses DAGScheduler state through thread-safe
* methods (getCacheLocs()); please be careful when modifying this method, because any new
* DAGScheduler state accessed by it may require additional synchronization.
*
* 计算每个task对应Partition最佳位置
* 从stage的最后一个rdd开始,去找哪个rdd的partition,是被cache了,或者checkpoint
* 那么,task的最佳位置,就是缓存的 checkpoint的partition的位置
* 因为这样的话,task就在哪个节点上执行,不需要计算之前的rdd了
*/
private def getPreferredLocsInternal(
rdd: RDD[_],
partition: Int,
visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = {
// If the partition has already been visited, no need to re-visit.
// This avoids exponential path exploration. SPARK-695
// 如果已访问过RDD,即以获得RDD的TaskLocation则不需再次获得
if (!visited.add((rdd, partition))) {
// Nil has already been returned for previously visited partitions.
return Nil
}
// If the partition is cached, return the cache locations
//寻找当前rdd的partition是否缓存了
val cached = getCacheLocs(rdd)(partition)
if (cached.nonEmpty) {
return cached
}
// If the RDD has some placement preferences (as is the case for input RDDs), get those
//寻找当前rdd的partitions 是否checkpoint了
val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList
if (rddPrefs.nonEmpty) {
return rddPrefs.map(TaskLocation(_))
}
// If the RDD has narrow dependencies, pick the first partition of the first narrow dependency
// that has any placement preferences. Ideally we would choose based on transfer sizes,
// but this will do for now.
//最后,递归调用自己,去寻找rdd 的父rdd,看着对应的partition是否缓存或者checkpoint
rdd.dependencies.foreach {
case n: NarrowDependency[_] =>
for (inPart <- n.getParents(partition)) {
val locs = getPreferredLocsInternal(n.rdd, inPart, visited)
if (locs != Nil) {
return locs
}
}
case _ =>
}
//如果这个stage,从最后一个rdd,到最开始的rdd,partition都没有被缓存或者checkpoint
//那么,task的最佳位置preferredLocs,就是Nil
Nil
}
无论是通过哪种方式获取RDD分区的优先位置,第一次计算的数据来源肯定都是通过RDD的preferredLocations方法获取的,不同的RDD有不同的preferredLocations实现,但是数据无非就是在三个地方存在,被cache到内存、HDFS、磁盘,而这三种方式的TaskLocation都有具体的实现
/**
* A location that includes both a host and an executor id on that host.
*/
//数据在内存中
private [spark]
case class ExecutorCacheTaskLocation(override val host: String, executorId: String)
extends TaskLocation {
override def toString: String = s"${TaskLocation.executorLocationTag}${host}_$executorId"
}
/**
* A location on a host.
*/
//数据在磁盘上(非HDFS上)
private [spark] case class HostTaskLocation(override val host: String) extends TaskLocation {
override def toString: String = host
}
/**
* A location on a host that is cached by HDFS.
*/
//数据在HDFS上
private [spark] case class HDFSCacheTaskLocation(override val host: String) extends TaskLocation {
override def toString: String = TaskLocation.inMemoryLocationTag + host
}