TaskSetManager实现了Schedulable特质,并参与到调度池的调度中。TaskSetManager对TaskSet进行管理,包括任务推断、Task本地性,并对Task进行资源分配。TaskSchedulerImpl依赖于TaskSetManager,本文将对TaskSetManager的实现进行分析。
DAGScheduler将Task提交给TaskScheduler时,需要将多个Task打包为TaskSet。TaskSet是整个调度池中对Task进行调度管理的基本单位,由调度池中的TaskSetManager来管理,其定义如下:
//org.apache.spark.scheduler.TaskSet
private[spark] class TaskSet(
val tasks: Array[Task[_]],
val stageId: Int,
val stageAttemptId: Int,
val priority: Int,
val properties: Properties) {
val id: String = stageId + "." + stageAttemptId
override def toString: String = "TaskSet " + id
}
下面对部分成员属性进行了解
在Hadoop 2.x.x版本中,当一个应用向YARN集群提交作业后,此作业的多个任务由于负载不均衡、资源分布不均等原因都会导致各个任务运行完成的时间不一致,甚至会出现一个Task尝试明显慢于同一作业的其他Task尝试的情况。如果对这种情况不加优化,最慢的Task尝试明显会拖慢整个作业的整体执行进度。mapreduce框架提供了任务推断执行机制,当有必要时就启动一个备份任务,最张采用备份任务的原任务中率先执行完成的结果作为最终结果。
与Hadoop类似,Spark应用向Spark集群提交作业后,也会因为相似的原因导致出现慢任务拖慢整个作业执行进度的问题。为了解决这些问题,Pool和TaskSetManager提供了Spark任务推断执行的实现。Pool和TaskSetManager中对推断执行的操作分为两类:一类是可推断任务的检测与缓存;另一类是从缓存中找到可推断任务进行推断执行。Pool的checkSpeculatableTasks方法和TaskSetManager的checkSpeculatableTasks方法实现了按照深度遍历算法对可推断任务的检测与缓存。TaskSetManager的dequeueSpeculativeTask方法则实现了从缓存中找到可推断任务进行推断执行。
3.1 checkSpeculatableTasks
用于检查当前TaskSetManager中是否有需要推断的任务
//org.apache.spark.scheduler.TaskSetManager
override def checkSpeculatableTasks(): Boolean = {
if (isZombie || numTasks == 1) {
return false //没有可推断的Task
}
var foundTasks = false
val minFinishedForSpeculation = (SPECULATION_QUANTILE * numTasks).floor.toInt
logDebug("Checking for speculative tasks: minFinished = " + minFinishedForSpeculation)
if (tasksSuccessful >= minFinishedForSpeculation && tasksSuccessful > 0) {
val time = clock.getTimeMillis()
val durations = taskInfos.values.filter(_.successful).map(_.duration).toArray
Arrays.sort(durations)
val medianDuration = durations(min((0.5 * tasksSuccessful).round.toInt, durations.length - 1))
val threshold = max(SPECULATION_MULTIPLIER * medianDuration, 100)
logDebug("Task length threshold for speculation: " + threshold)
for ((tid, info) <- taskInfos) { //遍历taskInfos,寻找符合推断条件的Task
val index = info.index
if (!successful(index) && copiesRunning(index) == 1 && info.timeRunning(time) > threshold &&
!speculatableTasks.contains(index)) {
logInfo(
"Marking task %d in stage %s (on %s) as speculatable because it ran more than %.0f ms"
.format(index, taskSet.id, info.host, threshold))
speculatableTasks += index
foundTasks = true
}
}
}
foundTasks
}
3.2 dequeueSpeculativeTask
用于根据指定的Host、Executor和本地性级别,从可推断的Task中找出可推断的Task在TaskSet中的索引和相应的本地性级别
protected def dequeueSpeculativeTask(execId: String, host: String, locality: TaskLocality.Value)
: Option[(Int, TaskLocality.Value)] =
{
speculatableTasks.retain(index => !successful(index)) //移除已经完成的Task
def canRunOnHost(index: Int): Boolean =
!hasAttemptOnHost(index, host) && !executorIsBlacklisted(execId, index)
if (!speculatableTasks.isEmpty) {
for (index <- speculatableTasks if canRunOnHost(index)) {
val prefs = tasks(index).preferredLocations
val executors = prefs.flatMap(_ match {
case e: ExecutorCacheTaskLocation => Some(e.executorId)
case _ => None
});
if (executors.contains(execId)) { //找到了在指定的Executor上推断执行的Task
speculatableTasks -= index
return Some((index, TaskLocality.PROCESS_LOCAL))
}
}
if (TaskLocality.isAllowed(locality, TaskLocality.NODE_LOCAL)) {
for (index <- speculatableTasks if canRunOnHost(index)) {
val locations = tasks(index).preferredLocations.map(_.host)
if (locations.contains(host)) { //找到了在本地节点上推断执行的Task
speculatableTasks -= index
return Some((index, TaskLocality.NODE_LOCAL))
}
}
}
if (TaskLocality.isAllowed(locality, TaskLocality.NO_PREF)) {
for (index <- speculatableTasks if canRunOnHost(index)) {
val locations = tasks(index).preferredLocations
if (locations.size == 0) { //对于没有本地性偏好的Task,让它在指定的Executor上推断执行
speculatableTasks -= index
return Some((index, TaskLocality.PROCESS_LOCAL))
}
}
}
if (TaskLocality.isAllowed(locality, TaskLocality.RACK_LOCAL)) {
for (rack <- sched.getRackForHost(host)) {
for (index <- speculatableTasks if canRunOnHost(index)) {
val racks = tasks(index).preferredLocations.map(_.host).flatMap(sched.getRackForHost)
if (racks.contains(rack)) { //找到了本地机架上推断执行的Task
speculatableTasks -= index
return Some((index, TaskLocality.RACK_LOCAL))
}
}
}
}
if (TaskLocality.isAllowed(locality, TaskLocality.ANY)) {
for (index <- speculatableTasks if canRunOnHost(index)) {
speculatableTasks -= index //找到可以在任何节点、机架上推断执行的Task
return Some((index, TaskLocality.ANY))
}
}
}
None
}
与Hadoop类似,Spark对任务的处理也要考虑数据的本地性(Locality),好的数据本地性能够大幅减少节点间的数据传输, 提升程序执行效率。Spark目前支持五种本地性级别,由高到低分别为:PROCESS_LOCAL(本地进程),NODE_LOCAL(本地节点),NO_PREF(没有偏好),RACK_LOCAL(本地机架),ANY(任何)
Task本地性的分配优先考虑有较高的本地性的级别,否则分配较低的本地性级别,直到ANY。TaskSet可以有一到多个本地性级别,但在给Task分配本地性时只能是其中的一个。TaskSet中的所有Task都具有相同的允许使用的本地性级别,但在运行期可能因为资源不足、运行时间等因素,导致同一TaskSet的本地性级别进行计算、获取某个本地性级别的等待时间、给Task分配资源时获取允许的本地性级别等。
TaskSet中实现的本地性操作包括对TaskSet的本地性级别进行计算、获取某个本地性级别的等待时间、给Task分配资源时获取允许的本地性级别等。
4.1 computeValidLocalityLevels
用于计算有效的本地性级别,这样就可以将Task按照本地性级别,由高到低分配给允许的Executor
private def computeValidLocalityLevels(): Array[TaskLocality.TaskLocality] = {
import TaskLocality.{PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY}
val levels = new ArrayBuffer[TaskLocality.TaskLocality]
if (!pendingTasksForExecutor.isEmpty && getLocalityWait(PROCESS_LOCAL) != 0 &&
pendingTasksForExecutor.keySet.exists(sched.isExecutorAlive(_))) {
levels += PROCESS_LOCAL //允许的本地性级别里包括PROCESS_LOCAL
}
if (!pendingTasksForHost.isEmpty && getLocalityWait(NODE_LOCAL) != 0 &&
pendingTasksForHost.keySet.exists(sched.hasExecutorsAliveOnHost(_))) {
levels += NODE_LOCAL //允许的本地性级别里包括NODE_LOCAL
}
if (!pendingTasksWithNoPrefs.isEmpty) {
levels += NO_PREF //允许的本地性级别里包括NO_PREF
}
if (!pendingTasksForRack.isEmpty && getLocalityWait(RACK_LOCAL) != 0 &&
pendingTasksForRack.keySet.exists(sched.hasHostAliveOnRack(_))) {
levels += RACK_LOCAL //允许的本地级别里包括RACK_LOCAL
}
levels += ANY //允许的本地性级别里增加ANY
logDebug("Valid locality levels for " + taskSet + ": " + levels.mkString(", "))
levels.toArray //返回所有允许的本地性级别
}
4.2 getLocalityWait
用于获取某个本地性级别的等待时间
private def getLocalityWait(level: TaskLocality.TaskLocality): Long = {
val defaultWait = conf.get("spark.locality.wait", "3s")
val localityWaitKey = level match {
case TaskLocality.PROCESS_LOCAL => "spark.locality.wait.process"
case TaskLocality.NODE_LOCAL => "spark.locality.wait.node"
case TaskLocality.RACK_LOCAL => "spark.locality.wait.rack"
case _ => null
}
if (localityWaitKey != null) {
conf.getTimeAsMs(localityWaitKey, defaultWait)
} else {
0L
}
}
4.3 getLocalityIndex
用于从myLocalityLevels中找出指定的本地性级别所对应的索引
def getLocalityIndex(locality: TaskLocality.TaskLocality): Int = {
var index = 0
while (locality > myLocalityLevels(index)) {
index += 1
}
index
}
4.4 getAllowedLocalityLevel
用于获取允许的本地性级别
private def getAllowedLocalityLevel(curTime: Long): TaskLocality.TaskLocality = {
while (currentLocalityIndex < myLocalityLevels.length - 1) {
val moreTasks = myLocalityLevels(currentLocalityIndex) match { //查找本地性级别有Task要运行
case TaskLocality.PROCESS_LOCAL => moreTasksToRunIn(pendingTasksForExecutor)
case TaskLocality.NODE_LOCAL => moreTasksToRunIn(pendingTasksForHost)
case TaskLocality.NO_PREF => pendingTasksWithNoPrefs.nonEmpty
case TaskLocality.RACK_LOCAL => moreTasksToRunIn(pendingTasksForRack)
}
if (!moreTasks) {
lastLaunchTime = curTime //没有Task需要处理,则将最后的运行时间设置为curTime
logDebug(s"No tasks for locality level ${myLocalityLevels(currentLocalityIndex)}, " +
s"so moving to locality level ${myLocalityLevels(currentLocalityIndex + 1)}")
currentLocalityIndex += 1
} else if (curTime - lastLaunchTime >= localityWaits(currentLocalityIndex)) {
lastLaunchTime += localityWaits(currentLocalityIndex) //跳入更低的本地性级别
logDebug(s"Moving to ${myLocalityLevels(currentLocalityIndex + 1)} after waiting for " +
s"${localityWaits(currentLocalityIndex)}ms")
currentLocalityIndex += 1
} else {
return myLocalityLevels(currentLocalityIndex) //返回当前本地性级别
}
}
myLocalityLevels(currentLocalityIndex) //未能找到允许的本地性级别,那么返回最低的本地性级别
}
执行步骤如下:
Spark任务在获取本地性级别时都要等待一段本地性级别的等待时长,任何任务都希望被分配 到可以从本地读取数据的节点上,以得到最大的性能提升。然而每个任务的运行时长都不是事先可以预料的,当一个任务在分配时,如果没有满足最佳本地性(PROCESS_LOCAL)的资源,而一直固执地期盼得到最佳的资源,很有可能被已经占用最佳资源但运行时间很长的任务耽搁,所以这些代码实现了当没有最佳本地性时,退而求其次,选择稍微差点的资源。