内容:
1、TaskScheduler与SchedulerBackend;
2、FIFO与FAIR两种调度模式彻底解密;
3、Task数据本地性资源分配源码实现;
==========通过spark-shell运行程序来观察TaskScheduler内幕============
先运行个例子:
root@Master:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin# ./spark-shell --master spark://Master:7077,Worker1:7077,Worker2:7077
scala> sc.textFile("/historyserverforSpark/README.md", 3).flatMap(_.split(" ")).map(word=>(word,1)).reduceByKey(_+_,1).saveAsTextFile("/historyserverforSpark /output/onclick3")
16/02/20 15:32:47 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 212.8 KB, free 539.1 KB)
16/02/20 15:32:47 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 19.7 KB, free 558.8 KB)
16/02/20 15:32:47 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.145.131:53080 (size: 19.7 KB, free: 1247.2 MB)
16/02/20 15:32:47 INFO spark.SparkContext: Created broadcast 3 from textFile at <console>:28
16/02/20 15:32:49 INFO spark.SparkContext: Starting job: saveAsTextFile at <console>:28
16/02/20 15:32:49 INFO mapred.FileInputFormat: Total input paths to process : 1
16/02/20 15:32:49 INFO scheduler.DAGScheduler: Registering RDD 9 (map at <console>:28)
16/02/20 15:32:49 INFO scheduler.DAGScheduler: Got job 1 (saveAsTextFile at <console>:28) with 1 output partitions
16/02/20 15:32:49 INFO scheduler.DAGScheduler: Final stage: ResultStage 3 (saveAsTextFile at <console>:28)
16/02/20 15:32:49 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
16/02/20 15:32:49 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 2)
16/02/20 15:32:49 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 2 (MapPartitionsRDD[9] at map at <console>:28), which has no missing parents
16/02/20 15:32:49 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 4.1 KB, free 562.9 KB)
16/02/20 15:32:49 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 2.3 KB, free 565.2 KB)
16/02/20 15:32:49 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.145.131:53080 (size: 2.3 KB, free: 1247.2 MB)
16/02/20 15:32:49 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1006
16/02/20 15:32:49 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 2 (MapPartitionsRDD[9] at map at <console>:28)
16/02/20 15:32:49 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 3 tasks
16/02/20 15:32:49 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, Master, partition 0,NODE_LOCAL, 2141 bytes)
16/02/20 15:32:49 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 2.0 (TID 3, Worker1, partition 1,NODE_LOCAL, 2141 bytes)
16/02/20 15:32:49 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 2.0 (TID 4, Master, partition 2,NODE_LOCAL, 2141 bytes)
16/02/20 15:32:49 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on Master:44388 (size: 2.3 KB, free: 511.1 MB)
16/02/20 15:32:49 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on Master:44388 (size: 19.7 KB, free: 511.1 MB)
16/02/20 15:32:49 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 2.0 (TID 4) in 369 ms on Master (1/3)
16/02/20 15:32:49 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 394 ms on Master (2/3)
16/02/20 15:32:50 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on 192.168.145.131:53080 in memory (size: 19.7 KB, free: 1247.2 MB)
16/02/20 15:32:50 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on Master:44388 in memory (size: 19.7 KB, free: 511.1 MB)
16/02/20 15:32:50 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on 192.168.145.131:53080 in memory (size: 22.6 KB, free: 1247.2 MB)
16/02/20 15:32:50 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on Master:44388 in memory (size: 22.6 KB, free: 511.1 MB)
16/02/20 15:32:50 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on Worker1:35104 (size: 2.3 KB, free: 511.1 MB)
16/02/20 15:32:50 INFO spark.ContextCleaner: Cleaned accumulator 2
16/02/20 15:32:50 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.145.131:53080 in memory (size: 2.3 KB, free: 1247.2 MB)
16/02/20 15:32:50 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on Master:44388 in memory (size: 2.3 KB, free: 511.1 MB)
16/02/20 15:32:50 INFO spark.ContextCleaner: Cleaned accumulator 1
16/02/20 15:32:50 INFO spark.ContextCleaner: Cleaned shuffle 0
16/02/20 15:32:52 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on Worker1:35104 (size: 19.7 KB, free: 511.1 MB)
16/02/20 15:32:58 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 2.0 (TID 3) in 9579 ms on Worker1 (3/3)
16/02/20 15:32:58 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
16/02/20 15:32:58 INFO scheduler.DAGScheduler: ShuffleMapStage 2 (map at <console>:28) finished in 9.569 s
16/02/20 15:32:58 INFO scheduler.DAGScheduler: looking for newly runnable stages
16/02/20 15:32:58 INFO scheduler.DAGScheduler: running: Set()
16/02/20 15:32:58 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 3)
16/02/20 15:32:58 INFO scheduler.DAGScheduler: failed: Set()
16/02/20 15:32:58 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[11] at saveAsTextFile at <console>:28), which has no missing parents
16/02/20 15:32:59 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 64.9 KB, free 303.8 KB)
16/02/20 15:32:59 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 22.6 KB, free 326.4 KB)
16/02/20 15:32:59 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.145.131:53080 (size: 22.6 KB, free: 1247.2 MB)
16/02/20 15:32:59 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006
16/02/20 15:32:59 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[11] at saveAsTextFile at <console>:28)
16/02/20 15:32:59 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
16/02/20 15:32:59 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 5, Master, partition 0,NODE_LOCAL, 1894 bytes)
16/02/20 15:32:59 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on Master:44388 (size: 22.6 KB, free: 511.1 MB)
16/02/20 15:32:59 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to Master:47683
16/02/20 15:32:59 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 166 bytes
16/02/20 15:33:02 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 5) in 3598 ms on Master (1/1)
16/02/20 15:33:02 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
16/02/20 15:33:02 INFO scheduler.DAGScheduler: ResultStage 3 (saveAsTextFile at <console>:28) finished in 3.601 s
16/02/20 15:33:02 INFO scheduler.DAGScheduler: Job 1 finished: saveAsTextFile at <console>:28, took 13.339011 s
1、当我们启动spark-shell本身的时候,命令终端反馈回来的主要是ClientEndpoint和SparkDeploySchedulerBackend,这是因为此时还没有任何Job的触发,这是启动Application本身而已,所以主要是实例化SparkContext并注册当前的应用程序给Master且从集群中获得ExecutorBackend计算资源;
2、DAGScheduler划分好Stage之后会通过TaskSchedulerImpl中的TaskSetManager来管理当前要运行的Stage中的所有任务TaskSet,TaskSet Manager会根据locality aware来为Task分配计算资源、监控Task的执行状态(例如重试、慢任务进行推测式执行);
注意:上面日志中的下面的NODE_LOCAL,发现,数据本地性的特征,所以三个任务在各自机器上(虽然我电脑设置数据存三份,三台电脑上都有的)跑的:
16/02/20 15:32:49 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, Master, partition 0,NODE_LOCAL, 2141 bytes)
16/02/20 15:32:49 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 2.0 (TID 3, Worker1, partition 1,NODE_LOCAL, 2141 bytes)
16/02/20 15:32:49 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 2.0 (TID 4, Master, partition 2,NODE_LOCAL, 2141 bytes)
==========TaskScheduler与SchedulerBackend============
1、总体的底层任务调度的过程如下:
1)TaskSchedulerImpl.submitTasks,主要的作用是将TaskSet加入到TaskSetManager中进行管理;
2)SchedulableBuilder.addTaskSetManager,SchedulableBuilder会确定TaskSetManager的调度顺序,然后按照TaskSetManager的locality aware来确定每个Task具体运行在哪个ExecutorBackend中;
// default scheduler is FIFO
private val schedulingModeConf = conf.get("spark.scheduler.mode", "FIFO")
val schedulingMode: SchedulingMode = try {
SchedulingMode.withName(schedulingModeConf.toUpperCase)
} catch {
case e: java.util.NoSuchElementException =>
throw new SparkException(s"Unrecognized spark.scheduler.mode: $schedulingModeConf")
}
def initialize(backend: SchedulerBackend) {
this.backend = backend
// temporarily set rootPool name to empty
rootPool = new Pool("", schedulingMode, 0, 0)
schedulableBuilder = {
schedulingMode match {
case SchedulingMode.FIFO =>
new FIFOSchedulableBuilder(rootPool)
case SchedulingMode.FAIR =>
new FairSchedulableBuilder(rootPool, conf)
}
}
schedulableBuilder.buildPools()
}
3)CoarseGrainedSchedulerBackend.reviveOffers,给DriverEndpoint发送ReviveOffers消息,ReviveOffers本身是一个空的case object对象,只是起到触发底层资源调度的作用,在有Task提交或者计算资源变动的时候会发送ReviveOffers 这个消息作为触发器;
backend.reviveOffers()
override def reviveOffers() {
driverEndpoint.send(ReviveOffers)
}
4)在DriverEndpoint中路由接收ReviveOffers消息并路由到makeOffers具体的方法中,在makeOffers方法中,首先准备好所有可以用于计算的workOffers(代表了所有可用ExecutorBackend中可以使用的cores信息);
5)TaskSchedulerImpl.resourceOffers为每一个task具体分配计算资源,输入是ExecutorBackend及其上可用的cores,输出TaskDescription的二位数组中,在其中确定了每个Task具体运行在哪个ExecutorBackend,resourceOffers到底如何确定Task具体运行在哪个ExecutorBackend上;
a)通过Random.shuffle方法重新洗牌所有的计算资源,以寻求计算的负载均衡;
b)根据每个ExecutorBackend的cores的个数声明类型为TaskDescription的ArrayBuffer数组;
c)如果有新的ExecutorBackend分配给我们的Job,此时会调用ExecutorAdded来获得最新的完整的可用计算资源;
d)通过调用TaskSetManager的resourceOffer
6)TaskDescription中已经确定好了Task具体要运行在哪个ExecutorBackend的具体的LocalityLevel:
* Description of a task that gets passed onto executors to be executed, usually created by
* [[TaskSetManager.resourceOffer]].
*/
private[spark] class TaskDescription(
val taskId: Long,
val attemptNumber: Int,
val executorId: String,
val name: String,
val index: Int, // Index within this task's TaskSet
_serializedTask: ByteBuffer)
extends Serializable {
而确定Task具体运行在哪个ExecutorBackend上的的算法是由TaskSetManager的resourceOffer方法决定的;
7)通过lauchTasks把任务发送给ExecutorBackend去执行,
5、数据本地性优先级从高到低依次为:PROCESS_LOCAL、NODE_LOCAL、NO_PREF、RACK_LOCAL、ANY,其中NO_PREF是指机器本地性,下面的代码就说明了追求的是最高级别的优先级本地性;
// Take each TaskSet in our scheduling order, and then offer it each node in increasing order
// of locality levels so that it gets a chance to launch local tasks on all of them.
// NOTE: the preferredLocality order: PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY
var launchedTask = false
for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) {
do {
launchedTask = resourceOfferSingleTaskSet(
taskSet, maxLocality, shuffledOffers, availableCpus, tasks)
} while (launchedTask)
}
6、每个Task默认是采用一个线程进行计算的;
// CPUs to request per task
val CPUS_PER_TASK = conf.getInt("spark.task.cpus", 1)
7、DAGScheduler是从数据层面去考虑prefferedLocation的,而TaskScheduler内存是从具体计算Task角度考虑计算的本地性;
8、Task进行广播时候的AkkaFrameSize大小是128MB(好处:广播大任务),如果任务大于等于128MB-200KB的话,则Task被直接丢弃掉;如果小于128MB-200KB的话,会通过CoarseGrainedSchedulerBackend去launchTask到具体的ExecutorBackend上;
// Launch tasks returned by a set of resource offers
private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
for (task <- tasks.flatten) {
val serializedTask = ser.serialize(task)
if (serializedTask.limit >= akkaFrameSize - AkkaUtils.reservedSizeBytes) {
scheduler.taskIdToTaskSetManager.get(task.taskId).foreach { taskSetMgr =>
try {
var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
"spark.akka.frameSize (%d bytes) - reserved (%d bytes). Consider increasing " +
"spark.akka.frameSize or using broadcast variables for large values."
msg = msg.format(task.taskId, task.index, serializedTask.limit, akkaFrameSize,
AkkaUtils.reservedSizeBytes)
taskSetMgr.abort(msg)
} catch {
case e: Exception => logError("Exception in error callback", e)
}
}
}
else {
val executorData = executorDataMap(task.executorId)
executorData.freeCores -= scheduler.CPUS_PER_TASK
executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
}
}
补充:
1、Task默认的最大重试次数是4次;
def this(sc: SparkContext) = this(sc, sc.conf.getInt("spark.task.maxFailures", 4))
2、Spark应用程序目前支持两种调度器:FIFO、FAIR,可以通过spark-env.sh中spark.scheduler.mode进行具体的设置,默认情况下是FIFO的方式;
3、TaskScheduler中要负责为Task分配计算资源,具体就是根据计算本地性原则确定Task具体要运行在哪个ExecutorBackend中;
王家林老师名片:
中国Spark第一人
新浪微博:http://weibo.com/ilovepains
微信公众号:DT_Spark
博客:http://blog.sina.com.cn/ilovepains
手机:18610086859
QQ:1740415547
本文出自 “一枝花傲寒” 博客,谢绝转载!