TaskScheduler和SchedulerBackend(DT大数据梦工厂)

内容:

1、TaskScheduler与SchedulerBackend;

2、FIFO与FAIR两种调度模式彻底解密;

3、Task数据本地性资源分配源码实现;

==========通过spark-shell运行程序来观察TaskScheduler内幕============

先运行个例子:

root@Master:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin# ./spark-shell --master spark://Master:7077,Worker1:7077,Worker2:7077

scala> sc.textFile("/historyserverforSpark/README.md", 3).flatMap(_.split(" ")).map(word=>(word,1)).reduceByKey(_+_,1).saveAsTextFile("/historyserverforSpark /output/onclick3")

16/02/20 15:32:47 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 212.8 KB, free 539.1 KB)

16/02/20 15:32:47 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 19.7 KB, free 558.8 KB)

16/02/20 15:32:47 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.145.131:53080 (size: 19.7 KB, free: 1247.2 MB)

16/02/20 15:32:47 INFO spark.SparkContext: Created broadcast 3 from textFile at <console>:28

16/02/20 15:32:49 INFO spark.SparkContext: Starting job: saveAsTextFile at <console>:28

16/02/20 15:32:49 INFO mapred.FileInputFormat: Total input paths to process : 1

16/02/20 15:32:49 INFO scheduler.DAGScheduler: Registering RDD 9 (map at <console>:28)

16/02/20 15:32:49 INFO scheduler.DAGScheduler: Got job 1 (saveAsTextFile at <console>:28) with 1 output partitions

16/02/20 15:32:49 INFO scheduler.DAGScheduler: Final stage: ResultStage 3 (saveAsTextFile at <console>:28)

16/02/20 15:32:49 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)

16/02/20 15:32:49 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 2)

16/02/20 15:32:49 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 2 (MapPartitionsRDD[9] at map at <console>:28), which has no missing parents

16/02/20 15:32:49 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 4.1 KB, free 562.9 KB)

16/02/20 15:32:49 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 2.3 KB, free 565.2 KB)

16/02/20 15:32:49 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.145.131:53080 (size: 2.3 KB, free: 1247.2 MB)

16/02/20 15:32:49 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1006

16/02/20 15:32:49 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 2 (MapPartitionsRDD[9] at map at <console>:28)

16/02/20 15:32:49 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 3 tasks

16/02/20 15:32:49 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, Master, partition 0,NODE_LOCAL, 2141 bytes)

16/02/20 15:32:49 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 2.0 (TID 3, Worker1, partition 1,NODE_LOCAL, 2141 bytes)

16/02/20 15:32:49 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 2.0 (TID 4, Master, partition 2,NODE_LOCAL, 2141 bytes)

16/02/20 15:32:49 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on Master:44388 (size: 2.3 KB, free: 511.1 MB)

16/02/20 15:32:49 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on Master:44388 (size: 19.7 KB, free: 511.1 MB)

16/02/20 15:32:49 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 2.0 (TID 4) in 369 ms on Master (1/3)

16/02/20 15:32:49 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 394 ms on Master (2/3)

16/02/20 15:32:50 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on 192.168.145.131:53080 in memory (size: 19.7 KB, free: 1247.2 MB)

16/02/20 15:32:50 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on Master:44388 in memory (size: 19.7 KB, free: 511.1 MB)

16/02/20 15:32:50 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on 192.168.145.131:53080 in memory (size: 22.6 KB, free: 1247.2 MB)

16/02/20 15:32:50 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on Master:44388 in memory (size: 22.6 KB, free: 511.1 MB)

16/02/20 15:32:50 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on Worker1:35104 (size: 2.3 KB, free: 511.1 MB)

16/02/20 15:32:50 INFO spark.ContextCleaner: Cleaned accumulator 2

16/02/20 15:32:50 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.145.131:53080 in memory (size: 2.3 KB, free: 1247.2 MB)

16/02/20 15:32:50 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on Master:44388 in memory (size: 2.3 KB, free: 511.1 MB)

16/02/20 15:32:50 INFO spark.ContextCleaner: Cleaned accumulator 1

16/02/20 15:32:50 INFO spark.ContextCleaner: Cleaned shuffle 0

16/02/20 15:32:52 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on Worker1:35104 (size: 19.7 KB, free: 511.1 MB)

16/02/20 15:32:58 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 2.0 (TID 3) in 9579 ms on Worker1 (3/3)

16/02/20 15:32:58 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool

16/02/20 15:32:58 INFO scheduler.DAGScheduler: ShuffleMapStage 2 (map at <console>:28) finished in 9.569 s

16/02/20 15:32:58 INFO scheduler.DAGScheduler: looking for newly runnable stages

16/02/20 15:32:58 INFO scheduler.DAGScheduler: running: Set()

16/02/20 15:32:58 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 3)

16/02/20 15:32:58 INFO scheduler.DAGScheduler: failed: Set()

16/02/20 15:32:58 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[11] at saveAsTextFile at <console>:28), which has no missing parents

16/02/20 15:32:59 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 64.9 KB, free 303.8 KB)

16/02/20 15:32:59 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 22.6 KB, free 326.4 KB)

16/02/20 15:32:59 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.145.131:53080 (size: 22.6 KB, free: 1247.2 MB)

16/02/20 15:32:59 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006

16/02/20 15:32:59 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (MapPartitionsRDD[11] at saveAsTextFile at <console>:28)

16/02/20 15:32:59 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks

16/02/20 15:32:59 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 5, Master, partition 0,NODE_LOCAL, 1894 bytes)

16/02/20 15:32:59 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on Master:44388 (size: 22.6 KB, free: 511.1 MB)

16/02/20 15:32:59 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to Master:47683

16/02/20 15:32:59 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 166 bytes

16/02/20 15:33:02 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 5) in 3598 ms on Master (1/1)

16/02/20 15:33:02 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool

16/02/20 15:33:02 INFO scheduler.DAGScheduler: ResultStage 3 (saveAsTextFile at <console>:28) finished in 3.601 s

16/02/20 15:33:02 INFO scheduler.DAGScheduler: Job 1 finished: saveAsTextFile at <console>:28, took 13.339011 s

1、当我们启动spark-shell本身的时候,命令终端反馈回来的主要是ClientEndpoint和SparkDeploySchedulerBackend,这是因为此时还没有任何Job的触发,这是启动Application本身而已,所以主要是实例化SparkContext并注册当前的应用程序给Master且从集群中获得ExecutorBackend计算资源;

2、DAGScheduler划分好Stage之后会通过TaskSchedulerImpl中的TaskSetManager来管理当前要运行的Stage中的所有任务TaskSet,TaskSet Manager会根据locality aware来为Task分配计算资源、监控Task的执行状态(例如重试、慢任务进行推测式执行);

注意:上面日志中的下面的NODE_LOCAL,发现,数据本地性的特征,所以三个任务在各自机器上(虽然我电脑设置数据存三份,三台电脑上都有的)跑的:

16/02/20 15:32:49 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, Master, partition 0,NODE_LOCAL, 2141 bytes)

16/02/20 15:32:49 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 2.0 (TID 3, Worker1, partition 1,NODE_LOCAL, 2141 bytes)

16/02/20 15:32:49 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 2.0 (TID 4, Master, partition 2,NODE_LOCAL, 2141 bytes)

==========TaskScheduler与SchedulerBackend============

1、总体的底层任务调度的过程如下:

1)TaskSchedulerImpl.submitTasks,主要的作用是将TaskSet加入到TaskSetManager中进行管理;

2)SchedulableBuilder.addTaskSetManager,SchedulableBuilder会确定TaskSetManager的调度顺序,然后按照TaskSetManager的locality aware来确定每个Task具体运行在哪个ExecutorBackend中;

// default scheduler is FIFO
private val schedulingModeConf conf.get("spark.scheduler.mode""FIFO")
val schedulingModeSchedulingMode try {
  SchedulingMode.withName(schedulingModeConf.toUpperCase)
catch {
  case e: java.util.NoSuchElementException =>
    throw new SparkException(s"Unrecognized spark.scheduler.mode: $schedulingModeConf")
}

def initialize(backend: SchedulerBackend) {
  this.backend = backend
  // temporarily set rootPool name to empty
  rootPool new Pool(""schedulingMode00)
  schedulableBuilder = {
    schedulingMode match {
      case SchedulingMode.FIFO =>
        new FIFOSchedulableBuilder(rootPool)
      case SchedulingMode.FAIR =>
        new FairSchedulableBuilder(rootPoolconf)
    }
  }
  schedulableBuilder.buildPools()
}

3)CoarseGrainedSchedulerBackend.reviveOffers,给DriverEndpoint发送ReviveOffers消息,ReviveOffers本身是一个空的case object对象,只是起到触发底层资源调度的作用,在有Task提交或者计算资源变动的时候会发送ReviveOffers 这个消息作为触发器;

backend.reviveOffers()

override def reviveOffers() {
  driverEndpoint.send(ReviveOffers)
}

4)在DriverEndpoint中路由接收ReviveOffers消息并路由到makeOffers具体的方法中,在makeOffers方法中,首先准备好所有可以用于计算的workOffers(代表了所有可用ExecutorBackend中可以使用的cores信息);

5)TaskSchedulerImpl.resourceOffers为每一个task具体分配计算资源,输入是ExecutorBackend及其上可用的cores,输出TaskDescription的二位数组中,在其中确定了每个Task具体运行在哪个ExecutorBackend,resourceOffers到底如何确定Task具体运行在哪个ExecutorBackend上;

a)通过Random.shuffle方法重新洗牌所有的计算资源,以寻求计算的负载均衡;

b)根据每个ExecutorBackend的cores的个数声明类型为TaskDescription的ArrayBuffer数组;

c)如果有新的ExecutorBackend分配给我们的Job,此时会调用ExecutorAdded来获得最新的完整的可用计算资源;

d)通过调用TaskSetManager的resourceOffer

6)TaskDescription中已经确定好了Task具体要运行在哪个ExecutorBackend的具体的LocalityLevel:

 * Description of a task that gets passed onto executors to be executed, usually created by
 * [[TaskSetManager.resourceOffer]].
 */
private[spark] class TaskDescription(
    val taskId: Long,
    val attemptNumber: Int,
    val executorId: String,
    val name: String,
    val index: Int,    // Index within this task's TaskSet
    _serializedTask: ByteBuffer)
  extends Serializable {

而确定Task具体运行在哪个ExecutorBackend上的的算法是由TaskSetManager的resourceOffer方法决定的;

7)通过lauchTasks把任务发送给ExecutorBackend去执行,

5、数据本地性优先级从高到低依次为:PROCESS_LOCAL、NODE_LOCAL、NO_PREF、RACK_LOCAL、ANY,其中NO_PREF是指机器本地性,下面的代码就说明了追求的是最高级别的优先级本地性;

// Take each TaskSet in our scheduling order, and then offer it each node in increasing order
// of locality levels so that it gets a chance to launch local tasks on all of them.
// NOTE: the preferredLocality order: PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY
var launchedTask = false
for (taskSet <- sortedTaskSetsmaxLocality <- taskSet.myLocalityLevels) {
  do {
    launchedTask = resourceOfferSingleTaskSet(
        taskSetmaxLocalityshuffledOffersavailableCpustasks)
  } while (launchedTask)
}

6、每个Task默认是采用一个线程进行计算的;

// CPUs to request per task
val CPUS_PER_TASK conf.getInt("spark.task.cpus"1)

7、DAGScheduler是从数据层面去考虑prefferedLocation的,而TaskScheduler内存是从具体计算Task角度考虑计算的本地性;

8、Task进行广播时候的AkkaFrameSize大小是128MB(好处:广播大任务),如果任务大于等于128MB-200KB的话,则Task被直接丢弃掉;如果小于128MB-200KB的话,会通过CoarseGrainedSchedulerBackend去launchTask到具体的ExecutorBackend上;

// Launch tasks returned by a set of resource offers
private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
  for (task <- tasks.flatten) {
    val serializedTask = ser.serialize(task)
    if (serializedTask.limit >= akkaFrameSize - AkkaUtils.reservedSizeBytes) {
      scheduler.taskIdToTaskSetManager.get(task.taskId).foreach { taskSetMgr =>
        try {
          var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
            "spark.akka.frameSize (%d bytes) - reserved (%d bytes). Consider increasing " +
            "spark.akka.frameSize or using broadcast variables for large values."
          msg = msg.format(task.taskIdtask.indexserializedTask.limitakkaFrameSize,
            AkkaUtils.reservedSizeBytes)
          taskSetMgr.abort(msg)
        } catch {
          case e: Exception => logError("Exception in error callback"e)
        }
      }
    }
    else {
      val executorData = executorDataMap(task.executorId)
      executorData.freeCores -= scheduler.CPUS_PER_TASK
      executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
    }
  }

补充:

1、Task默认的最大重试次数是4次;

def this(sc: SparkContext) = this(scsc.conf.getInt("spark.task.maxFailures"4))

2、Spark应用程序目前支持两种调度器:FIFO、FAIR,可以通过spark-env.sh中spark.scheduler.mode进行具体的设置,默认情况下是FIFO的方式;

3、TaskScheduler中要负责为Task分配计算资源,具体就是根据计算本地性原则确定Task具体要运行在哪个ExecutorBackend中;

王家林老师名片:

中国Spark第一人

新浪微博:http://weibo.com/ilovepains

微信公众号:DT_Spark

博客:http://blog.sina.com.cn/ilovepains

手机:18610086859

QQ:1740415547

邮箱:[email protected]


本文出自 “一枝花傲寒” 博客,谢绝转载!

你可能感兴趣的:(spark,TaskScheduler,内幕天机)