spark深度解析:调度算法

基于spark1.3.1的源码解读
不得不佩服spark源码的精炼,standalone模式中,不到100行就搞定了资源调度,spark提供两种资源调度方式,尽量打散:即我们的executor会尽量的分配更多的worker上;尽量集中:即我们的executor会尽量的分配更少的worker上;这其中是通过spreadOutApps变量来控制的,true为尽量分散。

 private def schedule() {
  #当master不是alive时,直接reuturn
  #也就是说Standby是不参与资源调度的
    if (state != RecoveryState.ALIVE) { return }
//Random.shuffle作用就是把集合随机打乱
//取出workers中所有之前注册的worker,进行过滤,必须 状态 是Alive的worker
//把worker随机的打乱
 val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(
_.state == WorkerState.ALIVE))
    val numWorkersAlive = shuffledAliveWorkers.size。

只有在模式是yarn-cluster提交后,才会注册driver,因为standalone与yarn-client
都会在本地启动dirver,而不会来注册driver,就更不可能被master来调度
所以说下面的这个for只会运行在yarn-cluster模式下提交下

for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
      // We assign workers to each waiting driver in a round-robin fashion. For each driver, we
      // start from the last worker that was assigned a driver, and continue onwards until we have
      // explored all alive workers.
      var launched = false
      var numWorkersVisited = 0
      
      /**while中的条件,当还有活着的worker没有被遍历到,就继续遍历
       * 而且这个driver在这个worker中还没有启动,launched=false
       */
      while (numWorkersVisited < numWorkersAlive && !launched) {
        val worker = shuffledAliveWorkers(curPos)
        numWorkersVisited += 1
        /**
         * 如果这个worker的空闲内存容量 大于等于driver所需的内存
         * 而且worker空间的CPU大于等于driver所需的CPU数量
         */
        if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
         //启动driver
          launchDriver(worker, driver)
          //把此driver从waitingDrivers中去掉
          waitingDrivers -= driver
          launched = true
        }
        //将指针指向下一个worker
        curPos = (curPos + 1) % numWorkersAlive
      }
    }

上面就是对Drvier的调度。
接着我们看对Application的资源调度
/**

  • Application的调度机制(核心之核心 )
  • 两种算法:一种是spreadOutApps(默认),另一种是非spreadOutApps
  • 通过个算法,其实 会将每个application,要启动的executor都平均分配 到每个worker上
  • 比如有20cpu core,有10个worker,那么实际会遍历两遍,每次循环,每个worker分配一个core
  • 最后每个worker分配了两个core
    */
if (spreadOutApps) {
          for (app <- waitingApps if app.coresLeft > 0) {
        //从workerk中,过滤出状态是ALIVE的
        val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
          .filter(canUse(app, _)).sortBy(_.coresFree).reverse
        val numUsable = usableWorkers.length
        //创建一个空数组,存储了要分配的每个worker的cpu
        val assigned = new Array[Int](numUsable) // Number of cores to give on each node
        //获取到底可以分配多少个cpu,取application所需的cpu数量与worker可用的cpu数量的最小值
        var toAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)
        var pos = 0
        //只要有还有要分配的cpu没有分配完就while
        while (toAssign > 0) {
          //如果可用的cpu大于已经分配的cpu数量,其实就是还有可用的cpu
          if (usableWorkers(pos).coresFree - assigned(pos) > 0) {
            //将要分配的cpu数量减一
            toAssign -= 1
            //给这个worker分配的cpu加1
            assigned(pos) += 1
          }
          //指定指向下一个worker
          pos = (pos + 1) % numUsable
        }
        // Now that we've decided how many cores to give on each node, let's actually give them
       //给每个worker分配了application需要的cpu core后
        for (pos <- 0 until numUsable) {
          //判断这个worker已经分配了core
          if (assigned(pos) > 0) {
            //创建了ExecutorDesc对象,封装了executor的信息
            //将这个executor添加到application缓存区
            val exec = app.addExecutor(usableWorkers(pos), assigned(pos))
            //在worker上启动Executor
            launchExecutor(usableWorkers(pos), exec)
            //将application的状态为RUNNING
            app.state = ApplicationState.RUNNING
           }
        }
      }
    }

关于这种调度算法总结:
我们之前说了在spark-submit中指定了多少个executor,每个execuotr需要多少个cpu core 实际上基本这个机制,最后,executor的实际数量,每个executor需要的core,可能与配置不一样 因为这里我们是基于总的cpu来分配的,就是说比如,我们配置了需要三个executor来启动application, 每个executor需要三个core,那么就总需9个core,其实在这种算法中,如果我们有9个worker,会给每个 worker分配一个core,然后给每个worker启动一个executor.最后其实是启动了9个executor,每个 executor有一个core

第二种调度算法:
这种算法与上面的正好相反,每个application,都尽可能少的分配到worker上去, 比如总共有10个worker,每个有10个core application总共要分配20个core,那么只会分配到两个worker上,每个worker都占满了这10个core那么其它的application只能分配另外的worker上去了。 所以我们在spark-submit中配置了要10个executor,每个execuotr需要2个core 那么共需要20个core,但这种算法中,其实只会启动两个executor,每个executor有10个core

//遍历worker,并且状态是ALIVE,还有空闲的cpu的worker
      for (worker <- workers if worker.coresFree > 0 && worker.state == WorkerState.ALIVE) {
       //遍历application,并且还有需要分配的core的applicztion
        for (app <- waitingApps if app.coresLeft > 0) {
          //判断当前这个worker可以被application使用
          if (canUse(app, worker)) {
            //取worker可用的cpu数量与application要分配的cpu数量的最小值
            val coresToUse = math.min(worker.coresFree, app.coresLeft)
            //如果小于0,说明没有core可分了
            if (coresToUse > 0) {
              val exec = app.addExecutor(worker, coresToUse)
                      //在worker上启动executor
              launchExecutor(worker, exec)
              //设置application的状态是
              app.state = ApplicationState.RUNNING
            }
          }

其中里面有一个非常重要的方法:

def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc) {
    logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
    //将executor加入worker缓存
    worker.addExecutor(exec)
    //向worker的actor发送LaunchExecutor ,在worker中启动executor
    worker.actor ! LaunchExecutor(masterUrl,
      exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory)
    exec.application.driver ! ExecutorAdded(
      exec.id, worker.id, worker.hostPort, exec.cores, exec.memory)
  }

这个方法里的
worker.actor ! LaunchExecutor(masterUrl,exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory)
正是向worker发送这个消息(封闭了akka的消息通信) ,在worker端调用这个方法来启动Executor。

你可能感兴趣的:(spark)