Spark-Core源码精读(2)、Master中的schedule详解

上一篇博客详细分析了Spark在Standalone模式下的部署过程，文中提到在Worker注册完成后需要执行一个schedule操作来分配资源，本文就将具体分析此方法具体是怎样分配资源的。

注：本专题的文章皆使用Spark-1.6.3版本的源码为参考，如果Spark-2.1.0版本有重大改进的地方也会进行说明。

什么时候会调用schedule？

其实每当一个新的application加入或者资源发生变化的时候都会调用schudule方法对资源进行重新分配，那么它是如何分配资源的呢？我们下面进行源码级别的分析。

schedule

我们先贴出schedule的源码：

// 既然要分配资源就必须保证Master的当前状态为ALIVE
if (state != RecoveryState.ALIVE) {
  return
}
// Drivers take strict precedence over executors
// 注释说的很明确，先注册Drivers然后再注册executors
// 1. 首先将ALIVE状态的Workers使用shuffle的方式打乱，以免每次都将Driver分配到同一个Worker上
val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
val numWorkersAlive = shuffledAliveWorkers.size
var curPos = 0
// 2. 循环遍历启动Drivers
for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
  // We assign workers to each waiting driver in a round-robin fashion. For each driver, we
  // start from the last worker that was assigned a driver, and continue onwards until we have
  // explored all alive workers.
  var launched = false
  var numWorkersVisited = 0
  // 2.1 判断是否有剩余的没有分配的Workers，并且尚未启动
  while (numWorkersVisited < numWorkersAlive && !launched) {
    // 2.2 获取一个Worker，第一个的索引为0，后面的索引根据curPos = (curPos + 1) % numWorkersAlive进行计算
    val worker = shuffledAliveWorkers(curPos)
    // 2.3 标记分配过的Worker加1
    numWorkersVisited += 1
    // 2.4 判断当前的Worker的内存和cpu是否满足Driver的需求
    if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
      // 2.5 如果满足资源的需求就在当前Worker上启动Driver
      launchDriver(worker, driver)
      // 2.6 启动完成后从等待的队列中删除，并将launched标记为true
      waitingDrivers -= driver
      launched = true
    }
    curPos = (curPos + 1) % numWorkersAlive
  }
}
// 3 启动Executors
startExecutorsOnWorkers()

启动Driver

我已经在上面的源码中对分配的流程进行了详细的注释，现在我们看一下launchDriver方法：

private def launchDriver(worker: WorkerInfo, driver: DriverInfo) {
  // 1. 打日志
  logInfo("Launching driver " + driver.id + " on worker " + worker.id)
  // 2. 向worker中添加driver的信息，包括增加已经使用的内存和cpu信息
  worker.addDriver(driver)
  // 3. 向driver中添加该worker的引用
  driver.worker = Some(worker)
  // 4. 向Worker发送LaunchDriver的消息，通知Worker启动Driver
  worker.endpoint.send(LaunchDriver(driver.id, driver.desc))
  // 5. 将driver的状态变成RUNNING
  driver.state = DriverState.RUNNING
}

接下来我们看一下对应的Worker在接收到LaunchDriver消息后是怎么处理的：

case LaunchDriver(driverId, driverDesc) => {
  // 1. 打日志
  logInfo(s"Asked to launch driver $driverId")
  // 2. 实例化DriverRunner
  val driver = new DriverRunner(
    conf,
    driverId,
    workDir,
    sparkHome,
    driverDesc.copy(command = Worker.maybeUpdateSSLSettings(driverDesc.command, conf)),
    self,
    workerUri,
    securityMgr)
  // 3. 实例化完成后向drivers中添加该driver的记录
  drivers(driverId) = driver
  // 4. 启动driver
  driver.start()

  // 5. 启动完成后记录资源的变化
  coresUsed += driverDesc.cores
  memoryUsed += driverDesc.mem
}

继续跟踪driver.start()：

// 英文注释说的很清楚：启动一个线程来运行和管理driver
/** Starts a thread to run and manage the driver. */
private[worker] def start() = {
  new Thread("DriverRunner for " + driverId) {
    override def run() {
      try {
        // 创建driver的工作目录
        val driverDir = createWorkingDirectory()
        // 下载用户的Jar文件到driver的工作目录并返回路径名称
        val localJarFilename = downloadUserJar(driverDir)
        def substituteVariables(argument: String): String = argument match {
          case "{{WORKER_URL}}" => workerUrl
          case "{{USER_JAR}}" => localJarFilename
          case other => other
        }
        // TODO: If we add ability to submit multiple jars they should also be added here
        val builder = CommandUtils.buildProcessBuilder(driverDesc.command, securityManager,
          driverDesc.mem, sparkHome.getAbsolutePath, substituteVariables)
        // 具体的启动Driver的操作，这里不再详细分析
        launchDriver(builder, driverDir, driverDesc.supervise)
      }
      catch {
        case e: Exception => finalException = Some(e)
      }
      val state =
        if (killed) {
          DriverState.KILLED
        } else if (finalException.isDefined) {
          DriverState.ERROR
        } else {
          finalExitCode match {
            case Some(0) => DriverState.FINISHED
            case _ => DriverState.FAILED
          }
        }
      finalState = Some(state)
      worker.send(DriverStateChanged(driverId, state, finalException))
    }
  }.start()
}

如果启动成功最后要向worker发送一条DriverStateChanged的消息，而Worker在接收到该消息后会调用handleDriverStateChanged方法进行一系列处理，具体的处理细节就不再说明，主要的就是向Master发送一条driverStateChanged的消息，Master在接收到该消息后移除Driver的信息：

case DriverStateChanged(driverId, state, exception) => {
  state match {
    case DriverState.ERROR | DriverState.FINISHED | DriverState.KILLED | DriverState.FAILED =>
      removeDriver(driverId, state, exception)
    case _ =>
      throw new Exception(s"Received unexpected state update for driver $driverId: $state")
  }
}

至此向Driver分配资源并启动Driver的过程结束，下面我们看一下启动Executors即执行startExecutorsOnWorkers()的流程。

启动Executors

startExecutorsOnWorkers():

/**
   * Schedule and launch executors on workers
   */
  private def startExecutorsOnWorkers(): Unit = {
     // 采用的是先进先出的原则
    // Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app
    // in the queue, then the second app, etc.
    for (app <- waitingApps if app.coresLeft > 0) {
      val coresPerExecutor: Option[Int] = app.desc.coresPerExecutor
      // Filter out workers that don't have enough resources to launch an executor
      // 过滤出ALIVE状态并且资源满足要求的workers，同时按照空闲cpu cores的个数倒序排列
      val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
        .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
          worker.coresFree >= coresPerExecutor.getOrElse(1))
        .sortBy(_.coresFree).reverse
        
      // 决定在每个worker上面分配多少个cpu cores
      val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)

      // 然后开始进行分配
      // Now that we've decided how many cores to allocate on each worker, let's allocate them
      for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
        allocateWorkerResourceToExecutors(
          app, assignedCores(pos), coresPerExecutor, usableWorkers(pos))
      }
    }
  }

我们首先看一下是如何决定在每个worker上分配多少个cores的，这里我们只列出scheduleExecutorsOnWorkers方法的英文注释，并进行说明，具体的操作大家可以去看源码：

/**
 * Schedule executors to be launched on the workers.
 * Returns an array containing number of cores assigned to each worker.
 *
 * There are two modes of launching executors. The first attempts to spread out an application's
 * executors on as many workers as possible, while the second does the opposite (i.e. launch them
 * on as few workers as possible). The former is usually better for data locality purposes and is
 * the default.
 *
 * The number of cores assigned to each executor is configurable. When this is explicitly set,
 * multiple executors from the same application may be launched on the same worker if the worker
 * has enough cores and memory. Otherwise, each executor grabs all the cores available on the
 * worker by default, in which case only one executor may be launched on each worker.
 *
 * It is important to allocate coresPerExecutor on each worker at a time (instead of 1 core
 * at a time). Consider the following example: cluster has 4 workers with 16 cores each.
 * User requests 3 executors (spark.cores.max = 48, spark.executor.cores = 16). If 1 core is
 * allocated at a time, 12 cores from each worker would be assigned to each executor.
 * Since 12 < 16, no executors would launch [SPARK-8881].
 */

大致意思是说有两种分配模型，第一种是将executors分配到尽可能多的workers上；第二种与第一种相反。默认使用的是第一种模型，这种模型更加符合数据的本地性原则，为每个Executor分配的cores的个数是可以进行配置的（spark-submit 或者 spark-env.sh），如果设置了，多个executors可能会被分配在一个worker上（前提是该worker拥有足够的cores和memory），否则每个executor会充分利用worker上的cores，这种情况下一个executor会被分配在一个worker上。具体在集群上分配cores的时候会尽可能的满足我们的要求，如果需要的cores的个数大于workers中空闲的cores的个数，那么就先分配空闲的cores，尽可能的去满足要求。

接下来就是具体为executors分配计算资源并启动executors的过程：

private def allocateWorkerResourceToExecutors(
      app: ApplicationInfo,
      assignedCores: Int,
      coresPerExecutor: Option[Int],
      worker: WorkerInfo): Unit = {
    // If the number of cores per executor is specified, we divide the cores assigned
    // to this worker evenly among the executors with no remainder.
    // Otherwise, we launch a single executor that grabs all the assignedCores on this worker.
    val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1)
    val coresToAssign = coresPerExecutor.getOrElse(assignedCores)
    for (i <- 1 to numExecutors) {
      // 向application中添加executor的信息
      val exec = app.addExecutor(worker, coresToAssign)
      // 启动executors
      launchExecutor(worker, exec)
      app.state = ApplicationState.RUNNING
    }
  }

启动executors：

private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {
    logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
    worker.addExecutor(exec)
    // 向worker发消息启动executor
    worker.endpoint.send(LaunchExecutor(masterUrl,
      exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))
    // 然后向driver发送executors的信息
    exec.application.driver.send(
      ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))
  }

worker在接收到启动executor的消息后执行具体的启动操作，并向Master汇报。

然后也要向driver发送executors的资源信息，driver收到信息后执行application，至此分配并启动executors的大致流程也就执行完毕。

最后用一张图总结一下启动Driver和Worker的简易流程：

Spark-Core源码精读(2)、Master中的schedule详解_第1张图片

本文只是大致的分析了Master在执行schedule的时候具体为Driver、Executors分配资源并启动它们的流程，以后我们还会分析整个application的运行流程，那时我们会具体进行分析。

本文参考和拓展阅读：

Spark-1.6.3源码

Spark-2.1.0源码

本文为原创，欢迎转载，转载请注明出处、作者，谢谢！

Spark-Core源码精读(2)、Master中的schedule详解

什么时候会调用schedule？

schedule

启动Driver

启动Executors

你可能感兴趣的:(Spark-Core源码精读(2)、Master中的schedule详解)