Spark源码解读之Master剖析

在上篇文章中我们剖析了SparkContext创建启动的整个流程,但是在创建SparkContext之后,TaskScheduler是如何向master注册application,以及master是如何调度worker启动的?带着这些问题我们来看看master的内部构造。

首先我们从下面这四个方面来深入Master源码来探究:

  1. 主备切换切换机制
  2. 注册机制
  3. 状态改变机制
  4. 资源调度机制(两种资源调度算法)

主备切换切换机制

所谓主备切换机制就是当Active Master挂掉之后可以切换到standby Master。
在Spark集群中我们也可以配置两个Master,它即支持Spark原生的standalone模式的master主备切换机制,也支持基于yarn的Master主备切换机制。

Spark Master的主备切换基于两种机制实现,一种是基于文件系统的切换机制,另一种是基于zookeeper的切换机制。基于文件系统的切换机制在Active Master切换Standby时需要我们手动来切换,而基于zookeeper的切换机制会自动进行切换。

那么在切换Active切换到Standby Master的时候,它会做哪些工作呢?
下面是需要做的工作的流程图:
Spark源码解读之Master剖析_第1张图片

completeRecovery方法的源码如下所示:

  /**
    * 对没有发送响应消息的Driver和Worker进行清除过滤操作
    */
  def completeRecovery() {
    // Ensure "only-once" recovery semantics using a short synchronization period.
    synchronized {
      if (state != RecoveryState.RECOVERING) { return }
      state = RecoveryState.COMPLETING_RECOVERY
    }

    // Kill off any workers and apps that didn't respond to us.
    workers.filter(_.state == WorkerState.UNKNOWN).foreach(removeWorker)
    apps.filter(_.state == ApplicationState.UNKNOWN).foreach(finishApplication)

    // Reschedule drivers which were not claimed by any workers
    drivers.filter(_.worker.isEmpty).foreach { d =>
      logWarning(s"Driver ${d.id} was not found after master recovery")
      if (d.desc.supervise) {
        logWarning(s"Re-launching ${d.id}")
        relaunchDriver(d)
      } else {
        removeDriver(d.id, DriverState.ERROR, None)
        logWarning(s"Did not re-launch ${d.id} because it was not supervised")
      }
    }

    state = RecoveryState.ALIVE
    schedule()
    logInfo("Recovery complete - resuming operations!")
  }
  def removeWorker(worker: WorkerInfo) {
    logInfo("Removing worker " + worker.id + " on " + worker.host + ":" + worker.port)
    //设置节点状态为DEAD,并且将节点的信息从内存中移除
    worker.setState(WorkerState.DEAD)
    idToWorker -= worker.id
    addressToWorker -= worker.actor.path.address
    for (exec <- worker.executors.values) {
      logInfo("Telling app of lost executor: " + exec.id)
      exec.application.driver ! ExecutorUpdated(
        exec.id, ExecutorState.LOST, Some("worker lost"), None)
      exec.application.removeExecutor(exec)
    }
  def finishApplication(app: ApplicationInfo) {
    removeApplication(app, ApplicationState.FINISHED)
  }

注册机制

在Spark应用运行时,Application,Worker,Driver都会向Master进行注册,它们分别是如何注册的呢?

注册流程图如下:

Spark源码解读之Master剖析_第2张图片

  /**
    * Spark Application应用注册机制
    * @param app
    */
  def registerApplication(app: ApplicationInfo): Unit = {
    val appAddress = app.driver.path.address
    if (addressToApp.contains(appAddress)) {
      logInfo("Attempted to re-register application at same address: " + appAddress)
      return
    }

    applicationMetricsSystem.registerSource(app.appSource)
    //将Application加入缓存中
    //val apps = new HashSet[ApplicationInfo]
    apps += app
    idToApp(app.id) = app
    //val actorToApp = new HashMap[ActorRef, ApplicationInfo]
    actorToApp(app.driver) = app
    addressToApp(appAddress) = app
    //将Application加入等待队列
    //  val waitingApps = new ArrayBuffer[ApplicationInfo]
    waitingApps += app
  }

  def removeWorker(worker: WorkerInfo) {
    logInfo("Removing worker " + worker.id + " on " + worker.host + ":" + worker.port)
    //设置节点状态为DEAD,并且将节点的信息从内存中移除
    worker.setState(WorkerState.DEAD)
    idToWorker -= worker.id
    addressToWorker -= worker.actor.path.address
    for (exec <- worker.executors.values) {
      logInfo("Telling app of lost executor: " + exec.id)
      exec.application.driver ! ExecutorUpdated(
        exec.id, ExecutorState.LOST, Some("worker lost"), None)
      exec.application.removeExecutor(exec)
    }


    for (driver <- worker.drivers.values) {
      /**
        * 如果设置driver.desc.supervise这个属性表示监督Driver,那么它会尝试重新启动Driver,
        * 否则直接移除Driver
        */
      if (driver.desc.supervise) {
        logInfo(s"Re-launching ${driver.id}")
        relaunchDriver(driver)
      } else {
        logInfo(s"Not re-launching ${driver.id} because it was not supervised")
        removeDriver(driver.id, DriverState.ERROR, None)
      }
    }

    /**
      * 根据持久化引擎的不同(总共有两种,一种是基于文件系统的FileSystemPersistenceEngine,
      * 另一种是基于zookeeper的ZookeeperPersistenceEngine),删除没有向Master汇报消息的节点
      */
    persistenceEngine.removeWorker(worker)
  }
  def relaunchDriver(driver: DriverInfo) {
    driver.worker = None
    driver.state = DriverState.RELAUNCHING
    waitingDrivers += driver
    schedule()
  }

状态改变机制

所谓状态改变机制,就是指Driver,Executor等节点状态发生改变的时候,会向Master发送消息。
Driver状态发生改变时源码如下:

    case DriverStateChanged(driverId, state, exception) => {
      state match {
          //当driver改变异常的时候,如果driver的状态是error,finished,killed,failed的话就移除driver
        case DriverState.ERROR | DriverState.FINISHED | DriverState.KILLED | DriverState.FAILED =>
          //移除Driver
          removeDriver(driverId, state, exception)
        case _ =>
          throw new Exception(s"Received unexpected state update for driver $driverId: $state")
      }
    }


 /**
    * 移除Driver
    * @param driverId
    * @param finalState
    * @param exception
    */
  def removeDriver(driverId: String, finalState: DriverState, exception: Option[Exception]) {
    //利用scala的高阶函数使用driverId来寻找Driver
    drivers.find(d => d.id == driverId) match {
        //当Driver不为空的时候(Some scala 的样例类)
      case Some(driver) =>
        logInfo(s"Removing driver: $driverId")
        //从缓存中移除
        drivers -= driver
        if (completedDrivers.size >= RETAINED_DRIVERS) {
          val toRemove = math.max(RETAINED_DRIVERS / 10, 1)
          completedDrivers.trimStart(toRemove)
        }
        //将完成的Driver加入完成队列中
        completedDrivers += driver
        //从 持久化引擎中删除Driver的信息
        persistenceEngine.removeDriver(driver)
        driver.state = finalState
        driver.exception = exception
        //移除每一个Driver
        driver.worker.foreach(w => w.removeDriver(driver))
        schedule()
      case None =>
        logWarning(s"Asked to remove unknown driver: $driverId")
    }
  }
}

当Executor状态发生改变的时候:

 /**
      * Master对于executor的状态改变机制
      */
    case ExecutorStateChanged(appId, execId, state, message, exitStatus) => {
      val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
      execOption match {
        case Some(exec) => {
          val appInfo = idToApp(appId)
          exec.state = state
          if (state == ExecutorState.RUNNING) { appInfo.resetRetryCount() }
          exec.application.driver ! ExecutorUpdated(execId, state, message, exitStatus)
          if (ExecutorState.isFinished(state)) {
            // Remove this executor from the worker and app
            logInfo(s"Removing executor ${exec.fullId} because it is $state")
            //从缓存中删除executorInfo
            appInfo.removeExecutor(exec)
            //从worker中删除executor
            exec.worker.removeExecutor(exec)

            val normalExit = exitStatus == Some(0)
            // Only retry certain number of times so we don't go into an infinite loop.
            if (!normalExit) {
              if (appInfo.incrementRetryCount() < ApplicationState.MAX_NUM_RETRY) {
                schedule()
              } else {
                val execs = appInfo.executors.values
                if (!execs.exists(_.state == ExecutorState.RUNNING)) {
                  logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
                    s"${appInfo.retryCount} times; removing it")
                  removeApplication(appInfo, ApplicationState.FAILED)
                }
              }
            }
          }
        }
        case None =>
          logWarning(s"Got status update for unknown executor $appId/$execId")
      }
    }

资源调度机制(两种资源调度算法)

在Spark程序运行的过程中处处都需要进行资源的调度,这也是Master中最为重要的部分,我们可以通过查看schedule()方法来了解Master的资源调度算法:

 /**
   * Schedule the currently available resources among waiting apps. This method will be called
   * every time a new app joins or resource availability changes.
   */
  /**
    *Master的资源调度机制
    */
  private def schedule() {
    //如果master的状态不是ALIVE状态直接返回
    if (state != RecoveryState.ALIVE) { return }

    // First schedule drivers, they take strict precedence over applications
    // Randomization helps balance drivers
    /**
      * 取出了所有的worker并且选择ALIVE状态的进行过滤,然后使用Random.shuffle的方式将所有的worker打乱处理。
      */
    val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
    val numWorkersAlive = shuffledAliveWorkers.size
    var curPos = 0

    /**
      * 首先调度Driver,只有cluster模式才会调度
      */
    for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
      // We assign workers to each waiting driver in a round-robin fashion. For each driver, we
      // start from the last worker that was assigned a driver, and continue onwards until we have
      // explored all alive workers.
      var launched = false
      var numWorkersVisited = 0
      //遍历所有ALIVE状态的并且没有启动过的Worker
      while (numWorkersVisited < numWorkersAlive && !launched) {
        //对所有的ALIVE的worker进行shuffle
        val worker = shuffledAliveWorkers(curPos)
        numWorkersVisited += 1

        /**
          * 如果当前worker的内存空闲量大于等于driver所需要的内存
          * 并且当前worker的空闲CPU核数大于或者等于driver所需要的cpu核数
          */
        if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
          //启动driver
          launchDriver(worker, driver)
          //从等待队列中取出该driver
          waitingDrivers -= driver
          launched = true
        }
        //指针指向下一个worker
        curPos = (curPos + 1) % numWorkersAlive
      }
    }

    /**
      * Application调度算法总共有两种,第一种是spreadOutApps算法,另一种是非spreadOutApps算法
      * 可以通过参数spark.deploy.spreadOut来调整,默认是true
      */
    // Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app
    // in the queue, then the second app, etc.
    /**
      * spreadOutApps算法是尝试分配所有的core到每一个节点,也就是说如果启动时指定三个executor,并且每个executor指定三个core
      * 也就说总共分配了9个core,如果集群上有9个executor,它会每个分配一个core。(也就是说不一定按照指定的方式来调度)
      */
    if (spreadOutApps) {
      // Try to spread out each app among all the nodes, until it has all its cores
      for (app <- waitingApps if app.coresLeft > 0) {
        //再次过滤出状态为ALIVE状态的worker节点,并且按照CPU的数量倒叙排列
        val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
          .filter(canUse(app, _)).sortBy(_.coresFree).reverse
        val numUsable = usableWorkers.length
        val assigned = new Array[Int](numUsable) // Number of cores to give on each node
        //计算到底要多少个cpu 核数,取app所需要的核数和worker总共可用的CPU数量的最小值
        var toAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)
        var pos = 0
        while (toAssign > 0) {
          //如果worker可用的CPU数量大于已经分配的CPU的数量,则表示还可以进行分配
          if (usableWorkers(pos).coresFree - assigned(pos) > 0) {
            toAssign -= 1
            assigned(pos) += 1
          }
          pos = (pos + 1) % numUsable
        }

        /**
          * 给worker分配完application所要求的cpu核数之后
          * 遍历worker
          */
        // Now that we've decided how many cores to give on each node, let's actually give them
        for (pos <- 0 until numUsable) {
          if (assigned(pos) > 0) {
            //只要判断之前给这个worker分配了core,那么就启动该worker
            val exec = app.addExecutor(usableWorkers(pos), assigned(pos))
            launchExecutor(usableWorkers(pos), exec)
            app.state = ApplicationState.RUNNING
          }
        }
      }
    } else {
      /**
        * 非spreadOutApps调度算法,尽可能地给worker分配给它所需要的所有的core
        * 比如我们设置了20个CPU core,executor的数量设置为10个,每一个worker两个core,
        * 但是真正的我们的worker最多需要10个core,那么它会尽量满足worker的要求
        * 此时会启动2个worker,也就是启动2个worker
        */
      // Pack each app into as few nodes as possible until we've assigned all its cores
      for (worker <- workers if worker.coresFree > 0 && worker.state == WorkerState.ALIVE) {
        for (app <- waitingApps if app.coresLeft > 0) {
          if (canUse(app, worker)) {
            val coresToUse = math.min(worker.coresFree, app.coresLeft)
            if (coresToUse > 0) {
              val exec = app.addExecutor(worker, coresToUse)
              launchExecutor(worker, exec)
              app.state = ApplicationState.RUNNING
            }
          }
        }
      }
    }
  }

至此,我们就从四个方面来深入Master的源码进行分析,上述如有任何问题,希望不仅赐教,谢谢!!!

你可能感兴趣的:(Spark,大数据,Spark源码剖析与调优)