在分析源代码之前,首先对Standalone Cluster Mode的资源调度有一个基本的认识:
首先,运行一个Application需要Driver进程和一组Executor进程。在Standalone Cluster Mode下,Driver和Executor都是在Master的监护下给Worker发消息创建(Driver进程和Executor进程都需要分配内存和CPU,这就需要Master进行资源的调度)。
另外,在为Application的Executor进程s分配CPU内核时,需要考虑CPU内核是尽可能的分散到所有的Worker上分配,还是尽可能在尽量少的Worker上分配,这是有计算任务的特点决定的,是数据密集型还是计算密集型(比如求PI值)
源代码:
/** * Schedule the currently available resources among waiting apps. This method will be called * every time a new app joins or resource availability changes. */ private def schedule() { if (state != RecoveryState.ALIVE) { return } // First schedule drivers, they take strict precedence over applications // Randomization helps balance drivers //首先为已经提交Application,但是Application对应的Driver进程还没有启动的应用分配Driver资源 val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE)) val numWorkersAlive = shuffledAliveWorkers.size var curPos = 0 ///waitingDrivers列表保存了已经提交Application,但是Application对应的Driver进程还没有启动的DriverInfo信息,每个元素是DriverInfo类型 for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers // We assign workers to each waiting driver in a round-robin fashion. For each driver, we // start from the last worker that was assigned a driver, and continue onwards until we have // explored all alive workers. var launched = false var numWorkersVisited = 0 //对于给定的待申请资源的Driver,遍历所有可用的Worker,寻找满足条件的Worker(CPU和内存足够) while (numWorkersVisited < numWorkersAlive && !launched) { //获取第curPos号的可用Worker val worker = shuffledAliveWorkers(curPos) numWorkersVisited += 1 ////如果该worker满足Driver的资源需求,则启动Driver进程,并且设置为已经动(launched=true) if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) { launchDriver(worker, driver) waitingDrivers -= driver launched = true } ///判断下一个可用的Worker curPos = (curPos + 1) % numWorkersAlive } } // Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app // in the queue, then the second app, etc. //将Executor尽量分散到尽可能多的Worker上 if (spreadOutApps) { // Try to spread out each app among all the nodes, until it has all its cores for (app <- waitingApps if app.coresLeft > 0) { val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE) .filter(canUse(app, _)).sortBy(_.coresFree).reverse val numUsable = usableWorkers.length //assigned数组记录每个Worker上已经分配的CPU内核数 val assigned = new Array[Int](numUsable) // Number of cores to give on each node //toAssign表示为该app将要分配的CPU内核数,它是app申请的内核数和所有Worker的剩余CPU内核数的较小值 var toAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum) var pos = 0 while (toAssign > 0) { //遍历所有的可用Worker,一次分配一个CPU内核,记录在assigned数组中 if (usableWorkers(pos).coresFree - assigned(pos) > 0) { toAssign -= 1 assigned(pos) += 1 } pos = (pos + 1) % numUsable } // Now that we've decided how many cores to give on each node, let's actually give them //根据assigned数组中记录的每个Worker要分配的CPU内核数,进行实际的分配 for (pos <- 0 until numUsable) { if (assigned(pos) > 0) { val exec = app.addExecutor(usableWorkers(pos), assigned(pos)) //启动给定指定内核数的Executor进程(实际上是CoarceGrainedExecutorBackend进程) launchExecutor(usableWorkers(pos), exec) app.state = ApplicationState.RUNNING } } } } else { // Pack each app into as few nodes as possible until we've assigned all its cores //将Executor的分配集中在尽可能上的Worker上,外围循环是Worker,内层循环是Application for (worker <- workers if worker.coresFree > 0 && worker.state == WorkerState.ALIVE) { for (app <- waitingApps if app.coresLeft > 0) { if (canUse(app, worker)) {//针对该Application,如果可以分配资源, //分配的内核数是该Worker可用的内核数与App申请的内核数的最小值 val coresToUse = math.min(worker.coresFree, app.coresLeft) if (coresToUse > 0) { //要么Worker的内核数全部分配,要么app需要的内核数得到满足 val exec = app.addExecutor(worker, coresToUse) launchExecutor(worker, exec) app.state = ApplicationState.RUNNING } } } } } }