上一篇博客详细分析了Spark在Standalone模式下的部署过程,文中提到在Worker注册完成后需要执行一个schedule操作来分配资源,本文就将具体分析此方法具体是怎样分配资源的。
注:本专题的文章皆使用Spark-1.6.3版本的源码为参考,如果Spark-2.1.0版本有重大改进的地方也会进行说明。
什么时候会调用schedule?
其实每当一个新的application加入或者资源发生变化的时候都会调用schudule方法对资源进行重新分配,那么它是如何分配资源的呢?我们下面进行源码级别的分析。
schedule
我们先贴出schedule的源码:
// 既然要分配资源就必须保证Master的当前状态为ALIVE
if (state != RecoveryState.ALIVE) {
return
}
// Drivers take strict precedence over executors
// 注释说的很明确,先注册Drivers然后再注册executors
// 1. 首先将ALIVE状态的Workers使用shuffle的方式打乱,以免每次都将Driver分配到同一个Worker上
val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
val numWorkersAlive = shuffledAliveWorkers.size
var curPos = 0
// 2. 循环遍历启动Drivers
for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
// We assign workers to each waiting driver in a round-robin fashion. For each driver, we
// start from the last worker that was assigned a driver, and continue onwards until we have
// explored all alive workers.
var launched = false
var numWorkersVisited = 0
// 2.1 判断是否有剩余的没有分配的Workers,并且尚未启动
while (numWorkersVisited < numWorkersAlive && !launched) {
// 2.2 获取一个Worker,第一个的索引为0,后面的索引根据curPos = (curPos + 1) % numWorkersAlive进行计算
val worker = shuffledAliveWorkers(curPos)
// 2.3 标记分配过的Worker加1
numWorkersVisited += 1
// 2.4 判断当前的Worker的内存和cpu是否满足Driver的需求
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
// 2.5 如果满足资源的需求就在当前Worker上启动Driver
launchDriver(worker, driver)
// 2.6 启动完成后从等待的队列中删除,并将launched标记为true
waitingDrivers -= driver
launched = true
}
curPos = (curPos + 1) % numWorkersAlive
}
}
// 3 启动Executors
startExecutorsOnWorkers()
启动Driver
我已经在上面的源码中对分配的流程进行了详细的注释,现在我们看一下launchDriver方法:
private def launchDriver(worker: WorkerInfo, driver: DriverInfo) {
// 1. 打日志
logInfo("Launching driver " + driver.id + " on worker " + worker.id)
// 2. 向worker中添加driver的信息,包括增加已经使用的内存和cpu信息
worker.addDriver(driver)
// 3. 向driver中添加该worker的引用
driver.worker = Some(worker)
// 4. 向Worker发送LaunchDriver的消息,通知Worker启动Driver
worker.endpoint.send(LaunchDriver(driver.id, driver.desc))
// 5. 将driver的状态变成RUNNING
driver.state = DriverState.RUNNING
}
接下来我们看一下对应的Worker在接收到LaunchDriver消息后是怎么处理的:
case LaunchDriver(driverId, driverDesc) => {
// 1. 打日志
logInfo(s"Asked to launch driver $driverId")
// 2. 实例化DriverRunner
val driver = new DriverRunner(
conf,
driverId,
workDir,
sparkHome,
driverDesc.copy(command = Worker.maybeUpdateSSLSettings(driverDesc.command, conf)),
self,
workerUri,
securityMgr)
// 3. 实例化完成后向drivers中添加该driver的记录
drivers(driverId) = driver
// 4. 启动driver
driver.start()
// 5. 启动完成后记录资源的变化
coresUsed += driverDesc.cores
memoryUsed += driverDesc.mem
}
继续跟踪driver.start():
// 英文注释说的很清楚:启动一个线程来运行和管理driver
/** Starts a thread to run and manage the driver. */
private[worker] def start() = {
new Thread("DriverRunner for " + driverId) {
override def run() {
try {
// 创建driver的工作目录
val driverDir = createWorkingDirectory()
// 下载用户的Jar文件到driver的工作目录并返回路径名称
val localJarFilename = downloadUserJar(driverDir)
def substituteVariables(argument: String): String = argument match {
case "{{WORKER_URL}}" => workerUrl
case "{{USER_JAR}}" => localJarFilename
case other => other
}
// TODO: If we add ability to submit multiple jars they should also be added here
val builder = CommandUtils.buildProcessBuilder(driverDesc.command, securityManager,
driverDesc.mem, sparkHome.getAbsolutePath, substituteVariables)
// 具体的启动Driver的操作,这里不再详细分析
launchDriver(builder, driverDir, driverDesc.supervise)
}
catch {
case e: Exception => finalException = Some(e)
}
val state =
if (killed) {
DriverState.KILLED
} else if (finalException.isDefined) {
DriverState.ERROR
} else {
finalExitCode match {
case Some(0) => DriverState.FINISHED
case _ => DriverState.FAILED
}
}
finalState = Some(state)
worker.send(DriverStateChanged(driverId, state, finalException))
}
}.start()
}
如果启动成功最后要向worker发送一条DriverStateChanged的消息,而Worker在接收到该消息后会调用handleDriverStateChanged方法进行一系列处理,具体的处理细节就不再说明,主要的就是向Master发送一条driverStateChanged的消息,Master在接收到该消息后移除Driver的信息:
case DriverStateChanged(driverId, state, exception) => {
state match {
case DriverState.ERROR | DriverState.FINISHED | DriverState.KILLED | DriverState.FAILED =>
removeDriver(driverId, state, exception)
case _ =>
throw new Exception(s"Received unexpected state update for driver $driverId: $state")
}
}
至此向Driver分配资源并启动Driver的过程结束,下面我们看一下启动Executors即执行startExecutorsOnWorkers()的流程。
启动Executors
startExecutorsOnWorkers():
/**
* Schedule and launch executors on workers
*/
private def startExecutorsOnWorkers(): Unit = {
// 采用的是先进先出的原则
// Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app
// in the queue, then the second app, etc.
for (app <- waitingApps if app.coresLeft > 0) {
val coresPerExecutor: Option[Int] = app.desc.coresPerExecutor
// Filter out workers that don't have enough resources to launch an executor
// 过滤出ALIVE状态并且资源满足要求的workers,同时按照空闲cpu cores的个数倒序排列
val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
.filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
worker.coresFree >= coresPerExecutor.getOrElse(1))
.sortBy(_.coresFree).reverse
// 决定在每个worker上面分配多少个cpu cores
val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)
// 然后开始进行分配
// Now that we've decided how many cores to allocate on each worker, let's allocate them
for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
allocateWorkerResourceToExecutors(
app, assignedCores(pos), coresPerExecutor, usableWorkers(pos))
}
}
}
我们首先看一下是如何决定在每个worker上分配多少个cores的,这里我们只列出scheduleExecutorsOnWorkers方法的英文注释,并进行说明,具体的操作大家可以去看源码:
/**
* Schedule executors to be launched on the workers.
* Returns an array containing number of cores assigned to each worker.
*
* There are two modes of launching executors. The first attempts to spread out an application's
* executors on as many workers as possible, while the second does the opposite (i.e. launch them
* on as few workers as possible). The former is usually better for data locality purposes and is
* the default.
*
* The number of cores assigned to each executor is configurable. When this is explicitly set,
* multiple executors from the same application may be launched on the same worker if the worker
* has enough cores and memory. Otherwise, each executor grabs all the cores available on the
* worker by default, in which case only one executor may be launched on each worker.
*
* It is important to allocate coresPerExecutor on each worker at a time (instead of 1 core
* at a time). Consider the following example: cluster has 4 workers with 16 cores each.
* User requests 3 executors (spark.cores.max = 48, spark.executor.cores = 16). If 1 core is
* allocated at a time, 12 cores from each worker would be assigned to each executor.
* Since 12 < 16, no executors would launch [SPARK-8881].
*/
大致意思是说有两种分配模型,第一种是将executors分配到尽可能多的workers上;第二种与第一种相反。默认使用的是第一种模型,这种模型更加符合数据的本地性原则,为每个Executor分配的cores的个数是可以进行配置的(spark-submit 或者 spark-env.sh),如果设置了,多个executors可能会被分配在一个worker上(前提是该worker拥有足够的cores和memory),否则每个executor会充分利用worker上的cores,这种情况下一个executor会被分配在一个worker上。具体在集群上分配cores的时候会尽可能的满足我们的要求,如果需要的cores的个数大于workers中空闲的cores的个数,那么就先分配空闲的cores,尽可能的去满足要求。
接下来就是具体为executors分配计算资源并启动executors的过程:
private def allocateWorkerResourceToExecutors(
app: ApplicationInfo,
assignedCores: Int,
coresPerExecutor: Option[Int],
worker: WorkerInfo): Unit = {
// If the number of cores per executor is specified, we divide the cores assigned
// to this worker evenly among the executors with no remainder.
// Otherwise, we launch a single executor that grabs all the assignedCores on this worker.
val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1)
val coresToAssign = coresPerExecutor.getOrElse(assignedCores)
for (i <- 1 to numExecutors) {
// 向application中添加executor的信息
val exec = app.addExecutor(worker, coresToAssign)
// 启动executors
launchExecutor(worker, exec)
app.state = ApplicationState.RUNNING
}
}
启动executors:
private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {
logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
worker.addExecutor(exec)
// 向worker发消息启动executor
worker.endpoint.send(LaunchExecutor(masterUrl,
exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))
// 然后向driver发送executors的信息
exec.application.driver.send(
ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))
}
worker在接收到启动executor的消息后执行具体的启动操作,并向Master汇报。
然后也要向driver发送executors的资源信息,driver收到信息后执行application,至此分配并启动executors的大致流程也就执行完毕。
最后用一张图总结一下启动Driver和Worker的简易流程:
本文只是大致的分析了Master在执行schedule的时候具体为Driver、Executors分配资源并启动它们的流程,以后我们还会分析整个application的运行流程,那时我们会具体进行分析。
本文参考和拓展阅读:
Spark-1.6.3源码
Spark-2.1.0源码
本文为原创,欢迎转载,转载请注明出处、作者,谢谢!