基于spark1.3.1的源码解读
不得不佩服spark源码的精炼,standalone模式中,不到100行就搞定了资源调度,spark提供两种资源调度方式,尽量打散:即我们的executor会尽量的分配更多的worker上;尽量集中:即我们的executor会尽量的分配更少的worker上;这其中是通过spreadOutApps变量来控制的,true为尽量分散。
private def schedule() {
#当master不是alive时,直接reuturn
#也就是说Standby是不参与资源调度的
if (state != RecoveryState.ALIVE) { return }
//Random.shuffle作用就是把集合随机打乱
//取出workers中所有之前注册的worker,进行过滤,必须 状态 是Alive的worker
//把worker随机的打乱
val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(
_.state == WorkerState.ALIVE))
val numWorkersAlive = shuffledAliveWorkers.size。
只有在模式是yarn-cluster提交后,才会注册driver,因为standalone与yarn-client
都会在本地启动dirver,而不会来注册driver,就更不可能被master来调度
所以说下面的这个for只会运行在yarn-cluster模式下提交下
for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
// We assign workers to each waiting driver in a round-robin fashion. For each driver, we
// start from the last worker that was assigned a driver, and continue onwards until we have
// explored all alive workers.
var launched = false
var numWorkersVisited = 0
/**while中的条件,当还有活着的worker没有被遍历到,就继续遍历
* 而且这个driver在这个worker中还没有启动,launched=false
*/
while (numWorkersVisited < numWorkersAlive && !launched) {
val worker = shuffledAliveWorkers(curPos)
numWorkersVisited += 1
/**
* 如果这个worker的空闲内存容量 大于等于driver所需的内存
* 而且worker空间的CPU大于等于driver所需的CPU数量
*/
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
//启动driver
launchDriver(worker, driver)
//把此driver从waitingDrivers中去掉
waitingDrivers -= driver
launched = true
}
//将指针指向下一个worker
curPos = (curPos + 1) % numWorkersAlive
}
}
上面就是对Drvier的调度。
接着我们看对Application的资源调度
/**
if (spreadOutApps) {
for (app <- waitingApps if app.coresLeft > 0) {
//从workerk中,过滤出状态是ALIVE的
val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
.filter(canUse(app, _)).sortBy(_.coresFree).reverse
val numUsable = usableWorkers.length
//创建一个空数组,存储了要分配的每个worker的cpu
val assigned = new Array[Int](numUsable) // Number of cores to give on each node
//获取到底可以分配多少个cpu,取application所需的cpu数量与worker可用的cpu数量的最小值
var toAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)
var pos = 0
//只要有还有要分配的cpu没有分配完就while
while (toAssign > 0) {
//如果可用的cpu大于已经分配的cpu数量,其实就是还有可用的cpu
if (usableWorkers(pos).coresFree - assigned(pos) > 0) {
//将要分配的cpu数量减一
toAssign -= 1
//给这个worker分配的cpu加1
assigned(pos) += 1
}
//指定指向下一个worker
pos = (pos + 1) % numUsable
}
// Now that we've decided how many cores to give on each node, let's actually give them
//给每个worker分配了application需要的cpu core后
for (pos <- 0 until numUsable) {
//判断这个worker已经分配了core
if (assigned(pos) > 0) {
//创建了ExecutorDesc对象,封装了executor的信息
//将这个executor添加到application缓存区
val exec = app.addExecutor(usableWorkers(pos), assigned(pos))
//在worker上启动Executor
launchExecutor(usableWorkers(pos), exec)
//将application的状态为RUNNING
app.state = ApplicationState.RUNNING
}
}
}
}
关于这种调度算法总结:
我们之前说了在spark-submit中指定了多少个executor,每个execuotr需要多少个cpu core 实际上基本这个机制,最后,executor的实际数量,每个executor需要的core,可能与配置不一样 因为这里我们是基于总的cpu来分配的,就是说比如,我们配置了需要三个executor来启动application, 每个executor需要三个core,那么就总需9个core,其实在这种算法中,如果我们有9个worker,会给每个 worker分配一个core,然后给每个worker启动一个executor.最后其实是启动了9个executor,每个 executor有一个core
第二种调度算法:
这种算法与上面的正好相反,每个application,都尽可能少的分配到worker上去, 比如总共有10个worker,每个有10个core application总共要分配20个core,那么只会分配到两个worker上,每个worker都占满了这10个core那么其它的application只能分配另外的worker上去了。 所以我们在spark-submit中配置了要10个executor,每个execuotr需要2个core 那么共需要20个core,但这种算法中,其实只会启动两个executor,每个executor有10个core
//遍历worker,并且状态是ALIVE,还有空闲的cpu的worker
for (worker <- workers if worker.coresFree > 0 && worker.state == WorkerState.ALIVE) {
//遍历application,并且还有需要分配的core的applicztion
for (app <- waitingApps if app.coresLeft > 0) {
//判断当前这个worker可以被application使用
if (canUse(app, worker)) {
//取worker可用的cpu数量与application要分配的cpu数量的最小值
val coresToUse = math.min(worker.coresFree, app.coresLeft)
//如果小于0,说明没有core可分了
if (coresToUse > 0) {
val exec = app.addExecutor(worker, coresToUse)
//在worker上启动executor
launchExecutor(worker, exec)
//设置application的状态是
app.state = ApplicationState.RUNNING
}
}
其中里面有一个非常重要的方法:
def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc) {
logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
//将executor加入worker缓存
worker.addExecutor(exec)
//向worker的actor发送LaunchExecutor ,在worker中启动executor
worker.actor ! LaunchExecutor(masterUrl,
exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory)
exec.application.driver ! ExecutorAdded(
exec.id, worker.id, worker.hostPort, exec.cores, exec.memory)
}
这个方法里的
worker.actor ! LaunchExecutor(masterUrl,exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory)
正是向worker发送这个消息(封闭了akka的消息通信) ,在worker端调用这个方法来启动Executor。