在上篇文章中我们剖析了SparkContext创建启动的整个流程,但是在创建SparkContext之后,TaskScheduler是如何向master注册application,以及master是如何调度worker启动的?带着这些问题我们来看看master的内部构造。
首先我们从下面这四个方面来深入Master源码来探究:
所谓主备切换机制就是当Active Master挂掉之后可以切换到standby Master。
在Spark集群中我们也可以配置两个Master,它即支持Spark原生的standalone模式的master主备切换机制,也支持基于yarn的Master主备切换机制。
Spark Master的主备切换基于两种机制实现,一种是基于文件系统的切换机制,另一种是基于zookeeper的切换机制。基于文件系统的切换机制在Active Master切换Standby时需要我们手动来切换,而基于zookeeper的切换机制会自动进行切换。
那么在切换Active切换到Standby Master的时候,它会做哪些工作呢?
下面是需要做的工作的流程图:
completeRecovery方法的源码如下所示:
/**
* 对没有发送响应消息的Driver和Worker进行清除过滤操作
*/
def completeRecovery() {
// Ensure "only-once" recovery semantics using a short synchronization period.
synchronized {
if (state != RecoveryState.RECOVERING) { return }
state = RecoveryState.COMPLETING_RECOVERY
}
// Kill off any workers and apps that didn't respond to us.
workers.filter(_.state == WorkerState.UNKNOWN).foreach(removeWorker)
apps.filter(_.state == ApplicationState.UNKNOWN).foreach(finishApplication)
// Reschedule drivers which were not claimed by any workers
drivers.filter(_.worker.isEmpty).foreach { d =>
logWarning(s"Driver ${d.id} was not found after master recovery")
if (d.desc.supervise) {
logWarning(s"Re-launching ${d.id}")
relaunchDriver(d)
} else {
removeDriver(d.id, DriverState.ERROR, None)
logWarning(s"Did not re-launch ${d.id} because it was not supervised")
}
}
state = RecoveryState.ALIVE
schedule()
logInfo("Recovery complete - resuming operations!")
}
def removeWorker(worker: WorkerInfo) {
logInfo("Removing worker " + worker.id + " on " + worker.host + ":" + worker.port)
//设置节点状态为DEAD,并且将节点的信息从内存中移除
worker.setState(WorkerState.DEAD)
idToWorker -= worker.id
addressToWorker -= worker.actor.path.address
for (exec <- worker.executors.values) {
logInfo("Telling app of lost executor: " + exec.id)
exec.application.driver ! ExecutorUpdated(
exec.id, ExecutorState.LOST, Some("worker lost"), None)
exec.application.removeExecutor(exec)
}
def finishApplication(app: ApplicationInfo) {
removeApplication(app, ApplicationState.FINISHED)
}
在Spark应用运行时,Application,Worker,Driver都会向Master进行注册,它们分别是如何注册的呢?
注册流程图如下:
/**
* Spark Application应用注册机制
* @param app
*/
def registerApplication(app: ApplicationInfo): Unit = {
val appAddress = app.driver.path.address
if (addressToApp.contains(appAddress)) {
logInfo("Attempted to re-register application at same address: " + appAddress)
return
}
applicationMetricsSystem.registerSource(app.appSource)
//将Application加入缓存中
//val apps = new HashSet[ApplicationInfo]
apps += app
idToApp(app.id) = app
//val actorToApp = new HashMap[ActorRef, ApplicationInfo]
actorToApp(app.driver) = app
addressToApp(appAddress) = app
//将Application加入等待队列
// val waitingApps = new ArrayBuffer[ApplicationInfo]
waitingApps += app
}
def removeWorker(worker: WorkerInfo) {
logInfo("Removing worker " + worker.id + " on " + worker.host + ":" + worker.port)
//设置节点状态为DEAD,并且将节点的信息从内存中移除
worker.setState(WorkerState.DEAD)
idToWorker -= worker.id
addressToWorker -= worker.actor.path.address
for (exec <- worker.executors.values) {
logInfo("Telling app of lost executor: " + exec.id)
exec.application.driver ! ExecutorUpdated(
exec.id, ExecutorState.LOST, Some("worker lost"), None)
exec.application.removeExecutor(exec)
}
for (driver <- worker.drivers.values) {
/**
* 如果设置driver.desc.supervise这个属性表示监督Driver,那么它会尝试重新启动Driver,
* 否则直接移除Driver
*/
if (driver.desc.supervise) {
logInfo(s"Re-launching ${driver.id}")
relaunchDriver(driver)
} else {
logInfo(s"Not re-launching ${driver.id} because it was not supervised")
removeDriver(driver.id, DriverState.ERROR, None)
}
}
/**
* 根据持久化引擎的不同(总共有两种,一种是基于文件系统的FileSystemPersistenceEngine,
* 另一种是基于zookeeper的ZookeeperPersistenceEngine),删除没有向Master汇报消息的节点
*/
persistenceEngine.removeWorker(worker)
}
def relaunchDriver(driver: DriverInfo) {
driver.worker = None
driver.state = DriverState.RELAUNCHING
waitingDrivers += driver
schedule()
}
所谓状态改变机制,就是指Driver,Executor等节点状态发生改变的时候,会向Master发送消息。
Driver状态发生改变时源码如下:
case DriverStateChanged(driverId, state, exception) => {
state match {
//当driver改变异常的时候,如果driver的状态是error,finished,killed,failed的话就移除driver
case DriverState.ERROR | DriverState.FINISHED | DriverState.KILLED | DriverState.FAILED =>
//移除Driver
removeDriver(driverId, state, exception)
case _ =>
throw new Exception(s"Received unexpected state update for driver $driverId: $state")
}
}
/**
* 移除Driver
* @param driverId
* @param finalState
* @param exception
*/
def removeDriver(driverId: String, finalState: DriverState, exception: Option[Exception]) {
//利用scala的高阶函数使用driverId来寻找Driver
drivers.find(d => d.id == driverId) match {
//当Driver不为空的时候(Some scala 的样例类)
case Some(driver) =>
logInfo(s"Removing driver: $driverId")
//从缓存中移除
drivers -= driver
if (completedDrivers.size >= RETAINED_DRIVERS) {
val toRemove = math.max(RETAINED_DRIVERS / 10, 1)
completedDrivers.trimStart(toRemove)
}
//将完成的Driver加入完成队列中
completedDrivers += driver
//从 持久化引擎中删除Driver的信息
persistenceEngine.removeDriver(driver)
driver.state = finalState
driver.exception = exception
//移除每一个Driver
driver.worker.foreach(w => w.removeDriver(driver))
schedule()
case None =>
logWarning(s"Asked to remove unknown driver: $driverId")
}
}
}
当Executor状态发生改变的时候:
/**
* Master对于executor的状态改变机制
*/
case ExecutorStateChanged(appId, execId, state, message, exitStatus) => {
val execOption = idToApp.get(appId).flatMap(app => app.executors.get(execId))
execOption match {
case Some(exec) => {
val appInfo = idToApp(appId)
exec.state = state
if (state == ExecutorState.RUNNING) { appInfo.resetRetryCount() }
exec.application.driver ! ExecutorUpdated(execId, state, message, exitStatus)
if (ExecutorState.isFinished(state)) {
// Remove this executor from the worker and app
logInfo(s"Removing executor ${exec.fullId} because it is $state")
//从缓存中删除executorInfo
appInfo.removeExecutor(exec)
//从worker中删除executor
exec.worker.removeExecutor(exec)
val normalExit = exitStatus == Some(0)
// Only retry certain number of times so we don't go into an infinite loop.
if (!normalExit) {
if (appInfo.incrementRetryCount() < ApplicationState.MAX_NUM_RETRY) {
schedule()
} else {
val execs = appInfo.executors.values
if (!execs.exists(_.state == ExecutorState.RUNNING)) {
logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
s"${appInfo.retryCount} times; removing it")
removeApplication(appInfo, ApplicationState.FAILED)
}
}
}
}
}
case None =>
logWarning(s"Got status update for unknown executor $appId/$execId")
}
}
在Spark程序运行的过程中处处都需要进行资源的调度,这也是Master中最为重要的部分,我们可以通过查看schedule()方法来了解Master的资源调度算法:
/**
* Schedule the currently available resources among waiting apps. This method will be called
* every time a new app joins or resource availability changes.
*/
/**
*Master的资源调度机制
*/
private def schedule() {
//如果master的状态不是ALIVE状态直接返回
if (state != RecoveryState.ALIVE) { return }
// First schedule drivers, they take strict precedence over applications
// Randomization helps balance drivers
/**
* 取出了所有的worker并且选择ALIVE状态的进行过滤,然后使用Random.shuffle的方式将所有的worker打乱处理。
*/
val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
val numWorkersAlive = shuffledAliveWorkers.size
var curPos = 0
/**
* 首先调度Driver,只有cluster模式才会调度
*/
for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
// We assign workers to each waiting driver in a round-robin fashion. For each driver, we
// start from the last worker that was assigned a driver, and continue onwards until we have
// explored all alive workers.
var launched = false
var numWorkersVisited = 0
//遍历所有ALIVE状态的并且没有启动过的Worker
while (numWorkersVisited < numWorkersAlive && !launched) {
//对所有的ALIVE的worker进行shuffle
val worker = shuffledAliveWorkers(curPos)
numWorkersVisited += 1
/**
* 如果当前worker的内存空闲量大于等于driver所需要的内存
* 并且当前worker的空闲CPU核数大于或者等于driver所需要的cpu核数
*/
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
//启动driver
launchDriver(worker, driver)
//从等待队列中取出该driver
waitingDrivers -= driver
launched = true
}
//指针指向下一个worker
curPos = (curPos + 1) % numWorkersAlive
}
}
/**
* Application调度算法总共有两种,第一种是spreadOutApps算法,另一种是非spreadOutApps算法
* 可以通过参数spark.deploy.spreadOut来调整,默认是true
*/
// Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app
// in the queue, then the second app, etc.
/**
* spreadOutApps算法是尝试分配所有的core到每一个节点,也就是说如果启动时指定三个executor,并且每个executor指定三个core
* 也就说总共分配了9个core,如果集群上有9个executor,它会每个分配一个core。(也就是说不一定按照指定的方式来调度)
*/
if (spreadOutApps) {
// Try to spread out each app among all the nodes, until it has all its cores
for (app <- waitingApps if app.coresLeft > 0) {
//再次过滤出状态为ALIVE状态的worker节点,并且按照CPU的数量倒叙排列
val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
.filter(canUse(app, _)).sortBy(_.coresFree).reverse
val numUsable = usableWorkers.length
val assigned = new Array[Int](numUsable) // Number of cores to give on each node
//计算到底要多少个cpu 核数,取app所需要的核数和worker总共可用的CPU数量的最小值
var toAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)
var pos = 0
while (toAssign > 0) {
//如果worker可用的CPU数量大于已经分配的CPU的数量,则表示还可以进行分配
if (usableWorkers(pos).coresFree - assigned(pos) > 0) {
toAssign -= 1
assigned(pos) += 1
}
pos = (pos + 1) % numUsable
}
/**
* 给worker分配完application所要求的cpu核数之后
* 遍历worker
*/
// Now that we've decided how many cores to give on each node, let's actually give them
for (pos <- 0 until numUsable) {
if (assigned(pos) > 0) {
//只要判断之前给这个worker分配了core,那么就启动该worker
val exec = app.addExecutor(usableWorkers(pos), assigned(pos))
launchExecutor(usableWorkers(pos), exec)
app.state = ApplicationState.RUNNING
}
}
}
} else {
/**
* 非spreadOutApps调度算法,尽可能地给worker分配给它所需要的所有的core
* 比如我们设置了20个CPU core,executor的数量设置为10个,每一个worker两个core,
* 但是真正的我们的worker最多需要10个core,那么它会尽量满足worker的要求
* 此时会启动2个worker,也就是启动2个worker
*/
// Pack each app into as few nodes as possible until we've assigned all its cores
for (worker <- workers if worker.coresFree > 0 && worker.state == WorkerState.ALIVE) {
for (app <- waitingApps if app.coresLeft > 0) {
if (canUse(app, worker)) {
val coresToUse = math.min(worker.coresFree, app.coresLeft)
if (coresToUse > 0) {
val exec = app.addExecutor(worker, coresToUse)
launchExecutor(worker, exec)
app.state = ApplicationState.RUNNING
}
}
}
}
}
}
至此,我们就从四个方面来深入Master的源码进行分析,上述如有任何问题,希望不仅赐教,谢谢!!!