Spark技术内幕:Master基于ZooKeeper的High Availability(HA)源码实现 详细阐述了使用ZK实现的Master的HA,那么Master是如何快速故障恢复的呢?
处于Standby状态的Master在接收到org.apache.spark.deploy.master.ZooKeeperLeaderElectionAgent发送的ElectedLeader消息后,就开始通过ZK中保存的Application,Driver和Worker的元数据信息进行故障恢复了,它的状态也从RecoveryState.STANDBY变为RecoveryState.RECOVERING了。当然了,如果没有任何需要恢复的数据,Master的状态就直接变为RecoveryState.ALIVE,开始对外服务了。
一方面Master通过
beginRecovery(storedApps, storedDrivers, storedWorkers)
recoveryCompletionTask = context.system.scheduler.scheduleOnce(WORKER_TIMEOUT millis, self, CompleteRecovery)
首先看一下如何通过ZooKeeperLeaderElectionAgent提供的接口恢复数据。
override def readPersistedData(): (Seq[ApplicationInfo], Seq[DriverInfo], Seq[WorkerInfo]) = { val sortedFiles = zk.getChildren().forPath(WORKING_DIR).toList.sorted // 获取所有的文件 val appFiles = sortedFiles.filter(_.startsWith("app_")) //获取Application的序列化文件 val apps = appFiles.map(deserializeFromFile[ApplicationInfo]).flatten //将Application的元数据反序列化 val driverFiles = sortedFiles.filter(_.startsWith("driver_")) //获取Driver的序列化文件 val drivers = driverFiles.map(deserializeFromFile[DriverInfo]).flatten //将Driver的元数据反序列化 val workerFiles = sortedFiles.filter(_.startsWith("worker_")) // 获取Worker的序列化文件 val workers = workerFiles.map(deserializeFromFile[WorkerInfo]).flatten // 将Worker的元数据反序列化 (apps, drivers, workers) }
恢复Application的步骤:
恢复Worker的步骤:
def beginRecovery(storedApps: Seq[ApplicationInfo], storedDrivers: Seq[DriverInfo],
storedWorkers: Seq[WorkerInfo]) {
for (app <- storedApps) { // 逐个恢复Application
logInfo("Trying to recover app: " + app.id)
try {
registerApplication(app)
app.state = ApplicationState.UNKNOWN
app.driver ! MasterChanged(masterUrl, masterWebUiUrl) //向AppClient发送Master变化的消息,AppClient会回复MasterChangeAcknowledged
} catch {
case e: Exception => logInfo("App " + app.id + " had exception on reconnect")
}
}
for (driver <- storedDrivers) {
// Here we just read in the list of drivers. Any drivers associated with now-lost workers
// will be re-launched when we detect that the worker is missing.
drivers += driver // 在Worker恢复后,Worker会主动上报运行其上的executors和drivers从而使得Master恢复executor和driver的信息。
}
for (worker <- storedWorkers) { //逐个恢复Worker
logInfo("Trying to recover worker: " + worker.id)
try {
registerWorker(worker) //重新注册Worker
worker.state = WorkerState.UNKNOWN
worker.actor ! MasterChanged(masterUrl, masterWebUiUrl) //向Worker发送Master变化的消息,Worker会回复WorkerSchedulerStateResponse
} catch {
case e: Exception => logInfo("Worker " + worker.id + " had exception on reconnect")
}
}
}
//调用时机 // 1. 在恢复开始后的60s会被强制调用 // 2. 在每次收到AppClient和Worker的消息回复后会检查如果Application和worker的状态都不为UNKNOWN,则调用 def completeRecovery() { // Ensure "only-once" recovery semantics using a short synchronization period. synchronized { if (state != RecoveryState.RECOVERING) { return } state = RecoveryState.COMPLETING_RECOVERY } // Kill off any workers and apps that didn't respond to us. 删除在60s内没有回应的app和worker workers.filter(_.state == WorkerState.UNKNOWN).foreach(removeWorker) apps.filter(_.state == ApplicationState.UNKNOWN).foreach(finishApplication) // Reschedule drivers which were not claimed by any workers drivers.filter(_.worker.isEmpty).foreach { d => // 如果driver的worker为空,则relaunch。 logWarning(s"Driver ${d.id} was not found after master recovery") if (d.desc.supervise) { logWarning(s"Re-launching ${d.id}") relaunchDriver(d) } else { removeDriver(d.id, DriverState.ERROR, None) logWarning(s"Did not re-launch ${d.id} because it was not supervised") } } state = RecoveryState.ALIVE schedule() logInfo("Recovery complete - resuming operations!") }
但是对于一个拥有几千个节点的集群来说,60s设置的是否合理?毕竟现在没有使用Standalone模式部署几千个节点的吧?因此硬编码60s看上去也十分合理,毕竟都是逻辑很简单的调用,如果一些节点60S没有返回,那么下线这部分机器也是合理的。
通过设置spark.worker.timeout,可以自定义超时时间。