本文是《图解Spark核心技术与案例实战》一书的读书笔记,将简单介绍一下几种Spark运行架构。
Spark有本地运行模式,stand alone模式,集群模式,yarn模式,mesos模式等多种模式。这些模式的主要组成部分都可以看成SparkContext,Cluster Manager,Executor三个部分,其中SparkContext负责管理Application的执行,与ClusterManager通信,进行资源的申请,任务的调度,监控。Cluster Manager负责管理集群资源,Executor负责执行task。
在不同的模式下,ClusterManager由不同的组件担任,在本地,stand alone和集群模式下,cluster manager是master,在Yarn 模式中由Resource Manager担任,在Mesos模式中由Application Master担任。
taskScheduler负责将DAGScheduler解析出来的stage转换成task set,然后通过submitTasks()提交给SchedulerBackend,SchedulerBackend收到任务之后使用reviveOffers()方法分配运行资源并启动任务。TaskScheduler负责沟通DAGScheduler和SchedulerBackend,由于DAGScheduler抽象层次较高,而SchedulerBackend负责与底层接口交互,因此TaskScheduler可以起到屏蔽底层不同的资源分配方式的作用。TaskScheduler的主要实现类是TaskSchedulerImpl,Yarn模式下提供了两个类继承TaskSchedulerImpl,分别是YarnScheduler和YarnCluseterScheduler.
SchedulerBackend根据不同的运行模式分为本地的LocalBackend,粗粒度模式下的CoarseGrainedSchedulerBackend,细粒度模式下的MesosSchedulerBackend,粗粒度模式下又分为stand alone模式下的SparkDeploySchedulerBackend,Yarn运行模式下的YarnSchedulerBackend,mesos粗粒度模式的CoarseMesosShedulerBackend.其中Yarn又根据client模式和cluster模式分为了YarnClientSchedulerBackend,YarnClusterSchedulerBackend.
本地模式一般用于测试,在这种模式下所有的Spark进程都运行在同一个JVM里面,比如在IDEA里面直接启动一个程序,那么这种情况下就是在本地模式。在运行中,默认不加配置的情况下使用的是本地模式,还可以在参数里或者程序中显式指定使用本地模式:
参数:--master local[4]
代码:SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]");
local模式首先启动了SparkContext,在启动SparkContext的过程中会初始化DAGScheduler,启动TaskSchedulerImpl,初始化TaskSchedulerImpl的时候会启动LocalBackend。启动LocalBackend的时候会实例化LocalEndpoint,实例化LocalEndpoint的时候会实例化Executor。
本地模式的job执行调用图:
上面省略了一些细节,只突出了在调用过程中比较重要的一些组件,大体流程如下:
main方法执行,创建了SparkContext,DAGScheduler,TaskSchedulerImpl等组件,程序执行到Action操作,开始执行job
独立运行模式是使用Spark自己实现的资源管理组件,而没有使用Yarn或者Mesos的运行模式。独立运行模式由客户端,Master节点,Worker节点组成,而SparkContext可能运行在本地客户端,也可能运行在Master节点,当使用run-example来运行spark程序的时候,SparkContext运行在Master节点上,如果使用spark submit工具运行作业时,spark context 运行在提交作业的客户端上。
在worker节点上通过ExecutorRunner运行了若干个CoarseGrainedExecutorBackend进程,每个进程包含一个executor,executor里面包含了一个线程池,用来调度执行任务。
override def start() {
super.start()
……
client = new AppClient(sc.env.actorSystem, masters, appDesc, this, conf)
client.start()
waitForRegistration()
}
// CoarseGrainedExecutorBackend的start
override def start() {
val properties = new ArrayBuffer[(String, String)]
for ((key, value) <- scheduler.sc.conf.getAll) {
if (key.startsWith("spark.")) {
properties += ((key, value))
}
}
// TODO (prashant) send conf instead of properties
// 创建driverEndpoint
driverEndpoint = rpcEnv.setupEndpoint(
CoarseGrainedSchedulerBackend.ENDPOINT_NAME, new DriverEndpoint(rpcEnv, properties))
}
在AppClient里面有个内部类,ClientActor,在AppClient的start方法里面初始化了这个ClientActor,触发了它的生命周期方法,registerWithMaster,然后就向Master发送了注册请求。
def tryRegisterAllMasters() {
for (masterAkkaUrl <- masterAkkaUrls) {
logInfo("Connecting to master " + masterAkkaUrl + "...")
val actor = context.actorSelection(masterAkkaUrl)
// !是运算符重载,发送了RegisterApplication消息
actor ! RegisterApplication(appDescription)
}
}
override def receiveWithLogging: PartialFunction[Any, Unit] = {
case RegisterApplication(description) => {
if (state == RecoveryState.STANDBY) {
// ignore, don't send response
} else {
logInfo("Registering app " + description.name)
// 创建App
val app = createApplication(description, sender)
// 注册App
registerApplication(app)
logInfo("Registered app " + description.name + " with ID " + app.id)
persistenceEngine.addApplication(app)
// 回复消息
sender ! RegisteredApplication(app.id, masterUrl)
// 启动Executor
schedule()
}
}
/**
* Schedule the currently available resources among waiting apps. This method will be called
* every time a new app joins or resource availability changes.
*/
private def schedule(): Unit = {
if (state != RecoveryState.ALIVE) { return }
// Drivers take strict precedence over executors
val shuffledWorkers = Random.shuffle(workers) // Randomization helps balance drivers
// 遍历所有活着的worker
for (worker <- shuffledWorkers if worker.state == WorkerState.ALIVE) {
// 遍历所有等待的driver
for (driver <- waitingDrivers) {
// 如果worker的空闲资源可以满足driver的需求
if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
// 在worker上面启动driver程序
launchDriver(worker, driver)
waitingDrivers -= driver
}
}
}
// 在worker 上启动executor
startExecutorsOnWorkers()
}
上面给的launchDriver给worker发送了launchDriver的消息,worker收到了消息之后会尝试启动driver,而startExecutorOnWorkers会向选中的worker发送launchExecutor的消息,worker收到消息之后会启动Executor。
override def receiveWithLogging: PartialFunction[Any, Unit] = {
// 启动executor
case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
if (masterUrl != activeMasterUrl) {
logWarning("Invalid Master (" + masterUrl + ") attempted to launch executor.")
} else {
try {
logInfo("Asked to launch executor %s/%d for %s".format(appId, execId, appDesc.name))
// Create the executor's working directory
val executorDir = new File(workDir, appId + "/" + execId)
if (!executorDir.mkdirs()) {
throw new IOException("Failed to create directory " + executorDir)
}
// Create local dirs for the executor. These are passed to the executor via the
// SPARK_EXECUTOR_DIRS environment variable, and deleted by the Worker when the
// application finishes.
val appLocalDirs = appDirectories.get(appId).getOrElse {
Utils.getOrCreateLocalRootDirs(conf).map { dir =>
Utils.createDirectory(dir, namePrefix = "executor").getAbsolutePath()
}.toSeq
}
appDirectories(appId) = appLocalDirs
// 创建executor runner
val manager = new ExecutorRunner(
appId,
execId,
appDesc.copy(command = Worker.maybeUpdateSSLSettings(appDesc.command, conf)),
cores_,
memory_,
self,
workerId,
host,
webUi.boundPort,
publicAddress,
sparkHome,
executorDir,
akkaUrl,
conf,
appLocalDirs, ExecutorState.LOADING)
executors(appId + "/" + execId) = manager
manager.start()
coresUsed += cores_
memoryUsed += memory_
master ! ExecutorStateChanged(appId, execId, manager.state, None, None)
} catch {
……
}
// 启动driver
case LaunchDriver(driverId, driverDesc) => {
logInfo(s"Asked to launch driver $driverId")
val driver = new DriverRunner(
conf,
driverId,
workDir,
sparkHome,
driverDesc.copy(command = Worker.maybeUpdateSSLSettings(driverDesc.command, conf)),
self,
akkaUrl,
securityMgr)
drivers(driverId) = driver
driver.start()
coresUsed += driverDesc.cores
memoryUsed += driverDesc.mem
}
case ReregisterWithMaster =>
reregisterWithMaster()
case ApplicationFinished(id) =>
finishedApps += id
maybeCleanupApplication(id)
}
上面没有涉及到的还有在worker在preStart的时候会调用registerWithMaster向master注册worker。
首先介绍一下Yarn的运行架构。
ResourceManager负责资源的分配,资源分配的基本单位是container,内存、磁盘,cpu等资源都会被封装到container里面,任务只能在container里面运行。NodeManager负责启动Application需要的container,监控节点的资源使用情况并与ResourceManager通信。ApplicationMaster与具体的Application相关,ApplicationMaster和ResourceManager协商,以获取合适的container,监控这些container的状态。
注意这一部分的代码在spark-yarn里面。
程序启动的顺序还是和之前类似,这里是使用了YarnClientSchedulerBackend作为schedulerBackend,然后在YarnClientSchedulerBackend里面开始注册Application
/**
* Submit an application running our ApplicationMaster to the ResourceManager.
*向ResourceManager提交Application,启动ApplicationMaster
* The stable Yarn API provides a convenience method (YarnClient#createApplication) for
* creating applications and setting up the application submission context. This was not
* available in the alpha API.
*/
def submitApplication(): ApplicationId = {
var appId: ApplicationId = null
try {
launcherBackend.connect()
// Setup the credentials before doing anything else,
// so we have don't have issues at any point.
setupCredentials()
// 初始化yarnClient以和yarn 集群通信
yarnClient.init(yarnConf)
yarnClient.start()
logInfo("Requesting a new application from cluster with %d NodeManagers"
.format(yarnClient.getYarnClusterMetrics.getNumNodeManagers))
// Get a new application from our RM
// 向resource manager 申请应用程序编号
val newApp = yarnClient.createApplication()
val newAppResponse = newApp.getNewApplicationResponse()
appId = newAppResponse.getApplicationId()
reportLauncherState(SparkAppHandle.State.SUBMITTED)
launcherBackend.setAppId(appId.toString)
// Verify whether the cluster has enough resources for our AM
// 确认集群中有足够的资源来启动Application Master
verifyClusterResources(newAppResponse)
// Set up the appropriate contexts to launch our AM
val containerContext = createContainerLaunchContext(newAppResponse)
val appContext = createApplicationSubmissionContext(newApp, containerContext)
// Finally, submit and monitor the application
// 向集群提交Application
logInfo(s"Submitting application $appId to ResourceManager")
yarnClient.submitApplication(appContext)
appId
} catch {
case e: Throwable =>
if (appId != null) {
cleanupStagingDir(appId)
}
throw e
}
}
// val containerContext = createContainerLaunchContext(newAppResponse)的调用
val amClass =
if (isClusterMode) {
Utils.classForName("org.apache.spark.deploy.yarn.ApplicationMaster").getName
} else {
Utils.classForName("org.apache.spark.deploy.yarn.ExecutorLauncher").getName
}
ResourceManager的代码在Yarn里面,没有包含在这里,ResourceManager收到消息之后会选择一个合适的NodeManager并启动ExecutorLauncher,ExecutorLauncher的main里面会启动ApplicationMaster
ApplicationMaster启动之后会通过registerAM方法向DriverEndpoint发送消息,进行注册,然后调用YarnAllocator的allocateResources方法申请资源:
private def registerAM(
_rpcEnv: RpcEnv,
driverRef: RpcEndpointRef,
uiAddress: String,
securityMgr: SecurityManager) = {
val sc = sparkContextRef.get()
val appId = client.getAttemptId().getApplicationId().toString()
val attemptId = client.getAttemptId().getAttemptId().toString()
val historyAddress =
sparkConf.get(HISTORY_SERVER_ADDRESS)
.map { text => SparkHadoopUtil.get.substituteHadoopVariables(text, yarnConf) }
.map { address => s"${address}${HistoryServer.UI_PATH_PREFIX}/${appId}/${attemptId}" }
.getOrElse("")
val _sparkConf = if (sc != null) sc.getConf else sparkConf
val driverUrl = RpcEndpointAddress(
_sparkConf.get("spark.driver.host"),
_sparkConf.get("spark.driver.port").toInt,
CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString
// 向driver 发送消息,通知driver ApplicationMaster已经启动
allocator = client.register(driverUrl,
driverRef,
yarnConf,
_sparkConf,
uiAddress,
historyAddress,
securityMgr,
localResources)
// 分配资源
allocator.allocateResources()
reporterThread = launchReporterThread()
}
其中有register的调用完成了对AM的注册,allocator.allocateResources()完成了资源分配。
/**
* Request resources such that, if YARN gives us all we ask for, we'll have a number of containers
* equal to maxExecutors.
*
* Deal with any containers YARN has granted to us by possibly launching executors in them.
*
* This must be synchronized because variables read in this method are mutated by other methods.
*/
def allocateResources(): Unit = synchronized {
updateResourceRequests()
val progressIndicator = 0.1f
// Poll the ResourceManager. This doubles as a heartbeat if there are no pending container
// requests.
val allocateResponse = amClient.allocate(progressIndicator)
val allocatedContainers = allocateResponse.getAllocatedContainers()
if (allocatedContainers.size > 0) {
logDebug("Allocated containers: %d. Current executor count: %d. Cluster resources: %s."
.format(
allocatedContainers.size,
numExecutorsRunning,
allocateResponse.getAvailableResources))
// 处理分配到的container ,调用函数启动executor
handleAllocatedContainers(allocatedContainers.asScala)
}
val completedContainers = allocateResponse.getCompletedContainersStatuses()
if (completedContainers.size > 0) {
logDebug("Completed %d containers".format(completedContainers.size))
processCompletedContainers(completedContainers.asScala)
logDebug("Finished processing %d completed containers. Current running executor count: %d."
.format(completedContainers.size, numExecutorsRunning))
}
}
上面的handleAllocatedContainers里面调用了runAllocatedContainers(),这个方法将启动Executor:
/**
* Launches executors in the allocated containers.
*/
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): Unit = {
for (container <- containersToUse) {
executorIdCounter += 1
val executorHostname = container.getNodeId.getHost
val containerId = container.getId
val executorId = executorIdCounter.toString
assert(container.getResource.getMemory >= resource.getMemory)
logInfo("Launching container %s for on host %s".format(containerId, executorHostname))
def updateInternalState(): Unit = synchronized {
numExecutorsRunning += 1
assert(numExecutorsRunning <= targetNumExecutors)
executorIdToContainer(executorId) = container
containerIdToExecutorId(container.getId) = executorId
// 记录当前分配到的Container信息
val containerSet = allocatedHostToContainersMap.getOrElseUpdate(executorHostname,
new HashSet[ContainerId])
containerSet += containerId
allocatedContainerToHostMap.put(containerId, executorHostname)
}
if (launchContainers) {
logInfo("Launching ExecutorRunnable. driverUrl: %s, executorHostname: %s".format(
driverUrl, executorHostname))
//启动executor
launcherPool.execute(new Runnable {
override def run(): Unit = {
try {
new ExecutorRunnable(
container,
conf,
sparkConf,
driverUrl,
executorId,
executorHostname,
executorMemory,
executorCores,
appAttemptId.getApplicationId.toString,
securityMgr,
localResources
).run()
updateInternalState()
} catch {
case NonFatal(e) =>
logError(s"Failed to launch executor $executorId on container $containerId", e)
// Assigned container should be released immediately to avoid unnecessary resource
// occupation.
amClient.releaseAssignedContainer(containerId)
}
}
})
} else {
// For test only
updateInternalState()
}
}
}
ExecutorRunnable是如何启动Executor的呢?看下ExecutorRunnable的代码就可以知道了:
val commands = prefixEnv ++ Seq(
YarnSparkHadoopUtil.expandEnvironment(Environment.JAVA_HOME) + "/bin/java",
"-server") ++
javaOpts ++
Seq("org.apache.spark.executor.CoarseGrainedExecutorBackend",
"--driver-url", masterAddress.toString,
"--executor-id", slaveId.toString,
"--hostname", hostname.toString,
"--cores", executorCores.toString,
"--app-id", appId) ++
userClassPath ++
Seq(
"1>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout",
"2>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr")
在Yarn cluster模式中,客户端提交Application的时候会一并提交启动ApplicationMaster的命令,ResourceManager收到Application之后会先选择一个NodeManager启动一个ApplicationMaster,ApplicationMaster进行之后的SparkContext的初始化。
YarnCluster模式中,首先在SparkContext的createTaskScheduler中匹配使用YarnClusterScheduler作为TaskScheduler,使用YarnClusterSchedulerBackend作为SchedulerBackend。
case "yarn-standalone" | "yarn-cluster" =>
if (master == "yarn-standalone") {
logWarning(
"\"yarn-standalone\" is deprecated as of Spark 1.0. Use \"yarn-cluster\" instead.")
}
val scheduler = try {
val clazz = Class.forName("org.apache.spark.scheduler.cluster.YarnClusterScheduler")
val cons = clazz.getConstructor(classOf[SparkContext])
cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]
} catch {
……
}
val backend = try {
val clazz =
Class.forName("org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend")
val cons = clazz.getConstructor(classOf[TaskSchedulerImpl], classOf[SparkContext])
cons.newInstance(scheduler, sc).asInstanceOf[CoarseGrainedSchedulerBackend]
} catch {
……
}
scheduler.initialize(backend)
(backend, scheduler)
通过Spark submit提交的程序会使用SparkSubmit的runMain执行,
这个runMain()方法是通过反射调用用户提交的程序里面的main方法,但是如果是yarn cluster模式,这个mainClass会被修改:
// In yarn-cluster mode, use yarn.Client as a wrapper around the user class
if (isYarnCluster) {
childMainClass = "org.apache.spark.deploy.yarn.Client"
那么这里实际上运行的是
def main(argStrings: Array[String]) {
if (!sys.props.contains("SPARK_SUBMIT")) {
logWarning("WARNING: This client is deprecated and will be removed in a " +
"future version of Spark. Use ./bin/spark-submit with \"--master yarn\"")
}
// Set an env variable indicating we are running in YARN mode.
// Note that any env variable with the SPARK_ prefix gets propagated to all (remote) processes
System.setProperty("SPARK_YARN_MODE", "true")
val sparkConf = new SparkConf
val args = new ClientArguments(argStrings)
new Client(args, sparkConf).run()
}
在这里可以看到首先是创建了Client对象,然后调用了其run方法,run方法里面调用了之前client模式的submitApplication,那么之后的流程就是向ResourceManager申请启动ApplicationManager,最后到了val newApp = yarnClient.createApplication()
,通过RPC调用使ResourceManager启动ApplicationMaster,接下来执行的函数是ExecutorLauncher的main方法,这里调用了ApplicationMaster的main方法:
/**
* This object does not provide any special functionality. It exists so that it's easy to tell
* apart the client-mode AM from the cluster-mode AM when using tools such as ps or jps.
*/
object ExecutorLauncher {
def main(args: Array[String]): Unit = {
ApplicationMaster.main(args)
}
}
ApplicationMaster.main
方法里面调用了ApplicationMaster.run
def main(args: Array[String]): Unit = {
SignalUtils.registerLogger(log)
val amArgs = new ApplicationMasterArguments(args)
// Load the properties file with the Spark configuration and set entries as system properties,
// so that user code run inside the AM also has access to them.
// Note: we must do this before SparkHadoopUtil instantiated
if (amArgs.propertiesFile != null) {
Utils.getPropertiesFromFile(amArgs.propertiesFile).foreach { case (k, v) =>
sys.props(k) = v
}
}
SparkHadoopUtil.get.runAsSparkUser { () =>
master = new ApplicationMaster(amArgs, new YarnRMClient)
// 调用run 方法
System.exit(master.run())
}
}
最终会通过ApplicationMaster的run方法执行,在这个方法中会判断是否是集群模式,然后选择启动Driver或者启动ExecutorLauncher:
if (isClusterMode) {
runDriver(securityMgr)
} else {
runExecutorLauncher(securityMgr)
}
而上面的runDriver会调用startUserApplication()方法,开始执行用户提交的程序。
private def runDriver(securityMgr: SecurityManager): Unit = {
addAmIpFilter()
// 开始执行用户程序
userClassThread = startUserApplication()
// This a bit hacky, but we need to wait until the spark.driver.port property has
// been set by the Thread executing the user class.
val sc = waitForSparkContextInitialized()
// If there is no SparkContext at this point, just fail the app.
if (sc == null) {
finish(FinalApplicationStatus.FAILED,
ApplicationMaster.EXIT_SC_NOT_INITED,
"Timed out waiting for SparkContext.")
} else {
rpcEnv = sc.env.rpcEnv
val driverRef = runAMEndpoint(
sc.getConf.get("spark.driver.host"),
sc.getConf.get("spark.driver.port"),
isClusterMode = true)
// 通知driver ApplicationMaster启动成功,
//接下来分配container并启动executor,和client模式相同
registerAM(rpcEnv, driverRef, sc.ui.map(_.appUIAddress).getOrElse(""), securityMgr)
userClassThread.join()
}
}
那么在实例化SparkContext的过程中调用的就是YarnClusterScheduler.start(),不过这个类没有复写start函数,所以还是TashSchedulerImpl的start,然后调用YarnClusterSchedulerBackend.start(),这个方法相比Client版本的要简单很多:
override def start() {
val attemptId = ApplicationMaster.getAttemptId
bindToYarn(attemptId.getApplicationId(), Some(attemptId))
super.start()
totalExpectedExecutors = YarnSparkHadoopUtil.getInitialTargetExecutorNumber(sc.conf)
}
Yarn client和Yarn cluster的区别就是两种模式的Application Master的区别:
Yarn client模式Application Master只用负责和ResourceManager申请container,client需要和container通信,调度job的执行,yarn client模式下client不能离开。
Yarn cluster模式Driver 是运行在Application Master里面的,用户提交了作业之后就不用继续在线了。
Mesos 运行模式分为Meson粗粒度模式和Mesos细粒度模式。Mesos使用zookeeper来解决单点故障问题。Mesos由4个组件构成,分别是Mesos master,Mesos slave,框架和执行容器。其架构如图所示:
Mesos Slave 负责对每个节点上的资源进行管理,Mesos Master 负责对接Mesos Slave和注册的应用程序框架,将Slave上的资源按照一定的策略分配给执行框架。
框架指的是Hadoop,Spark这类的分布式计算框架,为了给不同的计算框架提供统一的接入方式,Mesos提供了MesosSchedulerDriver,计算框架的Scheduler通过使用MesosSchedulerDriver的接口实现和MesosMaster通信,申请计算资源。这里的MesosSchedulerDriver的包名是org.apache.mesos,可以看出这个类是属于mesos提供的实现类,可以直接用,不用继承,也不会被框架回调。
执行容器的作用是执行计算任务,Mesos提供了MesosExecutorDriver来兼容不同框架启动任务的不同方式,框架如果想接入mesos,需要继承mesos的Executor类,复写其中的任务控制方法,这里使用的是策略模式。
Mesos粗粒度模式使用的SchedulerBackend是CoarseMesosSchedulerBackend,CoarseMesosBackend不仅继承了CoarseGrainedSchedulerBackend而且实现了MScheduler,MScheduler是一个回调接口,CoarseMesosSchedulerBackend在start里面实例化了一个MesosSchedulerDriver,这个类类似于上面yarn里面的client,但是里面有很多native方法,这个类有一个构造函数接受MScheduler参数,然后在各种事件发生的时候通过回调MScheduler的方法来做相应的操作。
case mesosUrl @ MESOS_REGEX(_) =>
MesosNativeLibrary.load()
val scheduler = new TaskSchedulerImpl(sc)
val coarseGrained = sc.conf.getBoolean("spark.mesos.coarse", false)
val url = mesosUrl.stripPrefix("mesos://") // strip scheme from raw Mesos URLs
val backend = if (coarseGrained) {
new CoarseMesosSchedulerBackend(scheduler, sc, url)
} else {
new MesosSchedulerBackend(scheduler, sc, url)
}
scheduler.initialize(backend)
(backend, scheduler)
override def start() {
super.start()
val fwInfo = FrameworkInfo.newBuilder().setUser(sc.sparkUser).setName(sc.appName).build()
startScheduler(master, CoarseMesosSchedulerBackend.this, fwInfo)
}
/**
* Starts the MesosSchedulerDriver with the provided information. This method returns
* only after the scheduler has registered with Mesos.
* @param masterUrl Mesos master connection URL
* @param scheduler Scheduler object
* @param fwInfo FrameworkInfo to pass to the Mesos master
*/
def startScheduler(masterUrl: String, scheduler: Scheduler, fwInfo: FrameworkInfo): Unit = {
synchronized {
if (mesosDriver != null) {
registerLatch.await()
return
}
new Thread(Utils.getFormattedClassName(this) + "-mesos-driver") {
setDaemon(true)
override def run() {
mesosDriver = new MesosSchedulerDriver(scheduler, fwInfo, masterUrl)
try {
val ret = mesosDriver.run()
logInfo("driver.run() returned with code " + ret)
if (ret.equals(Status.DRIVER_ABORTED)) {
System.exit(1)
}
} catch {
case e: Exception => {
logError("driver.run() failed", e)
System.exit(1)
}
}
}
}.start()
registerLatch.await()
}
}
override def registered(d: SchedulerDriver, frameworkId: FrameworkID, masterInfo: MasterInfo) {
appId = frameworkId.getValue
logInfo("Registered as framework ID " + appId)
markRegistered()
}
MesosMaster在完成注册之后会调用reviveOffers方法,这个方法会调用Spark里面的CoarseMesosSchedulerBackend.resourceOffers和CoarseGrainedExecutorBackend的onStart。
先看下resourceOffers方法:
override def resourceOffers(d: SchedulerDriver, offers: JList[Offer]) {
synchronized {
val filters = Filters.newBuilder().setRefuseSeconds(5).build()
// 遍历所有的资源
for (offer <- offers) {
// 获取slave id
val slaveId = offer.getSlaveId.toString
// 获取内存大小
val mem = getResource(offer.getResourcesList, "mem")
// 获取cpu数量
val cpus = getResource(offer.getResourcesList, "cpus").toInt
// 如果资源狗荣
if (totalCoresAcquired < maxCores &&
mem >= MemoryUtils.calculateTotalMemory(sc) &&
cpus >= 1 &&
failuresBySlaveId.getOrElse(slaveId, 0) < MAX_SLAVE_FAILURES &&
!slaveIdsWithExecutors.contains(slaveId)) {
// Launch an executor on the slave
// 在slave上面启动executor
val cpusToUse = math.min(cpus, maxCores - totalCoresAcquired)
totalCoresAcquired += cpusToUse
val taskId = newMesosTaskId()
taskIdToSlaveId(taskId) = slaveId
slaveIdsWithExecutors += slaveId
coresByTaskId(taskId) = cpusToUse
val task = MesosTaskInfo.newBuilder()
.setTaskId(TaskID.newBuilder().setValue(taskId.toString).build())
.setSlaveId(offer.getSlaveId)
.setCommand(createCommand(offer, cpusToUse + extraCoresPerSlave))
.setName("Task " + taskId)
.addResources(createResource("cpus", cpusToUse))
.addResources(createResource("mem",
MemoryUtils.calculateTotalMemory(sc)))
sc.conf.getOption("spark.mesos.executor.docker.image").foreach { image =>
MesosSchedulerBackendUtil
.setupContainerBuilderDockerInfo(image, sc.conf, task.getContainerBuilder())
}
d.launchTasks(
Collections.singleton(offer.getId), Collections.singletonList(task.build()), filters)
} else {
// Filter it out
d.launchTasks(
Collections.singleton(offer.getId), Collections.emptyList[MesosTaskInfo](), filters)
}
}
}
}
启动CoarseGrainedExecutorBackend的方法是通过外部命令来实现的,具体的代码在createCommand里面
if (uri.isEmpty) {
val runScript = new File(executorSparkHome, "./bin/spark-class").getCanonicalPath
command.setValue(
"%s \"%s\" org.apache.spark.executor.CoarseGrainedExecutorBackend"
.format(prefixEnv, runScript) +
s" --driver-url $driverUrl" +
s" --executor-id ${offer.getSlaveId.getValue}" +
s" --hostname ${offer.getHostname}" +
s" --cores $numCores" +
s" --app-id $appId")
} else {
// Grab everything to the first '.'. We'll use that and '*' to
// glob the directory "correctly".
val basename = uri.get.split('/').last.split('.').head
command.setValue(
s"cd $basename*; $prefixEnv " +
"./bin/spark-class org.apache.spark.executor.CoarseGrainedExecutorBackend" +
s" --driver-url $driverUrl" +
s" --executor-id ${offer.getSlaveId.getValue}" +
s" --hostname ${offer.getHostname}" +
s" --cores $numCores" +
s" --app-id $appId")
command.addUris(CommandInfo.URI.newBuilder().setValue(uri.get))
}
这里的调用就通知了CoarseGrainedExecutorBackend driverUrl,接下来直接运行的是CoarseGrainedExecutorBackend的main方法,CoarseGrainedExecutorBackend启动之后会先和Driver通信,完成通信之后启动Executor。
Mesos细粒度模式的细主要体现在在Mesos将会参与到Task执行的过程中去,不再是按照每个Application的粒度来管理,而是按照task的粒度来管理,在粗粒度模式中,Driver与MesosMaster通信获得了Container之后将直接和Executor通信,由spark在mesos分配的container里面管理具体的task的执行,而在细粒度模式下,spark 通过ExecutorBackend将task发送到MesosMaster里面去,MesosMaster再调度MesosSlave来执行任务,完成任务之后slave会将任务执行情况反馈给MesosMaster,MesosMaster再将任务执行情况返回个Driver。
Mesos细粒度模式与粗粒度模式实现的主要区别是SchedulerBackend的实现不同,Mesos细粒度模式使用了MesosSchedulerBackend,在MesosSchedulerBackend中, 启动和申请资源的流程类似,主要的变动是resourceOffers。接下来看详细步骤:
override def start() {
val fwInfo = FrameworkInfo.newBuilder().setUser(sc.sparkUser).setName(sc.appName).build()
classLoader = Thread.currentThread.getContextClassLoader
// 启动schedulerDriver
startScheduler(master, MesosSchedulerBackend.this, fwInfo)
}
def createExecutorInfo(execId: String): MesosExecutorInfo = {
val executorSparkHome = sc.conf.getOption("spark.mesos.executor.home")
.orElse(sc.getSparkHome()) // Fall back to driver Spark home for backward compatibility
.getOrElse {
throw new SparkException("Executor Spark home `spark.mesos.executor.home` is not set!")
}
val environment = Environment.newBuilder()
sc.conf.getOption("spark.executor.extraClassPath").foreach { cp =>
environment.addVariables(
Environment.Variable.newBuilder().setName("SPARK_CLASSPATH").setValue(cp).build())
}
val extraJavaOpts = sc.conf.getOption("spark.executor.extraJavaOptions").getOrElse("")
val prefixEnv = sc.conf.getOption("spark.executor.extraLibraryPath").map { p =>
Utils.libraryPathEnvPrefix(Seq(p))
}.getOrElse("")
environment.addVariables(
Environment.Variable.newBuilder()
.setName("SPARK_EXECUTOR_OPTS")
.setValue(extraJavaOpts)
.build())
sc.executorEnvs.foreach { case (key, value) =>
environment.addVariables(Environment.Variable.newBuilder()
.setName(key)
.setValue(value)
.build())
}
val command = CommandInfo.newBuilder()
.setEnvironment(environment)
val uri = sc.conf.getOption("spark.executor.uri")
.orElse(Option(System.getenv("SPARK_EXECUTOR_URI")))
val executorBackendName = classOf[MesosExecutorBackend].getName
if (uri.isEmpty) {
val executorPath = new File(executorSparkHome, "/bin/spark-class").getCanonicalPath
command.setValue(s"$prefixEnv $executorPath $executorBackendName")
} else {
// Grab everything to the first '.'. We'll use that and '*' to
// glob the directory "correctly".
val basename = uri.get.split('/').last.split('.').head
command.setValue(s"cd ${basename}*; $prefixEnv ./bin/spark-class $executorBackendName")
command.addUris(CommandInfo.URI.newBuilder().setValue(uri.get))
}
val cpus = Resource.newBuilder()
.setName("cpus")
.setType(Value.Type.SCALAR)
.setScalar(Value.Scalar.newBuilder()
.setValue(mesosExecutorCores).build())
.build()
val memory = Resource.newBuilder()
.setName("mem")
.setType(Value.Type.SCALAR)
.setScalar(
Value.Scalar.newBuilder()
.setValue(MemoryUtils.calculateTotalMemory(sc)).build())
.build()
val executorInfo = MesosExecutorInfo.newBuilder()
.setExecutorId(ExecutorID.newBuilder().setValue(execId).build())
.setCommand(command)
.setData(ByteString.copyFrom(createExecArg()))
.addResources(cpus)
.addResources(memory)
sc.conf.getOption("spark.mesos.executor.docker.image").foreach { image =>
MesosSchedulerBackendUtil
.setupContainerBuilderDockerInfo(image, sc.conf, executorInfo.getContainerBuilder())
}
executorInfo.build()
}
MesosExecutorBackend实现了MesosExecutor接口和ExecutorBackend,实例化的过程中将会创建Executor
// MesosExecutorBackend.scala
override def registered(
driver: ExecutorDriver,
executorInfo: ExecutorInfo,
frameworkInfo: FrameworkInfo,
slaveInfo: SlaveInfo) {
……
executor = new Executor(
executorId,
slaveInfo.getHostname,
env)
}
// MesosSchedulerBackend.scala片段
mesosTasks.foreach { case (slaveId, tasks) =>
slaveIdToWorkerOffer.get(slaveId).foreach(o =>
listenerBus.post(SparkListenerExecutorAdded(System.currentTimeMillis(), slaveId,
// TODO: Add support for log urls for Mesos
new ExecutorInfo(o.host, o.cores, Map.empty)))
)
// 发送tasks 到MesosMaster
d.launchTasks(Collections.singleton(slaveIdToOffer(slaveId).getId), tasks, filters)
}
Mesos粗粒度的实现方式和之前的各种框架类似,MesosMaster负责在Slave上面启动资源容器,将资源打包成对象返回给Driver,Driver在资源上调度Task的执行,这样的模式资源一旦分配就不能更改,资源的分配是以Application作为单位的。
Mesos细粒度模式将任务上传到MesosMaster,Driver只和MesosMaster交互,而tasks由MesosMaster调度执行,MesosMaster按照每task的粒度来请求资源和调度执行。
Mesos细粒度模式以task的粒度调度运行任务,优化了任务的使用,但是增加了调度任务的计算开销,也增加了MesosMaster的计算压力。