第35课: 打通Spark系统运行内幕机制循环流程
Spark通过DAGScheduler面向整个Job划分出了不同的Stage,划分Stage之后,Stage从后往前划分,执行的时候从前往后执行,每个Stage内部有一系列的任务,Stage里面的任务是并行计算,并行任务的逻辑是完全相同的,但处理的数据不同。DAGScheduler以TaskSet的方式,把我们一个DAG构建的Stage中的所有任务提交给底层的调度器TaskScheduler。TaskScheduler是一个接口,跟具体的任务解耦合,可以运行在不同的调度模式下,如可运行在Standalone模式,也运行在Yarn上。
Spark的基础调度图包括RDD、DAGScheduler、TaskScheduler、Worker等内容,本节讲解TaskScheduler工作原理。
图 8- 3 Spark 运行原理图
DAGScheduler在提交TaskSet给底层调度器的时候是面向接口TaskScheduler的,这符合面向对象中依赖抽象而不依赖具体的原则,带来底层资源调度器的可插拔性,导致Spark可以运行的众多的资源调度器模式上,例如Standalone、Yarn、Mesos、Local、EC2、其它自定义的资源调度器;在Standalone的模式下我们聚焦于TaskSchedulerImpl。
TaskScheduler是一个接口Trait,底层任务调度接口,由[[org.apache.spark.scheduler.TaskSchedulerImpl]]实现。这个接口允许插入不同的任务调度程序。每个任务调度器在单独的SparkContext中调度任务。任务调度程序从每个Stage的DAGScheduler获得提交的任务集,负责发送任务到集群运行,如果任务运行失败,将重试,将返回DAGScheduler事件。
TaskScheduler代码如下:
1. private[spark] trait TaskScheduler {
2.
3. private val appId ="spark-application-" + System.currentTimeMillis
4.
5. def rootPool: Pool
6.
7. def schedulingMode: SchedulingMode
8.
9. def start(): Unit
10.
11. // Invoked after system has successfullyinitialized (typically in spark context).
12. // Yarn uses this to bootstrap allocation ofresources based on preferred locations,
13. // wait for slave registrations, etc.
14. def postStartHook() { }
15.
16. // Disconnect from the cluster.
17. def stop(): Unit
18.
19. // Submit a sequence of tasks to run.
20. def submitTasks(taskSet: TaskSet): Unit
21.
22. // Cancel a stage.
23. def cancelTasks(stageId: Int,interruptThread: Boolean): Unit
24.
25. // Set the DAG scheduler for upcalls. This isguaranteed to be set before submitTasks is called.
26. def setDAGScheduler(dagScheduler:DAGScheduler): Unit
27.
28. // Get the default level of parallelism touse in the cluster, as a hint for sizing jobs.
29. def defaultParallelism(): Int
30.
31. /**
32. * Update metrics for in-progress tasks andlet the master know that the BlockManager is still
33. * alive. Return true if the driver knowsabout the given block manager. Otherwise, return false,
34. * indicating that the block manager shouldre-register.
35. */
36. def executorHeartbeatReceived(
37. execId: String,
38. accumUpdates: Array[(Long, Seq[AccumulatorV2[_,_]])],
39. blockManagerId: BlockManagerId): Boolean
40.
41. /**
42. * Get an application ID associated with thejob.
43. *
44. * @return An application ID
45. */
46. def applicationId(): String = appId
47.
48. /**
49. * Process a lost executor
50. */
51. def executorLost(executorId: String, reason:ExecutorLossReason): Unit
52.
53. /**
54. * Get an application's attempt ID associatedwith the job.
55. *
56. * @return An application's Attempt ID
57. */
58. def applicationAttemptId(): Option[String]
59.
60. }
DAGScheduler把TaskSet交给底层的接口TaskScheduler,具体实现的时候有不同的实现,TaskScheduler主要由TaskSchedulerImpl实现:
1. private[spark]class TaskSchedulerImpl(
2. val sc: SparkContext,
3. val maxTaskFailures: Int,
4. isLocal: Boolean = false)
5. extends TaskScheduler with Logging
6. {
TaskSchedulerImpl也有自己的子类YarnScheduler。
1. private[spark] class YarnScheduler(sc:SparkContext) extends TaskSchedulerImpl(sc) {
2.
3. // RackResolver logs an INFO message wheneverit resolves a rack, which is way too often.
4. if(Logger.getLogger(classOf[RackResolver]).getLevel == null) {
5. Logger.getLogger(classOf[RackResolver]).setLevel(Level.WARN)
6. }
7.
8. // By default, rack is unknown
9. override def getRackForHost(hostPort:String): Option[String] = {
10. val host = Utils.parseHostPort(hostPort)._1
11. Option(RackResolver.resolve(sc.hadoopConfiguration,host).getNetworkLocation)
12. }
13. }
YarnScheduler的子类YarnClusterScheduler实现如下:
1. private[spark]class YarnClusterScheduler(sc: SparkContext) extends YarnScheduler(sc) {
2. logInfo("CreatedYarnClusterScheduler")
3.
4. override def postStartHook() {
5. ApplicationMaster.sparkContextInitialized(sc)
6. super.postStartHook()
7. logInfo("YarnClusterScheduler.postStartHookdone")
8. }
9.
10. }
默认情况下我们研究Standalone的模式,所以主要研究TaskSchedulerImpl。DAGScheduler把TaskSet交给TaskScheduler, TaskScheduler中通过TastSetManager管理具体的任务。TaskScheduler的核心任务是提交TaskSet到集群运算并汇报结果:
l 为TaskSet创建和维护一个TaskSetManager并追踪任务的本地性以及错误信息;
l 遇到延后的Straggle任务会放到其它的节点进行重试;
l 向DAGScheduler汇报执行情况,包括在Shuffle输出lost的时候报告fetch failed错误等信息;
TaskSet是一个普通的类,第一个成员是tasks,tasks是一个数组。TaskSet源码如下:
1. private[spark]class TaskSet(
2. val tasks: Array[Task[_]],
3. val stageId: Int,
4. val stageAttemptId: Int,
5. val priority: Int,
6. val properties: Properties) {
7. val id: String = stageId + "." +stageAttemptId
8.
9. override def toString: String = "TaskSet" + id
10. }
TaskScheduler内部有SchedulerBackend,SchedulerBackend管理Executor资源。从Standalone的模式来讲具体实现是StandaloneSchedulerBackend(Spark 2.0 版本将之前的SparkDeploySchedulerBackend名字更新为StandaloneSchedulerBackend)。
SchedulerBackend本身是一个接口,是一个trait。SchedulerBackend源码如下:
1. private[spark] trait SchedulerBackend {
2. private val appId ="spark-application-" + System.currentTimeMillis
3.
4. def start(): Unit
5. def stop(): Unit
6. def reviveOffers(): Unit
7. def defaultParallelism(): Int
8.
9. def killTask(taskId: Long, executorId:String, interruptThread: Boolean): Unit =
10. throw new UnsupportedOperationException
11. def isReady(): Boolean = true
12.
13. /**
14. * Get an application ID associated with thejob.
15. *
16. * @return An application ID
17. */
18. def applicationId(): String = appId
19.
20. /**
21. * Get the attempt ID for this run, if thecluster manager supports multiple
22. * attempts. Applications run in client modewill not have attempt IDs.
23. *
24. * @return The application attempt id, ifavailable.
25. */
26. def applicationAttemptId(): Option[String] =None
27.
28. /**
29. * Get the URLs for the driver logs. TheseURLs are used to display the links in the UI
30. * Executors tab for the driver.
31. * @return Map containing the log names andtheir respective URLs
32. */
33. def getDriverLogUrls: Option[Map[String,String]] = None
34.
35. }
StandaloneSchedulerBackend:专门负责收集Worker的资源信息。接收Worker向Driver注册的信息,ExecutorBackend启动的时候进行注册,为当前应用程序准备计算资源,以进程为单位。
StandaloneSchedulerBackend:
1. private[spark] classStandaloneSchedulerBackend(
2. scheduler: TaskSchedulerImpl,
3. sc: SparkContext,
4. masters: Array[String])
5. extendsCoarseGrainedSchedulerBackend(scheduler, sc.env.rpcEnv)
6. with StandaloneAppClientListener
7. with Logging {
8. private var client: StandaloneAppClient =null
9. ……
StandaloneSchedulerBackend里面有个client: StandaloneAppClient
1. private[spark]class StandaloneAppClient(
2. rpcEnv: RpcEnv,
3. masterUrls: Array[String],
4. appDescription: ApplicationDescription,
5. listener: StandaloneAppClientListener,
6. conf:SparkConf)
7. extends Logging {
StandaloneAppClient 允许应用程序与 Spark standalone 集群管理器通信。获取Master的URL、应用程序描述和集群事件监听器, 当各种事件发生时可以回调监听器。masterUrls的格式为spark://host:port,StandaloneAppClient需要向Master进行注册。
StandaloneAppClient在StandaloneSchedulerBackend.scala的start方法启动的时候进行赋值,new出来一个StandaloneAppClient。
1. private[spark]class StandaloneSchedulerBackend(
2. ......
3.
4. override def start() {
5. ......
6. val appDesc = newApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
7. appUIAddress, sc.eventLogDir,sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)
8. client = newStandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
9. client.start()
10. launcherBackend.setState(SparkAppHandle.State.SUBMITTED)
11. waitForRegistration()
12. launcherBackend.setState(SparkAppHandle.State.RUNNING)
13. }
StandaloneAppClient.scala中,里面有一个类是ClientEndpoint,核心工作是在启动的时候向Master注册。StandaloneAppClient的start方法启动的时候,就new出来一个ClientEndpoint。
StandaloneAppClient源码如下:
1. private[spark]class StandaloneAppClient(
2. ......
3. private class ClientEndpoint(override valrpcEnv: RpcEnv) extends ThreadSafeRpcEndpoint
4. with Logging {
5. ……
6. def start() {
7. // Just launch an rpcEndpoint;it will call back into the listener.
8. endpoint.set(rpcEnv.setupEndpoint("AppClient",new ClientEndpoint(rpcEnv)))
9. }
StandaloneSchedulerBackend在启动的时候构建StandaloneAppClient实例,并在StandaloneAppClient实例start的时候启动了ClientEndpoint这个消息循环体,ClientEndpoint在启动的时候会向Master注册当前程序。
StandaloneAppClient中ClientEndpoint类的onStart()方法:
1. override def onStart(): Unit = {
2. try {
3. registerWithMaster(1)
4. } catch {
5. case e: Exception =>
6. logWarning("Failedto connect to master", e)
7. markDisconnected()
8. stop()
9. }
10. }
这个是StandaloneSchedulerBackend的第一个注册的核心功能。StandaloneSchedulerBackend 继承至CoarseGrainedSchedulerBackend。而CoarseGrainedSchedulerBackend在启动的时候就创建DriverEndpoint,从实例的角度讲,DriverEndpoint也是属于StandaloneSchedulerBackend实例。:
1. private[spark]
2. class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, valrpcEnv: RpcEnv)
3. extends ExecutorAllocationClientwith SchedulerBackend with Logging
4. {
5. ......
6. class DriverEndpoint(override valrpcEnv: RpcEnv, sparkProperties: Seq[(String, String)])
7. extends ThreadSafeRpcEndpointwith Logging {
8. ......
StandaloneSchedulerBackend的父类CoarseGrainedSchedulerBackend在start的时候会实例化类型为DriverEndpoint(这就是我们程序运行时候的经典对象 Driver)的消息循环体。
StandaloneSchedulerBackend 在运行的时候向Master注册申请到资源,当Worker的ExecutorBackend启动的时候会发送RegisteredExecutor信息向DriverEndpoint注册,此时StandaloneSchedulerBackend就掌握了当前应用程序拥有的计算资源,TaskScheduler就是通过StandaloneSchedulerBackend拥有的计算资源来具体运行Task;StandaloneSchedulerBackend不是应用程序的总管,应用程序的总管是DAGScheduler、TaskScheduler,StandaloneSchedulerBackend将应用程序的Task获取具体的计算资源,并把Task发送到集群中。
SparkContext、DAGScheduler、TaskSchedulerImpl、StandaloneSchedulerBackend 在应用程序启动的时候只实例化一次,应用程序存在期间始终存在这些对象;
这里基于spark 2.1 版本讲解:
Spark调度器三大核心资源:SparkContext、DAGScheduler、TaskSchedulerImpl,TaskSchedulerImpl作为具体的底层调度器,运行的时候需要计算资源,因此需要StandaloneSchedulerBackend,StandaloneSchedulerBackend设计巧妙的地方是启动的时候启动StandaloneAppClient,而StandaloneAppClient在start的时候有一个ClientEndpoint的消息循环体,ClientEndpoint的消息循环体启动的时候向Master注册应用程序;
StandaloneSchedulerBackend 的父类CoarseGrainedSchedulerBackend在start启动的时候会实例化DriverEndpoint,所有的ExecutorBackend启动的时候都要向DriverEndpoint注册,注册最后落到了StandaloneSchedulerBackend 的内存数据结构中,表明上看是在CoarseGrainedSchedulerBackend,但是实例化的时候是StandaloneSchedulerBackend,注册给父类的成员其实就是子类的成员。
作为前提问题:TaskScheduler的启动、StandaloneSchedulerBackend的启动是如何启动的?TaskSchedulerImpl什么时候实例化的?
TaskSchedulerImpl是在SparkContext中实例化的。在SparkContext类实例化的时候,类实例化的时候只要不是方法体里面的内容都会被执行,(sched, ts) 是SparkContext的成员,将调用createTaskScheduler方法,调用createTaskScheduler方法返回一个Tuple包括2个元素:sched是我们的schedulerBackend,ts是taskScheduler。
1. class SparkContext(config:SparkConf) extends Logging {
2. ......
3. // Create and start the scheduler
4. val (sched, ts) =SparkContext.createTaskScheduler(this, master, deployMode)
5. _schedulerBackend = sched
6. _taskScheduler = ts
7. _dagScheduler = newDAGScheduler(this)
8. _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
createTaskScheduler里面有很多运行模式,我们这里关注Standalone模式,这里首先new出来一个TaskSchedulerImpl,TaskSchedulerImpl和SparkContext是一一对应的,整个程序运行的时候只有一个TaskSchedulerImpl,也只有一个SparkContext;接着实例化StandaloneSchedulerBackend,整个程序运行的时候只有一个StandaloneSchedulerBackend。createTaskScheduler方法如下:
1. privatedef createTaskScheduler(
2. sc: SparkContext,
3. master: String,
4. deployMode: String):(SchedulerBackend, TaskScheduler) = {
5. import SparkMasterRegex._
6. ......
7. master match {
8. ......
9. case SPARK_REGEX(sparkUrl)=>
10. val scheduler = newTaskSchedulerImpl(sc)
11. val masterUrls =sparkUrl.split(",").map("spark://" + _)
12. val backend = newStandaloneSchedulerBackend(scheduler, sc, masterUrls)
13. scheduler.initialize(backend)
14. (backend, scheduler)
15. ......
在SparkContext实例化的时候通过createTaskScheduler来创建TaskSchedulerImpl和StandaloneSchedulerBackend。在createTaskScheduler中然后调用scheduler.initialize(backend)
initialize的方法参数把StandaloneSchedulerBackend传进来,schedulingMode模式匹配2种方式:FIFO、FAIR。
TaskSchedulerImpl的initialize源码如下:
1. definitialize(backend: SchedulerBackend) {
2. this.backend = backend
3. // temporarily set rootPoolname to empty
4. rootPool = newPool("", schedulingMode, 0, 0)
5. schedulableBuilder = {
6. schedulingMode match {
7. case SchedulingMode.FIFO=>
8. newFIFOSchedulableBuilder(rootPool)
9. case SchedulingMode.FAIR=>
10. newFairSchedulableBuilder(rootPool, conf)
11. case _ =>
12. throw newIllegalArgumentException(s"Unsupported spark.scheduler.mode:$schedulingMode")
13. }
14. }
15. schedulableBuilder.buildPools()
16. }
initialize的方法中调用schedulableBuilder.buildPools(),buildPools方法根据FIFOSchedulableBuilder、FairSchedulableBuilder不同的模式重载方法实现:
1. private[spark] traitSchedulableBuilder {
2. def rootPool: Pool
3.
4. def buildPools(): Unit
5.
6. def addTaskSetManager(manager:Schedulable, properties: Properties): Unit
7. }
initialize的方法把StandaloneSchedulerBackend传进来了,但还没有启动StandaloneSchedulerBackend。在TaskSchedulerImpl的initialize方法中把StandaloneSchedulerBackend传进来从而赋值为TaskSchedulerImpl的backend;在TaskSchedulerImpl调用start方法的时候会调用backend.start方法,在start方法中会最终注册应用程序。
看一下SparkContext.scala的taskScheduler的启动:
1. val (sched, ts) =SparkContext.createTaskScheduler(this, master, deployMode)
2. _schedulerBackend = sched
3. _taskScheduler = ts
4. _dagScheduler = newDAGScheduler(this)
5. ……
6. _taskScheduler.start()
7. _applicationId =_taskScheduler.applicationId()
8. _applicationAttemptId =taskScheduler.applicationAttemptId()
9. _conf.set("spark.app.id",_applicationId)
10. ……
其中调用了_taskScheduler的start方法:
1. private[spark] traitTaskScheduler {
2. ......
3.
4. def start(): Unit
5. …..
TaskScheduler的start()方法没具体实现,TaskScheduler子类的TaskSchedulerImpl的start()方法源码如下:
1. override def start() {
2. backend.start()
3.
4. if (!isLocal &&conf.getBoolean("spark.speculation", false)) {
5. logInfo("Starting speculativeexecution thread")
6. speculationScheduler.scheduleAtFixedRate(newRunnable {
7. override def run(): Unit =Utils.tryOrStopSparkContext(sc) {
8. checkSpeculatableTasks()
9. }
10. }, SPECULATION_INTERVAL_MS,SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)
11. }
12. }
TaskSchedulerImpl的start()这里就通过 backend.start()启动了StandaloneSchedulerBackend的start方法:
1. override def start() {
2. super.start()
3. launcherBackend.connect()
4. ......
5. val command =Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",
6. args, sc.executorEnvs,classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)
7. .......
8. val appDesc = newApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
9. appUIAddress,sc.eventLogDir, sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)
10. client = newStandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
11. client.start()
12. ........
13. }
StandaloneSchedulerBackend的start方法中,将command封装注册给Master,Master转过来要Worker启动具体的Executor。command已经封装好指令,Executor具体要启动进程入口类CoarseGrainedExecutorBackend。然后new出来一个StandaloneAppClient,通过client.start()启动client。
StandaloneAppClient的start方法中new出来一个ClientEndpoint:
1. def start() {
2. // Just launch an rpcEndpoint;it will call back into the listener.
3. endpoint.set(rpcEnv.setupEndpoint("AppClient",new ClientEndpoint(rpcEnv)))
4. }
ClientEndpoint源码如下:
1. private classClientEndpoint(override val rpcEnv: RpcEnv) extends ThreadSafeRpcEndpoint
2. with Logging {
3. ……
4. override def onStart(): Unit = {
5. try {
6. registerWithMaster(1)
7. } catch {
8. case e: Exception =>
9. logWarning("Failed to connect tomaster", e)
10. markDisconnected()
11. stop()
12. }
13. }
ClientEndpoint是一个ThreadSafeRpcEndpoint, ClientEndpoint的onStart()方法中调用registerWithMaster(1)进行注册,向Master注册程序,registerWithMaster方法如下:
1. private def registerWithMaster(nthRetry:Int) {
2. registerMasterFutures.set(tryRegisterAllMasters())
3. registrationRetryTimer.set(registrationRetryThread.schedule(newRunnable {
4. override def run(): Unit = {
5. if (registered.get) {
6. registerMasterFutures.get.foreach(_.cancel(true))
7. registerMasterThreadPool.shutdownNow()
8. } else if (nthRetry >=REGISTRATION_RETRIES) {
9. markDead("All masters areunresponsive! Giving up.")
10. } else {
11. registerMasterFutures.get.foreach(_.cancel(true))
12. registerWithMaster(nthRetry + 1)
13. }
14. }
15. }, REGISTRATION_TIMEOUT_SECONDS,TimeUnit.SECONDS))
16. }
程序注册以后,Master通过 schedule()为我们分配资源,通知Worker启动Executor,Executor启动的进程是CoarseGrainedExecutorBackend,Executor启动以后又转过来向Driver注册,Driver其实是StandaloneSchedulerBackend的父类CoarseGrainedSchedulerBackend的一个消息循环体DriverEndpoint。
大总结:
在SparkContext实例化的时候调用createTaskScheduler来创建TaskSchedulerImpl和StandaloneSchedulerBackend,同时在SparkContext实例化的时候会调用TaskSchedulerImpl的start,在start方法中会调用StandaloneSchedulerBackend的start,在该start方法中会创建StandaloneAppClient对象并调用StandaloneAppClient对象的start方法,在该start方法中会创建ClientEndpoint,在创建ClientEndpoint会传入Command来指定具体为当前应用程序启动的Executor进行的入口类的名称为CoarseGrainedExecutorBackend,然后ClientEndpoint启动并通过tryRegisterMaster来注册当前的应用程序到Master中,Master接受到注册信息后如果可以运行程序,则会为该程序生产Job ID并通过schedule来分配计算资源,具体计算资源的分配是通过应用程序的运行方式、Memory、cores等配置信息来决定的,最后Master会发送指令给Worker,Worker中为当前应用程序分配计算资源时会首先分配ExecutorRunner,ExecutorRunner内部会通过Thread的方式构建ProcessBuilder来启动另外一个JVM进程,这个JVM进程启动时候加载的main方法所在的类的名称就是在创建ClientEndpoint时传入的Command来指定具体名称为CoarseGrainedExecutorBackend的类,此时JVM在通过ProcessBuilder启动的时候获得了CoarseGrainedExecutorBackend后加载并调用其中的main方法,在main方法中会实例化CoarseGrainedExecutorBackend本身这个消息循环体,而CoarseGrainedExecutorBackend在实例化的时候会通过回调onStart向DriverEndpoint发送RegisterExecutor来注册当前的CoarseGrainedExecutorBackend,此时DriverEndpoint收到到该注册信息并保存在了StandaloneSchedulerBackend实例的内存数据结构中,这样Driver就获得了计算资源!
CoarseGrainedExecutorBackend.scala的main方法:
1. def main(args: Array[String]) {
2. var driverUrl: String = null
3. var executorId: String = null
4. var hostname: String = null
5. var cores: Int = 0
6. var appId: String = null
7. var workerUrl: Option[String] = None
8. val userClassPath = new mutable.ListBuffer[URL]()
9.
10. var argv = args.toList
11. ......
12. run(driverUrl, executorId, hostname, cores,appId, workerUrl, userClassPath)
13. System.exit(0)
14. }
CoarseGrainedExecutorBackend 的main然后开始调用run方法:
1. privatedef run(
2. driverUrl: String,
3. executorId: String,
4. hostname: String,
5. cores: Int,
6. appId: String,
7. workerUrl: Option[String],
8. userClassPath: Seq[URL]) {
9. ......
10. env.rpcEnv.setupEndpoint("Executor",new CoarseGrainedExecutorBackend(
11. env.rpcEnv, driverUrl, executorId,hostname, cores, userClassPath, env))
12. ......
13.
在CoarseGrainedExecutorBackend的main方法中,通过env.rpcEnv.setupEndpoint("Executor",new CoarseGrainedExecutorBackend构建了CoarseGrainedExecutorBackend实例本身。