段智华

第35课：打通Spark系统运行内幕机制循环流程

Spark通过DAGScheduler面向整个Job划分出了不同的Stage，划分Stage之后，Stage从后往前划分，执行的时候从前往后执行，每个Stage内部有一系列的任务，Stage里面的任务是并行计算，并行任务的逻辑是完全相同的，但处理的数据不同。DAGScheduler以TaskSet的方式，把我们一个DAG构建的Stage中的所有任务提交给底层的调度器TaskScheduler。TaskScheduler是一个接口，跟具体的任务解耦合，可以运行在不同的调度模式下，如可运行在Standalone模式，也运行在Yarn上。

Spark的基础调度图包括RDD、DAGScheduler、TaskScheduler、Worker等内容，本节讲解TaskScheduler工作原理。

图 8- 3 Spark 运行原理图

DAGScheduler在提交TaskSet给底层调度器的时候是面向接口TaskScheduler的，这符合面向对象中依赖抽象而不依赖具体的原则，带来底层资源调度器的可插拔性，导致Spark可以运行的众多的资源调度器模式上，例如Standalone、Yarn、Mesos、Local、EC2、其它自定义的资源调度器；在Standalone的模式下我们聚焦于TaskSchedulerImpl。

TaskScheduler是一个接口Trait，底层任务调度接口，由[[org.apache.spark.scheduler.TaskSchedulerImpl]]实现。这个接口允许插入不同的任务调度程序。每个任务调度器在单独的SparkContext中调度任务。任务调度程序从每个Stage的DAGScheduler获得提交的任务集，负责发送任务到集群运行，如果任务运行失败，将重试，将返回DAGScheduler事件。

TaskScheduler代码如下：

1. private[spark] trait TaskScheduler {

3. private val appId ="spark-application-" + System.currentTimeMillis

5. def rootPool: Pool

7. def schedulingMode: SchedulingMode

9. def start(): Unit

10.

11. // Invoked after system has successfullyinitialized (typically in spark context).

12. // Yarn uses this to bootstrap allocation ofresources based on preferred locations,

13. // wait for slave registrations, etc.

14. def postStartHook() { }

15.

16. // Disconnect from the cluster.

17. def stop(): Unit

18.

19. // Submit a sequence of tasks to run.

20. def submitTasks(taskSet: TaskSet): Unit

21.

22. // Cancel a stage.

23. def cancelTasks(stageId: Int,interruptThread: Boolean): Unit

24.

25. // Set the DAG scheduler for upcalls. This isguaranteed to be set before submitTasks is called.

26. def setDAGScheduler(dagScheduler:DAGScheduler): Unit

27.

28. // Get the default level of parallelism touse in the cluster, as a hint for sizing jobs.

29. def defaultParallelism(): Int

30.

31. /**

32. * Update metrics for in-progress tasks andlet the master know that the BlockManager is still

33. * alive. Return true if the driver knowsabout the given block manager. Otherwise, return false,

34. * indicating that the block manager shouldre-register.

35. */

36. def executorHeartbeatReceived(

37. execId: String,

38. accumUpdates: Array[(Long, Seq[AccumulatorV2[_,_]])],

39. blockManagerId: BlockManagerId): Boolean

40.

41. /**

42. * Get an application ID associated with thejob.

43. *

44. * @return An application ID

45. */

46. def applicationId(): String = appId

47.

48. /**

49. * Process a lost executor

50. */

51. def executorLost(executorId: String, reason:ExecutorLossReason): Unit

52.

53. /**

54. * Get an application's attempt ID associatedwith the job.

55. *

56. * @return An application's Attempt ID

57. */

58. def applicationAttemptId(): Option[String]

59.

60. }

DAGScheduler把TaskSet交给底层的接口TaskScheduler，具体实现的时候有不同的实现，TaskScheduler主要由TaskSchedulerImpl实现：

1. private[spark]class TaskSchedulerImpl(

2. val sc: SparkContext,

3. val maxTaskFailures: Int,

4. isLocal: Boolean = false)

5. extends TaskScheduler with Logging

6. {

TaskSchedulerImpl也有自己的子类YarnScheduler。

1. private[spark] class YarnScheduler(sc:SparkContext) extends TaskSchedulerImpl(sc) {

3. // RackResolver logs an INFO message wheneverit resolves a rack, which is way too often.

4. if(Logger.getLogger(classOf[RackResolver]).getLevel == null) {

5. Logger.getLogger(classOf[RackResolver]).setLevel(Level.WARN)

6. }

8. // By default, rack is unknown

9. override def getRackForHost(hostPort:String): Option[String] = {

10. val host = Utils.parseHostPort(hostPort)._1

11. Option(RackResolver.resolve(sc.hadoopConfiguration,host).getNetworkLocation)

12. }

13. }

YarnScheduler的子类YarnClusterScheduler实现如下：

1. private[spark]class YarnClusterScheduler(sc: SparkContext) extends YarnScheduler(sc) {

2. logInfo("CreatedYarnClusterScheduler")

4. override def postStartHook() {

5. ApplicationMaster.sparkContextInitialized(sc)

6. super.postStartHook()

7. logInfo("YarnClusterScheduler.postStartHookdone")

8. }

10. }

默认情况下我们研究Standalone的模式，所以主要研究TaskSchedulerImpl。DAGScheduler把TaskSet交给TaskScheduler， TaskScheduler中通过TastSetManager管理具体的任务。TaskScheduler的核心任务是提交TaskSet到集群运算并汇报结果：

l 为TaskSet创建和维护一个TaskSetManager并追踪任务的本地性以及错误信息；

l 遇到延后的Straggle任务会放到其它的节点进行重试；

l 向DAGScheduler汇报执行情况，包括在Shuffle输出lost的时候报告fetch failed错误等信息；

TaskSet是一个普通的类，第一个成员是tasks，tasks是一个数组。TaskSet源码如下：

1. private[spark]class TaskSet(

2. val tasks: Array[Task[_]],

3. val stageId: Int,

4. val stageAttemptId: Int,

5. val priority: Int,

6. val properties: Properties) {

7. val id: String = stageId + "." +stageAttemptId

9. override def toString: String = "TaskSet" + id

10. }

TaskScheduler内部有SchedulerBackend，SchedulerBackend管理Executor资源。从Standalone的模式来讲具体实现是StandaloneSchedulerBackend（Spark 2.0 版本将之前的SparkDeploySchedulerBackend名字更新为StandaloneSchedulerBackend）。

SchedulerBackend本身是一个接口，是一个trait。SchedulerBackend源码如下：

1. private[spark] trait SchedulerBackend {

2. private val appId ="spark-application-" + System.currentTimeMillis

4. def start(): Unit

5. def stop(): Unit

6. def reviveOffers(): Unit

7. def defaultParallelism(): Int

9. def killTask(taskId: Long, executorId:String, interruptThread: Boolean): Unit =

10. throw new UnsupportedOperationException

11. def isReady(): Boolean = true

12.

13. /**

14. * Get an application ID associated with thejob.

15. *

16. * @return An application ID

17. */

18. def applicationId(): String = appId

19.

20. /**

21. * Get the attempt ID for this run, if thecluster manager supports multiple

22. * attempts. Applications run in client modewill not have attempt IDs.

23. *

24. * @return The application attempt id, ifavailable.

25. */

26. def applicationAttemptId(): Option[String] =None

27.

28. /**

29. * Get the URLs for the driver logs. TheseURLs are used to display the links in the UI

30. * Executors tab for the driver.

31. * @return Map containing the log names andtheir respective URLs

32. */

33. def getDriverLogUrls: Option[Map[String,String]] = None

34.

35. }

StandaloneSchedulerBackend：专门负责收集Worker的资源信息。接收Worker向Driver注册的信息，ExecutorBackend启动的时候进行注册，为当前应用程序准备计算资源，以进程为单位。

StandaloneSchedulerBackend：

1. private[spark] classStandaloneSchedulerBackend(

2. scheduler: TaskSchedulerImpl,

3. sc: SparkContext,

4. masters: Array[String])

5. extendsCoarseGrainedSchedulerBackend(scheduler, sc.env.rpcEnv)

6. with StandaloneAppClientListener

7. with Logging {

8. private var client: StandaloneAppClient =null

9. ……

StandaloneSchedulerBackend里面有个client: StandaloneAppClient

1. private[spark]class StandaloneAppClient(

2. rpcEnv: RpcEnv,

3. masterUrls: Array[String],

4. appDescription: ApplicationDescription,

5. listener: StandaloneAppClientListener,

6. conf:SparkConf)

7. extends Logging {

StandaloneAppClient 允许应用程序与 Spark standalone 集群管理器通信。获取Master的URL、应用程序描述和集群事件监听器，当各种事件发生时可以回调监听器。masterUrls的格式为spark://host:port，StandaloneAppClient需要向Master进行注册。

StandaloneAppClient在StandaloneSchedulerBackend.scala的start方法启动的时候进行赋值，new出来一个StandaloneAppClient。

1. private[spark]class StandaloneSchedulerBackend(

2. ......

4. override def start() {

5. ......

6. val appDesc = newApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,

7. appUIAddress, sc.eventLogDir,sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)

8. client = newStandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)

9. client.start()

10. launcherBackend.setState(SparkAppHandle.State.SUBMITTED)

11. waitForRegistration()

12. launcherBackend.setState(SparkAppHandle.State.RUNNING)

13. }

StandaloneAppClient.scala中，里面有一个类是ClientEndpoint，核心工作是在启动的时候向Master注册。StandaloneAppClient的start方法启动的时候，就new出来一个ClientEndpoint。

StandaloneAppClient源码如下：

1. private[spark]class StandaloneAppClient(

2. ......

3. private class ClientEndpoint(override valrpcEnv: RpcEnv) extends ThreadSafeRpcEndpoint

4. with Logging {

5. ……

6. def start() {

7. // Just launch an rpcEndpoint;it will call back into the listener.

8. endpoint.set(rpcEnv.setupEndpoint("AppClient",new ClientEndpoint(rpcEnv)))

9. }

StandaloneSchedulerBackend在启动的时候构建StandaloneAppClient实例，并在StandaloneAppClient实例start的时候启动了ClientEndpoint这个消息循环体，ClientEndpoint在启动的时候会向Master注册当前程序。

StandaloneAppClient中ClientEndpoint类的onStart()方法：

1. override def onStart(): Unit = {

2. try {

3. registerWithMaster(1)

4. } catch {

5. case e: Exception =>

6. logWarning("Failedto connect to master", e)

7. markDisconnected()

8. stop()

9. }

10. }

这个是StandaloneSchedulerBackend的第一个注册的核心功能。StandaloneSchedulerBackend 继承至CoarseGrainedSchedulerBackend。而CoarseGrainedSchedulerBackend在启动的时候就创建DriverEndpoint，从实例的角度讲，DriverEndpoint也是属于StandaloneSchedulerBackend实例。：

1. private[spark]

2. class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, valrpcEnv: RpcEnv)

3. extends ExecutorAllocationClientwith SchedulerBackend with Logging

4. {

5. ......

6. class DriverEndpoint(override valrpcEnv: RpcEnv, sparkProperties: Seq[(String, String)])

7. extends ThreadSafeRpcEndpointwith Logging {

8. ......

StandaloneSchedulerBackend的父类CoarseGrainedSchedulerBackend在start的时候会实例化类型为DriverEndpoint（这就是我们程序运行时候的经典对象 Driver）的消息循环体。

StandaloneSchedulerBackend 在运行的时候向Master注册申请到资源，当Worker的ExecutorBackend启动的时候会发送RegisteredExecutor信息向DriverEndpoint注册，此时StandaloneSchedulerBackend就掌握了当前应用程序拥有的计算资源，TaskScheduler就是通过StandaloneSchedulerBackend拥有的计算资源来具体运行Task；StandaloneSchedulerBackend不是应用程序的总管，应用程序的总管是DAGScheduler、TaskScheduler，StandaloneSchedulerBackend将应用程序的Task获取具体的计算资源，并把Task发送到集群中。

SparkContext、DAGScheduler、TaskSchedulerImpl、StandaloneSchedulerBackend 在应用程序启动的时候只实例化一次，应用程序存在期间始终存在这些对象；

这里基于spark 2.1 版本讲解：

Spark调度器三大核心资源：SparkContext、DAGScheduler、TaskSchedulerImpl，TaskSchedulerImpl作为具体的底层调度器，运行的时候需要计算资源，因此需要StandaloneSchedulerBackend，StandaloneSchedulerBackend设计巧妙的地方是启动的时候启动StandaloneAppClient，而StandaloneAppClient在start的时候有一个ClientEndpoint的消息循环体，ClientEndpoint的消息循环体启动的时候向Master注册应用程序；

StandaloneSchedulerBackend 的父类CoarseGrainedSchedulerBackend在start启动的时候会实例化DriverEndpoint，所有的ExecutorBackend启动的时候都要向DriverEndpoint注册，注册最后落到了StandaloneSchedulerBackend 的内存数据结构中，表明上看是在CoarseGrainedSchedulerBackend，但是实例化的时候是StandaloneSchedulerBackend，注册给父类的成员其实就是子类的成员。

作为前提问题：TaskScheduler的启动、StandaloneSchedulerBackend的启动是如何启动的？TaskSchedulerImpl什么时候实例化的？

TaskSchedulerImpl是在SparkContext中实例化的。在SparkContext类实例化的时候，类实例化的时候只要不是方法体里面的内容都会被执行，(sched, ts) 是SparkContext的成员，将调用createTaskScheduler方法，调用createTaskScheduler方法返回一个Tuple包括2个元素：sched是我们的schedulerBackend，ts是taskScheduler。

1. class SparkContext(config:SparkConf) extends Logging {

2. ......

3. // Create and start the scheduler

4. val (sched, ts) =SparkContext.createTaskScheduler(this, master, deployMode)

5. _schedulerBackend = sched

6. _taskScheduler = ts

7. _dagScheduler = newDAGScheduler(this)

8. _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

createTaskScheduler里面有很多运行模式，我们这里关注Standalone模式，这里首先new出来一个TaskSchedulerImpl，TaskSchedulerImpl和SparkContext是一一对应的，整个程序运行的时候只有一个TaskSchedulerImpl，也只有一个SparkContext；接着实例化StandaloneSchedulerBackend，整个程序运行的时候只有一个StandaloneSchedulerBackend。createTaskScheduler方法如下：

1. privatedef createTaskScheduler(

2. sc: SparkContext,

3. master: String,

4. deployMode: String):(SchedulerBackend, TaskScheduler) = {

5. import SparkMasterRegex._

6. ......

7. master match {

8. ......

9. case SPARK_REGEX(sparkUrl)=>

10. val scheduler = newTaskSchedulerImpl(sc)

11. val masterUrls =sparkUrl.split(",").map("spark://" + _)

12. val backend = newStandaloneSchedulerBackend(scheduler, sc, masterUrls)

13. scheduler.initialize(backend)

14. (backend, scheduler)

15. ......

在SparkContext实例化的时候通过createTaskScheduler来创建TaskSchedulerImpl和StandaloneSchedulerBackend。在createTaskScheduler中然后调用scheduler.initialize(backend)

initialize的方法参数把StandaloneSchedulerBackend传进来，schedulingMode模式匹配2种方式：FIFO、FAIR。

TaskSchedulerImpl的initialize源码如下：

1. definitialize(backend: SchedulerBackend) {

2. this.backend = backend

3. // temporarily set rootPoolname to empty

4. rootPool = newPool("", schedulingMode, 0, 0)

5. schedulableBuilder = {

6. schedulingMode match {

7. case SchedulingMode.FIFO=>

8. newFIFOSchedulableBuilder(rootPool)

9. case SchedulingMode.FAIR=>

10. newFairSchedulableBuilder(rootPool, conf)

11. case _ =>

12. throw newIllegalArgumentException(s"Unsupported spark.scheduler.mode:$schedulingMode")

13. }

14. }

15. schedulableBuilder.buildPools()

16. }

initialize的方法中调用schedulableBuilder.buildPools()，buildPools方法根据FIFOSchedulableBuilder、FairSchedulableBuilder不同的模式重载方法实现：

1. private[spark] traitSchedulableBuilder {

2. def rootPool: Pool

4. def buildPools(): Unit

6. def addTaskSetManager(manager:Schedulable, properties: Properties): Unit

7. }

initialize的方法把StandaloneSchedulerBackend传进来了，但还没有启动StandaloneSchedulerBackend。在TaskSchedulerImpl的initialize方法中把StandaloneSchedulerBackend传进来从而赋值为TaskSchedulerImpl的backend；在TaskSchedulerImpl调用start方法的时候会调用backend.start方法，在start方法中会最终注册应用程序。

看一下SparkContext.scala的taskScheduler的启动：

1. val (sched, ts) =SparkContext.createTaskScheduler(this, master, deployMode)

2. _schedulerBackend = sched

3. _taskScheduler = ts

4. _dagScheduler = newDAGScheduler(this)

5. ……

6. _taskScheduler.start()

7. _applicationId =_taskScheduler.applicationId()

8. _applicationAttemptId =taskScheduler.applicationAttemptId()

9. _conf.set("spark.app.id",_applicationId)

10. ……

其中调用了_taskScheduler的start方法：

1. private[spark] traitTaskScheduler {

2. ......

4. def start(): Unit

5. …..

TaskScheduler的start()方法没具体实现，TaskScheduler子类的TaskSchedulerImpl的start()方法源码如下：

1. override def start() {

2. backend.start()

4. if (!isLocal &&conf.getBoolean("spark.speculation", false)) {

5. logInfo("Starting speculativeexecution thread")

6. speculationScheduler.scheduleAtFixedRate(newRunnable {

7. override def run(): Unit =Utils.tryOrStopSparkContext(sc) {

8. checkSpeculatableTasks()

9. }

10. }, SPECULATION_INTERVAL_MS,SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)

11. }

12. }

TaskSchedulerImpl的start()这里就通过 backend.start()启动了StandaloneSchedulerBackend的start方法:

1. override def start() {

2. super.start()

3. launcherBackend.connect()

4. ......

5. val command =Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",

6. args, sc.executorEnvs,classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)

7. .......

8. val appDesc = newApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,

9. appUIAddress,sc.eventLogDir, sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)

10. client = newStandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)

11. client.start()

12. ........

13. }

StandaloneSchedulerBackend的start方法中，将command封装注册给Master，Master转过来要Worker启动具体的Executor。command已经封装好指令，Executor具体要启动进程入口类CoarseGrainedExecutorBackend。然后new出来一个StandaloneAppClient，通过client.start()启动client。

StandaloneAppClient的start方法中new出来一个ClientEndpoint：

1. def start() {

2. // Just launch an rpcEndpoint;it will call back into the listener.

3. endpoint.set(rpcEnv.setupEndpoint("AppClient",new ClientEndpoint(rpcEnv)))

4. }

ClientEndpoint源码如下：

1. private classClientEndpoint(override val rpcEnv: RpcEnv) extends ThreadSafeRpcEndpoint

2. with Logging {

3. ……

4. override def onStart(): Unit = {

5. try {

6. registerWithMaster(1)

7. } catch {

8. case e: Exception =>

9. logWarning("Failed to connect tomaster", e)

10. markDisconnected()

11. stop()

12. }

13. }

ClientEndpoint是一个ThreadSafeRpcEndpoint， ClientEndpoint的onStart()方法中调用registerWithMaster(1)进行注册，向Master注册程序，registerWithMaster方法如下：

1. private def registerWithMaster(nthRetry:Int) {

2. registerMasterFutures.set(tryRegisterAllMasters())

3. registrationRetryTimer.set(registrationRetryThread.schedule(newRunnable {

4. override def run(): Unit = {

5. if (registered.get) {

6. registerMasterFutures.get.foreach(_.cancel(true))

7. registerMasterThreadPool.shutdownNow()

8. } else if (nthRetry >=REGISTRATION_RETRIES) {

9. markDead("All masters areunresponsive! Giving up.")

10. } else {

11. registerMasterFutures.get.foreach(_.cancel(true))

12. registerWithMaster(nthRetry + 1)

13. }

14. }

15. }, REGISTRATION_TIMEOUT_SECONDS,TimeUnit.SECONDS))

16. }

程序注册以后，Master通过 schedule()为我们分配资源，通知Worker启动Executor，Executor启动的进程是CoarseGrainedExecutorBackend，Executor启动以后又转过来向Driver注册，Driver其实是StandaloneSchedulerBackend的父类CoarseGrainedSchedulerBackend的一个消息循环体DriverEndpoint。

大总结：

在SparkContext实例化的时候调用createTaskScheduler来创建TaskSchedulerImpl和StandaloneSchedulerBackend，同时在SparkContext实例化的时候会调用TaskSchedulerImpl的start，在start方法中会调用StandaloneSchedulerBackend的start，在该start方法中会创建StandaloneAppClient对象并调用StandaloneAppClient对象的start方法，在该start方法中会创建ClientEndpoint，在创建ClientEndpoint会传入Command来指定具体为当前应用程序启动的Executor进行的入口类的名称为CoarseGrainedExecutorBackend，然后ClientEndpoint启动并通过tryRegisterMaster来注册当前的应用程序到Master中，Master接受到注册信息后如果可以运行程序，则会为该程序生产Job ID并通过schedule来分配计算资源，具体计算资源的分配是通过应用程序的运行方式、Memory、cores等配置信息来决定的，最后Master会发送指令给Worker，Worker中为当前应用程序分配计算资源时会首先分配ExecutorRunner，ExecutorRunner内部会通过Thread的方式构建ProcessBuilder来启动另外一个JVM进程，这个JVM进程启动时候加载的main方法所在的类的名称就是在创建ClientEndpoint时传入的Command来指定具体名称为CoarseGrainedExecutorBackend的类，此时JVM在通过ProcessBuilder启动的时候获得了CoarseGrainedExecutorBackend后加载并调用其中的main方法，在main方法中会实例化CoarseGrainedExecutorBackend本身这个消息循环体，而CoarseGrainedExecutorBackend在实例化的时候会通过回调onStart向DriverEndpoint发送RegisterExecutor来注册当前的CoarseGrainedExecutorBackend，此时DriverEndpoint收到到该注册信息并保存在了StandaloneSchedulerBackend实例的内存数据结构中，这样Driver就获得了计算资源！

CoarseGrainedExecutorBackend.scala的main方法：

1. def main(args: Array[String]) {

2. var driverUrl: String = null

3. var executorId: String = null

4. var hostname: String = null

5. var cores: Int = 0

6. var appId: String = null

7. var workerUrl: Option[String] = None

8. val userClassPath = new mutable.ListBuffer[URL]()

10. var argv = args.toList

11. ......

12. run(driverUrl, executorId, hostname, cores,appId, workerUrl, userClassPath)

13. System.exit(0)

14. }

CoarseGrainedExecutorBackend 的main然后开始调用run方法：

1. privatedef run(

2. driverUrl: String,

3. executorId: String,

4. hostname: String,

5. cores: Int,

6. appId: String,

7. workerUrl: Option[String],

8. userClassPath: Seq[URL]) {

9. ......

10. env.rpcEnv.setupEndpoint("Executor",new CoarseGrainedExecutorBackend(

11. env.rpcEnv, driverUrl, executorId,hostname, cores, userClassPath, env))

12. ......

13.

在CoarseGrainedExecutorBackend的main方法中，通过env.rpcEnv.setupEndpoint("Executor",new CoarseGrainedExecutorBackend构建了CoarseGrainedExecutorBackend实例本身。

第35课： 打通Spark系统运行内幕机制循环流程

你可能感兴趣的:(SparkInBeiJing)

第35课：打通Spark系统运行内幕机制循环流程