前文说到SparkContext的3大核心对象被创建。
分别是TaskScheduler、SchedulerBackend、DAGScheduler。
这三大对象创建完成后,紧接着,调用了TaskScheduler的start方法。
// SparkContext.scala line 527 // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's // constructor _taskScheduler.start()
上面的注释也很好的说明了,在调用TaskScheduler.start 前,需要在DAGScheduler的构造中,设置TaskScheduler的DAGScheduler的引用。这个在前文创建DAGScheduler的时候已经说明,他们之间是相互引用的。
当TaskScheduler.start被调用时,我们看看具体发生了什么。
// TaskSchedulerImpl.scala line 143 override def start() { backend.start() if (!isLocal && conf.getBoolean("spark.speculation", false)) { logInfo("Starting speculative execution thread") speculationScheduler.scheduleAtFixedRate(new Runnable { override def run(): Unit = Utils.tryOrStopSparkContext(sc) { checkSpeculatableTasks() } }, SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS) } }
首先:beakend.start被调用了。这里提到,TaskScheduler中有ShedulerBackend的引用。我们也认为SchedulerBackend就是TaskScheduler的Backend。因此在start的时候,Backend要先start。这样应该会容易理解。
那跟踪进SchedulerBackend的start方法中瞧瞧吧。这里的SchedulerBackend是SparkDeploySchedulerBackend的实例。
// SparkDeploySchedulerBackend.scala line 52 override def start() { super.start() launcherBackend.connect() // The endpoint for executors to talk to us val driverUrl = rpcEnv.uriOf(SparkEnv.driverActorSystemName, RpcAddress(sc.conf.get("spark.driver.host"), sc.conf.get("spark.driver.port").toInt), CoarseGrainedSchedulerBackend.ENDPOINT_NAME) val args = Seq( "--driver-url", driverUrl, "--executor-id", "{{EXECUTOR_ID}}", "--hostname", "{{HOSTNAME}}", "--cores", "{{CORES}}", "--app-id", "{{APP_ID}}", "--worker-url", "{{WORKER_URL}}") val extraJavaOpts = sc.conf.getOption("spark.executor.extraJavaOptions") .map(Utils.splitCommandString).getOrElse(Seq.empty) val classPathEntries = sc.conf.getOption("spark.executor.extraClassPath") .map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil) val libraryPathEntries = sc.conf.getOption("spark.executor.extraLibraryPath") .map(_.split(java.io.File.pathSeparator).toSeq).getOrElse(Nil) // When testing, expose the parent class path to the child. This is processed by // compute-classpath.{cmd,sh} and makes all needed jars available to child processes // when the assembly is built with the "*-provided" profiles enabled. val testingClassPath = if (sys.props.contains("spark.testing")) { sys.props("java.class.path").split(java.io.File.pathSeparator).toSeq } else { Nil } // Start executors with a few necessary configs for registering with the scheduler val sparkJavaOpts = Utils.sparkJavaOpts(conf, SparkConf.isExecutorStartupConf) val javaOpts = sparkJavaOpts ++ extraJavaOpts val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend", args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts) val appUIAddress = sc.ui.map(_.appUIAddress).getOrElse("") val coresPerExecutor = conf.getOption("spark.executor.cores").map(_.toInt) val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command, appUIAddress, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor) client = new AppClient(sc.env.rpcEnv, masters, appDesc, this, conf) client.start() launcherBackend.setState(SparkAppHandle.State.SUBMITTED) waitForRegistration() launcherBackend.setState(SparkAppHandle.State.RUNNING) }
在start方法中,先调用了super.start,上文可知,super是CoarseGrainedSchedulerBackend。
在start方法中,除了设置spark.开头的配置外,做了一件很重要的事情:创建了一个Endpoint,为DriverEndpoint。
// CoarseGrainedSchedulerBackend.scala line 303 override def start() { val properties = new ArrayBuffer[(String, String)] for ((key, value) <- scheduler.sc.conf.getAll) { if (key.startsWith("spark.")) { properties += ((key, value)) } } // TODO (prashant) send conf instead of properties driverEndpoint = rpcEnv.setupEndpoint(ENDPOINT_NAME, createDriverEndpoint(properties)) // 创建DriverEndpoint并向RpcEnv注册 }
Endpoint的名字叫:CoarseGrainedScheduler
// CoarseGrainedSchedulerBackend.scala line 510 private[spark] object CoarseGrainedSchedulerBackend { val ENDPOINT_NAME = "CoarseGrainedScheduler" }
而这里创建的DriverEndpoint是实现了ThreadSafeRpcEndpoint trait,而ThreadSafeRpcEndpoint 又继承了 RpcEndpoint trait
// CoarseGrainedSchedulerBackend.scala line 303 protected def createDriverEndpoint(properties: Seq[(String, String)]): DriverEndpoint = { new DriverEndpoint(rpcEnv, properties) } // CoarseGrainedSchedulerBackend.scala line 81 class DriverEndpoint(override val rpcEnv: RpcEnv, sparkProperties: Seq[(String, String)]) extends ThreadSafeRpcEndpoint with Logging // org.apache.spark.rpc.RpcEndpoint.sclaa line 148 private[spark] trait ThreadSafeRpcEndpoint extends RpcEndpoint // org.apache.spark.rpc.RpcEndpoint.sclaa line 46 /** * An end point for the RPC that defines what functions to trigger given a message. * * It is guaranteed that `onStart`, `receive` and `onStop` will be called in sequence. * * The life-cycle of an endpoint is: * * constructor -> onStart -> receive* -> onStop * * Note: `receive` can be called concurrently. If you want `receive` to be thread-safe, please use * [[ThreadSafeRpcEndpoint]] * * If any error is thrown from one of [[RpcEndpoint]] methods except `onError`, `onError` will be * invoked with the cause. If `onError` throws an error, [[RpcEnv]] will ignore it. */ private[spark] trait RpcEndpoint
从上述的注释可以很清楚的了解到RpcEndpoint的作用。
RpcEndpoint是 RPC[Remote Procedure Call :远程过程调用]中定义了收到的消息将触发哪个方法。
同时清楚的阐述了生命周期,构造-> onStart -> receive* -> onStop
这里receive* 是指receive 和 receiveAndReply。
他们的区别是:
receive是无需等待答复,而receiveAndReply是会阻塞线程,直至有答复的。
比如:外卖接到订餐电话,不会马上就让顾客吃到,需要餐馆做完送过去之后,才能吃到。【只需将消息带到即可,不用马上有反馈】
而如果是堂食,则客户订餐之后,就需要马上做完,第一时间上餐。
另一个关键的成员变量是self,实际上在创建的时候,就已经将消息的接收和发送都创建完成了。发送就是EndpintRef,这里通过self方法可以调用得到。下面会提到。
// org.apache.spark.rpc.RpcEndpoint.scala line 60 final def self: RpcEndpointRef = { require(rpcEnv != null, "rpcEnv has not been initialized") rpcEnv.endpointRef(this) } // org.apache.spark.rpc.RpcEndpoint.scala line 65 /** * Process messages from [[RpcEndpointRef.send]] or [[RpcCallContext.reply)]]. If receiving a * unmatched message, [[SparkException]] will be thrown and sent to `onError`. */ def receive: PartialFunction[Any, Unit] = { case _ => throw new SparkException(self + " does not implement 'receive'") } /** * Process messages from [[RpcEndpointRef.ask]]. If receiving a unmatched message, * [[SparkException]] will be thrown and sent to `onError`. */ def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = { case _ => context.sendFailure(new SparkException(self + " won't reply anything")) }
既然RpcEndpoint是接收消息的,那一定有一个发送消息的。
// RpcEndpointRef.scala line 26 /** * A reference for a remote [[RpcEndpoint]]. [[RpcEndpointRef]] is thread-safe. */ private[spark] abstract class RpcEndpointRef(conf: SparkConf) extends Serializable with Logging{ // line 46 /** * Sends a one-way asynchronous message. Fire-and-forget semantics. */ def send(message: Any): Unit // line 54 /** * Send a message to the corresponding [[RpcEndpoint.receiveAndReply)]] and return a [[Future]] to * receive the reply within the specified timeout. * * This method only sends the message once and never retries. */ def ask[T: ClassTag](message: Any, timeout: RpcTimeout): Future[T] // line 62 /** * Send a message to the corresponding [[RpcEndpoint.receiveAndReply)]] and return a [[Future]] to * receive the reply within a default timeout. * * This method only sends the message once and never retries. */ def ask[T: ClassTag](message: Any): Future[T] = ask(message, defaultAskTimeout) // line 93 def askWithRetry[T: ClassTag](message: Any, timeout: RpcTimeout): T ={ // ... } }
这里的发送对应接收,分为send和ask。顾名思义,ask就是要有结果的。当然ask还有重试的机制。详见askWithRetry。
了解了生命周期之后,我们再回到 DriverEndpoint的部分。当DriverEndpoint创建之后,按照生命周期顺序,下一步就会调用onStart方法。
在onStart方法中,是每隔一段时间发送ReviveOffers 类型的消息,默认是每隔1秒。
// CoarseGrainedSchedulerBackend.scala line 95 override def onStart() { // Periodically revive offers to allow delay scheduling to work val reviveIntervalMs = conf.getTimeAsMs("spark.scheduler.revive.interval", "1s") reviveThread.scheduleAtFixedRate(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { Option(self).foreach(_.send(ReviveOffers)) } }, 0, reviveIntervalMs, TimeUnit.MILLISECONDS) }
可以这样理解,Driver启动之后,会定时掐一下自己,给自己发送一个ReviveOffers的消息。这里的send方法是self的方法,回忆上面的内容,可知self是一个RpcEndPointRef对象。
那这个消息发送之后,由DriverEndpoint的receive接收后处理
// CoarseGrainedSchedulerBackend.scala line 122 case ReviveOffers => makeOffers()
感谢王家林老师的知识分享
王家林老师名片:
中国Spark第一人
新浪微博:http://weibo.com/ilovepains
微信公众号:DT_Spark
博客:http://blog.sina.com.cn/ilovepains
手机:18610086859
QQ:1740415547
YY课堂:每天20:00现场授课频道68917580
找我报名有折扣哦。