spark源码系列(1) SparkContext的初始化

我们先整体画一张spark程序执行的全流程

spark源码系列(1) SparkContext的初始化_第1张图片

1-2.我们通过spark-submit提交application时候,程序会通过反射的方式创建出一个DriverActor进程出来,Driver进程会创建一个SparkContext,SparkContext会初始化最重要的两个组件,DAGScheduler和TaskScheduler。

3-7.TaskScheduler会通知Master,Master会接到通知之后会初始化多个worker,worker会启动Executor进程,Executor进程启动之后会反向注册到TaskScheduler上

8-10.我们自己编写的spark程序,每遇见一个action,就会产生一个job,DAGScheduler通过是否产生shuffle来对一个job进程stage划分,每一个stage就是一个task,DAGScheduler会把task封装在taskSet中,提交给TaskScheduler。TaskScheduler会把封装好的taskSet提交给Executor,Executor内部会把Task封装成TaskRunner,然后从线程池中取出一个线程来执行这个TaskRunner。同时会把执行的过程和结果反馈给Driver。

-------------------------------------------------

以上是spark程序执行的全流程。然后我们从最开始的new SparkContext()开始分析。这是所有spark程序的最初阶段

先画一张图来说明sparkContext初始化的流程

spark源码系列(1) SparkContext的初始化_第2张图片

 打开spark源码,进入SparkContext.scala类 

_heartbeatReceiver = env.rpcEnv.setupEndpoint(
  HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))

// Create and start the scheduler
val (sched, ts) = SparkContext.createTaskScheduler(this, master)------1
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
// constructor
_taskScheduler.start()

1进入之后

private def createTaskScheduler(
    sc: SparkContext,
    master: String): (SchedulerBackend, TaskScheduler) = {
  import SparkMasterRegex._

  // When running locally, don't try to re-execute tasks on failure.
  val MAX_LOCAL_TASK_FAILURES = 1

  master match {
    case "local" =>
      val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
      val backend = new LocalBackend(sc.getConf, scheduler, 1)
      scheduler.initialize(backend)
      (backend, scheduler)

    case LOCAL_N_REGEX(threads) =>
      def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
      // local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
      val threadCount = if (threads == "*") localCpuCount else threads.toInt
      if (threadCount <= 0) {
        throw new SparkException(s"Asked to run locally with $threadCount threads")
      }
      val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
      val backend = new LocalBackend(sc.getConf, scheduler, threadCount)
      scheduler.initialize(backend)
      (backend, scheduler)

    case LOCAL_N_FAILURES_REGEX(threads, maxFailures) =>
      def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
      // local[*, M] means the number of cores on the computer with M failures
      // local[N, M] means exactly N threads with M failures
      val threadCount = if (threads == "*") localCpuCount else threads.toInt
      val scheduler = new TaskSchedulerImpl(sc, maxFailures.toInt, isLocal = true)
      val backend = new LocalBackend(sc.getConf, scheduler, threadCount)
      scheduler.initialize(backend)
      (backend, scheduler)

    case SPARK_REGEX(sparkUrl) =>
      val scheduler = new TaskSchedulerImpl(sc)
      val masterUrls = sparkUrl.split(",").map("spark://" + _)
      val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
      scheduler.initialize(backend)
      (backend, scheduler)

。。。。

这是一段模式匹配,我们看最后一段

首先会new一个TaskScheduler对象,和一个SparkDeploySchedulerBackend对象,然后会执行TaskScheduler的initialize方法

def initialize(backend: SchedulerBackend) {
  this.backend = backend
  // temporarily set rootPool name to empty
  rootPool = new Pool("", schedulingMode, 0, 0)
  schedulableBuilder = {
    schedulingMode match {
      case SchedulingMode.FIFO =>
        new FIFOSchedulableBuilder(rootPool)
      case SchedulingMode.FAIR =>
        new FairSchedulableBuilder(rootPool, conf)
    }
  }
  schedulableBuilder.buildPools()

这里面本质上是一个初始化scheduler pool的方法,有两个形式,FIFO和FAIR模式

执行完成之后会返回TaskSchedulerImpl和SparkDeploySchedulerBackend组成的一个元组

回到SparkContext方法,解析去会创建一个DAGScheduler对象

进入taskScheduler.start()

override def start() {
  backend.start()

  if (!isLocal && conf.getBoolean("spark.speculation", false)) {
    logInfo("Starting speculative execution thread")
    speculationScheduler.scheduleAtFixedRate(new Runnable {
      override def run(): Unit = Utils.tryOrStopSparkContext(sc) {
        checkSpeculatableTasks()
      }
    }, SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)
  }
}

关注第一句话backend.start()

override def start() {
  。。。。。。
  val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory,
    command, appUIAddress, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor)
  client = new AppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
  client.start()
  launcherBackend.setState(SparkAppHandle.State.SUBMITTED)
  waitForRegistration()
  launcherBackend.setState(SparkAppHandle.State.RUNNING)
}

前端一坨代码都不去看,这里有一个appDesc,这是一个ApplicationDescription对象,点击进去可以看到

name: String,
maxCores: Option[Int],
memoryPerExecutorMB: Int,
command: Command,
appUiUrl: String,
eventLogDir: Option[URI] = None,
// short name of compression codec used when writing event logs, if any (e.g. lzf)
eventLogCodec: Option[String] = None,
coresPerExecutor: Option[Int] = None,
user: String = System.getProperty("user.name", ""))

这其实就是我们再spark-submit之后后面跟的一些参数

接着会new一个AppClient对象,然后调用start()方法

endpoint.set(rpcEnv.setupEndpoint("AppClient", new ClientEndpoint(rpcEnv)))

注意到ClientEndPoint的onStart方法

override def onStart(): Unit = {
  try {
    registerWithMaster(1)
  } catch {
    case e: Exception =>
      logWarning("Failed to connect to master", e)
      markDisconnected()
      stop()
  }
}

这里就是注册到Master上

private def registerWithMaster(nthRetry: Int) {
  registerMasterFutures.set(tryRegisterAllMasters())
  registrationRetryTimer.set(registrationRetryThread.scheduleAtFixedRate(new Runnable {
    override def run(): Unit = {
      Utils.tryOrExit {
        if (registered.get) {
          registerMasterFutures.get.foreach(_.cancel(true))
          registerMasterThreadPool.shutdownNow()
        } else if (nthRetry >= REGISTRATION_RETRIES) {
          markDead("All masters are unresponsive! Giving up.")
        } else {
          registerMasterFutures.get.foreach(_.cancel(true))
          registerWithMaster(nthRetry + 1)
        }
      }
    }
  }, REGISTRATION_TIMEOUT_SECONDS, REGISTRATION_TIMEOUT_SECONDS, TimeUnit.SECONDS))
}

这里面本质是一个scheduler线程池。

--------

我们再进入Master.scala类,

 

找到receive 方法中的

case RegisterApplication(description, driver)
case RegisterApplication(description, driver) => {
  // TODO Prevent repeated registrations from some driver
  if (state == RecoveryState.STANDBY) {
    // ignore, don't send response
  } else {
    logInfo("Registering app " + description.name)
    val app = createApplication(description, driver)
    registerApplication(app)
    logInfo("Registered app " + description.name + " with ID " + app.id)
    persistenceEngine.addApplication(app)
    driver.send(RegisteredApplication(app.id, self))
    schedule()
  }
}

首先会判断Master的状态,只有不是STANDBY才能继续往下面执行

首先会根据appDescription来创建一个app,然后注册application

val appAddress = app.driver.address
if (addressToApp.contains(appAddress)) {
  logInfo("Attempted to re-register application at same address: " + appAddress)
  return
}

applicationMetricsSystem.registerSource(app.appSource)
apps += app
idToApp(app.id) = app
endpointToApp(app.driver) = app
addressToApp(appAddress) = app
waitingApps += app

这是注册application的方法,其实就是加入到各种容器中,HashMap和ArrayBuffer

然后会试用持久化引擎持久化app。这里说明一下持久化引擎,一般常用的是zookeeper,也就是会把app信息持久化到zookeeper中,方便做HA。然后会向driver发送注册app成功的信息。然后调用schedule方法

private def schedule(): Unit = {
  if (state != RecoveryState.ALIVE) { return }
  // Drivers take strict precedence over executors
  val shuffledWorkers = Random.shuffle(workers) // Randomization helps balance drivers
  for (worker <- shuffledWorkers if worker.state == WorkerState.ALIVE) {
    for (driver <- waitingDrivers) {
      if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
        launchDriver(worker, driver)
        waitingDrivers -= driver
      }
    }
  }
  startExecutorsOnWorkers()
}

scheduler方法中最重要的就是最后一句话,启动Executors。

-------------

接下去看一下DAGScheduler

这个类中包含了

eventProcessLoop

它通过这个线程组与下层的TaskScheduler进行通信

------

sparkContext中还有一个sparkUI

本质上sparkUI通过启动一个jetty容器来提供服务,进行可以进行页面访问

你可能感兴趣的:(大数据,框架)