《apache spark源码剖析》 学习笔记之SparkContext

SparkContext的初始化综述

SparkContext是进行Spark应用开发的主要接口,是Spark上层应用与底层应用实现的中转站。

SparkContext在初始化过程中,主要涉及以下内容:

  • SparkEnv
  • DAGScheduler
  • TaskScheduler
  • SchedulerBackend
  • WebUI
SparkContext的构造函数中最重要的入参是SparkConf。

步骤1:根据初始化入参生成SparkConf,再根据SparkConf来创建SparkEnv。SparkEnv中主要包含以下关键性组件:BlockManager、MapOUtputTracker、ShuffleFetcher、ConnectionManager。
生成SparkEnv
--------------------------------------------------------------------------------------------
// Create the Spark execution environment (cache, map output tracker, etc) _env = createSparkEnv(_conf, isLocal, listenerBus)
SparkEnv.set(_env)
-------------------------------------------------------------------------------------------
SparkEnv
------------------------------------------------------------------------------------------------
@DeveloperApi
class SparkEnv (
    val executorId: String,
    private[spark] val rpcEnv: RpcEnv,
    val serializer: Serializer,
    val closureSerializer: Serializer,
    val cacheManager: CacheManager,// 用以存储中间计算结果  val mapOutputTracker: MapOutputTracker,// 用来缓存MapStatus信息,并提供从MapOutputMaster获取信息的功能  val shuffleManager: ShuffleManager,// 路由维护表  val broadcastManager: BroadcastManager,// 广播  val blockTransferService: BlockTransferService,
    val blockManager: BlockManager,// 块管理  val securityManager: SecurityManager,// 安全管理  val httpFileServer: HttpFileServer,// 文件存储服务器  val sparkFilesDir: String,// 文件存储目录  val metricsSystem: MetricsSystem,// 测量  val shuffleMemoryManager: ShuffleMemoryManager,
    val executorMemoryManager: ExecutorMemoryManager,
    val outputCommitCoordinator: OutputCommitCoordinator,
    val conf: SparkConf) extends Logging {// 配置文件 
--------------------------------------------------------------------------------------------------
步骤2:创建TaskScheduler,根据Spark的运行模式来选择相应的SchedulerBackend,同时启动TaskScheduler,这一步至关重要。
生成TaskScheduler
-------------------------------------------------------------------------------------------------
val (sched, ts) = SparkContext.createTaskScheduler(this, master)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this)
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's // constructor _taskScheduler.start()
------------------------------------------------------------------
	createTaskScheduler最为关键的一点就是根据Master环境变量来判断Spark当前的部署方式,进而生成相应的SchedulerBackend的不同子类。创建的SchedulerBackend放置到TaskScheduler中,在后续的Task分发过程中扮演重要作用。
	createTaskScheduler函数进行部署方式判断:
---------------------------------------------------------------------
master match {
  case "local" =>
    val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
    val backend = new LocalBackend(sc.getConf, scheduler, 1)
    scheduler.initialize(backend)
    (backend, scheduler)

  case LOCAL_N_REGEX(threads) =>
    def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
    // local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.  val threadCount = if (threads == "*") localCpuCount else threads.toInt
    if (threadCount <= 0) {
      throw new SparkException(s"Asked to run locally with $threadCount threads")
    }
    val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
    val backend = new LocalBackend(sc.getConf, scheduler, threadCount)
    scheduler.initialize(backend)
    (backend, scheduler)

  case LOCAL_N_FAILURES_REGEX(threads, maxFailures) =>
    def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
    // local[*, M] means the number of cores on the computer with M failures  // local[N, M] means exactly N threads with M failures  val threadCount = if (threads == "*") localCpuCount else threads.toInt
    val scheduler = new TaskSchedulerImpl(sc, maxFailures.toInt, isLocal = true)
    val backend = new LocalBackend(sc.getConf, scheduler, threadCount)
    scheduler.initialize(backend)
    (backend, scheduler)

  case SPARK_REGEX(sparkUrl) =>
    val scheduler = new TaskSchedulerImpl(sc)
    val masterUrls = sparkUrl.split(",").map("spark://" + _)
    val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
    scheduler.initialize(backend)
    (backend, scheduler)

  case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) =>
    // Check to make sure memory requested <= memoryPerSlave. Otherwise Spark will just hang.  val memoryPerSlaveInt = memoryPerSlave.toInt
    if (sc.executorMemory > memoryPerSlaveInt) {
      throw new SparkException(
        "Asked to launch cluster with %d MB RAM / worker but requested %d MB/worker".format(
          memoryPerSlaveInt, sc.executorMemory))
    }

    val scheduler = new TaskSchedulerImpl(sc)
    val localCluster = new LocalSparkCluster(
      numSlaves.toInt, coresPerSlave.toInt, memoryPerSlaveInt, sc.conf)
    val masterUrls = localCluster.start()
    val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
    scheduler.initialize(backend)
    backend.shutdownCallback = (backend: SparkDeploySchedulerBackend) => {
      localCluster.stop()
    }
    (backend, scheduler)

  case "yarn-standalone" | "yarn-cluster" =>
    if (master == "yarn-standalone") {
      logWarning(
        "\"yarn-standalone\" is deprecated as of Spark 1.0. Use \"yarn-cluster\" instead.")
    }
    val scheduler = try {
      val clazz = Utils.classForName("org.apache.spark.scheduler.cluster.YarnClusterScheduler")
      val cons = clazz.getConstructor(classOf[SparkContext])
      cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]
    } catch {
      // TODO: Enumerate the exact reasons why it can fail  // But irrespective of it, it means we cannot proceed !  case e: Exception => {
        throw new SparkException("YARN mode not available ?", e)
      }
    }
    val backend = try {
      val clazz =
        Utils.classForName("org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend")
      val cons = clazz.getConstructor(classOf[TaskSchedulerImpl], classOf[SparkContext])
      cons.newInstance(scheduler, sc).asInstanceOf[CoarseGrainedSchedulerBackend]
    } catch {
      case e: Exception => {
        throw new SparkException("YARN mode not available ?", e)
      }
    }
    scheduler.initialize(backend)
    (backend, scheduler)

  case "yarn-client" =>
    val scheduler = try {
      val clazz = Utils.classForName("org.apache.spark.scheduler.cluster.YarnScheduler")
      val cons = clazz.getConstructor(classOf[SparkContext])
      cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]

    } catch {
      case e: Exception => {
        throw new SparkException("YARN mode not available ?", e)
      }
    }

    val backend = try {
      val clazz =
        Utils.classForName("org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend")
      val cons = clazz.getConstructor(classOf[TaskSchedulerImpl], classOf[SparkContext])
      cons.newInstance(scheduler, sc).asInstanceOf[CoarseGrainedSchedulerBackend]
    } catch {
      case e: Exception => {
        throw new SparkException("YARN mode not available ?", e)
      }
    }

    scheduler.initialize(backend)
    (backend, scheduler)

  case mesosUrl @ MESOS_REGEX(_) =>
    MesosNativeLibrary.load()
    val scheduler = new TaskSchedulerImpl(sc)
    val coarseGrained = sc.conf.getBoolean("spark.mesos.coarse", false)
    val url = mesosUrl.stripPrefix("mesos://") // strip scheme from raw Mesos URLs  val backend = if (coarseGrained) {
      new CoarseMesosSchedulerBackend(scheduler, sc, url, sc.env.securityManager)
    } else {
      new MesosSchedulerBackend(scheduler, sc, url)
    }
    scheduler.initialize(backend)
    (backend, scheduler)

  case SIMR_REGEX(simrUrl) =>
    val scheduler = new TaskSchedulerImpl(sc)
    val backend = new SimrSchedulerBackend(scheduler, sc, simrUrl)
    scheduler.initialize(backend)
    (backend, scheduler)

  case _ =>
    throw new SparkException("Could not parse Master URL: '" + master + "'")
}
-------------------------------------------------------------------------------------
_taskScheduler.start()的目的是启动相应的SchedulerBackend,并启动定时器进行检测。
				生成SchedulerBackend
-------------------------------------------------------------------------------------
override def start() {
  backend.start()

  if (!isLocal && conf.getBoolean("spark.speculation", false)) {
    logInfo("Starting speculative execution thread")
    speculationScheduler.scheduleAtFixedRate(new Runnable {
      override def run(): Unit = Utils.tryOrStopSparkContext(sc) {
        checkSpeculatableTasks()
      }
    }, SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)
  }
}
-------------------------------------------------------------------------------------
	步骤3:以上一步创建的TaskScheduler实例为入参创建DAGScheduler并启动运行。
-------------------------------------------------------------------------------------
_dagScheduler = new DAGScheduler(this)
-------------------------------------------------------------------------------------
步骤4:启动WebUI
-------------------------------------------------------------------------------------
_ui =
  if (conf.getBoolean("spark.ui.enabled", true)) {
    Some(SparkUI.createLiveUI(this, _conf, listenerBus, _jobProgressListener,
      _env.securityManager, appName, startTime = startTime))
  } else {
    // For tests, do not enable the UI  None
  }
// Bind the UI before starting the task scheduler to communicate // the bound port to the cluster manager properly _ui.foreach(_.bind())
--------------------------------------------------------------------------------------























你可能感兴趣的:(《apache spark源码剖析》 学习笔记之SparkContext)