Spark源码解读之SparkContext初始化

SparkContext初始化是Driver应用程序提交执行的前提,这里以local模式来了解SparkContext的初始化过程。

本文以

val conf = new SparkConf().setAppName("mytest").setMaster("local[2]")
 
  
val sc = new SparkContext(conf)

为例,打开debug模式,然后进行分析。

一、SparkConf概述

    SparkContext需要传入SparkConf来进行初始化,SparkConf用于维护Spark的配置属性。官方解释:

Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.
简单看下SparkConf的源码:

class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging {

  import SparkConf._

  /** Create a SparkConf that loads defaults from system properties and the classpath */
  def this() = this(true)

  private val settings = new ConcurrentHashMap[String, String]()

  if (loadDefaults) {
    // Load any spark.* system properties
    for ((key, value) <- Utils.getSystemProperties if key.startsWith("spark.")) {
      set(key, value)
    }
  }

  /** Set a configuration variable. */
  def set(key: String, value: String): SparkConf = {
    if (key == null) {
      throw new NullPointerException("null key")
    }
    if (value == null) {
      throw new NullPointerException("null value for " + key)
    }
    logDeprecationWarning(key)
    settings.put(key, value)
    this
  }

  /**
   * The master URL to connect to, such as "local" to run locally with one thread, "local[4]" to
   * run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.
   */
  def setMaster(master: String): SparkConf = {
    set("spark.master", master)
  }

  /** Set a name for your application. Shown in the Spark web UI. */
  def setAppName(name: String): SparkConf = {
    set("spark.app.name", name)
  }
//省略
}

SparkConf内部使用ConcurrentHashMap来维护所有的配置。由于SparkConf提供的setter方法返回的是this,也就是

一个SparkConf对象,所有它允许使用链式来设置属性。

如:new SparkConf().setAppName("mytest").setMaster("local[2]")


二、SparkContext的初始化

SparkContext的初始化步骤主要包含以下几步:

1)创建JobProgressListener

2)创建SparkEnv

3)创建


1. 复制SparkConf配置信息,然后校验或者添加新的配置信息

SparkContext的住构造器参数为SparkConf: 

class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient {

  // The call site where this SparkContext was constructed.
  private val creationSite: CallSite = Utils.getCallSite()

  // If true, log warnings instead of throwing exceptions when multiple SparkContexts are active
  private val allowMultipleContexts: Boolean =
    config.getBoolean("spark.driver.allowMultipleContexts", false)

  // In order to prevent multiple SparkContexts from being active at the same time, mark this
  // context as having started construction.
  // NOTE: this must be placed at the beginning of the SparkContext constructor.
  SparkContext.markPartiallyConstructed(this, allowMultipleContexts)
//省略
}

    getCallSite方法会得到一个CallSite对象,改对象存储了线程栈中最靠近栈顶的用户类及最靠近栈底的Scala或者Spark核心类信息。SparkContext默认只有一个实例,由属性"spark.driver.allowMultipleContexts"来控制。markPartiallyConstructed方法用于确保实例的唯一性,并将当前SparkContext标记为正在构建中。

    之后会初始化一些其他实例对象,比如会在内部创建一个SparkConf类型的对象_conf,然后在将传过来的config进行复制,然后会对配置信息进行校验。

private var _conf: SparkConf = _
...

_conf = config.clone()
_conf.validateSettings()  // Checks for illegal or deprecated config settings

if (!_conf.contains("spark.master")) {
  throw new SparkException("A master URL must be set in your configuration")
}
if (!_conf.contains("spark.app.name")) {
  throw new SparkException("An application name must be set in your configuration")
}

从上面代码可以看出,必须要指定spark.master(运行模式)和spark.app.name(应用程序名称),否则会抛出异常。


2. 创建SparkEnv

    SparkEnv包含了一个Spark应用的运行环境对象。官方解释:

* Holds all the runtime environment objects for a running Spark instance (either master or worker),
* including the serializer, Akka actor system, block manager, map output tracker, etc. Currently
* Spark code finds the SparkEnv through a global variable, so all the threads can access the same
* SparkEnv. It can be accessed by SparkEnv.get (e.g. after creating a SparkContext).

   上面的意思大概就是说SparkEnv包含了Spark实例(master or worker)运行时的环境对象,包括serializer, Akka actor system, block manager, map output tracker等等。

    SparkContext中创建SparkEnv实例的部分代码:

// An asynchronous listener bus for Spark events
private[spark] val listenerBus = new LiveListenerBus

// This function allows components created by SparkEnv to be mocked in unit tests:
private[spark] def createSparkEnv(
    conf: SparkConf,
    isLocal: Boolean,
    listenerBus: LiveListenerBus): SparkEnv = {
  SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master))
}

...

// "_jobProgressListener" should be set up before creating SparkEnv because when creating
// "SparkEnv", some messages will be posted to "listenerBus" and we should not miss them.
_jobProgressListener = new JobProgressListener(_conf)
listenerBus.addListener(jobProgressListener)

// Create the Spark execution environment (cache, map output tracker, etc)
_env = createSparkEnv(_conf, isLocal, listenerBus)
SparkEnv.set(_env)

也就是说最后会调用SparkEnv.createDriverEnv方法

/**
* Create a SparkEnv for the driver.
*/
private[spark] def createDriverEnv(
  conf: SparkConf,
  isLocal: Boolean,
  listenerBus: LiveListenerBus,
  numCores: Int,
  mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = {
	assert(conf.contains("spark.driver.host"), "spark.driver.host is not set on the driver!")
	assert(conf.contains("spark.driver.port"), "spark.driver.port is not set on the driver!")
	val hostname = conf.get("spark.driver.host")
	val port = conf.get("spark.driver.port").toInt
	create(
	  conf,
	  SparkContext.DRIVER_IDENTIFIER,
	  hostname,
	  port,
	  isDriver = true,
	  isLocal = isLocal,
	  numUsableCores = numCores,
	  listenerBus = listenerBus,
	  mockOutputCommitCoordinator = mockOutputCommitCoordinator
	)
}

private def create(
	conf: SparkConf,
	executorId: String,
	hostname: String,
	port: Int,
	isDriver: Boolean,
	isLocal: Boolean,
	numUsableCores: Int,
	listenerBus: LiveListenerBus = null,
	mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = {
	
...

	val envInstance = new SparkEnv(
	  executorId,
	  rpcEnv,
	  actorSystem,  // 基于Akka的分布式消息系统
	  serializer,
	  closureSerializer,
	  cacheManager,  // 缓存管理器
	  mapOutputTracker,  // map任务输出跟踪器
	  shuffleManager,  // shuffle管理器
	  broadcastManager,  // 广播管理器
	  blockTransferService,  //块传输服务
	  blockManager,  // 块管理器
	  securityManager,  // 安全管理器
	  sparkFilesDir,  // 
	  metricsSystem,  // 测量系统
	  memoryManager,  // 内存管理器
	  outputCommitCoordinator,
	  conf)  
	  
...

}

     SparkEnv的createDriverEnv方法会调用私有create方法来创建serializer,closureSerializer,cacheManager等,创建完成后会创建SparkEnv对象。


3. 创建MetadataCleaner

    MetadataCleaner是用来定时的清理metadata的,metadata有6种类型,封装在了MetadataCleanerType类中。

private[spark] object MetadataCleanerType extends Enumeration {

  val MAP_OUTPUT_TRACKER, SPARK_CONTEXT, HTTP_BROADCAST, BLOCK_MANAGER,
  SHUFFLE_BLOCK_MANAGER, BROADCAST_VARS = Value

  type MetadataCleanerType = Value

  def systemProperty(which: MetadataCleanerType.MetadataCleanerType): String = {
    "spark.cleaner.ttl." + which.toString
  }
}

MAP_OUTPUT_TRACKER:map任务的输出元数据,SPARK_CONTEXT:缓存到内存中的RDD,HTTP_BROADCAST:采用http方式广播broadcast的元数据,BLOCK_MANAGER:BlockManager中非Broadcast类型的Block数据,SHUFFLE_BLOCK_MANAGER:shuffle输出的数据,BROADCAST_VARS:Torrent方式广播broadcast的元数据,底层依赖于BlockManager。

    在SparkContext初始化的过程中,会创建一个MetadataCleaner,用于清理缓存到内存中的RDD。

  // Keeps track of all persisted RDDs
  private[spark] val persistentRdds = new TimeStampedWeakValueHashMap[Int, RDD[_]]
  
  /** Called by MetadataCleaner to clean up the persistentRdds map periodically */
  private[spark] def cleanup(cleanupTime: Long) {
    persistentRdds.clearOldValues(cleanupTime)
  }
  
  _metadataCleaner = new MetadataCleaner(MetadataCleanerType.SPARK_CONTEXT, this.cleanup, _conf)


4. 创建SparkStatusTracker

    SparkStatusTracker是低级别的状态报告API,用于监控job和stage。


5. 初始化Spark UI

    SparkUI为Spark监控Web平台提供了Spark环境、任务的整个生命周期的监控。


6.  HadoopConfiguration

    由于Spark默认使用HDFS作为分布式文件系统,所以需要获取Hadoop相关的配置信息。

_hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(_conf)

可以看下newConfiguration做了那些事情。

  def newConfiguration(conf: SparkConf): Configuration = {
    val hadoopConf = new Configuration()

    // Note: this null check is around more than just access to the "conf" object to maintain
    // the behavior of the old implementation of this code, for backwards compatibility.
    if (conf != null) {
      // Explicitly check for S3 environment variables
      if (System.getenv("AWS_ACCESS_KEY_ID") != null &&
          System.getenv("AWS_SECRET_ACCESS_KEY") != null) {
        val keyId = System.getenv("AWS_ACCESS_KEY_ID")
        val accessKey = System.getenv("AWS_SECRET_ACCESS_KEY")

        hadoopConf.set("fs.s3.awsAccessKeyId", keyId)
        hadoopConf.set("fs.s3n.awsAccessKeyId", keyId)
        hadoopConf.set("fs.s3a.access.key", keyId)
        hadoopConf.set("fs.s3.awsSecretAccessKey", accessKey)
        hadoopConf.set("fs.s3n.awsSecretAccessKey", accessKey)
        hadoopConf.set("fs.s3a.secret.key", accessKey)
      }
      // Copy any "spark.hadoop.foo=bar" system properties into conf as "foo=bar"
      conf.getAll.foreach { case (key, value) =>
        if (key.startsWith("spark.hadoop.")) {
          hadoopConf.set(key.substring("spark.hadoop.".length), value)
        }
      }
      val bufferSize = conf.get("spark.buffer.size", "65536")
      hadoopConf.set("io.file.buffer.size", bufferSize)
    }

    hadoopConf
  }

1)将Amazon S3文件系统的AccessKeyId和SecretAccessKey加载到Hadoop的Configuration。

2)将SparkConf中所有以spark.hadoop.开头的属性复制到Hadoop的Configuration。

3)将SparkConf的spark.buffer.size属性复制为Hadoop的Configuration的io.file.buffer.size属性。


7.  ExecutorEnvs

    ExecutorEnvs包含的环境变量会在注册应用时发送给Master,Master给Worker发送调度后,Worker最终使用executorEnvs提供的信息启动Executor。

private[spark] val executorEnvs = HashMap[String, String]()

// Convert java options to env vars as a work around
// since we can't set env vars directly in sbt.
for { (envKey, propKey) <- Seq(("SPARK_TESTING", "spark.testing"))
  value <- Option(System.getenv(envKey)).orElse(Option(System.getProperty(propKey)))} {
  executorEnvs(envKey) = value
}
Option(System.getenv("SPARK_PREPEND_CLASSES")).foreach { v =>
  executorEnvs("SPARK_PREPEND_CLASSES") = v
}
// The Mesos scheduler backend relies on this environment variable to set executor memory.
// TODO: Set this only in the Mesos scheduler.
executorEnvs("SPARK_EXECUTOR_MEMORY") = executorMemory + "m"
executorEnvs ++= _conf.getExecutorEnv
executorEnvs("SPARK_USER") = sparkUser

    由上面代码克制,可以通过配置spark.executor.memory指定Executor占用内存的大小,也可以配置系统变量

SPARK_EXECUTOR_MEMORYSPARK_MEM对其大小进行设置。


8. 注册HeartbeatReceiver    

  We need to register "HeartbeatReceiver" before "createTaskScheduler" because Executor will retrieve "HeartbeatReceiver" in the constructor. (SPARK-6640)

    _heartbeatReceiver = env.rpcEnv.setupEndpoint(
      HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))


9. 创建TaskScheduler

    TaskScheduler为Spark的任务调度器,Spark通过它提交任务并且请求集群调度任务;TaskScheduler通过master的配置匹配部署模式,创建TashSchedulerImpl,根据不同的集群管理模式(local、local[n]、standalone、local-cluster、mesos、YARN)创建不同的SchedulerBackend。

    val (sched, ts) = SparkContext.createTaskScheduler(this, master)
    _schedulerBackend = sched
    _taskScheduler = ts

createTaskScheduler方法会使用模式匹配来创建不同的 TaskSchedulerImpl和Backend。由于这儿使用的是本地模式,所以返回LocalBackend。

      case LOCAL_N_REGEX(threads) =>
      def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
      // local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
      val threadCount = if (threads == "*") localCpuCount else threads.toInt
      if (threadCount <= 0) {
        throw new SparkException(s"Asked to run locally with $threadCount threads")
      }
      val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
      val backend = new LocalBackend(sc.getConf, scheduler, threadCount)
      scheduler.initialize(backend)
      (backend, scheduler)

9.1 创建TaskSchedulerImpl

    TaskSchedulerImpl构造过程如下:

    1)从SparkConf中读取配置信息,包括每个任务分配的CPU数,迪奥多事(FAIR或FIFO,默认为FIFO)

  val conf = sc.conf

  // How often to check for speculative tasks
  val SPECULATION_INTERVAL_MS = conf.getTimeAsMs("spark.speculation.interval", "100ms")

  private val speculationScheduler =
    ThreadUtils.newDaemonSingleThreadScheduledExecutor("task-scheduler-speculation")

  // Threshold above which we warn user initial TaskSet may be starved
  val STARVATION_TIMEOUT_MS = conf.getTimeAsMs("spark.starvation.timeout", "15s")

  // CPUs to request per task
  val CPUS_PER_TASK = conf.getInt("spark.task.cpus", 1)
  
  ...
  
  // default scheduler is FIFO
  private val schedulingModeConf = conf.get("spark.scheduler.mode", "FIFO")

    2)创建TaskResultGetter,它的作用是:

  Runs a thread pool that deserializes and remotely fetches (if necessary) task results.

通过线程池对Worker上的Executor发送的Task的执行结果进行处理。默认会通过Executors.newFixedThreadPool创建一个包含4个、线程名以task-result-getter开头的线程池。

  private val THREADS = sparkEnv.conf.getInt("spark.resultGetter.threads", 4)
  private val getTaskResultExecutor = ThreadUtils.newDaemonFixedThreadPool(
    THREADS, "task-result-getter")

TaskSchedulerImpl的调度模式有FAIR和FIFO两种。任务的最终调度实际都是有SchedulerBackend实现的。local模式下的SchedulerBackend为LocalBackend。


9.2 TaskSchedulerImpl的初始化

    创建完TaskSchedulerImpl和LocalBackend后,需要对TaskSchedulerImpl调用initializeinitialize方法进行初始化。以默认的FIFO调度为例,TaskSchedulerImpl初始化过程如下。

  def initialize(backend: SchedulerBackend) {
    this.backend = backend
    // temporarily set rootPool name to empty
    rootPool = new Pool("", schedulingMode, 0, 0)
    schedulableBuilder = {
      schedulingMode match {
        case SchedulingMode.FIFO =>
          new FIFOSchedulableBuilder(rootPool)
        case SchedulingMode.FAIR =>
          new FairSchedulableBuilder(rootPool, conf)
      }
    }
    schedulableBuilder.buildPools()
  }

1) 使TaskSchedulerImpl持有LocalBackend的引用

2)创建Pool,Pool中缓存了调度队列,调度算法以及TaskSetManager集合等信息。

3)创建FIFOSchedulableBuilder,FIFOSchedulableBuilder用来操作Pool中的调度队列。


10. 创建DAGScheduler

    DAGScheduler的主要作用是在TaskSchedulerImpl正式提交任务之前做一些准备工作,包括:创建Job,将DAG中的RDD划分到不同的Stage,提交Stage等等。DAGScheduler的创建代码如下:

    _dagScheduler = new DAGScheduler(this)
    _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

    // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
    // constructor
    _taskScheduler.start()

11. 启动TaskScheduler

    启动TaskScheduler时,实际上调用量backend的start方法。

  override def start() {
    backend.start()

    if (!isLocal && conf.getBoolean("spark.speculation", false)) {
      logInfo("Starting speculative execution thread")
      speculationScheduler.scheduleAtFixedRate(new Runnable {
        override def run(): Unit = Utils.tryOrStopSparkContext(sc) {
          checkSpeculatableTasks()
        }
      }, SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)
    }
  }


下一篇将介绍Spark源码解读之RDD构建和转换


你可能感兴趣的:(spark)