@(spark)[configure|exeption|env|sparkContext]
def setMaster(master: String): SparkConf
getOption(key).map(_.toBoolean).getOrElse(defaultValue)
override def clone: SparkCon
DeprecatedConfig
,用于标识Deprecated
的config没什么特殊的,所有异常的基类
这个类非常nb,基本上就是个全局变量的集合。里面的结构应该就是saprk的核心结构了
/**
* :: DeveloperApi ::
* Holds all the runtime environment objects for a running Spark instance (either master or worker),
* including the serializer, Akka actor system, block manager, map output tracker, etc. Currently
* Spark code finds the SparkEnv through a global variable, so all the threads can access the same
* SparkEnv. It can be accessed by SparkEnv.get (e.g. after creating a SparkContext).
*
* NOTE: This is not intended for external use. This is exposed for Shark and may be made private
* in a future release.
*/
@DeveloperApi
class SparkEnv (
val executorId: String,
val actorSystem: ActorSystem,
val serializer: Serializer,
val closureSerializer: Serializer,
val cacheManager: CacheManager,
val mapOutputTracker: MapOutputTracker,
val shuffleManager: ShuffleManager,
val broadcastManager: BroadcastManager,
val blockTransferService: BlockTransferService,
val blockManager: BlockManager,
val securityManager: SecurityManager,
val httpFileServer: HttpFileServer,
val sparkFilesDir: String,
val metricsSystem: MetricsSystem,
val shuffleMemoryManager: ShuffleMemoryManager,
val outputCommitCoordinator: OutputCommitCoordinator,
val conf: SparkConf) extends Logging {
private var env: SparkEnv = _
的含义是env用默认值初始化
在Object Env中含有对某些val的初始化
/**
* Resolves paths to files added through `SparkContext.addFile()`.
*/
object SparkFiles {
这个是所有Spark程序都有的东西,也是本文的重点,这个文件非常长2300+ 行。基本上在SparkContext的初始化过程中,做了这么几个事情:
1. 加载配置文件SparkConf
1. 创建SparkEnv
1. 创建TaskScheduler
1. 创建DAGScheduler
/**
* Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
* cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
*
* Only one SparkContext may be active per JVM. You must `stop()` the active SparkContext before
* creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
*
* @param config a Spark Config object describing the application configuration. Any settings in
* this config overrides the default configs as well as system properties.
*/
class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient {
关于这个Define说明几件事情:
1. ExecutorAllocationClient
/**
* A client that communicates with the cluster manager to request or kill executors.
* This is currently supported only in YARN mode.
*/
private[spark] trait ExecutorAllocationClient {
对应这个trait,定义了相应的函数,注意此trait只对yarn生效,实际上在SparkContext中含有不少只和Yarn相关的东西
1. 一个JVM只能有一个SparkContext。
2. 不止一个构造函数,还有很多辅助构造函数
3. xxxFile的def,其主要目的是根据不同情况生成对应的RDD。关于RDD会在其它地方说明。
4. accumulator相关的def,其主要目的是根据不同情况生成不同的accumulator。
1. 定义了一些get函数,比如getPersistentRDDs,比如getXXXInfo
2. 定义了一些钩子函数,比如postEnvironmentUpdate之类的
3. 定义了一些私有成员,有相应的get和set。
4. text和wholeTextFiles处理了HDFS的文件读取,实际上生成了HadoopRDD
4. 所有函数的runJob的总入口在 runJob
,回头在仔细说rubJob的流程
/**
* Run a function on a given set of partitions in an RDD and pass the results to the given
* handler function. This is the main entry point for all actions in Spark. The allowLocal
* flag specifies whether the scheduler can run the computation on the driver rather than
* shipping it out to the cluster, for short actions like first().
*/
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
allowLocal: Boolean,
resultHandler: (Int, U) => Unit) {
if (stopped) {
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, allowLocal,
resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}
### Object SparkContext
1. 大概最重要的成员就是private var activeContext: Option[SparkContext] = None
2. 包含大量的@deprecated
,不是没什么用,而是都在其他的地方了,这个保持是为了向前兼容。
3. 值得关注的函数应该是这个了吧,在这个函数里会根据不同的配置创建不同的Scheduler
/**
* Create a task scheduler based on a given master URL.
* Return a 2-tuple of the scheduler backend and the task scheduler.
*/
private def createTaskScheduler(
sc: SparkContext,
master: String): (SchedulerBackend, TaskScheduler) = {
一堆implict,主要是为了引入simpleWritableConverter
跟踪job信息用的,主要的活儿实际上是jobProgressListener干的
/**
* Low-level status reporting APIs for monitoring job and stage progress.
*
* These APIs intentionally provide very weak consistency semantics; consumers of these APIs should
* be prepared to handle empty / missing information. For example, a job's stage ids may be known
* but the status API may not have any information about the details of those stages, so
* `getStageInfo` could potentially return `None` for a valid stage id.
*
* To limit memory usage, these APIs only provide information on recent jobs / stages. These APIs
* will provide information for the last `spark.ui.retainedStages` stages and
* `spark.ui.retainedJobs` jobs.
*
* NOTE: this class's constructor should be considered private and may be subject to change.
*/
class SparkStatusTracker private[spark] (sc: SparkContext) {
/**
* An HTTP server for static content used to allow worker nodes to access JARs added to SparkContext
* as well as classes created by the interpreter when the user types in code. This is just a wrapper
* around a Jetty server.
*/
private[spark] class HttpServer(