Spark---Spark开头的杂项

Spark—Spark开头的杂项

@(spark)[configure|exeption|env|sparkContext]

SparkConf

  1. 其核心是一个java.util.concurrent.ConcurrentHashMap[String, String],不同的key有自己的值
  2. 除了正常的set之外,还有大量的’utils’函数比如def setMaster(master: String): SparkConf
  3. 在根据类型取配置项的时候,使用scala的技巧getOption(key).map(_.toBoolean).getOrElse(defaultValue)
  4. cloneable: override def clone: SparkCon
  5. Oject SparkConf中还定义了DeprecatedConfig,用于标识Deprecated的config

SparkException

没什么特殊的,所有异常的基类

SparkEnv

这个类非常nb,基本上就是个全局变量的集合。里面的结构应该就是saprk的核心结构了

/**
 * :: DeveloperApi ::
 * Holds all the runtime environment objects for a running Spark instance (either master or worker),
 * including the serializer, Akka actor system, block manager, map output tracker, etc. Currently
 * Spark code finds the SparkEnv through a global variable, so all the threads can access the same
 * SparkEnv. It can be accessed by SparkEnv.get (e.g. after creating a SparkContext).
 *
 * NOTE: This is not intended for external use. This is exposed for Shark and may be made private
 *       in a future release.
 */
@DeveloperApi
class SparkEnv (
    val executorId: String,
    val actorSystem: ActorSystem,
    val serializer: Serializer,
    val closureSerializer: Serializer,
    val cacheManager: CacheManager,
    val mapOutputTracker: MapOutputTracker,
    val shuffleManager: ShuffleManager,
    val broadcastManager: BroadcastManager,
    val blockTransferService: BlockTransferService,
    val blockManager: BlockManager,
    val securityManager: SecurityManager,
    val httpFileServer: HttpFileServer,
    val sparkFilesDir: String,
    val metricsSystem: MetricsSystem,
    val shuffleMemoryManager: ShuffleMemoryManager,
    val outputCommitCoordinator: OutputCommitCoordinator,
    val conf: SparkConf) extends Logging {

private var env: SparkEnv = _的含义是env用默认值初始化

在Object Env中含有对某些val的初始化

SparkFiles

/**
 * Resolves paths to files added through `SparkContext.addFile()`.
 */
object SparkFiles {

SparkContext

这个是所有Spark程序都有的东西,也是本文的重点,这个文件非常长2300+ 行。基本上在SparkContext的初始化过程中,做了这么几个事情:
1. 加载配置文件SparkConf
1. 创建SparkEnv
1. 创建TaskScheduler
1. 创建DAGScheduler

Class SparkContext

/**
 * Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
 * cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
 *
 * Only one SparkContext may be active per JVM.  You must `stop()` the active SparkContext before
 * creating a new one.  This limitation may eventually be removed; see SPARK-2243 for more details.
 *
 * @param config a Spark Config object describing the application configuration. Any settings in
 *   this config overrides the default configs as well as system properties.
 */
class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient {

关于这个Define说明几件事情:
1. ExecutorAllocationClient

/**
* A client that communicates with the cluster manager to request or kill executors.
* This is currently supported only in YARN mode.
*/
private[spark] trait ExecutorAllocationClient {

对应这个trait,定义了相应的函数,注意此trait只对yarn生效,实际上在SparkContext中含有不少只和Yarn相关的东西
1. 一个JVM只能有一个SparkContext。
2. 不止一个构造函数,还有很多辅助构造函数
3. xxxFile的def,其主要目的是根据不同情况生成对应的RDD。关于RDD会在其它地方说明。
4. accumulator相关的def,其主要目的是根据不同情况生成不同的accumulator。
1. 定义了一些get函数,比如getPersistentRDDs,比如getXXXInfo
2. 定义了一些钩子函数,比如postEnvironmentUpdate之类的
3. 定义了一些私有成员,有相应的get和set。
4. text和wholeTextFiles处理了HDFS的文件读取,实际上生成了HadoopRDD
4. 所有函数的runJob的总入口在 runJob,回头在仔细说rubJob的流程

/**
* Run a function on a given set of partitions in an RDD and pass the results to the given
* handler function. This is the main entry point for all actions in Spark. The allowLocal
* flag specifies whether the scheduler can run the computation on the driver rather than
* shipping it out to the cluster, for short actions like first().
*/
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
allowLocal: Boolean,
resultHandler: (Int, U) => Unit) {
if (stopped) {
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, allowLocal,
resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}

### Object SparkContext
1. 大概最重要的成员就是private var activeContext: Option[SparkContext] = None
2. 包含大量的@deprecated,不是没什么用,而是都在其他的地方了,这个保持是为了向前兼容。
3. 值得关注的函数应该是这个了吧,在这个函数里会根据不同的配置创建不同的Scheduler

/**
* Create a task scheduler based on a given master URL.
* Return a 2-tuple of the scheduler backend and the task scheduler.
*/
private def createTaskScheduler(
sc: SparkContext,
master: String): (SchedulerBackend, TaskScheduler) = {

Writable相关

一堆implict,主要是为了引入simpleWritableConverter

SparkStatusTracker

跟踪job信息用的,主要的活儿实际上是jobProgressListener干的

/**
 * Low-level status reporting APIs for monitoring job and stage progress.
 *
 * These APIs intentionally provide very weak consistency semantics; consumers of these APIs should
 * be prepared to handle empty / missing information.  For example, a job's stage ids may be known
 * but the status API may not have any information about the details of those stages, so
 * `getStageInfo` could potentially return `None` for a valid stage id.
 *
 * To limit memory usage, these APIs only provide information on recent jobs / stages.  These APIs
 * will provide information for the last `spark.ui.retainedStages` stages and
 * `spark.ui.retainedJobs` jobs.
 *
 * NOTE: this class's constructor should be considered private and may be subject to change.
 */
class SparkStatusTracker private[spark] (sc: SparkContext) {

HttpServer

/**
* An HTTP server for static content used to allow worker nodes to access JARs added to SparkContext
* as well as classes created by the interpreter when the user types in code. This is just a wrapper
* around a Jetty server.
*/
private[spark] class HttpServer(

你可能感兴趣的:(spark)