推荐阅读
《关于MQ面试的几件小事 | 消息队列的用途、优缺点、技术选型》 《关于MQ面试的几件小事 | 如何保证消息队列高可用和幂等》 《关于MQ面试的几件小事 | 如何保证消息不丢失》 《关于MQ面试的几件小事 | 如何保证消息按顺序执行》 《关于MQ面试的几件小事 | 消息积压在消息队列里怎么办》 《关于Redis的几件小事 | 使用目的与问题及线程模型》 《关于Redis的几件小事 | Redis的数据类型/过期策略/内存淘汰》 《关于Redis的几件小事 | 高并发和高可用》 《关于Redis的几件小事 | 持久化/缓存雪崩与穿透》 《关于Redis的几件小事 | 缓存与数据库双写时的数据一致性》 《关于Redis的几件小事 | 并发竞争和Cluster模式》 本文适用于知识共享-署名-相同方式共享(CC-BY-SA)3.0协议 目录前言
SparkContext中的辅助属性
creationSite
allowMultipleContexts
startTime & stopped
addedFiles/addedJars & _files/_jars
persistentRdds
executorEnvs & _executorMemory & _sparkUser
checkpointDir
localProperties
_eventLogDir & _eventLogCodec
_applicationId & _applicationAttemptId
_shutdownHookRef
nextShuffleId & nextRddId
SparkContext后初始化
setupAndStartListenerBus()方法
postEnvironmentUpdate()方法
postApplicationStart()方法
其他事项
总结
private val creationSite: CallSite = Utils.getCallSite()
private val allowMultipleContexts: Boolean = config.getBoolean("spark.driver.allowMultipleContexts", false)
val startTime = System.currentTimeMillis()
private[spark] val stopped: AtomicBoolean = new AtomicBoolean(false)
private[spark] val addedFiles = new ConcurrentHashMap[String, Long]().asScala
private[spark] val addedJars = new ConcurrentHashMap[String, Long]().asScala
private[spark] val persistentRdds = {
val map: ConcurrentMap[Int, RDD[_]] = new MapMaker().weakValues().makeMap[Int, RDD[_]]()
map.asScala
}
private[spark] val executorEnvs = HashMap[String, String]()
val sparkUser = Utils.getCurrentUserName()
private[spark] var checkpointDir: Option[String] = None
protected[spark] val localProperties = new InheritableThreadLocal[Properties] {
override protected def childValue(parent: Properties): Properties = {
SerializationUtils.clone(parent)
}
override protected def initialValue(): Properties = new Properties()
}
private val nextShuffleId = new AtomicInteger(0)
private val nextRddId = new AtomicInteger(0)
private var _eventLogDir: Option[URI] = None
private var _eventLogCodec: Option[String] = None
private var _executorMemory: Int = _
private var _applicationId: String = _
private var _applicationAttemptId: Option[String] = None
private var _jars: Seq[String] = _
private var _files: Seq[String] = _
private var _shutdownHookRef: AnyRef = _
_jars = Utils.getUserJars(_conf)
if (jars != null) {
jars.foreach(addJar)
}
首先用Utils.getUserJars()方法从SparkConf的spark.jars配置项中取出路径组成的序列,然后分别调用addJar()方法。 代码#3.3 - o.a.s.SparkContext.addJar()方法
def addJar(path: String) {
def addJarFile(file: File): String = {
try {
if (!file.exists()) {
throw new FileNotFoundException(s"Jar ${file.getAbsolutePath} not found")
}
if (file.isDirectory) {
throw new IllegalArgumentException(
s"Directory ${file.getAbsoluteFile} is not allowed for addJar")
}
env.rpcEnv.fileServer.addJar(file)
} catch {
case NonFatal(e) =>
logError(s"Failed to add $path to Spark environment", e)
null
}
}
if (path == null) {
logWarning("null specified as parameter to addJar")
} else {
val key = if (path.contains("\\")) {
addJarFile(new File(path))
} else {
val uri = new URI(path)
Utils.validateURL(uri)
uri.getScheme match {
case null =>
addJarFile(new File(uri.getRawPath))
case "file" => addJarFile(new File(uri.getPath))
case "local" => "file:" + uri.getPath
case _ => path
}
}
if (key != null) {
val timestamp = System.currentTimeMillis
if (addedJars.putIfAbsent(key, timestamp).isEmpty) {
logInfo(s"Added JAR $path at $key with timestamp $timestamp")
postEnvironmentUpdate()
}
}
}
}
private[spark] def persistRDD(rdd: RDD[_]) {
persistentRdds(rdd.id) = rdd
}
private[spark] def unpersistRDD(rddId: Int, blocking: Boolean = true) {
env.blockManager.master.removeRdd(rddId, blocking)
persistentRdds.remove(rddId)
listenerBus.post(SparkListenerUnpersistRDD(rddId))
}
_executorMemory = _conf.getOption("spark.executor.memory")
.orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))
.orElse(Option(System.getenv("SPARK_MEM"))
.map(warnSparkMem))
.map(Utils.memoryStringToMb)
.getOrElse(1024)
for { (envKey, propKey) <- Seq(("SPARK_TESTING", "spark.testing"))
value <- Option(System.getenv(envKey)).orElse(Option(System.getProperty(propKey)))} {
executorEnvs(envKey) = value
}
Option(System.getenv("SPARK_PREPEND_CLASSES")).foreach { v =>
executorEnvs("SPARK_PREPEND_CLASSES") = v
}
executorEnvs("SPARK_EXECUTOR_MEMORY") = executorMemory + "m"
executorEnvs ++= _conf.getExecutorEnv
executorEnvs("SPARK_USER") = sparkUser
def setCheckpointDir(directory: String) {
if (!isLocal && Utils.nonLocalPaths(directory).isEmpty) {
logWarning("Spark is not running in local mode, therefore the checkpoint directory " +
s"must not be on the local filesystem. Directory '$directory' " +
"appears to be on the local filesystem.")
}
checkpointDir = Option(directory).map { dir =>
val path = new Path(dir, UUID.randomUUID().toString)
val fs = path.getFileSystem(hadoopConfiguration)
fs.mkdirs(path)
fs.getFileStatus(path).getPath.toString
}
}
_applicationId = _taskScheduler.applicationId()
taskScheduler.applicationAttemptId() =
_shutdownHookRef它用来定义SparkContext的关闭钩子,主要是在JVM退出时,显式地执行SparkContext.stop()方法,以防止用户忘记而留下烂摊子。 这实际上是后初始化逻辑,在下面的代码#3.8中会出现。
它的主要逻辑在开头的三个方法中,下面来逐一看它们的代码。setupAndStartListenerBus()
postEnvironmentUpdate()
postApplicationStart()
_taskScheduler.postStartHook()
_env.metricsSystem.registerSource(_dagScheduler.metricsSource)
_env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager))
_executorAllocationManager.foreach { e =>
_env.metricsSystem.registerSource(e.executorAllocationManagerSource)
}
logDebug("Adding shutdown hook") // force eager creation of logger
_shutdownHookRef = ShutdownHookManager.addShutdownHook(
ShutdownHookManager.SPARK_CONTEXT_SHUTDOWN_PRIORITY) { () =>
logInfo("Invoking stop() from shutdown hook")
stop()
}
SparkContext.setActiveContext(this, allowMultipleContexts)
private def setupAndStartListenerBus(): Unit = {
try {
conf.get(EXTRA_LISTENERS).foreach { classNames =>
val listeners = Utils.loadExtensions(classOf[SparkListenerInterface], classNames, conf)
listeners.foreach { listener =>
listenerBus.addToSharedQueue(listener)
logInfo(s"Registered listener ${listener.getClass().getName()}")
}
}
} catch {
case e: Exception =>
try {
stop()
} finally {
throw new SparkException(s"Exception when registering SparkListener", e)
}
}
listenerBus.start(this, _env.metricsSystem)
_listenerBusStarted = true
}
这个方法用于注册自定义的监听器,并最终启动LiveListenerBus。 自定义监听器都实现了SparkListener特征,通过spark.extraListeners配置参数来指定。 然后调用Utils.loadExtensions()方法,通过反射来构建自定义监听器的实例,并将它们注册到LiveListenerBus。
private def postEnvironmentUpdate() {
if (taskScheduler != null) {
val schedulingMode = getSchedulingMode.toString
val addedJarPaths = addedJars.keys.toSeq
val addedFilePaths = addedFiles.keys.toSeq
val environmentDetails = SparkEnv.environmentDetails(conf, schedulingMode, addedJarPaths,
addedFilePaths)
val environmentUpdate = SparkListenerEnvironmentUpdate(environmentDetails)
listenerBus.post(environmentUpdate)
}
}
该方法在添加自定义文件和JAR包时也都有调用,因为添加的资源会对程序的执行环境造成影响。 它会取得当前的自定义文件和JAR包列表,以及Spark配置、调度方式,然后通过SparkEnv.environmentDetails()方法再取得JVM参数、Java系统属性等,一同封装成SparkListenerEnvironmentUpdate事件,并投递给事件总线。
private def postApplicationStart() {
listenerBus.post(SparkListenerApplicationStart(appName, Some(applicationId),
startTime, sparkUser, applicationAttemptId, schedulerBackend.getDriverLogUrls))
}
这个方法比较简单,就是向事件总线投递SparkListenerApplicationStart事件,表示Application已经启动。
调用TaskScheduler.postStartHook()方法,等待SchedulerBackend初始化完毕。
在度量系统中注册DAGScheduler、BlockManager、ExecutionAllocationManager的度量源,以收集它们的监控数据。
添加关闭钩子,这个在之前已经提过了,不再赘述。
调用伴生对象中的setActiveContext()方法,将当前SparkContext设为活动的。
— THE END —