Spark源代码该怎么看,那就是看SparkContext类,只要你看懂了SparkContext,就懂得了Spark,因为这个是Spark程序的入口,也是最基础的。
class SparkContext(config: SparkConf) extends Logging {
// The call site where this SparkContext was constructed.
// 这个SparkContext构建的调用站点。
// 当在spark包中调用类时,返回调用spark的用户代码类的名称,以及它们调用的spark方法。
private val creationSite: CallSite = Utils.getCallSite()
// If true, log warnings instead of throwing exceptions when multiple SparkContexts are active
// 如果设置为true log日志里将会抛出多个SparkContext处于活动状态的异常
private val allowMultipleContexts: Boolean =
config.getBoolean("spark.driver.allowMultipleContexts", false)
// In order to prevent multiple SparkContexts from being active at the same time, mark this
// context as having started construction.
// NOTE: this must be placed at the beginning of the SparkContext constructor.
// 为了预防多SparkContexts同一时间处于活动状态,从这个上下文开始构建SparkContext
// 注:这placed必须要开始的sparkcontext constructor。
SparkContext.markPartiallyConstructed(this, allowMultipleContexts)
// 得到系统的当前时间
val startTime = System.currentTimeMillis()
// AtomicBoolean是线程阻塞的,一个线程没结束,另外的线程都不能使用,就如两个人同时去操作一盆花,一个浇水,一个看花,
// 普通的布尔值,可以同时做,但是AtomicBoolean是,有人看花的时候另外一个不能浇水,或者浇水的时候不能看花,除非一个完成了
private[spark] val stopped: AtomicBoolean = new AtomicBoolean(false)
http://blog.csdn.net/qq_21383435/article/details/78559871
// spark2.1.0版本需要Scala2.10版本以上的否则 打印警告
warnDeprecatedVersions()
// spark2.1.0版本需要Scala2.10版本以上的否则 打印警告
private def warnDeprecatedVersions(): Unit = {
val javaVersion = System.getProperty("java.version").split("[+.\\-]+", 3)
if (scala.util.Properties.releaseVersion.exists(_.startsWith("2.10"))) {
logWarning("Support for Scala 2.10 is deprecated as of Spark 2.1.0")
}
}
/**
* Return a copy of this SparkContext's configuration. The configuration ''cannot'' be
* changed at runtime.
* 返回一个SparkContext的配置对象,这个对象不能在运行的时候改变
*/
def getConf: SparkConf = conf.clone()
def jars: Seq[String] = _jars
def files: Seq[String] = _files
def master: String = _conf.get("spark.master")
def deployMode: String = _conf.getOption("spark.submit.deployMode").getOrElse("client")
def appName: String = _conf.get("spark.app.name")
private[spark] def isEventLogEnabled: Boolean = _conf.getBoolean("spark.eventLog.enabled", false)
private[spark] def eventLogDir: Option[URI] = _eventLogDir
private[spark] def eventLogCodec: Option[String] = _eventLogCodec
// 判断是否为本地模式
def isLocal: Boolean = Utils.isLocalMaster(_conf)
/**
* 如果context被停止了 或者中间停止了 就返回真
* @return true if context is stopped or in the midst of stopping.
*/
def isStopped: Boolean = stopped.get()
listenerBus是一个理解为公交车,人是事件,来人的时候就上车,到站就下车。模式是生产者–消费者模式具体请看
// An asynchronous listener bus for Spark events
// 一个Spark事件的异步监听总线,先创建listenerBus,因为它作为参数创递给sparkEnv
private[spark] val listenerBus = new LiveListenerBus(this)
具体实现请看:spark学习-48-Spark的event事件监听器LiveListenerBus和特质SparkListenerBus以及特质ListenerBus
http://blog.csdn.net/qq_21383435/article/details/78666141
// Keeps track of all persisted RDDs 跟踪所有持久化RDDs
private[spark] val persistentRdds = {
val map: ConcurrentMap[Int, RDD[_]] = new MapMaker().weakValues().makeMap[Int, RDD[_]]()
map.asScala
}
// Set SPARK_USER for user who is running SparkContext.
// 设置运行SparkContext的用户 返回当前用户名。这是当前登录的用户,除非它被“SPARK_USER”环境变量覆盖。
val sparkUser = Utils.getCurrentUserName()
我是在windows下使用hzjs用户登录电脑的,发现打印如下
17/12/19 17:06:13 INFO SecurityManager: Changing view acls to: hzjs,root
17/12/19 17:06:13 INFO SecurityManager: Changing modify acls to: hzjs,root
17/12/19 17:06:13 INFO SecurityManager: Changing view acls groups to:
17/12/19 17:06:13 INFO SecurityManager: Changing modify acls groups to:
这里获取用户是有用的,应该是设置本地文件和Hdfs文件的读写权限。
如果在linux上,是linux的用户
/**
* A unique identifier for the Spark application.
* spark应用程序的一个唯一的标识
* Its format depends on the scheduler implementation.
* 它的格式取决于调度的实现。
* (i.e.
* in case of local spark app something like 'local-1433865536131'
* in case of YARN something like 'application_1433865536131_34483'
* )
*/
def applicationId: String = _applicationId
def applicationAttemptId: Option[String] = _applicationAttemptId
下面一大段try语句,里面初始化了绝大部分的Spark运行环境
_conf = config.clone() /** 克隆一个配置对象 */
_conf.validateSettings() /** //检查不合法的配置或者是过时的配置,跑出一个异常*/
//必须设置master
if (!_conf.contains("spark.master")) {
throw new SparkException("A master URL must be set in your configuration")
}
// 必须设置应用程序的名称
if (!_conf.contains("spark.app.name")) {
throw new SparkException("An application name must be set in your configuration")
}
// log out spark.app.name in the Spark driver logs
logInfo(s"Submitted application: $appName")
// System property spark.yarn.app.id must be set if user code ran by AM on a YARN cluster
// 如果用户代码在Yarn集群上运行一个AM 那么必须设置系统属性spark.yarn.app.id
if (master == "yarn" && deployMode == "cluster" && !_conf.contains("spark.yarn.app.id")) {
throw new SparkException("Detected yarn cluster mode, but isn't running on a cluster. " +
"Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.")
}
if (_conf.getBoolean("spark.logConf", false)) {
logInfo("Spark configuration:\n" + _conf.toDebugString)
}
// Set Spark driver host and port system properties. This explicitly sets the configuration
// instead of relying on the default value of the config constant.
// 系统配置中设置Spark driver主机名和端口。这显式地设置了配置,而不是依赖于配置常量的默认值。
_conf.set(DRIVER_HOST_ADDRESS, _conf.get(DRIVER_HOST_ADDRESS))
// 如果没有设置driver的端口 那么就默认为0
_conf.setIfMissing("spark.driver.port", "0")
// 这里DRIVER_IDENTIFIER = "driver"
_conf.set("spark.executor.id", SparkContext.DRIVER_IDENTIFIER)
第9条和第19条,都是jars属性,spark.jars指定的jar包将由addJar方法加入到httpFileServer的JarDir变量指定的路径下.
spark.files指定的文件将由addFile方法加入到httpFileServer的fileDir变量指定的路径下。
// 这里设置了_jars
_jars = Utils.getUserJars(_conf)
_files = _conf.getOption("spark.files").map(_.split(",")).map(_.filter(_.nonEmpty))
.toSeq.flatten
// eventLogDir的日志目录/tmp/spark-events,但是我没找到在哪里
_eventLogDir =
if (isEventLogEnabled) {
val unresolvedDir = conf.get("spark.eventLog.dir", EventLoggingListener.DEFAULT_LOG_DIR)
.stripSuffix("/")
Some(Utils.resolveURI(unresolvedDir))
} else {
None
}
_eventLogCodec = {
// event log日志是否压缩
val compress = _conf.getBoolean("spark.eventLog.compress", false)
// 是否开启event日志isEventLogEnabled默认false 这里为假
if (compress && isEventLogEnabled) {
Some(CompressionCodec.getCodecName(_conf)).map(CompressionCodec.getShortName)
} else {
None
}
}
if (master == "yarn" && deployMode == "client") System.setProperty("SPARK_YARN_MODE", "true")
// "_jobProgressListener" should be set up before creating SparkEnv because when creating
// "SparkEnv", some messages will be posted to "listenerBus" and we should not miss them.
// “_jobprogresslistener”应在在sparkenv创建之前,因为当创建“sparkenv”,一些信息将被psot到“listenerbus”我们不应该错过。
/** 创建JobProgressListener 这个要在2.2版本以后移除*/
_jobProgressListener = new JobProgressListener(_conf)
listenerBus.addListener(jobProgressListener)
// Create the Spark execution environment (cache, map output tracker, etc)
// 设置Spark的execution的环境变量(缓存,tracker的输出map)
_env = createSparkEnv(_conf, isLocal, listenerBus)
SparkEnv.set(_env)
// This function allows components created by SparkEnv to be mocked in unit tests:
// 这个功能允许在单元测试组件创建的sparkenv
private[spark] def createSparkEnv(
conf: SparkConf,
isLocal: Boolean,
listenerBus: LiveListenerBus): SparkEnv = {
SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master))
}
/**
* Create a SparkEnv for the driver. 为driver创建一个SparkEnv
*
* conf: SparkConf conf是对SparkConf的复制
* listenerBus 才用监听器模式维护各类事件的处理
*/
private[spark] def createDriverEnv(
conf: SparkConf,
isLocal: Boolean,
listenerBus: LiveListenerBus,
numCores: Int,
mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = {
assert(conf.contains(DRIVER_HOST_ADDRESS),
s"${DRIVER_HOST_ADDRESS.key} is not set on the driver!")
assert(conf.contains("spark.driver.port"), "spark.driver.port is not set on the driver!")
val bindAddress = conf.get(DRIVER_BIND_ADDRESS)
val advertiseAddress = conf.get(DRIVER_HOST_ADDRESS)
val port = conf.get("spark.driver.port").toInt
val ioEncryptionKey = if (conf.get(IO_ENCRYPTION_ENABLED)) {
Some(CryptoStreamUtils.createKey(conf))
} else {
None
}
// createDriverEnv最终调用的是create方法创建SparkEnv
create(
conf,
SparkContext.DRIVER_IDENTIFIER,
bindAddress,
advertiseAddress,
port,
isLocal,
numCores,
ioEncryptionKey,
listenerBus = listenerBus,
mockOutputCommitCoordinator = mockOutputCommitCoordinator
)
}
SparkEnv的具体创建过程请看:
http://blog.csdn.net/qq_21383435/article/details/78559977
/ If running the REPL, register the repl's output dir with the file server.
// 如果运行REPL,将REPL的输出目录注册到文件服务器。
_conf.getOption("spark.repl.class.outputDir").foreach { path =>
val replUri = _env.rpcEnv.fileServer.addDirectory("/classes", new File(path))
_conf.set("spark.repl.class.uri", replUri)
}
// 监视job和stage进度的低级别状态报告API。 (基本上都是调用jobProgressListener这个里面的方法)
/** 创建 SparkStatusTracker */
_statusTracker = new SparkStatusTracker(this)
/** 创建UI界面进度条 */
_progressBar =
if (_conf.getBoolean("spark.ui.showConsoleProgress", true) && !log.isInfoEnabled) {
Some(new ConsoleProgressBar(this))
} else {
None
}
/** 这里默认启动了SPark的UI界面 */
_ui =
if (conf.getBoolean("spark.ui.enabled", true)) {
// 调用了SparkUI.createLiveUI()-》create()
Some(SparkUI.createLiveUI(this, _conf, listenerBus, _jobProgressListener,
_env.securityManager, appName, startTime = startTime))
} else {
// For tests, do not enable the UI
None
}
// Bind the UI before starting the task scheduler to communicate
// the bound port to the cluster manager properly
// 在启动任务计划程序之前将UI绑定到正确地将绑定端口传达给群集管理器。调用WebUI.bind()方法
_ui.foreach(_.bind()) //启动jetty。bind方法继承自WebUI,该类负责和真实的Jetty Server API打交道
详情:源代码:SparkUI界面 http://blog.csdn.net/qq_21383435/article/details/78760594
/**
默认情况下,Spark使用HDFS作为分布式文件系统,所以需要获取Hadoop相关配置信息,获取的信息包括:
1.将AmazonS3文件系统的AccessKeyId和SecretAccessKey加载到hadoop的Configuration;
2.将SparkConf中的所有以spark.hadoop.开头的属性都复制到hadoop的Configuration;
3.将SparkConf的属性spark.buffer.size复制为hadoop的Configuration的配置io.file.buffer.size
注意:如果指定了SPARK_YARN_MODE属性,则会使用YarnSparkHadoopUtil,否则默认为SparkHadoopUtil.
*/
_hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(_conf)
// Add each JAR given through the constructor
// 通过构造函数添加每个JAR
if (jars != null) {
jars.foreach(addJar)
}
if (files != null) {
files.foreach(addFile)
}
有点搞不懂
// Master给Worker发送调度后,Worker最终使用executorEnvs提供的信息启动Executor.可以通过spark.executor.memory
// 指定Executor占用的内存大小,也可以配置系统变量SPARK_EXECUTOR_MEMORY或者SPARK_MEM对其大小进行设置。
_executorMemory = _conf.getOption("spark.executor.memory")
.orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))
.orElse(Option(System.getenv("SPARK_MEM"))
.map(warnSparkMem))
.map(Utils.memoryStringToMb)
.getOrElse(1024)
// Convert java options to env vars as a work around
// since we can't set env vars directly in sbt.
// 将java选项env变量作为工作aroundsince我们不能设置环境变量直接在SBT。
for { (envKey, propKey) <- Seq(("SPARK_TESTING", "spark.testing"))
value <- Option(System.getenv(envKey)).orElse(Option(System.getProperty(propKey)))} {
executorEnvs(envKey) = value
}
Option(System.getenv("SPARK_PREPEND_CLASSES")).foreach { v =>
executorEnvs("SPARK_PREPEND_CLASSES") = v
}
// The Mesos scheduler backend relies on this environment variable to set executor memory.
// 使用Mesos调度器的后端依赖于此环境变量设置executor的memory。
// TODO: Set this only in the Mesos scheduler.
executorEnvs("SPARK_EXECUTOR_MEMORY") = executorMemory + "m"
executorEnvs ++= _conf.getExecutorEnv
executorEnvs("SPARK_USER") = sparkUser
// driver端接收executor的心跳
/**
* rpcEnv是一个抽象类
* env.rpcEnv.setupEndpoint实际调用的是NettyRpcEnv的setupEndpoint方法
*/
_heartbeatReceiver = env.rpcEnv.setupEndpoint(
HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))
// Create and start the scheduler 创建和启动任务调度
// 创建SparkDeployScheduler和TaskSchedulerImpl
/** 这里才是真正创建
* schedulerBackend
* taskScheduler
* dagScheduler
* */
val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
_schedulerBackend = sched //生成 schedulerBackend
_taskScheduler = ts //生成 taskScheduler
_dagScheduler = new DAGScheduler(this)
_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
// constructor
// 启动TaskScheduler在taskScheduler设置DAGScheduler的DAGScheduler的构造函数之后
// 任务调度器TaskScheduler的创建,想要TaskScheduler发挥作用,必须启动它。
/**
* 启动taskScheduler,调用的是TaskSchedulerImpl的start()方法
* */
_taskScheduler.start()
_applicationId = _taskScheduler.applicationId()
_applicationAttemptId = taskScheduler.applicationAttemptId()
_conf.set("spark.app.id", _applicationId)
Spark的job调度:http://blog.csdn.net/qq_21383435/article/details/78700430
// sparkUI相关:这里设置了spark.ui.proxyBase,就是yar模式下uiRoot参数
if (_conf.getBoolean("spark.ui.reverseProxy", false)) {
System.setProperty("spark.ui.proxyBase", "/proxy/" + _applicationId)
}
_ui.foreach(_.setAppId(_applicationId))
/** BlockManager的初始化 */
_env.blockManager.initialize(_applicationId)
// The metrics system for Driver need to be set spark.app.id to app ID.
// So it should start after we get app ID from the task scheduler and set spark.app.id.
// Driver的度量系统需要为spark.app.id设置一个APP ID
_env.metricsSystem.start()
// Attach the driver metrics servlet handler to the web ui after the metrics system is started.
// 在度量系统启动后将驱动程序servlet处理程序附加到Web用户界面。
_env.metricsSystem.getServletHandlers.foreach(handler => ui.foreach(_.attachHandler(handler)))
/**
* 启动eventLog,因为isEventLogEnabled默认为false,所以这个默认是不启动的
*/
_eventLogger =
if (isEventLogEnabled) {
val logger =
// 创建一个EventLoggingListener对象
new EventLoggingListener(_applicationId, _applicationAttemptId, _eventLogDir.get,
_conf, _hadoopConfiguration)
// 调用Start方法
logger.start()
// 加入到监听Bus
listenerBus.addListener(logger)
Some(logger)
} else {
None
}
Spark的EventLoggingListener:http://blog.csdn.net/qq_21383435/article/details/78760594
// Optionally scale number of executors dynamically based on workload. Exposed for testing.
/**
* dynamicAllocationEnabled用于对已经分配的Executor进行管理,创建和启动ExecutorAllocationManager。
*
* 返回在给定的conf中是否启用了动态分配。默认是没有启动的 false
* 设置spark.dynamicAllocation.enabled为true 可以启动动态分配
*/
val dynamicAllocationEnabled = Utils.isDynamicAllocationEnabled(_conf)
_executorAllocationManager =
if (dynamicAllocationEnabled) {
// 如果是local模式下,这个内容为_schedulerBackend: LocalSchedulerBackend 无法进行动态分配
schedulerBackend match {
case b: ExecutorAllocationClient =>
/** 新建一个ExecutorAllocationManager 动态分配executor*/
Some(new ExecutorAllocationManager(
schedulerBackend.asInstanceOf[ExecutorAllocationClient], listenerBus, _conf))
case _ =>
None
}
} else {
None
}
// 调用start方法
_executorAllocationManager.foreach(_.start())
Spark的动态资源分配ExecutorAllocationManager:http://blog.csdn.net/qq_21383435/article/details/78790231
/** 启用Spark的ContextCleaner 用于清理功能 */
_cleaner =
if (_conf.getBoolean("spark.cleaner.referenceTracking", true)) {
Some(new ContextCleaner(this))
} else {
None
}
// 调用start方法
_cleaner.foreach(_.start())
源代码:ContextCleaner清理器:
http://blog.csdn.net/qq_21383435/article/details/78826589
/**
* z这个方法主要执行每个监听器的静态代码块,启动监听总线listenBus
*/
setupAndStartListenerBus()
/**
* 在SparkContext的初始化过程中,可能对其环境造成影响,所以需要更新环境。就是提交代码后,如果更新了环境
*/
postEnvironmentUpdate()
/** 发布应用程序启动事件 */
postApplicationStart()
// Post init 等待SchedulerBackend准备好
_taskScheduler.postStartHook()
// DAGSchedulerSource的测量信息是job和Satge相关的信息
_env.metricsSystem.registerSource(_dagScheduler.metricsSource)
// 注册BlockManagerSource
_env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager))
// 动态分配的executor信息
_executorAllocationManager.foreach { e =>
_env.metricsSystem.registerSource(e.executorAllocationManagerSource)
}
// Make sure the context is stopped if the user forgets about it. This avoids leaving
// unfinished event logs around after the JVM exits cleanly. It doesn't help if the JVM
// is killed, though.
// 确保context是停止如果用户忘记了停止。这样可以避免在JVM退出之后留下未完成的事件日志。但是如果JVM是被杀了并不能帮助,
logDebug("Adding shutdown hook") // force eager creation of logger
/** ShutdownHookManager的创建,为了在Spark程序挂掉的时候,处理一些清理工作 */
_shutdownHookRef = ShutdownHookManager.addShutdownHook(
ShutdownHookManager.SPARK_CONTEXT_SHUTDOWN_PRIORITY) { () =>
logInfo("Invoking stop() from shutdown hook")
// 这调用停止方法。关闭SparkContext,我就搞不懂了
stop()
}
源代码:ShutdownHookManager虚拟机关闭钩子管理器:http://blog.csdn.net/qq_21383435/article/details/78828343
} catch {
case NonFatal(e) =>
logError("Error initializing SparkContext.", e)
try {
stop()
} catch {
case NonFatal(inner) =>
logError("Error stopping SparkContext after init error.", inner)
} finally {
throw e
}
}
// SparkContext初始化的最后将当前的SparkContext的状态从contextBeingConstructed(正在构建中)改为activeContext(已激活)
SparkContext.setActiveContext(this, allowMultipleContexts)
/**
* Called at the end of the SparkContext constructor to ensure that no other SparkContext has
* raced with this constructor and started.
*
* 在sparkcontext构造函数结束之后调用来确保没有其他sparkcontext参加了这个构造函数而且开始构建SparkContext。
*
* SparkContext初始化的最后将当前的SparkContext的状态从contextBeingConstructed(正在构建中)
* 改为activeContext(已激活)
*
*/
private[spark] def setActiveContext(
sc: SparkContext,
allowMultipleContexts: Boolean): Unit = {
SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
// 确保其他线程没有同时在构建SparkContext
assertNoOtherContextIsRunning(sc, allowMultipleContexts)
contextBeingConstructed = None
activeContext.set(sc)
}
}