首先声明,这个系列研究的源码基于spark-2.4.6
我们在使用spark-shell提交任务的时候,一般采用如下模式提交任务:
park-submit
--class xxxxx
--name 'test_xxxx'
--master yarn-cluster
--queue yarn-test
--principal ad-bigdata-test --keytab 'xxxx.keytab'
--num-executors 30
--driver-memory 8g
--executor-memory 40g
--executor-cores 20
--conf spark.task.maxFailures=0
--conf spark.memory.fraction=0.8
--conf spark.storage.memoryFraction=0.2
--conf spark.default.parallelism=600
--conf spark.sql.shuffle.partitions=2400
--conf spark.yarn.executor.memoryOverhead=2048
--conf spark.executor.heartbeatInterval=100
当我们提交后,spark集群收到指令后,首先用SparkSubmit
来接收来自命令行的请求:
def doSubmit(args: Array[String]): Unit = {
val uninitLog = initializeLogIfNecessary(true, silent = true)
val appArgs = parseArguments(args)
if (appArgs.verbose) {
logInfo(appArgs.toString)
}
appArgs.action match {
case SparkSubmitAction.SUBMIT => submit(appArgs, uninitLog)
case SparkSubmitAction.KILL => kill(appArgs)
case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
case SparkSubmitAction.PRINT_VERSION => printVersion()
}
}
首先会调用parseArguments
来处理命令行中的参数,最终参数都会解析在SparkSubmitArguments
中,最终会调用runMain
方法:
private def runMain(args: SparkSubmitArguments, uninitLog: Boolean): Unit = {
val (childArgs, childClasspath, sparkConf, childMainClass) = prepareSubmitEnvironment(args)
// Let the main class re-initialize the logging system once it starts.
if (uninitLog) {
Logging.uninitialize()
}
if (args.verbose) {
logInfo(s"Main class:\n$childMainClass")
logInfo(s"Arguments:\n${childArgs.mkString("\n")}")
// sysProps may contain sensitive information, so redact before printing
logInfo(s"Spark config:\n${Utils.redact(sparkConf.getAll.toMap).mkString("\n")}")
logInfo(s"Classpath elements:\n${childClasspath.mkString("\n")}")
logInfo("\n")
}
val loader =
if (sparkConf.get(DRIVER_USER_CLASS_PATH_FIRST)) {
new ChildFirstURLClassLoader(new Array[URL](0),
Thread.currentThread.getContextClassLoader)
} else {
new MutableURLClassLoader(new Array[URL](0),
Thread.currentThread.getContextClassLoader)
}
Thread.currentThread.setContextClassLoader(loader)
for (jar <- childClasspath) {
addJarToClasspath(jar, loader)
}
var mainClass: Class[_] = null
try {
mainClass = Utils.classForName(childMainClass)
} catch {
case e: ClassNotFoundException =>
logWarning(s"Failed to load $childMainClass.", e)
if (childMainClass.contains("thriftserver")) {
logInfo(s"Failed to load main class $childMainClass.")
logInfo("You need to build Spark with -Phive and -Phive-thriftserver.")
}
throw new SparkUserAppException(CLASS_NOT_FOUND_EXIT_STATUS)
case e: NoClassDefFoundError =>
logWarning(s"Failed to load $childMainClass: ${e.getMessage()}")
if (e.getMessage.contains("org/apache/hadoop/hive")) {
logInfo(s"Failed to load hive class.")
logInfo("You need to build Spark with -Phive and -Phive-thriftserver.")
}
throw new SparkUserAppException(CLASS_NOT_FOUND_EXIT_STATUS)
}
val app: SparkApplication = if (classOf[SparkApplication].isAssignableFrom(mainClass)) {
mainClass.newInstance().asInstanceOf[SparkApplication]
} else {
// SPARK-4170
if (classOf[scala.App].isAssignableFrom(mainClass)) {
logWarning("Subclasses of scala.App may not work correctly. Use a main() method instead.")
}
new JavaMainApplication(mainClass)
}
....
try {
app.start(childArgs.toArray, sparkConf)
} catch {
}
}
在prepareSubmitEnvironment
方法中,我们回针对传入的参数进行相关Spark应用启动的准备,其中,会判断spark的运行类型,如果是yarn cluster,则对饮的启动类为:
org.apache.spark.deploy.yarn.YarnClusterApplication
这里我们得到其实就是YarnClusterApplication
,
而prepareSubmitEnvironment
方法,返回值如下:
val (childArgs, childClasspath, sparkConf, childMainClass) = prepareSubmitEnvironment(args)
这里childMainClass
为YarnClusterApplication
,childArgs
会将解析到的SparkSubmitArguments
形成Seq,传给YarnClusterApplication
,同时,我们启动命令行中传入的class会通过如下形式传入到后面:
childArgs += ("--class", args.mainClass)`java
然后判断通过prepareSubmitEnvironment得到的childMainClass是否是SparkApplication
的子类,YarnClusterApplication
是其子类
private[spark] class YarnClusterApplication extends SparkApplication {
override def start(args: Array[String], conf: SparkConf): Unit = {
conf.remove("spark.jars")
conf.remove("spark.files")
new Client(new ClientArguments(args), conf).run()
}
}
然后启动YarnClusterApplication
,我们看到,这里是直接调用了org.apache.spark.deploy.yarn.Client.run()
方法,这个我们放在下一个节讲,这下面的就是之前说的,YARN客户端提交任务了,可以参考之前的内容: Yarn整体架构,客户端编程