本系列文章默认您对spark有相应的了解。笔者选择的spark版本为2.2.0,不同版本可能又些许差异。
1、首先我们先集群提交任务,会调用spark-submit
这个脚本。我找到spark安装目录下的/bin目录下。查看下其内容。
if [ -z "${SPARK_HOME}" ]; then
source "$(dirname "$0")"/find-spark-home
fi
# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0
exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"
首先会判断SPARK_HOME
是否存在,如果没有加载find-spark-home,也会自动设置并且将其注册为环境变量。重点是最后一句exec。
2、我们看看spark-class
这个脚本。
if [ -z "${SPARK_HOME}" ]; then
source "$(dirname "$0")"/find-spark-home
fi
. "${SPARK_HOME}"/bin/load-spark-env.sh
# Find the java binary
if [ -n "${JAVA_HOME}" ]; then
RUNNER="${JAVA_HOME}/bin/java"
else
if [ "$(command -v java)" ]; then
RUNNER="java"
else
echo "JAVA_HOME is not set" >&2
exit 1
fi
fi
# Find Spark jars.
if [ -d "${SPARK_HOME}/jars" ]; then
SPARK_JARS_DIR="${SPARK_HOME}/jars"
else
SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
fi
if [ ! -d "$SPARK_JARS_DIR" ] && [ -z "$SPARK_TESTING$SPARK_SQL_TESTING" ]; then
echo "Failed to find Spark jars directory ($SPARK_JARS_DIR)." 1>&2
echo "You need to build Spark with the target \"package\" before running this program." 1>&2
exit 1
else
LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*"
fi
# Add the launcher build dir to the classpath if requested.
if [ -n "$SPARK_PREPEND_CLASSES" ]; then
LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"
fi
# For tests
if [[ -n "$SPARK_TESTING" ]]; then
unset YARN_CONF_DIR
unset HADOOP_CONF_DIR
fi
# The launcher library will print arguments separated by a NULL character, to allow arguments with
# characters that would be otherwise interpreted by the shell. Read that in a while loop, populating
# an array that will be used to exec the final command.
#
# The exit code of the launcher is appended to the output, so the parent shell removes it from the
# command array and checks the value to see if the launcher succeeded.
build_command() {
"$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
printf "%d\0" $?
}
# Turn off posix mode since it does not allow process substitution
set +o posix
CMD=()
while IFS= read -d '' -r ARG; do
CMD+=("$ARG")
done < <(build_command "$@")
COUNT=${#CMD[@]}
LAST=$((COUNT - 1))
LAUNCHER_EXIT_CODE=${CMD[$LAST]}
# Certain JVM failures result in errors being printed to stdout (instead of stderr), which causes
# the code that parses the output of the launcher to get confused. In those cases, check if the
# exit code is an integer, and if it's not, handle it as a special error case.
if ! [[ $LAUNCHER_EXIT_CODE =~ ^[0-9]+$ ]]; then
echo "${CMD[@]}" | head -n-1 1>&2
exit 1
fi
if [ $LAUNCHER_EXIT_CODE != 0 ]; then
exit $LAUNCHER_EXIT_CODE
fi
CMD=("${CMD[@]:0:$LAST}")
exec "${CMD[@]}"
内容确实有点多,根据模块看主要就是几件事:
还是检查SPARK_HOME环境变量;
检查Spark jars路径;
执行org.apache.spark.launcher.Main,主要负责参数的解析并返回解析的参数;
然后判断解析后的参数是否正确。
正确则会开始执行org.apache.spark.deploy.SparkSubmit 这个方法。
3、我们打开SparkSubmit类,首先看main方法。
override def main(args: Array[String]): Unit = {
val appArgs = new SparkSubmitArguments(args)
if (appArgs.verbose) {
// scalastyle:off println
printStream.println(appArgs)
// scalastyle:on println
}
appArgs.action match {
case SparkSubmitAction.SUBMIT => submit(appArgs)
case SparkSubmitAction.KILL => kill(appArgs)
case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
}
}
一开始首先会新建一个 SparkSubmitArguments(args),主要就是初始化一些参数。其中比较重要的是:
// Use `sparkProperties` map along with env vars to fill in any missing parameters
loadEnvironmentArguments()
主要就是从环境变量、Spark属性等加载参数。以及action的初始化,默认设置为SUBMIT。
// Action should be SUBMIT unless otherwise specified
action = Option(action).getOrElse(SUBMIT)
让我们再回到SparkSubmit 的main方法,这里会根据刚才appArgs.action
的参数执行submit(appArgs)
方法。先看看方法介绍:
/**
* Submit the application using the provided parameters.
*
* This runs in two steps. First, we prepare the launch environment by setting up
* the appropriate classpath, system properties, and application arguments for
* running the child main class based on the cluster manager and the deploy mode.
* Second, we use this launch environment to invoke the main method of the child
* main class.
*/
这需要两个步骤。首先,我们通过设置来准备启动环境对应的类路径、系统属性和应用程序参数基于集群管理器和部署模式运行子主类。其次,我们使用这个启动环境来调用子主类的主方法。
流程就也就是调用doRunMain()
方法里的runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
方法。
里面我们需要关心的主要就是:
try {
// 根据提交参数类名,利用调用Class.forName反射生成主类
mainClass = Utils.classForName(childMainClass)
}
...
// 获取main方法
val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
try {
// 执行我们的main方法逻辑
mainMethod.invoke(null, childArgs.toArray)
}
我们的spark任务首先会创建一个SparkContext
,下一节我们将从这里开始尝试分析在客户端提交application后,spark的内部初始化。