Spark源码解析系列(一、任务提交)

文章目录

    • 前言
    • 提交任务流程分析

前言

本系列文章默认您对spark有相应的了解。笔者选择的spark版本为2.2.0,不同版本可能又些许差异。

提交任务流程分析

1、首先我们先集群提交任务,会调用spark-submit 这个脚本。我找到spark安装目录下的/bin目录下。查看下其内容。

if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0

exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

首先会判断SPARK_HOME 是否存在,如果没有加载find-spark-home,也会自动设置并且将其注册为环境变量。重点是最后一句exec。
2、我们看看spark-class这个脚本。

if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

. "${SPARK_HOME}"/bin/load-spark-env.sh

# Find the java binary
if [ -n "${JAVA_HOME}" ]; then
  RUNNER="${JAVA_HOME}/bin/java"
else
  if [ "$(command -v java)" ]; then
    RUNNER="java"
  else
    echo "JAVA_HOME is not set" >&2
    exit 1
  fi
fi

# Find Spark jars.
if [ -d "${SPARK_HOME}/jars" ]; then
  SPARK_JARS_DIR="${SPARK_HOME}/jars"
else
  SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
fi

if [ ! -d "$SPARK_JARS_DIR" ] && [ -z "$SPARK_TESTING$SPARK_SQL_TESTING" ]; then
  echo "Failed to find Spark jars directory ($SPARK_JARS_DIR)." 1>&2
  echo "You need to build Spark with the target \"package\" before running this program." 1>&2
  exit 1
else
  LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*"
fi

# Add the launcher build dir to the classpath if requested.
if [ -n "$SPARK_PREPEND_CLASSES" ]; then
  LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"
fi

# For tests
if [[ -n "$SPARK_TESTING" ]]; then
  unset YARN_CONF_DIR
  unset HADOOP_CONF_DIR
fi

# The launcher library will print arguments separated by a NULL character, to allow arguments with
# characters that would be otherwise interpreted by the shell. Read that in a while loop, populating
# an array that will be used to exec the final command.
#
# The exit code of the launcher is appended to the output, so the parent shell removes it from the
# command array and checks the value to see if the launcher succeeded.
build_command() {
  "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
  printf "%d\0" $?
}

# Turn off posix mode since it does not allow process substitution
set +o posix
CMD=()
while IFS= read -d '' -r ARG; do
  CMD+=("$ARG")
done < <(build_command "$@")

COUNT=${#CMD[@]}
LAST=$((COUNT - 1))
LAUNCHER_EXIT_CODE=${CMD[$LAST]}

# Certain JVM failures result in errors being printed to stdout (instead of stderr), which causes
# the code that parses the output of the launcher to get confused. In those cases, check if the
# exit code is an integer, and if it's not, handle it as a special error case.
if ! [[ $LAUNCHER_EXIT_CODE =~ ^[0-9]+$ ]]; then
  echo "${CMD[@]}" | head -n-1 1>&2
  exit 1
fi

if [ $LAUNCHER_EXIT_CODE != 0 ]; then
  exit $LAUNCHER_EXIT_CODE
fi

CMD=("${CMD[@]:0:$LAST}")
exec "${CMD[@]}"

内容确实有点多,根据模块看主要就是几件事:
还是检查SPARK_HOME环境变量;
检查Spark jars路径;
执行org.apache.spark.launcher.Main,主要负责参数的解析并返回解析的参数;
然后判断解析后的参数是否正确。
正确则会开始执行org.apache.spark.deploy.SparkSubmit 这个方法。

3、我们打开SparkSubmit类,首先看main方法。

override def main(args: Array[String]): Unit = {
    val appArgs = new SparkSubmitArguments(args)
    if (appArgs.verbose) {
      // scalastyle:off println
      printStream.println(appArgs)
      // scalastyle:on println
    }
    appArgs.action match {
      case SparkSubmitAction.SUBMIT => submit(appArgs)
      case SparkSubmitAction.KILL => kill(appArgs)
      case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
    }
  }

一开始首先会新建一个 SparkSubmitArguments(args),主要就是初始化一些参数。其中比较重要的是:

// Use `sparkProperties` map along with env vars to fill in any missing parameters
  loadEnvironmentArguments()

主要就是从环境变量、Spark属性等加载参数。以及action的初始化,默认设置为SUBMIT。

// Action should be SUBMIT unless otherwise specified
    action = Option(action).getOrElse(SUBMIT)

让我们再回到SparkSubmit 的main方法,这里会根据刚才appArgs.action的参数执行submit(appArgs) 方法。先看看方法介绍:

/**
   * Submit the application using the provided parameters.
   *
   * This runs in two steps. First, we prepare the launch environment by setting up
   * the appropriate classpath, system properties, and application arguments for
   * running the child main class based on the cluster manager and the deploy mode.
   * Second, we use this launch environment to invoke the main method of the child
   * main class.
   */

这需要两个步骤。首先,我们通过设置来准备启动环境对应的类路径、系统属性和应用程序参数基于集群管理器和部署模式运行子主类。其次,我们使用这个启动环境来调用子主类的主方法。
流程就也就是调用doRunMain() 方法里的runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose) 方法。
里面我们需要关心的主要就是:

try {
	// 根据提交参数类名,利用调用Class.forName反射生成主类
      mainClass = Utils.classForName(childMainClass)
    }
    ...
    // 获取main方法
    val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
try {
	// 执行我们的main方法逻辑
      mainMethod.invoke(null, childArgs.toArray)
    }

我们的spark任务首先会创建一个SparkContext ,下一节我们将从这里开始尝试分析在客户端提交application后,spark的内部初始化。

你可能感兴趣的:(Spark,#,Spark源码解析)