Spark2.0.2源码分析——SparkSubmit提交任务

序列图:

Spark2.0.2源码分析——SparkSubmit提交任务_第1张图片

1.客户端任务提交:

./spark-submit --master spark://node1:7077 --deploy-mode cluster --class org.apache.spark.examples.SparkPi ../examples/jars/spark-examples_2.11-2.4.0.jar 100

其中 spark - submit 脚本语句如下:

Spark2.0.2源码分析——SparkSubmit提交任务_第2张图片

2.执行 SparkSubmit 中的 main 方法 :

def main(args: Array[String]): Unit = {
    val appArgs = new SparkSubmitArguments(args)
    if (appArgs.verbose) {
      // scalastyle:off println
      printStream.println(appArgs)
      // scalastyle:on println
    }
    appArgs.action match {
      case SparkSubmitAction.SUBMIT => submit(appArgs)
      case SparkSubmitAction.KILL => kill(appArgs)
      case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
    }
  }

上面代码中,spark-submit 脚本提交的命令行参数,通过 main 函数的 args 获取,并将args参数传入 SparkSubmitArguments 中解析完成。最后通过匹配 appArgs 参数中的 action 类型,执行 submit、kill、requestStatus 操作。

进入到 SparkSubmitArguments 中,分析一下参数的解析过程。

  • SparkSubmitArguments 代码如下所示:
// Set parameters from command line arguments
  try {
// 调用parse方法,从命令行解析出各个参数
    parse(args.asJava)
  } catch {
  //捕获到IllegalArgumentException,打印错误并退出
    case e: IllegalArgumentException =>
      SparkSubmit.printErrorAndExit(e.getMessage())
  }
  // 合并默认的Spark配置项,使用传入的配置覆盖默认的配置
  mergeDefaultSparkProperties()
  // 从sparkProperties移除不是“spark.”为开始的配置
  ignoreNonSparkProperties()
  // 加载系统环境变量中的配置信息
  loadEnvironmentArguments()
  // 验证参数是否合法
  validateArguments()

这里调用父类的 SparkSubmitOptionParser 中的 parse 函数查找 args 中设置的--选项和值并解析为 name 和 value ,如 --master yarn-client 会被解析为值为 --master 的 name 和值为 yarn-client 的 value 。这之后调用SparkSubmitArguments#handle(MASTER, "yarn-client")进行处理。

// SparkSubmitOptionParser
if (!handle(name, value)) {
          break;
        }
  • handle(opt: String, value: String) 代码如下:
override protected def handle(opt: String, value: String): Boolean = {
    opt match {
      case NAME =>
        name = value

      case MASTER =>
        master = value

      case CLASS =>
        mainClass = value
      // 即,不是"client"和"cluster",此时报错
      case DEPLOY_MODE =>
        if (value != "client" && value != "cluster") {
          SparkSubmit.printErrorAndExit("--deploy-mode must be either \"client\" or \"cluster\"")
        }
        deployMode = value

...

}

这个函数也很简单,根据参数 opt 及 value,设置各个成员的值。接上例,parse 中调用 handle("--master", "yarn-client")
后,在 handle 函数中,master 成员将被赋值为 yarn-client。

  • 值得一提的是,如果没有 --class 参数,尝试从 JAR 中设置 main 类:
// Try to set main class from JAR if no --class argument is given
    if (mainClass == null && !isPython && !isR && primaryResource != null) {
      val uri = new URI(primaryResource)
      val uriScheme = uri.getScheme()

      uriScheme match {
        case "file" =>
          try {
            val jar = new JarFile(uri.getPath)
            // Note that this might still return null if no main-class is set; we catch that later
            mainClass = jar.getManifest.getMainAttributes.getValue("Main-Class")
          } catch {
            case e: Exception =>
              SparkSubmit.printErrorAndExit(s"Cannot load main class from JAR $primaryResource")
          }
        case _ =>
          SparkSubmit.printErrorAndExit(
            s"Cannot load main class from JAR $primaryResource with URI $uriScheme. " +
            "Please specify a class through --class.")
      }
    }

 

  • 注意,case MASTER 中的 MASTER 的值在 SparkSubmitOptionParser 定义为 --master,MASTER 与其他值定义如下:
protected final String CLASS = "--class";
  protected final String CONF = "--conf";
  protected final String DEPLOY_MODE = "--deploy-mode";
  protected final String DRIVER_CLASS_PATH = "--driver-class-path";
  protected final String DRIVER_CORES = "--driver-cores";
  protected final String DRIVER_JAVA_OPTIONS =  "--driver-java-options";
  protected final String DRIVER_LIBRARY_PATH = "--driver-library-path";
  protected final String DRIVER_MEMORY = "--driver-memory";
  protected final String EXECUTOR_MEMORY = "--executor-memory";
  protected final String FILES = "--files";
  protected final String JARS = "--jars";
  protected final String KILL_SUBMISSION = "--kill";
  protected final String MASTER = "--master";
  protected final String NAME = "--name";
  protected final String PACKAGES = "--packages";
  protected final String PACKAGES_EXCLUDE = "--exclude-packages";
  protected final String PROPERTIES_FILE = "--properties-file";
  protected final String PROXY_USER = "--proxy-user";
  protected final String PY_FILES = "--py-files";
  protected final String REPOSITORIES = "--repositories";
  protected final String STATUS = "--status";
  protected final String TOTAL_EXECUTOR_CORES = "--total-executor-cores";

  // Options that do not take arguments.
  protected final String HELP = "--help";
  protected final String SUPERVISE = "--supervise";
  protected final String USAGE_ERROR = "--usage-error";
  protected final String VERBOSE = "--verbose";
  protected final String VERSION = "--version";

  // Standalone-only options.

  // YARN-only options.
  protected final String ARCHIVES = "--archives";
  protected final String EXECUTOR_CORES = "--executor-cores";
  protected final String KEYTAB = "--keytab";
  protected final String NUM_EXECUTORS = "--num-executors";
  protected final String PRINCIPAL = "--principal";
  protected final String QUEUE = "--queue";

 

SparkSubmit#main 函数中有 case SparkSubmitAction.SUBMIT => submit(appArgs),这句代码判断是否是提交参数并执行程序,如果匹配到 SparkSubmitAction.SUBMIT,则调用 submit(appArgs)方法,参数 appArgs 是 SparkSubmitArguments 类型,appArgs 中包含了提交的各种参数,包括命令行传入以及默认的配置项。

  • submit(appArgs) 代码如下:
private def submit(args: SparkSubmitArguments): Unit = {
    // 返回一个四元组
    val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)

    def doRunMain(): Unit = {
      if (args.proxyUser != null) {
        val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
          UserGroupInformation.getCurrentUser())
        try {
          proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
            override def run(): Unit = {
              runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
            }
          })
        } catch {
          case e: Exception =>
            if (e.getStackTrace().length == 0) {
              printStream.println(s"ERROR: ${e.getClass().getName()}: ${e.getMessage()}")
              exitFn(1)
            } else {
              throw e
            }
        }
      } else {
        runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
      }
    }
    // 不管是不是 StandaloneCluster 模式都会调用 doRunMain 方法
    if (args.isStandaloneCluster && args.useRest) {
      try {
        printStream.println("Running Spark using the REST application submission protocol.")
        doRunMain()
      } catch {
        case e: SubmitRestConnectionException =>
          printWarning(s"Master endpoint ${args.master} was not a REST server. " +
            "Falling back to legacy submission gateway instead.")
          args.useRest = false
          submit(args)
      }
    } else {
      doRunMain()
    }
  }

这段代码中,调用 prepareSubmitEnvironment(args) 方法,完成提交环境的准备。该方法返回一个四元 Tuple ,分别表示子进程参数、子进程 classpath 列表、系统属性 map 、子进程 main 方法。完成了提交环境的准备工作之后,接下来就将启动子进程。

  • prepareSubmitEnvironment(args):

(0)设置应用程序部署方式:

(case "client" | null => CLIENT)

// Set the deploy mode; default is client mode
    var deployMode: Int = args.deployMode match {
      case "client" | null => CLIENT
      case "cluster" => CLUSTER
      case _ => printErrorAndExit("Deploy mode must be either client or cluster"); -1
    }

(1)将childMainClass设置为应用程序主类名:

// In client mode, launch the application main class directly
    // In addition, add the main application jar and any added jars (if any) to the classpath
    if (deployMode == CLIENT) {
      childMainClass = args.mainClass
      if (isUserJar(args.primaryResource)) {
        childClasspath += args.primaryResource
      }
      if (args.jars != null) { childClasspath ++= args.jars.split(",") }
      if (args.childArgs != null) { childArgs ++= args.childArgs }
    }

(2)将childMainClass设置为StandaloneRestClient或Client:

// In standalone cluster mode, use the REST client to submit the application (Spark 1.3+).
    // All Spark parameters are expected to be passed to the client through system properties.
    if (args.isStandaloneCluster) {
      if (args.useRest) {
        childMainClass = "org.apache.spark.deploy.rest.StandaloneRestClient"
        childArgs += (args.primaryResource, args.mainClass)
      } else {
        // In legacy standalone cluster mode, use Client as a wrapper around the user class
        childMainClass = "org.apache.spark.deploy.Client"
        if (args.supervise) { childArgs += "--supervise" }
        Option(args.driverMemory).foreach { m => childArgs += ("--memory", m) }
        Option(args.driverCores).foreach { c => childArgs += ("--cores", c) }
        childArgs += "launch"
        childArgs += (args.master, args.primaryResource, args.mainClass)
      }
      if (args.childArgs != null) {
        childArgs ++= args.childArgs
      }
    }

(3)将childMainClass设置为org.apache.spark.deploy.yarn.Client: 

// In yarn-cluster mode, use yarn.Client as a wrapper around the user class
    if (isYarnCluster) {
      childMainClass = "org.apache.spark.deploy.yarn.Client"
      if (args.isPython) {
        childArgs += ("--primary-py-file", args.primaryResource)
        childArgs += ("--class", "org.apache.spark.deploy.PythonRunner")
      } else if (args.isR) {
        val mainFile = new Path(args.primaryResource).getName
        childArgs += ("--primary-r-file", mainFile)
        childArgs += ("--class", "org.apache.spark.deploy.RRunner")
      } else {
        if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
          childArgs += ("--jar", args.primaryResource)
        }
        childArgs += ("--class", args.mainClass)
      }
      if (args.childArgs != null) {
        args.childArgs.foreach { arg => childArgs += ("--arg", arg) }
      }
    }

这里注意:childMainClass不为应用程序主类名时,后面invoke调的是spark内部已经写好的其他类,那这个时候我在shell命令中的--class在哪里执行呢:

childArgs += ("--class", args.mainClass)

mainMethod.invoke(null, childArgs.toArray)

  • childMainClass值为: 
  • 应用程序主类名; // yarn client 和 standalone client 模式下
  • org.apache.spark.deploy.rest.StandaloneRestClient;// standalone cluster 模式下
  • org.apache.spark.deploy.Client;// standalone cluster 模式下
  • org.apache.spark.deploy.yarn.Client。// yarn cluster 模式下
  • 具体执行过程在 runMain 函数中,代码如下所示:
private def runMain(
      childArgs: Seq[String],
      childClasspath: Seq[String],
      sysProps: Map[String, String],
      childMainClass: String,
      verbose: Boolean): Unit = {
    // scalastyle:off println
    if (verbose) {
      printStream.println(s"Main class:\n$childMainClass")
      printStream.println(s"Arguments:\n${childArgs.mkString("\n")}")
      printStream.println(s"System properties:\n${sysProps.mkString("\n")}")
      printStream.println(s"Classpath elements:\n${childClasspath.mkString("\n")}")
      printStream.println("\n")
    }
    // scalastyle:on println

    val loader =
      if (sysProps.getOrElse("spark.driver.userClassPathFirst", "false").toBoolean) {
        new ChildFirstURLClassLoader(new Array[URL](0),
          Thread.currentThread.getContextClassLoader)
      } else {
        new MutableURLClassLoader(new Array[URL](0),
          Thread.currentThread.getContextClassLoader)
      }
    Thread.currentThread.setContextClassLoader(loader)
    // 遍历classpath列表
    for (jar <- childClasspath) {
    // 使用loader类加载器将jar包依赖加入classpath
      addJarToClasspath(jar, loader)
    }

    for ((key, value) <- sysProps) {
    // 将sysProps中的配置全部设置到System全局变量中
      System.setProperty(key, value)
    }

    var mainClass: Class[_] = null

    try {
    // 获取启动的MainClass
      mainClass = Utils.classForName(childMainClass)
    } catch {
      case e: ClassNotFoundException =>
        e.printStackTrace(printStream)
        if (childMainClass.contains("thriftserver")) {
          // scalastyle:off println
          printStream.println(s"Failed to load main class $childMainClass.")
          printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.")
          // scalastyle:on println
        }
        System.exit(CLASS_NOT_FOUND_EXIT_STATUS)
      case e: NoClassDefFoundError =>
        e.printStackTrace(printStream)
        if (e.getMessage.contains("org/apache/hadoop/hive")) {
          // scalastyle:off println
          printStream.println(s"Failed to load hive class.")
          printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.")
          // scalastyle:on println
        }
        System.exit(CLASS_NOT_FOUND_EXIT_STATUS)
    }

    // SPARK-4170
    if (classOf[scala.App].isAssignableFrom(mainClass)) {
      printWarning("Subclasses of scala.App may not work correctly. Use a main() method instead.")
    }
    // 得到启动的对象的main方法
    val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
    if (!Modifier.isStatic(mainMethod.getModifiers)) {
      throw new IllegalStateException("The main method in the given main class must be static")
    }

    @tailrec
    def findCause(t: Throwable): Throwable = t match {
      case e: UndeclaredThrowableException =>
        if (e.getCause() != null) findCause(e.getCause()) else e
      case e: InvocationTargetException =>
        if (e.getCause() != null) findCause(e.getCause()) else e
      case e: Throwable =>
        e
    }

    try {
      // 使用反射执行main方法,并将childArgs作为参数传入该main方法
      mainMethod.invoke(null, childArgs.toArray)
    } catch {
      case t: Throwable =>
        findCause(t) match {
          case SparkUserAppException(exitCode) =>
            System.exit(exitCode)

          case t: Throwable =>
            throw t
        }
    }
  }

在上面代码中,使用Utils工具提供的classForName方法,找到主类,然后在mainClass上调用getMethod方法得到main方法,最后在mainMethod上调用invoke执行main方法。需要注意的是,执行invoke方法同时传入了childArgs参数,这个参数中保留了配置信息。Utils.classForName(childMainClass)方法将会返回要执行的主类。

注意:这里的 mainClass 和 SparkSubmitArguments 中的 mainClass 是不同的变量。

程序将启动对应模式的某个类,并运行主方法,该类中做了哪些事情?先来看看这个类中怎样完成调用吧。

3.对应模式下的 mainMethod 执行:

/*
standalone cluster 模式下
*/
object Client {
  def main(args: Array[String]) {
    // scalastyle:off println
    if (!sys.props.contains("SPARK_SUBMIT")) {
      println("WARNING: This client is deprecated and will be removed in a future version of Spark")
      println("Use ./bin/spark-submit with \"--master spark://host:port\"")
    }
    // scalastyle:on println
    // 创建SparkConf对象
    val conf = new SparkConf()
    val driverArgs = new ClientArguments(args)
    // 设置RPC请求超时时间为10秒
    conf.set("spark.rpc.askTimeout", "10")
    Logger.getRootLogger.setLevel(driverArgs.logLevel
    // 使用RpcEnv的create创建Rpc环境
    val rpcEnv =
      RpcEnv.create("driverClient", Utils.localHostName(), 0, conf, new SecurityManager(conf))
    //获得和Master通信的RPCEndpointRef
    val masterEndpoints = driverArgs.masters.map(RpcAddress.fromSparkURL).
      map(rpcEnv.setupEndpointRef(_, Master.ENDPOINT_NAME))
    //注册ClientEndpoint
    rpcEnv.setupEndpoint("client", new ClientEndpoint(rpcEnv, driverArgs, masterEndpoints, conf))
    // 等待rpcEnv的终止
    rpcEnv.awaitTermination()
  }
}

 

/*
yarn client 和 standalone client 模式下
*/
object SparkPi {
  def main(args: Array[String]) {
    val spark = SparkSession
      .builder
      .appName("Spark Pi")
      .getOrCreate()
    val slices = if (args.length > 0) args(0).toInt else 2
    val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
    val count = spark.sparkContext.parallelize(1 until n, slices).map { i =>
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x*x + y*y <= 1) 1 else 0
    }.reduce(_ + _)
    println(s"Pi is roughly ${4.0 * count / (n - 1)}")
    spark.stop()
  }
}



/*
SparkContext 的初始化
*/
private[this] var userSuppliedContext: Option[SparkContext] = None
private[spark] def sparkContext(sparkContext: SparkContext): Builder = synchronized {
      userSuppliedContext = Option(sparkContext)
      this
    }

SparkSubmit提交任务结束。

 

参考:https://blog.csdn.net/duan_zhihua/article/details/73456538

https://blog.csdn.net/vfgbv/article/details/52034016

你可能感兴趣的:(Spark源码分析)