spark Yarn部署流程解析,核心运行机制

1、提交流程图

spark Yarn部署流程解析,核心运行机制_第1张图片

提交流程文字说明:

1、执行bin/spark-submit命令后,Client会组装commnd命令到yarn集群的ResourceManager。

commnd命令:bin/java org.apache.spark.deploy.yarn.ApplicationMaster,如果非集群模式就是bin/java org.apache.spark.deploy.yarn.ExecutorLauncher,比如通过bin/spark-shell --master yarn执行,就不能通过集群模式执行

2、ResourceManager接收到命令后,RM就找相应的NodeManager执行这个命令,这个命令的作用就是创建并启动一个ApplicationMaster

ApplicationMaster启动后,里面的run方法会再启动一个Driver

3、Driver向ResourceManager申请资源

4、ResourceManager返回给Driver可用的Container列表

5、Driver就向NodeManager节点发送启动Container JVM命令,NodeManager接收到这个命令以后,就执行这个命令,启动一个JVM进程,生成一个Container

6、Container启动后会向线程池中放入Executor线程,并启动Executor线程,Executor线程启动以后,会反向向Driver注册自己,让driver监控自己的执行情况;同时发送各种执行命令RegisteredExecutor、LaunchTask。让每个节点执行receiver方法,进而启动真正的Executor

 

2、源码跟踪提交流程

bin/spark-submit 
--class org.apache.spark.examples.SparkPi 
--master yarn 
--deploymode client 
./examples/jars/spark-examples.jar 100

提交以上脚本以后,等于启动一个java进程

程序的调用链如下:

1)、object SparkSubmit.scala

这个是一个伴生对象,除了普通方法和构造方法以外的程序都会被执行

  —>main()

  def main(args: Array[String]): Unit = {
    //根据命令行的参数,封装和解析参数
    val appArgs = new SparkSubmitArguments(args)
    if (appArgs.verbose) {
      // scalastyle:off println
      printStream.println(appArgs)
      // scalastyle:on println
    }
    //appArgs.action match模式匹配;action = Option(action).getOrElse(SUBMIT),意思是没有设置的话就取SUBMIT,我们用的是submit,那么就执行submit(appArgs)
    appArgs.action match {
      case SparkSubmitAction.SUBMIT => submit(appArgs)
      case SparkSubmitAction.KILL => kill(appArgs)
      case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
    }
  }

  —>submit()

  private def submit(args: SparkSubmitArguments): Unit = {
    //获取参数
    val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)
    ...
    runMain()
  }

  —>runMain()

      //使用当前线程的类加载器,获取classpath下所有的jar包,加载jar包
val loader =
      if (sysProps.getOrElse("spark.driver.userClassPathFirst", "false").toBoolean) {
        new ChildFirstURLClassLoader(new Array[URL](0),
          Thread.currentThread.getContextClassLoader)
      } else {
        new MutableURLClassLoader(new Array[URL](0),
          Thread.currentThread.getContextClassLoader)
      }
    Thread.currentThread.setContextClassLoader(loader)

    for (jar <- childClasspath) {
      addJarToClasspath(jar, loader)
    }

    for ((key, value) <- sysProps) {
      System.setProperty(key, value)
    }

//通过反射得到main类并执行main函数
//如果是Cluster模式部署,main类是org.apache.spark.deploy.yarn.Client
//如果是Client模式运行,main类就是我们传入的参数class:org.apache.spark.examples.SparkPi 
mainClass = Utils.classForName(childMainClass)
val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
mainMethod.invoke(null, childArgs.toArray)

 

2)、——>  org.apache.spark.deploy.yarn.Client.scala

  —>main()

  def main(argStrings: Array[String]) {
    //目的就是新建一个Client,运行run方法
    new Client(args, sparkConf).run()
  }

  —>run()

  def run(): Unit = {
    this.appId = submitApplication()
  }

  —>submitApplication()

  def submitApplication(): ApplicationId = {
    var appId: ApplicationId = null
    try {
      launcherBackend.connect()
      // 启动yarn客户端
      setupCredentials()
      yarnClient.init(yarnConf)
      yarnClient.start()
      // yarn客户端在ResourceManager里面创建一个应用,得到一个全局appId
      val newApp = yarnClient.createApplication()
      val newAppResponse = newApp.getNewApplicationResponse()
      appId = newAppResponse.getApplicationId()
      reportLauncherState(SparkAppHandle.State.SUBMITTED)
      launcherBackend.setAppId(appId.toString)

      new CallerContext("CLIENT", Option(appId.toString)).setCurrentContext()

      // 判断ResourceManager是否有足够的资源给ApplicationMaster
      verifyClusterResources(newAppResponse)

      // 拼装命令bin/java org.apache.spark.deploy.yarn.ApplicationMaster,
      // 如果非集群模式就是bin/java org.apache.spark.deploy.yarn.ExecutorLauncher,比如通过bin/spark-shell --master yarn执行,就不能通过集群模式执行
      // 提交命令给yarn集群执行,ResourceManager就将命令给其中一个NodeManager执行,执行后启动一个ApplicationMaster
      val containerContext = createContainerLaunchContext(newAppResponse)
      val appContext = createApplicationSubmissionContext(newApp, containerContext)

      // 向yarn集群ResourceManager提交应用
      yarnClient.submitApplication(appContext)
      // 返回全局appId,可以在日志中查看
      appId
    } catch {
      ...
    }
  }

------------------------上面的是在yarnClient和ResourceManager运行的,下面的是在各个NodeManager运行的

3)、——>  org.apache.spark.deploy.yarn.ApplicationMaster

  —>main()

  def main(args: Array[String]): Unit = {
    // 封装参数
    val amArgs = new ApplicationMasterArguments(args)
    //新建一个ApplicationMaster,执行run方法,参数是yarn的ResourceManager客户端,通过这个客户端让NodeManager跟ResourceManager进行交互
    SparkHadoopUtil.get.runAsSparkUser { () =>
      master = new ApplicationMaster(amArgs, new YarnRMClient)
      System.exit(master.run())
    }
  }

  —>run():目的就是启动Driver,过程就是通过yarnClient提交一个command到ResourceManager,ResourceManager交给其中一个NodeManager执行,执行后会启动一个ApplicationMaster,ApplicationMaster里面的run方法会再启动一个Driver

  final def run(): Int = {
    ...
    try {
      if (isClusterMode) {
        // 启动Driver
        runDriver(securityMgr)
      } else {
        runExecutorLauncher(securityMgr)
      }
    } catch {
    }
  }

  —>runDriver():

  private def runDriver(securityMgr: SecurityManager): Unit = {
    addAmIpFilter()
    // 在NodeManager里面启动应用程序,也就是我们自己定义的传入的class
    userClassThread = startUserApplication()

    try {
      val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
        Duration(totalWaitTime, TimeUnit.MILLISECONDS))
        // 向yarn注册我们的ApplicationMaster,
        // 建立ApplicationMaster和ResourceManager之间的关联,AM和RM都是两个不同的JAVA进程,他们之间是通过RPC进行交互的
        // 注册的目的是申请资源,申请到资源后生成Container,然后运行Container
        rpcEnv = sc.env.rpcEnv
        val driverRef = runAMEndpoint(
          sc.getConf.get("spark.driver.host"),
          sc.getConf.get("spark.driver.port"),
          isClusterMode = true)
        registerAM(sc.getConf, rpcEnv, driverRef, sc.ui.map(_.appUIAddress).getOrElse(""),
          securityMgr)

      userClassThread.join()
    } catch {
    }
  }

  —>startUserApplication():新建一个线程启动我们用户自定义的class的main函数,也就是我们自己的代码的driver程序开始执行了

  private def startUserApplication(): Thread = {
    // 获取类加载器
    val classpath = Client.getUserClasspath(sparkConf)
    val userClassLoader =
      if (Client.isUserClassPathFirst(sparkConf, isDriver = true)) {
        new ChildFirstURLClassLoader(urls, Utils.getContextOrSparkClassLoader)
      } else {
        new MutableURLClassLoader(urls, Utils.getContextOrSparkClassLoader)
      }

    var userArgs = args.userArgs
    if (args.primaryPyFile != null && args.primaryPyFile.endsWith(".py")) {
      userArgs = Seq(args.primaryPyFile, "") ++ userArgs
    }
    val mainMethod = userClassLoader.loadClass(args.userClass)
      .getMethod("main", classOf[Array[String]])

    val userThread = new Thread {
      override def run() {
        try {
          // 执行我们自定义的class的main方法
          mainMethod.invoke(null, userArgs.toArray)
          finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
          logDebug("Done running users class")
        } catch {
        } finally {
        }
      }
    }
    userThread.setContextClassLoader(userClassLoader)
    userThread.setName("Driver")
    userThread.start()
    userThread
  }

4)、org.apache.spark.deploy.yarn.YarnAllocator.scala:内置有一个线程池,来启动ExecutorRunnable

  —>allocateResources():

  —>handleAllocatedContainers():

  —>runAllocatedContainers():

  private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): Unit = {
    for (container <- containersToUse) {

      def updateInternalState(): Unit = synchronized {
        numExecutorsRunning += 1
        executorIdToContainer(executorId) = container
        containerIdToExecutorId(container.getId) = executorId

        val containerSet = allocatedHostToContainersMap.getOrElseUpdate(executorHostname,
          new HashSet[ContainerId])
        containerSet += containerId
        allocatedContainerToHostMap.put(containerId, executorHostname)
      }

          launcherPool.execute(new Runnable {
            override def run(): Unit = {
              try {
                new ExecutorRunnable(
                  Some(container),
                  conf,
                  sparkConf,
                  driverUrl,
                  executorId,
                  executorHostname,
                  executorMemory,
                  executorCores,
                  appAttemptId.getApplicationId.toString,
                  securityMgr,
                  localResources
                ).run()
                updateInternalState()
      }
    }
  }

5)、org.apache.spark.deploy.yarn.ExecutorRunnable:目的是向NodeManager发送一个启动Executor进程的JAVA命令,

  —>run():

  —>startContainer():

6)、org.apache.spark.executor.CoarseGrainedExecutorBackend:进去以后执行onstart方法,向driver注册自己,让driver监控自己的执行情况;同时发送各种执行命令RegisteredExecutor、LaunchTask。让每个节点执行receiver方法,进而启动真正的Executor

  —>main():

  —>run():

  —>receive():

  override def receive: PartialFunction[Any, Unit] = {
    case RegisteredExecutor =>
      logInfo("Successfully registered with driver")
      try {
        executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)
      } catch {
        case NonFatal(e) =>
          exitExecutor(1, "Unable to create executor due to " + e.getMessage, e)
      }

    case RegisterExecutorFailed(message) =>
      exitExecutor(1, "Slave registration failed: " + message)

    case LaunchTask(data) =>
      if (executor == null) {
        exitExecutor(1, "Received LaunchTask command but executor was null")
      } else {
        val taskDesc = ser.deserialize[TaskDescription](data.value)
        logInfo("Got assigned task " + taskDesc.taskId)
        executor.launchTask(this, taskId = taskDesc.taskId, attemptNumber = taskDesc.attemptNumber,
          taskDesc.name, taskDesc.serializedTask)
      }
    case KillTask(taskId, _, interruptThread) =>
      if (executor == null) {
        exitExecutor(1, "Received KillTask command but executor was null")
      } else {
        executor.killTask(taskId, interruptThread)
      }
    case StopExecutor =>
      stopping.set(true)
      self.send(Shutdown)
    case Shutdown =>
      stopping.set(true)
      new Thread("CoarseGrainedExecutorBackend-stop-executor") {
        override def run(): Unit = {
          executor.stop()
        }
      }.start()
  }

 

你可能感兴趣的:(spark)