南宫紫攸

Spark源码解析之Yarn Cluster模式启动流程源码解析

这里解读当sparksubmit提交模式为Yarn Cluster模式时的启动流程。

SparkSubmit类的runMain()中执行到start()时，本地模式会进入本地提交的--class类的main中开始执行。

      // 启动实例
      app.start(childArgs.toArray, sparkConf)

而Yarn Cluster模式，在prepareSubmitEnvironment()中准备运行环境时有判断过，所以start()其实调用的是org.apache.spark.deploy.yarn.YarnClusterApplication类的start()。

    // In yarn-cluster mode, use yarn.Client as a wrapper around the user class
    // yarn-cluster模式,使用yarn.client作为用户提交类的包装执行器
    if (isYarnCluster) {
      // object SparkSubmit中有定义为"org.apache.spark.deploy.yarn.YarnClusterApplication"
      childMainClass = YARN_CLUSTER_SUBMIT_CLASS
     
       ...

      }
 
      // 遍历所有args参数,添加到子类参数中
      if (args.childArgs != null) {
        args.childArgs.foreach { arg => childArgs += ("--arg", arg) }
      }
    }

YarnClusterApplication

YarnClusterApplication类在org.apache.spark.deploy.yarn.Client类下，其实也就是加载运行环境的资源到运行服务器本地，然后通过Client类的run()运行。

// 同样继承了SparkApplication,重写了start()
private[spark] class YarnClusterApplication extends SparkApplication {

  override def start(args: Array[String], conf: SparkConf): Unit = {
    // SparkSubmit would use yarn cache to distribute files & jars in yarn mode,
    // so remove them from sparkConf here for yarn mode.
    // yarn模式使用缓存来分发jars和文件,所以移除之前spark的配置
    // 可以回头看看prepareSubmitEnvironment()运行环境准备,各种部署模式设置相应参数的方法options()
    conf.remove("spark.jars")
    conf.remove("spark.files")

    // 构建client实例,而首先又构建了ClientArguments实例解析参数
    new Client(new ClientArguments(args), conf).run()
  }

}

ClientArguments

就是加载代码和jars、参数，jar，class，args。

// TODO: Add code and support for ensuring that yarn resource 'tasks' are location aware !
private[spark] class ClientArguments(args: Array[String]) {

  var userJar: String = null
  var userClass: String = null
  var primaryPyFile: String = null
  var primaryRFile: String = null
  var userArgs: ArrayBuffer[String] = new ArrayBuffer[String]()

  parseArgs(args.toList)

  // 解析传入的参数
  private def parseArgs(inputArgs: List[String]): Unit = {
    var args = inputArgs

    while (!args.isEmpty) {
      args match {
        case ("--jar") :: value :: tail =>
          userJar = value
          args = tail

        case ("--class") :: value :: tail =>
          userClass = value
          args = tail

        case ("--primary-py-file") :: value :: tail =>
          primaryPyFile = value
          args = tail

        case ("--primary-r-file") :: value :: tail =>
          primaryRFile = value
          args = tail

        case ("--arg") :: value :: tail =>
          userArgs += value
          args = tail

        case Nil =>

        case _ =>
          throw new IllegalArgumentException(getUsageMessage(args))
      }
    }

    // pyfile和Rfile不能同时设置
    if (primaryPyFile != null && primaryRFile != null) {
      throw new IllegalArgumentException("Cannot have primary-py-file and primary-r-file" +
        " at the same time")
    }
  }

  private def getUsageMessage(unknownParam: List[String] = null): String = {
    val message = if (unknownParam != null) s"Unknown/unsupported param $unknownParam\n" else ""
    message +
      s"""
      |Usage: org.apache.spark.deploy.yarn.Client [options]
      |Options:
      |  --jar JAR_PATH           Path to your application's JAR file (required in yarn-cluster
      |                           mode)
      |  --class CLASS_NAME       Name of your application's main class (required)
      |  --primary-py-file        A main Python file
      |  --primary-r-file         A main R file
      |  --arg ARG                Argument to be passed to your application's main class.
      |                           Multiple invocations are possible, each will be passed in order.
      """.stripMargin
  }
}

Client

直接进入Client的run()。

private[spark] class Client(
    val args: ClientArguments,
    val sparkConf: SparkConf)
  extends Logging {
    ...

      /**
   * Submit an application to the ResourceManager.
   * If set spark.yarn.submit.waitAppCompletion to true, it will stay alive
   * reporting the application's status until the application has exited for any reason.
   * Otherwise, the client process will exit after submission.
   * If the application finishes with a failed, killed, or undefined status,
   * throw an appropriate SparkException.
   */
  // 向RM提交app
  def run(): Unit = {
    // 提交app获取id
    // spark.yarn.submit.waitAppCompletion设置为true,进程会保存存活并报告app状态,直到app完成
    // 如果fail,kill级undefined状态退出,会抛出异常
    this.appId = submitApplication()

    // 监控application状态
    if (!launcherBackend.isConnected() && fireAndForget) {
      val report = getApplicationReport(appId)
      val state = report.getYarnApplicationState
      logInfo(s"Application report for $appId (state: $state)")
      logInfo(formatReportDetails(report))
      if (state == YarnApplicationState.FAILED || state == YarnApplicationState.KILLED) {
        throw new SparkException(s"Application $appId finished with status: $state")
      }
    } else { 
      val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
      if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
        diags.foreach { err =>
          logError(s"Application diagnostics message: $err")
        }
        throw new SparkException(s"Application $appId finished with failed status")
      }
      if (appState == YarnApplicationState.KILLED || finalState == FinalApplicationStatus.KILLED) {
        throw new SparkException(s"Application $appId is killed")
      }
      if (finalState == FinalApplicationStatus.UNDEFINED) {
        throw new SparkException(s"The final status of application $appId is undefined")
      }
    }
  }
}

submitApplication()

看看提交app获取id的过程。

  def submitApplication(): ApplicationId = {
    var appId: ApplicationId = null
    try {
      // 初始化launcherBackend,与launcherServer建立连接
      launcherBackend.connect()
      // 初始化yarnClinet
      yarnClient.init(hadoopConf)
      // 启动yarnClient,连接到集群,获取节点信息
      yarnClient.start()

      // 输出节点个数
      logInfo("Requesting a new application from cluster with %d NodeManagers"
        .format(yarnClient.getYarnClusterMetrics.getNumNodeManagers))

      // Get a new application from our RM
      // 调用接口向RM创建一个app
      val newApp = yarnClient.createApplication()
      // 获取app请求的响应
      val newAppResponse = newApp.getNewApplicationResponse()
      // 获取app的id
      appId = newAppResponse.getApplicationId()

      // 建立客户端,用于与hadoop通讯
      new CallerContext("CLIENT", sparkConf.get(APP_CALLER_CONTEXT),
        Option(appId.toString)).setCurrentContext()

      // Verify whether the cluster has enough resources for our AM
      // 验证集群是否有足够资源运行AM
      verifyClusterResources(newAppResponse)

      // Set up the appropriate contexts to launch our AM
      // 启动Container用于启动AM,并设置环境变量
      val containerContext = createContainerLaunchContext(newAppResponse)
      val appContext = createApplicationSubmissionContext(newApp, containerContext)

      // Finally, submit and monitor the application
      logInfo(s"Submitting application $appId to ResourceManager")
      // 提交app,通过appContext获取资源情况
      yarnClient.submitApplication(appContext)
      // 监控提交的状况
      launcherBackend.setAppId(appId.toString)
      reportLauncherState(SparkAppHandle.State.SUBMITTED)

      // 返回appId
      appId
    } catch {
      case e: Throwable =>
        if (appId != null) {
          cleanupStagingDir(appId)
        }
        throw e
    }
  }

一步步解读上面的过程。

launcherBackend.connect()

launcherBackend是创建了LauncherBackend类的实例，这个类主要是用于与launcherServer通讯。

  private val launcherBackend = new LauncherBackend() {
    override protected def conf: SparkConf = sparkConf

    override def onStopRequest(): Unit = {
      // 如果返回的appId为空则kill掉进程
      if (isClusterMode && appId != null) {
        yarnClient.killApplication(appId)
      } else {
        setState(SparkAppHandle.State.KILLED)
        stop()
      }
    }
  }

yarnClient.init(hadoopConf)

实际是通过YarnClientImpl.class获取的YarnClient实例。

同样在Client类中：

  private val yarnClient = YarnClient.createYarnClient

YarnClient类：

public abstract class YarnClient extends AbstractService {
  /**
   * Create a new instance of YarnClient.
   */
  @Public
  public static YarnClient createYarnClient() {
    YarnClient client = new YarnClientImpl();
    return client;
  }
    ...
}

YarnClientImpl类：

public class YarnClientImpl extends YarnClient {
    ...
  public YarnClientImpl() {
    super(YarnClientImpl.class.getName());
  }

    ...
}

yarnClient.init() 和 start()

初始化方法就是调用的yarnClient继承的AbstractService类的init()和start()，主要是对状态的判断。

public abstract class AbstractService implements Service {
    ...

  /**
   * {@inheritDoc}
   * This invokes {@link #serviceInit}
   * @param conf the configuration of the service. This must not be null
   * @throws ServiceStateException if the configuration was null,
   * the state change not permitted, or something else went wrong
   */
  @Override
  public void init(Configuration conf) {
    if (conf == null) {
      throw new ServiceStateException("Cannot initialize service "
                                      + getName() + ": null configuration");
    }
    // 判断状态
    if (isInState(STATE.INITED)) {
      return;
    }
    synchronized (stateChangeLock) {
      if (enterState(STATE.INITED) != STATE.INITED) {
        setConfig(conf);
        try {
          // 初始化
          serviceInit(config);
          if (isInState(STATE.INITED)) {
            //if the service ended up here during init,
            //notify the listeners
            notifyListeners();
          }
        } catch (Exception e) {
          noteFailure(e);
          ServiceOperations.stopQuietly(LOG, this);
          throw ServiceStateException.convert(e);
        }
      }
    }
  }

  /**
   * {@inheritDoc}
   * @throws ServiceStateException if the current service state does not permit
   * this action
   */
  @Override
  public void start() {
    if (isInState(STATE.STARTED)) {
      return;
    }
    //enter the started state
    synchronized (stateChangeLock) {
      if (stateModel.enterState(STATE.STARTED) != STATE.STARTED) {
        try {
          startTime = System.currentTimeMillis();
          // 启动
          serviceStart();
          if (isInState(STATE.STARTED)) {
            //if the service started (and isn't now in a later state), notify
            if (LOG.isDebugEnabled()) {
              LOG.debug("Service " + getName() + " is started");
            }
            notifyListeners();
          }
        } catch (Exception e) {
          noteFailure(e);
          ServiceOperations.stopQuietly(LOG, this);
          throw ServiceStateException.convert(e);
        }
      }
    }
  }

    ...
}

yarnClient.getYarnClusterMetrics.getNumNodeManagers

获取节点数量

public abstract class YarnClient extends AbstractService {
    ...

  /**
   * 
   * Get metrics ({@link YarnClusterMetrics}) about the cluster.
   * 
   * 
   * @return cluster metrics
   * @throws YarnException
   * @throws IOException
   */
  public abstract YarnClusterMetrics getYarnClusterMetrics() throws YarnException,
      IOException;

    ...
}

getNumNodeManagers

这个Yarn节点数量在初始化Yarn集群时就已经通过Metric测量系统获取，这个后续再解读。

/**
 * YarnClusterMetrics represents cluster metrics.
 * 
 * Currently only number of NodeManagers is provided.
 */
@Public
@Stable
public abstract class YarnClusterMetrics {
  
  @Private
  @Unstable
  public static YarnClusterMetrics newInstance(int numNodeManagers) {
    YarnClusterMetrics metrics = Records.newRecord(YarnClusterMetrics.class);
    metrics.setNumNodeManagers(numNodeManagers);
    return metrics;
  }

  /**
   * Get the number of NodeManagers in the cluster.
   * @return number of NodeManagers in the cluster
   */
  @Public
  @Stable
  public abstract int getNumNodeManagers();

  @Private
  @Unstable
  public abstract void setNumNodeManagers(int numNodeManagers);

}

返回Client中继续往下，提交app到RM

      // 调用接口向RM创建一个app
      val newApp = yarnClient.createApplication()
      // 获取app请求的响应
      val newAppResponse = newApp.getNewApplicationResponse()
      // 获取app的id
      appId = newAppResponse.getApplicationId()

yarnClient.createApplication()，在YarnClient类下。

  public abstract YarnClientApplication createApplication()
      throws YarnException, IOException;

YarnClientApplication

主要是app的上下文信息。

public  class YarnClientApplication {
  private final GetNewApplicationResponse newAppResponse;
  private final ApplicationSubmissionContext appSubmissionContext;

  public YarnClientApplication(GetNewApplicationResponse newAppResponse,
                               ApplicationSubmissionContext appContext) {
    this.newAppResponse = newAppResponse;
    this.appSubmissionContext = appContext;
  }

  public GetNewApplicationResponse getNewApplicationResponse() {
    return newAppResponse;
  }

  public ApplicationSubmissionContext getApplicationSubmissionContext() {
    return appSubmissionContext;
  }
}

GetNewApplicationResponse

在这里getApplicationId获取appId。

public abstract class GetNewApplicationResponse {

  @Private
  @Unstable
  public static GetNewApplicationResponse newInstance(
      ApplicationId applicationId, Resource minCapability,
      Resource maxCapability) {
    GetNewApplicationResponse response =
        Records.newRecord(GetNewApplicationResponse.class);
    response.setApplicationId(applicationId);
    response.setMaximumResourceCapability(maxCapability);
    return response;
  }

  /**
   * Get the new ApplicationId allocated by the 
   * ResourceManager.
   * @return new ApplicationId allocated by the 
   *          ResourceManager
   */
  @Public
  @Stable
  // 获取appId
  public abstract ApplicationId getApplicationId();

  @Private
  @Unstable
  public abstract void setApplicationId(ApplicationId applicationId);

  /**
   * Get the maximum capability for any {@link Resource} allocated by the 
   * ResourceManager in the cluster.
   * @return maximum capability of allocated resources in the cluster
   */
  @Public
  @Stable
  public abstract Resource getMaximumResourceCapability();
  
  @Private
  @Unstable
  public abstract void setMaximumResourceCapability(Resource capability); 
}

ApplicationId

public abstract class ApplicationId implements Comparable {

  @Private
  @Unstable
  public static final String appIdStrPrefix = "application_";

  @Private
  @Unstable
  public static ApplicationId newInstance(long clusterTimestamp, int id) {
    ApplicationId appId = Records.newRecord(ApplicationId.class);
    appId.setClusterTimestamp(clusterTimestamp);
    appId.setId(id);
    appId.build();
    return appId;
  }
    ...
}

继续返回Client

      // 建立客户端,用于与hadoop通讯
      new CallerContext("CLIENT", sparkConf.get(APP_CALLER_CONTEXT),
        Option(appId.toString)).setCurrentContext()

/**
 * An utility class used to set up Spark caller contexts to HDFS and Yarn. The `context` will be
 * constructed by parameters passed in.
 * When Spark applications run on Yarn and HDFS, its caller contexts will be written into Yarn RM
 * audit log and hdfs-audit.log. That can help users to better diagnose and understand how
 * specific applications impacting parts of the Hadoop system and potential problems they may be
 * creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184, for a given HDFS operation, it's
 * very helpful to track which upper level job issues it.
 *
 * @param from who sets up the caller context (TASK, CLIENT, APPMASTER)
 *
 * The parameters below are optional:
 * @param upstreamCallerContext caller context the upstream application passes in
 * @param appId id of the app this task belongs to
 * @param appAttemptId attempt id of the app this task belongs to
 * @param jobId id of the job this task belongs to
 * @param stageId id of the stage this task belongs to
 * @param stageAttemptId attempt id of the stage this task belongs to
 * @param taskId task id
 * @param taskAttemptNumber task attempt id
 */
private[spark] class CallerContext(
  from: String,
  upstreamCallerContext: Option[String] = None,
  appId: Option[String] = None,
  appAttemptId: Option[String] = None,
  jobId: Option[Int] = None,
  stageId: Option[Int] = None,
  stageAttemptId: Option[Int] = None,
  taskId: Option[Long] = None,
  taskAttemptNumber: Option[Int] = None) extends Logging {

  private val context = prepareContext("SPARK_" +
    from +
    appId.map("_" + _).getOrElse("") +
    appAttemptId.map("_" + _).getOrElse("") +
    jobId.map("_JId_" + _).getOrElse("") +
    stageId.map("_SId_" + _).getOrElse("") +
    stageAttemptId.map("_" + _).getOrElse("") +
    taskId.map("_TId_" + _).getOrElse("") +
    taskAttemptNumber.map("_" + _).getOrElse("") +
    upstreamCallerContext.map("_" + _).getOrElse(""))

  private def prepareContext(context: String): String = {
    // The default max size of Hadoop caller context is 128
    lazy val len = SparkHadoopUtil.get.conf.getInt("hadoop.caller.context.max.size", 128)
    if (context == null || context.length <= len) {
      context
    } else {
      val finalContext = context.substring(0, len)
      logWarning(s"Truncated Spark caller context from $context to $finalContext")
      finalContext
    }
  }

  /**
   * Set up the caller context [[context]] by invoking Hadoop CallerContext API of
   * [[org.apache.hadoop.ipc.CallerContext]], which was added in hadoop 2.8.
   */
  def setCurrentContext(): Unit = {
    if (CallerContext.callerContextSupported) {
      try {
        val callerContext = Utils.classForName("org.apache.hadoop.ipc.CallerContext")
        val builder = Utils.classForName("org.apache.hadoop.ipc.CallerContext$Builder")
        val builderInst = builder.getConstructor(classOf[String]).newInstance(context)
        val hdfsContext = builder.getMethod("build").invoke(builderInst)
        callerContext.getMethod("setCurrent", callerContext).invoke(null, hdfsContext)
      } catch {
        case NonFatal(e) =>
          logWarning("Fail to set Spark caller context", e)
      }
    }
  }
}

往下，同样在Client中

// 验证集群是否有足够资源运行AM
      verifyClusterResources(newAppResponse)

  /**
   * Fail fast if we have requested more resources per container than is available in the cluster.
   */
  private def verifyClusterResources(newAppResponse: GetNewApplicationResponse): Unit = {

    // 最大内存
    val maxMem = newAppResponse.getMaximumResourceCapability().getMemory()
    logInfo("Verifying our application has not requested more than the maximum " +
      s"memory capability of the cluster ($maxMem MB per container)")

    // executor的内存
    val executorMem = executorMemory + executorMemoryOverhead + pysparkWorkerMemory
    if (executorMem > maxMem) {
      throw new IllegalArgumentException(s"Required executor memory ($executorMemory), overhead " +
        s"($executorMemoryOverhead MB), and PySpark memory ($pysparkWorkerMemory MB) is above " +
        s"the max threshold ($maxMem MB) of this cluster! Please check the values of " +
        s"'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.")
    }

    // AM需要的内存
    val amMem = amMemory + amMemoryOverhead
    if (amMem > maxMem) {
      throw new IllegalArgumentException(s"Required AM memory ($amMemory" +
        s"+$amMemoryOverhead MB) is above the max threshold ($maxMem MB) of this cluster! " +
        "Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or " +
        "'yarn.nodemanager.resource.memory-mb'.")
    }
    logInfo("Will allocate AM container, with %d MB memory including %d MB overhead".format(
      amMem,
      amMemoryOverhead))

    // We could add checks to make sure the entire cluster has enough resources but that involves
    // getting all the node reports and computing ourselves.
  }

executorMemory               spark.executor.memory  默认1g
executorMemoryOverhead       max(384M,0.07*spark.executor.memoryOverhead)
amMemory          yarn-cluster模式,由driver决定,spark.driver.memory 默认1g 
                  yarn-client模式,spark.yarn.am.memory 默认1g
amMemoryOverhead  yarn-cluster模式,由driver决定, max(384M,0.07*spark.driver.memory)
                  yarn-client模式,spark.yarn.am.memoryOverhead,max(384M,0.07*spark.yarn.am.memoryOverhead)

containerContext

      // 启动Container用于启动AM,并设置环境变量
      val containerContext = createContainerLaunchContext(newAppResponse)
      val appContext = createApplicationSubmissionContext(newApp, containerContext)

createContainerLaunchContext()

  /**
   * Set up a ContainerLaunchContext to launch our ApplicationMaster container.
   * This sets up the launch environment, java options, and the command for launching the AM.
   */
  private def createContainerLaunchContext(newAppResponse: GetNewApplicationResponse)
    : ContainerLaunchContext = {
    logInfo("Setting up container launch context for our AM")
    val appId = newAppResponse.getApplicationId
    val appStagingDirPath = new Path(appStagingBaseDir, getAppStagingDir(appId))
    val pySparkArchives =
      if (sparkConf.get(IS_PYTHON_APP)) {
        findPySparkArchives()
      } else {
        Nil
      }

    // 加载环境变量
    val launchEnv = setupLaunchEnv(appStagingDirPath, pySparkArchives)
    // 加载资源
    val localResources = prepareLocalResources(appStagingDirPath, pySparkArchives)

    val amContainer = Records.newRecord(classOf[ContainerLaunchContext])
    amContainer.setLocalResources(localResources.asJava)
    amContainer.setEnvironment(launchEnv.asJava)

    val javaOpts = ListBuffer[String]()

    // Set the environment variable through a command prefix
    // to append to the existing value of the variable
    var prefixEnv: Option[String] = None

    // Add Xmx for AM memory
    javaOpts += "-Xmx" + amMemory + "m"

    val tmpDir = new Path(Environment.PWD.$$(), YarnConfiguration.DEFAULT_CONTAINER_TEMP_DIR)
    javaOpts += "-Djava.io.tmpdir=" + tmpDir

    // TODO: Remove once cpuset version is pushed out.
    // The context is, default gc for server class machines ends up using all cores to do gc -
    // hence if there are multiple containers in same node, Spark GC affects all other containers'
    // performance (which can be that of other Spark containers)
    // Instead of using this, rely on cpusets by YARN to enforce "proper" Spark behavior in
    // multi-tenant environments. Not sure how default Java GC behaves if it is limited to subset
    // of cores on a node.
    // 设置AM的JVM内存和运行参数
    // SPARK_USE_CONC_INCR_GC,是否使用CMS,默认不启用
    val useConcurrentAndIncrementalGC = launchEnv.get("SPARK_USE_CONC_INCR_GC").exists(_.toBoolean)
    if (useConcurrentAndIncrementalGC) {
      // In our expts, using (default) throughput collector has severe perf ramifications in
      // multi-tenant machines
      javaOpts += "-XX:+UseConcMarkSweepGC"
      javaOpts += "-XX:MaxTenuringThreshold=31"
      javaOpts += "-XX:SurvivorRatio=8"
      javaOpts += "-XX:+CMSIncrementalMode"
      javaOpts += "-XX:+CMSIncrementalPacing"
      javaOpts += "-XX:CMSIncrementalDutyCycleMin=0"
      javaOpts += "-XX:CMSIncrementalDutyCycle=10"
    }

    // Include driver-specific java options if we are launching a driver
    // driver的运行参数
    if (isClusterMode) {
      sparkConf.get(DRIVER_JAVA_OPTIONS).foreach { opts =>
        javaOpts ++= Utils.splitCommandString(opts)
          .map(Utils.substituteAppId(_, appId.toString))
          .map(YarnSparkHadoopUtil.escapeForShell)
      }
      val libraryPaths = Seq(sparkConf.get(DRIVER_LIBRARY_PATH),
        sys.props.get("spark.driver.libraryPath")).flatten
      if (libraryPaths.nonEmpty) {
        prefixEnv = Some(createLibraryPathPrefix(libraryPaths.mkString(File.pathSeparator),
          sparkConf))
      }
      if (sparkConf.get(AM_JAVA_OPTIONS).isDefined) {
        logWarning(s"${AM_JAVA_OPTIONS.key} will not take effect in cluster mode")
      }
    } else {
      // Validate and include yarn am specific java options in yarn-client mode.
      sparkConf.get(AM_JAVA_OPTIONS).foreach { opts =>
        if (opts.contains("-Dspark")) {
          val msg = s"${AM_JAVA_OPTIONS.key} is not allowed to set Spark options (was '$opts')."
          throw new SparkException(msg)
        }
        if (opts.contains("-Xmx")) {
          val msg = s"${AM_JAVA_OPTIONS.key} is not allowed to specify max heap memory settings " +
            s"(was '$opts'). Use spark.yarn.am.memory instead."
          throw new SparkException(msg)
        }
        javaOpts ++= Utils.splitCommandString(opts)
          .map(Utils.substituteAppId(_, appId.toString))
          .map(YarnSparkHadoopUtil.escapeForShell)
      }
      sparkConf.get(AM_LIBRARY_PATH).foreach { paths =>
        prefixEnv = Some(createLibraryPathPrefix(paths, sparkConf))
      }
    }

    // For log4j configuration to reference
    javaOpts += ("-Dspark.yarn.app.container.log.dir=" + ApplicationConstants.LOG_DIR_EXPANSION_VAR)

    val userClass =
      if (isClusterMode) {
        Seq("--class", YarnSparkHadoopUtil.escapeForShell(args.userClass))
      } else {
        Nil
      }
    val userJar =
      if (args.userJar != null) {
        Seq("--jar", args.userJar)
      } else {
        Nil
      }
    val primaryPyFile =
      if (isClusterMode && args.primaryPyFile != null) {
        Seq("--primary-py-file", new Path(args.primaryPyFile).getName())
      } else {
        Nil
      }
    val primaryRFile =
      if (args.primaryRFile != null) {
        Seq("--primary-r-file", args.primaryRFile)
      } else {
        Nil
      }
    val amClass =
      if (isClusterMode) {
        Utils.classForName("org.apache.spark.deploy.yarn.ApplicationMaster").getName
      } else {
        Utils.classForName("org.apache.spark.deploy.yarn.ExecutorLauncher").getName
      }
    if (args.primaryRFile != null && args.primaryRFile.endsWith(".R")) {
      args.userArgs = ArrayBuffer(args.primaryRFile) ++ args.userArgs
    }
    val userArgs = args.userArgs.flatMap { arg =>
      Seq("--arg", YarnSparkHadoopUtil.escapeForShell(arg))
    }

    // AM的所有参数
    val amArgs =
      Seq(amClass) ++ userClass ++ userJar ++ primaryPyFile ++ primaryRFile ++ userArgs ++
      Seq("--properties-file", buildPath(Environment.PWD.$$(), LOCALIZED_CONF_DIR, SPARK_CONF_FILE))

    // Command for the ApplicationMaster
    // 构建ApplicationMaster的命令
    val commands = prefixEnv ++
      Seq(Environment.JAVA_HOME.$$() + "/bin/java", "-server") ++
      javaOpts ++ amArgs ++
      Seq(
        "1>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout",
        "2>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr")

    // TODO: it would be nicer to just make sure there are no null commands here
    val printableCommands = commands.map(s => if (s == null) "null" else s).toList
    amContainer.setCommands(printableCommands.asJava)

    logDebug("===============================================================================")
    logDebug("YARN AM launch context:")
    logDebug(s"    user class: ${Option(args.userClass).getOrElse("N/A")}")
    logDebug("    env:")
    if (log.isDebugEnabled) {
      Utils.redact(sparkConf, launchEnv.toSeq).foreach { case (k, v) =>
        logDebug(s"        $k -> $v")
      }
    }
    logDebug("    resources:")
    localResources.foreach { case (k, v) => logDebug(s"        $k -> $v")}
    logDebug("    command:")
    logDebug(s"        ${printableCommands.mkString(" ")}")
    logDebug("===============================================================================")

    // send the acl settings into YARN to control who has access via YARN interfaces
    val securityManager = new SecurityManager(sparkConf)
    amContainer.setApplicationACLs(
      YarnSparkHadoopUtil.getApplicationAclsForYarn(securityManager).asJava)
    setupSecurityToken(amContainer)
    amContainer
  }

createApplicationSubmissionContext()

设置AM的上下文

  /**
   * Set up the context for submitting our ApplicationMaster.
   * This uses the YarnClientApplication not available in the Yarn alpha API.
   */
  def createApplicationSubmissionContext(
      newApp: YarnClientApplication,
      containerContext: ContainerLaunchContext): ApplicationSubmissionContext = {
    val appContext = newApp.getApplicationSubmissionContext
    appContext.setApplicationName(sparkConf.get("spark.app.name", "Spark"))
    appContext.setQueue(sparkConf.get(QUEUE_NAME))
    appContext.setAMContainerSpec(containerContext)
    appContext.setApplicationType("SPARK")

    sparkConf.get(APPLICATION_TAGS).foreach { tags =>
      appContext.setApplicationTags(new java.util.HashSet[String](tags.asJava))
    }
    sparkConf.get(MAX_APP_ATTEMPTS) match {
      case Some(v) => appContext.setMaxAppAttempts(v)
      case None => logDebug(s"${MAX_APP_ATTEMPTS.key} is not set. " +
          "Cluster's default value will be used.")
    }

    sparkConf.get(AM_ATTEMPT_FAILURE_VALIDITY_INTERVAL_MS).foreach { interval =>
      appContext.setAttemptFailuresValidityInterval(interval)
    }

    val capability = Records.newRecord(classOf[Resource])
    capability.setMemory(amMemory + amMemoryOverhead)
    capability.setVirtualCores(amCores)

    sparkConf.get(AM_NODE_LABEL_EXPRESSION) match {
      case Some(expr) =>
        val amRequest = Records.newRecord(classOf[ResourceRequest])
        amRequest.setResourceName(ResourceRequest.ANY)
        amRequest.setPriority(Priority.newInstance(0))
        amRequest.setCapability(capability)
        amRequest.setNumContainers(1)
        amRequest.setNodeLabelExpression(expr)
        appContext.setAMContainerResourceRequest(amRequest)
      case None =>
        appContext.setResource(capability)
    }

    sparkConf.get(ROLLED_LOG_INCLUDE_PATTERN).foreach { includePattern =>
      try {
        val logAggregationContext = Records.newRecord(classOf[LogAggregationContext])

        // These two methods were added in Hadoop 2.6.4, so we still need to use reflection to
        // avoid compile error when building against Hadoop 2.6.0 ~ 2.6.3.
        val setRolledLogsIncludePatternMethod =
          logAggregationContext.getClass.getMethod("setRolledLogsIncludePattern", classOf[String])
        setRolledLogsIncludePatternMethod.invoke(logAggregationContext, includePattern)

        sparkConf.get(ROLLED_LOG_EXCLUDE_PATTERN).foreach { excludePattern =>
          val setRolledLogsExcludePatternMethod =
            logAggregationContext.getClass.getMethod("setRolledLogsExcludePattern", classOf[String])
          setRolledLogsExcludePatternMethod.invoke(logAggregationContext, excludePattern)
        }

        appContext.setLogAggregationContext(logAggregationContext)
      } catch {
        case NonFatal(e) =>
          logWarning(s"Ignoring ${ROLLED_LOG_INCLUDE_PATTERN.key} because the version of YARN " +
            "does not support it", e)
      }
    }

    appContext
  }

submitApplication()

提交app,通过appContext获取资源情况

  /**
   * 
   * Submit a new application to YARN. It is a blocking call - it
   * will not return {@link ApplicationId} until the submitted application is
   * submitted successfully and accepted by the ResourceManager.
   * 
   * 
   * 
   * Users should provide an {@link ApplicationId} as part of the parameter
   * {@link ApplicationSubmissionContext} when submitting a new application,
   * otherwise it will throw the {@link ApplicationIdNotProvidedException}.
   * 
   *
   * This internally calls {@link ApplicationClientProtocol#submitApplication
   * (SubmitApplicationRequest)}, and after that, it internally invokes
   * {@link ApplicationClientProtocol#getApplicationReport
   * (GetApplicationReportRequest)} and waits till it can make sure that the
   * application gets properly submitted. If RM fails over or RM restart
   * happens before ResourceManager saves the application's state,
   * {@link ApplicationClientProtocol
   * #getApplicationReport(GetApplicationReportRequest)} will throw
   * the {@link ApplicationNotFoundException}. This API automatically resubmits
   * the application with the same {@link ApplicationSubmissionContext} when it
   * catches the {@link ApplicationNotFoundException}
   *
   * @param appContext
   *          {@link ApplicationSubmissionContext} containing all the details
   *          needed to submit a new application
   * @return {@link ApplicationId} of the accepted application
   * @throws YarnException
   * @throws IOException
   * @see #createApplication()
   */
  public abstract ApplicationId submitApplication(
      ApplicationSubmissionContext appContext) throws YarnException,
      IOException;

launcherBackend.setAppId(appId.toString)

  private val launcherBackend = new LauncherBackend() {
    override protected def conf: SparkConf = sparkConf

    override def onStopRequest(): Unit = {
      if (isClusterMode && appId != null) {
        yarnClient.killApplication(appId)
      } else {
        setState(SparkAppHandle.State.KILLED)
        stop()
      }
    }
  }

LauncherBackend

/**
 * A class that can be used to talk to a launcher server. Users should extend this class to
 * provide implementation for the abstract methods.
 *
 * See `LauncherServer` for an explanation of how launcher communication works.
 */
private[spark] abstract class LauncherBackend {

  private var clientThread: Thread = _
  private var connection: BackendConnection = _
  private var lastState: SparkAppHandle.State = _
  @volatile private var _isConnected = false

  protected def conf: SparkConf

  def connect(): Unit = {
    val port = conf.getOption(LauncherProtocol.CONF_LAUNCHER_PORT)
      .orElse(sys.env.get(LauncherProtocol.ENV_LAUNCHER_PORT))
      .map(_.toInt)
    val secret = conf.getOption(LauncherProtocol.CONF_LAUNCHER_SECRET)
      .orElse(sys.env.get(LauncherProtocol.ENV_LAUNCHER_SECRET))
    if (port != None && secret != None) {
      val s = new Socket(InetAddress.getLoopbackAddress(), port.get)
      connection = new BackendConnection(s)
      connection.send(new Hello(secret.get, SPARK_VERSION))
      clientThread = LauncherBackend.threadFactory.newThread(connection)
      clientThread.start()
      _isConnected = true
    }
  }

  def close(): Unit = {
    if (connection != null) {
      try {
        connection.close()
      } finally {
        if (clientThread != null) {
          clientThread.join()
        }
      }
    }
  }

  def setAppId(appId: String): Unit = {
    if (connection != null && isConnected) {
      connection.send(new SetAppId(appId))
    }
  }

  def setState(state: SparkAppHandle.State): Unit = {
    if (connection != null && isConnected && lastState != state) {
      connection.send(new SetState(state))
      lastState = state
    }
  }

  /** Return whether the launcher handle is still connected to this backend. */
  def isConnected(): Boolean = _isConnected

  /**
   * Implementations should provide this method, which should try to stop the application
   * as gracefully as possible.
   */
  protected def onStopRequest(): Unit

  /**
   * Callback for when the launcher handle disconnects from this backend.
   */
  protected def onDisconnected() : Unit = { }

  private def fireStopRequest(): Unit = {
    val thread = LauncherBackend.threadFactory.newThread(new Runnable() {
      override def run(): Unit = Utils.tryLogNonFatalError {
        onStopRequest()
      }
    })
    thread.start()
  }

  private class BackendConnection(s: Socket) extends LauncherConnection(s) {

    override protected def handle(m: Message): Unit = m match {
      case _: Stop =>
        fireStopRequest()

      case _ =>
        throw new IllegalArgumentException(s"Unexpected message type: ${m.getClass().getName()}")
    }

    override def close(): Unit = {
      try {
        _isConnected = false
        super.close()
      } finally {
        onDisconnected()
      }
    }

  }

}

private object LauncherBackend {

  val threadFactory = ThreadUtils.namedThreadFactory("LauncherBackend")

}

reportLauncherState(SparkAppHandle.State.SUBMITTED)

报告执行情况

  def reportLauncherState(state: SparkAppHandle.State): Unit = {
    launcherBackend.setState(state)
  }

这里解读当sparksubmit提交模式为Yarn Cluster模式时的启动流程。

SparkSubmit类的runMain()中执行到start()时，本地模式会进入本地提交的--class类的main中开始执行。

      // 启动实例
      app.start(childArgs.toArray, sparkConf)

而Yarn Cluster模式，在prepareSubmitEnvironment()中准备运行环境时有判断过，所以start()其实调用的是org.apache.spark.deploy.yarn.YarnClusterApplication类的start()。

    // In yarn-cluster mode, use yarn.Client as a wrapper around the user class
    // yarn-cluster模式,使用yarn.client作为用户提交类的包装执行器
    if (isYarnCluster) {
      // object SparkSubmit中有定义为"org.apache.spark.deploy.yarn.YarnClusterApplication"
      childMainClass = YARN_CLUSTER_SUBMIT_CLASS
     
       ...

      }
 
      // 遍历所有args参数,添加到子类参数中
      if (args.childArgs != null) {
        args.childArgs.foreach { arg => childArgs += ("--arg", arg) }
      }
    }

YarnClusterApplication

YarnClusterApplication类在org.apache.spark.deploy.yarn.Client类下，其实也就是加载运行环境的资源到运行服务器本地，然后通过Client类的run()运行。

// 同样继承了SparkApplication,重写了start()
private[spark] class YarnClusterApplication extends SparkApplication {

  override def start(args: Array[String], conf: SparkConf): Unit = {
    // SparkSubmit would use yarn cache to distribute files & jars in yarn mode,
    // so remove them from sparkConf here for yarn mode.
    // yarn模式使用缓存来分发jars和文件,所以移除之前spark的配置
    // 可以回头看看prepareSubmitEnvironment()运行环境准备,各种部署模式设置相应参数的方法options()
    conf.remove("spark.jars")
    conf.remove("spark.files")

    // 构建client实例,而首先又构建了ClientArguments实例解析参数
    new Client(new ClientArguments(args), conf).run()
  }

}

ClientArguments

就是加载代码和jars、参数，jar，class，args。

// TODO: Add code and support for ensuring that yarn resource 'tasks' are location aware !
private[spark] class ClientArguments(args: Array[String]) {

  var userJar: String = null
  var userClass: String = null
  var primaryPyFile: String = null
  var primaryRFile: String = null
  var userArgs: ArrayBuffer[String] = new ArrayBuffer[String]()

  parseArgs(args.toList)

  // 解析传入的参数
  private def parseArgs(inputArgs: List[String]): Unit = {
    var args = inputArgs

    while (!args.isEmpty) {
      args match {
        case ("--jar") :: value :: tail =>
          userJar = value
          args = tail

        case ("--class") :: value :: tail =>
          userClass = value
          args = tail

        case ("--primary-py-file") :: value :: tail =>
          primaryPyFile = value
          args = tail

        case ("--primary-r-file") :: value :: tail =>
          primaryRFile = value
          args = tail

        case ("--arg") :: value :: tail =>
          userArgs += value
          args = tail

        case Nil =>

        case _ =>
          throw new IllegalArgumentException(getUsageMessage(args))
      }
    }

    // pyfile和Rfile不能同时设置
    if (primaryPyFile != null && primaryRFile != null) {
      throw new IllegalArgumentException("Cannot have primary-py-file and primary-r-file" +
        " at the same time")
    }
  }

  private def getUsageMessage(unknownParam: List[String] = null): String = {
    val message = if (unknownParam != null) s"Unknown/unsupported param $unknownParam\n" else ""
    message +
      s"""
      |Usage: org.apache.spark.deploy.yarn.Client [options]
      |Options:
      |  --jar JAR_PATH           Path to your application's JAR file (required in yarn-cluster
      |                           mode)
      |  --class CLASS_NAME       Name of your application's main class (required)
      |  --primary-py-file        A main Python file
      |  --primary-r-file         A main R file
      |  --arg ARG                Argument to be passed to your application's main class.
      |                           Multiple invocations are possible, each will be passed in order.
      """.stripMargin
  }
}

Client

直接进入Client的run()。

private[spark] class Client(
    val args: ClientArguments,
    val sparkConf: SparkConf)
  extends Logging {
    ...

      /**
   * Submit an application to the ResourceManager.
   * If set spark.yarn.submit.waitAppCompletion to true, it will stay alive
   * reporting the application's status until the application has exited for any reason.
   * Otherwise, the client process will exit after submission.
   * If the application finishes with a failed, killed, or undefined status,
   * throw an appropriate SparkException.
   */
  // 向RM提交app
  def run(): Unit = {
    // 提交app获取id
    // spark.yarn.submit.waitAppCompletion设置为true,进程会保存存活并报告app状态,直到app完成
    // 如果fail,kill级undefined状态退出,会抛出异常
    this.appId = submitApplication()

    // 监控application状态
    if (!launcherBackend.isConnected() && fireAndForget) {
      val report = getApplicationReport(appId)
      val state = report.getYarnApplicationState
      logInfo(s"Application report for $appId (state: $state)")
      logInfo(formatReportDetails(report))
      if (state == YarnApplicationState.FAILED || state == YarnApplicationState.KILLED) {
        throw new SparkException(s"Application $appId finished with status: $state")
      }
    } else { 
      val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
      if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
        diags.foreach { err =>
          logError(s"Application diagnostics message: $err")
        }
        throw new SparkException(s"Application $appId finished with failed status")
      }
      if (appState == YarnApplicationState.KILLED || finalState == FinalApplicationStatus.KILLED) {
        throw new SparkException(s"Application $appId is killed")
      }
      if (finalState == FinalApplicationStatus.UNDEFINED) {
        throw new SparkException(s"The final status of application $appId is undefined")
      }
    }
  }
}

submitApplication()

看看提交app获取id的过程。

  def submitApplication(): ApplicationId = {
    var appId: ApplicationId = null
    try {
      // 初始化launcherBackend,与launcherServer建立连接
      launcherBackend.connect()
      // 初始化yarnClinet
      yarnClient.init(hadoopConf)
      // 启动yarnClient,连接到集群,获取节点信息
      yarnClient.start()

      // 输出节点个数
      logInfo("Requesting a new application from cluster with %d NodeManagers"
        .format(yarnClient.getYarnClusterMetrics.getNumNodeManagers))

      // Get a new application from our RM
      // 调用接口向RM创建一个app
      val newApp = yarnClient.createApplication()
      // 获取app请求的响应
      val newAppResponse = newApp.getNewApplicationResponse()
      // 获取app的id
      appId = newAppResponse.getApplicationId()

      // 建立客户端,用于与hadoop通讯
      new CallerContext("CLIENT", sparkConf.get(APP_CALLER_CONTEXT),
        Option(appId.toString)).setCurrentContext()

      // Verify whether the cluster has enough resources for our AM
      // 验证集群是否有足够资源运行AM
      verifyClusterResources(newAppResponse)

      // Set up the appropriate contexts to launch our AM
      // 启动Container用于启动AM,并设置环境变量
      val containerContext = createContainerLaunchContext(newAppResponse)
      val appContext = createApplicationSubmissionContext(newApp, containerContext)

      // Finally, submit and monitor the application
      logInfo(s"Submitting application $appId to ResourceManager")
      // 提交app,通过appContext获取资源情况
      yarnClient.submitApplication(appContext)
      // 监控提交的状况
      launcherBackend.setAppId(appId.toString)
      reportLauncherState(SparkAppHandle.State.SUBMITTED)

      // 返回appId
      appId
    } catch {
      case e: Throwable =>
        if (appId != null) {
          cleanupStagingDir(appId)
        }
        throw e
    }
  }

一步步解读上面的过程。

launcherBackend.connect()

launcherBackend是创建了LauncherBackend类的实例，这个类主要是用于与launcherServer通讯。

  private val launcherBackend = new LauncherBackend() {
    override protected def conf: SparkConf = sparkConf

    override def onStopRequest(): Unit = {
      // 如果返回的appId为空则kill掉进程
      if (isClusterMode && appId != null) {
        yarnClient.killApplication(appId)
      } else {
        setState(SparkAppHandle.State.KILLED)
        stop()
      }
    }
  }

yarnClient.init(hadoopConf)

实际是通过YarnClientImpl.class获取的YarnClient实例。

同样在Client类中：

  private val yarnClient = YarnClient.createYarnClient

YarnClient类：

public abstract class YarnClient extends AbstractService {
  /**
   * Create a new instance of YarnClient.
   */
  @Public
  public static YarnClient createYarnClient() {
    YarnClient client = new YarnClientImpl();
    return client;
  }
    ...
}

YarnClientImpl类：

public class YarnClientImpl extends YarnClient {
    ...
  public YarnClientImpl() {
    super(YarnClientImpl.class.getName());
  }

    ...
}

yarnClient.init() 和 start()

初始化方法就是调用的yarnClient继承的AbstractService类的init()和start()，主要是对状态的判断。

public abstract class AbstractService implements Service {
    ...

  /**
   * {@inheritDoc}
   * This invokes {@link #serviceInit}
   * @param conf the configuration of the service. This must not be null
   * @throws ServiceStateException if the configuration was null,
   * the state change not permitted, or something else went wrong
   */
  @Override
  public void init(Configuration conf) {
    if (conf == null) {
      throw new ServiceStateException("Cannot initialize service "
                                      + getName() + ": null configuration");
    }
    // 判断状态
    if (isInState(STATE.INITED)) {
      return;
    }
    synchronized (stateChangeLock) {
      if (enterState(STATE.INITED) != STATE.INITED) {
        setConfig(conf);
        try {
          // 初始化
          serviceInit(config);
          if (isInState(STATE.INITED)) {
            //if the service ended up here during init,
            //notify the listeners
            notifyListeners();
          }
        } catch (Exception e) {
          noteFailure(e);
          ServiceOperations.stopQuietly(LOG, this);
          throw ServiceStateException.convert(e);
        }
      }
    }
  }

  /**
   * {@inheritDoc}
   * @throws ServiceStateException if the current service state does not permit
   * this action
   */
  @Override
  public void start() {
    if (isInState(STATE.STARTED)) {
      return;
    }
    //enter the started state
    synchronized (stateChangeLock) {
      if (stateModel.enterState(STATE.STARTED) != STATE.STARTED) {
        try {
          startTime = System.currentTimeMillis();
          // 启动
          serviceStart();
          if (isInState(STATE.STARTED)) {
            //if the service started (and isn't now in a later state), notify
            if (LOG.isDebugEnabled()) {
              LOG.debug("Service " + getName() + " is started");
            }
            notifyListeners();
          }
        } catch (Exception e) {
          noteFailure(e);
          ServiceOperations.stopQuietly(LOG, this);
          throw ServiceStateException.convert(e);
        }
      }
    }
  }

    ...
}

yarnClient.getYarnClusterMetrics.getNumNodeManagers

获取节点数量

public abstract class YarnClient extends AbstractService {
    ...

  /**
   * 
   * Get metrics ({@link YarnClusterMetrics}) about the cluster.
   * 
   * 
   * @return cluster metrics
   * @throws YarnException
   * @throws IOException
   */
  public abstract YarnClusterMetrics getYarnClusterMetrics() throws YarnException,
      IOException;

    ...
}

getNumNodeManagers

这个Yarn节点数量在初始化Yarn集群时就已经通过Metric测量系统获取，这个后续再解读。

/**
 * YarnClusterMetrics represents cluster metrics.
 * 
 * Currently only number of NodeManagers is provided.
 */
@Public
@Stable
public abstract class YarnClusterMetrics {
  
  @Private
  @Unstable
  public static YarnClusterMetrics newInstance(int numNodeManagers) {
    YarnClusterMetrics metrics = Records.newRecord(YarnClusterMetrics.class);
    metrics.setNumNodeManagers(numNodeManagers);
    return metrics;
  }

  /**
   * Get the number of NodeManagers in the cluster.
   * @return number of NodeManagers in the cluster
   */
  @Public
  @Stable
  public abstract int getNumNodeManagers();

  @Private
  @Unstable
  public abstract void setNumNodeManagers(int numNodeManagers);

}

返回Client中继续往下，提交app到RM

      // 调用接口向RM创建一个app
      val newApp = yarnClient.createApplication()
      // 获取app请求的响应
      val newAppResponse = newApp.getNewApplicationResponse()
      // 获取app的id
      appId = newAppResponse.getApplicationId()

yarnClient.createApplication()，在YarnClient类下。

  public abstract YarnClientApplication createApplication()
      throws YarnException, IOException;

YarnClientApplication

主要是app的上下文信息。

public  class YarnClientApplication {
  private final GetNewApplicationResponse newAppResponse;
  private final ApplicationSubmissionContext appSubmissionContext;

  public YarnClientApplication(GetNewApplicationResponse newAppResponse,
                               ApplicationSubmissionContext appContext) {
    this.newAppResponse = newAppResponse;
    this.appSubmissionContext = appContext;
  }

  public GetNewApplicationResponse getNewApplicationResponse() {
    return newAppResponse;
  }

  public ApplicationSubmissionContext getApplicationSubmissionContext() {
    return appSubmissionContext;
  }
}

GetNewApplicationResponse

在这里getApplicationId获取appId。

public abstract class GetNewApplicationResponse {

  @Private
  @Unstable
  public static GetNewApplicationResponse newInstance(
      ApplicationId applicationId, Resource minCapability,
      Resource maxCapability) {
    GetNewApplicationResponse response =
        Records.newRecord(GetNewApplicationResponse.class);
    response.setApplicationId(applicationId);
    response.setMaximumResourceCapability(maxCapability);
    return response;
  }

  /**
   * Get the new ApplicationId allocated by the 
   * ResourceManager.
   * @return new ApplicationId allocated by the 
   *          ResourceManager
   */
  @Public
  @Stable
  // 获取appId
  public abstract ApplicationId getApplicationId();

  @Private
  @Unstable
  public abstract void setApplicationId(ApplicationId applicationId);

  /**
   * Get the maximum capability for any {@link Resource} allocated by the 
   * ResourceManager in the cluster.
   * @return maximum capability of allocated resources in the cluster
   */
  @Public
  @Stable
  public abstract Resource getMaximumResourceCapability();
  
  @Private
  @Unstable
  public abstract void setMaximumResourceCapability(Resource capability); 
}

ApplicationId

public abstract class ApplicationId implements Comparable {

  @Private
  @Unstable
  public static final String appIdStrPrefix = "application_";

  @Private
  @Unstable
  public static ApplicationId newInstance(long clusterTimestamp, int id) {
    ApplicationId appId = Records.newRecord(ApplicationId.class);
    appId.setClusterTimestamp(clusterTimestamp);
    appId.setId(id);
    appId.build();
    return appId;
  }
    ...
}

继续返回Client

      // 建立客户端,用于与hadoop通讯
      new CallerContext("CLIENT", sparkConf.get(APP_CALLER_CONTEXT),
        Option(appId.toString)).setCurrentContext()

/**
 * An utility class used to set up Spark caller contexts to HDFS and Yarn. The `context` will be
 * constructed by parameters passed in.
 * When Spark applications run on Yarn and HDFS, its caller contexts will be written into Yarn RM
 * audit log and hdfs-audit.log. That can help users to better diagnose and understand how
 * specific applications impacting parts of the Hadoop system and potential problems they may be
 * creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184, for a given HDFS operation, it's
 * very helpful to track which upper level job issues it.
 *
 * @param from who sets up the caller context (TASK, CLIENT, APPMASTER)
 *
 * The parameters below are optional:
 * @param upstreamCallerContext caller context the upstream application passes in
 * @param appId id of the app this task belongs to
 * @param appAttemptId attempt id of the app this task belongs to
 * @param jobId id of the job this task belongs to
 * @param stageId id of the stage this task belongs to
 * @param stageAttemptId attempt id of the stage this task belongs to
 * @param taskId task id
 * @param taskAttemptNumber task attempt id
 */
private[spark] class CallerContext(
  from: String,
  upstreamCallerContext: Option[String] = None,
  appId: Option[String] = None,
  appAttemptId: Option[String] = None,
  jobId: Option[Int] = None,
  stageId: Option[Int] = None,
  stageAttemptId: Option[Int] = None,
  taskId: Option[Long] = None,
  taskAttemptNumber: Option[Int] = None) extends Logging {

  private val context = prepareContext("SPARK_" +
    from +
    appId.map("_" + _).getOrElse("") +
    appAttemptId.map("_" + _).getOrElse("") +
    jobId.map("_JId_" + _).getOrElse("") +
    stageId.map("_SId_" + _).getOrElse("") +
    stageAttemptId.map("_" + _).getOrElse("") +
    taskId.map("_TId_" + _).getOrElse("") +
    taskAttemptNumber.map("_" + _).getOrElse("") +
    upstreamCallerContext.map("_" + _).getOrElse(""))

  private def prepareContext(context: String): String = {
    // The default max size of Hadoop caller context is 128
    lazy val len = SparkHadoopUtil.get.conf.getInt("hadoop.caller.context.max.size", 128)
    if (context == null || context.length <= len) {
      context
    } else {
      val finalContext = context.substring(0, len)
      logWarning(s"Truncated Spark caller context from $context to $finalContext")
      finalContext
    }
  }

  /**
   * Set up the caller context [[context]] by invoking Hadoop CallerContext API of
   * [[org.apache.hadoop.ipc.CallerContext]], which was added in hadoop 2.8.
   */
  def setCurrentContext(): Unit = {
    if (CallerContext.callerContextSupported) {
      try {
        val callerContext = Utils.classForName("org.apache.hadoop.ipc.CallerContext")
        val builder = Utils.classForName("org.apache.hadoop.ipc.CallerContext$Builder")
        val builderInst = builder.getConstructor(classOf[String]).newInstance(context)
        val hdfsContext = builder.getMethod("build").invoke(builderInst)
        callerContext.getMethod("setCurrent", callerContext).invoke(null, hdfsContext)
      } catch {
        case NonFatal(e) =>
          logWarning("Fail to set Spark caller context", e)
      }
    }
  }
}

往下，同样在Client中

// 验证集群是否有足够资源运行AM
      verifyClusterResources(newAppResponse)

  /**
   * Fail fast if we have requested more resources per container than is available in the cluster.
   */
  private def verifyClusterResources(newAppResponse: GetNewApplicationResponse): Unit = {

    // 最大内存
    val maxMem = newAppResponse.getMaximumResourceCapability().getMemory()
    logInfo("Verifying our application has not requested more than the maximum " +
      s"memory capability of the cluster ($maxMem MB per container)")

    // executor的内存
    val executorMem = executorMemory + executorMemoryOverhead + pysparkWorkerMemory
    if (executorMem > maxMem) {
      throw new IllegalArgumentException(s"Required executor memory ($executorMemory), overhead " +
        s"($executorMemoryOverhead MB), and PySpark memory ($pysparkWorkerMemory MB) is above " +
        s"the max threshold ($maxMem MB) of this cluster! Please check the values of " +
        s"'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.")
    }

    // AM需要的内存
    val amMem = amMemory + amMemoryOverhead
    if (amMem > maxMem) {
      throw new IllegalArgumentException(s"Required AM memory ($amMemory" +
        s"+$amMemoryOverhead MB) is above the max threshold ($maxMem MB) of this cluster! " +
        "Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or " +
        "'yarn.nodemanager.resource.memory-mb'.")
    }
    logInfo("Will allocate AM container, with %d MB memory including %d MB overhead".format(
      amMem,
      amMemoryOverhead))

    // We could add checks to make sure the entire cluster has enough resources but that involves
    // getting all the node reports and computing ourselves.
  }

executorMemory               spark.executor.memory  默认1g
executorMemoryOverhead       max(384M,0.07*spark.executor.memoryOverhead)
amMemory          yarn-cluster模式,由driver决定,spark.driver.memory 默认1g 
                  yarn-client模式,spark.yarn.am.memory 默认1g
amMemoryOverhead  yarn-cluster模式,由driver决定, max(384M,0.07*spark.driver.memory)
                  yarn-client模式,spark.yarn.am.memoryOverhead,max(384M,0.07*spark.yarn.am.memoryOverhead)

containerContext

      // 启动Container用于启动AM,并设置环境变量
      val containerContext = createContainerLaunchContext(newAppResponse)
      val appContext = createApplicationSubmissionContext(newApp, containerContext)

createContainerLaunchContext()

  /**
   * Set up a ContainerLaunchContext to launch our ApplicationMaster container.
   * This sets up the launch environment, java options, and the command for launching the AM.
   */
  private def createContainerLaunchContext(newAppResponse: GetNewApplicationResponse)
    : ContainerLaunchContext = {
    logInfo("Setting up container launch context for our AM")
    val appId = newAppResponse.getApplicationId
    val appStagingDirPath = new Path(appStagingBaseDir, getAppStagingDir(appId))
    val pySparkArchives =
      if (sparkConf.get(IS_PYTHON_APP)) {
        findPySparkArchives()
      } else {
        Nil
      }

    // 加载环境变量
    val launchEnv = setupLaunchEnv(appStagingDirPath, pySparkArchives)
    // 加载资源
    val localResources = prepareLocalResources(appStagingDirPath, pySparkArchives)

    val amContainer = Records.newRecord(classOf[ContainerLaunchContext])
    amContainer.setLocalResources(localResources.asJava)
    amContainer.setEnvironment(launchEnv.asJava)

    val javaOpts = ListBuffer[String]()

    // Set the environment variable through a command prefix
    // to append to the existing value of the variable
    var prefixEnv: Option[String] = None

    // Add Xmx for AM memory
    javaOpts += "-Xmx" + amMemory + "m"

    val tmpDir = new Path(Environment.PWD.$$(), YarnConfiguration.DEFAULT_CONTAINER_TEMP_DIR)
    javaOpts += "-Djava.io.tmpdir=" + tmpDir

    // TODO: Remove once cpuset version is pushed out.
    // The context is, default gc for server class machines ends up using all cores to do gc -
    // hence if there are multiple containers in same node, Spark GC affects all other containers'
    // performance (which can be that of other Spark containers)
    // Instead of using this, rely on cpusets by YARN to enforce "proper" Spark behavior in
    // multi-tenant environments. Not sure how default Java GC behaves if it is limited to subset
    // of cores on a node.
    // 设置AM的JVM内存和运行参数
    // SPARK_USE_CONC_INCR_GC,是否使用CMS,默认不启用
    val useConcurrentAndIncrementalGC = launchEnv.get("SPARK_USE_CONC_INCR_GC").exists(_.toBoolean)
    if (useConcurrentAndIncrementalGC) {
      // In our expts, using (default) throughput collector has severe perf ramifications in
      // multi-tenant machines
      javaOpts += "-XX:+UseConcMarkSweepGC"
      javaOpts += "-XX:MaxTenuringThreshold=31"
      javaOpts += "-XX:SurvivorRatio=8"
      javaOpts += "-XX:+CMSIncrementalMode"
      javaOpts += "-XX:+CMSIncrementalPacing"
      javaOpts += "-XX:CMSIncrementalDutyCycleMin=0"
      javaOpts += "-XX:CMSIncrementalDutyCycle=10"
    }

    // Include driver-specific java options if we are launching a driver
    // driver的运行参数
    if (isClusterMode) {
      sparkConf.get(DRIVER_JAVA_OPTIONS).foreach { opts =>
        javaOpts ++= Utils.splitCommandString(opts)
          .map(Utils.substituteAppId(_, appId.toString))
          .map(YarnSparkHadoopUtil.escapeForShell)
      }
      val libraryPaths = Seq(sparkConf.get(DRIVER_LIBRARY_PATH),
        sys.props.get("spark.driver.libraryPath")).flatten
      if (libraryPaths.nonEmpty) {
        prefixEnv = Some(createLibraryPathPrefix(libraryPaths.mkString(File.pathSeparator),
          sparkConf))
      }
      if (sparkConf.get(AM_JAVA_OPTIONS).isDefined) {
        logWarning(s"${AM_JAVA_OPTIONS.key} will not take effect in cluster mode")
      }
    } else {
      // Validate and include yarn am specific java options in yarn-client mode.
      sparkConf.get(AM_JAVA_OPTIONS).foreach { opts =>
        if (opts.contains("-Dspark")) {
          val msg = s"${AM_JAVA_OPTIONS.key} is not allowed to set Spark options (was '$opts')."
          throw new SparkException(msg)
        }
        if (opts.contains("-Xmx")) {
          val msg = s"${AM_JAVA_OPTIONS.key} is not allowed to specify max heap memory settings " +
            s"(was '$opts'). Use spark.yarn.am.memory instead."
          throw new SparkException(msg)
        }
        javaOpts ++= Utils.splitCommandString(opts)
          .map(Utils.substituteAppId(_, appId.toString))
          .map(YarnSparkHadoopUtil.escapeForShell)
      }
      sparkConf.get(AM_LIBRARY_PATH).foreach { paths =>
        prefixEnv = Some(createLibraryPathPrefix(paths, sparkConf))
      }
    }

    // For log4j configuration to reference
    javaOpts += ("-Dspark.yarn.app.container.log.dir=" + ApplicationConstants.LOG_DIR_EXPANSION_VAR)

    val userClass =
      if (isClusterMode) {
        Seq("--class", YarnSparkHadoopUtil.escapeForShell(args.userClass))
      } else {
        Nil
      }
    val userJar =
      if (args.userJar != null) {
        Seq("--jar", args.userJar)
      } else {
        Nil
      }
    val primaryPyFile =
      if (isClusterMode && args.primaryPyFile != null) {
        Seq("--primary-py-file", new Path(args.primaryPyFile).getName())
      } else {
        Nil
      }
    val primaryRFile =
      if (args.primaryRFile != null) {
        Seq("--primary-r-file", args.primaryRFile)
      } else {
        Nil
      }
    val amClass =
      if (isClusterMode) {
        Utils.classForName("org.apache.spark.deploy.yarn.ApplicationMaster").getName
      } else {
        Utils.classForName("org.apache.spark.deploy.yarn.ExecutorLauncher").getName
      }
    if (args.primaryRFile != null && args.primaryRFile.endsWith(".R")) {
      args.userArgs = ArrayBuffer(args.primaryRFile) ++ args.userArgs
    }
    val userArgs = args.userArgs.flatMap { arg =>
      Seq("--arg", YarnSparkHadoopUtil.escapeForShell(arg))
    }

    // AM的所有参数
    val amArgs =
      Seq(amClass) ++ userClass ++ userJar ++ primaryPyFile ++ primaryRFile ++ userArgs ++
      Seq("--properties-file", buildPath(Environment.PWD.$$(), LOCALIZED_CONF_DIR, SPARK_CONF_FILE))

    // Command for the ApplicationMaster
    // 构建ApplicationMaster的命令
    val commands = prefixEnv ++
      Seq(Environment.JAVA_HOME.$$() + "/bin/java", "-server") ++
      javaOpts ++ amArgs ++
      Seq(
        "1>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout",
        "2>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr")

    // TODO: it would be nicer to just make sure there are no null commands here
    val printableCommands = commands.map(s => if (s == null) "null" else s).toList
    amContainer.setCommands(printableCommands.asJava)

    logDebug("===============================================================================")
    logDebug("YARN AM launch context:")
    logDebug(s"    user class: ${Option(args.userClass).getOrElse("N/A")}")
    logDebug("    env:")
    if (log.isDebugEnabled) {
      Utils.redact(sparkConf, launchEnv.toSeq).foreach { case (k, v) =>
        logDebug(s"        $k -> $v")
      }
    }
    logDebug("    resources:")
    localResources.foreach { case (k, v) => logDebug(s"        $k -> $v")}
    logDebug("    command:")
    logDebug(s"        ${printableCommands.mkString(" ")}")
    logDebug("===============================================================================")

    // send the acl settings into YARN to control who has access via YARN interfaces
    val securityManager = new SecurityManager(sparkConf)
    amContainer.setApplicationACLs(
      YarnSparkHadoopUtil.getApplicationAclsForYarn(securityManager).asJava)
    setupSecurityToken(amContainer)
    amContainer
  }

createApplicationSubmissionContext()

设置AM的上下文

  /**
   * Set up the context for submitting our ApplicationMaster.
   * This uses the YarnClientApplication not available in the Yarn alpha API.
   */
  def createApplicationSubmissionContext(
      newApp: YarnClientApplication,
      containerContext: ContainerLaunchContext): ApplicationSubmissionContext = {
    val appContext = newApp.getApplicationSubmissionContext
    appContext.setApplicationName(sparkConf.get("spark.app.name", "Spark"))
    appContext.setQueue(sparkConf.get(QUEUE_NAME))
    appContext.setAMContainerSpec(containerContext)
    appContext.setApplicationType("SPARK")

    sparkConf.get(APPLICATION_TAGS).foreach { tags =>
      appContext.setApplicationTags(new java.util.HashSet[String](tags.asJava))
    }
    sparkConf.get(MAX_APP_ATTEMPTS) match {
      case Some(v) => appContext.setMaxAppAttempts(v)
      case None => logDebug(s"${MAX_APP_ATTEMPTS.key} is not set. " +
          "Cluster's default value will be used.")
    }

    sparkConf.get(AM_ATTEMPT_FAILURE_VALIDITY_INTERVAL_MS).foreach { interval =>
      appContext.setAttemptFailuresValidityInterval(interval)
    }

    val capability = Records.newRecord(classOf[Resource])
    capability.setMemory(amMemory + amMemoryOverhead)
    capability.setVirtualCores(amCores)

    sparkConf.get(AM_NODE_LABEL_EXPRESSION) match {
      case Some(expr) =>
        val amRequest = Records.newRecord(classOf[ResourceRequest])
        amRequest.setResourceName(ResourceRequest.ANY)
        amRequest.setPriority(Priority.newInstance(0))
        amRequest.setCapability(capability)
        amRequest.setNumContainers(1)
        amRequest.setNodeLabelExpression(expr)
        appContext.setAMContainerResourceRequest(amRequest)
      case None =>
        appContext.setResource(capability)
    }

    sparkConf.get(ROLLED_LOG_INCLUDE_PATTERN).foreach { includePattern =>
      try {
        val logAggregationContext = Records.newRecord(classOf[LogAggregationContext])

        // These two methods were added in Hadoop 2.6.4, so we still need to use reflection to
        // avoid compile error when building against Hadoop 2.6.0 ~ 2.6.3.
        val setRolledLogsIncludePatternMethod =
          logAggregationContext.getClass.getMethod("setRolledLogsIncludePattern", classOf[String])
        setRolledLogsIncludePatternMethod.invoke(logAggregationContext, includePattern)

        sparkConf.get(ROLLED_LOG_EXCLUDE_PATTERN).foreach { excludePattern =>
          val setRolledLogsExcludePatternMethod =
            logAggregationContext.getClass.getMethod("setRolledLogsExcludePattern", classOf[String])
          setRolledLogsExcludePatternMethod.invoke(logAggregationContext, excludePattern)
        }

        appContext.setLogAggregationContext(logAggregationContext)
      } catch {
        case NonFatal(e) =>
          logWarning(s"Ignoring ${ROLLED_LOG_INCLUDE_PATTERN.key} because the version of YARN " +
            "does not support it", e)
      }
    }

    appContext
  }

submitApplication()

提交app,通过appContext获取资源情况

  /**
   * 
   * Submit a new application to YARN. It is a blocking call - it
   * will not return {@link ApplicationId} until the submitted application is
   * submitted successfully and accepted by the ResourceManager.
   * 
   * 
   * 
   * Users should provide an {@link ApplicationId} as part of the parameter
   * {@link ApplicationSubmissionContext} when submitting a new application,
   * otherwise it will throw the {@link ApplicationIdNotProvidedException}.
   * 
   *
   * This internally calls {@link ApplicationClientProtocol#submitApplication
   * (SubmitApplicationRequest)}, and after that, it internally invokes
   * {@link ApplicationClientProtocol#getApplicationReport
   * (GetApplicationReportRequest)} and waits till it can make sure that the
   * application gets properly submitted. If RM fails over or RM restart
   * happens before ResourceManager saves the application's state,
   * {@link ApplicationClientProtocol
   * #getApplicationReport(GetApplicationReportRequest)} will throw
   * the {@link ApplicationNotFoundException}. This API automatically resubmits
   * the application with the same {@link ApplicationSubmissionContext} when it
   * catches the {@link ApplicationNotFoundException}
   *
   * @param appContext
   *          {@link ApplicationSubmissionContext} containing all the details
   *          needed to submit a new application
   * @return {@link ApplicationId} of the accepted application
   * @throws YarnException
   * @throws IOException
   * @see #createApplication()
   */
  public abstract ApplicationId submitApplication(
      ApplicationSubmissionContext appContext) throws YarnException,
      IOException;

launcherBackend.setAppId(appId.toString)

  private val launcherBackend = new LauncherBackend() {
    override protected def conf: SparkConf = sparkConf

    override def onStopRequest(): Unit = {
      if (isClusterMode && appId != null) {
        yarnClient.killApplication(appId)
      } else {
        setState(SparkAppHandle.State.KILLED)
        stop()
      }
    }
  }

LauncherBackend

/**
 * A class that can be used to talk to a launcher server. Users should extend this class to
 * provide implementation for the abstract methods.
 *
 * See `LauncherServer` for an explanation of how launcher communication works.
 */
private[spark] abstract class LauncherBackend {

  private var clientThread: Thread = _
  private var connection: BackendConnection = _
  private var lastState: SparkAppHandle.State = _
  @volatile private var _isConnected = false

  protected def conf: SparkConf

  def connect(): Unit = {
    val port = conf.getOption(LauncherProtocol.CONF_LAUNCHER_PORT)
      .orElse(sys.env.get(LauncherProtocol.ENV_LAUNCHER_PORT))
      .map(_.toInt)
    val secret = conf.getOption(LauncherProtocol.CONF_LAUNCHER_SECRET)
      .orElse(sys.env.get(LauncherProtocol.ENV_LAUNCHER_SECRET))
    if (port != None && secret != None) {
      val s = new Socket(InetAddress.getLoopbackAddress(), port.get)
      connection = new BackendConnection(s)
      connection.send(new Hello(secret.get, SPARK_VERSION))
      clientThread = LauncherBackend.threadFactory.newThread(connection)
      clientThread.start()
      _isConnected = true
    }
  }

  def close(): Unit = {
    if (connection != null) {
      try {
        connection.close()
      } finally {
        if (clientThread != null) {
          clientThread.join()
        }
      }
    }
  }

  def setAppId(appId: String): Unit = {
    if (connection != null && isConnected) {
      connection.send(new SetAppId(appId))
    }
  }

  def setState(state: SparkAppHandle.State): Unit = {
    if (connection != null && isConnected && lastState != state) {
      connection.send(new SetState(state))
      lastState = state
    }
  }

  /** Return whether the launcher handle is still connected to this backend. */
  def isConnected(): Boolean = _isConnected

  /**
   * Implementations should provide this method, which should try to stop the application
   * as gracefully as possible.
   */
  protected def onStopRequest(): Unit

  /**
   * Callback for when the launcher handle disconnects from this backend.
   */
  protected def onDisconnected() : Unit = { }

  private def fireStopRequest(): Unit = {
    val thread = LauncherBackend.threadFactory.newThread(new Runnable() {
      override def run(): Unit = Utils.tryLogNonFatalError {
        onStopRequest()
      }
    })
    thread.start()
  }

  private class BackendConnection(s: Socket) extends LauncherConnection(s) {

    override protected def handle(m: Message): Unit = m match {
      case _: Stop =>
        fireStopRequest()

      case _ =>
        throw new IllegalArgumentException(s"Unexpected message type: ${m.getClass().getName()}")
    }

    override def close(): Unit = {
      try {
        _isConnected = false
        super.close()
      } finally {
        onDisconnected()
      }
    }

  }

}

private object LauncherBackend {

  val threadFactory = ThreadUtils.namedThreadFactory("LauncherBackend")

}

reportLauncherState(SparkAppHandle.State.SUBMITTED)

报告执行情况

  def reportLauncherState(state: SparkAppHandle.State): Unit = {
    launcherBackend.setState(state)
  }

你可能感兴趣的:(Spark)

如果MLlib 中没有所需要的模型，如何使用 Spark 进行分布式训练？是纯一呀 WSL Docker AI spark 分布式 mllib
如果MLlib中没有你所需要的模型，并且不打算结合更强大的框架（如TensorFlowOnSpark或Horovod），仍然可以使用Spark进行分布式训练，但需要手动处理训练任务的分配、数据准备、模型训练、结果合并和模型更新等过程。模型训练阶段将模型的训练任务分配到Spark集群的各个节点。数据并行：每个节点会处理数据的不同部分，并计算该部分的梯度或模型参数。自定义算法：如果使用的是自定义算法（
使用 Docker 部署 Apache Spark 集群教程努力的小T docker docker spark linux 运维服务器云计算容器
简介ApacheSpark是一个强大的统一分析引擎，用于大规模数据处理。本文将详细介绍如何使用Docker和DockerCompose快速部署一个包含一个Master节点和两个Worker节点的Spark集群。这种方法不仅简化了集群的搭建过程，还提供了资源隔离、易于扩展等优势。前置条件在开始之前，请确保你的环境中已经准备好了以下组件：安装并运行DockerEngine。安装DockerCompos
笔记：DataSphere Studio安装部署流程右边com Java 大数据
一、标准版部署标准版：有一定的安装难度，体现在Hadoop、Hive和Spark版本不同时，可能需要重新编译，可能会出现包冲突问题。适合于试用和生产使用，2~3小时即可部署起来。支持的功能有：数据开发IDE-Scriptis工作流实时执行信号功能和邮件功能数据可视化-Visualis数据质量-Qualitis(单机版)工作流定时调度-Azkaban(单机版)Linkis管理台二、基础环境准备2.1
HIVE- SPARK 流川枫_ 20210706 hdfs hive spark
日常记录备忘Hive修改字段类型之后（varchar->string）Hive可以查到数据，Presto查询报错;分区字段数据类型和表结构字段类型不一样；spark-sql分区表和非分区表兼容问题，不能关联可以建临时表把分区数据导入，用完数据将表删除；count有数据，select没数据可能是压缩格式所导致；优化合全量任务，之前是row_number()函数先插入当天增量，取出最新的数据插入全量表
spark为什么比mapreduce快？程序员
作者：京东零售吴化斌spark为什么比mapreduce快？首先澄清几个误区：1：两者都是基于内存计算的，任何计算框架都肯定是基于内存的，所以网上说的spark是基于内存计算所以快，显然是错误的2;DAG计算模型减少的是磁盘I/O次数（相比于mapreduce计算模型而言），而不是shuffle次数，因为shuffle是根据数据重组的次数而定，所以shuffle次数不能减少所以总结spark比m
spark为什么比mapreduce快？程序员
作者：京东零售吴化斌spark为什么比mapreduce快？首先澄清几个误区：1：两者都是基于内存计算的，任何计算框架都肯定是基于内存的，所以网上说的spark是基于内存计算所以快，显然是错误的2;DAG计算模型减少的是磁盘I/O次数（相比于mapreduce计算模型而言），而不是shuffle次数，因为shuffle是根据数据重组的次数而定，所以shuffle次数不能减少所以总结spark比m
Spark中Dataset方法详解小巫程序Demo日记 Spark+Hadoop学习 spark ajax java 分布式
一、数据清洗核心方法1.处理缺失值方法说明示例代码na().drop()删除包含空值的行Datasetcleaned=dataset.na().drop();na().fill(value)用指定值填充所有空值Datasetfilled=dataset.na().fill(0);na().fill(Map)按列填充不同值Mapfills=newHashMapunique=dataset.dropD
探索大数据处理：利用 Apache Spark 解锁数据价值 Echo_Wish 实战高阶大数据 apache spark 大数据
探索大数据处理：利用ApacheSpark解锁数据价值大家好，我是你们熟悉的大数据领域自媒体创作者Echo_Wish。今天，我们来聊聊如何利用ApacheSpark进行大规模数据处理。ApacheSpark作为一个快速、通用的集群计算框架，以其出色的性能和丰富的API，成为大数据处理的利器。那么，ApacheSpark究竟如何帮助我们高效处理海量数据？接下来，让我们一起深入探讨。一、ApacheS
最新Apache Hudi 1.0.1源码编译详细教程以及常见问题处理 Toroidals 大数据组件安装部署教程 hudi1.0.1 源码编译教程最新
1.最新ApacheHudi1.0.1源码编译2.Flink、Spark、Hive集成Hudi1.0.13.flinkstreaming写入hudi目录1.版本介绍2.安装maven2.1.下载maven2.2.设置环境变量2.3.添加Maven镜像3.编译hudi3.1.下载hudi源码3.2.修改hudi源码3.3.修改hudi-1.0.1/pom.xml，注释或去掉410行内容3.4.安装c
使用Docker安装Spark集群(带有HDFS) Sicilly_琬姗云计算大数据 docker spark hdfs
本实验在CentOS7中完成第一部分：安装Docker这一部分是安装Docker，如果机器中已经安装过Docker，可以直接跳过[root@VM-48-22-centos~]#systemctlstopfirewalld[root@VM-48-22-centos~]#systemctldisablefirewalld[root@VM-48-22-centos~]#systemctlstatusfi
使用Docker部署Spark集群小孩真笨工程开发技术 Cloud Data Docker Spark
使用Docker部署Spark集群克隆包含启动脚本的git仓库启动Spark0.8.0集群并切换至SparkShell环境不带参数运行部署脚本*运行一些小的例子终止集群克隆包含启动脚本的git仓库*[email protected]:amplab/docker-scripts.git当然，在这之前你必须已经配置了Github的SSH密钥认证，如果没有配置，会提示Per
从0开始使用Docker搭建Spark集群吃鱼的羊 SPARK Hadoop
https://www.jianshu.com/p/ee210190224f?utm_campaign=maleskine&utm_content=note&utm_medium=seo_notes&utm_source=recommendation最近在学习大数据技术，朋友叫我直接学习Spark，英雄不问出处，菜鸟不问对错，于是我就开始了Spark学习。为什么要在Docker上搭建Spark集群
Hbase深入浅出天才之上数据存储 Hbase 大数据存储
目录HBase在大数据生态圈中的位置HBase与传统关系数据库的区别HBase相关的模块以及HBase表格的特性HBase的使用建议Phoenix的使用总结HBase在大数据生态圈中的位置提到大数据的存储，大多数人首先联想到的是Hadoop和Hadoop中的HDFS模块。大家熟知的Spark、以及Hadoop的MapReduce，可以理解为一种计算框架。而HDFS，我们可以认为是为计算框架服务的存
深入浅出了解HBase及RDD编程山海王子大数据 hbase
深入浅出了解HBaseHBase简介架构HBase是什么样的数据库？关键是数据模型关键要素：什么是单元格时间戳的功能是什么？HBase为什么能存储海量数据创建一个HBase表配置Spark编写程序读取HBase数据编写程序向HBase写入数据关于搭建HBase高可用集群的图文教程，可参考我的另一篇博文——安装并配置HBase集群（5个节点）。HBase简介HBase是GoogleBigTable的
Spark 性能优化（四）：Cache LevenBigData spark 性能调优 spark 性能优化大数据
在Spark中，缓存是一种将计算结果存储在内存中的方式，目的是加速后续操作。当你执行迭代算法或查询时，如果多次重复使用相同的数据集，缓存可以避免每次都重新计算相同的转换操作。通过缓存，Spark可以将数据存储在内存中，这样在后续的处理阶段就能更快地访问。1.Spark缓存的关键点：缓存基本概念：通过调用.cache()对DataFrame或RDD进行缓存。默认情况下，数据会存储在内存中（RAM），
使用Docker搭建Flink集群 O_1CxH Flink大数据 Kafka大数据 docker flink 容器
目录使用Docker搭建Flink集群docker-compose一键搭建步骤附录参考资料使用Docker搭建Flink集群在学习大数据框架的时候，需要一个真实的环境。我们知道，像spark、flink这些计算框架都有多种运行模式：在本地使用多线程模拟集群真正的分布式集群如果直接在IDE（Intellj）里面编译和运行写好的程序，实际上是用的前一种运行模式；如果想尝试真正的生产环境中任务的提交和管
Spark 和 Flink 信徒_ spark flink 大数据
Spark和Flink都是目前流行的大数据处理引擎，但它们在架构设计、应用场景、性能和生态方面有较大区别。以下是详细对比：1.架构与核心概念方面ApacheSparkApacheFlink计算模型微批（Micro-Batch）为主，但支持结构化流（StructuredStreaming）原生流（TrueStreaming），基于事件驱动处理方式以RDD、DataFrame/Dataset作为核心抽
spark任务运行冰火同学 Spark spark 大数据分布式
运行环境在这里插入代码片[root@hadoop000conf]#java-versionjavaversion"1.8.0_144"Java(TM)SERuntimeEnvironment(build1.8.0_144-b01)[root@hadoop000conf]#echo$JAVA_HOME/home/hadoop/app/jdk1.8.0_144[root@hadoop000conf]#
【Redis】golang操作Redis基础入门寸铁 go 数据库 Redis redis golang 数据库 CRUD 基本操作分布式键值对
【Redis】golang操作Redis基础入门大家好我是寸铁总结了一篇【Redis】golang操作Redis基础入门sparkles:喜欢的小伙伴可以点点关注Redis的作用Redis（RemoteDictionaryServer）是一个开源的内存数据库，它主要用于存储键值对，并提供多种数据结构的支持。Redis的主要作用包括：1.缓存:Redis可以作为缓存系统，将常用的数据缓存在内存中，以
hive spark读取hive hbase外表报错分析和解决 spring208208 hive hive spark hbase
问题现象使用Sparkshell操作hive关联Hbase的外表导致报错；hive使用tez引擎操作关联Hbase的外表时报错。问题1：使用tez或spark引擎，在hive查询时只要关联hbase的hive表就会有问题其他表正常。“org.apache.hadoop.hbase.client.RetriesExhaustedException:Can’tgetthelocations”问题2：s
spark-广播变量哈哈哈哈q +spark hdfs hadoop 大数据 spark
当本地数据极大的时候，可以使用广播变量，使得减少内存。本地集合对象和分布式集合对象（RDD）进行关联的时候，需要将本地集合对象广播变量。本地的数据传输到集群上，会发到每一个线程，每一个分区。每一个进程executor，有多个线程分区，进程内的线程数据共享因此，给每一个线程发送数据会导致数据占用，浪费资源。所有，出现了广播变量，使得只发送给进程代码使用：broadcast=sc.broadcast(
探索数据云的无缝桥梁：Apache Spark 与 Snowflake 的完美结合窦育培
探索数据云的无缝桥梁：ApacheSpark与Snowflake的完美结合spark-snowflakeSnowflakeDataSourceforApacheSpark.项目地址:https://gitcode.com/gh_mirrors/sp/spark-snowflake项目介绍在大数据处理的浩瀚宇宙中，Snowflake以其独特的云数据仓库能力闪耀，而ApacheSpark则是数据分析和
maven插件学习(maven-shade-plugin和maven-antrun-plugin插件) catcher92 java maven maven 学习大数据
整合spark3.3.x和hive2.1.1-cdh6.3.2碰到个问题，就是spark官方支持的hive是2.3.x，但是cdh中的hive确是2.1.x的，项目中又计划用spark-thrift-server，导致编译过程中有部分报错。其中OperationLog这个类在hive2.3中新增加了几个方法，导致编译报错。这个时候有两种解决办法：修改spark源码，注释掉调用OperationLo
使用SparkLLM实现智能聊天：技术原理与实战演示 shuoac java
在本篇文章中，我们将探讨如何使用iFlyTek的SparkLLM模型来实现智能聊天功能。我们将详细介绍SparkLLM的技术背景、核心原理，并通过实际代码展示如何进行实现。另外，还会分析应用场景并给出一些实践建议。技术背景介绍SparkLLM是由iFlyTek提供的一种强大的语言模型，支持多种语言生成任务。它能够理解并生成自然语言，适用于对话系统、内容生成、智能客服等场景。核心原理解析SparkL
Spark 性能优化（三）：RBO 与 CBO LevenBigData spark 性能调优 spark 性能优化 ajax
1.RBO的核心概念在ApacheSpark的查询优化过程中，规则优化（Rule-BasedOptimization,RBO）是Catalyst优化器的一个关键组成部分。它主要依赖于一组固定的规则进行优化，而不是基于统计信息（如CBO-Cost-BasedOptimization）。RBO主要通过一系列逻辑规则（LogicalRules）和物理规则（PhysicalRules）来转换和优化查询计划
python 并行框架_基于python的高性能实时并行机器学习框架之Ray介绍 weixin_39778582 python 并行框架
前言加州大学伯克利分校实时智能安全执行实验室(RISELab)的研究人员已开发出了一种新的分布式框架，该框架旨在让基于Python的机器学习和深度学习工作负载能够实时执行，并具有类似消息传递接口(MPI)的性能和细粒度。这种框架名为Ray，看起来有望取代Spark，业界认为Spark对于一些现实的人工智能应用而言速度太慢了;过不了一年，Ray应该会准备好用于生产环境。目前ray已经发布了0.3.0
java获取hive表所有字段,Hive Sql从表中动态获取空列计数拾亿年 java获取hive表所有字段
我正在使用datastaxspark集成和sparkSQLthrift服务器,它为我提供了一个HiveSQL接口来查询Cassandra中的表.我的数据库中的表是动态创建的,我想要做的是仅根据表名在表的每列中获取空值的计数.我可以使用describedatabase.table获取列名,但在hiveSQL中,如何在另一个为所有列计数null的select查询中使用其输出.更新1：使用Dudu的解决
PySpark查询Dataframe中包含乱码的数据记录的方法 weixin_30777913 python 大数据 spark
首先，用PySpark获取Dataframe中所有非ASCII字符，找到其中的非乱码字符。frompyspark.sqlimportSparkSessionfrompyspark.sql.functionsimportcol,concat_ws,explode,split,coalesce,litfrompyspark.sql.typesimportStringTypespark=SparkSes
spark streaming基础操作天选之子123 大数据 spark 大数据分布式
sparkstreaming基础操作一、什么是sparkstreamingSparkStreaming用于流式数据的处理。SparkStreaming使用离散化流(discretized作为抽象表示，叫作DStream。DStream是随时间推移而收到的数据的序列。在内部，每个时间区间收到的数据都作为RDD存在，而DStream是由这些RDD所组成的序列(因此得名“离散化”)。简单来说，DStre
flink实时集成利器 - apache seatunnel - 核心架构详解 24k小善 flink apache 架构
SeaTunnel（原名Waterdrop）是一个分布式、高性能、易扩展的数据集成平台，专注于大数据领域的数据同步、数据迁移和数据转换。它支持多种数据源和数据目标，并可以与ApacheFlink、Spark等计算引擎集成。以下是SeaTunnel的核心架构详解：SeaTunnel核心架构SeaTunnel的架构设计分为以下几个核心模块：1.数据源（Source）功能：负责从外部系统读取数据。支持的
SQL的各种连接查询 xieke90 UNION ALL UNION 外连接内连接 JOIN
一、内连接概念：内连接就是使用比较运算符根据每个表共有的列的值匹配两个表中的行。内连接（join 或者inner join ） SQL语法： select * fron
java编程思想--复用类百合不是茶 java 继承代理组合 final类
复用类看着标题都不知道是什么,再加上java编程思想翻译的比价难懂,所以知道现在才看这本软件界的奇书一:组合语法:就是将对象的引用放到新类中即可代码: package com.wj.reuse; /** * * @author Administrator 组
[开源与生态系统]国产CPU的生态系统 comsci cpu
计算机要从娃娃抓起...而孩子最喜欢玩游戏.... 要让国产CPU在国内市场形成自己的生态系统和产业链,国家和企业就不能够忘记游戏这个非常关键的环节.... 投入一些资金和资源,人力和政策,让游
JVM内存区域划分Eden Space、Survivor Space、Tenured Gen，Perm Gen解释商人shang jvm内存
jvm区域总体分两类，heap区和非heap区。heap区又分：Eden Space（伊甸园）、Survivor Space(幸存者区)、Tenured Gen（老年代-养老区）。非heap区又分：Code Cache(代码缓存区)、Perm Gen（永久代）、Jvm Stack(java虚拟机栈)、Local Method Statck(本地方法栈)。 HotSpot虚拟机GC算法采用分代收
页面上调用 QQ oloz qq
<A href="tencent://message/?uin=707321921&Site=有事Q我&Menu=yes"> <img style="border:0px;" src=http://wpa.qq.com/pa?p=1:707321921:1></a>
一些问题文强chu 问题
1.eclipse 导出 doc 出现“The Javadoc command does not exist.” javadoc command 选择 jdk/bin/javadoc.exe 2.tomcate 配置 web 项目 ..... SQL:3.mysql * 必须得放前面否则 select&nbs
生活没有安全感小桔子生活孤独安全感
圈子好小，身边朋友没几个，交心的更是少之又少。在深圳，除了男朋友，没几个亲密的人。不知不觉男朋友成了唯一的依靠，毫不夸张的说，业余生活的全部。现在感情好，也很幸福的。但是说不准难免人心会变嘛，不发生什么大家都乐融融，发生什么很难处理。我想说如果不幸被分手(无论原因如何)，生活难免变化很大，在深圳，我没交心的朋友。明
php 基础语法 aichenglong php 基本语法
1 .1 php变量必须以$开头 <?php $a=” b”; echo ?> 1 .2 php基本数据库类型 Integer float/double Boolean string 1 .3 复合数据类型数组array和对象 object 1 .4 特殊数据类型 null 资源类型(resource) $co
mybatis tools 配置详解 AILIKES mybatis
MyBatis Generator中文文档 MyBatis Generator中文文档地址： http://generator.sturgeon.mopaas.com/ 该中文文档由于尽可能和原文内容一致，所以有些地方如果不熟悉，看中文版的文档的也会有一定的障碍，所以本章根据该中文文档以及实际应用，使用通俗的语言来讲解详细的配置。本文使用Markdown进行编辑，但是博客显示效
继承与多态的探讨百合不是茶 JAVA面向对象继承对象
继承 extends 多态继承是面向对象最经常使用的特征之一：继承语法是通过继承发、基类的域和方法 //继承就是从现有的类中生成一个新的类，这个新类拥有现有类的所有extends是使用继承的关键字：在A类中定义属性和方法； class A{ //定义属性 int age； //定义方法 public void go
JS的undefined与null的实例 bijian1013 JavaScript JavaScript
<form name="theform" id="theform"> </form> <script language="javascript"> var a alert(typeof(b)); //这里提示undefined if(theform.datas
TDD实践（一） bijian1013 java 敏捷 TDD
一.TDD概述 TDD：测试驱动开发，它的基本思想就是在开发功能代码之前，先编写测试代码。也就是说在明确要开发某个功能后，首先思考如何对这个功能进行测试，并完成测试代码的编写，然后编写相关的代码满足这些测试用例。然后循环进行添加其他功能，直到完全部功能的开发。
[Maven学习笔记十]Maven Profile与资源文件过滤器 bit1129 maven
什么是Maven Profile Maven Profile的含义是针对编译打包环境和编译打包目的配置定制，可以在不同的环境上选择相应的配置，例如DB信息，可以根据是为开发环境编译打包，还是为生产环境编译打包，动态的选择正确的DB配置信息 Profile的激活机制 1.Profile可以手工激活，比如在Intellij Idea的Maven Project视图中可以选择一个P
【Hive八】Hive用户自定义生成表函数(UDTF) bit1129 hive
1. 什么是UDTF UDTF，是User Defined Table-Generating Functions，一眼看上去，貌似是用户自定义生成表函数，这个生成表不应该理解为生成了一个HQL Table，貌似更应该理解为生成了类似关系表的二维行数据集 2. 如何实现UDTF 继承org.apache.hadoop.hive.ql.udf.generic
tfs restful api 加auth 2.0认计 ronin47
　　目前思考如何给tfs的ngx-tfs api增加安全性。有如下两点：　　一是基于客户端的ip设置。这个比较容易实现。　　二是基于OAuth2.0认证，这个需要lua，实现起来相对于一来说，有些难度。　　现在重点介绍第二种方法实现思路。　　前言：我们使用Nginx的Lua中间件建立了OAuth2认证和授权层。如果你也有此打算，阅读下面的文档，实现自动化并获得收益。SeatGe
jdk环境变量配置 byalias java jdk
进行java开发，首先要安装jdk，安装了jdk后还要进行环境变量配置： 1、下载jdk（http://java.sun.com/javase/downloads/index.jsp），我下载的版本是：jdk-7u79-windows-x64.exe 2、安装jdk-7u79-windows-x64.exe 3、配置环境变量：右击"计算机"-->&quo
《代码大全》表驱动法-Table Driven Approach-2 bylijinnan java
package com.ljn.base; import java.io.BufferedReader; import java.io.FileInputStream; import java.io.InputStreamReader; import java.util.ArrayList; import java.util.Collections; import java.uti
SQL 数值四舍五入小数点后保留2位 chicony 四舍五入
1.round() 函数是四舍五入用，第一个参数是我们要被操作的数据，第二个参数是设置我们四舍五入之后小数点后显示几位。 2.numeric 函数的2个参数，第一个表示数据长度，第二个参数表示小数点后位数。例如：　　select cast(round(12.5,2) as numeric(5,2))
c++运算符重载 CrazyMizzz C++
一、加+，减-，乘*，除/ 的运算符重载 Rational operator*(const Rational &x) const{ return Rational(x.a * this->a); } 在这里只写乘法的，加减除的写法类似二、<<输出,>>输入的运算符重载 &nb
hive DDL语法汇总 daizj hive 修改列 DDL 修改表
hive DDL语法汇总１、对表重命名 hive> ALTER TABLE table_name RENAME TO new_table_name; 2、修改表备注 hive> ALTER TABLE table_name SET TBLPROPERTIES ('comment' = new_comm
jbox使用说明 dcj3sjt126com Web
参考网址：http://www.kudystudio.com/jbox/jbox-demo.html jBox v2.3 beta [ 点击下载] 技术交流QQGroup：172543951 100521167 [2011-11-11] jBox v2.3 正式版 - [调整&修复] IE6下有iframe或页面有active、applet控件
UISegmentedControl 开发笔记 dcj3sjt126com
// typedef NS_ENUM(NSInteger, UISegmentedControlStyle) { // UISegmentedControlStylePlain, // large plain &
Slick生成表映射文件 ekian scala
Scala添加SLICK进行数据库操作，需在sbt文件上添加slick-codegen包 "com.typesafe.slick" %% "slick-codegen" % slickVersion 因为我是连接SQL Server数据库，还需添加slick-extensions，jtds包 "com.typesa
ES-TEST gengzg test
package com.MarkNum; import java.io.IOException; import java.util.Date; import java.util.HashMap; import java.util.Map; import javax.servlet.ServletException; import javax.servlet.annotation
为何外键不再推荐使用 hugh.wang mysql DB
表的关联，是一种逻辑关系，并不需要进行物理上的“硬关联”，而且你所期望的关联，其实只是其数据上存在一定的联系而已，而这种联系实际上是在设计之初就定义好的固有逻辑。在业务代码中实现的时候，只要按照设计之初的这种固有关联逻辑来处理数据即可，并不需要在数据库层面进行“硬关联”，因为在数据库层面通过使用外键的方式进行“硬关联”，会带来很多额外的资源消耗来进行一致性和完整性校验，即使很多时候我们并不
领域驱动设计 julyflame VO DAO 设计模式 DTO po
概念： VO（View Object）：视图对象，用于展示层，它的作用是把某个指定页面（或组件）的所有数据封装起来。 DTO（Data Transfer Object）：数据传输对象，这个概念来源于J2EE的设计模式，原来的目的是为了EJB的分布式应用提供粗粒度的数据实体，以减少分布式调用的次数，从而提高分布式调用的性能和降低网络负载，但在这里，我泛指用于展示层与服务层之间的数据传输对
单例设计模式 hm4123660 java Singleton 单例设计模式懒汉式饿汉式
单例模式是一种常用的软件设计模式。在它的核心结构中只包含一个被称为单例类的特殊类。通过单例模式可以保证系统中一个类只有一个实例而且该实例易于外界访问，从而方便对实例个数的控制并节约系统源。如果希望在系统中某个类的对象只能存在一个，单例模式是最好的解决方案。 &nb
logback zhb8015 log logback
一、logback的介绍 Logback是由log4j创始人设计的又一个开源日志组件。logback当前分成三个模块：logback-core,logback- classic和logback-access。logback-core是其它两个模块的基础模块。logback-classic是log4j的一个改良版本。此外logback-class
整合Kafka到Spark Streaming——代码示例和挑战 Stark_Summer spark storm zookeeper PARALLELISM processing
作者Michael G. Noll是瑞士的一位工程师和研究员，效力于Verisign，是Verisign实验室的大规模数据分析基础设施（基础Hadoop）的技术主管。本文，Michael详细的演示了如何将Kafka整合到Spark Streaming中。期间， Michael还提到了将Kafka整合到 Spark Streaming中的一些现状，非常值得阅读，虽然有一些信息在Spark 1.2版
spring-master-slave-commondao 王新春 DAO spring dataSource slave master
互联网的web项目，都有个特点：请求的并发量高，其中请求最耗时的db操作，又是系统优化的重中之重。为此，往往搭建 db的一主多从库的数据库架构。作为web的DAO层，要保证针对主库进行写操作，对多个从库进行读操作。当然在一些请求中，为了避免主从复制的延迟导致的数据不一致性，部分的读操作也要到主库上。（这种需求一般通过业务垂直分开，比如下单业务的代码所部署的机器，读去应该也要从主库读取数