Launching Spark on YARN

构建

需要重新整理spark的jar用来在yarn上面运行spark jobs. 通过设置hadoop的版本和SPARK_YARN环境变量来构建。
SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
​在spark编译版本中将包含这个jar  ./assembly/target/scala-2.10/spark-assembly_0.9.1-hadoop2.0.5.jar.

准备工作

1:通过装配使得spark支持yarn.
2: 装备可以安装到HDFS上面或者本地。
3:开发的程序代码必须分离打包

如果需要测试yarn这种部署模式,需要通过执行 sbt/sbt assemly来到的一个spark-example_2.10-0.9.1文件。

配置

spark部署在yarn和其他3种方式大部分系统,这里主要讲不通的地方。
环境变量
1:SPARK_YARN_USER_ENV 。 为了让spark在yarn上面执行需要添加 添加环境变量。通过逗号来分割这些环境变量:EG: SPARK_YARN_USER_ENV="JAVA_HOME=/jdk64,FOO=bar" .
系统变量
  • spark.yarn.applicationMaster.waitTries, property to set the number of times the ApplicationMaster waits for the the spark master and then also the number of tries it waits for the Spark Context to be intialized. Default is 10.
  • spark.yarn.submit.file.replication, the HDFS replication level for the files uploaded into HDFS for the application. These include things like the spark jar, the app jar, and any distributed cache files/archives.
  • spark.yarn.preserve.staging.files, set to true to preserve the staged files(spark jar, app jar, distributed cache files) at the end of the job rather then delete them.
  • spark.yarn.scheduler.heartbeat.interval-ms, the interval in ms in which the Spark application master heartbeats into the YARN ResourceManager. Default is 5 seconds.
  • spark.yarn.max.worker.failures, the maximum number of worker failures before failing the application. Default is the number of workers requested times 2 with minimum of 3
在YARN上面运行SPARK
  • 确保HADOOP_CONF_DIR和YARN_CONF_DIR指向包含hadoop集群的配文件。通过他们来连接集群,写入DFS,向resource manager提交任务。

yarn standalone下通过YARN CLIENT来启动SPARK APPLICATION

启动YARN CLIENT的命令如下

SPARK_JAR = <SPARK_ASSEMBLY_JAR_FILE > . /bin /spark - class org.apache.spark.deploy.yarn.Client \
  --jar <YOUR_APP_JAR_FILE > \
  -- class <APP_MAIN_CLASS > \
  --args <APP_MAIN_ARGUMENTS > \
  --num -workers <NUMBER_OF_WORKER_MACHINES > \
  --master - class <ApplicationMaster_CLASS >
  --master -memory <MEMORY_FOR_MASTER > \
  --worker -memory <MEMORY_PER_WORKER > \
  --worker -cores <CORES_PER_WORKER > \
  --name <application_name > \
  --queue <queue_name > \
  --addJars <any_local_files_used_in_SparkContext.addJar > \
  --files <files_for_distributed_cache > \
  --archives <archives_for_distributed_cache >


例子

# Build the Spark assembly JAR and the Spark examples JAR
$ SPARK_HADOOP_VERSION = 2. 0. 5 -alpha SPARK_YARN =true sbt /sbt assembly
 
# Configure logging
$ cp conf /log4j.properties.template conf /log4j.properties
 
# Submit Spark's ApplicationMaster to YARN's ResourceManager, and instruct Spark to run the SparkPi example
$ SPARK_JAR =. /assembly /target /scala - 2. 10 /spark -assembly - 0. 9. 1 -hadoop2. 0. 5 -alpha.jar \
    . /bin /spark -class org.apache.spark.deploy.yarn.Client \
      --jar examples /target /scala - 2. 10 /spark -examples -assembly - 0. 9. 1.jar \
      --class org.apache.spark.examples.SparkPi \
      --args yarn -standalone \
      --num -workers 3 \
      --master -memory 4g \
      --worker -memory 2g \
      --worker -cores 1
 
# Examine the output (replace $YARN_APP_ID in the following with the "application identifier" output by the previous command)
# (Note: YARN_APP_LOGS_DIR is usually /tmp/logs or $HADOOP_HOME/logs/userlogs depending on the Hadoop version.)
$ cat $YARN_APP_LOGS_DIR /$YARN_APP_ID /container *_000001 /stdout
Pi is roughly 3. 13794

上面启动YARN CLIENT 它会启动默认的Application Master 。 SparkPI将作为一个Application Master的子线程运行。YARN CLIENT会周期性的轮训Application Master的状态然后在控制台显示。当应用完成它也就结束了。通过这种方式,应用实际是运行在Application Master运行的机器上。所以如果应用包含了交互那么就会有问题。

SparkContext 和任务都运行在 Yarn 集群中

YARN CLIENT 模式下运行SPARK应用

在yarn client模式下 应用可以本地化运行,就像在local/mesos/standalone上运行应用或者spark-shell .  启动方式和他们一样。唯一不同当你需要给MASTER指定一个URL的时候用 yarn-client替换。同时需要配置SPARK_JAR SPARK_YARN_APP_JAR. .如果是用spark-shell对接安全HDFS,那么需要配置SPARK_YARN_MODE=true.
  • SPARK_YARN_APP_JAR, Path to your application’s JAR file (required)
  • SPARK_WORKER_INSTANCES, Number of workers to start (Default: 2)
  • SPARK_WORKER_CORES, Number of cores for the workers (Default: 1).
  • SPARK_WORKER_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
  • SPARK_MASTER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb)
  • SPARK_YARN_APP_NAME, The name of your application (Default: Spark)
  • SPARK_YARN_QUEUE, The hadoop queue to use for allocation requests (Default: ‘default’)
  • SPARK_YARN_DIST_FILES, Comma separated list of files to be distributed with the job.
  • SPARK_YARN_DIST_ARCHIVES, Comma separated list of archives to be distributed with the job.

     
SPARK_JAR =. /assembly /target /scala - 2. 10 /spark -assembly - 0. 9. 1 -hadoop2. 0. 5 -alpha.jar \
SPARK_YARN_APP_JAR =examples /target /scala - 2. 10 /spark -examples -assembly - 0. 9. 1.jar \
. /bin /run -example org.apache.spark.examples.SparkPi yarn -client
 
SPARK_YARN_MODE =true \
SPARK_JAR =. /assembly /target /scala - 2. 10 /spark -assembly - 0. 9. 1 -hadoop2. 0. 5 -alpha.jar \
SPARK_YARN_APP_JAR =examples /target /scala - 2. 10 /spark -examples -assembly - 0. 9. 1.jar \
MASTER =yarn -client . /bin /spark -shell

 



 


来自为知笔记(Wiz)


你可能感兴趣的:(Launching Spark on YARN)