将spark作业提交到yarn上执行
spark仅仅作为一个客户端
./spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
/home/hadoop/app/spark-2.3.1-bin-2.6.0-cdh5.7.0/examples/jars/spark-examples_2.11-2.3.1.jar \
3
--master yarn 相当于 --deploy-mode client,也就是yarn-client模式时,后边这句--deploy-mode client可写可不写
如果是yarn-cluster模式,则需要写上--deploy-mode cluster
直接按上方代码启动,会报错:
Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:288)
at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:248)
at org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:120)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:130)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
需要在环境变量中设置HADOOP_CONF_DIR or YARN_CONF_DIR
[hadoop@hadoop001 ~]$ cd $SPARK_HOME/conf
[hadoop@hadoop001 conf]$ vi spark-env.sh
export HADOOP_CONF_DIR=/home/hadoop/app/hadoop-2.6.0/etc/hadoop
查看日志,发现有一个步骤耗费了比较长的时间:
Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME
18/09/19 17:30:32 INFO yarn.Client: Preparing resources for our AM container
18/09/19 17:30:35 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
18/09/19 17:30:44 INFO yarn.Client: Uploading resource file:/tmp/spark-8152492d-487e-4d35-962a-42344edea033/__spark_libs__2104928720237052389.zip -> hdfs://192.168.137.251:9000/user/hadoop/.sparkStaging/application_1537349385350_0001/__spark_libs__2104928720237052389.zip
18/09/19 17:30:54 INFO yarn.Client: Uploading resource file:/tmp/spark-8152492d-487e-4d35-962a-42344edea033/__spark_conf__1822648312505136721.zip -> hdfs://192.168.137.251:9000/user/hadoop/.sparkStaging/application_1537349385350_0001/__spark_conf__.zip
官网上也有相关说明:
To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive
or spark.yarn.jars
. For details please refer toSpark Properties. If neither spark.yarn.archive
nor spark.yarn.jars
is specified, Spark will create a zip file with all jars under $SPARK_HOME/jars
and upload it to the distributed cache.
可做如下配置
[hadoop@hadoop000 ~]$ hadoop fs -mkdir -p /system/spark-lib
[hadoop@hadoop000 ~]$ hadoop fs -put /home/hadoop/app/spark-2.3.1-bin-2.6.0-cdh5.7.0/jars/* /system/spark-lib
[hadoop@hadoop000 ~]$ hadoop fs -chmod -R 755 /system/spark-lib
[hadoop@hadoop000 ~]$ cd $SPARK_HOME/conf
[hadoop@hadoop000 conf]$ cp spark-defaults.conf.template spark-defaults.conf
[hadoop@hadoop000 conf]$ vi spark-defaults.conf
spark.yarn.jars hdfs://192.168.137.251:9000//system/spark-lib/*
(如果没有*会报错:Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher)
运行日志由之前的upload..........变成
18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/api-asn1-api-1.0.0-M20.jar
18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/api-util-1.0.0-M20.jar
18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/arpack_combined_all-0.1.jar
18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/arrow-format-0.8.0.jar
18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/arrow-memory-0.8.0.jar
.............
.............
spark.yarn.jar配置成HDFS上的公共lib库中的jar包。这个配置项会使提交job时,不是从本地上传.jar包,而是从HDFS的一个目录复制到另一个目录,总的来说节省了一点时间。(网上有的文章里说,这里的配置,会节省掉上传jar包的步骤,其实是不对的,只是把从本地上传的步骤改成了在HDFS上的复制操作。)
这是每次提交申请资源都要耗费几十秒的时间的根本原因,这些jar包在yarn环境里都能访问的到,意味着在yarn的所有节点,所有container都能访问的到才可以,对离线来说还可以接受,但要求很高的话,每次启动spark作业都要耗费几十秒是不能接受的。spark可以和微服务结合起来,使用sping boot等把spark做成一个长服务,让它724小时不停运行,提交作业时不用一次又一次地重新申请资源
其他常用命令
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).