spark on yanr方式运行计算作业,发现作业提交缓慢
根据日志,提交缓慢主要在两个过程:
一、uploading file太慢
17/05/09 10:13:28 INFO yarn.Client: Uploading resource file:/opt/cloudera/parcels/spark-1.6.3-bin-hadoop2.6/lib/spark-assembly-1.6.3-hadoop2.6.0.jar -> hdfs://nameservice1/user/root/.sparkStaging/application_1493349445616_12544/spark-assembly-1.6.3-hadoop2.6.0.jar
17/05/09 10:13:36 INFO yarn.Client: Uploading resource file:/home/wis2_work/wis-spark-stream-1.0.0-all.jar -> hdfs://nameservice1/user/root/.sparkStaging/application_1493349445616_12544/wis-spark-stream-1.0.0-all.jar
这个日志输出后再上传程序依赖的jar包,大约耗时30s左右,造成提交缓慢,官网解决办法:如果想要在yarn端(yarn的节点)访问spark的runtime jars,需要指定spark.yarn.archive 或者 spark.yarn.jars。如果都这两个参数都没有指定,spark就会把$SPARK_HOME/jars/所有的jar上传到分布式缓存中。这也是之前任务提交特别慢的原因。
下面是解决办法
1、将$SPARK_HOME/相关依赖jar包上传到hdfs上
hadoop fs -mkdir /wis/tmp
hadoop fs -put /opt/cloudera/parcels/spark-1.6.3-bin-hadoop2.6/lib/spark-*.jar /wis/tmp/
2、修改spark-default.conf参数,添加:
spark.yarn.jar hdfs://nameservice1/wis/tmp/*.jar
以下几种方式也可生效
#spark.yarn.jar hdfs://nameservice1/wis/tmp/*
##直接配置多个以逗号分隔的jar,也可以生效。
注:1.6.3版本为spark.yarn.jar,详:http://spark.apache.org/docs/1.6.3/running-on-yarn.html#configuration
2.1.1版本为spark.yarn.jars,详:http://spark.apache.org/docs/latest/running-on-yarn.html#configuration
3、修改作业提交脚本中jar路径为hdfs路径,不然还是会以本地提交jar到hdfs,影响效率
二、spark认不到HADOOP_HOME
17/05/16 17:10:49 DEBUG Shell: Failed to detect a valid hadoop home directory
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:302)
at org.apache.hadoop.util.Shell.
at org.apache.hadoop.util.StringUtils.
at org.apache.hadoop.yarn.conf.YarnConfiguration.
at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.newConfiguration(YarnSparkHadoopUtil.scala:66)
at org.apache.spark.deploy.SparkHadoopUtil.
at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
解决方式:在spark-env.sh中添加如下配置
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
export SPARK_HOME=/opt/cloudera/parcels/spark-1.6.3-bin-hadoop2.6
export HADOOP_CONF_DIR=/etc/hadoop/conf
if [ -n "$HADOOP_HOME" ]; then
export LD_LIBRARY_PATH=:/lib/native
fi
三、spark加载hadoop库异常
17/05/16 17:11:18 DEBUG NativeCodeLoader: Trying to load the custom-built native-hadoop library...
17/05/16 17:11:18 DEBUG NativeCodeLoader: Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
17/05/16 17:11:18 DEBUG NativeCodeLoader: java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
17/05/16 17:11:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
原因是:jre目录下缺少了libhadoop.so和libsnappy.so两个文件。具体是,spark-shell依赖的是scala,scala依赖的是JAVA_HOME下的jdk,libhadoop.so和libsnappy.so两个文件应该放到$JAVA_HOME/jre/lib/amd64下面。这两个so:libhadoop.so和libsnappy.so 一般在hadoop native lib下面,通过配置spark-default.conf添加:
spark.executor.extraJavaOptions -XX:MetaspaceSize=300M -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.rootCategory=INFO -Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
再cp 这两个文件到jre目录下,得以解决
GC策略,日志 暂时不予配置,可能导致container初始失败问题
四、YARN ACCEPTED缓慢
调整hadoop角色,优化RM角色所在服务器资源分配