spark运维问题记录

环境:spark-2.1.0-bin-hadoop2.7

1.Spark启动警告:neither spark.yarn.jars not spark.yarn.archive is set,falling back to uploading libraries under SPARK_HOME

原因:
如果没设置spark.yarn.jars,每次提交到yarn,都会把$SPARK_HOME/jars打包成zip文件上传到HDFS对应的用户目录。而spark.yarn.jars可以指定HDFS某路径下包为公共依赖包,从而提高spark应用提交效率,节省空间

解决:
1.在HDFS创建目录 /spark/jars
2.将$SPARK_HOME/jars下所有包上传到/spark/jars
3.在spark-defaults.conf 中增加配置

spark.yarn.jars   hdfs://ns1/spark/jars/*

注意:
1.ns1是我集群HA机制namenode标识,如果没开启HA,填ip:port就可以了
2.可以将所有jar打成一个assembly包,将上面步骤3写成具体的包

2.SparkUI界面无法查看
一、本地模式或者standalone模式
只需http://:4040在Web浏览器中打开即可访问此界面。如果多个SparkContexts在同一台主机上运行,​​则它们将以4040(4041,4042等)开始绑定到连续的端口。

二、YARN模式
方式1:配置yarn.resourcemanager.webapp.address,YARN界面点击ApplicationMaster
方式2:配置web-proxy,如果没有配置web-proxy,默认为方式一

3.Spark启动报错:java.lang.IllegalStateException: Spark context stopped while waiting for backend

原因:由于Java 8与Hapdoop 2.7.0的Yarn存在某些不兼容,造成内存的溢出,导致程序异常终止。
解决:
方法一:修改 yarn-site.xml文件

<property>
    <name>yarn.nodemanager.pmem-check-enabledname>
    <value>falsevalue>
property>

<property>
    <name>yarn.nodemanager.vmem-check-enabledname>
    <value>falsevalue>
property>

方法二:使用Java7

4.配置spark history server重定向日志到mapreduce history server老报错,但是配置都感觉正确,弄了很久发现,只重启了YARN,没重启mapreduce history server,导致Aggregation失败,真想哭!

Logs not available for attempt_1449222634779_0065_m_000000_0. Aggregation may not be complete, Check back later or try the nodemanager at slave2:60459

5.spark任务跑了一段时间,报错如下

18/01/10 17:32:03 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, 10.255.xx.xxx, executor 1): java.nio.file.FileSystemException: /tmp/spark-2d179f1d-6404-44ea-b3e8-93701b62c416/executor-d477df32-8607-4ade-b999-b2a688ea7dc5/spark-2e014723-56c7-4cfd-81fa-cc369b0177d8/-4418563231515576679876_cache -> ./xxx.jar: No space left on device
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixCopyFile.copyFile(UnixCopyFile.java:253)
    at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:581)
    at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
    at java.nio.file.Files.copy(Files.java:1274)
    at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:608)
    at org.apache.spark.util.Utils$.copyFile(Utils.scala:579)
    at org.apache.spark.util.Utils$.fetchFile(Utils.scala:456)
    at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:659)
    at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:651)
    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
    at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
	at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
    at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
    at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
    at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
    at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:651)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:297)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

分析:这个问题可能有两种原因
1.磁盘空间不足
2.磁盘空间充足,但磁盘inode数用完了
df -h 查看磁盘空间,df -i 查看磁盘inode使用情况
解决
经过排查发现spark work目录太大,导致磁盘满了,在spark-env.sh添加如下配置,定期清理已停止的应用文件

export SPARK_WORKER_OPTS="
-Dspark.worker.cleanup.enabled=true
-Dspark.worker.cleanup.interval=1800
-Dspark.worker.cleanup.appDataTtl=604800"

spark.worker.cleanup.interval 清理周期(秒)
spark.worker.cleanup.appDataTtl 文件保存时间(秒)

你可能感兴趣的:(Spark,spark)