File does not exist: hdfs://localhost:9000/user/someone/.sparkStaging/application_1512614402012_0009

最近学习spark,写了一个spark workcount 在本地pesudo-distributed hadoop上基于yarn 运行,碰到的问题如下,总结下供有相同问题的同仁参考。

 

spark 提交方式如下:

 

spark-submit --class "com.my.WordCount" --master yarn --deploy-mode cluster build/libs/spark_study-0.0.1.jar

 

 

 

 

 

 

异常内容如下:

 

17/12/07 16:31:47 INFO yarn.ApplicationMaster: Preparing Local resources

Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:9000/user/someone/.sparkStaging/application_1512614402012_0009/__spark_conf__.zip 

at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1122)
 
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114) 

at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) 

at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114) 

at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$6.apply(ApplicationMaster.scala:161) 

at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$6.apply(ApplicationMaster.scala:158) 

at scala.Option.foreach(Option.scala:257) 

at org.apache.spark.deploy.yarn.ApplicationMaster.(ApplicationMaster.scala:158)

at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:763)

at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)

at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)

at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)

at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:762)

at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)

LogType:stdout

Log Upload Time:Thu Dec 07 16:31:49 +0800 2017

LogLength:0

Log Contents:

 

 

yarn ResourceManager 日志如下:

 

2017-12-07 16:31:40,119 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1512614402012_0009_01_000001 Container Transitioned from ACQUIRED to RUNNING

2017-12-07 16:31:45,574 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1512614402012_0009_01_000001 Container Transitioned from RUNNING to COMPLETED

2017-12-07 16:31:45,575 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: Completed container: container_1512614402012_0009_01_000001 in state: COMPLETED event:FINISHED

2017-12-07 16:31:45,575 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=adorechen OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1512614402012_0009 CONTAINERID=container_1512614402012_0009_01_000001

2017-12-07 16:31:45,575 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1512614402012_0009_01_000001 of capacityon host 172.29.6.27:51111, which currently has 0 containers,used andavailable, release resources=true

2017-12-07 16:31:45,575 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1512614402012_0009_000001 released container container_1512614402012_0009_01_000001 on node: host: 172.29.6.27:51111 #containers=0 available=8192 used=0 with event: FINISHED

2017-12-07 16:31:45,575 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Updating application attempt appattempt_1512614402012_0009_000001 with final state: FAILED, and exit status: 0

2017-12-07 16:31:45,575 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1512614402012_0009_000001 State change from LAUNCHED to FINAL_SAVING on event = CONTAINER_FINISHE

2017-12-07 16:31:45,575 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1512614402012_0009_000001

2017-12-07 16:31:45,575 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Application finished, removing password for appattempt_1512614402012_0009_000001

2017-12-07 16:31:45,575 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1512614402012_0009_000001 State change from FINAL_SAVING to FAILED on event = ATTEMPT_UPDATE_SAVED

2017-12-07 16:31:45,575 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The number of failed attempts is 1. The max attempts is 22017-12-07 16:31:45,575 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1512614402012_0009_000002

2017-12-07 16:31:45,575 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application appattempt_1512614402012_0009_000001 is done. finalState=FAILED2017-12-07 16:31:45,575 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1512614402012_0009_000002 State change from NEW to SUBMITTED on event = START

2017-12-07 16:31:45,575 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: Application application_1512614402012_0009 requests cleared2017-12-07 16:31:45,576 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1512614402012_0009_000002 to scheduler from user: adorechen

2017-12-07 16:31:45,576 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1512614402012_0009_000002 State change from SUBMITTED to SCHEDULED on event = ATTEMPT_ADDED2017-12-07 16:31:46,578 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1512614402012_0009_02_000001 Container Transitioned from NEW to ALLOCATED

2017-12-07 16:31:46,578 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=adorechen OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1512614402012_0009 CONTAINERID=container_1512614402012_0009_02_000001

2017-12-07 16:31:46,578 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1512614402012_0009_02_000001 of capacityon host 172.29.6.27:51111, which has 1 containers,used andavailable after allocation

2017-12-07 16:31:46,578 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Sending NMToken for nodeId : 172.29.6.27:51111 for container : container_1512614402012_0009_02_000001

 

 

 

查看前面的日志,发现Job在Container 1中已经成功执行,并删除了提交的spark job相关信息(job,jar,conf等)。但是在Container1向ResourceManage报告状态时失败,

ResourceManager以为没有执行成功,因此重新启动了Container2进行Job执行。但是spark的conf信息在container1 成功执行后已经删除,所以文件不存在,container2 执行失败,报错。

 

看日志可以得到以上信息,但是为什么会出现上面问题呢?

明明任务执行成功了,为什么还会启动container 2进行第二次任务尝试呢?

 

经过不停分析和查找资料,发现问题出在源代码中不小心遗留了一段

 

SparkConf conf =newSparkConf().setAppName("WordCount").setMaster("local[4]");


原来是本地测试以后,忘记删掉本地master设置了,导致yarn cluster模式执行时,ResourceManager 分配Container1 进行成功执行后,因为设置的时local模式,所以未向ResourceManager(ApplicationMaster)进行成功状态报告,所以更新失败,导致启动第二个Container进行尝试,出现了上面的错误。

 

 

解决方案: 删除代码中关于setMaster(*)的语句。

 

后续:

今天本地yarn集群上运行WordCount,又碰到这个问题,一看又设置了setMaster,赶紧删去,但是这一次没有起作用。查看resourceManager日志,和前面一样,更新状态失败。没有找到其他有用信息。接着查看nodemanager日志,发现有个异常:

2018-08-03 18:29:46,468 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:adorechen (auth:SIMPLE) cause:java.io.FileNotFoundException: File file:/private/var/f
olders/y8/rztx6q994k383p8rl1l3l5000000gp/T/spark-c1b3857e-8f9d-4b8f-bf7e-5155f31fab06/__spark_libs__8401638615010674587.zip does not exist
2018-08-03 18:29:46,477 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: { file:/private/var/folders/y8/rztx6q994k383p8rl1l3l5000000gp/T/sp
ark-c1b3857e-8f9d-4b8f-bf7e-5155f31fab06/__spark_libs__8401638615010674587.zip, 1533292183000, ARCHIVE, null } failed: File file:/private/var/folders/y8/rztx6q994k383p8rl1l3l5000000gp/T/spark-c
1b3857e-8f9d-4b8f-bf7e-5155f31fab06/__spark_libs__8401638615010674587.zip does not exist
java.io.FileNotFoundException: File file:/private/var/folders/y8/rztx6q994k383p8rl1l3l5000000gp/T/spark-c1b3857e-8f9d-4b8f-bf7e-5155f31fab06/__spark_libs__8401638615010674587.zip does not exist
        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:598)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:811)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:588)
        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:432)
        at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:251)
        at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
:
 

 通过查询异常,在stackoverflow 找到了解决方案:

This error was due to the config in the core-site.xml file.

Please note that to find this file your HADOOP_CONF_DIR env variable must be set.

In my case I added HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop/ to ./conf/spark-env.sh

See: Spark Job running on Yarn Cluster java.io.FileNotFoundException: File does not exits , eventhough the file exits on the master node

core-site.xml


    
        fs.default.name
        hdfs://master:9000
     

If this endpoint is unreachable, or if Spark detects that the file system is the same as the current system, the lib files will not be distributed to the other nodes in your cluster causing the errors above.

In my situation the node I was on couldn't reach port 9000 on the specified host.

 

https://stackoverflow.com/questions/40905224/spark-shell-spark-libs-zip-does-not-exist

你可能感兴趣的:(spark)