yarn 上提交shell 脚本时踩过的坑

遇到一个坑爹的问题,用yarn 跑一个自己写的脚本,提示失败,错误结果如下
16/12/22 16:47:47 INFO distributedshell.Client: Initializing Client
16/12/22 16:47:47 INFO distributedshell.Client: Running Client
16/12/22 16:47:47 INFO client.RMProxy: Connecting to ResourceManager at tracing044/172.18.0.44:8032
16/12/22 16:47:47 INFO distributedshell.Client: Got Cluster metric info from ASM, numNodeManagers=3
16/12/22 16:47:47 INFO distributedshell.Client: Got Cluster node info from ASM
16/12/22 16:47:47 INFO distributedshell.Client: Got node report from ASM for, nodeId=tracing044:8041, nodeAddresstracing044:8042, nodeRackName/default, nodeNumContainers0
16/12/22 16:47:47 INFO distributedshell.Client: Got node report from ASM for, nodeId=tracing045:8041, nodeAddresstracing045:8042, nodeRackName/default, nodeNumContainers2
16/12/22 16:47:47 INFO distributedshell.Client: Got node report from ASM for, nodeId=tracing041:8041, nodeAddresstracing041:8042, nodeRackName/default, nodeNumContainers2
16/12/22 16:47:47 INFO distributedshell.Client: Queue info, queueName=default, queueCurrentCapacity=0.052083332, queueMaxCapacity=1.0, queueApplicationCount=2, queueChildQueueCount=0
16/12/22 16:47:47 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=SUBMIT_APPLICATIONS
16/12/22 16:47:47 INFO distributedshell.Client: User ACL Info for Queue, queueName=root, userAcl=ADMINISTER_QUEUE
16/12/22 16:47:47 INFO distributedshell.Client: User ACL Info for Queue, queueName=default, userAcl=SUBMIT_APPLICATIONS
16/12/22 16:47:47 INFO distributedshell.Client: User ACL Info for Queue, queueName=default, userAcl=ADMINISTER_QUEUE
16/12/22 16:47:47 INFO distributedshell.Client: Max mem capabililty of resources in this cluster 98304
16/12/22 16:47:47 INFO distributedshell.Client: Max virtual cores capabililty of resources in this cluster 24
16/12/22 16:47:47 INFO distributedshell.Client: Copy App Master jar from local filesystem and add to local environment
16/12/22 16:47:48 INFO distributedshell.Client: Set the environment for the application master
16/12/22 16:47:48 INFO distributedshell.Client: Setting up app master command
16/12/22 16:47:48 INFO distributedshell.Client: Completed setting up app master command {{JAVA_HOME}}/bin/java -Xmx10m org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster --container_memory 10 --container_vcores 1 --num_containers 1 --priority 0 1>/AppMaster.stdout 2>/AppMaster.stderr
16/12/22 16:47:48 INFO distributedshell.Client: Submitting application to ASM
16/12/22 16:47:48 INFO impl.YarnClientImpl: Submitted application application_1481677093457_0077
16/12/22 16:47:49 INFO distributedshell.Client: Got application report from ASM for, appId=77, clientToAMToken=null, appDiagnostics=, appMasterHost=N/A, appQueue=default, appMasterRpcPort=-1, appStartTime=1482397003032, yarnAppState=ACCEPTED, distributedFinalState=UNDEFINED, appTrackingUrl=http://tracing044:8088/proxy/application_1481677093457_0077/, appUser=root
16/12/22 16:47:50 INFO distributedshell.Client: Got application report from ASM for, appId=77, clientToAMToken=null, appDiagnostics=, appMasterHost=tracing044/172.18.0.44, appQueue=default, appMasterRpcPort=-1, appStartTime=1482397003032, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://tracing044:8088/proxy/application_1481677093457_0077/A, appUser=root
16/12/22 16:47:51 INFO distributedshell.Client: Got application report from ASM for, appId=77, clientToAMToken=null, appDiagnostics=, appMasterHost=tracing044/172.18.0.44, appQueue=default, appMasterRpcPort=-1, appStartTime=1482397003032, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://tracing044:8088/proxy/application_1481677093457_0077/A, appUser=root
16/12/22 16:47:52 INFO distributedshell.Client: Got application report from ASM for, appId=77, clientToAMToken=null, appDiagnostics=, appMasterHost=tracing044/172.18.0.44, appQueue=default, appMasterRpcPort=-1, appStartTime=1482397003032, yarnAppState=RUNNING, distributedFinalState=UNDEFINED, appTrackingUrl=http://tracing044:8088/proxy/application_1481677093457_0077/A, appUser=root
16/12/22 16:47:53 INFO distributedshell.Client: Got application report from ASM for, appId=77, clientToAMToken=null, appDiagnostics=Diagnostics., total=1, completed=1, allocated=1, failed=1, appMasterHost=tracing044/172.18.0.44, appQueue=default, appMasterRpcPort=-1, appStartTime=1482397003032, yarnAppState=FINISHED, distributedFinalState=FAILED, appTrackingUrl=http://tracing044:8088/proxy/application_1481677093457_0077/A, appUser=root
16/12/22 16:47:53 INFO distributedshell.Client: Application did finished unsuccessfully. YarnState=FINISHED, DSFinalStatus=FAILED. Breaking monitoring loop
16/12/22 16:47:53 ERROR distributedshell.Client: Application failed to complete successfully

我的yarn 运行方式如下

yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar /(yarn home)/lib/hadoop-yarn/hadoop-yarn-applications-distributedshell-2.6.0-cdh5.8.0.jar  -shell_command /server/job.sh

去查yarn的日志,发现结果显示

16/12/23 11:14:54 INFO distributedshell.ApplicationMaster: appattempt_1481677093457_0136_000001 got container status for containerID=container_1481677093457_0136_01_000002, state=COMPLETE, exitStatus=126, diagnostics=Exception from container-launch.
Container id: container_1481677093457_0136_01_000002
Exit code: 126
Stack trace: ExitCodeException exitCode=126:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:578)
        at org.apache.hadoop.util.Shell.run(Shell.java:481)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:763)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 126

为了解决类似的问题。首先确定配置环境正确
运行ls命令测试

yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar /(yarn home)/lib/hadoop-yarn/hadoop-yarn-applications-distributedshell-2.6.0-cdh5.8.0.jar  -shell_command ls

发现执行成功。OK,接下来,在本地运行自己的脚本,确定在本地运行成功。

对于exit code 126/127类似的错误,是由于自己的脚本运行出现问题,导致其它节点会爆出这个错误。因此根本原因不在这个错误上。

我们继续检查其他的NameNode 节点上的日志。
通过连续多次运行脚本产生的日志,我们发现偶尔有成功的案例。失败的日志都是Permission denied或者no such file。

基于此点,猜测是否yarn 并不将目录传到其它NameNode上。所以做如下操作,首先当前目录是确定有执行权限,接下来在其他NameNode 节点上建立相应的目录,并且确保目录有可执行权限。将执行脚本丢到各个NameNode节点的对应目录下。

完成以上操作,问题解决。

正在研究相应代码,确保上述猜测。。。。

你可能感兴趣的:(yarn)